# Exploratory Data Analysis ATCS Final Project

Dataset from Kaggle: https://www.kaggle.com/datasets/antaresnyc/human-gut-microbiome-with-asd?resource=download

Research Paper analysis using this data: https://www.tandfonline.com/doi/full/10.1080/19490976.2020.1747329

Business Understanding/Goal: Get some information about the correlation between Autism Spectrum Disorder (ASD) and the composition of the microbiome.

## 1. Import Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", 200)

# Data Exploration

## 2. Load the Dataset

In [None]:
df = pd.read_csv("data/ASD meta abundance.csv")

## 3. Understand the Data

In [None]:
df.shape
df.head(10)
df.columns

# Data Cleaning

Changes to make to the dataset
- check for duplicates (there are none)
- transpose the data so the user ID is on the left, and it's easier to sum the amount of each bacteria
- create seperate datatables for ASD and control, for easier analysis
- the columns are formatted with the correct string (could change to be just the phylum for easier analysis)
- there are no null values

## 4. Check for Duplicates & Null Values

In [None]:
# no null values!
df.duplicated().sum()
df.isna().sum() / df.shape[0]

## 5. Create a datatable with just the phylum

In [None]:
# create a new dataset that maps the name to just the phylum before the ;
df.head()
def return_phylum(word):
    substring = word.split(";")[0]
    return substring

phylum = df.copy()
phylum["Taxonomy"] = phylum["Taxonomy"].map(return_phylum)
phylum

## 6. Transpose the Data

In [None]:
df = df.set_index('Taxonomy')
df = df.transpose()

In [31]:
df.head(5)

Taxonomy,g__Faecalibacterium;s__Faecalibacterium prausnitzii,g__Hungatella;s__Hungatella hathewayi,g__Clostridium;s__uncultured Clostridium sp.,g__Butyricimonas;s__Butyricimonas virosa,g__Alistipes;s__Alistipes indistinctus,g__Unclassified;s__Firmicutes bacterium CAG:176,g__Clostridium;s__Clostridium sp. CAG:7,g__Unclassified;s__Firmicutes bacterium CAG:882,g__Lachnoclostridium;s__[Clostridium] asparagiforme,g__Butyricicoccus;s__uncultured Butyricicoccus sp.,g__Unclassified;s__Firmicutes bacterium CAG:95,g__Oscillibacter;s__Oscillibacter sp. ER4,g__Desulfovibrio;s__Desulfovibrio piger,g__Fusobacterium;s__Fusobacterium mortiferum,g__Clostridium;s__Clostridium sp. GD3,g__Unclassified;s__Firmicutes bacterium CAG:124,g__Unclassified;s__Burkholderiales bacterium YL45,g__Ruminococcus;s__Ruminococcus callidus,g__Flavonifractor;s__uncultured Flavonifractor sp.,g__Subdoligranulum;s__Subdoligranulum variabile,g__Clostridium;s__Clostridium sp. CAG:127,g__Clostridium;s__Clostridium sp. CAG:510,g__Phascolarctobacterium;s__Phascolarctobacterium sp. CAG:207,g__Eubacterium;s__Eubacterium sp. CAG:38,g__Eubacterium;s__Eubacterium sp. CAG:192,g__Clostridium;s__Clostridium sp. CAG:242,g__Roseburia;s__Roseburia sp. CAG:380,g__Ruminococcus;s__Ruminococcus sp. CAG:254,g__Prevotella;s__Prevotella copri,g__Unclassified;s__Firmicutes bacterium CAG:110,g__Ruminococcus;s__uncultured Ruminococcus sp.,g__Clostridium;s__Clostridium sp. CAG:75,g__Lachnoclostridium;s__[Clostridium] bolteae,g__Prevotella;s__Prevotella sp. CAG:1092,g__Holdemanella;s__Holdemanella biformis,g__Prevotella;s__Prevotella sp. CAG:279,g__Dialister;s__Dialister succinatiphilus,g__Eubacterium;s__Eubacterium sp. CAG:252,g__Alistipes;s__Alistipes sp. CAG:268,g__Ruminococcus;s__Ruminococcus lactaris,g__Eubacterium;s__Eubacterium sp. CAG:202,g__Faecalibacterium;s__Faecalibacterium sp. CAG:74,g__Clostridium;s__Clostridium sp. CAG:277,g__Prevotella;s__Prevotella sp. CAG:1058,g__Roseburia;s__Roseburia inulinivorans,g__Phascolarctobacterium;s__Phascolarctobacterium sp. CAG:266,g__Akkermansia;s__Akkermansia sp. CAG:344,g__Prevotella;s__Prevotella sp. CAG:1031,g__Eubacterium;s__Eubacterium sp. CAG:156,g__Clostridium;s__Clostridium sp. CAG:299,g__Sutterella;s__Sutterella parvirubra,g__Mycoplasma;s__Mycoplasma sp. CAG:472,g__Megasphaera;s__Megasphaera sp. DJF_B143,g__Eubacterium;s__Eubacterium sp. CAG:786,g__Unclassified;s__Firmicutes bacterium CAG:137,g__Escherichia;s__Escherichia coli,g__Unclassified;s__Firmicutes bacterium CAG:103,g__Unclassified;s__[Eubacterium] rectale,g__Clostridium;s__Clostridium sp. CAG:62,g__Roseburia;s__Roseburia intestinalis,g__Coprobacillus;s__Coprobacillus sp. CAG:605,g__Bacteroides;s__Bacteroides fragilis,g__Eubacterium;s__Eubacterium ventriosum,g__Clostridium;s__Clostridium sp. CAG:43,g__Bilophila;s__Bilophila wadsworthia,g__Unclassified;s__Firmicutes bacterium CAG:313,g__Ruminococcus;s__Ruminococcus sp. CAG:488,g__Clostridium;s__Clostridium sp. CAG:245,g__Clostridium;s__Clostridium sp. CAG:451,g__Clostridium;s__Clostridium sp. CAG:269,g__Clostridium;s__Clostridium sp. CAG:58,g__Eubacterium;s__Eubacterium sp. CAG:86,g__Eubacterium;s__[Eubacterium] eligens,g__Ruminiclostridium;s__[Eubacterium] siraeum,g__Faecalibacterium;s__Faecalibacterium sp. CAG:82,g__Clostridium;s__Clostridium sp. CAG:417,g__Roseburia;s__Roseburia sp. CAG:197,g__Bifidobacterium;s__Bifidobacterium adolescentis,g__Bacteroides;s__Bacteroides vulgatus,g__Odoribacter;s__Odoribacter splanchnicus,g__Bacteroides;s__Bacteroides uniformis,g__Eubacterium;s__Eubacterium sp. CAG:248,g__Phascolarctobacterium;s__Phascolarctobacterium succinatutens,g__Bacteroides;s__Bacteroides thetaiotaomicron,g__Clostridium;s__Clostridium sp. KLE 1755,g__Unclassified;s__Firmicutes bacterium CAG:345,g__Clostridium;s__Clostridium sp. AT5,g__Bacteroides;s__Bacteroides stercoris,g__Clostridium;s__Clostridium sp. CAG:302,g__Bacteroides;s__Bacteroides ovatus,g__Unclassified;s__Lachnospiraceae bacterium TF01-11,g__Unclassified;s__Firmicutes bacterium CAG:129,g__Blautia;s__[Ruminococcus] torques,g__Sutterella;s__Sutterella sp. CAG:397,g__Unclassified;s__Firmicutes bacterium CAG:170,g__Parabacteroides;s__Parabacteroides merdae,g__Oscillibacter;s__Oscillibacter sp. CAG:241,g__Prevotella;s__Prevotella sp. KHD1,g__Prevotella;s__Prevotella copri CAG:164,g__Clostridium;s__Clostridium sp. CAG:343,...,g__Clostridium;s__Clostridium sp. NCR,g__Caviibacter;s__Caviibacter abscessus,g__Unclassified;s__Ignavibacteria bacterium RIFCSPLOWO2_12_FULL_56_21,g__Unclassified;s__Nitrospira bacterium SG8_35_4,g__Unclassified;s__Nitrospirae bacterium GWC2_57_9,g__Asticcacaulis;s__Asticcacaulis benevestitus,g__Aureimonas;s__Aureimonas altamirensis,g__Mesorhizobium;s__Mesorhizobium muleiense,g__Mesorhizobium;s__Mesorhizobium qingshengii,g__Mesorhizobium;s__Mesorhizobium sp. Root552,g__Kaistia;s__Kaistia granuli,g__Tepidicaulis;s__Tepidicaulis marinus,g__Labrys;s__Labrys sp. WJW,g__Falsirhodobacter;s__Falsirhodobacter sp. alg1,g__Rhodobacter;s__Rhodobacter sp. CACIA14H1,g__Roseobacter;s__Roseobacter sp. SK209-2-6,g__Sulfitobacter;s__Sulfitobacter geojensis,g__Asaia;s__Asaia astilbis,g__Skermanella;s__Skermanella aerolata,g__Elioraea;s__Elioraea tepidiphila,g__Novosphingobium;s__Novosphingobium acidiphilum,g__Bordetella;s__Bordetella sp. SCN 68-11,g__Pusillimonas;s__Pusillimonas sp. T7-7,g__Caballeronia;s__Paraburkholderia sordidicola,g__Paraburkholderia;s__Paraburkholderia andropogonis,g__Paraburkholderia;s__Paraburkholderia nodosa,g__Duganella;s__Duganella sp. Root1480D1,g__Janthinobacterium;s__Janthinobacterium agaricidamnosum,g__Sphaerotilus;s__Sphaerotilus natans,g__Ferrovum;s__Ferrovum sp. JA12,g__Ferriphaselus;s__Ferriphaselus amnicola,g__Methylophilus;s__Methylophilus sp. TWE2,g__Aquitalea;s__Aquitalea pelogenes,g__Chitiniphilus;s__Chitiniphilus shinanonensis,g__Eikenella;s__Eikenella corrodens,g__Neisseria;s__Neisseria mucosa,g__Neisseria;s__Neisseria sp. HMSC055H02,g__Neisseria;s__Neisseria sp. HMSC061B04,g__Neisseria;s__Neisseria sp. HMSC066H01,g__Neisseria;s__Neisseria sp. HMSC068C04,g__Snodgrassella;s__Snodgrassella sp. R-53583,g__Nitrosomonas;s__Nitrosomonas marina,g__Azoarcus;s__Azoarcus sp. KH32C,g__Desulfovibrio;s__Desulfovibrio sp. L21-Syr-AB,g__Hippea;s__Hippea sp. KM1,g__Unclassified;s__Deltaproteobacteria bacterium RIFOXYD12_FULL_55_16,g__Unclassified;s__Deltaproteobacteria bacterium RIFOXYD12_FULL_56_24,g__Nautilia;s__Nautilia profundicola,g__Sulfurovum;s__Sulfurovum lithotrophicum,g__Aeromonas;s__Aeromonas media,g__Thioalkalivibrio;s__Thioalkalivibrio sp. ALE16,g__Enterobacter;s__Enterobacter sp. BWH52,g__Enterobacter;s__Enterobacter sp. NFIX09,g__Klebsiella;s__Klebsiella oxytoca,g__Mangrovibacter;s__Mangrovibacter sp. MFB070,g__Unclassified;s__Type-F symbiont of Plautia stali,g__Tatumella;s__Tatumella morbirosei,g__Providencia;s__Providencia stuartii,g__Yersinia;s__Yersinia pestis,g__Legionella;s__Legionella erythra,g__Legionella;s__Legionella spiritensis,g__Methylobacter;s__Methylobacter luteus,g__Nevskia;s__Nevskia soli,g__Endozoicomonas;s__Endozoicomonas sp. ab112,g__Terasakiispira;s__Terasakiispira papahanaumokuakeensis,g__Avibacterium;s__Avibacterium paragallinarum,g__Acinetobacter;s__Acinetobacter soli,g__Acinetobacter;s__Acinetobacter sp. NIPH 284,g__Pseudomonas;s__Pseudomonas batumici,g__Pseudomonas;s__Pseudomonas caeni,g__Pseudomonas;s__Pseudomonas extremorientalis,g__Pseudomonas;s__Pseudomonas sp. BAY1663,g__Pseudomonas;s__Pseudomonas veronii,g__Unclassified;s__Bathymodiolus septemdierum thioautotrophic gill symbiont,g__Unclassified;s__Gammaproteobacteria bacterium RBG_16_51_14,g__Unclassified;s__Gammaproteobacteria bacterium RIFCSPHIGHO2_12_FULL_63_22,g__Vibrio;s__Vibrio sp. OY15,g__Rhodanobacter;s__Rhodanobacter sp. Root561,g__Mycoplasma;s__Mycoplasma arginini,g__Mycoplasma;s__Mycoplasma penetrans,g__Thermodesulfatator;s__Thermodesulfatator indicus,g__Unclassified;s__candidate division TA06 bacterium DG_78,g__Unclassified;s__uncultured marine bacterium EB0_49D07,g__Sugiyamaella;s__Sugiyamaella lignohabitans,g__Calocera;s__Calocera cornea,g__Uromyces;s__Uromyces hobsonii,g__Mitosporidium;s__Mitosporidium daphniae,g__Lichtheimia;s__Lichtheimia ramosa,g__T4virus;s__Aeromonas phage phiAS5,g__Unclassified;s__Bacillus phage BCD7,g__Unclassified;s__Clostridium phage c-st,g__Unclassified;s__Enterococcus phage EFDG1,g__Unclassified;s__Podovirus Lau218,g__Sap6virus;s__Enterococcus phage VD13,g__Unclassified;s__Bacillus phage vB_BanS-Tsamsa,g__Unclassified;s__Gordonia phage GTE2,g__Alphabaculovirus;s__Hyphantria cunea nucleopolyhedrovirus,g__Potyvirus;s__Bean common mosaic virus,g__Potyvirus;s__Telosma mosaic virus,g__Unclassified;s__Freshwater phage uvFW-CGR-AMD-COM-C203
A3,4988,5803,3793,64,15,100,2119,12,453,1266,17,524,7,1,1466,42,1,16,2000,724,536,20,1915,64,1383,5,33,5,34,20,841,1,1024,0,11,3,1,83,23,53,6,79,22,2,792,15,1416,4,45,1433,0,0,5,79,10,1203,23,846,29,132,0,999,146,1138,1159,1,2,0,1,0,617,37,114,188,155,0,9,13,581,1004,705,858,31,884,70,1,814,93,3,569,20,33,161,2,27,847,90,23,15,25,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A5,5060,5612,2795,1385,20,29,1230,24,691,1682,119,2018,714,0,1412,65,25,21,444,1712,9,19,1871,1868,11,10,6,13,92,48,903,4,997,49,1371,1666,2,104,16,1545,12,13,18,8,1037,12,10,12,26,406,1,0,3,6,68,1147,31,965,13,135,1,1028,640,1044,1030,1,5,0,0,1,480,37,868,707,1082,1,10,54,800,716,505,947,27,604,52,1,250,762,4,439,20,202,919,9,73,323,104,44,13,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A6,2905,4109,1355,725,723,11,1322,1,2278,43,9,101,19,0,937,36,26,7,1987,244,9,4,1868,41,1,14,0,1,28,29,325,0,1829,5,10,104,1,9,35,45,8,4,6,3,690,11,1,16,7,35,0,0,3,1,60,800,16,124,7,94,1,992,12,48,935,0,3,0,1,0,994,6,851,32,115,0,3,3,740,898,594,15,26,756,885,1,721,746,3,402,5,22,170,9,43,743,58,21,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A9,5745,1432,5558,1553,620,1320,2675,44,107,1726,1938,2023,150,1,2138,2213,41,1359,626,1930,2182,58,1067,2062,629,17,143,7,25,472,835,226,180,1,8,12,2,31,21,102,659,1046,1516,5,1291,8,100,4,638,59,1,1,4,1394,94,899,1059,1110,1209,1014,8,535,222,1247,1086,1,10,728,3,27,86,1142,973,1123,1089,737,22,26,704,877,738,96,25,666,40,4,208,890,761,689,741,757,280,9,863,768,505,81,6,874,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A31,4822,2652,5383,40,3261,51,1470,26,342,1804,49,415,17,9,1256,81,24,180,1892,1734,57,39,1918,1853,31,21,34,1925,1030,44,634,74,1237,43,7,132,179,96,18,1469,430,622,23,8,971,23,1451,15,50,60,27,1,6,5,106,474,602,393,53,234,1,874,1098,1154,1112,0,6,750,11,9,1045,101,1071,51,199,74,43,463,885,960,844,614,39,847,21,2,703,333,40,882,394,120,288,59,86,165,151,56,509,11,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## 7. Create seperate datatables for the ASD and control patients

In [None]:
# taking the index, turning it into a string, and checking if it contains A

ASD = df.loc[df.index.str.contains('A')]
control = df.loc[df.index.str.contains('B')]

# Exploratory Data Analysis

## 8. Find the summary statistics for the entire datatable

In [None]:
summary_stats = df.describe()
summary_stats

## 9. Find the important/common bacteria columns

In [None]:
# find the bacteria columns with the least and most average value counts throughout all users
df.mean().sort_values(ascending=False)

In [None]:
# Find the most common bacteria in control patients

control.mean().sort_values(ascending=False)

## 10. Visualize the bacteria data across users in a barplot

In [None]:
# Bar graph of the amount of Faecalibacterium prausnitzii bacteria in ASD patients
plt.figure(figsize=(16,5))
plt.title("Amount of Faecalibacterium prausnitzii bacteria per ASD patients")
sns.barplot(data=ASD, y="g__Faecalibacterium;s__Faecalibacterium prausnitzii", x=ASD.index)

Analysis: This bar graph goes to show the variability of bacteria within each individual, despite being consistently disease/non-diseased. Thus, the microbiome is an enterotype, and can be used almost as a fingerprint for person identification.

In [None]:
# all the bacteria in an array
# bacteria = [g__Faecalibacterium;s__Faecalibacterium prausnitzii	g__Hungatella;s__Hungatella hathewayi	g__Clostridium;s__uncultured Clostridium sp.	g__Butyricimonas;s__Butyricimonas virosa	g__Alistipes;s__Alistipes indistinctus	g__Unclassified;s__Firmicutes bacterium CAG:176	g__Clostridium;s__Clostridium sp. CAG:7	g__Unclassified;s__Firmicutes bacterium CAG:882	g__Lachnoclostridium;s__[Clostridium] asparagiforme	g__Butyricicoccus;s__uncultured Butyricicoccus sp.	g__Unclassified;s__Firmicutes bacterium CAG:95	g__Oscillibacter;s__Oscillibacter sp. ER4	g__Desulfovibrio;s__Desulfovibrio piger	g__Fusobacterium;s__Fusobacterium mortiferum	g__Clostridium;s__Clostridium sp. GD3	g__Unclassified;s__Firmicutes bacterium CAG:124	g__Unclassified;s__Burkholderiales bacterium YL45	g__Ruminococcus;s__Ruminococcus callidus	g__Flavonifractor;s__uncultured Flavonifractor sp.	g__Subdoligranulum;s__Subdoligranulum variabile	g__Clostridium;s__Clostridium sp. CAG:127	g__Clostridium;s__Clostridium sp. CAG:510	g__Phascolarctobacterium;s__Phascolarctobacterium sp. CAG:207	g__Eubacterium;s__Eubacterium sp. CAG:38	g__Eubacterium;s__Eubacterium sp. CAG:192	g__Clostridium;s__Clostridium sp. CAG:242	g__Roseburia;s__Roseburia sp. CAG:380	g__Ruminococcus;s__Ruminococcus sp. CAG:254	g__Prevotella;s__Prevotella copri	g__Unclassified;s__Firmicutes bacterium CAG:110	g__Ruminococcus;s__uncultured Ruminococcus sp.	g__Clostridium;s__Clostridium sp. CAG:75	g__Lachnoclostridium;s__[Clostridium] bolteae	g__Prevotella;s__Prevotella sp. CAG:1092	g__Holdemanella;s__Holdemanella biformis	g__Prevotella;s__Prevotella sp. CAG:279	g__Dialister;s__Dialister succinatiphilus	g__Eubacterium;s__Eubacterium sp. CAG:252	g__Alistipes;s__Alistipes sp. CAG:268	g__Ruminococcus;s__Ruminococcus lactaris	g__Eubacterium;s__Eubacterium sp. CAG:202	g__Faecalibacterium;s__Faecalibacterium sp. CAG:74	g__Clostridium;s__Clostridium sp. CAG:277	g__Prevotella;s__Prevotella sp. CAG:1058	g__Roseburia;s__Roseburia inulinivorans	g__Phascolarctobacterium;s__Phascolarctobacterium sp. CAG:266	g__Akkermansia;s__Akkermansia sp. CAG:344	g__Prevotella;s__Prevotella sp. CAG:1031	g__Eubacterium;s__Eubacterium sp. CAG:156	g__Clostridium;s__Clostridium sp. CAG:299	g__Sutterella;s__Sutterella parvirubra	g__Mycoplasma;s__Mycoplasma sp. CAG:472	g__Megasphaera;s__Megasphaera sp. DJF_B143	g__Eubacterium;s__Eubacterium sp. CAG:786	g__Unclassified;s__Firmicutes bacterium CAG:137	g__Escherichia;s__Escherichia coli	g__Unclassified;s__Firmicutes bacterium CAG:103	g__Unclassified;s__[Eubacterium] rectale	g__Clostridium;s__Clostridium sp. CAG:62	g__Roseburia;s__Roseburia intestinalis	g__Coprobacillus;s__Coprobacillus sp. CAG:605	g__Bacteroides;s__Bacteroides fragilis	g__Eubacterium;s__Eubacterium ventriosum	g__Clostridium;s__Clostridium sp. CAG:43	g__Bilophila;s__Bilophila wadsworthia	g__Unclassified;s__Firmicutes bacterium CAG:313	g__Ruminococcus;s__Ruminococcus sp. CAG:488	g__Clostridium;s__Clostridium sp. CAG:245	g__Clostridium;s__Clostridium sp. CAG:451	g__Clostridium;s__Clostridium sp. CAG:269	g__Clostridium;s__Clostridium sp. CAG:58	g__Eubacterium;s__Eubacterium sp. CAG:86	g__Eubacterium;s__[Eubacterium] eligens	g__Ruminiclostridium;s__[Eubacterium] siraeum	g__Faecalibacterium;s__Faecalibacterium sp. CAG:82	g__Clostridium;s__Clostridium sp. CAG:417	g__Roseburia;s__Roseburia sp. CAG:197	g__Bifidobacterium;s__Bifidobacterium adolescentis	g__Bacteroides;s__Bacteroides vulgatus	g__Odoribacter;s__Odoribacter splanchnicus	g__Bacteroides;s__Bacteroides uniformis	g__Eubacterium;s__Eubacterium sp. CAG:248	g__Phascolarctobacterium;s__Phascolarctobacterium succinatutens	g__Bacteroides;s__Bacteroides thetaiotaomicron	g__Clostridium;s__Clostridium sp. KLE 1755	g__Unclassified;s__Firmicutes bacterium CAG:345	g__Clostridium;s__Clostridium sp. AT5	g__Bacteroides;s__Bacteroides stercoris	g__Clostridium;s__Clostridium sp. CAG:302	g__Bacteroides;s__Bacteroides ovatus	g__Unclassified;s__Lachnospiraceae bacterium TF01-11	g__Unclassified;s__Firmicutes bacterium CAG:129	g__Blautia;s__[Ruminococcus] torques	g__Sutterella;s__Sutterella sp. CAG:397	g__Unclassified;s__Firmicutes bacterium CAG:170	g__Parabacteroides;s__Parabacteroides merdae	g__Oscillibacter;s__Oscillibacter sp. CAG:241	g__Prevotella;s__Prevotella sp. KHD1	g__Prevotella;s__Prevotella copri CAG:164	g__Clostridium;s__Clostridium sp. CAG:343	...	g__Clostridium;s__Clostridium sp. NCR	g__Caviibacter;s__Caviibacter abscessus	g__Unclassified;s__Ignavibacteria bacterium RIFCSPLOWO2_12_FULL_56_21	g__Unclassified;s__Nitrospira bacterium SG8_35_4	g__Unclassified;s__Nitrospirae bacterium GWC2_57_9	g__Asticcacaulis;s__Asticcacaulis benevestitus	g__Aureimonas;s__Aureimonas altamirensis	g__Mesorhizobium;s__Mesorhizobium muleiense	g__Mesorhizobium;s__Mesorhizobium qingshengii	g__Mesorhizobium;s__Mesorhizobium sp. Root552	g__Kaistia;s__Kaistia granuli	g__Tepidicaulis;s__Tepidicaulis marinus	g__Labrys;s__Labrys sp. WJW	g__Falsirhodobacter;s__Falsirhodobacter sp. alg1	g__Rhodobacter;s__Rhodobacter sp. CACIA14H1	g__Roseobacter;s__Roseobacter sp. SK209-2-6	g__Sulfitobacter;s__Sulfitobacter geojensis	g__Asaia;s__Asaia astilbis	g__Skermanella;s__Skermanella aerolata	g__Elioraea;s__Elioraea tepidiphila	g__Novosphingobium;s__Novosphingobium acidiphilum	g__Bordetella;s__Bordetella sp. SCN 68-11	g__Pusillimonas;s__Pusillimonas sp. T7-7	g__Caballeronia;s__Paraburkholderia sordidicola	g__Paraburkholderia;s__Paraburkholderia andropogonis	g__Paraburkholderia;s__Paraburkholderia nodosa	g__Duganella;s__Duganella sp. Root1480D1	g__Janthinobacterium;s__Janthinobacterium agaricidamnosum	g__Sphaerotilus;s__Sphaerotilus natans	g__Ferrovum;s__Ferrovum sp. JA12	g__Ferriphaselus;s__Ferriphaselus amnicola	g__Methylophilus;s__Methylophilus sp. TWE2	g__Aquitalea;s__Aquitalea pelogenes	g__Chitiniphilus;s__Chitiniphilus shinanonensis	g__Eikenella;s__Eikenella corrodens	g__Neisseria;s__Neisseria mucosa	g__Neisseria;s__Neisseria sp. HMSC055H02	g__Neisseria;s__Neisseria sp. HMSC061B04	g__Neisseria;s__Neisseria sp. HMSC066H01	g__Neisseria;s__Neisseria sp. HMSC068C04	g__Snodgrassella;s__Snodgrassella sp. R-53583	g__Nitrosomonas;s__Nitrosomonas marina	g__Azoarcus;s__Azoarcus sp. KH32C	g__Desulfovibrio;s__Desulfovibrio sp. L21-Syr-AB	g__Hippea;s__Hippea sp. KM1	g__Unclassified;s__Deltaproteobacteria bacterium RIFOXYD12_FULL_55_16	g__Unclassified;s__Deltaproteobacteria bacterium RIFOXYD12_FULL_56_24	g__Nautilia;s__Nautilia profundicola	g__Sulfurovum;s__Sulfurovum lithotrophicum	g__Aeromonas;s__Aeromonas media	g__Thioalkalivibrio;s__Thioalkalivibrio sp. ALE16	g__Enterobacter;s__Enterobacter sp. BWH52	g__Enterobacter;s__Enterobacter sp. NFIX09	g__Klebsiella;s__Klebsiella oxytoca	g__Mangrovibacter;s__Mangrovibacter sp. MFB070	g__Unclassified;s__Type-F symbiont of Plautia stali	g__Tatumella;s__Tatumella morbirosei	g__Providencia;s__Providencia stuartii	g__Yersinia;s__Yersinia pestis	g__Legionella;s__Legionella erythra	g__Legionella;s__Legionella spiritensis	g__Methylobacter;s__Methylobacter luteus	g__Nevskia;s__Nevskia soli	g__Endozoicomonas;s__Endozoicomonas sp. ab112	g__Terasakiispira;s__Terasakiispira papahanaumokuakeensis	g__Avibacterium;s__Avibacterium paragallinarum	g__Acinetobacter;s__Acinetobacter soli	g__Acinetobacter;s__Acinetobacter sp. NIPH 284	g__Pseudomonas;s__Pseudomonas batumici	g__Pseudomonas;s__Pseudomonas caeni	g__Pseudomonas;s__Pseudomonas extremorientalis	g__Pseudomonas;s__Pseudomonas sp. BAY1663	g__Pseudomonas;s__Pseudomonas veronii	g__Unclassified;s__Bathymodiolus septemdierum thioautotrophic gill symbiont	g__Unclassified;s__Gammaproteobacteria bacterium RBG_16_51_14	g__Unclassified;s__Gammaproteobacteria bacterium RIFCSPHIGHO2_12_FULL_63_22	g__Vibrio;s__Vibrio sp. OY15	g__Rhodanobacter;s__Rhodanobacter sp. Root561	g__Mycoplasma;s__Mycoplasma arginini	g__Mycoplasma;s__Mycoplasma penetrans	g__Thermodesulfatator;s__Thermodesulfatator indicus	g__Unclassified;s__candidate division TA06 bacterium DG_78	g__Unclassified;s__uncultured marine bacterium EB0_49D07	g__Sugiyamaella;s__Sugiyamaella lignohabitans	g__Calocera;s__Calocera cornea	g__Uromyces;s__Uromyces hobsonii	g__Mitosporidium;s__Mitosporidium daphniae	g__Lichtheimia;s__Lichtheimia ramosa	g__T4virus;s__Aeromonas phage phiAS5	g__Unclassified;s__Bacillus phage BCD7	g__Unclassified;s__Clostridium phage c-st	g__Unclassified;s__Enterococcus phage EFDG1	g__Unclassified;s__Podovirus Lau218	g__Sap6virus;s__Enterococcus phage VD13	g__Unclassified;s__Bacillus phage vB_BanS-Tsamsa	g__Unclassified;s__Gordonia phage GTE2	g__Alphabaculovirus;s__Hyphantria cunea nucleopolyhedrovirus	g__Potyvirus;s__Bean common mosaic virus	g__Potyvirus;s__Telosma mosaic virus	g__Unclassified;s__Freshwater phage uvFW-CGR-AMD-COM-C203]

# Statistical Analysis

Statistical Questions:
- Which phylum of bacteria is most common among all individuals?
- Which bacteria is most and least common for ASD patients?
- Which bacteria has the greatest difference among ASD and control patients?
- Is the diversity of the microbiome different?

## 1. Which phylum of bacteria is the most common among all indivudals?

In [None]:
# group by phylum
phylum = phylum.groupby("Taxonomy").mean()

In [None]:
# transpose the datatable to aggregate and sort all user columns for the bacteria
phylum = phylum.transpose()
phylum.mean().sort_values(ascending=False)

## Analysis - the most common bacteria phylum among all individuals:

g__Hungatella          3575.300000

g__Faecalibacterium    1223.083333

g__Flavonifractor       868.583333

g__Butyricicoccus       750.916667

g__Bilophila            498.466667

By figuring out the most common bacteria phylum among all individuals, we can better understand the composition of the microbiome for further study.

## 2. Which bacteria is most and least common for ASD patients? 

In [None]:
# Find the most common bacteria in ASD patients
ASD.mean().sort_values(ascending=False)

### Analysis - The most and least common bacteria in ASD patients is:

MOST

g__Faecalibacterium;s__Faecalibacterium prausnitzii                                4942.800000

g__Clostridium;s__uncultured Clostridium sp.                                       3708.966667

g__Hungatella;s__Hungatella hathewayi                                              3386.533333

g__Phascolarctobacterium;s__Phascolarctobacterium sp.                      1627.100000

g__Clostridium;s__Clostridium sp.                                            1611.266667

LEAST     

g__Mesotoga;s__Mesotoga infera                                                        0.000000

g__Streptococcus;s__Streptococcus sp.                                            0.000000

g__Unclassified;s__Candidatus Sungbacteria bacterium        0.000000

g__Unclassified;s__Freshwater phage                             0.000000

By finding the most and least common bacteria in ASD patients, we can begin to understand the gut microbiome composition of ASD patients.

## 3. Which bacteria has the greatest percent difference among ASD and control patients?


In [None]:
# TODO: 
difference = ASD.mean()-control.mean()
#difference.mean().sort_values(ascending=False)
difference = difference.abs()
difference.sort_values(ascending=False)

### Analysis - Bacteria with greatest percent difference between ASD and control:

g__Phascolarctobacterium;s__Phascolarctobacterium sp. CAG:207    597.200000

g__Hungatella;s__Hungatella hathewayi                            377.533333

g__Prevotella;s__Prevotella sp. CAG:279                          329.766667

g__Bacteroides;s__Bacteroides stercoris                          329.433333

g__Desulfovibrio;s__Desulfovibrio piger                          319.633333

Thus, it is possible that changes in levels of Phascolarctobacterium, Hungatella, Prevotella, Bacteroides, and Desulfovibrio bacteria could be an indicator or have some correlation to the presence of ASD.


## 4. Is the diversity (variety of species) of the microbiome different between ASD & control?

for all A s, find the number of bacteria columns that are 0, then compare to the B s

In [38]:
def count_zeros(data):
    count = 0
    for row in data.itertuples(index=False):
        for value in row:
            if value == 0:
                count += 1
    return count

zero_count_ASD = count_zeros(ASD)
zero_count_control = count_zeros(control)
zero_count_control, zero_count_ASD

(115420, 108345)

### Analysis
The zero count for the ASD patients are greater than that of the control patients. Thus, the microbiome diversity or variety of species present is greater in a normal patient. This isn't as accurate as the Shannon Index, but it gives us a basic understanding that there is more variety in microbiome species for a healthy individual.