# Final Project Katherine Maki - BIOF 309

## Project Question: How many bacterial features (at the genus level) are unique and overlapping across three major taxonomy databases?

### What is a bacterial genus?

Bacteria is classified taxonomically ranging from Phylum (high level) to species strain

![taxonomy](img/Taxonomy_copy2.png)

In microbiome analysis, 16S rRNA gene amplicon sequencing is a "culture independent" analysis technique used to classify bacteria based on hypervariable regions in the bacterial 16S rRNA gene. Bacteria is classified based on genetic sequences that are amplified from DNA extracted from a sample. For example, to noninvasively analyze the gut microbiome you can collect and extract DNA from fecal samples :)

## Four Databases that will be analyzed
#### Ribosomal Database Project
#### Greengenes
#### Silva 
#### Human Oral Microbiome Database

## Links to data
[Human Oral Microbiome Database](http://www.homd.org/?name=Download&taxonly=1)
[Silva Release 138](https://www.arb-silva.de/no_cache/download/archive/release_138/Exports/)
[Greengenes 13_8](https://docs.qiime2.org/2020.2/data-resources/)
[Ribosomal Database Project V16](https://mothur.org/wiki/rdp_reference_files/)

## Data Import

In [13]:
import pandas as pd
import numpy as np

## Since there are four datasets and they are all formatted differently, we will bring them in seperately and format them to extra characters are deleted, empty columns are changed to "NaN" and the genus data is subsetted

### Greengenes is imported first

In [89]:
#Import GreenGenes Dataset without header
gg = pd.read_table('data_input/gg_99_otu_taxonomy.txt', header=None, names=['id', 'taxonomy'])

In [90]:
#Split Single Taxonomy Column into Several Columns
gg = gg.taxonomy.str.split(";",expand=True)
gg.columns = ['Kingdom','Phylum','Class',
                     'Order','Family','Genus', 'Species']

In [91]:
#Get rid of the preceding type labels 
gg.Kingdom = gg.Kingdom.apply(lambda x: x.replace('k__',''))
gg.Phylum = gg.Phylum.apply(lambda x: x.replace('p__',''))
gg.Class = gg.Class.apply(lambda x: x.replace('c__',''))
gg.Order = gg.Order.apply(lambda x: x.replace('o__',''))
gg.Family = gg.Family.apply(lambda x: x.replace('f__',''))
gg.Genus = gg.Genus.apply(lambda x: x.replace('g__',''))
gg.Species = gg.Species.apply(lambda x: x.replace('s__',''))

In [92]:
#Replace blank spaces with NaN
gg = gg.replace(r'^\s*$', np.nan, regex=True)

In [94]:
#Add column identifying the dataset
gg['Dataset'] = 'Greengenes'

In [95]:
#View output
gg.head()

Unnamed: 0,Kingdom,Phylum,Class,Order,Family,Genus,Species,Dataset
0,Bacteria,Cyanobacteria,Synechococcophycideae,Synechococcales,Synechococcaceae,Synechococcus,,Greengenes
1,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,Pelagibacteraceae,,,Greengenes
2,Bacteria,Actinobacteria,Actinobacteria,Actinomycetales,Mycobacteriaceae,Mycobacterium,,Greengenes
3,Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,,Greengenes
4,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Anoxybacillus,kestanbolensis,Greengenes


In [97]:
#Pull out Genus and Dataset Variables
gg_genus =  gg[['Genus', 'Dataset']]

In [98]:
gg_genus.head()

Unnamed: 0,Genus,Dataset
0,Synechococcus,Greengenes
1,,Greengenes
2,Mycobacterium,Greengenes
3,Staphylococcus,Greengenes
4,Anoxybacillus,Greengenes


### Next Silva is imported

In [119]:
#Import GreenGenes Dataset without header
silva = pd.read_table('data_input/silva_consensus_taxonomy_7_levels.txt', header=None, names=['id', 'taxonomy'])

In [120]:
silva = silva.taxonomy.str.split(";",expand=True)
silva.columns = ['Kingdom','Phylum','Class',
                     'Order','Family','Genus', 'Species']

In [121]:
#Get rid of the preceding type labels 
silva.Kingdom = silva.Kingdom.apply(lambda x: x.replace('D_0__',''))
silva.Phylum = silva.Phylum.apply(lambda x: x.replace('D_1__',''))
silva.Class = silva.Class.apply(lambda x: x.replace('D_2__',''))
silva.Order = silva.Order.apply(lambda x: x.replace('D_3__',''))
silva.Family = silva.Family.apply(lambda x: x.replace('D_4__',''))
silva.Genus = silva.Genus.apply(lambda x: x.replace('D_5__',''))
silva.Species = silva.Species.apply(lambda x: x.replace('D_6__',''))

In [123]:
silva.head(40)

Unnamed: 0,Kingdom,Phylum,Class,Order,Family,Genus,Species
0,Bacteria,Epsilonbacteraeota,Campylobacteria,Campylobacterales,Thiovulaceae,Sulfuricurvum,Sulfuricurvum sp. EW1
1,Bacteria,Cyanobacteria,Oxyphotobacteria,Nostocales,Nostocaceae,Nostoc PCC-73102,Nostoc sp. 'Nephroma expallidum cyanobiont 23'
2,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Muribaculaceae,uncultured bacterium,uncultured bacterium
3,Bacteria,Proteobacteria,Gammaproteobacteria,Alteromonadales,Pseudoalteromonadaceae,Pseudoalteromonas,Pseudoalteromonas sp.
4,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacteriales,Enterobacteriaceae,Klebsiella,uncultured organism
5,Bacteria,Firmicutes,Clostridia,Clostridiales,Ruminococcaceae,Ruminococcaceae UCG-005,uncultured bacterium
6,Bacteria,Firmicutes,Clostridia,Clostridiales,Ruminococcaceae,Ruminococcaceae V9D2013 group,uncultured rumen bacterium
7,Bacteria,Firmicutes,Clostridia,Thermolithobacterales,Thermolithobacteraceae,Thermolithobacter,Thermolithobacter ferrireducens
8,Bacteria,Actinobacteria,Actinobacteria,Streptomycetales,Streptomycetaceae,Streptomyces,Streptomyces luteogriseus
9,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Porphyromonadaceae,Porphyromonas,uncultured bacterium


In [110]:
silva.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 369953 entries, 0 to 369952
Data columns (total 7 columns):
Kingdom    369953 non-null object
Phylum     369953 non-null object
Class      369953 non-null object
Order      369953 non-null object
Family     369953 non-null object
Genus      369953 non-null object
Species    369953 non-null object
dtypes: object(7)
memory usage: 19.8+ MB


In [99]:
gg_genus.describe()

Unnamed: 0,Genus,Dataset
count,93463,203452
unique,2062,1
top,Bacteroides,Greengenes
freq,2747,203452


## 2) Provide a link to your data
Your data is required to be free and open to anyone.
As such, you should have a URL which anyone can use to download your data:

In [None]:
# Enter link here.

## 3) Import your data
In the space below, import your data.
If your data span multiple files, read them all in.
If applicable, merge or append them as needed.

## 4) Show me the head of your data.

## 5) Show me the shape of your data

## 6) Show me the proportion of missing observations for each column of your data

## 7) Give me a problem statement.
Below, write a problem statement. Keep in mind that your task is to tease out relationships in your data and eventually build a predictive model. Your problem statement can be vague, but you should have a goal in mind. Your problem statement should be between one sentence and one paragraph.

## 8) What is your _y_-variable?
For final project, you will need to perform a statistical model. This means you will have to accurately predict some y-variable for some combination of x-variables. From your problem statement in part 7, what is that y-variable?