# Final Project Katherine Maki - BIOF 309

## Project Question: How many bacterial features (at the genus level) are unique and overlapping across four major taxonomy databases?

### What is a bacterial genus?

Bacteria is classified taxonomically ranging from Phylum (high level) to species strain

![taxonomy](img/Taxonomy_copy2.png)

In microbiome analysis, 16S rRNA gene amplicon sequencing is a "culture independent" analysis technique used to classify bacteria based on hypervariable regions in the bacterial 16S rRNA gene. Bacteria is classified based on genetic sequences that are amplified from DNA extracted from a sample. For example, to noninvasively analyze the gut microbiome you can collect and extract DNA from fecal samples :)

## Four Databases that will be analyzed
#### Ribosomal Database Project
#### Greengenes
#### Silva 
#### Human Oral Microbiome Database

## Links to data
[Human Oral Microbiome Database](http://www.homd.org/?name=Download&taxonly=1)
[Silva Release 138](https://www.arb-silva.de/no_cache/download/archive/release_138/Exports/)
[Greengenes 13_8](https://docs.qiime2.org/2020.2/data-resources/)
[Ribosomal Database Project V16](https://mothur.org/wiki/rdp_reference_files/)

## Data Import

In [131]:
import pandas as pd
import numpy as np

## Since there are four datasets and they are all formatted differently, we will bring them in seperately and format them to extra characters are deleted, empty columns are changed to "NaN" and the genus data is subsetted

### Greengenes is imported first

In [132]:
#Import GreenGenes Dataset without header
gg = pd.read_table('data_input/gg_99_otu_taxonomy.txt', header=None, names=['id', 'taxonomy'])

In [133]:
#Split Single Taxonomy Column into Several Columns
gg = gg.taxonomy.str.split(";",expand=True)
gg.columns = ['Kingdom','Phylum','Class',
                     'Order','Family','Genus', 'Species']

In [134]:
#Get rid of the preceding type labels 
gg.Kingdom = gg.Kingdom.apply(lambda x: x.replace('k__',''))
gg.Phylum = gg.Phylum.apply(lambda x: x.replace('p__',''))
gg.Class = gg.Class.apply(lambda x: x.replace('c__',''))
gg.Order = gg.Order.apply(lambda x: x.replace('o__',''))
gg.Family = gg.Family.apply(lambda x: x.replace('f__',''))
gg.Genus = gg.Genus.apply(lambda x: x.replace('g__',''))
gg.Species = gg.Species.apply(lambda x: x.replace('s__',''))

In [135]:
#Replace blank spaces with "Not Classified" so all databases match
gg = gg.replace(r'^\s*$', "Not Classified", regex=True)

In [136]:
#Add column identifying the dataset
gg['Dataset'] = 'Greengenes'

In [137]:
#View output
gg.head()

Unnamed: 0,Kingdom,Phylum,Class,Order,Family,Genus,Species,Dataset
0,Bacteria,Cyanobacteria,Synechococcophycideae,Synechococcales,Synechococcaceae,Synechococcus,Not Classified,Greengenes
1,Bacteria,Proteobacteria,Alphaproteobacteria,Rickettsiales,Pelagibacteraceae,Not Classified,Not Classified,Greengenes
2,Bacteria,Actinobacteria,Actinobacteria,Actinomycetales,Mycobacteriaceae,Mycobacterium,Not Classified,Greengenes
3,Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,Not Classified,Greengenes
4,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Anoxybacillus,kestanbolensis,Greengenes


In [138]:
#Pull out Genus and Dataset Variables
gg_genus =  gg[['Genus', 'Dataset']]

In [139]:
gg_genus.head()

Unnamed: 0,Genus,Dataset
0,Synechococcus,Greengenes
1,Not Classified,Greengenes
2,Mycobacterium,Greengenes
3,Staphylococcus,Greengenes
4,Anoxybacillus,Greengenes


### Next Silva is imported

In [172]:
#Import GreenGenes Dataset without header
silva = pd.read_table('data_input/silva_taxonomy_7_levels.txt', header=None, names=['id', 'taxonomy'])

In [173]:
silva = silva.taxonomy.str.split(";",expand=True)
silva.columns = ['Kingdom','Phylum','Class',
                     'Order','Family','Genus', 'Species']

In [174]:
#Get rid of the preceding type labels 
silva.Kingdom = silva.Kingdom.apply(lambda x: x.replace('D_0__',''))
silva.Phylum = silva.Phylum.apply(lambda x: x.replace('D_1__',''))
silva.Class = silva.Class.apply(lambda x: x.replace('D_2__',''))
silva.Order = silva.Order.apply(lambda x: x.replace('D_3__',''))
silva.Family = silva.Family.apply(lambda x: x.replace('D_4__',''))
silva.Genus = silva.Genus.apply(lambda x: x.replace('D_5__',''))
silva.Species = silva.Species.apply(lambda x: x.replace('D_6__',''))

In [175]:
#Add column identifying the dataset
silva['Dataset'] = 'Silva'

In [176]:
#Only keep Bacteria in Silva Dataset as it also contains Archaea
silva_bac = silva[silva['Kingdom'] == 'Bacteria']

In [177]:
#Pull out Genus and Dataset Variables
silva_genus =  silva_bac[['Genus', 'Dataset']]

In [178]:
silva_genus.head()

Unnamed: 0,Genus,Dataset
0,Klebsiella,Silva
1,uncultured bacterium,Silva
2,Tyzzerella 3,Silva
3,uncultured,Silva
4,Pseudoalteromonas,Silva


In [179]:
#silva_genus.loc[silva_genus['Genus'].str.contains('uncultured')] = 'Not Classified'
silva_genus['Genus'].str.lower().str.contains("uncultured") = "Not Classified"

SyntaxError: can't assign to function call (<ipython-input-179-907f1187cb5b>, line 2)

In [183]:
silva_genus.loc[silva_genus['Genus'].str.contains('uncultured'), 'Genus'] = 'Not Classified'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [185]:
silva_genus.head(50)

Unnamed: 0,Genus,Dataset
0,Klebsiella,Silva
1,Not Classified,Silva
2,Tyzzerella 3,Silva
3,Not Classified,Silva
4,Pseudoalteromonas,Silva
5,Not Classified,Silva
6,Chromatium,Silva
7,Not Classified,Silva
8,Sulfurovum,Silva
9,Lachnospiraceae UCG-008,Silva


In [159]:
for x in silva_genus["Genus"]:
    if x == silva_genus.str.contains("uncultured", case=False regex=False): "Not Classified"
        else: x 

SyntaxError: invalid syntax (<ipython-input-159-9ef301355682>, line 2)

In [159]:
#for x in silva_genus["Genus"]:
    if x == silva_genus.str.contains("uncultured", case=False regex=False): "Not Classified"
        else: x 

SyntaxError: invalid syntax (<ipython-input-159-9ef301355682>, line 2)

SyntaxError: can't assign to function call (<ipython-input-170-fb9c7145607e>, line 2)

In [156]:
#df.loc[df['sport'].str.contains('ball'), 'sport'] = 'ball sport'
silva_genus.head()

Unnamed: 0,Genus,Dataset
0,Klebsiella,Silva
1,uncultured bacterium,Silva
2,Tyzzerella 3,Silva
3,uncultured,Silva
4,Pseudoalteromonas,Silva


In [129]:
#Need to Change the Different Uncultured Bacteria Data in Silva to "Not Classified" to Match with GreenGenes


silva_genus["Genus"].str.contains("uncultured", case=False regex=False) = "Not Classified"

SyntaxError: can't assign to function call (<ipython-input-129-3c3a488db622>, line 1)

In [110]:
silva.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 369953 entries, 0 to 369952
Data columns (total 7 columns):
Kingdom    369953 non-null object
Phylum     369953 non-null object
Class      369953 non-null object
Order      369953 non-null object
Family     369953 non-null object
Genus      369953 non-null object
Species    369953 non-null object
dtypes: object(7)
memory usage: 19.8+ MB


In [99]:
gg_genus.describe()

Unnamed: 0,Genus,Dataset
count,93463,203452
unique,2062,1
top,Bacteroides,Greengenes
freq,2747,203452


In [None]:
.shape

In [None]:
Np.unique

## 3) Import your data
In the space below, import your data.
If your data span multiple files, read them all in.
If applicable, merge or append them as needed.

## 4) Show me the head of your data.

## 5) Show me the shape of your data

## 6) Show me the proportion of missing observations for each column of your data

In [124]:
#Filter out Archea and Eukaryotic Data

#Filter out not classified at genus level (i.e. "uncultured bacterium")
#Need to find the reverse of this
#is_not_uncultured = silva_genus["Genus"].isin(["uncultured bacterium", "uncultured organism", "uncultured"])
#df["uni_names"].str.lower().str.contains("berkeley") = "University of California, Berkeley"

NameError: name 'silva_genus' is not defined

## 7) Give me a problem statement.
Below, write a problem statement. Keep in mind that your task is to tease out relationships in your data and eventually build a predictive model. Your problem statement can be vague, but you should have a goal in mind. Your problem statement should be between one sentence and one paragraph.

## 8) What is your _y_-variable?
For final project, you will need to perform a statistical model. This means you will have to accurately predict some y-variable for some combination of x-variables. From your problem statement in part 7, what is that y-variable?