# Data Visualization exercise

In [1]:
# importing the package(s) we want to use
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
### The input file is tab-delimited
tsv_Filepath = "https://raw.githubusercontent.com/csbfx/advpy122-data/master/euk.tsv"
# we can specify the delimiter by using the sep keyword argument
euk = pd.read_csv(tsv_Filepath, sep='\t')

In [3]:
### Using .head() and .tail() to see data content
euk.head()
euk.tail()

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status
8297,Saccharomyces cerevisiae,Fungi,Ascomycetes,3.99392,38.2,-,-,2017,Scaffold
8298,Saccharomyces cerevisiae,Fungi,Ascomycetes,0.586761,38.5921,155,298,1992,Chromosome
8299,Saccharomyces cerevisiae,Fungi,Ascomycetes,12.0204,38.2971,-,-,2018,Chromosome
8300,Saccharomyces cerevisiae,Fungi,Ascomycetes,11.9609,38.2413,-,-,2018,Chromosome
8301,Saccharomyces cerevisiae,Fungi,Ascomycetes,11.8207,38.2536,-,-,2018,Chromosome


In [4]:
euk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8302 entries, 0 to 8301
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Species             8302 non-null   object 
 1   Kingdom             8302 non-null   object 
 2   Class               8302 non-null   object 
 3   Size (Mb)           8302 non-null   float64
 4   GC%                 8302 non-null   object 
 5   Number of genes     8302 non-null   object 
 6   Number of proteins  8302 non-null   object 
 7   Publication year    8302 non-null   int64  
 8   Assembly status     8302 non-null   object 
dtypes: float64(1), int64(1), object(7)
memory usage: 583.9+ KB


In [5]:
### What to do with '-' values?
euk.fillna(value="NA")

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status
0,Emiliania huxleyi CCMP1516,Protists,Other Protists,167.676000,64.5,38549,38554,2013,Scaffold
1,Arabidopsis thaliana,Plants,Land Plants,119.669000,36.0529,38311,48265,2001,Chromosome
2,Glycine max,Plants,Land Plants,979.046000,35.1153,59847,71219,2010,Chromosome
3,Medicago truncatula,Plants,Land Plants,412.924000,34.047,37603,41939,2011,Chromosome
4,Solanum lycopersicum,Plants,Land Plants,828.349000,35.6991,31200,37660,2010,Chromosome
...,...,...,...,...,...,...,...,...,...
8297,Saccharomyces cerevisiae,Fungi,Ascomycetes,3.993920,38.2,-,-,2017,Scaffold
8298,Saccharomyces cerevisiae,Fungi,Ascomycetes,0.586761,38.5921,155,298,1992,Chromosome
8299,Saccharomyces cerevisiae,Fungi,Ascomycetes,12.020400,38.2971,-,-,2018,Chromosome
8300,Saccharomyces cerevisiae,Fungi,Ascomycetes,11.960900,38.2413,-,-,2018,Chromosome


## Visualize the relationship between two variables

Can you see the relationship between genome size and the number of genes?



It is hard for us to see patterns by reading large tables of data. For example, look at the following table that contains the genome size and the number of genes for reptiles. Can you see the relationship between genome size and the number of genes?

In [6]:
## Look only look at reptile data using class 'Reptiles'
print(euk[euk.Class == "Reptiles"])

                           Species  Kingdom     Class  Size (Mb)      GC%  \
282            Anolis carolinensis  Animals  Reptiles    1799.14  40.8238   
543            Sphenodon punctatus  Animals  Reptiles    4272.21        -   
565               Pogona vitticeps  Animals  Reptiles    1716.68     42.1   
589      Platysternon megacephalum  Animals  Reptiles    2319.09     43.9   
612               Podarcis muralis  Animals  Reptiles    1511.00  44.2057   
650              Cuora amboinensis  Animals  Reptiles    2214.83     43.9   
730             Ophiophagus hannah  Animals  Reptiles    1594.07     40.6   
944         Chrysemys picta bellii  Animals  Reptiles    2365.77   44.564   
1122                Chelonia mydas  Animals  Reptiles    2208.41     43.7   
1136    Alligator mississippiensis  Animals  Reptiles    2161.73     44.4   
1150            Crocodylus porosus  Animals  Reptiles    2049.54     44.2   
1218             Notechis scutatus  Animals  Reptiles    1665.53     40.2   

In [7]:
print(euk[euk.Class == "Reptiles" & euk["Size (Mb)"] >= 4000])

TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]

### Use `relplot` to look at relationships
We will use the `relplot` function, and the names of the columns that we want on the `X` and `Y` axes to look at the relationship between genome size and number of genes in Reptiles

In [None]:
### Relationship plot between genome size and number of genes in Retiles
reptiles = euk[euk.Class == "Reptiles"]
sns.relplot(data=reptiles, x="Size (Mb)", y="Number of genes")

### Can you create a new column to look at gene density?
gene density = Number of genes / Size (Mb)

In [None]:
### Your Code
# gene density = Number of genes / Size (Mb)
# create the new column
euk["Gene Density"] = euk["Number of genes"] / euk["Size (Mb)"]


In [None]:
### Create the plot that represents the relation between Size (Mb) and density
# Is it what you expected?

## Seaborn
These are other `kind` plots that we did not try in lectures. Look up seaborn documentation and try to create these new plots.

1.   Create a violin plot for genome size distribution against Class Land Plants.
2.   Create a scatter plot showing the relationship between Kingdom and Publication year
3. Create a bar plot for GC% for each Kingdom

In [None]:
### Your code - Create a violin plot for genome size distribution against Class Land Plants.


In [None]:
### Your code - Create a scatter plot showing the relationship between Kingdom and Publication year


In [None]:
### Your code - Create a bar plot for GC% for each Kingdom