# Data exploration with Pandas

###  Tree of Life
<table><tr><td><img src="http://t3.gstatic.com/licensed-image?q=tbn:ANd9GcSq4PRaxgfpjNOSe81JgN8l71DWtDHpkSfH3xo8EOk7khAlqQozXnJm8ubupyHj" width=300></td><td>
<img src="https://i.guim.co.uk/img/static/sys-images/Guardian/Pix/pictures/2008/04/17/DarwinSketch.article.jpg?width=445&quality=85&auto=format&fit=max&s=c7f89552d12b8495b2b4eb4d7a5bc391" width=300></td><td><img src="https://i.pinimg.com/originals/78/3f/98/783f983d622b06b9a990ad67efabbbe8.png" width=300></td></tr></table>

### Objective of this notebook
We will continue our data exploration with the dataset we used in Colab_Lec04. In this notebook, we will tackle real world questions with pandas. We will explore an kingdom of life data file. Each row in this data set represents a particular organism. An organism is classified in a particular Kingdom and Class.

Notebook adapted from Wendy Lee

In [None]:
# Import libraries
import pandas as pd

In [None]:
### Notice this dataset is separated by tab instead of comma
euk_filepath = "https://raw.githubusercontent.com/csbfx/advpy122-data/master/euk.tsv"
# euk_df = pd.read_csv(<DATA>, sep=(<SEPARATOR>))

In [None]:
### A reminder of what columns are in this dataset and the size of the dataframe


In [None]:
### You can sort by one or more columns
# euk_df.sort_values('Species', ascending=True).head()

### Sort by two columns; sort col1 first, then for rows with identical value in col1 are sorted according to col2


### *Q1. How many fungal species have genomes size bigger than 100Mb? What are their names?*

**We need to filter a few things to address this question.**
1. Select all the Fungi under the "Kingdom" column.
2. Select all the Fungi with genome size greater than 100.
3. Select Species from the filtered data from step 2 above.

**Let's do it step by step. And then we will combine all of them in one line of code.**

In [None]:
### Filter dataframe and narrow down to only Kingdom that includes Fungi


To **combine two conditional statements** in the filtering, we need to surround each conditional statements with its own pair of **parentheses**.

In [None]:
### Narrow down to Fungi that has genome size > 100


In [None]:
## To select Species that are Fungi with genome size > 100 Mb
# euk_df[(euk_df["Kingdom"] == "Fungi") & (euk_df['Size (Mb)'] > 100)]['Species']

## Alternative way to call column with '.'

Convert the Series from the above #Q1 to a python list using `to_list()` method

In [1]:
speciesList = euk_df[ (euk_df.Kingdom == 'Fungi') &
                  (euk_df['Size (Mb)'] > 100)
                 ]["Species"].to_list()

## Slice for the top 10 on the list
speciesList[0:10]

### Q2. How many organisms are there for each Kingdom (plants, animals, fungi, protists, and other), and how many unique species names?

*Hint*: Any time we see a question involving the words ”how many
... for each ...” the answer is `value_counts()`.

In [None]:
euk_df.Kingdom.value_counts()

To address "how many unique species names?", count the unique species names for plants by filtering the dataframe. [`nunique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html) returns the number of distinct, non-NA observations for each column (default) or row in a DataFrame

In [None]:
euk_df[(euk_df["Kingdom"] == "Plants")].Species.nunique()

In [None]:
### Write a loop to expand this to the other Kingdoms

# Kingdoms = ["Protists", "Plants", "Animals"]

for k in Kingdoms:
    print(k, euk_df[euk_df.Kingdom == k].Species.nunique())

#### To make the above more scalable
Hardcoding the values in kingdom is not scalable, we can implement a more elegant way to get a list of unique kingdom names directly from the dataframe [`unique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html) returns unique values in the order of apperance. It does NOT sort.

In [None]:
### Return an array of unique values in Kingdom


### Use an array to make it more scalable


### Q3. Make a new dataframe containing just the rows for the *Aquila* genus.

**Let go over some biology terminologies**
- The names under the **Species** column are scientific names that made up of a *genus* name and a *species* name separated by a space. Example: *Homo sapiens*. Note: there are some species that don't follow that format and we will ignore them for now.

To solve this problem, we will need to **separate the genus and species names** for each value in the **Species** column.

In [None]:
### Review how to split a string.
## What if we wanted to split string by ","?
a = "Homo sapiens"
a_split = a.split( )
print(a_split)
print(a_split[0])


# # Split the strings stored in the column Species
euk_df.Species.str.split()

In [None]:
## Take the first element of each of the resulting lists (again remembering to refer to the str attribute):
## This gives us our series of genus names.


In [None]:
## Add the condition to get a series of boolean values:
euk_df.Species.str.split(' ').str[0] == "Aquila"

In [None]:
## Plug the above into the original dataframe to select the rows that are True
aquila_data = euk_df[ (<condition where Species == "Aquila">) ]

In [None]:
## Extract the genus name and combine it with other columns in the dataframe to create a new dataframe
euk_df['Genus']=

neweuk = euk_df[["Species","Genus","Class","Kingdom"]]


### Q4. Which organism have at least 10% more proteins than genes?

There are a few different ways to interpret ”10% more”, but for the purposes of this question we will say that we want to divide the number of proteins by the number of genes, and if the result is greater than or equal to 1.1 then we want to include the organism.

If you look at the `euk` dataframe, you will see that the columns **Number of proteins** and **Number of genes** are mixed with numeric values and dashes. To ensure that all the values in these two columns are numeric, we will use pandas method [`to_numeric`](https://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.to_numeric.html) to covert the values to numeric. We will set the `errors` argument to `'coerce'` to convert any non-numeric value to NaN (not a number).

In [None]:
### Calculate proteins per gene by applying the division for each row. What to do about NaN?
# euk_df["Proteins per gene"] = euk_df["Number of proteins"]/euk_df["Number of genes"]
# euk_df["Proteins per gene"]

In [None]:
### What happens if there are non-numeric value in a column?
euk_df["Number of proteins"] =  pd.to_numeric(euk_df["Number of proteins"])
euk_df["Number of proteins"].mean()

# euk_df["Number of proteins"] =  pd.to_numeric(euk_df["Number of proteins"], errors='coerce' )
# euk_df["Number of genes"] =  pd.to_numeric(euk_df["Number of genes"], errors='coerce' )
# euk_df["Number of proteins"] =  pd.to_numeric(euk_df["Number of proteins"], errors='coerce')
# euk_df["GC%"] =  pd.to_numeric(euk_df["GC%"],  errors='coerce')

# euk_df["Proteins per gene"] = euk_df["Number of proteins"]/euk_df["Number of genes"]
# euk_df.head()

In [None]:
### To filter the rows where the ratio of proteins and genes is
## greater than or equal to 1.1, we will get a series of boolean
## values by setting the condition.

euk_df[(euk_df["Proteins per gene"] >= 1.1)]

### Indexing in pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, `iloc` and `loc`. For more advanced operations, these are the ones you're supposed to be using.

#### Index-based selection
Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data.`iloc` follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:

In [None]:
### New df of only Fungi Kingdom
fungi = euk_df[(euk_df.Kingdom == 'Fungi')]

## Selecting first row
fungi.iloc[0]

## Extracting rows and columns

-  Extract specific rows and all columns

    ```df.iloc[rows]```

-  Extract specific rows and columns

    ```df.iloc[rows, columns]```

-  Extract specific rows and columns with specific indices

    ```df.iloc[start:stop:skip, start:stop:skip]```

#### Note that stop is exclusive

 ```[1:3]``` will extract the second and third rows for all columns.

```[1::2, :]```  will extract from the 2nd row all the way to the end every other row, and all the columns

```[:, :3]```  all rows, from the beginning to the 3rd column (indices 0, 1, 2)


In [None]:
# ## Getting the second and third rows for all columns
fungi.iloc[1:3]

In [None]:
# Getting every other rows from the first row to the 10th row, and extracting 4th and 5th columns
fungi.iloc[0:10:2, 3:5]

Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:

In [None]:
## Select all rows, first column
fungi.iloc[:, 0]

On its own, the `:` operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to omit the first 3 columns and extract just the first three rows, we would do:

In [None]:
fungi.iloc[:3, 3:]

#### Label-based selection
The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

In [None]:
## Getting all rows, and specific columns based on their headers
fungi.loc[:, ['Species', 'Number of genes', 'Number of proteins']]

### Set row index using a specific column

In [None]:
### Set index as Species for fungi df

### Reset dataframe's index

In [None]:
### Can you retrieve the condition of GC% > 50.00?
GC_fungi = fungi[fungi['GC%'] > 50.00]

In [None]:
### Need to reset the index because GC_fungi is multitier


In [None]:
# Getting specific rows based on the row index and all columns
fungi.loc[['Saccharomyces cerevisiae'], :]

In [None]:
fungi.reset_index(inplace=True)
fungi.head()

**Data visualizing with Pie plot**  
Dataframe has a built-in pie chart plotting(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.pie.html). While it is not as customizable as later libraries that we will use such as matplotlib, it allows us to quickly plot a pie chart from the dataframe. Additionally you can specify the size plot you want to return using the parameter figsize).


In [None]:
## Pass in the column that you want to plot as your "y" variable.
fungi.plot.pie(y='Publication year', figsize=(5, 5))
