# Data exploration with Pandas

###  Tree of Life
<table><tr><td><img src="http://t3.gstatic.com/licensed-image?q=tbn:ANd9GcSq4PRaxgfpjNOSe81JgN8l71DWtDHpkSfH3xo8EOk7khAlqQozXnJm8ubupyHj" width=300></td><td>
<img src="https://i.guim.co.uk/img/static/sys-images/Guardian/Pix/pictures/2008/04/17/DarwinSketch.article.jpg?width=445&quality=85&auto=format&fit=max&s=c7f89552d12b8495b2b4eb4d7a5bc391" width=300></td><td><img src="https://i.pinimg.com/originals/78/3f/98/783f983d622b06b9a990ad67efabbbe8.png" width=300></td></tr></table>

### Objective of this notebook
We will continue our data exploration with the dataset we used in Colab_Lec04. In this notebook, we will tackle real world questions with pandas. We will explore an kindom of life data file. Each row in this data set represents a particular organism. An organism is classified in a particular Kingdom and Class.

Notebook adapted from Wendy Lee

In [2]:
# Import libraries
import pandas as pd

In [4]:
euk_filepath = "https://raw.githubusercontent.com/csbfx/advpy122-data/master/euk.tsv"
euk_df = pd.read_csv(euk_filepath, sep='\t')
euk_df.head()

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status
0,Emiliania huxleyi CCMP1516,Protists,Other Protists,167.676,64.5,38549,38554,2013,Scaffold
1,Arabidopsis thaliana,Plants,Land Plants,119.669,36.0529,38311,48265,2001,Chromosome
2,Glycine max,Plants,Land Plants,979.046,35.1153,59847,71219,2010,Chromosome
3,Medicago truncatula,Plants,Land Plants,412.924,34.047,37603,41939,2011,Chromosome
4,Solanum lycopersicum,Plants,Land Plants,828.349,35.6991,31200,37660,2010,Chromosome


In [60]:
# ## You can sort by one or more columns
# euk_df.sort_values(['Species', 'Publication year'], ascending=[True, False])

*italicized text*### Q1. How many fungal species have genomes size bigger than 100Mb? What are their names?

**We need to filter a few things to address this question.**
1. Select all the Fungi under the "Kingdom" column.
2. Select all the Fungi with genome size greater than 100.
3. Select Species from the filtered data from step 2 above.

**Let's do it step by step. And then we will combine all of them in one line of code.**

In [61]:
# ## Narrow down to only Fungi
# euk_df[euk_df.Kingdom == 'Fungi']

To **combine two conditional statements** in the filtering, we need to surround each conditional statements with its own pair of **parentheses**.

In [62]:
# ## Narrow down to Fungi that has genome size > 100
# euk_df[(euk_df.Kingdom == 'Fungi') &
#     (euk_df['Size (Mb)'] > 100)
#     ]

In [63]:
# ## Species that are Fungi with genome size > 100 Mb
# euk_df[(euk_df.Kingdom == 'Fungi') & (euk_df['Size (Mb)'] > 100)].Species

#### Convert the Series in \#3 above to a python list using `to_list` method ####

In [64]:
# speciesList = euk_df[ (euk_df.Kingdom == 'Fungi') &
#                   (euk_df['Size (Mb)'] > 100)
#                  ]["Species"].to_list()

# ## Slice for the top 10 on the list
# speciesList[0:10]

### Q2. How many organisms are there for each Kingdom (plants, animals, fungi, protists, and other), and how many unique species names?

**HINT**: Any time we see a question involving the words ”how many
... for each ...” **"how many ... for each ..."** the answer is **`value_counts`**.

In [65]:
# euk_df.Kingdom.value_counts()

To address "how many unique species names?", count the unique species names for plants by filtering the dataframe. **[`nunique()`]**(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html) returns the number of distinct obeservations.

In [66]:
# euk_df[(euk_df["Kingdom"] == "Plants")].Species.nunique()

In [67]:
# ## To expand this to the other kingdoms

# for k in ["Protists", "Plants", "Fungi", "Animals", "Other"]:
#     print(k, euk_df[euk_df.Kingdom == k].Species.nunique())

#### To make the above more scalable
Hardcoding the values in kingdom is not scalable, we can implement a more elegant way to get a list of unique kingdom names directly from the dataframe **[`unique()`]**(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html) returns unique values in the order of apperance. It does NOT sort.

In [68]:
# ## Return an array of unique values in Kingdom
# euk_df.Kingdom.unique()

# ## Use an array to make it more scalable
# for kingdom in euk_df.Kingdom.unique():
#     print(kingdom, euk_df[euk_df.Kingdom == kingdom].Species.nunique())

### Q3. Make a new dataframe containing just the rows for the *Aquila* genus.

**Let go over some biology terminologies**
- The names under the **Species** column are scientific names that made up of a *genus* name and a *species* name separated by a space. Example: *Homo sapiens*. Note: there are some species that don't follow that format and we will ignore them for now.

To solve this problem, we will need to **separate the genus and species names** for each value in the **Species** column.

In [69]:
# ## Review how to split a string
# a = "abc def"
# a_split = a.split()
# print(a_split)
# print(a_split[0])


# # Split the strings stored in the column Species
# euk_df.Species.str.split()

In [70]:
# ## Take the first element of each of the resulting lists (again remembering to refer to the str attribute):
# ## This gives us our series of genus names.
# euk_df.Species.str.split().str[0]

In [71]:
# ## Add the condition to get a series of boolean values:
# euk_df.Species.str.split(' ').str[0] == "Aquila"

In [72]:
# ## Plug the above into the original dataframe to select the rows that are True

# aquila_data = euk_df[ (euk_df.Species.str.split(' ').str[0] == "Aquila") ]
# aquila_data

In [73]:
# ## Extract the genus name and combine it with other columns in the dataframe to create a new dataframe
# euk_df['Genus']=euk_df.Species.str.split(" ").str[0]

# neweuk = euk_df[["Species","Genus","Class","Kingdom"]]
# neweuk

### Q4. Which organism have at least 10% more proteins than genes?

There are a few different ways to interpret ”10% more”, but for the purposes of this question we will say that we want to divide the number of proteins by the number of genes, and if the result is greater than or equal to 1.1 then we want to include the organism.

If you look at the `euk` dataframe, you will see that the columns **Number of proteins** and **Number of genes** are mixed with numeric values and dashes. To ensure that all the values in these two columns are numeric, we will use pandas method [`to_numeric`](https://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.to_numeric.html) to covert the values to numeric. We will set the `errors` argument to `'coerce'` to convert any non-numeric value to NaN (not a number).

In [74]:
# euk_df["Number of genes"] =  pd.to_numeric(euk_df["Number of genes"],  errors='coerce')
# euk_df["Number of proteins"] =  pd.to_numeric(euk_df["Number of proteins"],  errors='coerce')
# euk_df["GC%"] =  pd.to_numeric(euk_df["GC%"],  errors='coerce')

# # automatically applying the division for each row
# euk_df["Proteins per gene"] = euk_df["Number of proteins"]/euk_df["Number of genes"]

In [75]:
# # To filter the rows where the ratio of proteins and genes is
# # greater than or equal to 1.1, we will get a series of boolean
# # values by setting the condition.

# euk_df[(euk_df["Proteins per gene"] >= 1.1)]

### Indexing in pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, `iloc` and `loc`. For more advanced operations, these are the ones you're supposed to be using.

#### Index-based selection
Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data.`iloc` follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:

In [76]:
# ## New df of fungi
# fungi = euk_df[(euk_df.Kingdom == 'Fungi')]

# ## Selecting first row
# fungi.iloc[0]

## Extracting rows and columns

-  Extract specific rows and all columns

    ```df.iloc[rows]```

-  Extract specific rows and columns

    ```df.iloc[rows, columns]```

-  Extract specific rows and columns with specific indices

    ```df.iloc[start:stop:skip, start:stop:skip]```

#### Note that stop is exclusive

 ```[1:3]``` will extract the second and third rows for all columns.

```[1::2, :]```  will extract from the 2nd row all the way to the end every other row, and all the columns

```[:, :3]```  all rows, from the beginning to the 3rd column (indices 0, 1, 2)


In [77]:
# ## Getting the second and third rows for all columns
# fungi.iloc[1:3]

In [78]:
# # Getting every other rows from the first row to the 10th row, and extracting 4th and 5th columns
# fungi.iloc[0:10:2, 3:5]

Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:

In [79]:
# ## Select all rows, first column
# fungi.iloc[:, 0]

On its own, the `:` operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to omit the first 3 columns and extract just the first three rows, we would do:

In [80]:
# fungi.iloc[:3, 3:]

#### Label-based selection
The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

In [81]:
# ## Getting all rows, and specific columns based on their headers
# fungi.loc[:, ['Species', 'Number of genes', 'Number of proteins']]

### Set row index using a specific column

In [82]:
# fungi.set_index('Species', inplace=True)
# fungi.head()

### Reset dataframe's index

In [83]:
# # Getting specific rows based on the row index and all columns
# fungi.loc[['Saccharomyces cerevisiae'], :]

In [84]:
# fungi.reset_index(inplace=True)
# fungi.head()