# <i class="fa fa-laptop"></i> Bioinformatic Programming using Python

<div style="background-color: #86CBBB; 1px; height:3px " ></div>

Here, we will demonstrate some of the basics of programming Python. If you want to lean more, there are many more resources and other training sessions out there, including the official [Python Tutorial](https://docs.python.org/3/tutorial/).  

In this Jupyter _notebook_ you can answer and practice the exercises from this practical session. 

There are several icons that denome:

* **<i class="fa fa-search"></i> Example**: Some code examples available for the topic of interest. 
* **<i class="fa fa-pencil"></i> Activitities**: These are actitivies to practice the lessons topics.
* **<i class="fa fa-key"></i> Hint:** A small hint on how to solve the exercise.
* **<i class="fa fa-cogs"></i> Code:** You write here the code. More than one line can be provided in the same code block. The symbol "..." expects to be replaced by some code.
* **<i class="fa fa-file-code-o"></i> Script:** Intrusctions are provided to create an external script code using the terminal or other resources.
* **<i class="fa fa-comment"></i> Comments:** Write your comments or notes in here if desired. This cell is formatted using Markdown. To activate it, click and write and then press Run or Execute.
* **<i class="fa fa-rocket"></i> Challenge**: Activitities to test your coding skills.

## Getting Started

Jupyter notebooks have two types of cells.  "Markdown" cells, like this one which can contain formatted text, and "Code" cells, which contain the code you will run.

After you execute the cell, the result will automatically display underneath the cell.


## Jupyter menu

To use the jupyter menu there are several buttons and options. See below a small description, see additional details [here](https://jupyter.brynmawr.edu/services/public/dblank/Jupyter%20Notebook%20Users%20Manual.ipynb) or [here](https://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Notebook%20Basics.html#Notebook-Basics)

* <i class="fa fa-save"></i> This is your save button. You can click this button to save your notebook at any time, though keep in mind that Jupyter Notebooks automatically save your progress very frequently.
 
* <i class="fa fa-plus"></i> This is the new cell button. You can click this button any time you want a new cell in your Jupyter Notebook.
 
* <i class="fa fa-scissors"></i> This is the cut cell button. If you click this button, the cell you currently have selected will be deleted from your Notebook.   
 
* <i class="fa fa-copy"></i> This is the copy cell button. If you click this button, the currently selected cell will be duplicated and stored in your clipboard. 
 
* <i class="fa fa-paste"></i> This is the past button. It allows you to paste the duplicated cell from your clipboard into your notebook.
 
* <i class="fa fa-arrow-up"></i><i class="fa fa-arrow-down"></i> These buttons allow you to move the location of a selected cell within a Notebook. Simply select the cell you wish to move and click either the up or down button until the cell is in the location you want it to be.
 
* <i class="fa fa-step-forward"></i> This button will "run" your cell, meaning that it will interpret your input and render the output in a way that depends on what type of cell you're using. 
 
* <i class="fa fa-step-forward"></i> This is the restart kernel button. See your kernel documentation for more information.
 
* <i class="fa fa-stop"></i> This is the stop button. Clicking this button will stop your cell from continuing to run. This tool can be useful if you are trying to execute more complicated code, which can sometimes take a while, and you want to edit the cell before waiting for it to finish rendering. 
  
* <i class="fa fa-repeat"></i> esborra les variables que hi havia guardades fins al moment  
 
 
There are also several shorcuts available, check the list in:

![image-2.png](attachment:image-2.png)

Some examples:

**Edit mode**: Enter
**Run cell**: Shift-Enter, Ctrl-Enter,
**Basic browsing**: , up arrow/k, down arrow/j, esc <br>
**Save the notebook**: S  
**Change cell format type**: M, Y  
**Create new cells**: A ( insert cell above), B (insert cell below)  
**Edit cells**: X (Cut), C (Copy), V (Paste) D (Delete), Z (Undo cell selection)  
**Show keyboard shorcuts**: H

<br>

 <div style="background-color: #86CBBB; 1px; height:3px " ></div>

# Some basic introduction to python: Pandas dataframes
Let's review a few basic features of advanced python programming data types.

## Pandas dataframes

**_Pandas Dataframes_** 

[pandas](https://pandas.pydata.org/docs/index.html#module-pandas) is a library providing high-performance, easy-to-use data structures and data analysis tools.

Dataframes contain:
- Data organized in 2 dimensions, rows and columns
- Labels that correspond to the rows and columns

Pandas dataframe:
- has functions for analyzing, exploring, and manipulating data.
- can clean messy data sets (missing, NULL, wrong values), and make them readable and relevant.
- allows us to analyze big data and make conclusions based on statistical theories.



Additional details [here](https://realpython.com/pandas-dataframe/) and tutorials [there](https://www.w3schools.com/python/pandas/default.asp)

### <i class="fa fa-search"></i> Example

#### Create dataframes

In [None]:
# import package
import pandas

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pandas.DataFrame(mydataset)

print(myvar)

In [None]:
type(mydataset)

In [None]:
type(myvar)

Check version of package:

In [None]:
## check version of package
import pandas
print(pandas.__version__)

#### Use alias to import pandas

You can also use an alias. 

In Python alias are an alternate name for referring to the same thing.

Now the Pandas package can be referred to as `pd` instead of `pandas`.



In [None]:
# import package and use abbreviation
import pandas as pd

data = {
    'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
    'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai','Manchester', 'Cairo', 'Osaka'],
    'age': [41, 28, 33, 34, 38, 31, 37],
    'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}

row_labels = [101, 102, 103, 104, 105, 106, 107]

df = pd.DataFrame(data=data, index=row_labels)

df

#### Head & tail

`pandas` DataFrames can sometimes be very large, making it impractical to look at all the rows at once. You can use `.head()` to show the first few items and `.tail()` to show the last few items. 

You can use the number of rows to show, using the parameter `n`.



In [None]:
df.head(n=3)

In [None]:
df.tail(n=3)

#### Find index & columns names, and other information

In [None]:
df.ndim

In [None]:
df.shape

In [None]:
df.size

In [None]:
## find index IDs
df.index

In [None]:
# find columns
df.columns

In [None]:
len(df.columns)

In [None]:
df.columns[1]

In [None]:
df.dtypes

As you can see, `.dtypes` returns a Series object with the column names as labels and the corresponding data types as values.

#### Access data in dataframes

You can access a column in a `pandas` DataFrame the same way you would get a value from a dictionary:

In [None]:
## access columns
df['city']

If the name of the column is a string that is a valid Python identifier, then you can use dot notation to access it. That is, you can access the column the same way you would get the attribute of a class instance:

In [None]:
## access columns too
df.city

Also, pandas has four accessors in total:

- `.loc[]` accepts the labels of rows and columns and returns Series or DataFrames. You can use it to get entire rows or columns, as well as their parts.

- `.iloc[]` accepts the zero-based indices of rows and columns and returns Series or DataFrames. You can use it to get entire rows or columns, or their parts.

- `.at[]` accepts the labels of rows and columns and returns a single data value.

- `.iat[]` accepts the zero-based indices of rows and columns and returns a single data value.

In [None]:
df

In [None]:
## Access a row
df.loc[103]

In [None]:
## Access a row using zero-based indices
df.iloc[2]

#### Slicing dataframes

`.loc[]` and `.iloc[]` are particularly powerful. They support slicing and NumPy-style indexing. You can use them to access a column:

In [None]:
df.loc[:,:]

In [None]:
df.loc[ [103, 105], ['city','age']]

In [None]:
## slicing dataframe: from 0-3rd row, all columns included
df.iloc[:3]

In [None]:
## slicing dataframe: from 0-3rd row, only 0-2nd columns included
df.iloc[:3,:2]

In [None]:
## Reversing dataframe
df.iloc[:,::-1]

#### Access cell values

It’s possible to use `.loc[]` and `.iloc[]` to get particular data values. However, when you need only a single value, pandas recommends using the specialized accessors `.at[]` and `.iat[]`:

In [None]:
## Access a cell value
df.at[103, 'name']

In [None]:
## Access a cell value using zero-based indices
df.iat[1, 1]

#### Modify dataframes

In [None]:
df.loc[:, 'py-score']

In [None]:
df.loc[:104, 'py-score'] = [40, 50, 60, 70]

In [None]:
df.loc[:, 'py-score']

In [None]:
df.loc[104:, 'py-score'] = 0

In [None]:
df.loc[:, 'py-score']

#### Insert or delete data in dataframes

In [None]:
john = pd.Series(data=['John', 'Boston', 34, 79], 
                 index=df.columns, name=17)

In [None]:
john

In [None]:
## this might produce errors: depends on the pandas versions installed
df = df.append(john)

In [None]:
df

In [None]:
## remove the inserted row using drop
df = df.drop(labels=[17])

In [None]:
## add a new column with given values as numpy.array
import numpy as np
df['js-score'] = np.array([71.0, 95.0, 88.0, 79.0, 91.0, 91.0, 80.0])

In [None]:
df

In [None]:
## add a column in the last position with all values equals to 0.0
df['total-score'] = 0.0

In [None]:
df

In [None]:
## add a new column in a different position
df.insert(loc=4, column='django-score',
          value=np.array([86.0, 81.0, 78.0, 88.0, 74.0, 70.0, 81.0]))

In [None]:
df

#### Remove elements

We can remove columns in a dataframe using `del` or `.pop` and the column name

In [None]:
del df['total-score']
## df.pop('total-score')

In [None]:
df

You can also remove one or more columns with `.drop()` as you did previously with the rows. Again, you need to specify the labels of the desired columns with labels. In addition, when you want to remove columns, you need to provide the argument `axis=1`:



In [None]:
df = df.drop(labels='age', axis=1)

In [None]:
df

#### Operations on dataframes

In [None]:
df['py-score'] + df['js-score'] + df['django-score']

In [None]:
df['total'] = df['py-score'] + df['js-score'] + df['django-score']

In [None]:
df

##### Applying NumPy and SciPy Functions

Most `NumPy` and `SciPy` routines can be applied to `pandas` Series or DataFrame objects as arguments instead of as `NumPy` arrays. To illustrate this, we can use `numpy.average()` function.

In [None]:
df.iloc[:, 2:5]

In [None]:
np.average(df.iloc[:, 2:5], axis=1)

In [None]:
df['avg'] = np.average(df.iloc[:, 2:5], axis=1)

In [None]:
df

#### Sort dataframes

We can sort DataFrames by the values in the column provided. Several parameters required:
- The parameter by sets the label of the row or column to sort by. 
- ascending specifies whether you want to sort in ascending (True) or descending (False) order, the latter being the default setting. 
- You can pass axis to choose if you want to sort rows (axis=0) or columns (axis=1).

If you want to sort by multiple columns, then just pass lists as arguments for by and ascending

See additional details in this link [here](https://realpython.com/pandas-sort-python/)

In [None]:
df.sort_values(by='avg', ascending=True)

In [None]:
df.sort_values(by=['city','avg'])

The optional parameter `inplace` can also be used with `.sort_values()`. It’s set to False by default, ensuring `.sort_values()` returns a new pandas DataFrame. When you set inplace=True, the existing DataFrame will be modified and `.sort_values()` will return `None`.

#### Filter dataframes

Data filtering is another powerful feature of `pandas`. It works similarly to indexing with Boolean arrays in NumPy.

If you apply some logical operation on a Series object, then you’ll get another Series with the Boolean values `True` and `False`.

In [None]:
df['django-score'] >= 80

In [None]:
df[df['django-score'] >= 80]

You can create very powerful and sophisticated expressions by combining logical operations with the following operators:

- NOT (~)
- AND (&)
- OR (|)
- XOR (^)

In [None]:
## Both conditions must be true
df[ (df['py-score'] >= 40) & (df['js-score'] >= 80)]

In [None]:
## One of both or both: py-score >= 40 OR js-score >= 80
df[ (df['py-score'] >= 40) | (df['js-score'] >= 80)]

In [None]:
## One of both, but not both!
df[ (df['py-score'] >= 40) ^ (df['js-score'] >= 80)]

#### Some data statistics

In [None]:
df

In [None]:
df.describe()

In [None]:
df.mean()

In [None]:
df['py-score'].mean()

In [None]:
df['py-score'].std()

In [None]:
## find correlations
df.corr()

#### Iterate over pandas dataframes

Pandas DataFrame’s row and column labels can be retrieved as sequences with `.index` and `.columns`. You can use this feature to iterate over labels and get or set data values. 

However, pandas provides several more convenient methods for iteration.

Iterate columns:
- [`.items()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.items.html) to iterate over columns
- [`.iteritems()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iteritems.html) to iterate over columns

Iterate rows:
- [`.iterrows()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html) to iterate over rows
- [`.itertuples()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.itertuples.html) to iterate over rows and get named tuples

In [None]:
## iterate using pd.iteritems()
for col_label, col in df.iteritems():
    print("Col_label: ", col_label)
    print("col: ")
    print(col, end='\n\n')

In [None]:
## iterate using pd.iterrows()
for row_label, row in df.iterrows():
    print("row_label: ", row_label)
    print("row: ")
    print(row, end='\n\n')

In [None]:
for row in df.itertuples():
    print(row)
    print(row.name)
    print("#")

#### Plotting data using pandas and matplotlib

In [None]:
df

In [None]:
import matplotlib.pyplot as plt
df.plot(kind = 'scatter', x = 'name', y = 'django-score')
plt.show()

In [None]:
import matplotlib.pyplot as plt
df.boxplot('django-score', by='py-score')
plt.show()

#### Read and write information from pandas to file

Imagine we have information store as CSV in data.csv file


In [None]:
# df_example = pd.read_csv('data.csv') ## other parameters can be used

We can dump information to csv, excel or json format files


In [None]:
# df_example.to_csv()
# df_example.to_excel()
# df_example.to_json()

<div style="background-color: #86CBBB; font-size: 16px; line-height:30px; height:30px;padding-left: 6px; " >

**<i class="fa fa-pencil"></i> Activity 1 | Some basic operations on pandas dataframes**

</div>

1.1) Write a Pandas program to create a dataframe from a dictionary and display it.
```
Sample data: {'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]}
```

1.2) Write a Pandas program to create and display a DataFrame from a specified dictionary data which has the index labels.
Sample Python dictionary data and list labels:

```
exam_data = {

'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']

}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
```

1.3.1. Using the previous dataframe get the first three rows

1.3.2. Using the previous dataframe get the last three rows

1.3.3. Select the 'name' and 'score' columns from the previous DataFrame.

1.3.4. Select 'name' and 'score' columns in rows 1, 3, 5, 6 from the previous data frame: 

1.3.5 Select the rows where the number of attempts in the examination is greater than 2.

1.3.6 Select the rows where the score is missing, i.e. is NaN

**Hint**: You can use `.isnull()`

1.3.7 Select the rows the score is between 15 and 20 (inclusive).

1.3.8 Select the rows where number of attempts in the examination is less than 2 and score greater than 15.

1.3.9 Change the score in row 'd' to 11.5.

<div style="background-color: #86CBBB; font-size: 16px; line-height:30px; height:30px;padding-left: 6px; " >

**<i class="fa fa-pencil"></i> Activity 2 | Pandas dataframe and input files:**

</div>

## Read csv and use pandas dataframe

Example of pandas dataframe operation using a synthetic dataset from Kaggle. See additional reference [here](https://www.kaggle.com/code/franciscomcm/five-useful-operations-with-pandas-dataframes/notebook)

This dataset contains very basic examples to explore handy operations with Pandas DataFrames. There are 3 CSV files in the dataset:
- `thermometer_A.csv` and `thermometer_B.csv` contain synthetic data representing temperature measurements over a full day by two devices.
- `fertiliser_plant_growth.csv` contains synthetic data represeting the growth of 3 groups of plants (control, fertilizer A and fertilizer B).

Find files in github or shared folder in Google Drive classroom under the `Data/Pandas_example.Fertilizer` folder

1.1) Read information into a csv and show, describe, dimensions, etc

In [None]:
## read info to csv: fertiliser_plant_growth.csv

In [None]:
## read info to csv: thermometer_A.csv

In [None]:
## read info to csv: thermometer_B.csv

1.2) Add temperature from thermometer A and B to fertilizer dataframe. Add also time points as new column.

1.3) Find difference between temperature measures

**<i class="fa fa-rocket"></i> Challenge**:


In [None]:
## Find the biggest absolute difference

1.4) Find meand and std between temperature measures

1.5) Create some basic plots

1.5.1 Use boxplot to plot rate of growth according to each group

1.5.2. Plot temperature for thermometer A (or B) along time.

**<i class="fa fa-rocket"></i> Challenge**:
1.5.3. Create a plot along time for both temperature series.

1.6) Create some filtering on your dataframe

1.6.1. Only Control samples

1.6.2. Only Fertilsers samples: 1 & 2

1.6.3 All fertilsers samples with baseline 1

1.6.4 All samples with TempA and TempB < 20

1.6.5 All samples with TempA or TempB < 20