# Data collection project tutorials: Intro to python, jupyter notebook, pandas and data manipulation

This is a Jupyter notebook. We had a small mention to them in the FAIR data workshop, but now we're going to use them as a tool to learn the very fundamentals of python and to grasp the unmeasurable importance of programing when we work with data. 

## Sweet introduction to chaos

We will begin where everybody has begun with programming, with the *"Hello world"* exercise. We simply want python to **print** a piece of text, also known as **string**.

In [None]:
print("Hello world")

Notice the qutoation marks ("\_\_\_\_") around the text? That's our way of telling python that this is text. Anything that we put between quotation marks becomes a piece of text and can be printed. Example:

In [None]:
print("Quarantine sucks")
print("wololo")
print("Is the lab open?")

What happens if we don't use them?

In [None]:
print(wololo)

This error happens because python believes that wololo is a **variable**. A variable is an object that contains information, which can be for instance a string:

In [None]:
my_first_variable = "Python is fun"
print(my_first_variable)

But python can do much more than printing text :)

## Data types and simple manipulations

The three most commont types of data in python are: strings, floats and integers.

In [None]:
string_var = 'bacon'
float_var = 3.14
int_var = 42

And we can ask python which kind of data each variable contains with the command **type()**.

In [None]:
print(type(string_var))
print(type(float_var))
print(type(int_var))

Some operations with these variables are infective, some others are incompatible, and some others are possible after a transformation:

In [None]:
print(float_var+float_var)
print(float_var + 1.2)
print(string_var + string_var)
print(string_var + ' is delicious')
print(string_var + ' costs ' + str(float_var) + ' euros in my local supermarket')  # Float transformed into string
print(int_var + int_var)
print(int_var + float_var)  # Infective!

We would need two sessions only to show every fancy way in which you can operate with variables, but for the sake of time, we'll go forward to data structures now.

## Data structures

### Lists

The most basic data structure is a **list**. A list is an array of data elements stored in the same object. Lists are definded with square brakets **\[ \]** in python's syntax. These elements can be of any kind, even variables and other lists:

In [None]:
my_list = ['ham', 'eggs', 42, float_var, [1,2,3,4], 3.14]
print(my_list)

Lists can be indexed, which means that we can choose which position to get the information from (keep in mind, python counts starting from 0):

In [None]:
print(my_list[0]) 
print(my_list[1])
print(my_list[-1])
print(my_list[-2])

### Dictionaries

A very pythonic data structure is the **dictionary**, which keeps unindexed (unsorted) values, which can be found with their key. Dictionaries are defined with curly brakets **{ }** in python's syntax. 

In [None]:
my_car = {"brand": "Ford",
  "model": "Mustang",
  "year": 1964}
print(my_car)

So if we want to know the model of my car we'll do the following:

In [None]:
print(my_car['model'])

But I forgot to mention that I upgraded it in 2010!!! Let me just put that in...

In [None]:
my_car['upgraded'] = 2010
print(my_car)

Perfect

So why all this? Explanation? Because the syntax of the **dataframes** that we are interested in behave almost identically!

## Pandas dataframes

### Introduction to dataframes 

In [None]:
import pandas as pd
import io
import requests

download = False

if download:
    url="https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv"
    s=requests.get(url).content
    df=pd.read_csv(io.StringIO(s.decode('utf-8')))
else:
    df = pd.read_csv('./iris.csv')

df

Subset only species

In [None]:
df['species']

How many unique species do we have?

In [None]:
unique_species = df['species'].unique()
print(unique_species)

Since we are here, let's do a mini intermezzo to explain how to iterate through an indexed object, like our prevoius output:

In [None]:
for species in unique_species:  # Takes each element (species) in an indexed object (unique_species)
    print(species + ' are beautiful flowers')  # And does something with this element. In this case tell the whole world how beautiful they are

### Basic data selection

Let's say that for some reason, I don't like wide sepals, so I want to know which of these my entries have a sepal no bigger than 3 cm

In [None]:
small_sepal_df = df[df['sepal_width']<= 3.0]
small_sepal_df

Or maybe what I don't like is the ratio of "chubby sepals", so I want to select those entries with a sepal_length/sepal_width of at least 2:

In [None]:
slim_df = df[df['sepal_length']/df['sepal_width']>=2.0]
slim_df

### Make and plot our own dataframes

What's the percentage of species in each of these dataframes?

In [None]:
df_species_perc = df['species'].value_counts(normalize=True) * 100
small_sepal_df_species_perc = small_sepal_df['species'].value_counts(normalize=True) * 100
slim_df_species_perc = slim_df['species'].value_counts(normalize=True) * 100

In [None]:
percentages_df = pd.DataFrame({'complete':df_species_perc, 'thin': small_sepal_df_species_perc, 'slim':slim_df_species_perc})
percentages_df

In [None]:
import matplotlib.pyplot as plt
import numpy as np

print(percentages_df)
dataframes_labels = percentages_df.index  # directly from the dataframe

# Set distances between bars and their width
x = np.arange(len(dataframes_labels))  # the label locations
width = 0.25  # the width of the bars

# Do the actual plotting
fig, ax = plt.subplots()
rects1 = ax.bar(x - width, percentages_df['complete'], width, label='complete')
rects2 = ax.bar(x, percentages_df['thin'], width, label='thin')
rects3 = ax.bar(x + width, percentages_df['slim'], width, label='slim')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Percentages')
ax.set_title('Percentages of species per dataframe')
ax.set_xticks(x)
ax.set_xticklabels(dataframes_labels)
fig.legend()
plt.show()



Or maybe you prefer it transposed?

In [None]:
transposed_df = percentages_df.T
print(transposed_df)

dataframes_labels = transposed_df.index  # directly from the dataframe

# Set distances between bars and their width
x = np.arange(len(dataframes_labels))  # the label locations
width = 0.25  # the width of the bars

# Do the actual plotting
fig, ax = plt.subplots()
rects1 = ax.bar(x - width, transposed_df['setosa'], width, label='setosa')
rects2 = ax.bar(x, transposed_df['versicolor'], width, label='versicolor')
rects3 = ax.bar(x + width, transposed_df['virginica'], width, label='virginica')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Percentages')
ax.set_title('Percentages of species per dataframe')
ax.set_xticks(x)
ax.set_xticklabels(dataframes_labels)
fig.legend()
plt.show()

### More than plotting!

So now we know that we can very easily operate, select and plot data by using dataframes. But these are still things that can be done easily in other tools, so let's do something cooler. I want to know if there's a correlation between the different dimensions of the plants. Let's do it from our original dataframe!

In [None]:
df.corr()  # This is it, really, this is how to make the operation. 

In [None]:
import seaborn as sns 
ax = sns.heatmap(
    df.corr(), 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)
ax.set_title('Global correlations');

What if I want to know it for each species?!?!

In [None]:
for specie in df['species'].unique():
    temp_df = df[df['species'] == specie]
    temp_df.corr()
    
#     print(temp_df)
    print(temp_df.corr())
    ax = sns.heatmap(
    temp_df.corr(), 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
    )
    ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
    )
    ax.set_title('{0} correlations'.format(specie))
    plt.show()
    plt.close();