# Tutorial 4. Plotting (and fitting)

We have already learned quite a lot of python! We know the types of data, how to iterate throught indexable objects, a bit of pandas, how to use functions, scripts and flow control. At this point, many people already say that they can program. But we want to learn how to make programming useful for your research, so we need to keep pushing now :)

In this lesson, we will learn about simple data plotting and also how to make a simple linear fit to our data. We will be using the historical and robust package `matplotlib` for this, but keep in mind that other packages such as `seaborn` and `plotly` offer more visually-appealing plots. 

## Basic plotting

In [None]:
import matplotlib.pyplot as plt

Let's begin with a scatter plot.

When you want to make a scatter plot, you must pass the data in two lists: one for the x values and one for the y values. Such as this

In [None]:
plt.scatter([1,2,3,4,5,6], [2,4,6,8,10,12])
plt.show()

Of course, you can also save the lists in a variable and pass the variables (they don't have to be called x and y by the way).

In [None]:
x = [1,2,3,4,5,6]
y = [2,4,6,8,10,12]
print(x)

In [None]:
plt.scatter(x,y)
plt.show()

You can also plot a line that connects all the dots, but keep in mind that this is not a regression line.

In [None]:
plt.plot(x,y)
plt.show()

Let me show you how this is not a regression line:

In [None]:
plt.plot([1,2,3,4],[2,1,5,3])
plt.show()

## Enrich your plots with labels and titles

A plot is nothing without a description of which information it contains. In the same plot, we can add a title, axis labels, several plots, text, modify the style of the background... I don't even know all the posibilities, but the formatting options are rich on `matplotlib`. 

The one thing to keep in mind is that all that needs to go into the same plot must be written before `plt.show()`, which displays the figure. After showing the image, the plot should be reseted, but this could also be forced with `plt.close()` if it doesn't happen. This is very important if you're **saving the figure** instead of showing it (more of this in the homework). 

In [None]:
plt.scatter(x,y, color='orange', s = 100, marker='v')  # Scatter plot of our points
plt.plot(x,y, '-.', color = 'orange', linewidth = 2)  # Line-connected plot of our points
plt.scatter([0,1,2,3,4],[0,1,2,3,4], color='blue', s = 100, marker='o')  # Scatter plot of our points
plt.plot([0,1,2,3,4],[0,1,2,3,4], '--', color = 'blue', linewidth = 2)  # Line-connected plot of our points
plt.title('My first plot')  # Title
plt.xlabel('Independent variable')  # x-axis label
plt.ylabel('Dependent variable')  # y-axis label
plt.show()  # show the plot in screen

You can also do cool things like changing the size and color for each individual dot, passing it on lists:

In [None]:
dot_color = ['red', 'darkorange', 'yellow', 'green', 'blue', 'darkviolet']
dot_size = [100, 60, 500, 150, 100, 300]
plt.scatter(x,y, color=dot_color, s = dot_size)  # Scatter plot of our points
plt.show()

## Numpy and scipy: the fundamentals of fast calculations on python

Although python has native math operations, these operations are pretty slow compared with how fast they can be done. Python offers packages like **numpy** and scipy that offer fast pre-implemented operations. Numpy works with **arrays** instead of lists. They seem to behave very similarly to lists, as they are also indexed and can be interated, but they provide very easy and fast operation of their values. 

In [None]:
import numpy as np

In [None]:
x = np.array([1,2,3,4,5,6])
y = np.array([2,4,6,8,10,12])
print(x)
print(y)
print(x[-1])
print(type(x))

- This works:

In [None]:
print(x*y)
print(x+y)

- This does not work:

In [None]:
print([1,2,3,4]*[2,1,2,4])

- This doesn't work the way we wanted:

In [None]:
print([1,2,3,4]+[2,1,2,4])

### Plotting with numpy

We can plot numpy arrays as if they were lists:

In [None]:
x = np.array([1,2,3,4,5,6])
y = np.array([2,4,6,8,10,12])
plt.plot(x,y)
plt.show()

But let's do something more interesting than just plotting. Let's change the values of y and fit a linear regression.
This is how the plot looks with the new y values

In [None]:
y = np.array([1,5,4,7,10,8])
plt.scatter(x,y)
plt.show()

And now we're going to apply a linear regression to our data. We will do this by using the function `linregress`, contained in `scipy.stats`. Notice that we have imported `scipy.stats` as `stats`. We can give the names that we desire to the imported packages. 

This linear regression returns 5 values, and I know that not because I remember, but because I googled the documentation page, which you also should do: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

In [None]:
import scipy.stats as stats
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

Here we are obtaining the y values of our fit for each point in our x values. It's the famous ax + b formula that we learned in highschool, but programming it this time:

In [None]:
new_fit = x*slope + intercept
print(new_fit)

So let's plot it all together! This figure will have the following components:
- Scatter plot of our data points
- Linear regression of these points
- R and R2 values displayed
- Slope and intercept values displayed
- Title and labels displayed

In [None]:
plt.scatter(x,y)
plt.plot(x, new_fit)
plt.text(1, 8,'R value = {0}'.format(r_value))
plt.text(1, 7,'R2 value = {0}'.format(str(r_value*r_value)))
plt.text(2, 2, 'Intercept = {0}'.format(intercept))
plt.text(2, 1, 'Slope = {0}'.format(slope))
plt.title('Linear fit')
plt.xlabel('Independent variable')
plt.ylabel('Dependent variable')
plt.show()

## Pandas and numpy

Pandas is really designed FROM numpy. When you select a pandas column or row, you obtain a pandas Series. These Series are actually built with numpy arrays as their base. This is handy because it allows to perform many of the operations that numpy allows. For instance:

In [None]:
import pandas as pd
df = pd.DataFrame({'first_column':[1,2,3,4,5,6], 'second_column':[5,2,3,1,5,7], 'third_column':[3,3,3,3,3,3], 'names':['spam', 'spam', 'eggs', 'eggs', 'ham', 'ham']})
df

In [None]:
df['first_column']

In [None]:
print(type(df['first_column']))  # A series
print(type(np.array(df['first_column'])))  # In case you need to conver it to a numpy array

In [None]:
df['first_column']*df['second_column']

In [None]:
df['first times second'] = df['first_column']*df['second_column']

In [None]:
df

And as a big hint for the homework and a reminder on how to subset from pandas, let's subset our dataframe into 3 dataframes, one for each name:

In [None]:
df['names'].unique()

In [None]:
df['names'] != 'eggs'

In [None]:
df[df['names']!='eggs']

In [None]:
for name in df['names'].unique():
    print(name)
    temp_df = df[df['names'] == name]
    print(temp_df)  # OR DO ANYTHING ELSE WITH THIS DATAFRAME

## HOMEWORK

For homework, we are going to use the iris dataset again. You will calculate the petal and sepal ratios using the fancy pandas way explained above, and save it to the dataframe. Then you will generate **and save in disk** 3 plots, one per flower variety. These plots will have the ratios and the linear fit of the data points. 

I want you to write a **script** that is divided in (at least) 2 functions:
- The function `linear_fit` will receive 2 pandas series or 2 numpy arrays and will perform a linear regression on their data. Then, it will return the slope and intercept of this fit. 
- The function `plot_data` will have as input a dataframe with the raw data that needs to be plotted. This function will call the function `linear_fit` and will receive the slope and intercept that `linear_fit` calculates. Finally, it will display a scatter plot of the raw data and a plot of the regression line. The x and y labels must be informative of whether it's the sepal or petal ratio. The title will be the flower variety used for each plot. This function will return nothing, but it will **save** the plots in a .png file with the name of the flower variety. 

You can choose whether you want to subset the data before or in `plot_data`. In other words, you can feed `plot_data` with the whole dataframe or with a subset of the dataframe that contains only a variety, but you'll have to do that 3 times in the second case. 

I recommend you to perform the ratio calculations before feeding it to `plot_data`, and feel free to organize the code for this in another function if you believe this will look cleaner. 

**GOOD LUCK!** 

And remember: Google is your friend. 