# II. Matplotlib - always plot your data

This notebook gives a short taste of matplotlib, pythons most popular libary for plotting. 

   4. Loading information into Pandas
   5. Plotting with matplotlib
       - a single plot 
       - Plotting multiple graphs
   6. Why you should always plot your data


## 4. Loading information

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

reading the information

In [None]:
df = pd.read_csv("Data/DatasaurusDozen.tsv", index_col = 0).sort_index()

In [None]:
df

This does not seem right, the file we try to open is a .tsv file, therefore we need to indicate that the seperator used in the file is a Tab and not a ",".

In [None]:
df = pd.read_csv("Data/DatasaurusDozen.tsv", sep = "\t", index_col = 0).sort_index()

In [None]:
df

In [None]:
df.info()

In [None]:
df

In [None]:
df.index.unique()

The Dataframe.loc[...] command can be used to group rows or columns by a certain label/index.

In [None]:
away_df = df.loc["away"]
print(away_df.info())
print(away_df)

## 5. Plotting 

Using pandas build in functions we can very simply get a first look at how our data looks.

In [None]:
away_df.plot()

Looking at the plot (or da data) we can quickly see that a default lineplot might not be suited for this kind of data. Lets try a differend kind of plot.

Available typs of plots
- ‘line’ : line plot (default)
- ‘bar’ : vertical bar plot
- ‘barh’ : horizontal bar plot
- ‘hist’ : histogram
- ‘box’ : boxplot
- ‘kde’ : Kernel Density Estimation plot
- ‘density’ : same as ‘kde’
- ‘area’ : area plot
- ‘pie’ : pie plot
- ‘scatter’ : scatter plot
- ‘hexbin’ : hexbin plot

In [None]:
# x and y denote the different axis.
away_df.plot("x","y",kind="scatter")

## 5.1 Plotting using matplotlib directly

The pandas libary also uses matplotlib for plotting (That is why we use "%matplotlib inline" so it can properly display the plots). However it is often times easier to use the matplotlib directly, especially when plots become more complex.

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl

printing the same graph from above

In [None]:
plt.scatter(away_df["x"],away_df["y"])
plt.show()

However there is a bunch of different settings we can use to make it look a little better.

In [None]:
plt.figure(figsize = (10,7), dpi = 100) # 10 is width, 7 is height
plt.style.use('bmh')
plt.scatter(away_df["x"],away_df["y"],marker = "*", label="Data")
plt.title("Away")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend(loc="best")
plt.show()


You can look up all the different markers here: 
https://matplotlib.org/api/markers_api.html

And all the different styles here:
https://matplotlib.org/tutorials/introductory/customizing.html

## 5.2 Plotting multiple graphs


first we grab a second dataset to plot from our initial file.

In [None]:
circle_df = df.loc["circle"]
circle_df.plot("x","y",kind="scatter")

first, lets just plot both datasets into the same graph. We can copy basicly all the settings from above and just add a line plotting our second dataset.

We can also add some color to differentiate the data.

In [None]:
plt.figure(figsize = (10,7), dpi = 100) # 10 is width, 7 is height
plt.style.use('bmh')
plt.scatter(away_df["x"],away_df["y"],marker = "*", label="away", color = "r")
# This is the second dataset
plt.scatter(circle_df["x"], circle_df["y"], marker = ".", label="circle", color ="b")

plt.title("double the plot")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend(loc="best")
plt.show()

Sometimes it is more usefull to have the plots side by side. For this we have to create subplots. The "figure" we are plotting on is the canvas the plots get drawn on. However if we want to have more then one graph, we can divide the space of the figure into multiple "axes". Then we can adress the settings and information for each axes directly.

In [None]:
# Create Figure
plt.style.use("seaborn-whitegrid")
fig = plt.figure(figsize = (10,5), dpi = 120)

# Setup the layout by creating two axes. 
# Axes get assigned a set postion in the figure

ax1 = plt.subplot2grid((1,2),(0,0)) # The first (1,2) = 1 row and 2 columns of graphs
ax2 = plt.subplot2grid((1,2),(0,1)) # The second (0,1) is the postion of the axes

# Scatterplot for "away"
ax1.scatter(away_df["x"],away_df["y"],color = "r")
ax1.set_title("away")
ax1.set_xlim(0,100)
ax1.set_ylim(0,100)

# Scatterplot for "circle"
ax2.scatter(circle_df["x"], circle_df["y"], color ="b")
ax2.set_title("circle")
ax2.set_xlim(0,100)
ax2.set_ylim(0,100)

plt.show()

## 6. Why you should always plot your data

In pandas it is easy to get summary statistics about the data. However is important to always inspect the data presented because these statistics can be missleading. 

In [None]:
from IPython.display import display_html
#
# This function calculates the summary statistics for the given set of points
# 
def get_values(df):
    res = pd.DataFrame([[df.x.mean()],[ df.y.mean()], [df.x.std()],[df.y.std()], [df.corr().x.y]],
                       index=["X mean", "Y mean", "X standard d.", "Y standard d.", "Correlation"],
                       columns=[df.index[1]])
    return res

# This function allows to display Dataframes side by side
def display_side_by_side(l):
    html_str=''
    for df in l:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)           

    


Looking at the data we just worked with we can display the summary statistics of the dataset.

In [None]:
display_side_by_side([get_values(away_df), get_values(circle_df)])

Looking at just these it would be hard to guess that the plots look that fundamentaly different as the graphs plotted above.
But there is more!

Printing all the summary statistics of all the data in the dataset it becomes clear that they differ only marginaly.

In [None]:
df_list = []

#This creates a list with all the summary statistics of the initial dataset
for data in df.index.unique():
    temp = get_values(df.loc[data])
    df_list.append(temp)
    
display_side_by_side(df_list)

### So this is why you should always take the time to look at the data that is presented to you!

In [None]:
fig, axes = plt.subplots(5,3, figsize= (9, 15), sharex = True, sharey = True, dpi = 300)

for i, ax in enumerate(axes.ravel()):
    if (i < 13):
        ax.scatter(df.loc[df.index.unique()[i]]['x'],df.loc[df.index.unique()[i]]['y'])
        ax.set_title(df.index.unique()[i])
plt.suptitle("Always plot your data!", verticalalignment = "top", fontsize = 16)
plt.show()

This amazing dataset was created by Justin Matejka and George Fitzmaurice.
Find their work on this here: https://www.autodeskresearch.com/publications/samestats