 # Day 3 - Intro to Data Analysis using Pandas and Seaborn

 [Pandas](https://pandas.pydata.org/) is a widely used and very powerful python library for data analysis. Pandas is efficient, since it is built on top of NumPy, but also adds a lot of convenience layers to make analysing data more straightforward than using raw NumPy. The library can deal with a lot of widely used file formats such as csv, json, hdf5, ..., and also databases such as SQL. Pandas also interoperates cleanly with existing python libraries such as NumPy, matplotlib, or machine learning libraries such as scikit-learn.
 
 [Seaborn](https://seaborn.pydata.org/) is a data visualization library which provides functionality to create plots that would be quite involved using pure matplotlib in only a few lines of python, and it understands Pandas dataframes by default

 First, lets get some imports out of the way.

In [None]:
import numpy as np 
import matplotlib.pyplot as pl
import pandas as pd 
import seaborn as sns # This is the usual way seaborn is abbreviated. Apparently it's an inside reference to The West Wing
sns.set() # Use seaborn's settings to plot, which looks a lot nicer than the matplotlib default

Much like how the NumPy library is built around the central object of the NumPy array, Pandas is built around the central objects of the Pandas Dataframe and the Pandas Series. A Dataframe holds 2-dimensional tabular data, such as the data in a .csv file, in a python object. A Series holds 1-dimensional data, and each column of a DataFrame is a Series. There are a number of ways to create a DataFrame from data in python. One way is to construct a DataFrame from a dictionary of lists, mapping the names of each column to a list of values. Here's some not-quite-true data from the last Irish election. 

In [None]:

df = pd.DataFrame(
    {'Political_Party': ['Sinn Fein', 'Fine Gael', 'Fianna Fail', 'Greens', 'Social Democrats'] ,
     'Leader': ['Mary-Lou MacDonald', 'Leo Varadkar', 'Michael Martin', 'Eamon Ryan', 'Roisin Shortall'],
     'MPs': [38, 34, 40, 12, 7], 
     'Vote share %': [40, 10, 20, 10, 5]
   }
)
df

We can see that by default jupyter formats DataFrames to be a bit more human readable than it does with NumPy arrays. 

We can access the columns of a DataFrame as it were a dictionary:

In [None]:
print(df['Political_Party'])


We can also access the columns of the DataFrame as attributes using `.`

In [None]:
print(df.Political_Party)

This means that columns of DataFrame support tab completion which is really useful. Try pressing the tab key with the cursor at the end of `df.Pol` below!

In [None]:
df.Pol

# Importing Data

A lot of the time, you would be interested in manipulating data that already exists. In this case we'd commonly read data from something like a .csv file. CSV stands for "Comma-seperated values" and is a common file format for representing columns of data, which are seperated by commas as the name implies. CSV files are human readable, and understood by Pandas, Excel and pretty much any data application. For this tutorial, we'll take advantage of the fact that seaborn has a few sample datasets which we can grab on demand without having to deal with the `.csv` file directly. 

We'll look at the Iris dataset. It is a collection of images of flowers and contains info on which species each flower is, and the width and length of its petals and sepals. This is a good dataset for us to look at since it combines numerical data (the sepal, petal widths) and categorical data (the species) which we use to demonstrate the power of pandas. 

In [None]:
iris = sns.load_dataset("iris")

 Now that we have our data loaded in python, we can start to explore the data. We can print the dataframe:

In [None]:
iris

  The method `describe` will give us a statistical summary of the data.

In [None]:
iris.describe()

 `Count` just tells us how many times each field appears, which is 150 for each field here. Some datasets will have missing data, which are represented by 'NaN' (Not a Number) in which case `count` will not be the same for each field. The rest of the rows give a statistical summary.

 The head method shows the first few rows of the dataframe

In [None]:
iris.head()


 DataFrames behave like dicts, and we can access each column by its name.

In [None]:
iris['species']


If we have categorical data and want to see only the unique values, there's a few things we can do:

In [None]:
print(np.unique(iris['species']))

# or 
print(set(iris['species']))

# or, to also get the count
from collections import Counter
print(Counter(iris['species']))

We can also access multiple columns by passing a list of names. This will return a DataFrame

In [None]:
iris[['species', 'sepal_width']]



 We can also use the index to access the rows using the `loc` method. (Note that indexing with loc *includes* the endpoint of the range)

In [None]:
iris.loc[0:4]


 We can do much fancier indexing than just by the index value. As we saw when we called `.describe` above, the petal_width varied from 0.1 to 2.5. Let's say we're only interested in parts of the data where the sepal_width is less than 1.5. Pandas' powerful boolean indexing features let us index by a boolean. We assign the result to a new DataFrame

In [None]:
iris_small_flowers = iris[iris['petal_width'] < 1.5]
iris_small_flowers


  Math using the columns of dataframes is vectorized, in that math operations get applied to every entry in a column

In [None]:
petal_area = iris['petal_width'] * iris['petal_length']
petal_area

 We can also add data to a DataFrame. New columns by default get appended to the end.

In [None]:
iris['petal_area'] = petal_area
iris

# Saving data

One strength of pandas is that a DataFrame can easily be loaded to and from a lot of different types of data formats, .csv, .hdf5, excel files, SQL database, ... very straightforwardly using `to_*` and `from_*` methods. 

In [None]:
iris.to_csv("iris.csv")

In [None]:
!cat iris.csv #Print out the contents of the file iris.csv

In [None]:
new_iris = pd.read_csv("iris.csv", index_col=0)
new_iris

# Task 1
 Add a new column to the iris data set representing the petal 'aspect_ratio', it's length divided by its width. 
 
 Also, using the boolean index feature, produce a DataFrame which only has data for the virginica species. Call `.describe()` on the result. 

 # Plotting.
 
 If we're trying to analyse data, we usually want to do a bit more than just see standard deviations and means. Pandas offers a thin wrapper around matplotlib to allow us to easily do simple plots to explore our data. Note that everything we do here can be done directly with matplotlib.

 Perhaps we expect their to be a correlation between sepal length and petal length, we can do a scatter plot.

In [None]:
iris.plot.scatter(x='petal_length', y='sepal_length')

We can also color the dots by the value in a different column by specifying `c=`. 

In [None]:
iris.plot.scatter('petal_length', 'sepal_length', c='sepal_width', cmap='viridis')

 One of the nice things about using the Pandas plotting interface is certain things, such as the axes being labeled by the column name is done automatically. 
 
We might also want to plot a histogram of a single column. We then call `plot` on the Series object.

In [None]:

iris['sepal_length'].plot.hist(bins=50)


In fact, if we just call `.hist` directly Pandas will generate all of the histograms for whatever columns the data is numerical.

In [None]:
iris.hist()

# Task: 
Let's explore some of the other plotting functionality in pandas. 

In [None]:
[i for i in dir(iris.plot) if not i.startswith('_')]    # list of functions/methods in iris.plot that don't start with '_'

Try plotting iris using box plot. This plot shows the min, max, mean, 25, 50, and 75 quartiles along with outliers in a single plot. More on box plots [here](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51).

Now try a scatter plot of 'petal_width' and 'petal_length', but instead of coloring each point based on the 'sepal_width', change the size of the point based on the value of 'sepal_width' using the `s=` option. 

## Groupby

 If you look at our initial scatter plot, you might start to think we're not 
 analysing our data as we should. Notice that there is a cluster of values in the bottom 
 left. This is because our data is categorical; one should expect that the different species
 have different characteristic petal and sepal sizes, and we'd like our plots to show that 
 distinction. 



Pandas let's us easily group data using the `groupby` method. So let's go ahead and group our data by species so we can more usefully analyse it. 

In [None]:
g = iris.groupby('species')

This returns a `GroupBy` object which we must do some operation on to get useful data. We can get statistics easily: 

In [None]:
g.count()

In [None]:
g.mean()

In [None]:
g.std()

We can see that we're now getting means and standard deviations within each group. 

If we wanted to we could iterate over `g` to get the three DataFrames

In [None]:
import matplotlib.pyplot as plt
for species, subdf in g: 
    plt.scatter(subdf['sepal_length'], subdf['petal_length'], label=species)

Now we're starting to see that the clustering is indeed due to the seperate species. 

If we wanted the scatter plots to be seperate, we could have just called plot on the `GroupBy` object `g` itself. 

In [None]:
g.plot.scatter('sepal_length', 'petal_length')

This will work for any type of plot.

In [None]:
g.boxplot()

# Seaborn functionality. 

Let's finally just look at some seaborn functionality to make our plots even nicer. Seaborn uses slightly different names for the plotting functions but we can do things much more easily with it. 

Seaborn's plot is called `.relplot` which stands for 'relationship plot' and a few more options we can avail of to produce a more useful plot with one function call. By default it uses a scatter plot, but we could switch to line plot with `kind='line'` if it was appropriate. 

In [None]:
sns.relplot(x='petal_length', y='sepal_length', hue='species', data=iris)

We acheived the colouring which we have to use `groupby` with a for loop to get in one call thanks to the `hue` option!

We can easily find and plot a line of best fit using `lmplot`. 

In [None]:
sns.lmplot(x='petal_length', y='sepal_length', data=iris)

`.distplot` will plot a histogram. It will also plot a 'kernel destiny estimator' which is a smoothed approximation to the data. 

In [None]:
sns.distplot(iris.petal_length, bins=20)

Seaborn has easy functions to combine scatter plots and histograms onto one figure. 

In [None]:
sns.jointplot(x='petal_length', y='sepal_length', data=iris)

Or even get this info for the whole dataset in one go. 

In [None]:
sns.pairplot(data=iris, hue='species')

This is only a cursory overview of what seaborn and pandas can do. Hopefully I've convinced you that if you have data, then pandas and seaborn can save you a lot of trouble in analysing it and plotting it. 

# Task 

We've looked at the iris dataset, but lets look at a different dataset and do some exploratory analysis on it. 

In [None]:
sns.get_dataset_names() # this is all of the available datasets in seaborn

Planets is a dataset of exoplanets we found, perfect for us! Do some exploration of the dataset, printing, using `.describe()`, `.head()` etc. Which columns are categorical? Which are numeric? Is there any missing data (represented by NaN)?

Some things you might interested in observing: 
- Plot a histogram of the year of discovery to see how many more planets we find now. 
- Try a box plot comparing the distance and method. 
- Try an sns.pairplot (You may need 'dropna=True' for it to handle the NaN's)
- Try to group the data by 'method' and call count. Which method has found the most planets? Which methods do or do not provide mass estimate? 

The goal of pandas and seaborn is to explore, so try different things yourself!