# Data Analysis and Data Science
## Be curious about the data 

By: Caroline Labelle
<br>For: BCM6065-65

<br>
Date: June 30th, 2022

<hr style="border:1px solid black"> </hr>


## Importing the data

Once we've indified an interesting dataset, we need to import it. A great Python's library to do is Pandas.

Pandas is a library that have multiple functions to help with data manipulation and analysis.

In [1]:
## Import some library for DF manipulation
import pandas as pd

In [2]:
## Define variable with file name
fn = "palmer_penguins.csv"

In [3]:
## Read ans save data into variable as a dataframe
data = pd.read_csv(fn)

In [4]:
## Show the first few rows of the dataset
data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [15]:
data.shape

(344, 7)

In [5]:
data.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


In [6]:
data['species'].unique()

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

In [7]:
data['island'].unique()

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

What observation can you make?

We can observe the columns, see that there are 3 species, 3 islands. There are NaN values as well. 

<hr style="border:1px dashed black"> </hr>

## Cleaning out the dataset

When cleaning a dataset, it is important to document how and why we do it: we can not just remove data and continue as it never existed.

In [10]:
## Which variables have missing data?
## The function df.isna() returns a boolean DataFrame
for column in data.columns:
    print ("The number of missing data in column " +str(column)+ " is " +str(data[column].isna().sum())+".")

The number of missing data in column species is 0.
The number of missing data in column island is 0.
The number of missing data in column bill_length_mm is 2.
The number of missing data in column bill_depth_mm is 2.
The number of missing data in column flipper_length_mm is 2.
The number of missing data in column body_mass_g is 2.
The number of missing data in column sex is 11.


In [None]:
## How can we evaluate if a variable (i.e. column) contains Nan?


In [11]:
## Dropping a.k.a removing missing data

## We want to drop the rows where at least one element is missing
## It is good practice to make a new cleaned DataFrame
data_clean = data.dropna(axis=0)

In [13]:
## How many samples were removed?
print("The number of samples that were removed is " + str(data.shape[0]-data_clean.shape[0])+".")

The number of samples that were removed is 11.


<hr style="border:1px dashed black"> </hr>

## Exploring the data

In [14]:
### How many samples and variables do we have?
print("The number of samples that we have is " + str(data_clean.shape[0])+".")

The number of samples that we have is 333.


In [None]:
### What are the variables?
variables = 

In [None]:
### What are the types of each variable?


In [None]:
### For the numeric variables,
### what are the min. max, mean and median values?

print("Min. values: ")


In [None]:
print("Max. values: ")


In [None]:
print("Median. values: ")


In [None]:
print("Mean. values: ")


In [None]:
### For the categorical variables, how many "categories" each has?
print("Nb. species:", )
print("Nb. islands:", )
print("Nb. sex:", )

In [None]:
### What are the species name? 


![alt text](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png)

In [None]:
### Since we have different groups, let's group our data!
### What are the mean mass and lengths values?

### By species


In [None]:
### By island


In [None]:
### By sex


Based on these results, what hypothesis can you formulate?

<hr style="border:1px dashed black"> </hr>

## Visualize the data

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Example gallery: https://seaborn.pydata.org/examples/index.html<br>
Helpfull tutorial: https://seaborn.pydata.org/tutorial.html

In [None]:
## Import the seaborn library
## You might have to install it first!


We can set global theme parameters (<code>sns.set_theme</code>) that will be apply to all plots.

Ressource: https://seaborn.pydata.org/generated/seaborn.set_theme.html

The <code>context</code> will affect things like the size of the labels, lines, and other elements of the plot, but not the overall style. You can choose between "notebook", "paper", "talk" and "poster".

The <code>style</code> will affect things like the color of the axes, whether a grid is enabled by default, and other aesthetic elements. You can choose between "darkgrid", "whitegrid", "dark", "white", "ticks".

The <code>palette</code> will define the colors used when plotting.

In [None]:
sns.set(rc={'figure.figsize':(9, 4)})
sns.set_theme(context="notebook", style="white", palette="Set2")

![alt text](https://seaborn.pydata.org/_images/function_overview_8_0.png)

The general syntax to generat a ploft/figure with seaborn:

```python
sns.type_of_plot(data=data_df, x=col_name, (y=col_name), [ARGS])
```

<br><br>
For instance, we can generate a plot with the values of sepal length represented by histograms:

```python
sns.histplot(data, x="body_mass_g")
```

In [None]:
### Plot body mass as a histogram
sns.histplot(data_clean, x="body_mass_g")
sns.despine()

### Exploring distribution of measurements 

A histogram is a classic visualization tool that represents the distribution of one or more variables by counting the number of observations that fall within disrete bins.

Ressource: https://seaborn.pydata.org/generated/seaborn.histplot.html?highlight=histplot#seaborn.histplot

In [None]:
### We want to look a the distribution of measurements for body_mass
sns.histplot()
sns.despine()

---
**bins => str, number, vector, or a pair of such values**<br>
Generic bin parameter that can be the name of a reference rule, the number of bins, or the breaks of the bins.

In [None]:
### Let's use more bins...
sns.histplot(data_clean, x="flipper_length_mm")
sns.despine()

---
**binwidth => number or pair of numbers** <br>
Width of each bin, overrides bins

In [None]:
### Let's use smaller bins...
sns.histplot(data_clean, x="flipper_length_mm")
sns.despine()

---
**hue => vector or key in data**<br>
Semantic variable that is mapped to determine the color of plot elements.

In [None]:
### We want the distribution to be sex-specific
sns.histplot(data_clean, x="flipper_length_mm")
sns.despine()

---
**multiple ==> {“layer”, “dodge”, “stack”, “fill”}** <br>
Approach to resolving multiple elements when semantic mapping creates subsets. Only relevant with univariate data.

In [None]:
### Make our plot more readable
p = sns.histplot(data_clean, x="flipper_length_mm", bins=25, hue="sex")
p.set_title("Stacked Distribution of flipper length (mm)")
sns.despine()

***
We can use <code>displot</code>, instead of <code>histplot</code>, which will allow us to draw distribution plots onto a FacetGrid a.k.a have a figure with various subplots.<br>

Ressource: https://seaborn.pydata.org/generated/seaborn.displot.html?highlight=displot#seaborn.displot

**kind => {“hist”, “kde”, “ecdf”}**<br>
Approach for visualizing the data. Selects the underlying plotting function and determines the additional set of valid parameters.

**row, col ==> vectors or keys in data**<br>
Variables that define subsets to plot on different facets.

In [None]:
### Modify the following line of code so that:
### --> you are using displot()
### --> you are representing the flipper length with histogram
### --> you have a figure of 1 row and 3 columns: one column per species

sns.histplot(data_clean, x="flipper_length_mm", bins=25, hue="sex", multiple="stack")
sns.despine()

What do you observe in the above figure?
***
Once you have figure, you can save it in a vector format (<code>.pdf</code> or <code>.svg</code>). Figures can then be easily modified/edited in apps such as Illustrator and Affinity Design.

In [None]:
### Save your last figure as a pdf
p = sns.displot(data_clean, x="flipper_length_mm", bins=25, hue="sex", multiple="stack", kind="hist", col="species")
p.despine()

p.savefig("")

We have been representing the distributions in terms of "count of occurrences" (y-axis). It is worth noting that the distribution statistic can easily be change with the <code>stat</code> argument.

**stat => {“count”, “frequency”, “density”, “probability”}**

Aggregate statistic to compute in each bin.
* **count** shows the number of observations
* **frequency** shows the number of observations divided by the bin width
* **density** normalizes counts so that the area of the histogram is 1
* **probability** normalizes counts so that the sum of the bar heights is 1

### Visualizing relationships

A scatterplot is a classic visualization tool that represents the relationship between <code>x</code> and <code>y</code>. Different subsets of the data can be shown by altering the color, size and style of the marker.

Even though it is possible to show up to three dimensions independently by using all three semantic types (color, size and style), this style of plot can be hard to interpret and is often ineffective.

Ressource: https://seaborn.pydata.org/generated/seaborn.scatterplot.html?highlight=scatterplot#seaborn.scatterplot

In [None]:
### We are interested in the relation between bill length and depth
### Plot this relation as a scatter plot
sns.scatterplot()
sns.despine()

***
**hue => vector or key in data**<br>
Grouping variable that will produce points with different colors. Can be either categorical or numeric, although color mapping will behave differently in latter case.

In [None]:
### Color the marker based on the species
sns.scatterplot(data=data_clean, x="bill_length_mm", y="bill_depth_mm")
sns.despine()

***
**size => vector or key in data**<br>
Grouping variable that will produce points with different sizes. Can be either categorical or numeric, although size mapping will behave differently in latter case.
<br><br>
**style => vector or key in data**<br>
Grouping variable that will produce points with different markers. Can have a numeric dtype but will always be treated as categorical.

In [None]:
### How could we incorporate species, island and sex identification
### into our bill length and depth scatter plot?

## marker color: species
## marker size: sex
## marker style: island

sns.scatterplot(data=data_clean, x="bill_length_mm", y="bill_depth_mm")
sns.despine()

In [None]:
### To change figure size, we need to use the Matplotlib library
### It was installed as dependency of seaborn: you just have to import it!


Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python: the Seaborn library is based on matplotlib.<br>

Ressource: https://matplotlib.org/stable/index.html

***
<code>plt.figure</code> implements a top level figure object that will hold all plot element. We can specify various arguments such as the size of the figure, it's facecolor and esdgecolor.

Ressource: https://matplotlib.org/stable/api/figure_api.html?highlight=figure#module-matplotlib.figure

In [None]:
### Define a figure object with the desired dimension
fig = plt.figure()
p = sns.scatterplot(data=data_clean, x="bill_length_mm", y="bill_depth_mm", hue="species", size="sex", style="island")
sns.despine()

### We can also play with our legend
## Legend outside of plot
plt.legend()
sns.despine()

### Add a descriptive title
p.set_title("")

### save figure as pdf
fig.savefig("")

***
It is hard to fit multiple pairwise relationships from a dataset into a single plot. The rendered plot can be hard to read and interpret!

<br>
Instead of using <code>scatteplot</code>, let's try <code>pairplot</code>!

This  will create a grid of Axes such that each numeric variable in data will by shared across the y-axes across a single row and the x-axes across a single column. The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column.

It is also possible to show a subset of variables or plot different variables on the rows and columns.

Ressource: https://seaborn.pydata.org/generated/seaborn.pairplot.html?highlight=pairplot#seaborn.pairplot

In [None]:
### Plot all pairwise relationships between all numeric variables
### Color samples based on their species
### The diagonal should contain histogram

sns.pairplot()
sns.despine()

There is some redundance in the figure obtained...

We can specify to plot only the lower triangle with <code>corner=True</code>.

In [None]:
### Only plot lower triangle of the pairwise plot
p = sns.pairplot(data_clean, hue="species", diag_kind="hist")
sns.despine()

### Save figure to PDF
p.savefig("pairwise_penguins.pdf")

What do you observe in the above figure?

### Visualizing statistical relationships

Many datasets contain multiple quantitative variables, and the goal of an analysis is often to relate those variables to each other. Seaborn can be very useful as it allows to use statistical models to estimate a simple relationship between two noisy sets of observations.

Remember that Seaborn is a *visualization* tool and not a statistical library: you test for linear and non-linear correlations between subsets but you will not have the associated coefficient or other statistical metrics.

The two main functions to visualize a linear relationship are <code>regplot()</code> and <code>lmplot()</code>. Both functions draw a scatterplot of two variables, <code>x</code> and <code>y</code>, and then fit the regression model <code>y ~ x</code> and plot the resulting regression line and a 95% confidence interval for that regression 

***
<code>regplot()</code> always shows a single relationship (i.e. no <code>hue</code> argument).

Ressource: https://seaborn.pydata.org/generated/seaborn.regplot.html#seaborn.regplot

In [None]:
### Plot linear regression between bill length and depth
sns.regplot()

### Add a title
plt.title("Linear regression - Bill length vs. depth")
sns.despine()

***
<code>lmplot()</code> combines <code>regplot()</code> with <code>FacetGrid</code> to provide an easy interface to show a linear regression on “faceted” plots that allow you to explore interactions with up to three additional categorical variables.

Ressource: https://seaborn.pydata.org/generated/seaborn.lmplot.html#seaborn.lmplot

In [None]:
### Plot linear regression between bill length and depth
### The regression should be species-specific
sns.lmplot()

### Add a title
plt.title("Linear regression - Bill length vs. depth")
sns.despine()

**truncate => bool, optional**

If True, the regression line is bounded by the data limits. If False, it extends to the x axis limits.

In [None]:
### Plot linear regression between bill length and depth
### The regression should be species-specific
### We want a figure of 1 row and 3 columns: one column per species
sns.lmplot(data=data_clean, x="bill_length_mm", y="bill_depth_mm", hue="species", truncate=False)
sns.despine()

***
It is possible to only show the linear regresion, i.e. remove the scatter plot

**scatter => bool, optional**

If True, draw a scatterplot with the underlying observations (or the x_estimator values).

In [None]:
### Only show the linear regression
sns.lmplot(data=data_clean, x="bill_length_mm", y="bill_depth_mm", hue="island", truncate=False)

### Add a title
plt.title("Linear regression - Bill length vs. width")
sns.despine()

***
Various parameters of the linear regression can be adjusted for both <code>regplot()</code> and <code>lmplot()</code>:

* **ci => int in [0, 100] or None, optional** <br>Size of the confidence interval for the regression estimate. This will be drawn using translucent bands around the regression line. The confidence interval is estimated using a bootstrap; for large datasets, it may be advisable to avoid that computation by setting this parameter to None. 
<br>

* **x_estimator => callable that maps vector -> scalar, optional** <br>Apply this function to each unique value of x and plot the resulting estimate. This is useful when x is a discrete variable. If x_ci is given, this estimate will be bootstrapped and a confidence interval will be drawn.
<br>

* **logistic => bool, optional** <br>If True, assume that y is a binary variable and use statsmodels to estimate a logistic regression model. Note that this is substantially more computationally intensive than linear regression, so you may wish to decrease the number of bootstrap resamples (n_boot) or set ci to None.
<br>

* **logx => bool, optional** <br>If True, estimate a linear regression of the form y ~ log(x), but plot the scatterplot and regression model in the input space. Note that x must be positive for this to work.
<br>



### Other interesting visualisation

<code>jointplot()</code> will draw a plot of two variables with bivariate and univariate graphs.

Ressource: https://seaborn.pydata.org/generated/seaborn.jointplot.html?highlight=jointplot#seaborn.jointplot

In [None]:
sns.jointplot(data=data_clean, x="bill_length_mm", y="bill_depth_mm", hue="species")
sns.despine()

In [None]:
sns.jointplot(data=data_clean, x="bill_length_mm", y="bill_depth_mm", kind="reg")
sns.despine()

***
<code>rugplot()</code> will Plot marginal distributions by drawing ticks along the x and y axes..

Ressource: https://seaborn.pydata.org/generated/seaborn.rugplot.html?highlight=rugplot#seaborn.rugplot

In [None]:
sns.scatterplot(data=data_clean, x="bill_length_mm", y="bill_depth_mm", hue="species")
sns.rugplot(data=data_clean, x="bill_length_mm", y="bill_depth_mm", hue="species")
sns.despine()

plt.legend(bbox_to_anchor=(1,1))
sns.despine()