# Data Analysis and Data Science
## Be curious about the methods

By: Caroline Labelle
<br>For: BCM6065-65

<br>
Date: July 4th 2022

<hr style="border:1px solid black"> </hr>


In [None]:
Name: 

## Installing and importing Python libraries

Befor using (or importing) a library in Python, you first need to install it!. This step only need to be done once for each library: once istalled, you will have access to all the libraries from your coding environment.

Ressource: https://pip.pypa.io/en/stable/user_guide/

In [None]:
### Installing scikit-learn

In [None]:
### Import sklearn
import sklearn.decomposition, sklearn.cluster

### Import scipy
import scipy as sp

### Import pandas, numpy, seaborn and matplotlib.pyplot
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
sns.set(rc={'figure.figsize':(9, 4)})
sns.set_theme(context="notebook", style="white", palette="Set2")

We will be mainly using the <code>sklearn</code> and <code>scipy</code> libraries to implement and use various data analysis methods.

scikit-learn ressource: https://scikit-learn.org/stable/index.html
Scipy ressource: https://docs.scipy.org/doc/scipy/reference/

## Data Analysis

Once we've explored our dataset and have a better undesrtanding of what it contains, we can start to analyse it! Before applying any kind of methods, we must first establish "what we want to know".

Do we want to **fit** our data to a model and/or assess if there is a **correlation** between variables? Do we want to **decompose** our dataset and/or identify **clusters**?

Once we establish "what we want to know", we need to define "how we'll do it"! There exist different methods for a single task... We must be curious about the methods and use the one that is the most appropriate to our *context*! 

<hr>

## Regression and Curve fitting

### Linear regression with Scipy

<code>scipy.stats.linregress(x, y, alternative='two-sided')</code>

* **x, y**: sets of measurements
* **alternative='two-sided'**: the alternative hypothesis ($H_{a}$) is that the slope of the regression line is nonzero

Here, the null hypothesis ($H_{0}$) is that the slope is zero.

Ressource: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html#scipy.stats.linregress

In [None]:
### Import and clean the penguins dataset
data_penguin = 

### Re-index


In [None]:
### Plot a pairwise comparison figure with seaborn


Flipper length and body mass seems to be highly corrolated. We want to confirm this by applying a linear regression.

In [None]:
## Do a linear regression
reg = sp.stats.linregress()

In [None]:
## Look a the results
reg

The <code>rvalue</code> represents the correlation coefficient. The Pearson's correlation coefficient measures the linear relationship between two variables.

**Important**: a correlation of 0 does not imply no correlation!

<hr>

### Exercice [10 points]
Select a pair of variables (other than Flipper length vs. body mass). 
1. Apply a linear regression for each species;
2. Plot the datapoints and a the linear fit obtained for each flower type (you plot three plots independently).

Which species type has the highest correlation?

### Curve fitting with Scipy

<code>scipy.optimize.curve_fit(f, x, y)</code>

* **f**: model function such that $f(x, ...)$
* **x, y**: sets of measurements


Ressource: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html#scipy.optimize.curve_fit

In [None]:
### Linear of regression with curve_fit
### Define the model f


In [None]:
### Do linear curve fitting of petal length vs. petal width
param, cov = sp.optimize.curve_fit()

In [None]:
### Look at results and compare to linregress results
### Param:

### R Coef:


<code>numpy.corrcoef()</code> returns the Pearson correlation coefficient matrix of the variables.

Ressource: https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html

### Exercice [20 points]
Import the dose-response dataset. We tested various concentrations for 12 different drugs on AML3 cells. The dataset contains the viability responses. 

* Row names are the $log_{10}$ concentration
* Column names are the various drugs we tested
* Cells values are the $\%$ viability of AML3 for a given concentration of a given compound.


**Find which drug has the smallest IC$_{50}$ and which drug has the smallest minimal viabiality.**

It is common practice to model dose-response data with the log-logistic model:<br>
$$f(x) = Min + \frac{Max - Min}{1 + 10 ^ {x - IC_{50}}} $$<br>

where $x$ is a dose and $f(x)$ the $\%$ viability for that given dose.

In [None]:
### Import the data


In [None]:
### Define a function for the log-logistic model
### np.power() : https://numpy.org/doc/stable/reference/generated/numpy.power.html


In [None]:
### Do curve fitting (non-linear regression) on viability for 17-AAG
### Plot the resulting curve


In [None]:
### Do curve fitting (non-linear regression) on viability on each drug
### Find the smallest predicted IC50 and smallest predicted Min


### BONUS [5 points]
**Is there a linear correlation between the IC50 and Min values?**

In [None]:
### Do curve fitting (non-linear regression) on viability on each drug
### Save predicted IC50 and predicted Min


In [None]:
### Calculte linear regression for IC50 vs. Min


In [None]:
### Plot IC50 vs. Min values with linear fit


<hr>

## Dimensionality reduction

Dimensionality reduction allows use to reduce th enumber of random variables to consider. It is primarly useful for visualisation purpose and to increase the efficiency of other analysis methods (eg. clustering).

### PCA with scikit learn

**PCA** = Principal Component Analysis

PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. *We are trading a little bit of accuracy for simplicity*.

Ressource: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

In [None]:
### Show the penguins dataset


In [None]:
### Define the data we need for the PCA
X = 

In [None]:
### Initiate the PCA and apply it to our data
pca = sklearn.decomposition.PCA()
pca.fit()

In [None]:
# Instantiate a new scaler
scaler = sklearn.preprocessing.StandardScaler()

# Learn the pattern from the input data
scaler.fit()

#Apply the pattern
X_scaled = scaler.transform() 

### Initiate the PCA and apply it to the scaled data
pca = sklearn.decomposition.PCA()
pca.fit()

Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.

**Attributes of the pca object:**

<code>pca.n_components_</code>: estimated number of components

<code>pca.n_features_</code>: number of features in the training data

<code>pca.n_samples_</code>: number of samples in the training data

In [None]:
## Get estimated number of components


In [None]:
## Get number of features in the training data


In [None]:
## Get number of samples in the training data


PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on.

<code>pca.explained_variance_</code>: amount of variance explained by each of the selected components

<code>pca.explained_variance_ratio_</code>: Percentage of variance explained by each of the selected components.

In [None]:
## Get the explained_variance


In [None]:
## Get the explained_variance_ratio_
## What do you notice?


In [None]:
## Plot the explained_variance_ratio_
## Bar plot: https://seaborn.pydata.org/generated/seaborn.barplot.html


In [None]:
### Apply the dimensionality reduction to our data
X_reduce = pca.fit_transform()

### Make a dataframe and add labelling columns
reduce_penguin = pd.DataFrame()


In [None]:
### Plot the results of the reduction
### How many PC should we use?


What can be said of the above plot?

Remember, the principal components are less interpretable and don’t have any real meaning since they are constructed as linear combinations of the initial variables...

### Other ressources to learn more about PCA
* https://builtin.com/data-science/step-step-explanation-principal-component-analysis
* https://www.youtube.com/watch?v=HMOI_lkzW08
* https://www.youtube.com/watch?v=FgakZw6K1QQ

### Exercice  [20 points]

Import the Wisconsin breast cancer dataset. We are interested in identifying tumors types.

Can you identify a pair of variables that seems to be linearly correlated? 
* **What is the r coefficient?** 
* **Do you find different r coefficient for the different tumors type?**

Apply a PCA on the dataset. Make sur to plot the percent of explained variance, and the results of the reduction apply to data.
* **What can you conclude regarding the tumors types?**

In [None]:
## Import the dataset


In [None]:
## Do a pairplot of all the variable pairing


In [None]:
## Do linear regression on pair of variables


In [None]:
## Do linear regression on pair of variables based on tumour type


In [None]:
### Define the data we need for the PCA


In [None]:
### Initiate de PCA and apply it to our data


In [None]:
## Get the explained_variance_ratio_
## What do you notice?


In [None]:
## Plot the explained_variance_ratio_


In [None]:
### Apply the dimensionality reduction to our data


In [None]:
### Plot the results of the reduction
### How many PC should we use?


<hr>

## Clustering

Clustering methodologies allow us to automatically group similar object into sets. There exist many clustering methodologies!

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_001.png)

### k-means with scikit-learn

The KMeans algorithm clusters data by trying to separate samples in $n$ groups of equal variance, minimizing a criterion known as **within-cluster sum-of-squares**.

This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

At a glance, the k-means algorithm divides a set of $N$ samples $X$ into $K$ disjoint clusters $C$, each described by the mean $\mu_{j}$ of the samples in the cluster $C_{j}$. 

*In very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem and speed up the computations.*

Ressource: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

Let's apply the k-means algorithm to the first two principal component of the Iris dataset. **Are we able to cluster the datapoint based on their flower type?**

In [None]:
### Define the data
X = 

In [None]:
### Initiate the k-means alogirthm
### How many cluster should we use?
kmean = sklearn.cluster.KMeans()

In [None]:
### Apply kmeans to our data
kmeans_X = kmean.fit()

**Attributes of the kmeans object:**

<code>kmeans.cluster_centers_</code>: coordinates of cluster centers

<code>kmeans.labels_</code>: labels of each point

In [None]:
### Get centroids coordinate


In [None]:
### Get datapoints labels


In [None]:
### Add the cluster column


In [None]:
### Plot the cluster and the labels


It is not always easy to define the number of cluster to use!

The most common approach for deciding the value of $K$ is the so-called elbow method. It involves running the algorithm multiple times over a loop, with an increasing number of cluster choice and then plotting a clustering score as a function of the number of clusters.

<code>inertia_</code>: sum of squared distances of samples to their closest cluster center

In [None]:
### Get the inertia of our initial kmean


In [None]:
### Run kmeans for various k values
### initiate empty list

### Create for loop for K from 1 to 10


In [None]:
### Plot K vs. interti


### Exercice [10 points]

Go back the Wisconsin breast cancer dataset. **Can you cluster the datapoint in $K$ clusters?**

You must first decide which value to give $K$. Make sure to leave some trace of your tought process...<br>
Apply the kmean algorithme to the data and plot the results. **Can you find a link between one of the dataset features and the clusters?**