<hr>

##### Mount Drive - **Google Colab Only Step**

When using google colab in order to access files on our google drive we need to mount the drive by running the below python cell, then clicking the link it generates and pasting the code in the cell.



In [0]:
from google.colab import drive
drive.mount('/content/drive')

Change Directory To Access The Dependent Files - **Google Colab Only Step**

In [0]:
directory = "student"
if (directory == "student"):
  %cd drive/Colab\ Notebooks/data-science-track/
else:
  %cd drive/Shared\ drives/Rubrik/Data\ Science/Course/Data-Science-Track

# Correlations


## Import Libraries


In [0]:
# Data
import numpy as np
import pandas as pd

# Plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

# Library Configurations: 
sns.set() # make seaborn override the styling of matplotlib graphs
pd.set_option('display.max_columns', None) # display all columns
pd.set_option('display.max_rows', None) # display all rows

<hr>

<br>

## Load The Iris Data Set


### Save all rows of data to dataframe called `iris`
```python
iris = pd.read_csv('./data/iris.csv')
```

<hr>

<br>

## Understand The Data

### Visualizing The Head and Tail 

```python
iris.head()
```




### Checking The Shape Of The DataFrame

``` python 
iris.shape
``` 


### Checking The Data Types Of The Columns

```python 
iris.info()
```

### Exploring Column Unique Values

```python
for column_name in iris:
    print(column_name + ":")
    display(iris[column_name].unique())
    print() # print a new line to make reading output easier
```


### Numerical 5 Summary Statistic

```python
iris.describe()
```

### Categorical 5 Summary Statistic

```python
iris.describe(include='object')
```

<hr>

<br>

## Filtering Rows Of DataFrame

One of the biggest advantages of having the data as a Pandas Dataframe is that Pandas allows us to slice and dice the data in multiple ways.

Often, you may want to subset a pandas dataframe based on one or more values of a specific column. Essentially, we would like to select rows based on one value or multiple values present in a column.

### Filtering Rows Example Before Explanation
Let us show only the rows related to the 'Iris-setosa' species

```python
iris[iris['Species'] == 'Iris-setosa']
```

### The Boolean Series
When we filter a data frame we need to create a boolean series. We will then use this boolean series to index the correct rows of a data frame.

#### Syntax

```python
boolean_series = dataFrame['desired-column-to-search'] == 'desired-value'
```

#### For Example 
```python
boolean_series = iris['Species'] == 'Iris-setosa'
```
The code on the right of the assignment operator (`=`), will return a boolean series.
```python 
 iris['Species'] == 'Iris-setosa'
 ``` 

 #### Let's take a look at the boolean series being returned

In [0]:
(iris['Species'] == 'Iris-setosa').head()

0    True
1    True
2    True
3    True
4    True
Name: Species, dtype: bool

In [0]:
(iris['Species'] == 'Iris-setosa').tail()

145    False
146    False
147    False
148    False
149    False
Name: Species, dtype: bool

This operation essentially goes down every single row of the `iris` data frame, looks at the `Species` column (a series), and checks if it is `==` to `Iris-setosa`. If it is, the matching index for the resulting series will be populated with a `True` value; if it is not, the matching index of the resulting series will be populated with a `False` value

Adding a `.head()` and `.tail()` to the boolean series limits the output.

#### Notice how the outputs have different values

This is because the dataframe is sorted in a particular order, where all of the 'Iris-setosa' rows are on placed on the top of the dataframe followed by 'Iris-versicolor' and 'Iris-virginica' rows. 


### Your Turn
#### Create and Use A Boolean Series Filtering Out Only The 'Iris-virginica' Rows Of The DataFrame

> **Pro-Tip:**
Try replacing the `.head()` with a `.sum()` to get the **total** number of `True` values in a boolean series. There should only be 50 rows in the entire `iris` dataframe, where the species is `Iris-setosa`

```python
(iris['Species'] == 'Iris-setosa').sum()
```

> **Pro-Tip:**
Try replacing the `.sum()` with a `.mean()` operation to get the **proportion** of values that are `True`. There should be 33.3% of the rows in the entire `Iris` dataframe, where the species is `Iris-Setosa`.

```python
(iris['Species'] == 'Iris-setosa').mean()

# For percentage
(iris['Species'] == 'Iris-setosa').mean() * 100
```

<hr>

<br>

### Boolean indexing To Filter DataFrame

#### Comparing shape
Because the resulting boolean series has as many values as the original dataframe, we can pass it in as an index, in this step we will look at the shape of both objects to prove this point.

We will take that boolean series and use it to `index` the original `iris` dataframe, wherever the series has a `False` value, that row will be filtered out and only where the series is `True` will that row be kept. 

#### Common Way Of Storing A Boolean Series
A common way to store the boolean series of filtered DataFrame indexes is by storing the boolean series into a descriptive variable prefixed by "`is_`". 

#### For example
```python
is_setosa = iris['Species'] == 'Iris-setosa'
```

The reason for this is because it helps programmers to understand what is being filtered easily. We will now use this boolean series to show the desired rows of the DataFrame. 

```python
is_setosa = iris['Species'] == 'Iris-setosa'

# 150 Rows, 1 Column
print(is_setosa.shape)

# 150 Rows, 6 Column
print(iris.shape)

# Remember, we are filtering rows, so the difference in columns is irrelevant
```

#### Index `Iris` DataFrame Without A Boolean Series
We will pick a random range to index the dataframe, this is not too dissimilar from what we are about to do with the boolean series.

```python
iris[47:55]
```

#### Indexing The DataFrame:

Now, we pass in the boolean series, instead of the random range

```python
iris[is_setosa]
```


#### We Save The Filtered DataFrame To A Variable
```python
setosa = iris[is_setosa]
setosa.head()
```

#### Verify the final shape
Look at that! We have the same amount of rows in the final `setosa` dataframe as there were `True` values in `is_setosa`

```python
setosa.shape
```

### Recap
Below are both approaches once again, for context

#### One-Liner Syntax
```python
dataFrame[dataFrame['desired-column-to-search'] == 'desired-value']
```

#### Broken-Down Syntax
```python
is_description = dataFrame['desired-column-to-search'] == 'desired-value'
filtered_dataFrame = dataFrame[is_description]
```

#### Example
```python
# One-Liner
setosa = iris[iris['Species'] == 'Iris-setosa']

# Broken-down
is_setosa = iris['Species'] == 'Iris-setosa'
setosa = iris[is_setosa]
```

<hr>

<br>

## Multivariate Exploratory Analysis 
Multivariate analysis is performed to understand interactions between different fields (columns) in the dataset, or finding interactions between two or more variables.

### Definition Of Correlation
**Correlation** is a term in statistics that refers to the degree of association between two random variables. So the correlation between two data sets is the amount to which they resemble one another. It indicates the extent to which two or more variables fluctuate together. 

If A and B tend to be observed at the same time, you’re pointing out a correlation between A and B. You’re not implying A causes B or vice versa. You’re simply saying when A is observed, B is observed. They move together or show up at the same time.

<br>

### Three Types Of Correlations:
#### Positive Correlation 
When you observe A increasing and B increases as well. Or if A decreases, B correspondingly decreases. It indicates the extent to which those variables increase or decrease in parallel

**Examples:** 
- Sunlight and Plant Height - As the amount of sunlight a plant receives increases, so does the plant's height.
- Time spend Studying and Grade - As the amt of time spent studying for a test inc, so does a student's score on said test.

<br>

#### Negative Correlation
When an increase in A leads to a decrease in B or vice versa.

**Examples:** 
- Absences and Grades - As a students # of abscences from school increases, their average grades decrease
- Speed and Time spent driving - As the speed in my car increases, the amt of time it takes me to get to x decreases. 

<br>

#### No Correlation
When two variables are completely unrelated and a change in A leads to no changes in B, or vice versa.

**Examples:** 
- The amount of time I spend watching TV has no impact on your heating bill

<br>


### Definition of Causation 
**Causation** implies that A and B have a cause-and-effect relationship with one another. You’re saying A causes B.

Causation is also known as causality.

#### Causation Properties
- Firstly, causation means that two events appear at the same time or one after the other.
- Secondly, it means these two variables not only appear together, the existence of one causes the other to manifest.


#### Note: “Correlation Does Not Imply Causation”
Just because two trends seem to fluctuate in tandem, this rule posits, that doesn’t prove that they are meaningfully related to one another. Check out this [resource to see for yourself](http://www.tylervigen.com/spurious-correlations) 

<hr>

<br>

### Correlations And Causations Exercise:
I would like for you to list 3 more examples for each type of correlation below in addition to three examples of causations.

#### Positive Correlations:
- here
- here
- here

#### Negative Correlations:
- here
- here
- here

#### No Correlations:
- here
- here
- here

#### Causation:
- here
- here
- here


<hr>

<br>

### Correlations And Scatter Plots
A scatter plot attempts to describe the `correlation`, sometimes called `covariance` between 2 continuous distributions of numbers. The idea is to test how the values of one change along with those of the other.


#### Example Plots
The folowing plots are extremely ideal, in practice a scatter plot will most certainly not be this clear cut, there will be outliers throughout, but if a correlation is present, there will either be a general trend up —or down.



![Positive and negative correlations on a scatter plot](https://drive.google.com/uc?id=1oK3Q7w1MnOWvbnKwynHcRLZaQyDtrbRb)



<br>

### Setup for the following cells 

The next cells will depend on the this cell being run. 

We will be creating a setosa DataFrame using the filtering we just learned.
We will also set up two features to do multivariate exploratory analysis.

```python 
is_setosa = iris["Species"] == 'Iris-setosa'
setosa = iris[is_setosa]

feat1 = "SepalLengthCm"
feat2 = "SepalWidthCm"
```

### Violin plot

A violin plot is a method of plotting numeric data

It is similar to a box plot, with the addition of a rotated kernel density plot on each side.

The hybrid between a Histogram and Swarmplot is a violinplot it shows the IQR and median!

A Violin Plot is similar to a swarm plot, except we give up a bit of the granularity of the Swarm Plot, in favor of IQR and Median markers.

**Note:** Make the X parameter equal to what you want to see 

#### Show violin plot for feat1 against 'Species' 
```python
sns.violinplot(x = feat1, y='Species', data=setosa)
plt.show()
```

#### Show violin plot for feat2 against 'Species' 
```python
sns.violinplot(x = feat1, y='Species', data=setosa)
plt.show()
```

### Our First Scatter Plot

Using a scatter plot let us plot the "SepalLengthCm" and the "SepalWidthCm" features to see how the fluctuate together.

```python 
sns.scatterplot(x=feat1, y=feat2, data=setosa)
plt.show()
```


#### What can we say about this plot? 


#### Is this relationship a correlation or a causation?

#### If it is a correlation what type of correlation is it? 

<hr>

<br>

### Covariance
The covariance is a measure of correlation. 

We will be using numpy's covariance premade function to find out the covariance.

We will see that there are some repeated values in the resulting matrix, this is because the correlations are measured in combinations.

It is also important to note that variance is also returned using the covariance function. 

Index Keys for a 2 by 2 matrix:
- [0,0] : Variance for feature 1  
- [0,1] : Covariance for both features 
- [1,0] : Covariance for both features 
- [1,1] : Variance for feature 2 


In [0]:
print("Using Numpy Library\n", np.cov(setosa[feat1], setosa[feat2]))

print()
print("variance for feat1 using numpy.var():", np.var(setosa[feat1])) 
print("variance for feat1 using numpy.cov():", np.cov(setosa[feat1], setosa[feat2])[0,0]) 
print("variance for feat2 using numpy.var():", np.var(setosa[feat2]))
print("variance for feat2 using numpy.cov():", np.cov(setosa[feat1], setosa[feat2])[1,1])

print()
print("Covariance for these two features", np.cov(setosa[feat1], setosa[feat2])[0,1])

Using Numpy Library
 [[0.12424898 0.10029796]
 [0.10029796 0.14517959]]

variance for feat1 using numpy.var(): 0.12176399999999993
variance for feat1 using numpy.cov(): 0.12424897959183674
variance for feat2 using numpy.var(): 0.142276
variance for feat2 using numpy.cov(): 0.14517959183673476

Covariance for these two features 0.10029795918367344


Because of the confusing nature of the covariance matrix, we must seek out a better way to test the correlation between 2 numerical distributions, as it turns out a statistician by the name of `pearson` developed a much more roboust measure of correlation, which we will discuss below!

Another thing to note is that the covariance will be within a range of negative infinity and and positive infinity. This makes it hard to make sense of what this value actually is saying about correlations. We will use the **pearson correlation coefficient** to help us map this relationship to a range of -1 to 1. 


<hr>

<br> 

### Pearson Correlation Coefficient
Pearson product-moment correlation coefficient, measures the strength of the linear association between variables.

The sign and the absolute value of a Pearson correlation coefficient describe the direction and the magnitude of the relationship between two variables.
- The value of a correlation coefficient ranges between -1 and 1.
- The greater the absolute value of a correlation coefficient, the stronger the linear relationship.
- The strongest linear relationship is indicated by a correlation coefficient of -1 or 1.
- The weakest linear relationship is indicated by a correlation coefficient equal to 0.
- A positive correlation means that if one variable gets bigger, the other variable tends to get bigger. A really positive correlation will tend toward +1.
- A negative correlation means that if one variable gets bigger, the other variable tends to get smaller. A really negative correlation will tend towards -1.
- A non correlation will hover around 0, usually a small negative or positive decimal.

<br> 
This function takes in the string values for two desired columns and the dataFrame to which these columns belong to and returns a value Between -1 and 1, this value is called the Pearson Correlation Coefficient. It's entire purpose is to give a single number value that describes the intensity of a correlation. 


In [0]:
def pearson_coeff(feat_1, feat_2, df):
    """
        Info:
            This function will return the pearson coefficent, a value between 
            -1 and 1 to signify the type of correlation.   
        Params:
            feat_1 (type: string), column name
            feat_2 (type: string), column name 
            df (type: Pandas DataFrame)
        Output:
            pearson coefficent value (type: float)
    """
    x = df[feat_1]
    y = df[feat_2]
    
    covar = np.cov(x,y)[0,1]
    std_x = np.std(x)
    std_y = np.std(y)
    
    
    print('Pearson Correlation Coefficient:')
    print(feat_1, "VS.", feat_2)
    
    return covar / (std_x * std_y)

#### Call / Invoke the function with the 2 features that were defined above.
Passing in the following arguments with variables we assigned earlier: 
- feat_1 = `feat1`
- feat_2 = `feat2`
- df = `setosa`

If you are struggling with understanding what a function is doing, that is okay. It is very common to treat a function like a black box, we may not know how it does what it does, but if the source is reputable enough,  we can trust that it will take in a certain input and kick back our desired output. In this case, the inputs are described in the function definition and the output is the pearson correlation coefficient for those inputs

Visually, it does look like there is a general trend upwards, but it hard to tell how intense the correlation is, the numerical output of this function gives us a much better understanding of the kind of correlation present. 

In this case the correlation is around +.7, which is far enough from 0 to be considered a great positive correlation.

<hr>

<br>

### Your Turn

### Reassign the `feat1` and `feat2` variables with the following values

```python 
feat1 = 'Id'
feat2 = 'SepalLengthCm'
```


#### Create Scatter plot

#### Calculate The Pearson Correlation Coefficent

#### Describe the scatter plot:


#### Is the relatonship a correlation or a causation?

#### If it is a correlation what type of correlation is present?

<hr>

<br>

### Invoking Pandas correlation function
You can also invoke the pandas DataFrame.corr() function. The default parameters utilizes the pearson correlation coefficient. 

This function will show you all of the correlations between all the columns of the DataFrame 

```python
setosa.corr()
```

#### Try it for yourself

#### What features have the strongest positive correlation?

#### What features have the strongest negative correlation? 

#### What features have no correlation? 