# Lesson: Regression - EXPLORATION

<a href = "https://www.canva.com/design/DAFiFa1HOKY/uQ-ZPCWBYYjgONgjHzY9hg/view?utm_content=DAFiFa1HOKY&utm_campaign=designshare&utm_medium=link&utm_source=publishsharelink">![image.png](attachment:bd9198ba-7096-4098-a381-8f9df856eac6.png)</a>

Let's explore the interactions of all attributes and target variable to help discover drivers of our target variable. 

> ##### "Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypotheses, and to check assumptions with the help of summary statistics and graphical representations." - Prasad Patil



<hr style="border:2px solid gray">

## Data Wrangling

In [1]:
#standard ds imports
import pandas as pd
import numpy as np

#viz and stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr, spearmanr
from scipy.stats import shapiro 

#my wrangle file
import wrangle as w

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Use our function from wrangle to acquire and prepare our data.

df = w.wrangle_exams()
df.head()

In [None]:
# Check if student_id 's  are unique



___

## Types of Visualizations

Here is a breakdown of visualizations by type with some useful code snippets. Below, let's use the appropriate visualizations on our `student_grades` dataset.
### 1. Univariate Distributions
Check out the distributions of a single variable at a time, using pandas built-in plotting function to create a histogram or Seaborn `displot`, `boxplot`, or `countplot`.

#### a. Continuous variable distributions
 >```python
 >df.[col].hist(grid=False, bins=10)
 >
 >sns.displot(x, data)
 >
 >sns.boxplot(data)
 >```

#### b. Discrete variable distributions
 >```python
 >sns.countplot(x='discrete_var', data)
 >```

<br>
<br> 
 
### 2. Bi-/Multi-Variate Comparisons

#### a. Continuous vs. Continuous
>```python
>sns.lineplot(data, x, y)
>
>sns.scatterplot(data, x, y)
>```
    
- Seaborn `pairplot` to create a scatter matrix visualizing all continuous variable relationships along with individual distributions.
    >```python
    >sns.pairplot(data)
    >```

- Seaborn `relplot` for a simple scatterplot of two continuous variables.
    >```python
    >sns.relplot(x, y, data, kind=scatter)
    >```

- Seaborn `lmplot` for a simple scatterplot of two continous variables with a regression line. ***You can pass a discrete variable to `col` or `hue` to bring in another dimension, too.***
    >```python
    >sns.lmplot(x, y, data, scatter=True, hue=None, col=None)
    >```

- Seaborn `jointplot` for a simple scatterplot of two continuous variables with a regression line and the addition of a histogram for each variable.
    >```python
    >sns.jointplot(x, y, data, kind=scatter)
    >```

- Seaborn `heatmap` of Correlation Coefficients for all numeric columns in a dataset. *Can also be ran on Discrete vs. Discrete*
    >```python
    >sns.heatmap(train.corr())
    >```
    
<br>    
    
#### b. Discrete vs. Continuous
- Seaborn `swarmplot` or `stripplot` to examine a discrete variable by a continuous.
    >```python
    >sns.swarmplot(x='continuous_var', y='discrete_var', data=train)
    >
    >sns.stripplot(x='continuous_var', y='discrete_var', data=train)
    >
    >sns.catplot(x='continuous_var', y='discrete_var', data=train)
    >```
    
- Seaborn `boxplot`, `violinplot`, or `barplot` to show the distribution of a continuous variable by a discrete variable.
    >```python
    >sns.boxplot(x='discrete_var', y='continuous_var', data=train)
    >
    >sns.violinplot(x='discrete_var', y='continuous_var', data=train)
    >
    >sns.barplot(x='discrete_var', y='continuous_var', data=train)
    >```
    
    


___
## Univariate Exploration

In [None]:
plt.figure(figsize=(16, 3))

# List of columns

for i, col in enumerate(cols):
    
    # i starts at 0, but plot no.s should start at 1
    
    # Create subplot.
    
    # Title with column name.
    
    # Display histogram for column.
    
    # Hide gridlines.
    

<div class="alert alert-block alert-success">

### Takeaways:

</div>

___
## Split Data
##### Before we explore bi- and multi-variate relationships, we ***must*** split our data to avoid leakage of unseen data.

___
## Goal

Let's keep our goal from our student grades scenario in mind here.
> I'm a university professor hoping I can build a prediction model that will be able to use these exams to predict the final grade within 5 points average per student.

Since my target variable is continuous, `final_grade`, this is a regression problem. It's important to remember that Multiple linear regression analysis makes several key assumptions:

- There must be a linear relationship between the outcome variable and the independent variables.  *Scatterplots can show whether there is a linear or curvilinear relationship.*

- No Multicollinearity: Multiple regression assumes that the independent variables are not highly correlated with each other.

- Multivariate Normality: Multiple regression assumes that the residuals are normally distributed.


# Hypotheses:
## Q1. Is there a relationship between `exam1` and `final_grade`? 
- both of my variables are continuous
- check for **correlation**

- ${H_0}$: There is no there linear correlation between exam1 and final_grade
- ${H_a}$:  There is a linear correlation between exam1 and final_grade

<br>



#### Visualize
##### `sns.heatmap()`

Let's look at a heatmap of the correlation coefficients for a dataset.

1. Determine if normally distributed
- if normal, use pearsons method
- if not, use spearman  
2. Calculate the correlation coefficient for each pair of variables
- use pandas `.corr()` 
- it defaults to `method=pearson`
- can change to `method=spearman`
3. Use correleation coefficients to generate heatmap 


In [None]:
# We already determined that all of the columns were NOT normally distributed.

# create the correlation matrix using pandas .corr()


In [None]:
# pass my correlation matrix to Seaborn's heatmap


In [None]:
# Upper triangle of an array


In [None]:
# Lower triangle of an array


In [None]:
# pass my correlation matrix to Seaborn's heatmap with customization


In [None]:
# pass my correlation matrix to Seaborn's heatmap with more customization! 


#### Hypothesis Testing

In [None]:
# Since my variables are not normally distributed, 
# use scipy stats function spearmanr to calculate correlation and p-value 


<div class="alert alert-block alert-success">

##### Heatmap Takeaways

    
</div>

___
## What other visualizations could we have used?

### `sns.relplot()`

Let's do a simple scatter plot of two continuous variables in our dataset.

### `sns.lmplot()`

Let's make that simple scatter plot but add a regression line.

<div class="alert alert-block alert-info">
    
##### Confidence Interval: 
This is the range of values you expect your estimate to fall between if you redo your test, within a certain level of confidence. 
    
    
Confidence, in statistics, is another way to describe probability.
</div>

### `sns.jointplot()`

Let's use a `sns.jointplot()` with `kind=reg` to view individual variable distributions for our x and y along with a scatter plot with regression line.

### `sns.pairplot()`

Let's use `sns.pairplot()` to view a scatter plot visualizing the relationships between all of the numeric columns in our dataset all at once as well as individual distributions for each individual column.

<div class="alert alert-block alert-success">
    
##### Takeways:

</div>

___
## Q2: Is there a cutoff grade that makes sense to investigate? Passing/failing/letter grades?

In [None]:
#number of people who failed each test


### Make categorical values for further exploration

In [None]:
#assign fail and pass for each test


### What's the relationship between passing `exam1` and the `final_grade`?

In [None]:
#seaborn histplot 


#set line for passing level


In [None]:
#mean final_grade by exam1 pass/fail status


### What percentage of students failed `exam1` and the `final_grade`?

### Of the students who failed `exam1`, how many also had a failing `final_grade`?

In [None]:
#create subset of people who failed exam1


In [None]:
#how many failed final


In [None]:
#Percentage of students who failed final


### Of the students who failed `exam2`, how many also had a failing `final_grade`?

In [None]:
#create subset of people who failed exam2


In [None]:
#how many failed final


In [None]:
#percentage who failed final


### Of the students who failed both `exam1` and `exam2`, how many also had a failing `final_grade`?

In [None]:
#create subset of people who failed exam2


In [None]:
#how many failed final


In [None]:
#percentage who failed final


<div class="alert alert-block alert-success">
    
##### Takeways:
    
    
</div>

## Further Reading

- [Visualization with Seaborn Demos](https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html)
- <https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15>
- <https://www.itl.nist.gov/div898/handbook/index.htm>
- <https://adataanalyst.com/data-analysis-resources/visualise-categorical-variables-in-python/>
- Boxplot vs. Violin example https://matplotlib.org/3.2.1/gallery/statistics/boxplot_vs_violin.html
- https://datavizcatalogue.com/