# CODE ALONG: FROM DISTRIBUTIONS TO HYPOTHESES

#### Learning Objectives

    - To be able to use probability density function to calculate probablity of specific values
    - To identify normally distributed features
    - to perform a hypothesis test to compare numeric data between 2 groups

In [1]:
#imports
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

sns.set_context('talk')
mpl.rcParams['figure.figsize']=[12,6]

## Exploring Distributions

Dataset:  https://archive.ics.uci.edu/ml/datasets/student+performance

In [None]:
# Setting max columns
pd.set_option('display.max_columns', 100)

In [5]:
## read in the Data/student/student-mat.csv (it uses ";" as the sep)
df = pd.read_csv("Data/student+performance.zip", sep=';')

# display info and .head
df.info()
df.head()

ValueError: Multiple files found in ZIP file. Only one file per ZIP: ['.student.zip_old', 'student.zip']

In [None]:
#Calculate an Avg Grade column by averaging G1, G2, G3
#then devide by 20, and * 100(to make to "%")
df["Avg Grade"]=df[["G1, G2, G3"]].mean(axis=1)/20*100
df

In [None]:
#plot the distribution of Avg Grade
sns.histplot(data=df,x="Avg Grade",kde=True);

In [None]:
# use scipy"s normaltest to determin if normal distributed
#pvalue > .05 normal
stats.normaltest(df["Avg Grade"])


    - Wh have our p-value for our normaltest, but what does it mean??
        - Check the docstring for the normaltest to find out the null hypothesis of the test

### Calculating Probabilities with Scipy's Probability Density Function

In [None]:
# Get the mean, std, min, and max for the Avg Grade column
#could use df["Avg Grade"].describe()
dist_stats = df['Avg Grade'].agg(['mean', 'std', 'min', 'max'])
dist_stats

In [None]:
# Generate a linearly-spaced array of values that span the min to the max ....
xs=np.linspace(dist_stats.loc['min'], dist_stats.loc['max'])
xs

In [None]:
# use stats.norm.pdf to get the PDF curve that corresponds to your ...
pdf = stats.norm.pdf(xs,loc=dist_stats.loc['mean'], scale=dist_stats.loc["std"])
pdf

In [None]:
#plot the histogram again and then plot the pdf we calculated
sns.histplot(data=df,x='Avg Grade',y="density")
plt.plot(xs,pdf,color='red',label="PDF")
plt.legend();

    - Looks pretty normal! But can we confirm for a fact that its normal?

In [None]:
# WATCH VIDEO FOR CODE TO PROVE IT NORMAL

### Q1 What is the probability of a student getting a score of 90 or above?

In [None]:
#Plot the histogram again ad pdf again
sns.histplot(data=df,x='Avg Grade',y='density')
plt.plot(xs,pdf,color='red',label="PDF")

#Add a vspan to the plot showing the region we want to calc prob for
plt.axvspan(90,100,alpha=.6, color='orange',zorder=0)

plt.legend();

    - how can we calculate this probability? Can we use the PDF?

In [None]:
# Try making a list of values from 90-100 and getting the pdf value
above_90=list(range(90,101))
above_90_pdf=stats.norm.pdf(above_90,loc=dist-stats.loc["mean"], scale=dist_stats.loc["std"])

#Sum the values to get the total probability
above_90_pdf.sum()

    - What the flaw to the approach?

In [None]:
# use the cumulative density function to find prob of 90 or lower
p_less_90=stats.norm.cdf(90,loc=dist_stats=["mean"], scale=dist-stats.loc["std"])
p_less_90

    - Now we want the opposit probability, probablity of being GREATER Than 90

In [None]:
#Calc 1-prob of 90 or lower
1-p_less_90

    - Answer: there is a 2.4% chance of having a score greater that 90

## Hypothesis Testing

### State The Hypothesis
    - (Null Hypothesis): Students with internet access have the same average grades as students who do not.
    - (Alternative Hypothesis): Students with internet access have significantly different average grades compared to students who do not.

### Visualize and Separate Groups
    - Visualize the histogram of Avg Grade again, but separate it into groups based on the "internet" column.
    - Note: when comparing 2 groups with seaborn's histplot, you will want to add common_norm=False

In [None]:
df["internet"].value_counts(1)

In [None]:
sns.countplot(data=df,x='internet');

In [None]:
# visulize the histogram of Avg Grade again but separate it by "internet"
sns.histplot(data=df,x="Avg Grade",hue='internet', common_norm=False,
            stat='density',kde=True);

In [None]:
# Plot a bar plot fo the Avg Grade for the students with internet vs. those that do not...
sns.barplot(data=df,x='internet',y='Avg Grade');

In [None]:
#Separat the 2 groups into 2 varaibles
grp_yes=df.loc[df['internet']=='yes','Avg Grade']
grp_no=df.loc[df['internet']=='no','Avg Grade']

display(grp_yes.head(), grp_no.head())

#### T-Test Assumptions
    - Since we are comparing a numeric measurement between 2 groups, we want to run a 2-sample (AKA independent T-test).

    - The Assumptions are:

        - No significant outliers
        - Normality
        - Equal Variance

#### Assumption: No Sig. Outliers

In [None]:
# check yes group for outliers using z-score >3 rule
outliers_yes = np.abs(stats.zscore(grp_yes))>3
outliers_yes.sum()

In [None]:
# check no group for outliers using z-score >3 rule
outliers_no=np.abs(stats.zscore(grp_no))>3
outliers_no.sum()

    - No outliers to worry about!  Assumption met.

#### Assumption: Normally Distributed Groups

In [None]:
# Use normaltest to ceck if yes group is normally distributed
stats.normaltest(grp_yes)

In [None]:
# Use normaltest to check if no group is normally distributed
stats.normaltest(grp_no)

    - Did we meet the assumption of normality?

#### Assumption: Equal Variance

In [None]:
# Use Levene's test to check if groups have equal variance
stats.levene(grp_yes,grp_no)

    - Did we meet the assumption of equal variance?

#### Perform Final Hypothesis Test (T-Test)

    - Since we met all of the assumptions for the test we can proceed with our t-test.
    - Next class we will discuss what we would do if we did NOT meet the assumptions.

In [None]:
# run stats.ttest_ind on the 2 groups
results=stats.ttest_ind(grp_yes,grp_no)
results

    - What is our p-value? Is it less than our alpha of .05? What does this mean?

    - Our T-Test returned a p-value of .041. Since p<.05, we can reject the null hypothesis that students with internet access have the same average grades as students who do not.

We therefore support the alternative hypothesis that there is a significant difference in Average Grades between students who do/do not have internet access. Our visualization below shows that students with internet access have HIGHER average grades.

In [None]:
## Add a summary visual to support our results.
sns.barplot(data=df, x='internet',y='Avg Grade',ci=68)

# Challenge Q: what is the probability of a student getting a score less than 30 ?

In [None]:
## Plot the histogram again AND pdf again
sns.histplot(data=df, x='Avg Grade', stat='density')
plt.plot(xs,pdf,color='red', label='PDF')

## Add a vpsan to the plot showing the region we want to calc prob for
plt.axvspan(1,31,alpha=0.6,color='orange',zorder=0)

plt.legend();


    - How can we calculate this probability? Can we use the PDF?

In [None]:
## try making a list of values between  0-30 and getting the pdf values
less_30 = list(range(0,31))
less_30_pdf = stats.norm.pdf(less_30, loc=dist_stats.loc['mean'], scale=dist_stats.loc['std'])

## Sum the values to get the total probability. 
less_30_pdf.sum()

In [None]:
## Use the cumulative density function to find prob of 30 OR lower.
p_less_30 = stats.norm.cdf(30, loc=dist_stats.loc['mean'], scale=dist_stats.loc['std'])
p_less_30


    - Answer: there is a 1% chance of having a score less than 30.
