# Hypothesis Testing

![alt text](https://miro.medium.com/max/910/1*4c72kKs77I7nJmGtYq3dkw.png)

## One Sample Significance Tests



The purpose of One Sample Significance Tests is to check if a sample of observations could have been generated by a process with a specific mean or proportion.

Some questions that can be answered by one sample significance tests are:
* Is there equal representation of men and women in a particular industry?
* Is the normal human body temperature 98.6 F?

We will try and apply this test to a few real world problems in this notebook.

The Suicide dataset was obtained from Kaggle courtesy Rajanand Illangovan. You can download it here: https://www.kaggle.com/rajanand/suicides-in-india

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy import stats

### Analyzing Suicides in India by Gender

Are men as likely to commit suicide as women?

This is the question we will attempt at answering in this section. To answer this question, we will use suicide statistics shared by the National Crime Records Bureau (NCRB), Govt of India. To perform this analysis, we need to know the sex ratio in India. The Census 2011 report states that there are 940 females for every 1000 males in India.

Let p denote the fraction of women in India.

In [0]:
p = 940/(940+1000)
p

If there is no correlation between gender and suicide, then the sex ratio of people committing suicides should closely reflect that of the general population. 

Let us now get our data into a Pandas dataframe for analysis.

In [0]:
import pandas as pd
url='https://raw.githubusercontent.com/SankBad/GraduateSpecialistRutgers/master/suicides.csv'
df = pd.read_csv(url,sep=",") # use sep="," for coma separation. 
df.head()

In [0]:
df.shape

In [0]:
df['Gender'].value_counts()

We can see that the number of female suicides is slightly lesser than the number of male suicides. There are also fewer females than males. How do we prove that females are as likely to commit suicide as males? This can be answered through hypothesis testing.|

#### Step 1: Formulate the hypothesis and decide on confidence level

The null hypothesis, as stated in the slides, is the default state. Therefore, I will state my null and alternate hypothesis as follows.

* **Null Hypothesis (H0)**: Men and women are equally likely to commit suicide.
* **Alternate Hypothesis (H1)**: Men and women are not equally likely to commit suicide.

If the null hypothesis is true, it would mean that the fraction of women committing suicide would be the same as the fraction of women in the general population. We now need to use a suitable statistica test to find out if this is indeed is the case.

Our statistical test will generate a p-value which has to be compared to a significance level ($\alpha$). If p is less than alpha, then it is extremely unlikely that the event must have occurred by chance and we would be reasonable in rejecting the null hypothesis. On the contrary, if the p-value is higher than $\alpha$, we will not be in a position to reject the null hypothesis.

Let us assume, $\alpha$ = 0.05

#### Step 2: Decide on the Statsitical Test

We will be using the One Sample Z-Test here. 

#### Step 3: Compute the p-value

In [0]:
h0_prop = p
h0_prop

In [0]:
h1_prop = df['Gender'].value_counts()['Female']/len(df)
h1_prop

In [0]:
sigma_prop = np.sqrt((h0_prop * (1 - h0_prop))/len(df))
sigma_prop

In [0]:
z = (h1_prop - h0_prop)/sigma_prop
z

In [0]:
def pvalue(z):
    return 2 * (1 - stats.norm.cdf(z))

In [0]:
p_val = (1-stats.norm.cdf(z))*2
p_val

The p value is so small that Python has effectively rounded it to zero.

#### Step 4: Comparison and Decision

The p value obtained is extremely strong evidence to suggest that it is much lower than our significance level $\alpha$. We can thus safely disregard the null hypothesis and accept the alternate hypothesis (since it is the negation of the null hypothesis).

**Men and women are not equally likely to commit suicide.**

Note that this test says nothing about if men are more likely than women to commit suicide or vice versa. It just states that they are not equally likely. The reader is encouraged to form their own hypothesis tests to check these results.

### Analyzing the average heights of NBA Players

I was interested in knowing the average height of NBA playes. A quick Google search tells me that the average height of players between 1985-2006 was **6'7"** or 200.66 cm. Is this still the case?

To answer this question, we will be using the NBA Players Stats - 2014-2015 dataset on Kaggle courtesy DrGuillermo. The dataset can be downloaded here: https://www.kaggle.com/drgilermo/nba-players-stats-20142015

In [0]:
from google.colab import files
files.upload()

In [0]:
df2 = pd.read_csv('players_stats.csv')
df2.head()

In [0]:
df2.shape

#### Hypothesis Testing

One Sample Significance Test for Mean is extremely similar to that for Proportion. We will go through almost an identical process.

The hypotheses are defined as follows:
* **Null Hypothesis**: The average height of an NBA player is 200.66 cm.
* **Alternate Hypothesis**: The average height of an NBA player is not 200.66 cm.

Significance Level, $\alpha$ is at 0.05. Assuming Null Hypothesis to be true.

In [0]:
h0_mean = 200.66

In [0]:
h1_mean = df2['Height'].mean()
h1_mean

In [0]:
sigma = df2['Height'].std()/np.sqrt(len(df2))
sigma

In [0]:
z = (h1_mean - h0_mean)/sigma
z

In [0]:
p_val = (1 - stats.norm.cdf(abs(z))) * 2
p_val

The p value obtained is much lesser than the significance level $\alpha$. We therefore reject the null hypothesis and accept the alternate hypothesis (the negation). We can therefore arrive at the following conclusion from this analysis:

**The average height of NBA Players is NOT 6'7"**.

## Two Sample Significance Tests

In the last section, we saw how one sample significance tests could be used to test if the proportion or the mean of a certain feature of a population is equal to a predefined proportion or mean respectively. In other words, we were comparing A sample with a prdefined value.

Two sample significance tests, on the other hand, allow us to compare two different populations and check if there is any meaningful difference in their means or proportions. The steps involved and the tools used are almost identical to the one sample significance test with one critical difference. The null hypothesis mean or proportion is assumed to be the difference of the means or proportions of the two populations and is set to zero.

Using two sample significance tests, we can answer questions such as:
* Is there racial discrimination when it comes to recruitment for white collar jobs?
* Is there a pay gap between men and women in the industry? Are women, on average, paid less?
* Do some universities involve in conscious racial discrimination? That is, are they more inclined to accept a student of a particular race as compared to another?

### Analyzing Literacy Rates

In this section, we will try and compare the literacy rates in the major areas of Punjab and Delhi ICT and discern if there is any meaningful difference between the two aforementioned quantities.

To answer this question, we will be using the 'Top 500 Indian Cities' dataset made available on Kaggle courtesy Arijit Mukherjee. The dataset can be found here: https://www.kaggle.com/zed9941/top-500-indian-cities

In [0]:
from google.colab import files
files.upload()

In [0]:
df3 = pd.read_csv('cities.csv')
df3.head()

In [0]:
df3['state_name'].value_counts()

In [0]:
punjab = df3[df3['state_name'] == 'PUNJAB']['effective_literacy_rate_total']
delhi = df3[df3['state_name'] == 'NCT OF DELHI']['effective_literacy_rate_total']

In [0]:
punjab_mean = punjab.mean()
punjab_std = punjab.std()

punjab_mean, punjab_std

In [0]:
delhi_mean = delhi.mean()
delhi_std = delhi.std()

delhi_mean, delhi_std

From the above calculations, it can be seen that the mean and the standard deviations of Punjab and Delhi literacy rates differ slightly. The next step is to determine if this difference is a statistically significant one.

For hypothesis testing, the following are defined:

* **Null Hypothesis:** The true mean literacy rate for Punjab and Delhi are the same.
* **Alternate Hypothesis:** The true mean literacy rate for Punjab and Delhi are not the same.

The threshold value of $\alpha$ is assumed to be 0.05.
Assuming Null Hypothesis is true.


![alt text](https://www.statsdirect.co.uk/help/generatedimages/equations/equation168.svg)

In [0]:
h0_mean = 0
mean_diff = delhi_mean - punjab_mean
sigma_diff = np.sqrt((delhi_std**2)/len(delhi)  + (punjab_std**2)/len(punjab))
mean_diff, sigma_diff

Since we are dealing with sample sizes less than 30, using the t-statistic will be more appropriate. To use student's t though, we need to calculate the degree of freedom. This is done as follows:

In [0]:
deg = (((delhi_std**2)/len(delhi)  + (punjab_std**2)/len(punjab)) ** 2) / ((((delhi_std**2)/len(delhi))**2)/(len(delhi)-1)  + (((punjab_std**2)/len(punjab))**2)/(len(punjab) - 1))
deg

In [0]:
z = (mean_diff - h0_mean) / sigma_diff
z

In [0]:
p = (1-stats.t.cdf(z, deg))*2
p

The value of p obtained here is much higher than the significance level $\alpha$. Therefore, we cannot reject the null hypothesis. It stands.

**The true mean literacy rate for Punjab and Delhi are the same.**