## **Hypothesis Testing**


#### Hypothesis testing is a technique for **evaluating a theory using data.**
#### The hypothesis is the Researcher’s **initial belief about the situation before the study.**

#### The commonly accepted fact is known as the null hypothesis while the opposite is the alternate hypothesis. The researcher’s task is to reject, nullify, or disprove the null hypothesis. In fact, the word “null” is meant to imply that it’s a commonly accepted fact that researchers work to nullify (zero effect).

#### For example, if we consider a study about cell phones and cancer risk, we might have the following hypothesis:

#### **Null hypothesis: “Cell phones have no effect on cancer risk.”**
#### **Alternative hypothesis (the one under investigation): “Cell phones affect the risk of cancer.”**

#### The hypothesis that there is **no difference**between things is called **Null Hypothesis.**

>#### There are **three types of tests, and the phrasing of the alternative hypothesis** determines which type we should use.
#### If we are checking for a **difference compared to a hypothesized value**, we look for **extreme values in either tail and perform a two-tailed test**. 
#### If the **alternative hypothesis uses language like "less" or "fewer", we perform a left-tailed test**
#### **Words like "greater" or "exceeds" correspond to a right-tailed test.** 

>### **Statistical Test**
#### The output of a Statistical Test is a **Decision to (Reject or Fail to Reject) Null Hypothesis.**
#### A Statistical Test Need 3 thins:
#### **1. Data**
#### **2. Its needs a Null or Primary Hypothesis to Reject or Fail to Reject.**(Cell phones have no effect on cancer risk)
#### **3. Its needs an Alternative Hypothesis**.(Cell phones affect the risk of cancer)

## **StandardError**

#### Standard deviation of the mean is called **StandardError**.

>### **Standard Deviation** vs **Standard Error**

|Std|St Error|
|------------------------|--------------------------------------|
|The standard deviation quantifies the variation within a set of measurements.|The standard error quantifies the variation in means from multiple set of measurements|

In [2]:
import numpy as np
import pandas as pd

# Reading data
late_shipments = pd.read_feather('data/late_shipments.feather', columns=None, use_threads=True)

# Print the late_shipments dataset
display(late_shipments.head())

# Calculate the proportion of late shipments
late_prop_samp = (late_shipments['late']=='Yes').mean()

# Print the results
# print(late_prop_samp)
# print(late_shipments)



Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2996.0,266644.0,89.0,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83
1,30998.0,Botswana,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test,...,25.0,800.0,32.0,1.6,"Trinity Biotech, Plc",Yes,10.0,559.89,reasonable,1.72
2,69871.0,Vietnam,PMO - US,Direct Drop,EXW,Air,0.0,No,ARV,Adult,...,22925.0,110040.0,4.8,0.08,Hetero Unit III Hyderabad IN,Yes,3723.0,19056.13,expensive,181.57
3,17648.0,South Africa,PMO - US,Direct Drop,DDP,Ocean,0.0,No,ARV,Adult,...,152535.0,361507.95,2.37,0.04,"Aurobindo Unit III, India",Yes,7698.0,11372.23,expensive,779.41
4,5647.0,Uganda,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test - Ancillary,...,850.0,8.5,0.01,0.0,Inverness Japan,Yes,56.0,360.0,reasonable,0.01


>### What is **Z-Score**?

#### A **Z-score** is a numerical measurement that **describes a value's relationship to the mean** of a group of values. **Z-score is measured in terms of standard deviations from the mean.**
#### **If a Z-score is 0, it indicates that the data point's score is identical to the mean score.**

>### **Bootstrap Function** for bootstrapping a mean value.

In [5]:
def bootstrap(x, nboot, operation):
    """This function will return a bootstrap sampling distribution
    
    Args:
        x(list):  a list.
        nboot(int): number of bootstrap samples
        operation: which operation will be executed on the sample.
    
    Return:
        list: late_shipments_boot_distn
        
    """
    # making a numpy array from x, so that we can use the x[index]. This process will allow us
    # to take sample with replecement.
    x = np.array(x)
    
    late_shipments_boot_distn = []
    for i in range(nboot):
        index = np.random.randint(0, len(x), len(x))
        samples = x[index]
        late_shipments_boot_distn.append(operation(samples))
        
    return np.array(late_shipments_boot_distn)

    

In [6]:
# Calling the bootstrap function and assign the value to *late_shipments_boot_distn
late_shipments_boot_distn = bootstrap(late_shipments['late']=='Yes', 5000, np.mean)

# Hypothesize that the proportion is 6%
late_prop_hyp = 0.06

# Calculate the standard error
std_error = np.std(late_shipments_boot_distn, ddof=1)

# Find z-score of late_prop_samp
z_score = (late_prop_samp - late_prop_hyp) / std_error

# Print z_score
print(z_score)

0.12953717414434007


## **P-value**
#### The p-value **quantifies the rareness** in our results.

>#### **The probability of obtain a result. Assuming the null Hypothesis is True.**
>#### **P-value** Quantify the evidence for the null hypothesis.

#### **Large p-values** mean our statistic is producing a **result that is likely not in a tail of our null distribution,** and chance could be a good explanation for the result. **Small p-values** mean our statistic is producing a **result likely in the tail of our null distribution.** 
### **Because p-values are probabilities, they are always between zero and one.**

#### **Large p-values** --> Failed to Reject Null Hypothesis.
#### **Small p-values** --> Reject Null Hypothesis. 

## **Calculation P-value**

In [4]:
from scipy.stats import norm

# Calculate the z-score of late_prop_samp
z_score = (late_prop_samp-late_prop_hyp)/std_error

# Calculate the p-value
p_value = 1-norm.cdf(z_score)
                 
# Print the p-value
print(p_value) 

0.44762358697230376


### **Significance level**
#### The **cutoff point is known as the significance level**, and is **denoted alpha.** The appropriate significance level depends on the dataset and the discipline worked in.
#### **Five percent is the most common choice,** but **ten percent** and **one percent** are also popular. 
#### The significance level gives us a decision process for which hypothesis to support. 
>#### If the **p-value** is less than or equal to alpha, we **reject the null hypothesis.** Otherwise, we fail to reject it.

### **Types of Errors**
#### Returning to the criminal trial analogy, there are two possible truth states and two possible test outcomes, amounting to four combinations. Two of these indicate that the verdict was correct.

#### **If the defendant didn't commit the crime, but the verdict was guilty, they are wrongfully convicted. If the defendant committed the crime, but the verdict was not guilty, they got away with it.** These are both errors in justice.
#### Similarly, for **hypothesis testing,** there are two ways to get it right, and two types of error. If we support the alternative hypothesis when the null hypothesis was correct, we made a false positive error. If we support the null hypothesis when the alternative hypothesis was correct, we made a false negative error. 
>#### These errors are sometimes known as **type one and type two errors, respectively.**

### **Possible errors in our example**

#### In the case of data scientists coding as children, **if we had a p-value less than or equal to the significance level, and rejected the null hypothesis, it's possible we made a false positive error.** Although we thought data scientists started coding as children at a higher rate, it may not be true in the whole population. Conversely, **if the p-value was greater than the significance level, and we failed to reject the null hypothesis, it's possible we made a false negative error.**

### **Calculating a confidence interval.**

In [5]:
# Calculate 95% confidence interval using quantile method
lower = np.quantile(late_shipments_boot_distn, 0.05)
upper = np.quantile(late_shipments_boot_distn, 0.95)

# Print the confidence interval
print((lower, upper))

(0.049, 0.074)


![b](img/b.png)

## **Steps of Hypothesis Testing.**

![c](img/c.png)

In [6]:
# Reading the data
stack = pd.read_feather('data/stack_overflow.feather', columns=None, use_threads=True)
display(stack.head())

Unnamed: 0,respondent,main_branch,hobbyist,age,age_1st_code,age_first_code_cut,comp_freq,comp_total,converted_comp,country,...,survey_length,trans,undergrad_major,webframe_desire_next_year,webframe_worked_with,welcome_change,work_week_hrs,years_code,years_code_pro,age_cat
0,36.0,"I am not primarily a developer, but I write co...",Yes,34.0,30.0,adult,Yearly,60000.0,77556.0,United Kingdom,...,Appropriate in length,No,"Computer science, computer engineering, or sof...",Express;React.js,Express;React.js,Just as welcome now as I felt last year,40.0,4.0,3.0,At least 30
1,47.0,I am a developer by profession,Yes,53.0,10.0,child,Yearly,58000.0,74970.0,United Kingdom,...,Appropriate in length,No,"A natural science (such as biology, chemistry,...",Flask;Spring,Flask;Spring,Just as welcome now as I felt last year,40.0,43.0,28.0,At least 30
2,69.0,I am a developer by profession,Yes,25.0,12.0,child,Yearly,550000.0,594539.0,France,...,Too short,No,"Computer science, computer engineering, or sof...",Django;Flask,Django;Flask,Just as welcome now as I felt last year,40.0,13.0,3.0,Under 30
3,125.0,"I am not primarily a developer, but I write co...",Yes,41.0,30.0,adult,Monthly,200000.0,2000000.0,United States,...,Appropriate in length,No,,,,Just as welcome now as I felt last year,40.0,11.0,11.0,At least 30
4,147.0,"I am not primarily a developer, but I write co...",No,28.0,15.0,adult,Yearly,50000.0,37816.0,Canada,...,Appropriate in length,No,"Another engineering discipline (such as civil,...",,Express;Flask,Just as welcome now as I felt last year,40.0,5.0,3.0,Under 30


In [7]:
# Groupby Mean
xbar = stack.groupby('age_first_code_cut')['converted_comp'].mean()

# Groupby std
s = stack.groupby('age_first_code_cut')['converted_comp'].std()

# n count
n = stack.groupby('age_first_code_cut')['converted_comp'].count()


In [8]:
# Defining xbar_yes, no, s_yes ....
yes =late_shipments['late'] == 'Yes'
no = late_shipments['late'] == 'No'
xbar_yes = yes.mean()
s_yes = yes.std()
xbar_no = no.mean()
s_no = no.std()
n_yes = yes.count()
n_no = no.count()

# Calculate the numerator of the test statistic
numerator = xbar_yes - xbar_no

# Calculate the denominator of the test statistic
denominator = np.sqrt(s_yes**2/n_yes+s_no**2/n_no)

# Calculate the test statistic
t_stat = numerator/denominator

# Print the test statistic
print(t_stat)

-81.99069137379603


## **T-Test**

>#### A t-test is an inferential statistic used to determine if there is a **significant difference between the means of two groups and how they are related.**

#### **We used an approximation for the test statistic standard error using sample information. Using this approximation adds more uncertainty and that's why this is a t instead of a z problem. The t distribution allows for more uncertainty when using multiple estimates in a single statistic calculation.** 

### **Why is t needed?**

>#### When a sample standard deviation is used in estimating a standard error.

#### Normal distribution, is a probability distribution that is **symmetric about the mean,** showing that **data near the mean are more frequent in occurrence than data far from the mean.**

## **From t to p-value**

In [9]:
from scipy.stats import t

x =1- t.cdf(t_stat, df=True)
print(x)

0.9961179238900442
