# Hypothesis testing

Steps
1. define null hypothisis H0 and alternative hypothesis H1

Eg:

H0 - the world is flat
H1 - the world is round

H0- this can be stated as having no relationship or dependency to the two or more varables
H1- this is contradictory to H0 and claims that relationship exist ( this is the hypothisis we want to prove)

Conventional approach by analysts is to assume that the null hypothesis is true and then test the alternative hypothesis.

2. Choose the appropriate statistical test
    a. Corrilation test - examins the relationship between two variables when you have the 2 variables of both numeric data types
    b. t-test - two variables one numeric and other categorical. tests the difference between two groups of data. Small sample size
    c. z-test - two variables one numeric and other categorical. tests the difference between two groups of data. large sample size.
    d. ANOVA - tests the relationship between multiple numerical variables.
    e. Chi-square test - compares relationship between two categorical variables.

3. Calculae the p value (probability)

4. determine the statistical significance
p value is compared against the significance value to determine whether there are sufficient evedence to reject the null hypothesis.
Significance value is a predetermined value that is used to determine the significance of the test. = 0.05
p>0.05 => reject the null hypothesis
p<0.05 => accept the null hypothesis


In [3]:
'''
Customer personality analysis dataset to demonstrate different types of statstical tests
'''

import pandas as pd

df = pd.read_csv('data/marketing_campaign.csv',sep='\t')

In [6]:
df.shape

(2240, 29)

In [7]:
df.columns

Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
      dtype='object')

In [11]:
df.sample(50)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
1224,2740,1958,PhD,Widow,33438.0,1,1,22-09-2013,81,62,...,5,0,0,0,0,0,0,3,11,0
62,1012,1952,Graduation,Single,61823.0,0,1,18-02-2013,26,523,...,7,0,0,0,0,0,0,3,11,0
929,10673,1976,PhD,Married,68397.0,0,1,17-11-2013,6,760,...,3,0,0,0,0,0,0,3,11,0
474,9081,1988,Graduation,Single,20518.0,1,0,18-05-2014,58,4,...,5,0,0,0,0,0,0,3,11,0
462,8594,1958,PhD,Widow,50520.0,0,1,28-01-2014,25,112,...,6,0,0,0,0,0,0,3,11,0
1149,10525,1986,Graduation,Single,26576.0,1,0,13-10-2012,40,10,...,9,1,0,0,0,0,0,3,11,1
811,5585,1972,Graduation,Single,21359.0,1,0,20-04-2013,1,12,...,8,0,0,0,0,0,0,3,11,1
1266,5207,1963,PhD,Married,53378.0,1,1,24-09-2012,41,489,...,8,0,0,0,0,0,0,3,11,1
377,4459,1989,Graduation,Single,30279.0,1,0,30-12-2012,13,10,...,8,0,0,0,0,0,0,3,11,0
2121,10067,1976,2n Cycle,Together,25176.0,1,1,10-08-2013,79,4,...,7,0,0,0,0,0,0,3,11,0


In [12]:
# Keep in mind -  the statistical tests above are sensitive to large sample sizes, and almost 
# certainly will generate small p values when size is large. therefore we take a random 
# sample of the data to test.

sample = df.sample(100,random_state=100) #fixed sample size of 100

In [15]:
# sample.dtypes

In [16]:
# T-test is used when we want to test the relationship between a numerical variable and 
# a categorical variable
# Three main types of tests:
# 1. One-sample t-test - test the mean of one group against aa constant value
# 2. Two-sample t-test - test the mean of two groups against a constant value
# 3. paired t-test - test the mean of two groups against each other

# For example, we lke to test recency contributes to the prediction of response 
# (whether the customer accepted the offer in the last campeign)

sample['Response'].unique()


array([1, 0])

In [17]:
sample['Recency'].unique()

array([14, 61, 46, 19, 51, 32, 55, 33, 65, 27,  3, 54, 13, 94, 43, 57,  4,
       30, 25, 78, 15, 87, 12, 47, 97, 74, 31, 89, 16, 53, 63, 50, 21, 11,
       10, 23,  8, 60,  6, 40, 69, 71, 73, 77, 45,  1, 24,  2, 62, 80, 84,
       42, 64, 76, 28, 52, 99, 35, 48, 83, 56])

In [19]:
# We need to catrogorise the dataset into customers who accepted the ofer and 
# customers who rejected the offer

recency_p = sample[sample['Response']==1]['Recency']
recency_n = sample[sample['Response']==0]['Recency']


In [20]:
# Step 1
# H0: response is independent of recency
# H1: response is dependent on recency

from scipy.stats import ttest_ind

t_stat,p_value=ttest_ind(recency_p,recency_n) # passing in 2 samples

print('p value', p_value)

# Significance value is 0.05, p<0.05 so we can reject the null hypothesis and accept H1

p value 0.024822208644980654


In [22]:
# print(t_stat)

In [26]:
# We know that t-est is used to compare one or more sample groups. 
# If we want to test more than 2 sample groups we use ANOVA

# ANOVA examins the difference among groups by calculating the ratio of 
# variance across groups Vs variance within a group

# Example: using the feature 'kidhome' for the prediction of 'Num Web Purchaces'

kidhome_0=sample[sample['Kidhome']==0]['NumWebPurchases']
kidhome_2=sample[sample['Kidhome']==1]['NumWebPurchases']
kidhome_3=sample[sample['Kidhome']==2]['NumWebPurchases']

# H0: NumWebPurchases is independent of kidhome
# H1: NumWebPurchases is dependent on kidhome

from scipy.stats import f_oneway

t_stat, p_value = f_oneway(kidhome_0,kidhome_2,kidhome_3)

print('p value', p_value)

# Significance value is 0.05, p<0.05 so we can reject the null hypothesis and accept H1


p value 0.00039808004666969554


In [31]:
# Chi-square test is used to test the independence of categorical variables

# Example: Whether 'Education' and 'response' are independent

combine = pd.crosstab(sample['Education'],sample['Response'])
print(combine)

# H0: Response is independent of Education
# H1: Response is dependent on Education

from scipy.stats import chi2_contingency

result = chi2_contingency(combine)
# print(result)

p_value = result[1]

print('p value', p_value)

# p value > 0.05, so we can accept the null hypothesis
# Indecates that education may not be  a strong predictor of response

Response     0  1
Education        
2n Cycle    13  3
Basic        2  0
Graduation  44  5
Master       9  4
PhD         17  3
p value 0.4129779495497867


In [58]:
# Example 2 datase for hypothesis testing

import seaborn as sns

df = sns.load_dataset('iris')

In [59]:
print(df)

     sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]


In [60]:
df.sample(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
39,5.1,3.4,1.5,0.2,setosa
147,6.5,3.0,5.2,2.0,virginica
6,4.6,3.4,1.4,0.3,setosa
44,5.1,3.8,1.9,0.4,setosa
80,5.5,2.4,3.8,1.1,versicolor
62,6.0,2.2,4.0,1.0,versicolor
12,4.8,3.0,1.4,0.1,setosa
61,5.9,3.0,4.2,1.5,versicolor
47,4.6,3.2,1.4,0.2,setosa
48,5.3,3.7,1.5,0.2,setosa


In [61]:
df['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [62]:
# lets use petal_width to compare species for our chi-square test
# the first move is to check summary to check info
# convert into a categorical variable

print(df['petal_width'].describe())

count    150.000000
mean       1.199333
std        0.762238
min        0.100000
25%        0.300000
50%        1.300000
75%        1.800000
max        2.500000
Name: petal_width, dtype: float64


In [63]:
def petal_cat(df):
    if df['petal_width'] <= 1.3:
        return 0
    elif df['petal_width'] > 1.3:
        return 1
    else:
        return 'Indiffent'

df['petal_width_new'] = df.apply(petal_cat,axis=1)


In [64]:
def species_cat(df):
    if df['species'] == 'virginica':
        return 0
    elif df['species'] == 'versicolor':
        return 1
    else:
        return 2

df['species'] = df.apply(species_cat,axis=1)

In [65]:
df_new = df.drop(columns=['sepal_length','sepal_width','petal_length','petal_width'])
df_new

Unnamed: 0,species,petal_width_new
0,2,0
1,2,0
2,2,0
3,2,0
4,2,0
...,...,...
145,0,1
146,0,1
147,0,1
148,0,1


In [66]:
result = chi2_contingency(df_new)
print('p value',result[1])

# compare p to significance value 0.05, p>0.05 we accept the null hypothesis

# Alternaitve approach to determine accepting or rejecting null hypothesis is based on the confidence interval

# 95% is the confidence interval for the null hypothesis being a fact (it is the amount confidence we have in our null hypothesis)




p value 0.0972850525013659


In [82]:
prob = 0.95
alpha = 1 - prob # here alpha is the significance level

print(alpha)

if result[1]<=alpha:
    print('reject H0')
else:
    print('accept H0')

0.050000000000000044
accept H0


In [91]:
# one sample t test
from scipy import stats

df_ = df.petal_width

print(stats.ttest_1samp(a=df_,popmean=1.199))

# from the test we see p beats alpha=0.05 so we accept the null hypothesis

Ttest_1sampResult(statistic=0.00535591859453653, pvalue=0.9957337798547192)


In [96]:
# 2 sample t test

class1=df.petal_width
class2=df.sepal_length

# requirement here for performing 2 sample t-test two - the independent groups we are sampling
#  have equal variance. We can only know the varance are the same when the ratio of the 
# higher to lower is less that 4:1

import numpy as np

print(np.var(class1),np.var(class2))

ratio_check=np.var(class1)/np.var(class2)
print(ratio_check) # less than 4:1, thus the variance are considered to be the same

print(stats.ttest_ind(a=class1,b=class2,equal_var=True))

# H0 - The mean of both samples are equal
# H1 - The mean of both samples are not equal

0.5771328888888888 0.6811222222222223
0.8473264710200484
Ttest_indResult(statistic=-50.53601510214771, pvalue=3.401977346814745e-148)
