### HYPOTHESIS TESTING


_Definition: Testing of an assumption regarding sample data from a given population and whether this sample meets a given hypothetical condition_

_Uses: widely used to compare results based on a before and after basis, for example, patients' condition improve after a certain drug A is administered, so a test is done before and after to establish this_

### steps in Hypothesis testing

1. DEFINING THE NULL AND ALTERNATIVE HYPOTHESIS

* you need to define the ``null hypothesis`` denoted by $H_0$, according to the null hypothesis, there is no difference between groups or correlation between variables.
  
* you need to also define the ``alternative hypothesis`` denoted by $H_A$, this is the alternate response to your research question It asserts that there is a demographic impact or a statistical significant difference between variables

We will explore this using sample data from Kaggle dataset   '[Marketing Campaign csv](https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis)'

In [48]:
import  pandas as pd
import numpy as np


In [49]:
# reading the data set

df = pd.read_csv('marketing_campaign.csv', sep= '\t')

In [50]:
# selecting a sample size
sample_df = df.sample(n = 100, random_state= 100)


* i would like to explore relationship between two columns in our data, that is ``Recency`` (Number of days since customer's last purchase)  and ``Response`` (if customer accepted the offer in the last campaign). So for my hypothesis:

##

* $H_0$ : _there is no difference in Recency between the customers who accepted the offer in the last campaign and who did not accept the offer_

* $H_A$ : _customers who accepted the offer has lower Recency compared to customers who did not accept the offer_ 

2. CHOOSING THE RIGHT TEST

Most types of test used are summarized as below :
* **ANOVA TEST** : Compare the differences between two or more groups of numeric or categorical variables
* **T-Test** : compare two groups/categories of numeric variables with small sample size
* **Z-Test** : compare two groups/categories of numeric variables with large sample size
* **Chi - squared Test** : Examine relationship between two categorical variables
* **Correlation Test** : Examine relationship between two numeric variables

* here is a summarized diagram of the above tests

#

![Alt text](chart.png)


### T-test

for the hypothesis , the T-test seems to be the most appropriate. There are three types of T-test: 


* **one sample t-test:**: test the mean of one group against a constant value
* **two sample t-test:** test the difference of means between two groups
* **paired sample t-test:** test the difference of means between two measurements of the same subject

the two sample T-test will be used, hence the need to create my two samples. with the first sample being the ```Recency of customers who accepted the offer```, and the second being the``` recency of customers who rejected the offer.```

In [51]:
# Recency of customers who accepted the offer

recency_a = sample_df[sample_df['Response'] == 1]['Recency']



In [52]:
# Recency of customers who rejected the offer

recency_r = sample_df[sample_df['Response']== 0]['Recency']


#### calculate the p-value

In [53]:
# two sampled t-test

from scipy.stats import ttest_ind

t_stat, pvalue = ttest_ind(recency_a, recency_r)
print (f't-stat: {t_stat:.2f}, \np-value: {pvalue:.5f}')


t-stat: -2.28, 
p-value: 0.02482


4. Determine the statistical significance

The pvalue from above is compared against a significance level threshold of alpha  ``0.05`` and since ``0.024`` is smaller than our alpha thus statistically significant, we can reject our null hypothesis. This is an indicator that feature “Response” may be a strong predictor of the target variable “Recency"

### ANOVA Test

This is used to compare more than two samples. More aptly put, it compares the ratio of variance across different groups against the variance within each group

For example. use of feature ``"Kidhome"`` against feature ``"number of web purchases"``. Kidhome is categorical and has 3 distinct values as seen below

In [54]:
sample_df.Kidhome.value_counts()

0    55
1    42
2     3
Name: Kidhome, dtype: int64

From these we can get 3 samples

In [55]:
kidhome_0 = sample_df[sample_df['Kidhome']==0]['NumWebPurchases']
kidhome_1 = sample_df[sample_df['Kidhome']==1]['NumWebPurchases']
kidhome_2 = sample_df[sample_df['Kidhome']==2]['NumWebPurchases']

**we can proceed to state our null and alternative hypothesis**

$H_0$ : _there is no difference among the three groups_

$H_A$ : _there is a difference between at least two groups_

calculate the p value

In [56]:
from scipy import stats
from scipy.stats import f

# calculating using the stats library

f_stat, pvalue = stats.f_oneway(kidhome_0, kidhome_1, kidhome_2)

print (f'f-stat: {f_stat:.2f}, \np-value: {pvalue:.5f}')



f-stat: 8.50, 
p-value: 0.00040


We can infer that there is a difference in ``Num WEb Purchases`` between at least two groups hence rejecting null hypothesis since p-value  ``` 0.0004``` is less than alpha of ``0.05``

### Chi - Squared Test

* used to determine if there is a significant association between two categorical variables. 
* if two categorical variables are independent, then one categorical variable should have similar composition when the other categorical variable changes.

for this test we will look at two categories from our sample data, ``Education`` and ``Response``

if these two variables are completely independent to each other (null hypothesis is true), then the proportion of positive Response and negative Response should be the same across all Education groups.

**we can proceed to define our hypothesis**

$H_0$ :“Education” and “Response” are independent to each other. 

$H_A$ :“Education” and “Response” are dependent to each other. 

In [57]:
# The code pd.crosstab(sample_df['Education'], sample_df['Response']) creates a contingency table or cross-tabulation between the two categorical variables 

# A contingency table is a table that displays the frequency distribution of one or more categorical variables in relation to one another. 

ed_contingency = pd.crosstab(sample_df['Education'], sample_df['Response'])

ed_contingency

Response,0,1
Education,Unnamed: 1_level_1,Unnamed: 2_level_1
2n Cycle,13,3
Basic,2,0
Graduation,44,5
Master,9,4
PhD,17,3


In [58]:
# importing the necessary library

from scipy.stats import chi2_contingency

# running the chi test

chi2_stat, pvalue, dof, exp = chi2_contingency(ed_contingency)


print (f'p-value: {pvalue:.5f}')

p-value: 0.41298


The p-value  is ``0.41``, suggesting that it is not statistically significant. Therefore, we fail to reject the null hypothesis that these two categorical variables are independent.