## Import Libraries & Load Data Set

In [2]:
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
data = pd.read_csv('sales_rural_urban.csv')
data.head()

  data = pd.read_csv('sales_rural_urban.csv')


Unnamed: 0,zip,rural_urban,order_id,status,item_id,qty_ordered,price,value,discount_amount,discount_percent,...,middle_initial,last_name,gender,age,e_mail,place_name,county,city,state,region
0,2889,urban,100354687,order_refunded,574790,2,2900.0,2900.0,0.0,0.0,...,K,Dewald,M,52,bernard.dewald@hotmail.co.uk,Warwick,Kent,Warwick,RI,Northeast
1,2889,urban,100356322,canceled,577804,2,1000.0,1000.0,0.0,0.0,...,K,Dewald,M,52,bernard.dewald@hotmail.co.uk,Warwick,Kent,Warwick,RI,Northeast
2,8076,urban,100354701,complete,574814,4,34.5,103.5,0.0,0.0,...,J,Stough,M,60,florentino.stough@gmail.com,Riverton,Burlington,Riverton,NJ,Northeast
3,3907,rural,100354729,order_refunded,574882,3,54.0,108.0,0.0,0.0,...,B,Cyphers,M,67,horacio.cyphers@hotmail.co.uk,Ogunquit,York,Ogunquit,ME,Northeast
4,5846,rural,100354759,received,574938,2,45.5,45.5,0.0,0.0,...,D,Fleet,F,37,fran.fleet@ibm.com,Island Pond,Essex,Island Pond,VT,Northeast


## Set The Hypothesis

Before we can do our tests, we have to understand what the hypotheses represent:

- H0 - There is no difference

- H1 - There is a significant difference in the specified direction

The specified direction here refers to the nature of a **One-Tailed** Test

Either:
- Upper (Right Tailed) Test : H1 is greater than H0
or
- Lower (Left Tailed) Test : H1 is less than H0
    
--    
    
A **Two-Tailed** Test will measure if there is a significant effect/difference both upper and lower, and the significance will be divided between the *Upper* and *Lower* levels.

e.g:

------
***Significance level = 0.05***

**One Tailed** \
Higher - H1 is **Greater than** H0 by ***0.05*** \
Lower - H1 is **Less than** H0 by ***0.05***

Two Tailed 
H1 is **Greater than** H0 by ***0.025***, or **Less Than H0** by ***0.025***


## Choose Significance / Confidence Level

This is not something you have to programme, but: \

- Typically non-health related Significance levels would be a difference of **0.05**
- More imperative hypotheses would have a significance level closer to **0.02**

## Sample

We only need to sample an array here, as we will need to calculate it directly. 

In [None]:
#Sample 30, random state passengers from 1st class. This will return a list of prices from a df with 30 people from 3rd Class
c1_sample = data[data['Pclass']==1]['Fare'].sample(30, random_state = 42)
c1_sample.head()

## Compute Statistic

Here we have checked if 1st class prices are different from 85 usd by sampling 30 1st class passengers and requiring a 5% significance.

In [None]:
#compute statistics and p-value. Here we are using a one sample t-test.
st.ttest_1samp(c3_sample,85)

## Get P-Value

In [None]:
#for the two tailed experiment

# Perform one-sample t-test
t_statistic, p_value = st.ttest_1samp(c3_sample,85)


#for the single tailed experiment
print("T-statistic:", t_statistic)
print("P-value:", p_value)


#for the two tailed experiment
print("T-statistic:", t_statistic)
print("P-value:", p_value/2)


## Interpret Result

In [1]:
# Check if the result is statistically significant (common threshold of 0.05)
if p_value < 0.05:
    print("The result is statistically significant: Reject null hypothesis.")
else:
    print("The result is not statistically significant: Fail to reject null hypothesis.")

T-statistic: -2.7043232157601955
P-value: 0.024221605135545964
The result is statistically significant: Reject null hypothesis.


## AB Testing

Independent Samples
For two groups where we cannot match the observations to one another. 
In this case transactions from a website with different interfaces (a, b)

In [None]:
ab_test = pd.read_csv('ab_test.csv')
ab_test.head()

In [None]:
#if we don't assume equal variance the test will be more robust
st.ttest_ind(ab_test['a'], ab_test['b'], equal_var=False) 

# ANOVA

In [None]:
One-Way ANOVA (Analysis of Variance):
Used to compare the means of three or more independent groups to determine if there's a significant difference between them.

Two-Way ANOVA:
Used to analyze the effects of two independent categorical variables on a continuous dependent variable.

We typically use the following tests in the 

One-Sample Z-Test:
Used to test the mean of a single sample against a known population mean when the population standard deviation is known.

One-Sample T-Test:
Used to test the mean of a single sample against a known or assumed population mean when the population standard deviation is unknown.

Two-Sample Independent T-Test:
Used to compare the means of two independent groups to determine if there's a significant difference between them.

Paired Samples T-Test:
Used to compare the means of two related groups, often before and after a treatment or intervention.

One-Way ANOVA (Analysis of Variance):
Used to compare the means of three or more independent groups to determine if there's a significant difference between them.

Two-Way ANOVA:
Used to analyze the effects of two independent categorical variables on a continuous dependent variable.

Chi-Square Test:
Used to assess the association between categorical variables by comparing observed and expected frequencies in a contingency table.

Mann-Whitney U Test (Wilcoxon Rank-Sum Test):
Non-parametric test used to compare the medians of two independent groups when assumptions for the t-test are not met.

Kruskal-Wallis Test:
Non-parametric equivalent of one-way ANOVA, used to compare the medians of three or more independent groups.

Wilcoxon Signed-Rank Test:
Non-parametric test used to compare the medians of two related groups (paired samples).

Fisher's Exact Test:
Used to determine if there's a significant association between two categorical variables in cases where the sample sizes are small.

Binomial Test:
Used to test if the proportion of successes in a single sample differs significantly from a hypothesized proportion.

These are just a few examples of the many types of hypothesis tests available. The choice of test depends on your research question, the type of data you have, and the assumptions you can make about your data. It's important to select the appropriate test based on the characteristics of your data and the objectives of your analysis.




