# Conduct a A/B test with Python: Analyse the Literacy Rates of Two States for Funding Allocation

In this project, I conducted A/B tests to determine the mean district literacy rates for two states (STATE21 and STATE28). I used Python to simulate taking a random sample of 20 districts in each state and conducted a two-sample t-test based on the sample data.

## Summary of Results

There is a statistically significant difference between the mean district literacy rates of the two states. This analysis helps decide how to distribute government resources. Since STATE28 has a lower literacy rate, more resources should be allocated to improve the literacy rate of this state.

## Read the data
The data used for the analysis was accessed via https://www.kaggle.com/datasets/saswatsethda/districtwise-education-data/data

In [1]:
import pandas as pd
from scipy import stats

In [5]:
education_districtwise = pd.read_csv("education_districtwise.csv")
# education_districtwise = education_districtwise.dropna()

## Explore the data

1. Data clearning. Remove entries with null values. Check duplicates.
2. Filter the dataframe to get the data of the two districts of interest - STATE21 and STATE28 - for further investigation

In [4]:
education_districtwise.head()

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0


In [7]:
print(education_districtwise.shape)
print(education_districtwise.isnull().sum())

(680, 7)
DISTNAME       0
STATNAME       0
BLOCKS         0
VILLAGES       0
CLUSTERS       0
TOTPOPULAT    46
OVERALL_LI    46
dtype: int64


In [8]:
# Remove entries with null values
education_districtwise=education_districtwise.dropna()

In [10]:
# check for dulicates
print(education_districtwise.duplicated().sum())

0


In [11]:
# Get the data for the two districts of interest

state21 = education_districtwise[education_districtwise['STATNAME'] == "STATE21"]
state28 = education_districtwise[education_districtwise['STATNAME'] == "STATE28"]

## Simulate random sampling & Compute the sample means 

use the `.sample()` function to take a random sample from a dataframe. 
*   `n`: sample size. 
*   `replace=True`: Enables sampling with replacement; bootstraping is normally used to estimate statistics. Even if duplicates feel odd, bootstrapping may still be statistically valid as a modeling tool — it doesn't claim the duplicates exist, it just estimates sampling variability.
*   `random_state`: Sets a fixed seed for reproducibility. 

In [14]:
sampled_state21 = state21.sample(n=20, replace = True, random_state=42)
sampled_state28 = state28.sample(n=20, replace = True, random_state=1)

In [16]:
sampled_state21['OVERALL_LI'].mean()

68.01950000000001

In [15]:
sampled_state28['OVERALL_LI'].mean()

61.167999999999985

STATE21 has a mean district literacy rate of about 68%, while STATE28 has a mean district literacy rate of about 61%.

The observed difference between the mean district literacy rates of STATE21 and STATE28 is 7% (68% - 61%). 

**Note**: At this point, we might be tempted to conclude that STATE21 has a higher overall literacy rate than STATE28. However, due to sampling variability, this observed difference might simply be due to chance, rather than an actual difference in the corresponding population means. A hypothesis test can help determine whether or not the results are statistically significant. 

## Conduct a hypothesis test

A two-sample t-test is the standard approach for comparing the means of two independent samples. The steps for conducting a hypothesis test are:

1.   State the null hypothesis and the alternative hypothesis.
2.   Choose a significance level. The threshold at which you will consider a result statistically significant. This is the probability of rejecting the null hypothesis when it is true. Here we use 5%. 
3.   Find the p-value. 
4.   Reject or fail to reject the null hypothesis.

### The null hypothesis and the alternative hypothesis

*   **Null hypothesis** $(H_0)$: There is no difference in the mean district literacy rates between STATE21 and STATE28.
*   **Alternative hypothesis** $(H_A)$: There is a difference in the mean district literacy rates between STATE21 and STATE28.

### Find the p-value

**P-value** refers to the probability of observing results as or more extreme than those observed when the null hypothesis is true. If the probability of this outcome is very unlikely—in particular, if the p-value is *less than* the significance level of 5%— then we will reject the null hypothesis.

For a two-sample $t$-test, use `scipy.stats.ttest_ind()` to compute the p-value. This function includes the following arguments:

*   `a`: Observations from the first sample 
*   `b`: Observations from the second sample
*   `equal_var`: A boolean, or true/false statement, which indicates whether the population variance of the two samples is assumed to be equal. In our example, we don’t have access to data for the entire population, so we don’t want to assume anything about the variance. To avoid making a wrong assumption, set this argument to `False`. 

**Reference:** [scipy.stats.ttest_ind](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html)


In [17]:
stats.ttest_ind(a=sampled_state21['OVERALL_LI'], b=sampled_state28['OVERALL_LI'], equal_var=False)

TtestResult(statistic=3.4447514689083785, pvalue=0.0014097931596738957, df=37.94509081476471)

The p-value is about 0.0014, or 0.14%. 

This means there is only a 0.14% probability that the absolute difference between the two mean district literacy rates would be 6.2 percentage points or greater if the null hypothesis were true. In other words, it’s highly unlikely that the difference in the two means is due to chance.

#### Reject the null hypothesis

The p-value of 0.14%, is less than the significance level of 5%. Therefore, we will *reject* the null hypothesis and conclude that there is a statistically significant difference between the mean district literacy rates of the two states: STATE21 and STATE28. 

# Conclusion

There is a statistically significant difference between the mean district literacy rates of the two states. STATE28 has a lower literacy rate thus more resources should be allocated to improve literacy.