In [1]:
import pandas as pd
from scipy import stats

In [2]:
df = pd.read_csv("education_districtwise.csv")
df = df.dropna()

In [3]:
df

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.00
...,...,...,...,...,...,...,...
675,DISTRICT522,STATE29,37,876,137,5296396.0,78.05
676,DISTRICT498,STATE29,64,1458,230,4042191.0,56.06
677,DISTRICT343,STATE29,59,1117,216,3483648.0,65.05
678,DISTRICT130,STATE29,51,993,211,3522644.0,66.16


In [4]:
state21 = df[df['STATNAME'] == "STATE21"]
state28 = df[df['STATNAME'] == "STATE28"]

In [5]:
sampled_state21 = state21.sample(n=20, replace = True, random_state=13490)

In [6]:
sampled_state28 = state28.sample(n=20, replace = True, random_state=39103)

In [7]:
sampled_state21['OVERALL_LI'].mean()

70.82900000000001

In [9]:
sampled_state28['OVERALL_LI'].mean()

214    55.10
239    75.59
221    65.68
211    53.53
211    53.53
234    66.41
239    75.59
243    61.63
219    54.57
212    60.90
225    68.56
229    64.96
223    71.59
217    53.56
226    63.81
222    67.04
231    73.30
212    60.90
235    72.47
231    73.30
Name: OVERALL_LI, dtype: float64

## Conduct a hypothesis test



### Step 1: State the null hypothesis and the alternative hypothesis
The null hypothesis is a statement that is assumed to be true unless there is convincing evidence to the contrary. The alternative hypothesis is a statement that contradicts the null hypothesis, and is accepted as true only if there is convincing evidence for it.

In a two-sample t-test, the null hypothesis states that there is no difference between the means of your two groups. The alternative hypothesis states the contrary claim: there is a difference between the means of your two groups.

We use  𝐻0
  to denote the null hypothesis, and  𝐻𝐴
  to denote the alternative hypothesis.

𝐻0
 : There is no difference in the mean district literacy rates between STATE21 and STATE28
𝐻𝐴
 : There is a difference in the mean district literacy rates between STATE21 and STATE28
### Step 2: Choose a significance level
The significance level is the threshold at which you will consider a result statistically significant. This is the probability of rejecting the null hypothesis when it is true. The education department asks you to use their standard level of 5%, or 0.05.

### Step 3: Find the p-value
P-value refers to the probability of observing results as or more extreme than those observed when the null hypothesis is true.

Based on your sample data, the difference between the mean district literacy rates of STATE21 and STATE28 is 6.2 percentage points. Your null hypothesis claims that this difference is due to chance. Your p-value is the probability of observing an absolute difference in sample means that is 6.2 or greater if the null hypothesis is true. If the probability of this outcome is very unlikely - in particular, if your p-value is less than your significance level of 5% – then you will reject the null hypothesis.

scipy.stats.ttest_ind()
For a two-sample 𝑡
-test, you can use scipy.stats.ttest_ind() to compute your p-value. This function includes the following arguments:

a: Observations from the first sample.
b: Observations from the second sample.
equal_var: A boolean, or true/false statement, which indicates whether the population variance of the two samples is assumed to be equal. In our example, you don’t have access to data for the entire population, so you don’t want to assume anything about the variance. To avoid making a wrong assumption, set this argument to False.
Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html.

Now you’re ready to write your code and enter the relevant arguments:

a: Your first sample refers to the district literacy rate data for STATE21, which is stored in the OVERALL_LI column of your variable sampled_ state21.
b: Your second sample refers to the district literacy rate data for STATE28, which is stored in the OVERALL_LI column of your variable sampled_ state28.
equal_var: Set to False because you don’t want to assume that the two samples have the same variance.

In [12]:
stats.ttest_ind(a=sampled_state21['OVERALL_LI'], b=sampled_state28['OVERALL_LI'], equal_var=False)

Ttest_indResult(statistic=2.8980444277268735, pvalue=0.006421719142765237)

### Step 4: Reject or fail to reject the null hypothesis
To draw a conclusion, compare your p-value with the significance level.

If the p-value is less than the significance level, you conclude there is a statistically significant difference in the mean district literacy rates between STATE21 and STATE28. In other words, you reject the null hypothesis  𝐻0
 .
If the p-value is greater than the significance level, you conclude there is not a statistically significant difference in the mean district literacy rates between STATE21 and STATE28. In other words, you fail to reject the null hypothesis  𝐻0
 .
Your p-value of 0.0064, or 0.64%, is less than the significance level of 0.05, or 5%. So, you reject the null hypothesis, and conclude that there is a statistically significant difference between the mean district literacy rates of the two states STATE21 and STATE28.

Your analysis will help the education department decide how to distribute government resources. Since there is a statistically significant difference in mean district literacy rates, the state with the lower literacy rate, STATE28, will likely receive more resources to improve literacy.

If you have successfully completed the material above, congratulations! You now understand how to use Python to conduct a two-sample hypothesis test. Going forward, you can start using Python to conduct hypothesis tests on your own data.