# Statistics 






### Example:


#### 1. A survey claims that in a math test female students tend to score fewer marks than the average marks of 75 out of 100. Consider a sample of 24 female students and perform a hypothesis test to check the claim with 90% confidence.

Use the dataset available in the CSV file `mathscore_1ttest.csv`.


In [3]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [4]:
data=pd.read_csv('mathscore_1ttest.csv')

In [16]:
# H0:mu>75
# h1:mu<75
mu=75
n=24
# # sd=not given
# as population sd is not given and sample size is less than 30 so we use t test

In [9]:
math_fem_sam=data[data['gender']=='female']['math score'].tolist()

In [13]:
print(f'size(n):{len(math_fem_sam)} \tsample_mean:{np.mean(math_fem_sam)}\t sample std:{np.std(math_fem_sam)}')

size(n):24 	sample_mean:66.45833333333333	 sample std:11.357740263313335


In [20]:
p_value=stats.ttest_1samp(math_fem_sam,popmean=75,alternative='less').pvalue

In [21]:
if p_value<.1:
    print("Reject H0: Avearge marks of female math marks is less than 75")
else:
    print("Fail to Reject H1: Avearge marks of female math marks is more than 75")

Reject H0: Avearge marks of female math marks is less than 75


#### 2. A researcher is studying the growth of bacteria in waters of Lake Beach. The mean bacteria count of 100 per unit volume of water is within the safety level. The researcher collected 10 water samples of unit volume and found the mean bacteria count to be 94.8 with a sample variance of 72.66. Does the data indicate that the bacteria count is within the safety level? Test at the α = .05 level. Assume that the measurements constitute a sample from a normal population.

In [25]:
n=10
mu=100
x_bar=94.8
s=72.66**.5
# # Assumption already considerd as sample is drawn from normal populaion
# as population sd is not given and sample size is less than 30 therefore t test is used
# H0:mu>100
# H1:mu<100    
t=(x_bar-mu)/(s/np.sqrt(n))
p_value=stats.t.cdf(t,df=n-1)
print(f'p_value:{p_value} \t t_stats:{t}')

if p_value<.05:
    print("Reject H0: safety level is below 100")
else:
    print("Fail to reject H0: Safety level is above 100")

p_value:0.04289782134327503 	 t_stats:-1.9291040236750068
Reject H0: safety level is below 100


### Example:

#### 1. In previous years, people believed that at most 80% of male students score more than 50 marks out of 100 in Mathematics. Perform a test to check whether this percentage is more than 80. Consider the level of significance as 0.05.

Consider the sample of math scores of male students available in the CSV file `StudentsPerformance.csv`.

In [None]:
# H0:P0<.8
# H1:P0>.8


In [26]:
data=pd.read_csv('StudentsPerformance.csv')

In [40]:
p=data[(data['math score']>50)&(data['gender']=='male')].shape[0]/len(data[(data['gender'] == 'male')])
p

0.8757763975155279

In [43]:
P0=.8
n=len(data[(data['gender'] == 'male')])

z=(p-P0)/np.sqrt(P0*(1-P0)/n)
z

4.163394160018601

In [45]:
p_value=stats.norm.sf(z)
print("P_value:",p_value)
print("z_stats:",z)

if p_value<.05:
    print("Reject H0: male math marks is greater than 50")
else:
    print("Fail to reject H0: male math marks is  less than 50")


P_value: 1.5677570141208797e-05
z_stats: 4.163394160018601
Reject H0: male math marks is greater than 50


#### 2. From a sample of 361 business owners had gone into bankruptcy due to recession. On taking a survey, it was found that 105 of them had not consulted any professional for managing their finance before opening the business. Test the null hypothesis that at most 25% of all businesses had not consulted before opening the business. Test the claim using p-value technique. Use α = 0.05.

In [48]:
# H0:mu<=.25
# H1:mu>.25    

p=105/361
P0=.25
n=361
z=(p-P0)/np.sqrt(P0*(1-P0)/n)
p_value=stats.norm.sf(z)
print("p_value:",p_value)
print("z_stats:",z)

if p_value<.05:
    print("Reject H0")
else:
    print("Fail to reject H0")

p_value: 0.03650049373124949
z_stats: 1.7928245201151534
Reject H0


### Example:

#### 1. The training institute <i>Nature Learning</i> claims that the students trained in their institute have overall better performance than the students trained in their competitor institute <i>Speak Global Learning</i>. We have a sample data of 500 students from each institute along with their total score collected from independent normal populations. Frame a hypothesis and test the Nature Learning's claim with 99% confidence.
Consider the total score for students given in the CSV file `StudentsPerformance.csv`. 

In [50]:
# H0:mu1-mu2<0
# H1:mu1(Nature learning)-mu2(speak global learning)>0
data=pd.read_csv('StudentsPerformance.csv')    
data.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning
1,female,group C,standard,completed,55,63,72,190,Nature Learning
2,female,group B,standard,none,64,71,56,191,Nature Learning
3,male,group A,free/reduced,none,60,99,72,231,Nature Learning
4,male,group C,standard,none,75,66,51,192,Nature Learning


In [56]:
training_ins=data[data['training institute']=='Nature Learning']['total score'].values

In [57]:
speak_global=data[data['training institute']!='Nature Learning']['total score'].values

In [None]:
# This is two sample z test as sample size is greater than 30

In [59]:
from statsmodels.stats import weightstats as wstats

In [65]:
p_value=wstats.ztest(x1=training_ins,x2=speak_global,alternative='larger')[1]

In [66]:
if p_value<.01:
    print("Reject H0")
else:
    print("Fail to reject H0")

Fail to reject H0


#### 2. A study was carried out to understand amount of haemoglobin in blood for males and females. A random sample of 160 males and 180 females have means of 13 g/dl and 15 g/dl. The two samples have standard deviation of 4.1 g/dl for male donors and 3.5 g/dl for female donor . Can it be said the population means of concentrations of the elements are the same for men and women? Use  α = 0.01.

### Example: 

#### 1. The teachers' association claims that the total score of the students who completed the test preparation course is different than the total score of the students who have not completed the course. The sample data consists of 15 students who completed the course and 18 students who have not completed the course. Test the association's claim with ⍺ = 0.05.

Consider the total score of the students who have/ have not completed the preparation course are given in the CSV file `totalmarks_2ttest.csv`.

In [81]:
# Assumption 1 : normal distribution
# h0: Data is normal
# h1: data is not normal    
p_value=stats.shapiro(data['total score']).pvalue
print(stats.shapiro(data['total score']))
if p_value<.01:
    print("reject H0: data not is normal")
else:
    print("Fail to reject H0: Data is  normal")
    

ShapiroResult(statistic=0.9845389723777771, pvalue=0.9080861806869507)
Fail to reject H0: Data is  normal


In [83]:
# Assumption 2 : equal distribution
# h0: variancce are equal
# h1: variance are not equal    
p_value=stats.levene(comp,not_comp).pvalue
print(stats.shapiro(data['total score']))
if p_value<.01:
    print("reject H0: variance are not equal")
else:
    print("Fail to reject H0: variance are equal")
    

ShapiroResult(statistic=0.9845389723777771, pvalue=0.9080861806869507)
Fail to reject H0: variance are equal


In [69]:
# H0:mu1=mu2
# H1:mu1(course completed)-mu2(not completed)!=0
n1=15
n2=18
# this is two sample t test unpaired test as samples drawn are independent and sample size is also small


data=pd.read_csv('totalmarks_2ttest.csv')
data

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,male,group E,standard,completed,84,83,78,245,Speak Global Learning
1,male,group C,free/reduced,completed,79,77,75,231,Speak Global Learning
2,male,group A,standard,none,91,96,92,279,Nature Learning
3,female,group B,free/reduced,completed,76,94,87,257,Speak Global Learning
4,male,group A,standard,completed,46,41,43,130,Nature Learning
5,female,group C,standard,completed,70,82,76,228,Speak Global Learning
6,male,group C,standard,none,79,78,77,234,Speak Global Learning
7,male,group D,standard,none,88,77,77,242,Nature Learning
8,female,group E,standard,none,62,73,70,205,Speak Global Learning
9,female,group B,standard,completed,60,70,74,204,Nature Learning


In [72]:
comp=data[data['test preparation course']=='completed']['total score'].values
not_comp=data[data['test preparation course']!='completed']['total score'].values

In [84]:
p_value=stats.ttest_ind(comp,not_comp).pvalue

In [85]:
if p_value<.05:
    print("Reject H0")
else:
    print("Fail to Reject H0")

Fail to Reject H0


### Example:

#### 1. A training institute wants to check if their writing training program was effective or not. 17 students are selected to check the hypothesis. Consider 0.05 as the level of significance.

The writing scores before and after training are provided in the CSV file `WritingScores.csv`. 

In [None]:
H0: mu1(before) -mu2(after)=0
H1: mu1(before) -mu2(after)!=0

In [89]:
data=pd.read_csv('WritingScores.csv')
data.head(2)

Unnamed: 0,score_before,score_after
0,59,50
1,62,67


In [90]:
# as n is less than 30 and before after si there or paired t test is used here

In [95]:
# Assumption 1:
# part 1

p_value=stats.shapiro(data['score_before']).pvalue
if p_value<.01:
    print("reject H0: data is not normal")
else:
    print("Fail to reject H0: data is normal")
    
# part 2    
p_value=stats.shapiro(data['score_after']).pvalue
if p_value<.01:
    print("reject H0: data is not normal")
else:
    print("Fail to reject H0: data is normal")
    

Fail to reject H0: data is normal
Fail to reject H0: data is normal


In [99]:
diff=(abs(data['score_before']-data['score_after'])).values

In [103]:
p_value=stats.ttest_rel(data['score_before'],data['score_after']).pvalue
if  p_value<.05:
    print("reject H0")
else:
    print("Fail to rejct H0")

Fail to rejct H0


### Example:

#### 1. Check whether there is a significant difference between the observed and expected education values or not with 90% confidence. 

Consider the observed values from the performance dataset of students available in the CSV file `students_data.csv`. Consider the expected values from the demographic data given in the CSV file `demographic_data.csv`.

In [113]:
exp_data=pd.read_csv('demographic_data.csv')
obs_data=pd.read_csv('students_data.csv')

In [None]:
# H0: there is no significant difference between observed and expected value
# H1: There is significiant differnce between oberved and expected value    
    

In [114]:
obs_data.shape

(1000, 10)

In [130]:
observed=obs_data['education'].value_counts().values

In [131]:
expected=(exp_data.value_counts(normalize=True)*obs_data.shape[0]).values

In [136]:
p_value=stats.chisquare(observed,expected).pvalue

In [137]:
if p_value<.10:
    print("reject H0")
else:
    print("Fail to reject H0")

reject H0


#### 2. At an emporium, the manager is interested in knowing the age group which visits the mall during the day. He defines categories as - children, teenagers, adults and senior citizens. He plans to have his inventory of goods accordingly. He claims that out of all the people who visited 5% are children, 38% are teenagers, 2% are senior citizens are remaining are adults. From a sample of 180 people, it was seen that 25 were children, 50 were teenagers, 90 were adults and  15 were senior citizens. Test the manager’s claim at a 95% confidence level.


In [194]:
df=pd.DataFrame([[.05,.38,.55,.02],[25,50,90,15]],columns=['children','teenager','adults','senior citizen'])

In [195]:
df

Unnamed: 0,children,teenager,adults,senior citizen
0,0.05,0.38,0.55,0.02
1,25.0,50.0,90.0,15.0


In [196]:
df.iloc[1,:].sum()

180.0

In [197]:
observed=df.iloc[1,:].tolist()

In [198]:
expected=(df.iloc[0,:]*df.iloc[1,:].sum()).tolist()

In [199]:
expected

[9.0, 68.4, 99.00000000000001, 3.6]

In [200]:
observed

[25.0, 50.0, 90.0, 15.0]

In [201]:
df

Unnamed: 0,children,teenager,adults,senior citizen
0,0.05,0.38,0.55,0.02
1,25.0,50.0,90.0,15.0


In [204]:
p_value=stats.chisquare(observed,expected).pvalue

if p_value<.05:
    print("reject h0")
else:
    print("fail to reject h0")

reject h0


### Example:

#### 1. Check if there is any relationship between the gender and education level of students with 95% confidence. 

Use the performance dataset of students available in the CSV file `students_data.csv`.

In [205]:
data=pd.read_csv('students_data.csv')
data.head(3)

# h0:variables are independent
# H1: Variables are not independent


Unnamed: 0,gender,ethnicity,education,lunch,test_prep_course,math_score,reading_score,writing_score,total_score,training_institute
0,female,group B,bachelor's degree,standard,none,89,55,56,200,Nature Learning
1,female,group C,college,standard,completed,55,63,72,190,Nature Learning
2,female,group B,master's degree,standard,none,64,71,56,191,Nature Learning


In [209]:

p_value=stats.chi2_contingency(pd.crosstab(data['education'],data['gender'])).pvalue

In [210]:
if p_value<.05:
    print("Reject H0")
else:
    print("Fail to reject H0")

Fail to reject H0


In [214]:
stats.chi2_contingency([[93, 51], [68, 40]],correction=False)

Chi2ContingencyResult(statistic=0.07023411371237459, pvalue=0.790996215494177, dof=1, expected_freq=array([[92., 52.],
       [69., 39.]]))

### Example:

#### 1. Total marks in aptitude exam are recorded for students with different race/ethnicity. Test whether all the races/ethnicities have an equal average score with 0.05 level of significance. 
Use the performance dataset of students available in the CSV file `students_data.csv`.

In [218]:
data=pd.read_csv('students_data.csv')
data.head(2)

Unnamed: 0,gender,ethnicity,education,lunch,test_prep_course,math_score,reading_score,writing_score,total_score,training_institute
0,female,group B,bachelor's degree,standard,none,89,55,56,200,Nature Learning
1,female,group C,college,standard,completed,55,63,72,190,Nature Learning


The null and alternative hypothesis is:

H0: The average score of all races/ethnicities is same
H1: At least one race/ethnicity has a different average score

In [217]:
# h0:Average score of all race is same
# h1:atleast one race has differenct race    

In [220]:
data['ethnicity'].unique()

array(['group B', 'group C', 'group A', 'group D', 'group E'],
      dtype=object)

In [223]:
grp_a=data[data['ethnicity']=='group A']['total_score'].values
grp_b=data[data['ethnicity']=='group B']['total_score'].values
grp_c=data[data['ethnicity']=='group C']['total_score'].values
grp_d=data[data['ethnicity']=='group D']['total_score'].values
grp_e=data[data['ethnicity']=='group E']['total_score'].values


In [227]:
# # assumption 
# part 1: normal distribution
p_value=stats.shapiro(data['total_score']).pvalue

if p_value<.05:
    print("Reject H0:Data is not normal")
else:
    print("Fail to reject H0:data is normal")
      
# part 2: have equal variance    
p_value=stats.levene(grp_a,grp_b,grp_c,grp_d,grp_e).pvalue

if p_value<.05:
    print("Reject H0:Variances are not equal")
else:
    print("Fail to reject H0:variances are equal")


Fail to reject H0:data is normal
Fail to reject H0:variances are equal


In [241]:
p_value=stats.f_oneway(grp_a,grp_b,grp_c,grp_d,grp_e).pvalue
if p_value<.05:
    print("reject H0")
else:
    print("Fail to reject H)")

Fail to reject H)


In [231]:
import statsmodels.formula.api as sfa
import statsmodels.api as sma 

In [240]:
model=sfa.ols('total_score ~ ethnicity ',data=data).fit()
sma.stats.anova_lm(model)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
ethnicity,4.0,1699.671655,424.917914,0.78911,0.532294
Residual,995.0,535785.303345,538.477692,,


In [239]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
print(pairwise_tukeyhsd(mean_pressure_df['Mean_Pressure'], mean_pressure_df['Car_Type']))

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1bdd3aa4dd0>

In [242]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [244]:
tukey_hsd=pairwise_tukeyhsd(data['total_score'],data['ethnicity'])

In [248]:
print(tukey_hsd)

 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj   lower   upper  reject
------------------------------------------------------
group A group B   2.8789 0.8689 -5.2357 10.9936  False
group A group C  -0.8712 0.9979 -8.4401  6.6978  False
group A group D    0.536 0.9997 -7.2158  8.2878  False
group A group E   0.6143 0.9997 -7.9535  9.1821  False
group B group C  -3.7501 0.3957 -9.5614  2.0612  False
group B group D  -2.3429 0.8276 -8.3905  3.7046  False
group B group E  -2.2647 0.9057 -9.3279  4.7986  False
group C group D   1.4072 0.9504 -3.8857     6.7  False
group C group E   1.4854   0.97 -4.9435  7.9143  False
group D group E   0.0783    1.0 -6.5649  6.7215  False
------------------------------------------------------


In [249]:
data.head()

Unnamed: 0,gender,ethnicity,education,lunch,test_prep_course,math_score,reading_score,writing_score,total_score,training_institute
0,female,group B,bachelor's degree,standard,none,89,55,56,200,Nature Learning
1,female,group C,college,standard,completed,55,63,72,190,Nature Learning
2,female,group B,master's degree,standard,none,64,71,56,191,Nature Learning
3,male,group A,associate's degree,free/reduced,none,60,99,72,231,Nature Learning
4,male,group C,college,standard,none,75,66,51,192,Nature Learning


In [250]:
stats.pearsonr(data['math_score'],data['total_score'])

PearsonRResult(statistic=0.6626368421809106, pvalue=1.901379415273251e-127)