## Table of Content
1. **[Small Sample Test](#t)**
    - 1.1 - **[One Sample t Test](#1t)**
2. **[Z Proportion Test](#prop)**
    - 2.1 - **[One Sample Test](#1_p)**

<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [12]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import 'random' to generate random sample
import random

# import statistics to perform statistical computation  
import statistics

# import 'stats' package from scipy library
from scipy import stats

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

# to test the normality 
from scipy.stats import shapiro

# import the function to calculate the power of test
from statsmodels.stats import power

In [13]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

### Example:


#### 1. A survey claims that in a math test female students tend to score fewer marks than the average marks of 75 out of 100. Consider a sample of 24 female students and perform a hypothesis test to check the claim with 90% confidence.

Use the dataset available in the CSV file `mathscore_1ttest.csv`.

In [14]:
# read the students performance data 
df_female_scores = pd.read_csv('totalmarks_2ttest (1).csv')

# display the first two observations
df_female_scores.head(2)

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,male,group E,standard,completed,84,83,78,245,Speak Global Learning
1,male,group C,free/reduced,completed,79,77,75,231,Speak Global Learning


In [15]:
# consider a list of math scores of female students from the data
df_female_scores.shape

(33, 9)

The null and alternative hypothesis is:

H<sub>0</sub>: $\mu \geq 75$<br>
H<sub>1</sub>: $\mu < 75$

Here ⍺ = 0.1 and degrees of freedom = 23, for a one-tailed test let us calculate the critical t-value.

In [16]:
mew=75
cl=0.9
alpha=1-cl

In [17]:
cv=stats.norm.ppf(0.1)
cv

-1.2815515655446004

In [18]:
val=df_female_scores.loc[df_female_scores['gender']=='female','math score']
val

3     76
5     70
8     62
9     60
12    53
14    46
15    62
16    69
17    66
18    67
19    68
20    68
23    42
25    60
27    52
28    48
29    56
30    41
Name: math score, dtype: int64

In [19]:
# Calculate the Test Statistic
x_bar=val.mean()
sigma=df_female_scores['math score'].std(ddof=0)

# ddof =1 in numpy condiers n-1 in denom
samplesd=np.std(val,ddof=1)

In [20]:
n=val.shape[0]
teststats=(x_bar-mew)/(sigma/np.sqrt(n))
print('Teststats:',teststats)

Teststats: -5.011251698280098


In [21]:
# One sample t test
stats.ttest_1samp(val,mew,alternative='less')

TtestResult(statistic=-6.50333753824416, pvalue=2.7048385057334625e-06, df=17)

In [26]:
p_value=stats.t.cdf(-6.50333753824416,df=n-1)
p_value

2.7048385057334625e-06

In [25]:
if p_value<alpha:
    print('reject')
else:
    print('Fail to reject')

reject


<a id="prop"></a>
# 2. Z Proportion Test

<a id="1_p"></a>
## 2.1 One Sample Test

Perform one sample Z test for the population proportion. We compare the population proportion ($P$) with a specific value ($P_{0}$).

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: P = P_{0}$ or $P \geq P_{0}$ or $P \leq P_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: P \neq P_{0}$ or $P < P_{0}$ or $P > P_{0}$</strong></p>

The test statistic for proportion Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{p -  P_{0}}{\sqrt{\frac{P_{0}(1-P_{0})}{n}}}$</strong></p>

Where, <br>
$p$: Sample proportion<br>
$n$: Sample size

Under $H_{0}$, the test statistic follows a standard normal distribution.

### Example:

#### 1. In previous years, people believed that at most 80% of male students score more than 50 marks out of 100 in Mathematics. Perform a test to check whether this percentage is more than 80. Consider the level of significance as 0.05.

Consider the sample of math scores of male students available in the CSV file `StudentsPerformance.csv`.

In [5]:
# read the students performance data 
df_student = pd.read_csv('StudentsPerformance (2).csv')

# display the first two observations
df_student.head(2)

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning
1,female,group C,standard,completed,55,63,72,190,Nature Learning


In [7]:
df_student['gender'].unique()

array(['female', 'male'], dtype=object)

In [8]:
males=df_student.loc[df_student.gender=='male']
males.shape

(483, 9)

The null and alternative hypothesis is:

H<sub>0</sub>: $P \leq 0.8$<br>
H<sub>1</sub>: $P > 0.8$ 

Here ⍺ = 0.05, for a one-tailed test calculate the critical z-value.

In [27]:
hypo_prop=0.8
sample_prop=(males[males['math score']>50].shape[0])/males.shape[0]
n=males.shape[0]

In [29]:
cv=stats.norm.isf(0.05)
cv

1.6448536269514729

In [34]:
teststats=(samp_prop-hypo_prop)/(np.sqrt(hypo_prop*(1-hypo_prop)/n))
teststats

4.163394160018601

In [39]:
p_value=1-stats.norm.cdf(teststats)
p_value

1.5677570141203745e-05

In [40]:
if p_value<0.05:
    print('reject')
else:
    print('Fail to reject')

reject


In [16]:
# Now using the male dataset, we willa pply a condition
# finding math scores>80 then taking count of than
count_80=males.loc[males['math score']>50].shape[0]
samp_prop= count_80/males.shape[0]
samp_prop
hyp_prop=0.80
n=males.shape[0]

In [8]:
samp_prop

0.8757763975155279

In [58]:
stats.norm.isf(0.05)

1.6448536269514729

In [85]:
# Test Statistic= (samp_prop-hyp_prop)/sqrt(hyp_prop*(1-hyp_prop)/n)
n=males.shape[0]
teststats=(samp_prop-hyp_prop)/np.sqrt((hyp_prop*(1-hyp_prop))/n)
print('Teststats',teststats)

Teststats 4.163394160018601


In [62]:
# p_value
1-stats.norm.cdf(teststats)


# reject the Ho which means that more than 80% student are able to achive more than 50% marks

1.5677570141203745e-05

In [68]:
# calculate the 95% confidence interval
stats.norm.interval(0.95,loc=samp_prop,scale=np.sqrt((hyp_prop*(1-hyp_prop))/n))

(0.8401038178124423, 0.9114489772186136)

#### 2. From a sample of 361 business owners had gone into bankruptcy due to recession. On taking a survey, it was found that 105 of them had not consulted any professional for managing their finance before opening the business. Test the null hypothesis that at most 25% of all businesses had not consulted before opening the business. Test the claim using p-value technique. Use α = 0.05.

The null and alternative hypothesis is:

H<sub>0</sub>: $P \leq 0.25$<br>
H<sub>1</sub>: $P > 0.25$ 

In [27]:
n=361
samp_prop2=105/361
hyp_prop2=0.25

In [77]:
teststats=(samp_prop2-hyp_prop2)/np.sqrt((hyp_prop2*(1-hyp_prop2))/n)
print('Teststats',teststats)

Teststats 1.7928245201151534


In [81]:
# Pvalue
1-stats.norm.cdf(teststats)

0.03650049373124953

In [80]:
# Confidence
stats.norm.interval(0.95,loc=samp_prop2,scale=np.sqrt((hyp_prop2*(1-hyp_prop2))/n))

(0.24619086783771343, 0.33552658368583227)

Here the p-value is less than 0.05. Thus, we reject the null hypothesis and conclude that at least 25% of all businesses had not consulted before starting the business.

## Basic confidence Interval
* We observe that the confidence Interaval Range contains 0.25 which means that results are misleading

* We also notice that Pvalue is rejecting rhe Ho at 5% however,it is getting FTR( Fail to reject ) at 1% which suggest that the test is running into error

* We need data to generate accurate result.