<b><font color = blue size=4><center>Definitions </center></font></center></b>

<b> Z - Distribution : </b> The Z-distribution is a normal distribution with mean zero and standard deviation 1<br>
![z-dist.PNG](attachment:z-dist.PNG)

<b>Z-Score:</b> In statistics, the standard score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores above the mean have positive standard scores, while those below the mean have negative standard scores.

z = (x – μ) / σ 

For example, let's say you have a test score of 190. The test has a mean (μ) of 150 and a standard deviation (σ) of 25. Assuming a normal distribution, your z score would be: z = (x – μ) / σ = (190-150)/25 = 40/25 = 1.6
<br>


<b>t- distribution</b>The T distribution (also called Student’s T Distribution) is a family of distributions that look almost identical to the normal distribution curve, only a bit shorter and fatter. The t distribution is used instead of the normal distribution when you have small samples (for more on this, see: t-score vs. z-score). The larger the sample size, the more the t distribution looks like the normal distribution. In fact, for sample sizes larger than 20 (e.g. more degrees of freedom), the distribution is almost exactly like the normal distribution.<br>
![t-dist.PNG](attachment:t-dist.PNG)


<b>t-score:</b> When you have fewer than n = 20 in your sample, you should use t-score calculations rather than a z-score to analyze your data. As the number n grows larger, graphs of t-scores come to approximate those of z-scores, as a higher number of points in the set statistically assures a higher likelihood of the sample being coincident with an "infinitely" large random sample of the population of interest.<br>
The t distribution (aka, Student’s t-distribution) is a probability distribution that is used to estimate population parameters when the sample size is small and/or when the population variance is unknown.<br>
![t-distrbution.PNG](attachment:t-distrbution.PNG)

In simple terms, the larger the t-score, the larger the difference is between the groups you are testing<br>

<b>The F-Statistic:</b> Variation Between Sample Means / Variation Within the Samples. The F-statistic is the test statistic for F-tests. In general, an F-statistic is a ratio of two quantities that are expected to be roughly equal under the null hypothesis, which produces an F-statistic of approximately 1 <br>

![F-dist.PNG](attachment:F-dist.PNG)



<b>p-value :</b> The p value is the evidence against a null hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.<br>

P values are expressed as decimals although it may be easier to understand what they are if you convert them to a percentage. For example, a p value of 0.0254 is 2.54%. This means there is a 2.54% chance your results could be random (i.e. happened by chance). That’s pretty tiny. On the other hand, a large p-value of .9(90%) means your results have a 90% probability of being completely random and not due to anything in your experiment. Therefore, the smaller the p-value, the more important (“significant”) your results.<br>
<img width=200 height=100>
![p-value.png](attachment:p-value.png)

</img>

<b>Hypothesis testing:</b> Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for the analysis.<br>
Step 1: State the Null Hypothesis<br>
Step 2: State the Alternative Hypothesis<br>
Step 3: Set the confidence interval ( alpha)<br>
Step 4: Collect Data<br>
Step 5: Calculate a test statistic<br>
Step 6: Construct Acceptance / Rejection regions<br>
Step 7: Based on steps 5 and 6, draw a conclusion about Null Hypothesis<br>
![null_hyp1.PNG](attachment:null_hyp1.PNG)





Type1 Error : Reject Null Hypothesis when it is true(alpha)<br>
Type2 Error : Fail to reject Null Hypothesis when it is false ( beta)

# Case 1: Hypothesis testing for single proportion

In [None]:
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

<b>Example 1:</b> <br>

In previous years, 52% of parents believed that electronics and social media was the cause of their teenager’s lack of sleep. Do more parents today believe that their teenager’s lack of sleep is caused due to electronics and social media? 

**Population**: Parents with a teenager (age 13-18)  
**Parameter of Interest**: p  

**Null Hypothesis:** p = 0.52  
**Alternative Hypthosis:** p > 0.52 (note that this is a one-sided test)

**Data**: 1018 people were surveyed. 56% of those who were surveyed believe that their teenager’s lack of sleep is caused due to electronics and social media.( confidence level is 95%)

### Use of `proportions_ztest()` from `statsmodels`

Single group proportion uses z-statistic test. We use the proportions_ztest() function from the Statsmodels package. Note the argument alternative="larger" indicating a one-sided test. The function returns two values - the z-statistic and the corresponding p-value.<br>

This is the test for proportions based on normal (z) test

In [1]:
n = 1018 #sample size
pnull = .52
phat = .56

In [4]:
sm.stats.proportions_ztest(phat * n, n, pnull, alternative='larger')

(2.571067795759113, 0.005069273865860533)

In [5]:
n = 101
sm.stats.proportions_ztest(phat * n, n, pnull, alternative='larger')

(0.8098420561098053, 0.20901547893640587)

### Conclusion of the hypothesis test
Since the calculated p-value of the z-test is pretty small, we do not have enough evidence to accept  the NULL hypothesis that the percentage of parents, who believe that their teenager’s lack of sleep is caused due to electronics and social media, is as same as previous years' estimate i.e. 52%.

Although, we do not accept the alternate hypothesis, this informally means that there is a good chance of this proportion being more than 52%.

# Case 2: Difference in Population Proportions



Is there a significant difference between the population proportions of parents of black children and parents of Hispanic children who report that their child has had some swimming lessons?

**Populations**: All parents of black children age 6-18 and all parents of Hispanic children age 6-18  
**Parameter of Interest**: p1 - p2, where p1 = black and p2 = hispanic  

**Null Hypothesis:** p1 - p2 = 0  
**Alternative Hypthosis:** p1 - p2 $\neq$  0  

**Data**: 247 Parents of Black Children. 36.8% of parents report that their child has had some swimming lessons. 
<br>308 Parents of Hispanic Children. 38.9% of parents report that their child has had some swimming lessons.

### Use of `ttest_ind()` from `statsmodels`
Difference in population proportion needs t-test. Also, the population follow a binomial distribution here. We can just pass on the two population quantities with the appropriate binomial distribution parameters to the t-test function.

The function returns three values: (a) test statisic, (b) p-value of the t-test, and (c) degrees of freedom used in the t-test.

In [9]:
n1 = 24
p1 = .11

n2 = 30
p2 = .13

population1 = np.random.binomial(1, p1, n1)
population2 = np.random.binomial(1, p2, n2)

sm.stats.ttest_ind(population1, population2)

(-0.39165332934047786, 0.6969148213563883, 52.0)

### Conclusion of the hypothesis test
Since the p-value is quite high , we cannot reject the Null hypothesis in this case i.e. the difference in the population proportions are not statistically significant.

### But what happens if we could survey much higher number of people?
We do not chnage the proportions, just the number of survey participants in the two population. The slight difference in the proportion could become statistically significant in this situation. There is no guarantee that when you run the code, you will get a p-value < 0.05 all the time as the samples are randomly generated each itme. But if you run it a few times, you will notice some p-values < 0.05 for sure.

In [3]:
n1 = 5000
p1 = .37

n2 = 5000
p2 = .39

population1 = np.random.binomial(1, p1, n1)
population2 = np.random.binomial(1, p2, n2)

sm.stats.ttest_ind(population1, population2)


(-2.601060449806657, 0.009307294604281677, 9998.0)

# case 3: Single population mean

Let's say a cartwheeling competition was organized for some adults. The data looks like following,

(80.57, 98.96, 85.28, 83.83, 69.94, 89.59, 91.09, 66.25, 91.21, 82.7 , 73.54, 81.99, 54.01, 82.89, 75.88, 98.32, 107.2 , 85.53, 79.08, 84.3 , 89.32, 86.35, 78.98, 92.26, 87.01)

Is distance Is the average cartwheel distance (in inches) for adults more than 80 inches?

**Population**: All adults  
**Parameter of Interest**: $\mu$, population mean cartwheel distance.

**Null Hypothesis:** $\mu$ = 80 
<br>**Alternative Hypthosis**: $\mu$ > 80

**Data**:
<br>25 adult participants. 
<br>$\mu = 83.84$
<br>$\sigma = 10.72$

In [None]:
cwdata = np.array([80.57, 98.96, 85.28, 83.83, 69.94, 89.59, 91.09, 66.25, 91.21, 82.7 , 73.54, 81.99, 54.01, 
                 82.89, 75.88, 98.32, 107.2 , 85.53, 79.08, 84.3 , 89.32, 86.35, 78.98, 92.26, 87.01])
n = len(cwdata)
mean = cwdata.mean()
sd = cwdata.std()
(n, mean, sd)

sm.stats.ztest(cwdata, value = 80, alternative = "larger")

### Conclusion of the hypothesis test
Since the p-value  (0.0394) is lower than the standard confidence level 0.05, we can reject the Null hypothesis that the mean cartwheel distance for adults (a population quantity) is equal to 80 inches. There is strong evidence in support for the alternatine hypothesis that the mean cartwheel distance is, in fact, higher than 80 inches. Note, we used `alternative="larger"` in the z-test.

We can also plot the histogram of the data to check if it approximately follows a Normal distribution.

# Case4: Difference in population means

Considering adults in the [NHANES data](https://www.cdc.gov/nchs/nhanes/index.htm), do males have a significantly higher mean [Body Mass Index](https://www.cdc.gov/healthyweight/assessing/bmi/index.html) than females?

**Population**: Adults in the NHANES data.  
**Parameter of Interest**: $\mu_1 - \mu_2$, Body Mass Index.  

**Null Hypothesis:** $\mu_1 = \mu_2$  
**Alternative Hypthosis:** $\mu_1 \neq \mu_2$

**Data:**

2976 Females 
$\mu_1 = 29.94$  
$\sigma_1 = 7.75$  

2759 Male Adults  
$\mu_2 = 28.78$  
$\sigma_2 = 6.25$  

$\mu_1 - \mu_2 = 1.16$

In [None]:
url = "https://raw.githubusercontent.com/kshedden/statswpy/master/NHANES/merged/nhanes_2015_2016.csv"
da = pd.read_csv(url)
da.head()

In [None]:
females = da[da["RIAGENDR"] == 2]
male = da[da["RIAGENDR"] == 1]

In [None]:
n1 = len(females)
mu1 = females["BMXBMI"].mean()
sd1 = females["BMXBMI"].std()

(n1, mu1, sd1)

In [None]:
n2 = len(male)
mu2 = male["BMXBMI"].mean()
sd2 = male["BMXBMI"].std()

(n2, mu2, sd2)

In [None]:
sm.stats.ztest(females["BMXBMI"].dropna(), male["BMXBMI"].dropna(),alternative='two-sided')

### Conclusion of the hypothesis test
Since the p-value  (6.59e-10) is extremely small, we can reject the Null hypothesis that the mean BMI of males is same as that of females. Note, we used `alternative="two-sided"` in the z-test because here we are checking for inequality.

We can also plot the histogram of the data to check if it approximately follows a Normal distribution.

In [None]:
plt.figure(figsize=(7,4))
plt.title("Female BMI histogram",fontsize=16)
plt.hist(females["BMXBMI"].dropna(),edgecolor='k',color='pink',bins=25)
plt.show()

plt.figure(figsize=(7,4))
plt.title("Male BMI histogram",fontsize=16)
plt.hist(male["BMXBMI"].dropna(),edgecolor='k',color='blue',bins=25)
plt.show()

Example: A phone company claims that 43% of smart phone users have an iphone. You doubt this claim. So you conducted a survey of 83 smart phone users. 44 of them use an iphone. What can you conclude if alpha = 0.05? <br>
Solution: <br>
H0 : P = 0.43                            Ha : P != 0.43 <br>
Because the alternative Hypothesis has not equal to sign, we need to do two tailed test(both directions)<br>
Distribution of the proportion : Phat ~ N(0.43,sqrt((0.43 * 0.57)/83))  ~ N(0.43,0.0543)<br>
Phat = X /n  = 44/83 = 0.53 <br>

Z = (Phat - P) / SE  = (0.53 - 0.43) / 0.0543  = 1.84 <br>
p-Value = 0.0329 + 0.0329 = 0.0658<br>
Based on the survey , there is a 6.58% chance the proportion of iphone users is actually 43% <br>
Decision: Fail to Reject the Null Hypothesis <br>
Reason : P-value > alpha ( 0.0658 > 0.05) <br>
Conclusion : There is not enough evidence to conclude the proportion of i-phone users is different than 43% <br>





Assignment 1: It is claimed the average page in a novel has 275 words per page. To test this laim, you sampled 24 pages of the novel , you find the average page has 260 words with Standard Deviation of 34 words. Do you believe the claim is true if alpha=0.05? <br>

Assignment 2: you want to know if there is a differnece in GPA of online students and face-to-face students. you survey 32 online students who has an average GPA of 3.45 with a Standard Deviation of 0.7%. You also interview 41 f2f students who have an average GPA of 3.67 with SD of 0.4. If alpha=0.10, can you conclude the groups are different?<br>

Assignment 3:A researcher wants to verify a claim about her comments that 40% of the residents speaks spanish in home, 10% speaks Russian, 45% speaks English and 5% speaks other languages. In a survey of 200 community members, 71 speaks spanish, 2 speaks Russian, 102 speaks english, and 4 speaks other. If alpha=0.05, can the researcher conclude the claimed distributtion is accurate?

Assignment 4: A customer wants to know how the cost of school supplies varies from store to store. A teacher claims the SD is only /$15. The customer surveys 43 stores and finds a mean of /$84 and a SD of /$12. Test if the SD is less than the teachers claim of /$15, if alpha = 0.05
