In [43]:
# for basic operations
import numpy as np 
import pandas as pd 
# for data visualizations
import seaborn as sns
import matplotlib.pyplot as plt
# Statistics
import scipy.stats as stats

# Sampling Distribution

A sampling distribution is a statistic that is arrived out through repeated sampling from a larger population.

The mean of the all sample drawn from the normal distributed populations would be a normally distributed with mean equal to population of mean and standard deviations as

![se.PNG](attachment:se.PNG)

Also known as the standard error . Lower the standard error , more accurate the calculations would be

# Central Limit Theorem

The central limit theorem (CLT) is a statistical theory that states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population. Furthermore, all of the samples will follow an approximate normal distribution pattern, with all variances being approximately equal to the variance of the population divided by each sample's size.

The distinguishing and unique feature of the central limit theorem is that irrespective of the shape of the distribution of the original population, the sampling distribution of the mean will approach a normal distribution as the size of the sample increases and becomes large.

![CLT.PNG](attachment:CLT.PNG)

# Statistical Inference

* This **subset** of the population is nothing but the Sample data

* We carry out various tests on the Sample to gain insight on the larger population out there!

* Therefore Statistical inference is the process of analyzing sample data to gain insight into the population from which the data was collected and to investigate differences between different data samples.

The sample mean is usually not exactly the same as the population mean. This difference can be caused by many factors including poor survey design, biased sampling methods and the randomness inherent to drawing a sample from a population.



# Confidence Interval

**Confidence Interval (CI)** is a type of estimate computed from the statistics of the observed data. This proposes a range of plausible values for an unknown parameter (for example, the mean). The interval has an associated confidence level that the true parameter is in the proposed range.

![image.png](attachment:image.png)

The 95% confidence interval defines a range of values that you can be 95% certain contains the population mean. With large samples, you know that mean with much more precision than you do with a small sample, so the confidence interval is quite narrow when computed from a large sample.

# Hypothesis Testing

* $Statistical Hypothesis$, sometimes called confirmatory data analysis, is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables. A statistical hypothesis test is a method of statistical inference. It is a statement about a population parameter.

## Null Hypothesis

* In Inferential Statistics, **The Null Hypothesis is a general statement or default position that there is no relationship between two measured phenomena or no association among groups.** Rejecting to accept the Null hypothesis would lead to alternate Hypothesis

* Statistical hypothesis tests are based on a statement called the null hypothesis that assumes nothing interesting is going on between whatever variables you are testing.


## Alternate Hypothesis

* The alternate hypothesis is just an alternative to the null. Basically, you're looking at whether there's enough change (with the alternate hypothesis) to be able to reject the null hypothesis

##  The Null Hypothesis is assumed to be true and Statistical evidence is required to reject it in favor of an Alternative Hypothesis.


1. Once you have the null and alternative hypothesis in hand, you choose a significance level (often denoted by the Greek letter α). The significance level is a probability threshold that determines when you reject the null hypothesis.

2. After carrying out a test, if the probability of getting a result as extreme as the one you observe due to chance is lower than the significance level, you reject the null hypothesis in favor of the alternative.

3. This probability of seeing a result as extreme or more extreme than the one observed is known as the p-value.

## P Value

* In statistical hypothesis testing, **the p-value or probability value** is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct. 

* So now say that we have put a significance (α) = 0.05
* This means that if we see a p-value of lesser than 0.05, we reject our Null and accept the Alternative to be true


## Type 1 and Type 2 Error

* In statistical hypothesis testing, a type I error is the rejection of a true null hypothesis, while a type II error is the non-rejection of a false null hypothesis

![Type_err.PNG](attachment:Type_err.PNG)

## Type 1  and Type 2 Error Example

For example, let's look at the trail of an accused criminal. The null hypothesis is that the person is innocent, while the alternative is guilty. 
* A Type 1 error in this case would mean that the person is not found innocent and is sent to jail, despite actually being innocent.
* A Type 2 Error Example In this case would be, the person is found innocent and not sent to jail despite of him being guilty in real. 


## Hypothesis Formulation - Example
An apparel company would like to introduce the product line into a new market area 
- Survey of a random sample of 400 households in that market showed a mean income per household of $30,000 with standard deviation of $8,000.
- Company believes the product line will be adequately profitable only in markets where the mean household income is greater than $29,000. Should Karen introduce the product line into the new market

## Hypothesis Formulation

- Is the average waiting time for the  customers of Smart Supermarket at the checkouts greater than 15 minutes?
 
- Is the proportion of households owning Color TVs in Chennai less than 0.4?
 
- Is the average expenditure per household on eating out significantly higher in Bangalore than in Calcutta?
 
- Two random sample surveys, conducted with two months gap between the two, assessed public opinions on the outcome: The question that was posed was “If the general election was going to take place tomorrow, would you cast your vote for or against the ruling party?


## Another way to test: Gosset's (Student's) t-test

* The t-statistic was introduced in 1908 by William Sealy Gosset, a chemist working for the Guinness brewery in Dublin, Ireland (“Student” was his pen name). Gosset devised the t-test as a way to monitor the quality of stout. He published the test in Biometrika in 1908. 
* A t-test is any statistical hypothesis test in which the test statistic follows a Student’s t distribution if the null hypothesis is supported. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Student’s t distribution.

![T_Test.PNG](attachment:T_Test.PNG)

* The T-test is a statistical test used to determine whether a numeric data sample differs significantly from the population or whether two samples differ from one another.
* A z-test assumes a sample size >30 to work, but what if our sample is less than 30?
* A t-test solves this problem and gives us a way to do a hypothesis test on a smaller sample.
* Now, let's also see if house prices in Stone Brook neighborhood are different from the houses in the rest of the neighborhoods.

## One sample testing

In one sample test, we compare the population parameter such as mean of a single sample of data collected from a single population. 


## Very rarely we know the variance of the population. 

A common strategy to assess hypothesis is to conduct a t test. A t test can tell whether two groups have the same mean. 
A t test can be estimated for:
* 1) One sample t test
* 2) Two sample t test (including paired t test)

We assume that the samples are randomly selected, independent and come from a normally distributed population with unknown but equal variances.

## Example 1: T-test (sigma of the population is unknown)

Experian Marketing Services reported that the typical American spends a mean of 144 minutes (2.4 hours) per day accessing the Internet via a mobile device. In order to test the validity of this statement, you select a sample of 30 friends and family. The results for the time spent per day accessing the Internet via mobile device (in minutes) are stored in InternetMobileTime 

a. Is there evidence that the population mean time spent per day accessing the Internet via mobile device is different from 144 minutes? Use the p-value approach and a level of significance of 0.05. 

b. What assumption about the population distribution is needed in order to conduct the t test in (a)?


In [60]:
mydata = pd.read_csv('InternetMobileTime.csv')
mydata.head()

Unnamed: 0,Minutes
0,72
1,144
2,48
3,72
4,36


### Step 1: Define null and alternative hypotheses

In testing the average minutes spent on the internet.

Null hypothesis states that mean  internet usage time, $\mu$ is equals to 144.
Alternative hypothesis states that the mean mean  internet usage time, $\mu$ is unequal to 144.

* $H_0$: $\mu$ = 144
* $H_A$: $\mu$ $\neq$ 144

### Step 2: Decide the significance level

Here we select $\alpha$ = 0.05.

In [61]:
print("The sample size for this problem is",len(mydata))

The sample size for this problem is 30


### Step 3: Identify the test statistic

We do not know the population standard deviation and n = 30. So we use the t distribution and the $t_{STAT}$ test statistic.

### Step 4: Calculate the p - value and test statistic

**scipy.stats.ttest_1samp calculates the t test for the mean of one sample given the sample observations and  the expected value in the null hypothesis. This function returns t statistic and the two-tailed p value.**

In [62]:
# one sample t-test
# null hypothesis: expected value = 144
t_statistic, p_value = ttest_1samp(mydata, 144)
print('One sample t test \nt statistic: {0} p value: {1} '.format(t_statistic, p_value))

One sample t test 
t statistic: [1.22467437] p value: [0.23055327] 


### Let us calculate at the value of the test statistic manually

Here, we see that the values of test statistic and the p-value is same as that we calculated from the function of the Scipy library.

### Step 5 Decide to reject or accept null hypothesis

In [47]:

# p_value < 0.05 => alternative hypothesis:

alpha_value = 0.05 # Level of significance
print('Level of significance: %.2f' %alpha_value)
if p_value < alpha_value: 
    print('We have evidence to reject the null hypothesis since p value < Level of significance')
else:
    print('We have no evidence to reject the null hypothesis since p value > Level of significance') 

print ("Our one-sample t-test p-value=", p_value)

Level of significance: 0.05
We have no evidence to reject the null hypothesis since p value > Level of significance
Our one-sample t-test p-value= [0.23055327]


In this example, p value is 0.23055327 and it is greater than 5% level of significance

So the statistical decision is failing to reject the null hypothesis at 5% level of significance.

 So at 95% confidence level, there is  sufficient evidence  to prove that mean time spent on the internet is equal to  144 minutes. 

## Two sample t test

**Two sample t test (Snedecor and Cochran 1989) is used to determine if two population means are equal.
A common application is to test if a new treatment or approach or process is yielding better results than the current treatment or approach or process.**

* 1) Data is *paired* - For example, a group of students are given coaching classes and effect of coaching on the  marks scored is determined.
* 2) Data is *not paired* - For example, find out  whether the miles per gallon of  cars of Japanese make is superior to cars of Indian make.

### Example 1 - Independent Two Sample T-Test

A hotel manager looks to enhance the initial impressions that hotel guests have when they check in. Contributing to initial impressions is the time it takes to deliver a guest’s luggage to the room after check-in. A random sample of 20 deliveries on a particular day were selected in Wing A of the hotel, and a random sample of 20 deliveries were selected in Wing B. The results are stored in Luggage . Analyze the data and determine whether there is a difference between the mean delivery times in the two wings of the hotel. (Use $\alpha$ = 0.05) 

In [63]:
mydata = pd.read_csv('Luggage.csv')
mydata.head()

Unnamed: 0,WingA,WingB
0,10.7,7.2
1,9.89,6.68
2,11.83,9.29
3,9.04,8.95
4,9.37,6.61


### Step 1: Define null and alternative hypotheses

In testing whether the mean time of deliveries of the luggages are same in both the wings of the hotel, the null hypothesis states that the mean time to deliver the luggages are the same, $\mu{A}$ equals $\mu{B}$. The alternative hypothesis states that the mean time to deliver the luggages are different, $\mu{A}$ is not equal to $\mu{B}$.

* $H_0$: $\mu{A}$ - $\mu{B}$ =      0 i.e        $\mu{A}$ = $\mu{B}$
* $H_A$: $\mu{A}$ - $\mu{B}$ $\neq$  0 i.e      $\mu{A}$ $\neq$ $\mu{B}$

### Step 2: Decide the significance level

Here we select $\alpha$ = 0.05 and the population standard deviation is not known.

### Step 3: Identify the test statistic

* We have two samples and we do not know the population standard deviation.
* Sample sizes for both samples are  same.
* The sample is not a large sample, n < 30. So you use the t distribution and the $t_{STAT}$ test statistic for two sample unpaired test.

### Step 4: Calculate the p - value and test statistic

** We use the scipy.stats.ttest_ind to calculate the t-test for the means of TWO INDEPENDENT samples of scores given the two sample observations. This function returns t statistic and two-tailed p value.**

** This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances.**

For this exercise, we are going to first assume that the variance is equal and then compute the necessary statistical values.

In [64]:
t_statistic, p_value  = ttest_ind(mydata['WingA'],mydata['WingB'])
print('tstat',t_statistic)    
print('P Value',p_value)    

tstat 5.16151166403543
P Value 8.007988032535588e-06


### Step 5:  Decide to reject or accept null hypothesis

In [50]:
# p_value < 0.05 => alternative hypothesis:
# they don't have the same mean at the 5% significance level
print ("two-sample t-test p-value=", p_value)

alpha_level = 0.05

if p_value < alpha_level:
    print('We have enough evidence to reject the null hypothesis in favour of alternative hypothesis')
    print('We conclude that the mean time to deliver luggages in of both the wings of the hotel are not same.')
else:
    print('We do not have enough evidence to reject the null hypothesis in favour of alternative hypothesis')
    print('We conclude that mean time to deliver luggages in of both the wings of the hotel are same.')

two-sample t-test p-value= 8.007988032535588e-06
We have enough evidence to reject the null hypothesis in favour of alternative hypothesis
We conclude that the mean time to deliver luggages in of both the wings of the hotel are not same.


Let us now go ahead and check the confidence intervals at a specific $\alpha$ value.

## Example 2 - Paired T-Test

The file Concrete contains the compressive strength, in thousands of pounds per square inch (psi), of 40 samples of concrete taken two and seven days after pouring. 
At the 0.01 level of significance, is there evidence that the mean strength is lower at two days than at seven days?



In [65]:
mydata = pd.read_csv('Concrete.csv')
mydata.head()

Unnamed: 0,Sample,Two Days,Seven Days
0,1,2.83,3.505
1,2,3.295,3.43
2,3,2.71,3.67
3,4,2.855,3.355
4,5,2.98,3.985


### Step 1: Define null and alternative hypotheses

In testing whether the number of days has any effect on the lowering the compressive strength of the concrete,
* the null hypothesis states that the compressive strength of the cement is not lower at 2 days than at 7 days, $\mu_{2}$ $\geq$ $\mu_{7}$. 
* The alternative hypthesis states that the compressive strength of the cement is lower at 2 days than at 7 days, $\mu_{2}$ < $\mu_{7}$

* $H_0$: $\mu_{2}$ - $\mu_{7}$ $\geq$  0
* $H_A$: $\mu_{2}$ - $\mu_{7}$ <  0

Here, $\mu_2$ denotes the mean compressive strenght of the cement after two days and $\mu_7$ denotes the mean compressive strength of the cement after seven days.

### Step 2: Decide the significance level

Here we select $\alpha$ = 0.01 as given in the question.

### Step 3: Identify the test statistic

* Sample sizes for both samples are  same.
* We have two paired samples and we do not know the population standard deviation.
* The sample is not a large sample, n < 30. So you use the t distribution and the $t_{STAT}$ test statistic for two sample paired test.

### Step 4: Calculate the p - value and test statistic

**We use the scipy.stats.ttest_rel to calculate the T-test on TWO RELATED samples of scores. This is a two-sided test for the null hypothesis that 2 related or repeated samples have identical average (expected) values. Here we give the two sample observations as input. This function returns t statistic and two-tailed p value.**

In [66]:
# paired t-test: doing two measurments on the same experimental unit
# e.g., before and after a treatment
t_statistic, p_value  =  stats.ttest_rel(mydata['Two Days'],mydata['Seven Days'])
print('tstat  %1.3f' % t_statistic)    
print("p-value for one-tail:", p_value/2)

tstat  -9.372
p-value for one-tail: 7.768158524368873e-12


### Step 5:  Decide to reject or accept null hypothesis

In [53]:
# p_value < 0.05 => alternative hypothesis:
# they don't have the same mean at the 5% significance level
print ("Paired two-sample t-test p-value=", p_value/2)

alpha_level = 0.01

if (p_value/2) < alpha_level:
    print('We have enough evidence to reject the null hypothesis in favour of alternative hypothesis')
    
else:
    print('We do not have enough evidence to reject the null hypothesis in favour of alternative hypothesis')
    

Paired two-sample t-test p-value= 7.768158524368873e-12
We have enough evidence to reject the null hypothesis in favour of alternative hypothesis


# Chi Square Test

The term "chi-squared test," also written as χ² test, refers to certain types of statistical hypothesis tests that are valid to perform when the test statistic is chi-squared distributed under the null hypothesis. Often, however, the term is used to refer to Pearson's chi-squared test and variants thereof.

***A chi-squared goodness of fit tests whether the distribution of sample categorical data matches an expected distribution.***

For example, 
* *you could use a chi-squared goodness-of-fit test to check whether the race demographics of members at your church or school match that of the entire population of your country*.
* *you could check whether the computer browser preferences of your friends match those of Internet uses as a whole.*

* *When working with categorical data the values the observations themselves aren't of much use for statistical testing because categories like "male", "female," and "other" have no mathematical meaning.*

![image.png](attachment:image.png)

* **Good Fit**: If the significance value that is p-value associated with chi-square statistics is 0.002, there is very strong evidence of rejecting the null hypothesis of no fit. It means good fit.

## Chi-Sqaured Test of Independence

Independence is a key concept in probability that describes a situation where knowing the value of one variable tells you nothing about the value of another.

For instance, the month you were born probably doesn't tell you anything which web browser you use, so we'd expect birth month and browser preference to be independent.

On the other hand, your month of birth might be related to whether you excelled at sports in school, so month of birth and sports performance might not be independent.

The chi-squared test of independence tests whether two categorical variables are independent.

### More Chi-Square Questions

In [67]:
df = pd.DataFrame({'Promoted': [15, 16], 'Not-promoted': [9, 15]}, index = ['Company A', 'Company B'])

In [68]:
df

Unnamed: 0,Promoted,Not-promoted
Company A,15,9
Company B,16,15


### Step 1: Define null and alternative hypotheses

H0: Promotions are dependent on Company type
H1: Promotions are independent of Company type

### Step 2: Decide the significance level

Here we select α= 0.05 as per 95% Confidence Level requirement in the question.

### Step 3: Identify the test statistic

This is a Chi-sq Test where categorical data has been reported in raw frequencies

### Step 4: Calculate the p - value and test statistic

In [69]:
chi2, pval, dof, exp_freq = chi2_contingency(df, correction = False)

In [70]:
pval

0.41943105261448455

### Step 5:  Decide to reject or accept null hypothesis

Since the pvalue is > 0.05, therefore, at 95% confidence we fail to reject the null hypothesis which implies the management is not biased in favor of employees originally belonging to company A.