### Binomial Distribution

Heaven Furnitures (HF) sells furniture like sofas, beds and tables. It is observed that 25% of their customers complain about the furniture purchased by them for many reasons. On Tuesday, 20 customers purchased furniture products from HF.

In [7]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
from scipy.stats import randint
from scipy.stats import skewnorm

# import 'random' to generate a random sample
import random

from statsmodels.stats import weightstats as stests

from scipy.stats import shapiro

# import the function to calculate the power of test
from statsmodels.stats import power

In [3]:
# a. Probability that exactly 3 customers will complain

prob = stats.binom.pmf(k = 3, n = 20, p = 0.25)
prob = round(prob, 2)

#  Probability that more than 3 customers will complain

prob = stats.binom.sf(k = 3, n = 20, p = 0.25)
prob = round(prob, 2)

#  Probability that less than 3 customers will complain

prob = stats.binom.cdf(k = 3, n = 20, p = 0.25)
prob = round(prob, 2)

The probability that exactly 3 customers will complain about the purchased products is 0.13
The probability that more than 3 customers will complain about the purchased products is 0.77
The probability that less than 3 customers will complain about the purchased products is 0.23


### Normal Distribution

A survey was conducted and it was found that the people spend their 300 minutes in a day surfing on online shopping sites on average and the corresponding standard deviation is 127 minutes. Assume that the time spent on surfing follows a normal distribution. Calculate the following probabilities:

In [5]:
# a. Probability users are spending less than or equal to 100 minutes?
avg = 300
std = 127
zstat = (100 - avg) / std
prob = stats.norm.cdf(zstat)
req_prob = round(prob, 2)

# b. Probability users are spending more than or equal to 100 minutes per day?
prob = stats.norm.sf(zstat)
req_prob = round(prob, 2)

print('The probability that the users are spending more than or equal to 100 minutes daily is', req_prob)

# c. Probability that people are between 250 minutes and 350 minutes per day?
z_250 = (250 - avg) / std
p_250 = stats.norm.cdf(z_250)
z_350 = (350 - avg) / std
p_350 = stats.norm.cdf(z_350)

prob = p_350 - p_250 # calculate the difference between 'p_350' and 'p_250' to find the required probability
req_prob = round(prob, 2)
print('The probability that people are spending time between 250 minutes and 350 minutes per day is', req_prob)

The probability that the users are spending less than or equal to 100 minutes daily is 0.06
The probability that the users are spending more than or equal to 100 minutes daily is 0.94


### Sampling

Sampling:
- With Replacement: random.choices
- Without Replacement: random.sample

### Central Limit Theorem

The central limit theorem states that, for sufficiently large n, the sample mean  𝑋bar  follows an approximately normal distribution

### Point Estimation

1. Consider the data of grade points for 35 students in a data science course. Select grades of 20 students randomly from the data and find the point estimate for the population mean.

Ans: Asking you to find the mean

2. A financial firm has created 50 portfolios. From them, a sample of 13 portfolios was selected, out of which 8 were found to be underperforming. Can you estimate the number of underperforming portfolios?

Ans: Multiply n with proportion

1. Consider the data for the number of ice-creams sold per day. An ice-cream vendor collected this data for 90 days and then a sample is drawn (without replacement) containing ice-creams sold for 25 days.

Ans: np.abs(samp_mean - pop_mean)

### Confidence Interval Estimation for Mean

conf_interval = sample statistic ± margin of error

margin of error is given by $Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}$.


#### Z_alpha_by_2

1. A random sample of weight (in kg.) for 35 diabetic patients is drawn from the population with a standard deviation of 8 kg. Find the 90% confidence interval for the population mean.

In [None]:
interval = stats.norm.interval(0.90,
                               loc = np.mean(weight),
                               scale = std_pop / np.sqrt(n))
print('The 90% confidence interval of population mean is', np.round(interval,2))

3. A movie production house needs to estimate the average monthly wage of the technical crew members. The previous data shows that the standard deviation of the wages is 190 dollars. The production team thinks that the estimation of the average wage should not exceed 54 dollars. The team has decided to take a small subset of wages for the estimation. Find a suitable number of wages to be considered to get the estimate with 90% confidence.

In [8]:
sigma = 190
ME = 54
z_alpha_by_2 = np.abs(round(stats.norm.isf(q = 0.1/2), 4))
n = ((z_alpha_by_2)**2)*(sigma**2)/(ME**2)
print('Required Sample Size:', round(n))

Required Sample Size: 33


#### t_alpha_by_2

1. There are 150 apples on a tree. You randomly choose 17 apples and found that the average weight of apples is 78 grams with a standard deviation of 23 grams. Find the 90% confidence interval for the population mean.

In [None]:
n = 17
sample_avg = 78
sample_std = 23

interval = stats.t.interval(0.90,
                            df = n-1,
                            loc = sample_avg,
                            scale = sample_std/np.sqrt(n))

print('90% confidence interval for population mean is', interval)

#### Z_prop Interval Estimation

$Z_{\frac{\alpha}{2}}\sqrt{\frac{p(1 - p)}{n}}$ is the margin of error.

1. A financial firm has created 50 portfolios. From them, a sample of 13 portfolios was selected, out of which 8 were found to be underperforming. Construct a 99% confidence interval to estimate the population proportion.¶

In [None]:
N = 50
n = 13
x = 8
p_samp = x/n

interval = stats.norm.interval(0.99,
                               loc = p_samp,
                               scale = np.sqrt((p_samp*(1-p_samp))/n))

print('99% confidence interval for population proportion is', interval)

### Hypothesis Testing

#### Two-Tailed

$H_{0}: \mu = \mu_{0}$. 

$H_{1}: \mu \neq \mu_{0}$

#### Left-Tailed

$H_{0}: \mu \geq \mu_{0}$ 

$H_{1}: \mu < \mu_{0}$

#### Right-Tailed
$H_{0}: \mu \leq \mu_{0}$ 

$H_{1}: \mu > \mu_{0}$

#### One Sample Z tests
Assumption: Normal Distribution, data continuous, sample is simple random

If $\sigma$ is unknown, use the sample standard deviation (s) instead of $\sigma$ to calculate the test statistic.

1. A car manufacturing company claims that the mileage of their new car is 25 kmph with a standard deviation of 2.5 kmph. A random sample of 45 cars was drawn and recorded their mileage as per the standard procedure. From the sample, the mean mileage was seen to be 24 kmph. Is this evidence to claim that the mean mileage is different from 25kmph? (assume the normality of the data) Use α = 0.01.

In [13]:
# Two Tailed Z test Using Statistic
zcrit = np.abs(round(stats.norm.isf(q = 0.01/2), 2))

print('Critical value for two-tailed Z-test:', zcrit)

n = 45
pop_mean = 25
pop_std = 2.5
samp_mean = 24

zstat = (samp_mean - pop_mean) / (pop_std / np.sqrt(n))
print('Statistic value for two-tailed Z-test:', zstat)
 
# zstat less than zcrit we fail to reject H1

Critical value for two-tailed Z-test: 2.58
Statistic value for two-tailed Z-test: -2.6832815729997477


In [15]:
# p value method
p_value = stats.norm.cdf(zstat)

# for a two-tailed test multiply the p-value by 2
req_p = p_value*2
print('p-value:', req_p)
# pval<alpha:  we fail to reject H1

p-value: 0.007290358091535638


In [18]:
# confidence interval method
print('Confidence interval:', stats.norm.interval(0.99, loc = samp_mean, scale = pop_std / np.sqrt(n)))
# value not inside interval: we fail to reject H1

Confidence interval: (23.040045096471452, 24.959954903528548)


3. A typhoid vaccine in the market inscribes 3 mg of ascorbic acid in the vaccine. A research team claims that the vaccines contain less than 3 mg of acid. We collected the data of 40 vaccines by using random sampling from a population and recorded the amount of ascorbic acid. Test the claim of the research team using the sample data ⍺ value (significance level) to 0.05.

acid_amt = [2.57, 3.06, 3.28 , 3.24, 2.79, 3.40, 3.36, 3.07, 2.46, 3.03, 3.05, 2.94, 3.46, 3.19, 3.09, 2.81, 3.13, 2.88, 
            2.76, 2.75, 3.17, 2.89, 2.54, 3.18, 3.08, 2.60, 3.06, 3.13, 3.11, 3.08, 2.93, 2.90, 3.06, 2.97, 3.24, 2.86, 
            2.87, 3.18, 3, 2.95]
            
Ans: left tailed z test

If shapiro fails then go for non-parametric (Wilcoxon Test)

In [20]:

acid_amt = [2.57, 3.06, 3.28 , 3.24, 2.79, 3.40, 3.36, 3.07, 2.46, 3.03, 3.05, 2.94, 3.46 , 3.19, 3.09,2.81, 3.13, 2.88, 2.76, 
            2.75, 3.17, 2.89, 2.54, 3.18, 3.08, 2.60, 3.06, 3.13, 3.11, 3.08, 2.93, 2.90, 3.06, 2.97, 3.24, 2.86, 2.87, 3.18, 
            3, 2.95]
stat, p_value = shapiro(acid_amt)

# print the test statistic and corresponding p-value 
print('Test statistic:', stat)
print('P-Value:', p_value)

# h0: pval>alpha normal
# h1: pval<alpha not normal
# p-value is greater than 0.05, thus we can say that the data is normally distributed.

Test statistic: 0.9764790534973145
P-Value: 0.5609316825866699


In [22]:
# stat crit method
zcrit = np.abs(round(stats.norm.isf(q = 0.05), 2))

print('Critical value for one-tailed Z-test:', zcrit)

zstat, pval = stests.ztest(x1 = acid_amt, value = 3, alternative = 'smaller')

# print the test statistic and corresponding p-value
print("Z-score: ", zstat)
print("p-value: ", pval)

Critical value for one-tailed Z-test: 1.64
Z-score:  0.08289008952836197
p-value:  0.5330305328333862


In [None]:
# confidence interval method
print('Confidence interval:', 
      stats.norm.interval(0.95, 
                          loc = np.mean(acid_amt),
                          scale = statistics.stdev(acid_amt) / np.sqrt(len(acid_amt))))

5. An e-commerce company claims that the mean delivery time of food items on its website in NYC is 60 minutes with a standard deviation of 30 minutes. A random sample of 45 customers ordered from the website, and the average time for delivery was found to be 75 minutes. Is this enough evidence to claim that the average time to get items delivered is more than 60 minutes. (assume the normality of the data). Test the with α = 0.05.

Ans: right tailed z test

In [23]:
# stat crit method
zcrit = np.abs(round(stats.norm.isf(q = 0.05), 2))
print('Critical value for one-tailed Z-test:', zcrit)

n = 45
pop_mean = 60
pop_std = 30
samp_mean = 75
zstat = (samp_mean - pop_mean) / (pop_std / np.sqrt(n))
print("Z-zstat:", zstat)

Critical value for one-tailed Z-test: 1.64
Z-zstat: 3.3541019662496843


In [None]:
# p value method
p_value = stats.norm.sf(z_score)

print('p-value:', p_value)

In [None]:
# confidence interval method
print('Confidence interval:', stats.norm.interval(0.95, 
                                                  loc = samp_mean, 
                                                  scale = pop_std / np.sqrt(n)))

### Two Sample Z Test

Assumptions:
- Normal Distribution (Shapiro)
- Equal Variances (Levene)
    - h0: The variances are equal
    - h1: The variances are not equal

1. The training institute Nature Learning claims that the students trained in their institute have overall better performance than the students trained in their competitor institute Speak Global Learning. We have a sample data of 500 students from each institute along with their total score collected from independent normal populations. Frame a hypothesis and test the Nature Learning's claim with 99% confidence.



In [None]:
# perform shapiro
stat, p_value = shapiro(df_student['total score'])

# print the test statistic and corresponding p-value 
print('Test statistic:', stat)
print('P-Value:', p_value)
#  p-value is greater than 0.05, 
# thus we can say that the total scores of the students trained 
# from both the institutes are normally distributed.

# perform levene
stat, p_value = stats.levene(nl_scores, sgl_scores)

# print the test statistic and corresponding p-value 
print('Test statistic:', stat)
print('P-Value:', p_value)
# p-value is greater than 0.05, 
# thus we can say that the population variances are equal.

In [None]:
zcrit = np.abs(round(stats.norm.isf(q = 0.01), 2)) #q/2 for two tailed

print('Critical value for one-tailed Z-test:', zcrit)

zstat, pval = stests.ztest(x1 = nl_scores,
                           x2 = sgl_scores,
                           value = 0, # value from the hypothesis
                           alternative = 'larger') #alternative='two-sided' for two tailed

# print the test statistic and corresponding p-value
print("Z-score: ", zstat)
print("p-value: ", pval) #pval*2 for two tailed

# confidence interval method
interval = stats.norm.interval(0.99, 
            loc = nl_mean - sgl_mean, 
            scale = np.sqrt(((nl_std**2) / n_1) + ((sgl_std**2) / n_2))))
print('Confidence interval:', interval)
      

### One sample t test

1. A survey claims that in a math test female students tend to score fewer marks than the average marks of 75 out of 100. Consider a sample of 24 female students and perform a hypothesis test to check the claim with 90% confidence.

H<sub>0</sub>: $\mu \geq 75$<br>
H<sub>1</sub>: $\mu < 75$

In [None]:
# calculate samplemean, samplestd, n and df i.e n-1.
# perform shapiro
tcrit = round(stats.t.isf(q = 0.1, df = 23), 2)

print('Critical value for one-tailed t-test:', tcrit)
tstat, p_val = stats.ttest_1samp(a = math_marks, popmean = 75)

req_p_val = p_val/2 

# print the test statistic value and corresponding p-value
print('Test Statistic:', tstat)
print('p-value:', req_p_val)


# confidence interval method
interval = stats.t.interval(0.90, 
                            df = n-1, 
                            loc = sample_avg, 
                            scale = sample_std/np.sqrt(n))
print('90% confidence interval for population mean is', interval)


#### 2. A researcher is studying the growth of bacteria in waters of Lake Beach. The mean bacteria count of 100 per unit volume of water is within the safety level. The researcher collected 10 water samples of unit volume and found the mean bacteria count to be 94.8 with a sample variance of 72.66. Does the data indicate that the bacteria count is within the safety level? Test at the α = .05 level. Assume that the measurements constitute a sample from a normal population.

The null and alternative hypothesis is:

H<sub>0</sub>: $\mu \geq 100$<br>
H<sub>1</sub>: $\mu < 100$

In [None]:
t_val = round(stats.t.isf(q = 0.05, df = 9), 2)

n = 10
pop_mean = 100
samp_var = 72.66 
samp_mean = 94.8

t_score = (samp_mean - pop_mean) / (samp_std / np.sqrt(n))

samp_std = np.sqrt(samp_var)

# calculate the test statistic using the function 't_test'
t_score = t_test(pop_mean, samp_std, n, samp_mean)
print("t-score:", t_score)

p_value = stats.t.cdf(t_score, df = 9)

print('p-value:', p_value)

#INTERVAL
interval = stats.t.interval(0.95, df = n-1, loc = samp_mean, scale = samp_std/np.sqrt(n))
print(interval)

## 3.2 Two Sample t Test (Unpaired)

The two sample t-test is used to compare the means of two independent populations. This test assumes that the populations are normally distributed from which the samples are taken.

The null and alternative hypothesis is given as:
<p style='text-indent:25em'> <strong> $H_{0}: \mu_{1} - \mu_{2} = \mu_{0}$ or $\mu_{1} - \mu_{2} \geq \mu_{0}$ or $\mu_{1} -\mu_{2} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{1} - \mu_{2} \neq \mu_{0} $ or $\mu_{1} - \mu_{2} < \mu_{0}$ or $\mu_{1} -\mu_{2} > \mu_{0}$</strong></p>



### Example: 

#### 1. The teachers' association claims that the total score of the students who completed the test preparation course is different than the total score of the students who have not completed the course. The sample data consists of 15 students who completed the course and 18 students who have not completed the course. Test the association's claim with ⍺ = 0.05.

In [6]:
#Find size, mean and std of both samples.
# degrees of freedom for 2 sample t-test
print('Degrees of freedom:', n_1 + n_2 - 2)

#Shapiro Test
#Levene Test

The null and alternative hypothesis is:

H<sub>0</sub>: $\mu_{1} - \mu_{2} = 0$<br>
H<sub>1</sub>: $\mu_{1} - \mu_{2} \neq 0$

Here ⍺ = 0.05 and degrees of freedom = 31, for a two-tailed test let us calculate the critical t-value.

In [None]:
t_val = np.abs(round(stats.t.isf(q = 0.05/2, df = 31), 2))
print('Critical value for two-tailed t-test:', t_val)

# use 'ttest_ind()' to calculate the test statistic and corresponding p-value for 2 sample test
t_stat, p_val = stats.ttest_ind(a = course_complete, b = course_incomplete)

print('Test Statistic:', t_stat)
print('p-value:', p_val)

s = np.sqrt((((n_1-1)*samp_std_1**2) + ((n_2-1)*samp_std_2**2)) / (n_1 + n_2 - 2))

# pass the scaling factor s*(1/n1 + 1/n2)^(1/2) to the parameter, 'scale'
interval = stats.t.interval(0.95, 
                            df = n_1 + n_2 - 2, 
                            loc = samp_avg_1 - samp_avg_2, 
                            scale = s * np.sqrt(1/n_1 + 1/n_2))

print('95% confidence interval for population mean is', interval)

## 3.3 Paired t Test
A paired t-test is used to compare the mean of the population for two dependent samples. The dependent samples can be the scores before and after a specific treatment. 

Let $X_{i}$ be the sample before the treatment and $Y_{i}$ be the sample after the treatment. Let $\mu_{X}$, $\mu_{Y}$ be the mean of the data X and Y respectively. The mean difference $\mu_{d} = \mu_{Y} - \mu_{X}$.

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: \mu_{d} = \mu_{0}$ or $\mu_{d} \geq \mu_{0}$ or $\mu_{d} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{d} \neq \mu_{0}$ or $\mu_{d} < \mu_{0}$ or $\mu_{d} > \mu_{0}$</strong></p>

The test statistic for paired t-test is given as:
<p style='text-indent:25em'> <strong> $t = \frac{\overline{X_{D}} - \mu_{0}} {\frac{s_{D}}{\sqrt{n}}}$</strong></p>

### Example:

#### 1. A training institute wants to check if their writing training program was effective or not. 17 students are selected to check the hypothesis. Consider 0.05 as the level of significance.

The writing scores before and after training are provided in the CSV file `WritingScores.csv`. 

In [None]:
diff_marks = df_score['score_after'] - df_score['score_before']

mean_diff = np.mean(diff_marks)

std_diff = statistics.stdev(diff_marks)

n = len(df_score)
print('Degrees of freedom:', n-1)

#Shapiro for before and after

The null and alternative hypothesis is:

H<sub>0</sub>: The training was not effective ($\mu_{d} = 0$)<br>
H<sub>1</sub>: The training was effective ($\mu_{d} \neq 0$)

In [None]:
t_val = np.abs(round(stats.t.isf(q = 0.05/2, df = 16), 2))
print('Critical Value for two-tailed t-test:', t_val)

# use 'ttest_rel()' to calculate the t-statistic and corresponding p-value for paired samples
# pass the before and after scores to the function
t_stat, p_val = stats.ttest_rel(df_score['score_after'], df_score['score_before'])

# print the the t-test statistic and corresponding p-value 
print("Test Statistic:", t_stat)
print("p-value:", p_val)

In [None]:
#INTERVAL
interval = stats.t.interval(0.95, 
                            df = n-1, 
                            loc = mean_diff, 
                            scale = std_diff/np.sqrt(n))

# 4. Z Proportion Test

## 4.1 One Sample Test

Perform one sample Z test for the population proportion. We compare the population proportion ($P$) with a specific value ($P_{0}$).

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: P = P_{0}$ or $P \geq P_{0}$ or $P \leq P_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: P \neq P_{0}$ or $P < P_{0}$ or $P > P_{0}$</strong></p>

### Example:

#### 1. In previous years, people believed that at most 80% of male students score more than 50 marks out of 100 in Mathematics. Perform a test to check whether this percentage is more than 80. Consider the level of significance as 0.05.

Consider the sample of math scores of male students available in the CSV file `StudentsPerformance.csv`.

In [None]:
total_male = len(df_student[(df_student['gender'] == 'male')])

male_50 = df_student[(df_student['gender'] == 'male') & (df_student['math score'] > 50)]

# obtain the number of male students with math score greater than 50 
num_male_50 = len(male_50)

# calculate sample proportion
p_samp = num_male_50/total_male

In [None]:
z_val = np.abs(round(stats.norm.isf(q = 0.05), 2))
print('Critical value for one-tailed Z-test:', z_val)

In [None]:
# hypothesized proportion
hypo_p = 0.8 

z_prop = (p_samp - hypo_p) / np.sqrt((hypo_p * (1 - hypo_p)) / total_male)
print('Test statistic:', z_prop)

In [None]:
p_value = stats.norm.sf(z_prop)
print('p-value:', p_value)

In [None]:
interval = stats.norm.interval(0.95, 
                               loc = p_samp, 
                               scale = np.sqrt((p_samp*(1-p_samp))/total_male))
print('95% confidence interval for population proportion is', interval)

## 4.2 Two Sample Test

Perform two sample Z test for the population proportion. We check the equality of population proportions $P_{1}$ and $P_{2}$.

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: P_{1} - P_{2} = P_{0}$ or $P_{1} - P_{2} \geq P_{0}$ or $P_{1} - P_{2} \leq P_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: P_{1} - P_{2} \neq P_{0}$ or $P_{1} - P_{2} < P_{0}$ or $P_{1} - P_{2} > P_{0}$</strong></p>

In [9]:
df_nl = df_student[df_student['training institute'] == 'Nature Learning']

n_1 = len(df_nl)

lunch_1 = len(df_nl[df_nl['lunch'] == 'standard'])

df_sg = df_student[df_student['training institute'] == 'Speak Global Learning']

n_2 = len(df_sg)

lunch_2 = len(df_sg[df_sg['lunch'] == 'standard'])

The null and alternative hypothesis is:

H<sub>0</sub>: $P_{1} - P_{2} = 0$<br>
H<sub>1</sub>: $P_{1} - P_{2} \neq 0$ 

In [None]:
z_val = np.abs(round(stats.norm.isf(q = 0.1/2), 2))
z_prop, p_val = sm.stats.proportions_ztest(count = np.array([lunch_std_1, lunch_std_2]), 
                                             nobs = np.array([n_1, n_2]))


### Chi-Square Test [Goodness of Fit]
This test is used to compare the distribution of the categorical data with the expected distribution. 

<p style='text-indent:6em'> <strong> $H_{0}$: There is no significant difference between the observed and expected frequencies from the expected distribution</strong></p>
<p style='text-indent:6em'> <strong> $H_{1}$: There is a significant difference between the observed and expected frequencies from the expected distribution</strong></p>

1. Check whether there is a significant difference between the observed and expected education values or not with 90% confidence.

In [4]:
# if counts not given and data is given

# use 'value_counts()' to calculate the count for each category in the variable 'education' 
observed_value = df_student['education'].value_counts() 

# observed values
print(observed_value,'observed_value')

# use 'value_counts()' to calculate the count for each category in the variable 'education' 
exp_count = df_demographic['education'].value_counts()

# count of each category
print(exp_count,'exp_count')

In [None]:
expected_value = (exp_count * len(df_student)) / len(df_demographic)

In [None]:
# degrees of freedom: (rows-1)*(cols-1)
chi2_val = np.abs(round(stats.chi2.isf(q = 0.1, df = 5), 4))

print('Critical value for chi-square test:', chi2_val)

test_stat, p_value = chisquare(f_obs = observed_value, f_exp = expected_value)

print('Test statistic:', test_stat)
print('p-value:', p_value)

2. At an emporium, the manager is interested in knowing the age group which visits the mall during the day. He defines categories as - children, teenagers, adults and senior citizens. He plans to have his inventory of goods accordingly. He claims that out of all the people who visited 5% are children, 38% are teenagers, 2% are senior citizens are remaining are adults. From a sample of 180 people, it was seen that 25 were children, 50 were teenagers, 90 were adults and 15 were senior citizens. Test the manager’s claim at a 95% confidence level.

H<sub>0</sub>: The manager's claim is correct <br>
H<sub>1</sub>: The manager's claim is not correct

In [6]:
chi2_val = np.abs(round(stats.chi2.isf(q = 0.05, df = 3), 4))

print('Critical value for chi-square test:', chi2_val)

Critical value for chi-square test: 7.8147


In [7]:
# given observed values
observed_value = [25, 50, 90, 15]

# expected count 
exp_count = [0.05, 0.38, 0.55, 0.02]

# calculate the expected values for each category
# expected_value = (np.array(exp_count) * 180) 
expected_value = [9, 68, 99, 4]

# use the 'chisquare()' to perform the goodness of fit test
# the function returns the test statistic value and corresponding p-value
# pass the observed values to the parameter, 'f_obs'
# pass the expected values to the parameter, 'f_exp'
stat, p_value = chisquare(f_obs = observed_value, f_exp = expected_value)

print('Test statistic:', stat)
print('p-value:', p_value)

# The above output shows that the chi-square test statistic 
# is greater than 7.8147 and the p-value is less than 0.05. 
# Thus, we reject the null hypothesis 
# and conclude that manager's claim is not correct.

Test statistic: 64.2773321449792
p-value: 7.160266387019384e-14


In [None]:
#INTERVAL
p_1 = lunch_std_1/ n_1

p_2 = lunch_std_2/ n_2

p_bar = (n_1*p_1 + n_2*p_2) / (n_1 + n_2)

# pass the scaling factor np.sqrt(p_bar(1-p_bar)(1/n_1 + 1/n_2)) to the parameter, 'scale'
a,b = stats.norm.interval(0.9, loc = p_1 - p_2, scale = np.sqrt(p_bar*(1-p_bar)*(1/n_1 + 1/n_2)))

print('90% confidence interval for population proportion is', interval)

### Chi-Square Test [Goodness of Fit]
This test is used to compare the distribution of the categorical data with the expected distribution. 

<p style='text-indent:6em'> <strong> $H_{0}$: There is no significant difference between the observed and expected frequencies from the expected distribution</strong></p>
<p style='text-indent:6em'> <strong> $H_{1}$: There is a significant difference between the observed and expected frequencies from the expected distribution</strong></p>

1. Check whether there is a significant difference between the observed and expected education values or not with 90% confidence.

In [4]:
# if counts not given and data is given

# use 'value_counts()' to calculate the count for each category in the variable 'education' 
observed_value = df_student['education'].value_counts() 

# observed values
print(observed_value,'observed_value')

# use 'value_counts()' to calculate the count for each category in the variable 'education' 
exp_count = df_demographic['education'].value_counts()

# count of each category
print(exp_count,'exp_count')

In [None]:
expected_value = (exp_count * len(df_student)) / len(df_demographic)

In [None]:
# degrees of freedom: (rows-1)*(cols-1)
chi2_val = np.abs(round(stats.chi2.isf(q = 0.1, df = 5), 4))

print('Critical value for chi-square test:', chi2_val)

test_stat, p_value = chisquare(f_obs = observed_value, f_exp = expected_value)

print('Test statistic:', test_stat)
print('p-value:', p_value)

2. At an emporium, the manager is interested in knowing the age group which visits the mall during the day. He defines categories as - children, teenagers, adults and senior citizens. He plans to have his inventory of goods accordingly. He claims that out of all the people who visited 5% are children, 38% are teenagers, 2% are senior citizens are remaining are adults. From a sample of 180 people, it was seen that 25 were children, 50 were teenagers, 90 were adults and 15 were senior citizens. Test the manager’s claim at a 95% confidence level.

H<sub>0</sub>: The manager's claim is correct <br>
H<sub>1</sub>: The manager's claim is not correct

In [6]:
chi2_val = np.abs(round(stats.chi2.isf(q = 0.05, df = 3), 4))

print('Critical value for chi-square test:', chi2_val)

Critical value for chi-square test: 7.8147


In [7]:
# given observed values
observed_value = [25, 50, 90, 15]

# expected count 
exp_count = [0.05, 0.38, 0.55, 0.02]

# calculate the expected values for each category
# expected_value = (np.array(exp_count) * 180) 
expected_value = [9, 68, 99, 4]

# use the 'chisquare()' to perform the goodness of fit test
# the function returns the test statistic value and corresponding p-value
# pass the observed values to the parameter, 'f_obs'
# pass the expected values to the parameter, 'f_exp'
stat, p_value = chisquare(f_obs = observed_value, f_exp = expected_value)

print('Test statistic:', stat)
print('p-value:', p_value)

# The above output shows that the chi-square test statistic 
# is greater than 7.8147 and the p-value is less than 0.05. 
# Thus, we reject the null hypothesis 
# and conclude that manager's claim is not correct.

Test statistic: 64.2773321449792
p-value: 7.160266387019384e-14


### Chi-Square Test for Independence

This test is used to test whether the categorical variables are independent or not.

<p style='text-indent:20em'> <strong> $H_{0}$: The variables are independent</strong></p>
<p style='text-indent:20em'> <strong> $H_{1}$: The variables are not independent (i.e. variables are dependent)</strong></p>

1. Check if there is any relationship between the gender and education level of students with 95% confidence.¶

In [None]:
# if data is given

table = pd.crosstab(df_student['gender'], df_student['education'])

# observed values  
observed_value = table.values
observed_value

In [None]:
chi2_val = np.abs(round(stats.chi2.isf(q = 0.05, df = 5), 4))

print('Critical value for chi-square test:', chi2_val)

In [None]:
test_stat, p, dof, expected_value = chi2_contingency(observed = observed_value, correction = False)

print("Test statistic:", test_stat)
print("p-value:", p)
print("Degrees of freedom:", dof)
print("Expected values:", expected_value)

2. A study was conducted to test the effect of the malaria parasite - plasmodium falciparum - on heterozygous and homozygous humans. The vaccine was given to a cohort of 252 humans. Test whether the heterozygous humans are better protected than homozygous. Consider 0.05 as a level of significance.

In [8]:
# if obs and exp is given
observed_value = np.array([[93, 51], [68, 40]])

In [9]:
chi2_val = np.abs(round(stats.chi2.isf(q = 0.05, df = 1), 4))

print('Critical value for chi-square test:', chi2_val)

Critical value for chi-square test: 3.8415


In [10]:
test_stat, p, dof, expected_value = chi2_contingency(observed = observed_value, correction = False)

# print the output
print("Test statistic:", test_stat)
print("p-value:", p)
print("Degrees of freedom:", dof)
print("Expected values:", expected_value)

# The above output shows that the chi-square test statistic 
# is less than 3.8415 and the p-value is greater than 0.05. 
# Thus we fail to reject (i.e. accept) the null hypothesis 
# and conclude that the zygote type 
# and infection of the malaria parasite are independent.

Test statistic: 0.07023411371237459
p-value: 0.790996215494177
Degrees of freedom: 1
Expected values: [[92. 52.]
 [69. 39.]]


### One-way ANOVA

It assumes that the samples are taken from normally distributed populations. To check this assumption we can use the `Shapiro-Wilk Test.` Also, the population variances should be equal; this can be tested using the `Levene's Test`.

<p style='text-indent:20em'> <strong> $H_{0}$: The averages of all treatments are the same. </strong></p>
<p style='text-indent:20em'> <strong> $H_{1}$: At least one treatment has a different average. </strong></p>


2. Ryan is a production manager at an industry manufacturing alloy seals. They have 4 machines - A, B, C and D. Ryan wants to study whether all the machines have equal efficiency. Ryan collects data of tensile strength from all the 4 machines as given. Test at 5% level of significance

In [13]:
# given data
# tensile strength due to machine A
A = [68.7, 75.4, 70.9, 79.1, 78.2]

# tensile strength due to machine B
B = [62.7, 68.5, 63.1, 62.2, 60.3]

# tensile strength due to machine C
C = [55.9, 56.1, 57.3, 59.2, 50.1]

# tensile strength due to machine D
D = [80.7, 70.3, 80.9, 85.4, 82.3]

In [14]:
df_machine = pd.DataFrame(data = {'machine': ['machine_A','machine_B','machine_C','machine_D']*5, 
                                  'strength': [68.7, 62.7, 55.9, 80.7, 75.4, 68.5, 56.1, 70.3, 70.9, 63.1, 57.3, 80.9, 79.1, 
                                               62.2, 59.2, 85.4, 78.2, 60.3, 50.1, 82.3]})

In [15]:
# shapiro
stat, p_value = stats.shapiro(df_machine['strength'])

# print the p-values for each group
print('p-value:', p_value)

p-value: 0.3721875548362732


In [16]:
# levene
stat, p_value = stats.levene(df_machine[df_machine['machine'] == 'machine_A']['strength'],
                             df_machine[df_machine['machine'] == 'machine_B']['strength'],
                             df_machine[df_machine['machine'] == 'machine_C']['strength'],
                             df_machine[df_machine['machine'] == 'machine_D']['strength'])

# print the p-value 
print('P-Value:', p_value)

P-Value: 0.7570021212992085


In [17]:
# dfn = num of cats -1
# dfd = num of obs - num of cats

In [20]:
# fcrit
fcrit = np.abs(round(stats.f.isf(q = 0.05, dfn = 3, dfd = 16), 4))

print('Critical value for F-test:', fcrit)

Critical value for F-test: 3.2389


In [21]:
# fstat
fstat, p_val = stats.f_oneway(df_machine[df_machine['machine'] == 'machine_A']['strength'],
                                  df_machine[df_machine['machine'] == 'machine_B']['strength'],
                                  df_machine[df_machine['machine'] == 'machine_C']['strength'],
                                  df_machine[df_machine['machine'] == 'machine_D']['strength'])

# print the test statistic and p-value
print('Test statistic:', fstat)
print('p_value:', p_val)

Test statistic: 32.03072350199285
p_value: 5.375613532781072e-07


In [None]:
# The above output shows that the test statistic is greater than 3.2389 
# and the p-value is less than 0.05. Thus we reject the null hypothesis 
# and conclude that the average tensile strength due to at least one machine is different.

### Post-hoc Analysis

we study the `Tukey's HSD` test. This test is efficient when the sample size for each treatment is equal. If the sample size is not equal fo each treatment then we can use the `Scheffe test`. The `scikit_posthocs.posthoc_scheffe()` can be used to perform the test.

1. Ryan is a production manager at an industry manufacturing alloy seals. They have 4 machines - A, B, C and D. Ryan wants to study whether all the machines have equal efficiency. Ryan collects data of tensile strength from all the 4 machines as given. Perform the post-hoc test to find out which machine has a different average. Test at 5% level of significance.

In [22]:
df_machine = pd.DataFrame(data = {'machine': ['machine_A','machine_B','machine_C','machine_D']*5, 
                                  'strength': [68.7, 62.7, 55.9, 80.7, 75.4, 68.5, 56.1, 70.3, 70.9, 63.1, 57.3, 80.9, 79.1, 
                                               62.2, 59.2, 85.4, 78.2, 60.3, 50.1, 82.3]})

In [23]:
comp = mc.MultiComparison(data = df_machine['strength'], groups = df_machine['machine'])

# tukey's range test
post_hoc = comp.tukeyhsd()

# print the summary table
post_hoc.summary()

group1,group2,meandiff,p-adj,lower,upper,reject
machine_A,machine_B,-11.1,0.0044,-18.8842,-3.3158,True
machine_A,machine_C,-18.74,0.001,-26.5242,-10.9558,True
machine_A,machine_D,5.46,0.2265,-2.3242,13.2442,False
machine_B,machine_C,-7.64,0.0553,-15.4242,0.1442,False
machine_B,machine_D,16.56,0.001,8.7758,24.3442,True
machine_C,machine_D,24.2,0.001,16.4158,31.9842,True


The `reject=False` for pairs (machine_A, machine_D) and (machine_B, machine_C) denotes that we fail to reject the null hypothesis; and conclude that the average tensile strength due to machine_A and machine_D, machine_B and machine_C is same.

For the pairs (machine_A, machine_B), (machine_A, machine_C), (machine_B, machine_D), and (machine_C, machine_D) the average tensile strength is not the same.

The values in the columns `lower` and `upper` represent the lower and upper bound of the 95% confidence interval for the mean difference. 