<a href="https://colab.research.google.com/github/Somtochukwu-Achikanu/Ibm-datascience/blob/main/z_score%2C_p_value_and_t_tests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [34]:
shipment = pd.read_feather('/content/late_shipments.feather')
print(shipment.head())

        id       country managed_by  fulfill_via vendor_inco_term  \
0  36203.0       Nigeria   PMO - US  Direct Drop              EXW   
1  30998.0      Botswana   PMO - US  Direct Drop              EXW   
2  69871.0       Vietnam   PMO - US  Direct Drop              EXW   
3  17648.0  South Africa   PMO - US  Direct Drop              DDP   
4   5647.0        Uganda   PMO - US  Direct Drop              EXW   

  shipment_mode  late_delivery late product_group    sub_classification  ...  \
0           Air            1.0  Yes          HRDT              HIV test  ...   
1           Air            0.0   No          HRDT              HIV test  ...   
2           Air            0.0   No           ARV                 Adult  ...   
3         Ocean            0.0   No           ARV                 Adult  ...   
4           Air            0.0   No          HRDT  HIV test - Ancillary  ...   

  line_item_quantity line_item_value pack_price unit_price  \
0             2996.0       266644.00      

In [35]:
#To calculate the proportion of late shipments in the sample; that is, the mean cases where the late column is "Yes"

late_prop_sample = (shipment['late'] =='Yes').mean()
print(late_prop_sample)

0.061


The proportion of late shipments in the sample is 0.061, or 6.1%

### Calculating Z score

In [36]:
#Hypothesize that the proportion of late shipments is 6%

late_prop_hyp = 0.06

#Bootstraping
boot_distribution = []
for i in range(5000):
  boot_distribution.append(np.mean(shipment.sample(frac=1, replace=True)['late'] == 'Yes'))

#Calculate the standard error from the standard deviation of the bootstrap distribution
std_error = np.std(boot_distribution, ddof=1)

#Z-score
z_score = (late_prop_sample - late_prop_hyp) / std_error
print(z_score)

0.13060992516421904


 The null hypothesis,
, is that the proportion of late shipments is six percent.

The alternative hypothesis,
, is that the proportion of late shipments is greater than six percent.

Since the HA is greater then H0, the the right tailed test will be used in this analysis

# P-Value

In [37]:
#P value - probability that the proportion of late shipment is greater than 6%
from scipy.stats import norm

z_score = (late_prop_sample - late_prop_hyp) / std_error



#Right tailed test
p_value = 1 - norm.cdf(z_score, loc=0, scale=1)
print(p_value)

0.4480419454228569


The p value is low which means that statistics is likely in the tail of the null distribution and the probabilty that late shipment proportion is greater than 6% can be obtained... This rejects the null hypothesis

# Significant level and confidence Level



In [38]:
#Calculate a 95% confidence interval from late_shipments_boot_distn using the quantile method, labeling the lower and upper intervals lower and upper

lower = np.quantile(boot_distribution, 0.025)
upper = np.quantile(boot_distribution, 0.975)

print((lower, upper))

(0.047, 0.077)


Does the confidence interval match up with the conclusion to stick with the original assumption that 6% is a reasonable value for the unknown population parameter?

Answer: Yes, since 0.06 is included in the confidence interval range and we failed to reject the H0 due to large p-value, the results are similar

if the hypothesized population parameter is within the confidence interval, you should fail to reject the null hypothesis.

# Calculating T-Test

The late_shipments dataset has been split into a "yes" group, where late == "Yes" and a "no" group where late == "No". The weight of the shipment is given in the weight_kilograms variable.

The sample means for the two groups are available as xbar_no and xbar_yes. The sample standard deviations are s_no and s_yes. The sample sizes are n_no and n_yes

In [39]:
#Calculating sample mean of the two groups(x)
groupby = shipment.groupby('late')['weight_kilograms'].mean()
print(groupby)

xbar_yes = shipment[shipment['late'] == 'Yes']['weight_kilograms'].mean()
print(xbar_yes)

xbar_no = shipment[shipment['late'] == 'No']['weight_kilograms'].mean()
print(xbar_no)

late
No     1897.791267
Yes    2715.672131
Name: weight_kilograms, dtype: float64
2715.6721311475408
1897.7912673056444


In [40]:
#Calculating std for the two groups(s)
std = shipment.groupby('late')['weight_kilograms'].std()
print(std)

s_yes = shipment[shipment['late'] == 'Yes']['weight_kilograms'].std()
print(s_yes)

s_no =shipment[shipment['late'] == 'No']['weight_kilograms'].mean()
print(s_no)

late
No     3154.039507
Yes    2544.688211
Name: weight_kilograms, dtype: float64
2544.688210903328
1897.7912673056444


In [41]:
#Calculating the count(n)
n = shipment.groupby('late')['weight_kilograms'].count()
print(n)

n_yes = shipment[shipment['late'] == 'Yes']['weight_kilograms'].count()
print(n_yes)

n_no =shipment[shipment['late'] == 'No']['weight_kilograms'].count()
print(n_no)

late
No     939
Yes     61
Name: weight_kilograms, dtype: int64
61
939


In [42]:
#T- test

numerator = xbar_no - xbar_yes
denominator = np.sqrt(s_no ** 2 / n_no + s_yes ** 2 / n_yes)

t_test = numerator / denominator
print(t_test)

-2.466112269007943


## Calculating p value from t test

Let recall our null hypothesis and alternate hypothesis
Ho: The mean weight of shipments that weren't late is the same as the mean weight of shipments that were late.

Ha: The mean weight of shipments that weren't late is less than the mean weight of shipments that were late






In [43]:
alpha = 0.05

#state our degree of freedom
degrees_of_freedom = 998  #(df = number of samples - 2 for a two sample t_test, 1000 - 2)

#Calculate the p value
from scipy.stats import t

p_value = t.cdf (t_test, df = degrees_of_freedom)
print(p_value)

0.006913028443258096


Based on the p_value calculated, i will have to reject null hypothesis since p_value is less than alpha