In [3]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns

## Credit card processing
Q1. Tom is working in a credit card processing company as a team leader. His team is responsible to validate certain data for new credit card applications. The time spent by his team on an application is normally distributed with average 300 minutes and standard deviation 40 minutes.Tom and his team worked on process improvement to reduce the time spent in processing new applications. After implementing the improvements, Tom checked the time spent by his team on randomly selected 25 new card applications. The average time spent is 290 min. Tom is happy that, though it is a small improvement, it is a step in right direction. He shares the good news with his manager Lisa. But Lisa in not convinced about the improvement. At 95% confidence, is the processes really improved?

In [5]:
# Though Sample size is less than 30 (Condition for T-Test), the question mentions it to be 
# "Norammly Distributed" and SD is given, so choosing Z-Test
Mu = 300
sd = 40
n = 25
xbar = 290 
SE = sd/np.sqrt(n)  # Standard Error Formula

In [None]:
# Hypothesis Formulation
# H0 = Mu>=300
# Ha = Mu<300
# One Tail Test - Left Tail Test

In [11]:
# Approach 1: Finding X-critical

x_critical = stats.norm.ppf(q=0.05,loc=Mu,scale=SE) # Alpha=0.05 i.e. 1-0.95 as Confidence Level is 95% (0.95)
x_critical

286.8411709843882

In [13]:
# Approach 2: Finding Zstat and Zcritical

zstat = (xbar-Mu)/SE
zstat

-1.25

In [15]:
zcritical = stats.norm.ppf(0.05,0,1)  # For Standard Normal Distribution, Mean=0 and SD=1
zcritical

-1.6448536269514729

In [19]:
# Approach 3; p-value method
p_value = stats.norm.cdf(xbar,Mu,SE)
p_value

0.10564977366685535

## E-commerce Delivery Time
Q2. It is known from experience that for a certain E-commerce company the mean delivery time of the products is 5 days with a standard deviation of 1.3 days.

The new customer service manager of the company is afraid that the company is slipping in the delivery time and collects a random sample of 45 orders. The mean delivery time of these samples comes out to be 5.25 days. 

Is there enough statistical evidence for the manager's apprehension that the mean delivery time of products is greater than 5 days.

Use level of significance $\alpha$ = 0.05

In [7]:
sd = 1.3
n = 45
Mu = 5
xbar = 5.25
SE = sd/np.sqrt(n)

In [None]:
# Hypothesis Formulation:
# H0: Mu<=5
# Ha: Mu>5
# One Tail Test - Right Tail Test

In [27]:
# x_critical = 1-stats.norm.ppf(q=0.95,loc=Mu,scale=SE)
# x_critical

In [11]:
# P-Value method
p_value = stats.norm.cdf(xbar,Mu,SE)
p_value

0.901481479074213

In [17]:
1-p_value #As its a Right Tail Test

0.09851852092578695

## Performance assesment
Q3. A football team has recently hired a new coach with the aim of improving the team's performance, particularly in scoring goals. The team wants to assess whether there is a statistically significant difference in the average number of goals scored per match after hiring the new coach.The team collects data on the number of goals scored per match for a random sample of 30.
Before Hiring New Coach the average was at 1.8 goals with the standard deviation of 0.5. After hiring the new coach the average went up to 2.2 goals. The analytics team wants to assess the performance and provide recomendations for the future matches.

In [29]:
Mu = 1.8
sd = 0.5
n = 30
xbar = 2.2
SE = sd/np.sqrt(n)

In [None]:
# Hypothesis Formulation:
# H0: Mu<=1.8
# Ha: Mu>1.8
# One tail test - Right Tail Test

In [64]:
p_value = stats.norm.cdf(xbar,Mu,SE)
p_value

0.9999941143304512

In [66]:
1-p_value

5.8856695488440636e-06

In [56]:
Zstat=(2.2-1.8)/SE
Zstat

4.38178046004133

In [62]:
Zcrit = stats.norm.ppf(1-0.05,0,1)
Zcrit

1.6448536269514722

Experian Marketing Services reported that the typical American spends a mean of 144 minutes (2.4 hours) per day accessing the Internet via a mobile device. In order to test the validity of this statement, you select a sample of 30 friends and family. The results for the time spent per day accessing the Internet via mobile device (in minutes) are stored in InternetMobileTime 

a. Is there evidence that the population mean time spent per day accessing the Internet via mobile device is different from 144 minutes? Use the p-value approach and a level of significance of 0.05. 

b. What assumption about the population distribution is needed in order to conduct the t test in (a)? 

** Use InternetMobileTime .csv **

In [68]:
df = pd.read_csv('InternetMobileTime .csv')

In [72]:
df.head()

Unnamed: 0,Minutes
0,72
1,144
2,48
3,72
4,36


In [85]:
xbar = df['Minutes'].mean()
xbar

175.26666666666668

In [87]:
df.describe()

Unnamed: 0,Minutes
count,30.0
mean,175.266667
std,139.836834
min,24.0
25%,72.0
50%,144.0
75%,276.0
max,576.0


In [None]:
# Hypothesis Formulation
# H0: Mu<=144
# Ha: Mu>144  --- From Sample
# One Tail Test -- Right Tail Test

In [101]:
Mu = 144
n = 30
xbar = 175
SD = 139.84 # Here, SD of population is not given, it is that of Sample.
# Hence, One Sample T-Test

In [103]:
SE = SD/np.sqrt(n)

In [109]:
tstat = (xbar-Mu)/SE
tstat

1.2142018937829053

In [113]:
tcrit = stats.t.ppf(1-0.05,29)  # For T-test, We need degree of freedom which is n-1
tcrit

1.6991270265334972

In [115]:
from scipy.stats import ttest_1samp,ttest_ind,ttest_rel

In [120]:
tstats,P_value = ttest_1samp(df,Mu,alternative='greater') # Greater as its a Right Tail, its Default is Two sided Test

In [124]:
tstats  # Same as calculated before

array([1.22467437])

In [122]:
P_value

array([0.11527663])

A hotel manager looks to enhance the initial impressions that hotel guests have when they check in. Contributing to initial impressions is the time it takes to deliver a guest’s luggage to the room after check-in. A random sample of 20 deliveries on a particular day were selected in Wing A of the hotel, and a random sample of 20 deliveries were selected in Wing B. The results are stored in Luggage . Analyze the data and determine whether there is a difference between the mean delivery times in the two wings of the hotel. (Use $\alpha$ = 0.05) <br>

In [126]:
df = pd.read_csv('Luggage (1).csv')

In [None]:
# Hypothesis Formulation:
# H0: MuA = MuB   ----- MuA-MuB = 0 ----- Md=0
# Ha: MuA - MuB > 0 ---- Md!=0
# Two Tailed Test
# Two sample T-Test (Independent)

In [130]:
tstat,P_value = ttest_ind(df['WingA'],df['WingB'])

In [132]:
tstat

5.16151166403543

In [134]:
P_value

8.007988032535588e-06

The dataset "Concrete" contains the compressive strength (measured in thousands of pounds per square inch, psi) of 40 concrete samples taken two days and seven days after pouring. At the 0.01 level of significance, can we conclude that the mean compressive strength of concrete is lower at two days than at seven days?

** Load the concrete.csv

In [139]:
df = pd.read_csv('Concrete.csv')

In [141]:
df.describe()

Unnamed: 0,Sample,Two Days,Seven Days
count,40.0,40.0,40.0
mean,20.5,2.991,3.544125
std,11.690452,0.496086,0.503577
min,1.0,1.635,2.275
25%,10.75,2.81,3.3175
50%,20.5,3.0175,3.6
75%,30.25,3.2725,3.83125
max,40.0,3.825,4.57


In [None]:
# Hypothesis Formulation:
# H0: Mu2 = Mu7 ---- Mu2-Mu7=0 --- Md=0
# Ha: Mu7 - Mu2 >0 ---- Md!=0
# Two Tailed Test
# T-Test -- Two Sample Test (Dependent/Related/Paired)

In [143]:
tstat,p_value = ttest_rel(df['Two Days'],df['Seven Days'])  # Two Tail Test

In [145]:
p_value

1.5536317048737742e-11

In [None]:
# Method 2: Right Tail Test
# H0: Mu2 = Mu7 ---- Mu2-Mu7=0 --- Md<=0
# Ha: Mu7 - Mu2 >0 ---- Md>0

In [147]:
tstat1,p_value1 = ttest_rel(df['Seven Days'],df['Two Days'],alternative='greater') # Right Tail Test

In [149]:
p_value1

7.768158524368871e-12