# IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK

In [1]:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# CONFIDENCE INTERVALS

**EXERCISE 1**
* What is the normal body temperature for healthy humans? A random sample of 130 healthy human body temperatures provided by Allen Shoemaker yielded 98.25 degrees and standard deviation 0.73 degrees. 

* Give a 99% confidence interval for the average body temperature of healthy people.

In [41]:
n = 130
x_bar = 98.25
sigma = 0.73
alpha = 0.01

In [42]:
sem = sigma / np.sqrt(n)
sem

0.06402523540941313

In [44]:
stats.t.interval(confidence=0.01, df = n-1 ,loc=x_bar, scale=sem)

(98.24919598477457, 98.25080401522543)

In [45]:
# stats.norm.interval(confidence=0.01, loc=x_bar, scale=sem)

**EXERCISE 2**
* The administrators for a hospital wished to estimate the average number of days required for inpatient treatment of patients between the ages of 25 and 34. A random sample of 500 hospital patients between these ages produced a mean and standard deviation equal to 5.4 and 3.1 days, respectively.

* Construct a 95% confidence interval for the mean length of stay for the population of patients from which the sample was drawn.

In [None]:
n = 500
alpha = 0.05

# HYPOTHESIS TESTING

**EXERCISE 3**
* The hourly wages in a particular industry are normally distributed with mean $13.20 and standard deviation $2.50. A company in this industry employs 40 workers, paying them an average of $12.20 per hour. Can this company be accused of paying substandard wages? Use an α = .01 level test. (Wackerly, Ex.10.18)

CHECK: statistic: -2.5298221281347035, pvalue= 0.005706018193000826

In [None]:
# H0: mu = 13.20
# H1: mu < 13.20

In [7]:
mu0 = 13.20
n = 40
s = 2.50
x_bar = 12.20
alpha = 0.01

In [11]:
z = (x_bar - mu0) / (s/np.sqrt(n))
z

-2.5298221281347035

In [15]:
sem = s/np.sqrt(n)
sem

0.3952847075210474

In [19]:
p_value = stats.norm.cdf(z)
p_value

0.005706018193000826

In [34]:
if p_value < alpha:
    print("Reject the Null")
else:
    print("Fail to reject")

Reject the Null


**EXERCISE 4**
* Shear strength measurements derived from unconfined compression tests for two types of soils gave the results shown in the following document (measurements in tons per square foot). Do the soils appear to differ with respect to average shear strength, at the 1% significance level?

Results for two type of soils

CHECK: statistic: 5.1681473319343345, pvalue= 2.593228732352821e-06

In [24]:
df1 = pd.read_csv("soil - Sheet1.csv")
df1

Unnamed: 0,Soil1,Soil2
0,1.442,1.364
1,1.943,1.878
2,1.11,1.337
3,1.912,1.828
4,1.553,1.371
5,1.641,1.428
6,1.499,1.119
7,1.347,1.373
8,1.685,1.589
9,1.578,1.714


In [35]:
df1.Soil1.fillna(df1.Soil1.mean(), inplace=True)
df1.tail(5)

Unnamed: 0,Soil1,Soil2
30,1.6918,1.593
31,1.6918,1.172
32,1.6918,1.51
33,1.6918,1.74
34,1.6918,1.355


In [36]:
leveneTest = stats.levene(df1.Soil1, df1.Soil2)
leveneTest

LeveneResult(statistic=1.6612825488884275, pvalue=0.20179788672995813)

In [28]:
indTest = stats.ttest_ind(df1.Soil1, df1.Soil2, equal_var = True)
indTest

Ttest_indResult(statistic=5.58856260809653, pvalue=4.381657766244157e-07)

**EXERCISE 5**
* The following dataset is based on data provided by the World Bank (https://datacatalog.worldbank.org/dataset/education-statistics). World Bank Edstats.  2015 PISA Test Dataset

1.Get descriptive statistics (the central tendency, dispersion and shape of a dataset’s distribution) for each continent group (AS, EU, AF, NA, SA, OC).
2.Determine whether there is any difference (on the average) for the math scores among European (EU) and Asian (AS) countries (assume normality and equal variances). Draw side-by-side box plots.

CHECK: statistic=0.870055317967983, pvalue=0.38826888111307345

In [4]:
df2 = pd.read_csv("2015 PISA Test - Sheet1.csv")
df2

Unnamed: 0,Country Code,Continent_Code,internet_users_per_100,Math,Reading,Science
0,ALB,EU,63.252933,413.1570,405.2588,427.2250
1,ARE,AS,90.500000,427.4827,433.5423,436.7311
2,ARG,SA,68.043064,409.0333,425.3031,432.2262
3,AUS,OC,84.560519,493.8962,502.9006,509.9939
4,AUT,EU,83.940142,496.7423,484.8656,495.0375
...,...,...,...,...,...,...
65,TUN,AF,48.519836,366.8180,361.0555,386.4034
66,TUR,EU,53.744979,420.4540,428.3351,425.4895
67,URY,SA,64.600000,417.9919,436.5721,435.3630
68,USA,,74.554202,469.6285,496.9351,496.2424


In [5]:
df2.Continent_Code.value_counts()

EU    37
AS    17
SA     7
OC     2
AF     2
Name: Continent_Code, dtype: int64

In [None]:
#H0: mu1 = mu2
#H1: mu1 != mu2

In [29]:
leveneTest = stats.levene(df2["Continent_Code"] == "EU", df2["Continent_Code"] == "AS")
leveneTest

LeveneResult(statistic=8.324222431668238, pvalue=0.0045409991999672255)

In [32]:
indTest = stats.ttest_ind(df2["Continent_Code"] == "EU", df2["Continent_Code"] == "AS", equal_var = False)
indTest

Ttest_indResult(statistic=3.6064660749106765, pvalue=0.0004358850070309797)