### CONFIDENCE INTERVALS

#### EXERCISE 1. 
What is the normal body temperature for healthy humans? A random sample of 130 healthy human body temperatures provided by Allen Shoemaker yielded 98.25 degrees and standard deviation 0.73 degrees. 

Give a 99% confidence interval for the average body temperature of healthy people.

In [19]:
import numpy as np
import pandas as pd
from scipy import stats

In [2]:
n = 130               # sample size
xbar = 98.25          # sample mean
s = 0.73              # sample std
sem = s / np.sqrt(n)  # standard error of the mean

In [3]:
stats.t.interval(confidence=0.99,  # confidence level  
                 df=n-1,           # degrees of freedom
                 loc=xbar,         # sample mean
                 scale=sem)        # standard error of the mean

(98.08260738705933, 98.41739261294067)

We are 99% confident that the average body temperature of healty people is between 98.08 and 98.41

#### EXERCISE 2. 
The administrators for a hospital wished to estimate the average number of days required for inpatient treatment of patients between the ages of 25 and 34. A random sample of 500 hospital patients between these ages produced a mean and standard deviation equal to 5.4 and 3.1 days, respectively.


Construct a 95% confidence interval for the mean length of stay for the population of patients from which the sample was drawn.

In [4]:
n = 500
xbar = 5.4
s = 3.1
sem = s / np.sqrt(n)

In [5]:
stats.t.interval(confidence=0.95,  # confidence level  
                 df=n-1,           # degrees of freedom
                 loc=xbar,         # sample mean
                 scale=sem)        # standard error of the mean

(5.127617354510309, 5.672382645489692)

We are 95% confident that the mean length of stay required for inpatient treatment of patients is between 5.12 and 5.67 days

### HYPOTHESIS TESTING

#### EXERCISE 3.
The hourly wages in a particular industry are normally distributed with mean \\$13.20 and standard deviation of \\$ 2.50. A company in this industry employs 40 workers, paying them an average of \$12.20 per hour. Can this company be accused of paying substandard wages? Use an α = .01 level test. (Wackerly, Ex.10.18)

CHECK: statistic: -2.5298221281347035, pvalue= 0.005706018193000826

In [6]:
# H0: mu = 13.20
# H1: mu < 13.20

In [7]:
n = 40
mu = 13.20
xbar = 12.20
sigma = 2.50
alpha = 0.01

In [8]:
se = sigma / np.sqrt(n)

In [11]:
z = (xbar - mu) / se
z

-2.5298221281347035

In [12]:
p_value = stats.norm.cdf(z)
p_value

0.005706018193000826

In [18]:
if p_value < alpha:
    print(f"At {alpha} level of significance level, there is enough evidence to support the claim that this company paying substandard wages")
else:
    print(f"At {alpha} level of significance level, there is not enough evidence to support the claim that this company paying substandard wages")

At 0.01 level of significance level, there is enough evidence to support the claim that this company paying substandard wages


#### EXERCISE 4.
Shear strength measurements derived from unconfined compression tests for two types of soils gave the results shown in the following document (measurements in tons per square foot). Do the soils appear to differ with respect to average shear strength, at the 1% significance level?

[Results for two type of soils](https://docs.google.com/spreadsheets/d/1f2odmgDboIVuSV-A5gmuC25ppqQ5g1OIIF4h5EOqUcI/edit?usp=sharing)

CHECK: statistic: 5.1681473319343345, pvalue= 2.593228732352821e-06

In [20]:
soil = pd.read_csv("soil.csv")

In [23]:
soil.head(10)

Unnamed: 0,Soil1,Soil2
0,1.442,1.364
1,1.943,1.878
2,1.11,1.337
3,1.912,1.828
4,1.553,1.371
5,1.641,1.428
6,1.499,1.119
7,1.347,1.373
8,1.685,1.589
9,1.578,1.714


In [26]:
soil.shape

(35, 2)

In [24]:
soil.isna().sum()

Soil1    5
Soil2    0
dtype: int64

In [27]:
soil.dropna(inplace=True)

In [28]:
soil.shape

(30, 2)

In [29]:
#Perform Levene test for equal variances
#H0: The population variances are equal
#H1: There is a difference between the variances in the population
#The small p-value suggests that the populations do not have equal variances.

stats.levene(soil.Soil1, soil.Soil2)

LeveneResult(statistic=0.2323198108973329, pvalue=0.631622932753579)

 pvalue=0.6316 > alpha=0.01 ---> Fail to reject null hypothesis, the population variances are equal

In [30]:
indTest = stats.ttest_ind(soil.Soil1, soil.Soil2, equal_var=True)
indTest

Ttest_indResult(statistic=5.134893443609086, pvalue=3.4402046436336477e-06)

In [34]:
alpha = 0.01

if indTest.pvalue < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject null hypothesis")

Reject the null hypothesis


#### EXERCISE 5. 
The following dataset is based on data provided by the World Bank (https://datacatalog.worldbank.org/dataset/education-statistics). World Bank Edstats. [2015 PISA Test Dataset](https://docs.google.com/spreadsheets/d/14rVnIUfEm3CuK9bSvS5253RHWzQhXOuNc0I-cCkgpR8/edit?usp=sharing) 

1.Get descriptive statistics (the central tendency, dispersion and shape of a dataset’s distribution) for each continent group (AS, EU, AF, NA, SA, OC).
2.Determine whether there is any difference (on the average) for the math scores among European (EU) and Asian (AS) countries (assume normality and equal variances). Draw side-by-side box plots.

CHECK: statistic=0.870055317967983, pvalue=0.38826888111307345

In [36]:
df = pd.read_csv("2015_PISA_Test.csv")

In [37]:
df

Unnamed: 0,Country Code,Continent_Code,internet_users_per_100,Math,Reading,Science
0,ALB,EU,63.252933,413.1570,405.2588,427.2250
1,ARE,AS,90.500000,427.4827,433.5423,436.7311
2,ARG,SA,68.043064,409.0333,425.3031,432.2262
3,AUS,OC,84.560519,493.8962,502.9006,509.9939
4,AUT,EU,83.940142,496.7423,484.8656,495.0375
...,...,...,...,...,...,...
65,TUN,AF,48.519836,366.8180,361.0555,386.4034
66,TUR,EU,53.744979,420.4540,428.3351,425.4895
67,URY,SA,64.600000,417.9919,436.5721,435.3630
68,USA,,74.554202,469.6285,496.9351,496.2424


In [38]:
df.shape

(70, 6)

In [39]:
df.isna().sum()

Country Code              0
Continent_Code            5
internet_users_per_100    0
Math                      0
Reading                   0
Science                   0
dtype: int64

In [40]:
df.dropna(inplace=True)

In [41]:
df.shape

(65, 6)

In [55]:
df.groupby("Continent_Code").describe()

Unnamed: 0_level_0,internet_users_per_100,internet_users_per_100,internet_users_per_100,internet_users_per_100,internet_users_per_100,internet_users_per_100,internet_users_per_100,internet_users_per_100,Math,Math,...,Reading,Reading,Science,Science,Science,Science,Science,Science,Science,Science
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Continent_Code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AF,2.0,43.359918,7.297226,38.2,40.779959,43.359918,45.939877,48.519836,2.0,363.2121,...,358.25645,361.0555,2.0,381.07425,7.536556,375.7451,378.409675,381.07425,383.738825,386.4034
AS,17.0,68.455613,21.08606,21.976068,50.3,74.0,84.948353,92.884826,17.0,466.216647,...,508.6905,535.1002,17.0,467.945847,56.671371,386.4854,417.6112,456.4836,523.2774,555.5747
EU,37.0,77.274888,12.425773,53.744979,68.6329,76.184,87.479056,98.2,37.0,477.981449,...,499.8146,526.4247,37.0,478.299381,34.450616,383.6824,460.7749,490.225,501.9369,534.1937
OC,2.0,86.391704,2.589686,84.560519,85.476112,86.391704,87.307296,88.222889,2.0,494.55975,...,507.678175,509.2707,2.0,511.6487,2.340241,509.9939,510.8213,511.6487,512.4761,513.3035
SA,7.0,60.180494,9.772455,40.9,57.116462,64.289,66.321532,69.198471,7.0,402.8877,...,431.9227,458.5709,7.0,421.747186,18.470319,396.6836,408.20545,424.5905,433.7946,446.9561
