In [84]:
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import t
from scipy.stats import norm
import statsmodels.api as sm

**CONFIDENCE INTERVALS**

**EXERCISE 1.**

What is the normal body temperature for healthy humans? A random sample of 130 healthy human body temperatures provided by Allen Shoemaker yielded 98.25 degrees and standard deviation 0.73 degrees.

Give a 99% confidence interval for the average body temperature of healthy people.

**1. Solution**

**Confidence Interval = x +- z * (s/sqrt(n))**

Sample mean(x) = 98.25 degrees
Sample standard deviation(s) = 0.73 degrees
Sample size(n) = 130

We want a 99% confidence interval.

The z-score for a 99% confidence level is approximately 2.576.

Upper Limit = 98.25 + 2.576 * (0.73/sqrt(130)) = 98.415

Lower Limit = 98.25 - 2.576 * (0.73/sqrt(130)) = 98.085

CI = (98.085, 98.415)

**2. Solution**

In [85]:
sample_mean = 98.25
sample_std = 0.73
sample_size = 130
confidence_level = 0.99

z_score = stats.norm.ppf((1 + confidence_level) / 2)
margin_of_error = z_score * (sample_std / np.sqrt(sample_size))
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print("99% Confidence Interval:", confidence_interval)

99% Confidence Interval: (98.08508192246582, 98.41491807753418)


**3. Solution**

In [86]:
sem = sample_std / np.sqrt(sample_size)
sem

0.06402523540941313

In [87]:
moe = 2.58 * sem
moe

0.1651851073562859

In [88]:
upper_limit = sample_mean + moe
lower_limit = sample_mean - moe
print('99% Confidence Interval:',(lower_limit, upper_limit))

99% Confidence Interval: (98.08481489264372, 98.41518510735628)


**EXERCISE 2.**

The administrators for a hospital wished to estimate the average number of days required for inpatient treatment of patients between the ages of 25 and 34. A random sample of 500 hospital patients between these ages produced a mean and standard deviation equal to 5.4 and 3.1 days, respectively.


Construct a 95% confidence interval for the mean length of stay for the population of patients from which the sample was drawn.

**1. Solution**

In [89]:
sample_mean = 5.4
sample_std = 3.1
sample_size = 500
confidence_level = 0.95

df = sample_size - 1
t_score = t.ppf((1 + confidence_level) / 2, df)
margin_of_error = t_score * (sample_std / np.sqrt(df))
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print('95% Confidence Interval:', confidence_interval)

95% Confidence Interval: (5.127344562608701, 5.6726554373913)


**2. Solution**

In [90]:
sem = sample_std / np.sqrt(sample_size)
sem

0.13863621460498696

In [91]:
moe = 1.96 * sem
moe

0.27172698062577444

In [92]:
upper_limit = sample_mean + moe
lower_limit = sample_mean - moe
print('95% Confidence Interval:',(lower_limit, upper_limit))

95% Confidence Interval: (5.128273019374226, 5.671726980625775)


**HYPOTHESIS TESTING**




**EXERCISE 3.**

The hourly wages in a particular industry are normally distributed with mean $13.20 and standard deviation $2.50. A company in this industry employs 40 workers, paying them an average of $12.20 per hour. Can this company be accused of paying substandard wages? Use an α = .01 level test. (Wackerly, Ex.10.18)

CHECK: statistic: -2.5298221281347035, pvalue= 0.005706018193000826

In [93]:

mu = 13.20
sigma = 2.50 
x_bar = 12.20
n = 40 
alpha = 0.01 


std_error = sigma / (n ** 0.5)

z_score = (x_bar - mu) / std_error

p_value = norm.cdf(z_score)

if p_value < alpha:
    print("Reject the null hypothesis. The company can be accused of paying substandard wages.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to accuse the company of paying substandard wages.")

print("Statistic:", z_score)
print("p-value:", p_value)

Reject the null hypothesis. The company can be accused of paying substandard wages.
Statistic: -2.5298221281347035
p-value: 0.005706018193000826


**EXERCISE 4.** 

Shear strength measurements derived from unconfined compression tests for two types of soils gave the results shown in the following document (measurements in tons per square foot). Do the soils appear to differ with respect to average shear strength, at the 1% significance level?

Results for two type of soils

CHECK: statistic: 5.1681473319343345, pvalue= 2.593228732352821e-06

In [94]:
df = pd.read_excel('soil.xlsx')

In [97]:
from scipy.stats import ttest_ind


# Perform two-sample t-test
t_statistic, p_value = ttest_ind(soil1, soil2)

# Given significance level
alpha = 0.01

# Check if the null hypothesis can be rejected
if p_value < alpha:
    print("Reject the null hypothesis. The soils appear to differ with respect to average shear strength.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to conclude that the soils differ with respect to average shear strength.")

print("Statistic:", t_statistic)
print("p-value:", p_value)

Reject the null hypothesis. The soils appear to differ with respect to average shear strength.
Statistic: 5.1681473319343345
p-value: 2.593228732352821e-06


In [100]:
df = pd.read_excel('EdStatsEXCEL.xlsx')


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2055,2060,2065,2070,2075,2080,2085,2090,2095,2100
0,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,,,,,,,...,,,,,,,,,,
1,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,,,,,,,...,,,,,,,,,,
2,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,,,,,,,...,,,,,,,,,,
3,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.M,,,,,,,...,,,,,,,,,,
4,Arab World,ARB,"Adjusted net enrolment rate, primary, both sex...",SE.PRM.TENR,54.822121,54.894138,56.209438,57.267109,57.991138,59.36554,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886925,Zimbabwe,ZWE,"Youth illiterate population, 15-24 years, male...",UIS.LP.AG15T24.M,,,,,,,...,,,,,,,,,,
886926,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, b...",SE.ADT.1524.LT.ZS,,,,,,,...,,,,,,,,,,
886927,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, f...",SE.ADT.1524.LT.FE.ZS,,,,,,,...,,,,,,,,,,
886928,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, g...",SE.ADT.1524.LT.FM.ZS,,,,,,,...,,,,,,,,,,


In [102]:
df

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2055,2060,2065,2070,2075,2080,2085,2090,2095,2100
0,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,,,,,,,...,,,,,,,,,,
1,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,,,,,,,...,,,,,,,,,,
2,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,,,,,,,...,,,,,,,,,,
3,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.M,,,,,,,...,,,,,,,,,,
4,Arab World,ARB,"Adjusted net enrolment rate, primary, both sex...",SE.PRM.TENR,54.822121,54.894138,56.209438,57.267109,57.991138,59.36554,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886925,Zimbabwe,ZWE,"Youth illiterate population, 15-24 years, male...",UIS.LP.AG15T24.M,,,,,,,...,,,,,,,,,,
886926,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, b...",SE.ADT.1524.LT.ZS,,,,,,,...,,,,,,,,,,
886927,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, f...",SE.ADT.1524.LT.FE.ZS,,,,,,,...,,,,,,,,,,
886928,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, g...",SE.ADT.1524.LT.FM.ZS,,,,,,,...,,,,,,,,,,


**EXERCISE 5.**

The following dataset is based on data provided by the World Bank (https://datacatalog.worldbank.org/dataset/education-statistics). World Bank Edstats.  2015 PISA Test Dataset

Get descriptive statistics (the central tendency, dispersion and shape of a dataset’s distribution) for each continent group (AS, EU, AF, NA, SA, OC).
Determine whether there is any difference (on the average) for the math scores among European (EU) and Asian (AS) countries (assume normality and equal variances). Draw side-by-side box plots.
CHECK: statistic=0.870055317967983, pvalue=0.38826888111307345

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_csv("PISA_test_dataset.csv")

# Filter the dataset for European (EU) and Asian (AS) countries
eu_countries = data[data[''] == 'EU']['Math Score']
as_countries = data[data['Continent'] == 'AS']['Math Score']

# Calculate descriptive statistics
eu_stats = eu_countries.describe()
as_stats = as_countries.describe()

# Perform t-test
t_statistic, p_value = stats.ttest_ind(eu_countries, as_countries)

# Print descriptive statistics
print("Descriptive statistics for European countries:")
print(eu_stats)
print("\nDescriptive statistics for Asian countries:")
print(as_stats)

# Print t-test results
print("\nT-test results:")
print("Statistic:", t_statistic)
print("p-value:", p_value)

# Draw side-by-side box plots
plt.boxplot([eu_countries, as_countries], labels=['EU', 'AS'])
plt.xlabel('Continent')
plt.ylabel('Math Score')
plt.title('Math Scores of European and Asian Countries')
plt.grid(True)
plt.show()