# Statistical Inference
1. The Central Limit Theorem
2. Confidence intervals for Means
3.  Hypothesis testing
  - the z test
  - single sample t test
  - independent samples t test

# The Central Limit Theorem 
- The Central Limit Theorem states that given large enough sample size(>=30), the following properties hold true:
1. Sampling distribution's mean = Population mean (μ)
2. Sampling distribution's standard deviation (standard error) = σ/√n
3. for n ≥ 30, the sampling distribution tends to a normal distribution for all practical purposes.
4. In other words, for a large n, the sampling distribution of the mean approaches a normal distribution !

In [1]:
import numpy as np
import pandas as pd
import random 

import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv(r"C:\Users\Shivani Dussa\Downloads\car-booking-system-main\car-booking-system-main\train.csv")

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df['windspeed'].head()

In [None]:
sns.set_style('whitegrid')
sns.distplot(df.humidity,kde = True,color = 'Red',bins = 100)

In [None]:
plt.hist(df.humidity,bins = 100,color = 'pink')
plt.axvline(x = df.humidity.mean(),color = 'g')  # we will take samples to validate a normal bell cure 

In [None]:
x1bar = df.humidity.sample(30).mean()  # samples we can take 30 or more than 30 then we can see perfect bell curve 
x2bar = df.humidity.sample(30).mean()
print(x1bar,x2bar)

In [None]:
list = []
num_samples = 5000  # here we are taking 5000 samples from 8709
for i in range(0,num_samples):
    list.append(df.humidity.sample(n = 30,replace = False).mean())

In [None]:
len(list)

In [None]:
ax = sns.distplot(list,kde = True , color = 'Red',bins = 100) 

### Sampling distribution approaching Normal distribution
- For sample size >=30, the resulting sampling distribution is almost a normal distribution

In [None]:
from scipy.stats import expon

In [None]:
data = expon.rvs(size = 1500)
sns.distplot(data,kde = True,color = 'red',bins = 100)

In [None]:
plt.hist(df.windspeed,bins = 100,color = 'violet')
plt.axvline(x = df.windspeed.mean(),color = 'indigo')

In [None]:
x1 = df.windspeed.sample(50).mean()
x2 = df.windspeed.sample(60).mean()

In [None]:
x1,x2

In [None]:
list1 = []
num_samples = 2000
for i in range(0,num_samples):
    list1.append(df.windspeed.sample(n = 30,replace = True).mean())

In [None]:
len(list1)

In [None]:
ax = sns.distplot(list1,kde = True, color = 'indigo',bins = 100)
plt.axvline(x = df.windspeed.mean(),color = 'red')

# Confidence Intervals 
- Confidence Interval (CI) is a type of statistical estimation that proposes a range of plausible values for an unknown parameter (for example, the mean). The interval has an associated confidence level that the true parameter is in the proposed range. The 95% confidence interval defines a range of values that you can be 95% confident contains the population mean. With large samples, you know that mean with much more precision than you do with a small sample, so the confidence interval is quite narrow when computed from a large sample.

Calculating the Confidence Level(C.I)

Note: we should use the standard deviation of the entire population, but in many cases we won't know it.

We can use the standard deviation for the sample if we have enough observations (at least n=30, hopefully more)

Step 2:

Decide what Confidence Interval we want: 95% or 99% are common choices. Then find the "Z" value for that Confidence Interval here:
  - Confidence Interval----------------------------------------------Z score
             80                                 1.282
             85%                                1.440
             90%                                1.645
             95%                                1.960
             99%                                2.576
             99.5%                              2.807
             99.9%                              3.291 

Step 3: Use that Z value in this formula for the Confidence Interval

 CI = xbar +- Z s/root n

CI = confidence Level

xbar = sample mean

z = confidence level value

s = sample standard deviation

n = sample size

Note:-The value after the ± is called the margin of error

In [3]:
h  = pd.read_csv(r"C:\Users\Shivani Dussa\Downloads\heart_failure_clinical_records_dataset (1).csv")

In [4]:
h

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


In [5]:
#find a 95% confidence interval function using z distribution
def zscore(mean,std_dev,ci,n):
    import scipy.stats as st
    import numpy as np
    area = (ci)/2
    z = st.norm.ppf(area)
    std_dv = std_dev/np.sqrt(n)
    moe = z * std_dv
    lb = round(mean - moe,1)
    up = round(mean + moe,1)
    print('z score:', z)
    print(f'the confidence interval is({lb},{up})')

In [6]:
st.norm.ppf(0.9887)  # // if we give probability (area) of graph it gives z score
                      # if we want to find z score by cumulative probability.

2.280129653030281

In [7]:
st.norm.cdf(2.28)    # if we give z score, it gives cummulative probability(area under the curve).
                       # // we want to find cumulative probability by z score.

0.9886961557614472

In [8]:
mean = 5.6
n = 30
std_dev = 0.8
ci = 0.99
zscore(mean,n,std_dev,ci)

z score: -0.2533471031357997
the confidence interval is(13.2,-2.0)


In [9]:
mean = 4500
n = 30
std_dev = 1.9
ci = 0.95
zscore(mean,n,std_dev,ci)

z score: 1.6448536269514722
the confidence interval is(4449.4,4550.6)


In [10]:
#finding the 99% confidence interval by using t distribution 
def t_ci(mean,n,std_dev,ci):
    from scipy.stats import t
    import numpy as np
    import random
    sample = h[mean].sample(n,random_state = 1) # here instead of mean we give our variables like platelets,sex,time..from table
    area = (1 + ci)/2
    df = n - 1 # degrees of freedom
    t = t.ppf(area,df)      # ci = x + t * s/sqroot n- 1
    mu = np.mean(sample)     
    sigma = np.std(sample)
    standard_error = sigma/np.sqrt(n)
    moe = t  * standard_error
    lb = round(mu - moe)
    ub = round(mu + moe)
    return lb,ub

In [11]:
h.columns

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'DEATH_EVENT'],
      dtype='object')

In [12]:
from scipy.stats import t

n = 10
ci = .99
sample = h['platelets'].sample(n,random_state = 1)  # std_dev
area = (1 + ci)/2
df = n - 1 # degrees of freedom
t = t.ppf(area,df)
mu = np.mean(sample)     
sigma = np.std(sample)
standard_error = sigma/np.sqrt(n)
moe = t  * standard_error
lb = round(mu - moe)
ub = round(mu + moe)

print('lower bound:', lb)
print('upper bounf:', ub)
print('confidence interrval:', (lb,ub))
print('the avg platelets with 99% of ci is:',(lb,ub))

lower bound: 183500
upper bounf: 412900
confidence interrval: (183500, 412900)
the avg platelets with 99% of ci is: (183500, 412900)


In [17]:
print('the confidence level width is:',ub - lb)

the confidence level width is: 229400
