## Inferential statistics

## Introduction to Inference

* Sample Mean & Population Mean
* Statistical Inference
* Central Limit Theorem
* Confidence Intervals
* Interpretation Of Confidence Interval
* Hypothesis Testing
* Why Null Hypothesis ?
* Alternate Hypothesis
* P-Value
* t-test
* Type I and Type II error
* Chi-squared Goodness of fit test
* Chi-sqaured Test of Independence

<img src="Images/statistics.jpeg" width="700" height="400">

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [47]:
data = pd.read_csv('data/insurance.csv')
data.head()

# Ramdom sample for population
indecies = np.random.randint(0,100, 100)
sample = data.iloc[indecies]

### Sample Mean and population Mean

* Let's consider a sample of 100 people at random from 1338 people
* Compute the mean, median, mode, standard deviation of sample and compare them with the total population

## Central Limit Theorem

The central limit theorem (CLT) is a statistical theory that states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population. Furthermore, all of the samples will follow an approximate normal distribution pattern, with all variances being approximately equal to the variance of the population divided by each sample's size.

In [7]:
# Pick 100 different samples from the same population.
# Each sample will be approximately 100 different people.
# Compute the mean BMI and Charges of each sample.
# Try to plot a Histogram for the 100 mean values of BMI and Charges.

## Confidence Interval

**Confidence Interval (CI)** is a type of estimate computed from the statistics of the observed data. This proposes a range of plausible values for an unknown parameter (for example, the mean). The interval has an associated confidence level that the true parameter is in the proposed range.


In [48]:
# lets import the scipy package
import scipy.stats as stats
import math


# Get the z value*
z_value = stats.norm.ppf(q = 0.95) 

# Pick a sample of 100 people
sample_size = 100
sample = np.random.choice(a= data['bmi'],
                          size = sample_size)

sample_mean = sample.mean()
sample_std = sample.std()


margin_of_error = z_value * (sample_std/math.sqrt(sample_size)) 

# defining our confidence interval
confidence_interval = (sample_mean - margin_of_error,
                       sample_mean + margin_of_error) 

# lets print the results
print("Confidence interval:",end=" ")
print(confidence_interval)
print("True mean: {}".format(data['bmi'].mean()))

Confidence interval: (29.48620886568075, 31.411991134319244)
True mean: 30.66339686098655


population_mean = sample_mean +- z*sample_std/sqrt(sample_size)


z = +- (Population_mean - Sample_mean)*sqrt(sample_size)/sample_std_dev

sample_mean = 110
sample_std_dev = 5
sample_size = 50

Assume population mean >= 100

z = +- 14.14

Rejection zone = 5%

z_upperbound = 95% --> 1.96
z_loowerbound = 5% --> -1.96

z >= 14.14 --> 99.9999999% (area under the curve)
z <= -14.14 --> 4.9999999% (area under the curve)

In [14]:
# Change the sample sizes to [10, 50, 100, 500, 1000]. Observe the changes w.r.t confidence interval.


# Change the confidence percentage value to [50%, 70%, 80%, 90%, 95%, 99%, 99.6%, 100%]. 
# Observe the changes w.r.t confidence interval.


# Write a 5 line summary of your analysis on how the confidence interval changed w.r.t sample sizes and confidence percentages.
# Also, mention if samples are good or bad.

# Explore what T-statistic is and explain where we should use T-values and Z-value respectively.


## Hypothesis Testing

* $Statistical Hypothesis$, sometimes called confirmatory data analysis, is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables. A statistical hypothesis test is a method of statistical inference.

### Null Hypothesis

* In Inferential Statistics, **The Null Hypothesis is a general statement or default position that there is no relationship between two measured phenomena or no association among groups.**

* Statistical hypothesis tests are based on a statement called the null hypothesis that assumes nothing interesting is going on between whatever variables you are testing.

* Therefore, in our case the Null Hypothesis would be:
**The Mean of House Prices in OldTown is not different from the houses of other neighborhoods**

### Alternate Hypothesis

* The alternate hypothesis is just an alternative to the null. For example, if your null is **I'm going to win up to 1000** then your alternate is **I'm going to win more than 1000.** Basically, you're looking at whether there's enough change (with the alternate hypothesis) to be able to reject the null hypothesis

###  The Null Hypothesis is assumed to be true and Statistical evidence is required to reject it in favor of an Alternative Hypothesis.

### P Value

* In statistical hypothesis testing, **the p-value or probability value** is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct. 

* So now say that we have put a significance (α) = 0.05
* This means that if we see a p-value of lesser than 0.05, we reject our Null and accept the Alternative to be true


In [34]:
# lets import z test from statsmodels
from statsmodels.stats.weightstats import ztest

# Null hypothesis: Chargers of smokers and non-smokerts are same.
z_statistic, p_value = ztest(x1 = data[data['smoker'] == 'yes']['charges'],
                             value = data['charges'].mean())

# lets print the Results
print('Z-statistic is :{}'.format(z_statistic))
print('P-value is :{:.50f}'.format(p_value))

if p_value < 0.05:
    print("We can safely reject the Null hypothesis")
else:
    print("We do not have significant evidence to safely reject the null hypothesis.")

Z-statistic is :26.934097902356328
P-value is :0.00000000000000000000000000000000000000000000000000
We can safely reject the Null hypothesis


In [35]:
import scipy.stats as stats

# Create a dummy dataset of 10 year old children's weight
data = np.random.randint(20, 40, 10)

# Define the null hypothesis
H0 = "The average weight of 10 year old children is less than 32kg."

# Define the alternative hypothesis
H1 = "The average weight of 10 year old children is more than 32kg."

# Calculate the test statistic
t_stat, p_value = stats.ttest_1samp(data, 32)

# Print the results
print("Test statistic:", t_stat)
print("p-value:", p_value)

# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

Test statistic: -0.780398972571708
p-value: 0.45519024115915063
Fail to reject the null hypothesis.
