__Statistics__ is the branch of mathematics dealing with the collection, analysis, interpretation,
presentation, and organization of numerical data.

Statistics are mainly classified into two subbranches:
1. __Descriptive statistics__: These are used to summarize data, such as the mean,
standard deviation for continuous data types (such as age), whereas frequency
and percentage are useful for categorical data (such as gender).
2. __Inferential statistics__: Many times, a collection of the entire data (also known as
population in statistical methodology) is impossible, hence a subset of the data
points is collected, also called a sample, and conclusions about the entire
population will be drawn, which is known as inferential statistics. Inferences are
drawn using hypothesis testing, the estimation of numerical characteristics, the
correlation of relationships within data, and so on.

__Machine learning__ is the branch of computer science that utilizes past experience to learn
from and use its knowledge to make future decisions. Machine learning is at the
intersection of computer science, engineering, and statistics. The goal of machine learning is
to generalize a detectable pattern or to create an unknown rule from given examples

1. __Supervised learning__: This is teaching machines to learn the relationship between
other variables and a target variable. The major segments within
supervised learning are as follows:
    1. Classification problem
    2. Regression problem
2. __Unsupervised learning__: In unsupervised learning, algorithms learn by
themselves without any supervision or without any target variable provided. It is
a question of finding hidden patterns and relations in the given data. The
categories in unsupervised learning are as follows:
    1. Dimensionality reduction
    2. Clustering
3. __Reinforcement learning__: This allows the machine or agent to learn its behavior
based on feedback from the environment. In reinforcement learning, the agent
takes a series of decisive actions without supervision and, in the end, a reward
will be given, either +1 or -1. Based on the final payoff/reward, the agent
reevaluates its paths. Reinforcement learning problems are closer to the artificial
intelligence methodology rather than frequently used machine learning
algorithms.

Difference between Statistics and ML:
1. Relationships are formed in forms of mathematical equations in statistics whereas in ML it is formed in the form of rule-based programming. 
2. Statistical model predicts the output with Machine learning just predicts the output with accuracy of 85 percent and having 90 percent confidence about it. Machine learning just predicts the output with accuracy of 85 percent.
3. Statistics  -  Data will be split into 70 percent - 30 percent to create training and testing data. Model developed on training data and tested on testing data. ML - Data will be split into 50 percent - 25 percent - 25 percent to create training, validation, andtesting data. Models developed on training and hyperparameters are tuned on validation data and finally get evaluated against test data.

Steps in building ML Model:
1. Collection of Data
2. Data preparation and outlier treatment
3. Data Analysis and Feature Engineering
4. Train algorithm on training and validation data
5. Test algorithm on test data
6. Deploy algorithm

__Statistics Fundamentals__
1. __Population__: This is the totality, the complete list of observations, or all the data
points about the subject under study.
2. __Sample__:A sample is a subset of a population, usually a small portion of the
population that is being analyzed.
3. __Parameter versus Statistic__: Any measure that is calculated on the population is a
parameter, whereas on a sample it is called a statistic.
4. __Mean__: Arithmetic average. The mean is sensitive to outliers in the data. An outlier is the value of a set or column that is highly deviant from the many other values in the same data; it usually has very high or low values.
5. __Median__:This is the midpoint of the data, and is calculated by either arranging it in ascending or descending order. If there are N observations.
6. __Mode__:This is the most repetitive data point in the data.

<img src="images/mean_median_mode.png">

In [1]:
import numpy as np
from scipy import stats

In [2]:
data = np.array([4, 5, 1, 6, 8, 1, 3, 6, 7])

In [3]:
mean = np.mean(data)
mean

4.555555555555555

In [4]:
median = np.median(data)
median

5.0

In [5]:
mode = stats.mode(data)
mode[0][0]

1

7. __Measure of Variation__:Dispersion is the variation in the data, and measures the inconsistencies in the value of variables in the data.
8. __Range__:Difference between the maximum and minimum of the value.
9. __Variance__: This is the mean of squared deviations from the mean. The dimension of variance is the square of the actual values. The reason to use denominator N-1 for a sample instead of N in the population is due the degree of freedom. 1 degree of freedom lost in a sample by the time of calculating variance is due to extraction of substitution of sample. 
10. __Standard Deviation__: This is the square root of variance. By applying the square root on variance, we measure the dispersion with respect to the original variable rather than square of the dimension. 
11. __Quantiles__:These are identical fragments of the data. Quantiles cover percentiles, deciles, quartiles, and so on. These measures are calculated after arranging the data in ascending order
    1. __Percentile__:This is the percentage of data points below the value of the original whole data. The median is the 50 th percentile, as the number of data points below the median is about 50 percent of the data.
    2. __Decile__: This is 10th percentile, which means the number of data points below the decile is 10 percent of the whole data.
    3. __Quartile__: This is one-fourth of the data, and also is the 25 percentile. The first quartile is 25 percent of the data, the second quartile is 50 percent of the data, the third quartile is 75 percent of the data. The second quartile is also known as the median or 50 th percentile or 5 th decile.
    4. __Interquartile Range__: This is the difference between the third quartile and first quartile. It is effective in identifying outliers in data. The interquartile range describes the middle 50 percent of the data points.
    
    <img src="images/quantile.png">

In [6]:
from statistics import variance, stdev
game_points = np.array([35, 46, 72, 38, 81, 41, 57, 93, 17, 33, 61, 75])

In [7]:
# Variance
variance_data = variance(game_points)
variance_data

521

In [8]:
# Standard Deviation
standard_dev = stdev(game_points)
standard_dev

22.825424421026653

In [9]:
# Range
range_data = np.max(game_points, axis=0) - np.min(game_points, axis=0)
range_data

76

In [10]:
# Quantile
for val in [10, 20, 30, 40, 50, 60, 70, 80, 90]:
    quant = np.percentile(game_points, val)
    print(val, '% :', quant)

10 % : 33.199999999999996
20 % : 35.6
30 % : 38.900000000000006
40 % : 43.0
50 % : 51.5
60 % : 59.4
70 % : 68.69999999999999
80 % : 74.4
90 % : 80.4


12. __Hypothesis Testing__: This is the process of making inferences about the overall population by conducting some statistical tests on a sample. Null and alternate hypotheses are ways to validate whether an assumption is statistically significant or not.
13. __P-Value__: The probability of obtaining a test statistic result is at least as extreme as the one that was actually observed, assuming that the null hypothesis is true (usually in modeling, against each independent variable, a p-value less than 0.05 is considered significant and greater than 0.05 is considered insignificant; nonetheless, these values and definitions may change with respect to context).

P value less than 0.05 means both claimed values and distribution mean values are significantly different, hence we can reject null hypothesis.

__Steps involved in Hypothesis Testing__
1. Assume a null hypothesis (usually no difference, no significance, and so on; a null hypothesis always tries to assume that there is no anomaly pattern and is always homogeneous, and so on).
2. Collect the sample.
3. Calculate test statistics from the sample in order to verify whether the hypothesis is statistically significant or not.
4. Decide either to accept or reject the null hypothesis based on the test statistic.

__Test Statistic and Critical Value__
In hypothesis testing, a critical value is a point on test distribution that is compared to the test statistic to determine whether to reject null hypothesis. If absolute value of test statistic is greater than critical value, then it would be correct to declare statistical significance and reject null hypothesis. Critical values correspond to alpha, so their values become fixed when we chosse the test's alpha. 

In [11]:
from scipy import stats
xbar = 990
mu0 = 1000
s = 12.5
n = 30
# Test Statistic
t_smple = (xbar-mu0)/(s/np.sqrt(float(n)))
t_smple

-4.381780460041329

In [12]:
# Critical Value
alpha = 0.05
t_alpha = stats.t.ppf(alpha, n-1)
t_alpha

-1.6991270265334977

In [13]:
# P Value
p_val = stats.t.sf(np.abs(t_smple), n-1)
p_val

7.035025729010886e-05

14. __Type I and Type II Error__: Hypothesis testing is usually done on the samples rather
than the entire population, due to the practical constraints of available resources
to collect all the available data. However, performing inferences about the
population from samples comes with its own costs, such as rejecting good results
or accepting false results, not to mention separately, when increases in sample
size lead to minimizing type I and II errors:
    1. __Type I error__: Rejecting a null hypothesis when it is true
    2. __Type II error__: Accepting a null hypothesis when it is false
    
15. __Normal Distribution__:This is very important in statistics because of the central limit theorem, which states that the population of all possible samples of size n from a population with mean μ and variance σ2 approaches a normal distribution

In [14]:
# Z-Score
xbar = 67
mu0 = 52
s = 16.3
z = (xbar-mu0)/s
z

0.920245398773006

In [15]:
# Probability Under Curve 
p_val = 1 - stats.norm.cdf(z)
p_val*100

17.872226751475175

16. __Chi-square__:This test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data. Given two categorical random variables X and Y, the chi-square test of independence determines whether or not there exists a statistical dependence between them.

The test is usually performed by calculating χ2 from the data and χ2 with
(m-1, n-1) degrees from the table. A decision is made as to whether both
variables are independent based on the actual value and table value,
whichever is higher.

<img src="images/chi-square.png">

The chi2_contingency function in the stats package uses the observed table and subsequently calculates its expected table, followed by calculating the p-value in order to check whether two variables are dependent or not. If p-value < 0.05, there is a strong dependency between two variables, whereas if p-value > 0.05, there is no dependency between the variable

In [20]:
import pandas as pd 
from scipy import stats

survey = pd.read_csv('Data Files/survey.csv')
survey.head()

Unnamed: 0,Sex,Wr.Hnd,NW.Hnd,W.Hnd,Fold,Pulse,Clap,Exer,Smoke,Height,M.I,Age
0,Female,18.5,18.0,Right,R on L,92.0,Left,Some,Never,173.0,Metric,18.25
1,Male,19.5,20.5,Left,R on L,104.0,Left,,Regul,177.8,Imperial,17.583
2,Male,18.0,13.3,Right,L on R,87.0,Neither,,Occas,,,16.917
3,Male,18.8,18.9,Right,R on L,,Neither,,Never,160.0,Metric,20.333
4,Male,20.0,20.0,Right,Neither,35.0,Right,Some,Never,165.0,Metric,23.667


In [21]:
survey_tab = pd.crosstab(survey['Smoke'], survey['Exer'], margins=True)
survey_tab

Exer,Freq,None,Some,All
Smoke,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Heavy,7,1,3,11
Never,87,18,84,189
Occas,12,3,4,19
Regul,9,1,7,17
All,115,23,98,236


In [23]:
observed = survey_tab.iloc[0:4, 0:3]
observed

Exer,Freq,None,Some
Smoke,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Heavy,7,1,3
Never,87,18,84
Occas,12,3,4
Regul,9,1,7


In [25]:
contg = stats.chi2_contingency(observed=observed)
p_value = round(contg[1], 3)
p_value

0.483

The p-value is 0.483 , which means there is no dependency between the smoking habit and exercise behavior.

Stats.Chi2_Contingency returns following:
1. Test Statistics
2. P Value
3. Degree of Freedom 

17. __ANOVA__:Analyzing variance tests the hypothesis that the means of two or more populations are equal. ANOVAs assess the importance of one or more factors by comparing the response variable means at the different factor levels. The null hypothesis states that all population means are equal while the alternative hypothesis states that at least one is different.

In [28]:
import pandas as pd
from scipy import stats
data = pd.read_csv('Data Files/fetilizers.csv')
data.head()

Unnamed: 0,fertilizer1,fertilizer2,fertilizer3
0,62,54,48
1,62,56,62
2,90,58,92
3,42,36,96
4,84,72,92


In [29]:
# One-Way ANOVA
one_way_anova = stats.f_oneway(data['fertilizer1'], data['fertilizer2'], data['fertilizer3'])
print ("Statistic :", round(one_way_anova[0],2),", p-value:",round(one_way_anova[1],3))

Statistic : 3.66 , p-value: 0.051


The p-value did come equal to 0.05, hence we accept the null hypothesis that the mean crop yields of the fertilizers are equal.

18. __Confusion Matrix__:This is the matrix of the actual versus the predicted. The table contains following:
    1. __True positives (TPs)__: True positives are cases when we predict the outcome(class) and it is correct.
    2. __True negatives (TNs)__: Cases when we predict the outcome (class) and the class is actually not there.
    3. __False positives (FPs)__: When we predict the outcome as yes when the outcome actually does not have it. FPs are also considered to be type I errors.
    4. __False negatives (FNs)__: When we predict the outcome as no when the outcome actually does have it. FNs are also considered to be type II errors.
    5. __Precision (P)__: When yes is predicted, how often is it correct? (TP/TP+FP)
    6. __Recall (R)/sensitivity/true positive rate__: Among the actual yeses, what fraction was predicted as yes? (TP/TP+FN)