## Summary => <br>
This notebook includes the following topics. <br><br>

The notebook will be constructed in two stages. <br>
* 1st Stage -> Complete python implementations along with brief descriptions. (Est. Date of Completion - 28-03-2021)
* 2nd Stage -> Solving questions on these topics using python. (Est. Date of Completion - 31-04-2021)
* Continuous Development and Improvisations....

## Table of Contents

* Understanding Data types
    * Interval Scale
    * Binary 
    * Categorical
    * Ordinal 
    * Ratio Scaled
    * Mixed Type
* Different types of distances
* Simmilarity and Dissimilarity Matrix
* Familiarizing with different types of Error Metrics
* Handling Missing data values
* Central Tendency & Dispersion
* Descriptive Statistics
* Summary Statistics
    * Central Tendency Statistics
        * Arithmetic Mean
        * Weighted Mean
        * Median
        * Percentile
    * Dispersion
        * Skewness
        * Kurtosis
        * Range
        * Interquartile Range
        * Variance
        * Standard Score
        * Coefficient of Variation
* [Sample](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#16.-Sample-Statistics) vs [Population statistics](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#17.-Population-Statistics)
* Random Variables
* Probability Distribution Functions
    * Uniform Distribution
    * Exponential Distribution
    * [Binomial Distribution](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#8.-Binomial-Distribution)
    * [Normal Distributions](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#9.-Normal-Distribution)
    * [Poisson Distributions](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#10.-Poisson-Distribution)
    * [Bernoulli Distribution](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#11.-Bernoulli-Distribution)
* [Measuring p-value](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#13.-Calculating-p-Value)
* [Measuring Correlation](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#14.-Measuring-Correlation)
* [Measuring Variance](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#15.-Measuring-Variance)
* Expected Value

* [z-score](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#5.-Z-Test)
* Hypothesis Testing
    * Null & Alternate Hypothesis
    * Type 1 Error; Type 2 Error
    * Various Approaches
        * p-value
        * critical value
        * confidence interval value
* z-stats vs t-stats

* Two Sample Tests
* Confidence Interval
* Similarity & Dissimilarity Matrices
* [Central Limit Theorem](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#12.-Central-Limit-Theorem)
* [Chi Square Test](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#3.-Chi-Square-Test)
* [T Test](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#4.-T-Test)
* [ANOVA Test](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#6.-ANOVA-Test)
    * [One Way Anova Test](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#6.1-One-Way-ANOVA-Test)
        * F Test (LSD Test)
        * Tukey Kramer Test
    * [Two Way Anova Test](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#6.2-Two-Way-ANOVA-Test)
        * Interaction Effects
* [F Stats](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#7.-F-Stats-Test)
* [Regressions (Linear, Multiple) + ROC](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#2.-Regressions)
* Logistic Regression
    * Python Implementation
    * Calculating G Statistics
* Residual Analysis
* Maximum Likelihood Estimation
* Cluster Analysis
    * Partitioning Cluster Methods
        * K-Means
        * K Mediods
    * Hierarchial Cluster Methods
        * Agglomerative
    * Density Based Cluster Methods
        * DBSCAN
* [CART Algorithms](https://www.kaggle.com/antoreepjana/statistics-for-ml-data-analysis/#1.-CART-Algorithms)
    * Python Implementation
    * various Calculations involved
        * Information Gain
        * Gain Ratio
        * Gini Index
* Confusion Metrics, ROC & Regression Analysis
* Bonus Topics
    * Classification Thresholding
    * Prediction Bias
    * Sampling Methods
        * Simple
        * Convenience
        * Systematic
        * Cluster
        * Stratified

In [None]:
import numpy as np
import pandas as pd 
import os
import random
import statistics
from scipy import stats

In [None]:
random.seed(2021)
np.random.seed(2021)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### 1. CART Algorithms

Brief Description -> 

##### Tools Used

Dataset Used -> Boston Dataset (UCI Machine Learning Repository)

In [None]:
from sklearn.datasets import load_boston
boston_dataset = load_boston()

In [None]:
boston = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)

In [None]:
boston.head()

In [None]:
boston['MEDV'] = boston_dataset.target

In [None]:
names = boston_dataset.feature_names

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
array = boston.values

X = array[:, 0:13]
Y = array[:, 13]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3, random_state = 1234)

In [None]:
model = DecisionTreeRegressor(max_leaf_nodes = 20)

In [None]:
model.fit(X_train, Y_train)

In [None]:
from sklearn.metrics import r2_score

In [None]:
YHat = model.predict(X_test)

In [None]:
r2 = r2_score(Y_test, YHat)
print("R2 Score -> ", r2)

### plot the decision tree as a graph 

In [None]:
import graphviz
from sklearn import tree

method 1

In [None]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(model, 
                   feature_names=names,  
                   class_names=boston_dataset.target,
                   filled=True)

method 2

In [None]:
plt.figure(figsize = (20,20))
dot_data = tree.export_graphviz(model, out_file=None, 
                                feature_names=names,  
                                class_names=boston_dataset.target,
                                filled=True, rounded= True)

# Draw graph
graph = graphviz.Source(dot_data, format="png") 
graph

We'll learn how to custom paint your graph from the default settings (coming soon)

In [None]:
"""import pydotplus
graph = pydotplus.graph_from_dot_data(dot_data)
nodes = graph.get_node_list()

for node in nodes:
    if node.get_label():
        print(node.get_label())
        node.set_fillcolor('yellow')
        

graph.write_png('colored_tree.png')
"""

### 2. Regressions

Regression is a problem where you need to find a function that maps some features or variables to others sufficiently well.<br>
Dependent features are called the dependent variables. <br>
Independent features are called the independent variables. <br>

Regression is useful to forecast a response using a set of predictors. <br>
When implementing linear regression of some dependent variable y on set of independent variables x, 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. is called the regression equation. 𝜀 is the random error.

Useful Resources -> <br>

* https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Lecture/lecture03_2020JC.html#1
* https://towardsdatascience.com/maximum-likelihood-estimation-explained-normal-distribution-6207b322e47f#:~:text=%E2%80%9CA%20method%20of%20estimating%20the,observed%20data%20is%20most%20probable.%E2%80%9D&text=By%20assuming%20normality%2C%20we%20simply,the%20popular%20Gaussian%20bell%20curve.
* https://online.stat.psu.edu/stat462/node/207/
* https://psychscenehub.com/psychpedia/odds-ratio-2/
* http://statkat.com/stat-tests/logistic-regression.php#:~:text=Logistic%20regression%20analysis%20tests%20the,%3D%CE%B2K%3D0

Regression analysis involves identifying the relationship between a dependent variable and one or more independent variables. <br>
A model of the relationship is hypothesized, and estimates of the parameter values are used to develop an estimated regression equation. <br>
<br><br>
**Regression Model** <br>
In simple linear regression, the model used to describe the relationship between a single dependent variable y and a single independent variable x is y = β0 + β1x + ε. <br>
β0 and β1 are referred to as the model parameters,and ε is a probabilistic error term that accounts for the variability in y that cannot be explained by the linear relationship with x <br><br>
The difference between the observed value of y and the value of y predicted by the estimated regression equation is called a residual.<br>

**Key Assumptions ->** <br>
* ε is a random variable with an expected value of 0
* the variance of ε is the same for all values of x
* the values of ε are independent,
* ε is a normally distributed random variable.

**Residual Analysis ->** <br><br>
The analysis of residuals plays an important role in validating the regression model. <br>
If the error term satisfies the above 4 assumptions, then the regression model is valid. 

So-called dummy variables are used to represent qualitative variables in regression analysis. In general, k - 1 dummy variables are needed to model the effect of a qualitative variable that may assume k values.

An F-Test based on MSR / MSE can be used to test the statistical significance of the overall relationship between the dependent variable and the set of independent variables. <br>
Large values of F support the conclusion that the overall is statistically significant. 

**Key Terms ->** <br><br><br>
* Coefficient of Determination -> R^2, tells you the amount of variation in y based on the dependence on x. Larger R^2 indicates a better fit and means that the model can 
* Coefficient of Correlation -> 
* SSE ->
* SSR -> Sum of Squared Residuals
* SST -> 
* Error Term ε -> 
* Regression Equation -> 
* Correlation Equation -> 
* Estimated Regression Equation -> 
* Regression Model ->
* F Statistic -> 
* MSE
* MSR
* Dummy Variable -> A variable that takes values of 0 or 1 and is used to consider the effect of qualitative variables in a regression model
* Correlation vs Causation -> 
* Standard Error ->
* Confidence Interval Estimate -> The interval estimate of the mean value of y for a given value of x.
* OLS -> Method of Ordinary Least Squares. To get the best model weights, we use the method of OLS to minimize the SSR. 

1. Linear Regression Analysis

Regression tries to search for relationships among variables. 

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
X = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])

In [None]:
model = LinearRegression()

model.fit(X,y)

r2_score = model.score(X,y)
print("Coefficient of Determination -> " , r2_score)

print("Intercept -> ", model.intercept_)
print("Slope -> ", model.coef_)

2. Multiple Regression Analysis

Multiple or multivariate linear regression is linear regression with two or more independent variables. <br>
Polynomial Regression is considered a generalized case of linear regression. You assume polynomial dependence between the output and the input.  Which means, the regression equation can now include non-linear terms. 

In [None]:
x = [[0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]]
y = [4, 5, 20, 14, 32, 22, 38, 43]

X = np.array(x)
y = np.array(y)


model = LinearRegression()

model.fit(X,y)


r2_score = model.score(X,y)

print("Coefficient of Determination -> ", r2_score)

print("Model Intercept => ", model.intercept_)

print("Slope => ", model.coef_)

#### Linear Regression using statsmodels

In [None]:
import statsmodels.api as sm

In [None]:
x = [[0,1], [5,1], [15,2], [25,5], [35,11], [45,15], [55,34], [60,35]]
y = [4,5,20,14,32,22,38,43]

x, y = np.array(x), np.array(y)


x = sm.add_constant(x)

model = sm.OLS(y, x)


results = model.fit()

print(results.summary())

In [None]:
print("Coefficient of Determination -> ", results.rsquared)

print("Adjusted Coefficient of Determination => ", results.rsquared_adj)

print("Regression coefficients => ", results.params)

**(Bonus or Optional) Polynomial Regression with Scikit-Learn** <br>
In polynomial regression, you need to transform the inputs to include non-linear terms.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([15, 11, 2, 8, 25, 32])


transformer = PolynomialFeatures(degree = 2, include_bias = False)

x = transformer.fit_transform(x)

model = LinearRegression().fit(x,y)

r2_score = model.score(x,y)

print("Coefficient of Determination => ", r2_score)

print("Intercept -> ", model.intercept_)
print("Coefficients => ", model.coef_)

Neither regression nor correlation analyses can be interpreted as establishing cause-and-effect relationships. <br><br> They can indicate only how or to what extent variables are associated with each other.<br><br> The correlation coefficient measures only the degree of linear association between two variables. Any conclusions about a cause-and-effect relationship must be based on the judgment of the analyst.

### 3. Chi Square Test

background -> 

* used to analyze the frequencies of two variables with multiple categories to determine their independency
* qualitative variables
* nominal data

degrees of freedom for the chi-squared distribution -> (rows -1) * (cols -1)

a. Understanding Contigency Tables (also known as crosstab)

* useful for multiple population proportions
* classify sample observations according to two or more characterstics
* also called cross-classification table

Contigency tables are the pivot tables obtained by utilizing the categorical variable. The contigency here is whether a variable affects the values of the caegorical variable. <br>


b. Performing Chi-Square Tests

c. Chi-Square Tests for Feature Selection

**Assumption** -> each cell in the contigency table has expected frequency atleast **5**

![](https://media.geeksforgeeks.org/wp-content/uploads/Capture-214.png)

#### Note:- Used only for Categorical Features.

Dataset used -> https://www.kaggle.com/c/cat-in-the-dat

In [None]:
from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2 


data = pd.read_csv('../input/cat-in-the-dat/train.csv')

In [None]:
data.head()

In [None]:
data.drop(['id'], axis = 1, inplace = True)

In [None]:
data.dtypes

In [None]:
for col in data.columns:
    print(col, data[col].nunique())

In [None]:
for col in data.columns:
    print(col, '\n\n',data[col].value_counts())
    print('-'*10)

bin_3, bin_4 has T/F values. <br>
nom_0, nom_1, nom_2, nom_3, nom_4 have 3-6 unique values. <br>
nom_5, nom_6, nom_7, nom_8, nom_9 have many unique values <br>
THen comes the ordinal variables

In [None]:
data['bin_3'] = data['bin_3'].map({"T" : 1, "F" : 0})
data['bin_4'] = data['bin_4'].map({"Y" : 1, "N" : 0})

In [None]:
data.head()

We're done with dealing of binary variables. <br>
Now we're left to deal with the nominals & ordinals.

We have 5 ordinal variables of which 4 have few unique values and can be dealt in a similar manner. <br>
ord_5 has multiple unique values and needs to be handled separately. 

In [None]:
for col in ['ord_1', 'ord_2', 'ord_3', 'ord_4']:
    print(col, list(np.unique(data[col])))

In [None]:
m1_ord1 = {'Novice' : 0, 'Contributor' : 1, 'Expert' : 2, 'Master' : 3, 'Grandmaster' : 4}

data['ord_1'] = data['ord_1'].map(m1_ord1)

In [None]:
data.head()

In [None]:
m2_ord2 = {'Boiling Hot' : 0, 'Cold' : 1, 'Freezing' : 2, 'Hot' : 3, 'Lava Hot' : 4, 'Warm' : 5}

data['ord_2'] = data['ord_2'].map(m2_ord2)

In [None]:
data.head()

In [None]:
data['ord_3'] = data['ord_3'].apply(lambda x : ord(x) - ord('a'))
data['ord_4'] = data['ord_4'].apply(lambda x : ord(x) - ord('A'))

In [None]:
data.head()

In [None]:
data['ord_5a'] = data['ord_5'].str[0]
data['ord_5b'] = data['ord_5'].str[1]

data['ord_5a'] = data['ord_5a'].map({val : idx for idx, val in enumerate(np.unique(data['ord_5a']))})
data['ord_5b'] = data['ord_5b'].map({val : idx for idx, val in enumerate(np.unique(data['ord_5b']))})

In [None]:
data.head()

Let's deal the nominal variables.

In [None]:
data[['nom_0', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9']]

In [None]:
data['nom_1'].value_counts()

In [None]:
data['nom_2'].value_counts()

In [None]:
data['nom_3'].value_counts()

In [None]:
data['nom_4'].value_counts()

In [None]:
data['nom_5'].value_counts()

In [None]:
data['nom_6'].value_counts()

In [None]:
data['nom_7'].value_counts()

In [None]:
data.drop(['ord_5', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9'], axis = 1, inplace = True)

In [None]:
"""data['day'] = data['day'] / 7.0

data['month'] = data['month'] / 12.0"""

In [None]:
data.head()

Let's encode the remaining of the nominal values

In [None]:
data['nom_1'].value_counts()

In [None]:
m1_nom1 = {'Trapezoid' : 0, 'Square' : 1, 'Star' : 2, 'Circle' : 3, 'Polygon' : 4, 'Triangle' : 5}

data['nom_1'] = data['nom_1'].map(m1_nom1)

In [None]:
data['nom_2'].value_counts()

In [None]:
m2_nom2 = {'Lion' : 0, 'Cat' : 1, 'Snake' : 2, 'Dog' : 3, 'Axolotl' : 4, 'Hamster' : 5}
data['nom_2'] = data['nom_2'].map(m2_nom2)

In [None]:
data['nom_3'].value_counts()

In [None]:
m3_nom3 = {'Russia' : 0, 'Canada' : 1, 'China' : 2, 'Finland' : 3, 'Costa Rica' : 4, 'India' : 5}

data['nom_3'] = data['nom_3'].map(m3_nom3)

In [None]:
data['nom_4'].value_counts()

In [None]:
m4_nom4 = {'Oboe' : 0, 'Piano' : 1, 'Bassoon' : 2, 'Theremin' : 3}

data['nom_4'] = data['nom_4'].map(m4_nom4)

In [None]:
data.head()

In [None]:
data['nom_0'].value_counts()

In [None]:
m0_nom0 = {'Green' : 0, 'Blue' : 1, 'Red' : 2}

data['nom_0'] = data['nom_0'].map(m0_nom0)

Perform One Hot Encoding of the ordinal features

Label Encoding multiple columns

In [None]:
df_copy = data.copy()
df_copy.drop(['target'], axis = 1, inplace = True)

In [None]:
df_copy = pd.get_dummies(df_copy, columns = df_copy.columns)
df_copy

In [None]:
data.head()

In [None]:
#X = data.drop(['target'], axis = 1)
X = df_copy
y = data.target

In [None]:
# perform feature engineering to encode categorical variables so as to be processed by chi2_feature transform

In [None]:
chi2_features = SelectKBest(chi2, k = 10)
X_kbest_features = chi2_features.fit_transform(X,y)

print("Original Number of Features -> (shape)", X.shape[1])

print("K Best Features (shape)-> ",X_kbest_features.shape[1])



In [None]:
X_kbest_features

##### solving chi2 test generic questions and numericals using python 

q1. A survey of 500 respondents is depicted in the table below. <br> (In Making)



|Type| No regular Exercise  | Sporadic Exercise | Regular Exercise | Total  |
|----|-----|----|----|-----|
|Dormitory| 32  | 30 | 28 | 90 |
|On-Campus Apartment| 74 | 64 | 42 | 180 |
|Off-Campus Apartment| 110  | 25  | 15  | 150  |
|At home| 39 | 6 | 5 | 50 |
|Total | 255 | 125 | 90 | 470


<br>Based on the above data, is there any relationship between exercise and student's living arrangement? <br>


Soln. 


H0 -> Living arrangement and exercise are independent <br><br>
H1 -> Are dependent 


### 4. T-Test

t-test also known as Student's t-test compares the two averages (means) and tells you if they are different from each other. <br>
Can also tell you how significant the differences are. 

**t-score**

**T-Values vs P-Values**

Types of T-Test <br>
* Independent Samples t-test
* Paired Sample t-test
* One Sample t-test

### 5. Z-Test

Helps to determine whether the distribution of the test statistics can be approximated by a normal distribution. <br>
Helps to determine whether two sample means are approximately the same when their variance is known and the sample size is large enough.

Q1. The grades on a physics midterm at Covington are roughly symmetric with μ=72, σ=2.0. <br>
Stephanie scored 74 on the exam. <br>
Z-Score to stephanie's exam score. 

In [None]:
z = (74 - 72 )/ 2

print(z)

Q2. Following are the summary statistics of life spans of two breeds of cats. <br>

| Breed  |Mean Life Span (yrs)   | Standard Deviation  |  
|---|---|---|
| Abyssinian  | 12  | 1.5  | 
| Colorpoint shorthair  |  14 | 1  |   

Fluffy was an Abyssinian cat who lived for 13 years, and Mittens was a colorpoint shorthair cat who lived for 15 years.<br>
Relative to the breed, which cat had a longer life span? <br><br>

* Fluffy 
* Mittens
* Fluffy and Mittens had equally long lifespans relative to their breeds.
* It's impossible to say without seeing all of the individual lifespans.
* It's impossible to say since we don't know the shape of either distribution.

In [None]:
z1 = (13 - 12) / 1.5

z2 = (15 - 14) / 1

print("Z Score for Fluffy -> ", z1)
print("Z Score for Mittens -> ", z2)

Mittens had a longer lifespan

### 6. ANOVA Test

ANOVA -> Analysis of Variance. <br>
Helps to compare the means of more than 2 groups. <br>
ANOVA F Test is also called omnibus test. <br><br><br>

Main types of ANOVA Test -> 
* One-way or One-factor 
* Two-way or Two-factor

ANOVA Hypotheses -> <br>
* Null Hypotheses = Group means are equal. No variation in the groups. 
* Alternative Hypothesis = At least, one group is different from other groups.

ANOVA Assumptions -> <br><br>
* Residuals(experimental error) are normally distributed.(Shapiro-Wilks Test)
* Homogenity of variances (variances are equal between treatment groups) (Levene's or Bartlett's Test)
* Observations are sampled independently from each other. 

ANOVA Working -> <br><br>
* Check sample sizes, i.e., Equal number of observations in each group. 
* Calculate Mean Square for each group (MS) (SS of group/degrees of freedom-1)
* Calc Mean Sq. Error (SS Error / df of residuals)
* Calc F value (MS of group / MSE)

#### 6.1 One-Way ANOVA Test

In [None]:
import random

In [None]:
random.seed(2021)

In [None]:
df = pd.DataFrame([random.sample(range(1, 1000), 4) , random.sample(range(1, 1000), 4), random.sample(range(1, 1000), 4), random.sample(range(1, 1000), 4)], columns = ['A', 'B', 'C', "D"])

In [None]:
df

In [None]:
df_melt = pd.melt(df.reset_index(), id_vars = ['index'], value_vars = ['A','B','C','D'])
df_melt.columns = ['index', 'treatments', 'value']

In [None]:
df_melt

In [None]:
sns.boxplot(x='treatments', y='value', data=df_melt, color='#99c2a2')
sns.swarmplot(x="treatments", y="value", data=df_melt, color='#7d0013')
plt.show()

In [None]:
from scipy import stats

In [None]:
fvalue, pvalue = stats.f_oneway(df['A'], df['B'], df['C'], df['D'])
print("f Value -> ", fvalue)
print("p value -> ", pvalue)

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols


model = ols('value ~ C(treatments)', data = df_melt).fit()

anova_table = sm.stats.anova_lm(model, typ = 2)
anova_table

##### Interpretation

p-value obtained from ANOVA Analysis is not significant (p > 0.05), and therefore, we conclude that there are no significant differences amongst the groups. 

#### 6.2 Two-Way ANOVA Test

In Two-Way ANOVA Test, we have 2 independent variables and their different levels

In [None]:
data = pd.DataFrame(list(zip(['A','A','A','B','B','B', 'C', 'C', 'C', 'D', 'D', 'D'], [np.random.ranf() for _ in range(12)], [np.random.ranf() for _ in range(12)], [np.random.ranf() for _ in range(12)])), columns = ['Genotype', '1_year', '2_year', '3_year'])

In [None]:
data

In [None]:
data_melt = pd.melt(data, id_vars = ['Genotype'], value_vars = ['1_year', '2_year', '3_year'])

In [None]:
data_melt.head()

In [None]:
data_melt.columns = ['Genotype', 'years', 'value']

In [None]:
sns.boxplot(x = 'Genotype', y = 'value', hue = 'years', data = data_melt, palette = ['r', 'k', 'w'])

In [None]:
model = ols('value ~ C(Genotype) + C(years) + C(Genotype) : C(years)', data = data_melt).fit()

In [None]:
anova_table = sm.stats.anova_lm(model, typ = 2)

anova_table

##### Post-Hoc Analysis (Tukey's Test)

In [None]:
!pip install -q bioinfokit
from bioinfokit.analys import stat

In [None]:
res = stat()
res.tukey_hsd(df = df_melt, res_var = 'value', xfac_var = 'treatments', anova_model = 'value ~ C(treatments)')
output = res.tukey_summary

In [None]:
output

All the values are in accordance to the condition p > 0.05 <br>
Hence, aren't statistically significant.

### 7. F Stats Test

### 8. Binomial Distribution

In [None]:
from scipy.stats import binom

n = 6
p = 0.6

r_values = list(range(n + 1))

mean, var = binom.stats(n, p)


dist = [binom.pmf(r, n, p) for r in r_values]

df = pd.DataFrame(list(zip(r_values, dist)), columns = ['r', 'p(r)'], index = None)

df

In [None]:
df['p(r)'].plot.bar()

### 9. Normal Distribution

also known as 
* Gaussian Distribution
* Bell Curve


<br><br> Below is the probability distribution function (pdf) for Normal Distribution -> 

![](https://cdn.askpython.com/wp-content/uploads/2020/10/Probability-density-function-of-Normal-Distribution.jpg.webp)

* x -> input value
* mu -> mean
* sigma -> std deviation

![](https://cdn.askpython.com/wp-content/uploads/2020/10/Standard-deviation-around-mean.jpg.webp)

In [None]:
mu, sigma = 0.5, 1

In [None]:
data = np.random.normal(mu, sigma, 10000)

In [None]:
count, bins, ignored = plt.hist(data, 20)



### 10. Poisson Distribution

![](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1539784818/output_39_0_knqrjh.png)

Note -> Normal Distribution is a limiting case of poisson distribution when lambda -> inf. 

In [None]:
from scipy.stats import poisson
data_poisson =poisson.rvs(mu = 3, size = 10000, random_state= 2021)

In [None]:
sns.distplot(data_poisson, bins = 30, kde = False, 
            color = 'skyblue',
             hist_kws = {'linewidth' : 15, 'alpha' : 1}
            )



### 11. Bernoulli Distribution

A bernoulli distribution has only 2 possible outcomes, 1 (success) and 0 (failure). <br>
Eg. A coin toss. <br>

### 12. Central Limit Theorem

**What it states?** <br><br>
Even when a sample is not normally distributed, if you draw multiple samples and take each of their averages, the averages will represent a normal distribution.<br><br>
Which means repeated sampling from a not normally distributed sample and taking the means of those repeated samples will end up being a normally distributed sample. <br><br>

100 samples in total which are not normally distributed. Take random 10 samples say 50 times and take the mean of these samples. It will come out to be a normally distributed sample.

The following is an experiment of dice roll for 1000 times. <br>
for 1000 times, we make samples of samples size 100 where possible outcomes are 1,2,3,4,5,6 <br><br>
By plotting the histogram of the sample means, we obtain a normally distributed plot. <br>
This is Central Limit Theorem

In [None]:

means = [np.mean(np.random.randint(1, 7, 100)) for _ in range(1000)]

plt.hist(means)
plt.show()

##### Key Takeaways :- <br><br>

![](https://miro.medium.com/max/366/1*RdIQG331j0tayi50asTOIw.png)

![](https://miro.medium.com/max/418/1*dCxzo7E6lmKxHLEg2xZSoQ.png)

You can never experiment with all your customers (population). However, to draw a conclusion for an experiment which is a good representaion of your customers, you need to perform repeated experiments on different set of customers (different samples of the not normally distributed population/sample as per the context) and confirm your hypotheses. 

### 13. Calculating p-Value

![](https://miro.medium.com/max/963/1*0XXmFcatWBkagH3YeYdpig.png)

p-value is all about answering the question with certain confidence level. <br>
eg. I am 90% confident that I will get that job. 

**In p-value tests, our task might be to find the probability that a sample mean could be x, given the hypothesis that the population mean is y.** <br>

<br>
Conclusion => <br>
The p-value gives us the probability of observing what we observed, given the hypothesis is true. It doesn't tell us the probability that the null hypothesis is true.

**How it works?** <br>
Statement (Null Hypothesis) -> 

In [None]:
def pvalue(mu, sigma, samp_size, samp_mean = 0, deltam = 0):
    
    np.random.seed(2021)
    

COnsidering Means -> Use t-test

![](https://media.geeksforgeeks.org/wp-content/uploads/20200503190751/Annotation-2020-05-03-190733-300x92.png)

Considering Proportions -> z test

![](https://media.geeksforgeeks.org/wp-content/uploads/20200503191805/Annotation-2020-05-03-191654-300x92.png)

## Solving Questions on Distributions using python

Q1. Calculate the probability of getting 14 heads in 20 attempts from a fair coin.

In [None]:
# It is a problem of binomial distribution. N = 20, p = 0.5, q = 0.5, x = 14


# use the probability mass function to evaluate the probability
print(stats.binom.pmf(k = 14, n = 20, p = 0.5))

Q2. A question paper contains 90 multiple choice questions. There are four alternatives answers to each question of which only one is correct. What is the probability to score atleast 22 marks without any preparation (random guessing).

In [None]:
## Again a question of binom distribution. N = 90, k = 22, p = 0.25


# In this question, we'll be using cumulative distribution function and subtract the outcome from 1. 
# Find the cumulative probability till k = 21 and subtract it from 1 so that outcome is >= 22 marks
1 - stats.binom.cdf(k = 21, n = 90, p = 0.25)

Q3. On an average 5% items supplied by a manufacturer are defective. If a batch of 10 items is inspected, what is the probability that 2 items are defective. 

In [None]:
### Binomial Distribution problem

## Defective sample is to be considered as success. p = 0.05, N = 10, k = 2

stats.binom.pmf(k = 2, n= 10, p = 0.05)

Q4. A car distributor experiences on an average 3 car sales per day. Find the probability that on a randomly selected day they will sell 
1. 5 cars.
2. 0 Cars
3. At most 2 cars
4. exactly 1 car

In [None]:
## Since we are dealing with discrete occurences over an interval, this is the case of poisson distribution.
# x = 5, mu = 3
stats.poisson.pmf(5, 3)

In [None]:
## Case ii. x = 0, mu = 3

stats.poisson.pmf(0,3)

In [None]:
# Case iii. x <= 2, mu = 3

stats.poisson.cdf(2, 3)

In [None]:
## Case iv. x = 1, mu = 3

stats.poisson.pmf(1,3)

Q5. The weight of football players is normally distributed with mean of 200 pounds and a standard deviation of 25 pounds. Find the probability of a player weighing
1. more than 241.25 pounds
2. less than 250 pounds.

In [None]:
# part 1
# x = 241.25
# loc = 200
# scale = 25

# P(z > 241.25)

1 - stats.norm.cdf(241.25, 200, 25)

In [None]:
# part 2

# x = 250
# loc = 200
# scale = 25

# P(z < 250)

stats.norm.cdf(250, 200, 25)

Q6. Assuming a binomial experiment with p = 0.5 and a sample size of 100. The expected value of this distribution is?

In [None]:
# Expected value of a binomial experiment is np

0.5 * 100

Q7. Probability of acceptance of a student in a college is 0.3 <br>
If 5 students apply, probability that at most 2 are selected?

In [None]:
stats.binom.cdf(2, 5, 0.3)

Q8. Probability of obtaining 45 or fewer heads in 100 tosses of a coin?

In [None]:
stats.binom.cdf(45, 100, 0.5)

Q9. Suppose a die is tossed 5 times. Probability of getting exactly 2 fours?

In [None]:
stats.binom.pmf(2,5, 0.167)

Q10. An average light bulb manufactured by the Acme Corporation lasts 300 days with a standard deviation of 50 days. Assuming that bulb life is normally distributed, what is the probability that an Acme light bulb will last at most 365 days?

In [None]:
stats.norm.cdf(365, 300, 50)

Q11. Suppose scores on an IQ test are normally distributed. If the test has a mean of 100 and a standard deviation of 10, what is the probability that a person who takes the test will score between 90 and 110?

In [None]:
stats.norm.cdf(110, 100, 10) - stats.norm.cdf(90, 100, 10)

Q12. Suppose the average number of lions seen on a 1-day safari is 5. What is the probability that tourists will see fewer than four lions on the next 1-day safari?

In [None]:
stats.poisson.cdf(3, 5)

Q13. The average number of homes sold by the Acme Realty company is 2 homes per day. What is the probability that exactly 3 homes will be sold tomorrow?

In [None]:
stats.poisson.pmf(2,3)

Q14. Suppose scores on an IQ test are normally distributed, with a mean of 100. Suppose 20 people are randomly selected and tested. The standard deviation in the sample group is 15. What is the probability that the average test score in the sample group will be at most 110?

Q15. Acme Corporation manufactures light bulbs. The CEO claims that an average Acme light bulb lasts 300 days. A researcher randomly selects 15 bulbs for testing. The bulbs last an average of 290 days, with a standard deviation of 50 days. If the CEO's claim were true, what is the probability that 15 randomly selected bulbs would have an average life of no more than 290 days?

Q16. Suppose we select 5 cards from an ordinary deck of playing cards. What is the probability of obtaining 2 or fewer hearts?

### 14. Measuring Correlation

The strength of the association between two variables is known as correlation test. <br>
If we want to know the relation between height and weight of human beings, a dataset of the same is to be obtain and correlation is to be found to justify or reject the above hypothesis. 

* r takes values -1 to +1 
* r = 0 means no correlation
* can't be applied to ordinal variables
* the sample size should be moderate 20 to 30. 
* outliers can lead to misleading calculations

![](https://media.geeksforgeeks.org/wp-content/uploads/20200311233526/formula6.png)

In [None]:
random.seed(2021)
lst1 = random.sample(range(100), 50)
print("Elements of 1st list -> ", lst1, "\n")


lst2 = random.sample(range(100), 50)
print("Elements of 2nd list -> ", lst2, "\n")


In [None]:
corr, _ = stats.pearsonr(lst1, lst2)

print("Pearsons correlation : ", corr)

Inference -> A value close to 0 means there is no correlation between the values of the elements. We can say there is slight but insignificant correlation between the values

### 15. Measuring Variance

Variance -> Measures how far from their mean the individual observations in dataset are. <br>
Std Deviation -> Square root of variance is std deviation which measures the amount of dispersion of the dataset.

In [None]:
def variance(data):
    
    n = len(data)
    
    mean = sum(data)/n
    
    deviations = [(x - mean) ** 2 for x in data]
    
    variance = sum(deviations) / n
    
    return variance

In [None]:
random.seed(2021)
data = random.sample(range(1000), 10)

variance(data)

Variance estimate of the population using sample data

In [None]:
def variance(data , dof = 0):
    
    n = len(data)
    mean = sum(data) / n
    
    return sum((x - mean)** 2 for x in data) / (n - dof)



In [None]:
variance(data, dof = 1)

Standard deviation is the square root of the variance value calculated. <br><br><br>
Values that are within one standard deviation of the mean can be thought of as fairly typical. <br>
Those values which are three or more standard deviations away from the mean can be considered as **outliers**. 

### 16. Sample Statistics

### 17. Population Statistics

### 18. Maximum Likehood Estimation

MLE is a method to find the most likely density function that would have generated the data. <br>
The likelihood function depends on mean 'mu' and variance σ2 which is found through an iterative process using calculators or computers.<br>


|  S.No | Likelihood   | Probability  |
|---|---|---|
| 1  | Refers to past events with known outcomes  | Refers to the occurence of future outcomes  |
| 2  | eg. A coin is flipped 10 times and 10 heads occur. Likelihood the coin is an unbiased coin?  | eg. A coin flipped n times. Probability of getting heads.| 
| 3  | Sum of Likelihoods != 1  | Sum of Probabilities = 1  |  

**Steps to perform MLE:** <br>
* Perform a certain experiment to collect data
* Choose parametric model of the data
* Formulate the likelihood as an objective function to be maximized 
* Maximize the objective func and derive the parameters of the model


**Examples** -> <br>
* Coin Toss to find the probabilities of heads and tails
* Dart throwing

Note :- For linear regression models, we use Ordinary Least Squares (OLS) to fit the regression model and estimate the parameters B0 & B1. <br>
**MLE** is based on the data we observe, what are the model parameters that maximize the likelihood of the observed data occuring?

**Applications ->** <br>
The parameters of a logistic regression model can be estimated by the probabilistic framework called Maximum Likelihood Estimation.

The outputs of a logistic regression are class probabilities.

**Brief Overview** -> <br>
![](https://miro.medium.com/max/421/1*ayxQCn3xz6sm41KRjf3Ygw.gif)

![](https://miro.medium.com/max/721/1*6MTXtB4zipiDMguZrlXSlA.gif)

In statistics, MLE is widely used to obtain the parameter for a distribution. In this paradigm, 

In this paradigm, to maximize log likelihood, we need to minimize the cost function. <br>
![](https://miro.medium.com/max/774/1*VAb-6NSg2vwUtqCtfNdjrA.gif)

Gradient Descent algorithm is used to tweak the values of the cost function using MLE. 

### 19. Cluster Analysis

Cluster Analysis is an unsupervised technique of grouping objects together based on their properties. By doing so, the objects in one group are more similar to each other than to those in other groups.

Cluster Analysis vs Discriminant Analysis

Cluster Analysis assigns objects to various groups without any prior object labels whereas Discriminant Analysis uses such knowledge which was defined in advance.

**Distance Functions**

Minkowski Distance <br><br>
A generalization of both the Euclidean and the manhattan metric is the Minowski distance given by :- <br>


![](https://slideplayer.com/slide/5070455/16/images/16/Minkowski+Distance+Minkowski+distance%3A+a+generalization.jpg)

![](https://www.researchgate.net/publication/349155159/figure/fig1/AS:989596292767746@1612949550717/Three-typical-Minkowski-distances-ie-Euclidean-Manhattan-and-Chebyshev-distances.png)

Understanding various distance functions -> <br>
Both euclidean and manhattan distance satisfy the following -> 
* d(i,j) >= 0
* d(i,j) = d(j,i)
* d(i,j) >= d(i,h) + d(h,j)
* d(i,i) != 0

#### 19. i. K Means Clustering

![](https://files.realpython.com/media/kmeans-algorithm.a94498a7ecd2.png)

**Ways to choose optimal number of clusters** -> <br>
1. Elbow Method
2. Silhouette coefficient

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

In [None]:
X, y = make_blobs(n_samples = 300, centers = 4, cluster_std = 0.60, random_state = 0)


plt.scatter(X[:,0], X[:,1])

Elbow Method

In [None]:
wcss = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    
    
plt.plot(range(1,11), wcss)
plt.title('Elbow method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(X)
plt.scatter(X[:,0], X[:,1])
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.show()

Silhouette Analysis Method

In [None]:
from sklearn.metrics import silhouette_samples, silhouette_score

In [None]:
n_clusters = [2,3,4,5,6,7,8,9]


for n_cluster in n_clusters:
    
    clusterer = KMeans(n_clusters = n_cluster, random_state = 10)
    cluster_labels = clusterer.fit_predict(X)
    
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("\nFor n_clusters = ", n_cluster, "\nThe average silhouette_score is : ", silhouette_avg)
    

As we can see, for n = 4, we obtain the highest silhouette score. <br>
This is exactly the same number we obtained using elbow method as well as it is the number of clusters we have defined in our dataset.

#### 19. ii. K-Mediods

**K-Means** clustering algorithm is sensitive to **outliers** as mean value is easily influenced by **extreme values**. <br>
**K-Mediods** is a variant of K-Means which is more robust to noises and outliers.

In [None]:
!pip install -q https://github.com/scikit-learn-contrib/scikit-learn-extra/archive/master.zip


In [None]:
from sklearn_extra.cluster import KMedoids

kmediods = KMedoids(n_clusters = 3, random_state = 0).fit(X)

In [None]:
kmediods.labels_

#### 19. iii. Agglomerative Clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering 
import scipy.cluster.hierarchy as shc

In [None]:
df = pd.read_csv('../input/ccdata/CC GENERAL.csv')

X = df.drop(['CUST_ID'], axis = 1)

X.fillna(method = 'ffill', inplace = True)

In [None]:
from sklearn.preprocessing import StandardScaler, normalize

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)


X_normalized = normalize(X_scaled)


X_normalized = pd.DataFrame(X_normalized)

In [None]:
X_normalized

In [None]:
from sklearn.decomposition import PCA 


pca = PCA(n_components = 2)


x_pca = pca.fit_transform(X_normalized)

x_pca = pd.DataFrame(x_pca)
x_pca.columns = ['P1', 'P2']

In [None]:
plt.figure(figsize =(8, 8))
plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(x_pca, method ='ward')))

We can use silhouette scores to find the optimal number of clusters.

### 20. Hypothesis Testing

It is a statistical method used for making statistical decisions using experimental data. <br>
Used to evaluate two mutually exclusive statements in a population to determine which statement is best supported using the sample data.

![](https://miro.medium.com/max/481/1*2vTwIrqdELKJY-tpheO7GA.jpeg)

Standardized Normal Curve <br>
Which means it is a normal distribution curve which is standardized (mean = 0, std = 1).

Normal Curve -> 
![](https://miro.medium.com/max/626/1*gBnxoTRwo9sDovvegHfm6g.png)

**Important Terminologies ->** <br><br><br>
**Level of Significance** => the degree of significance in which we accept or reject the nuol hypothesis. A 5% significance means the the result should be atleast 95% confident to give a similar result in each sample. 

### 21. Type-I Error & Type-II Error

Type I Error

When you reject the null hypothesis but thay hypothesis was true. Type I error is denoted by alpha. The region that shows the critical region , is called the alpha region. <br>

Type II Error

When we accept null hypothesis but it is false. Denoted by Beta. The normal curve that shows the acceptance region is called the beta region. 

One tailed Test

A statistical test in which the region of **rejection** is only on **one** side of the sampling distribution. 

Two Tailed Test

A **two-tailed** test is a statistical **test** in which the critical area of distribution is **two-sided** and tests are adopted as whether the samples aare greater than or less than some critical values. If the sample being tested falls in either of the critical values, the alternate hypothesis is accepted instead of the null hypothesis. 

P-Value

The **p-Value**, or calculated probability, of finding extreme results when the null hypothesis is true. <br>
If your p value is less than the chosen significance level, you reject the null hypothesis. 

**Degree of Freedom** -> <br>


**Widely Used hypothesis testing types ->** <br>
* T Test (Student T Test)
* Z Test
* ANOVA Test
* Chi-Square Test

### 22. Z-Stats & T-Stats

T-Test <br>
Type of inferential statistics used to determine if there is a significant difference between means of two groups which may be related in certain features. <br>
Used when the datasets would follow a normal distribution and may have unknown variances. 

We have two types of T-Tests -> <br>
* 1 sample t-test
* 2 sample t-test

One Sample t-test -> <br>


2 Samples t-test -> <br>
        Compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are statistically different. <br>
        Also known as **Independent t-test**

**Paired sampled t-test** -> <br>
Also known as dependent sample t-test. <br>
Univariate test that tests for a significant difference between 2 related variables. 

Z-Test <br>
Use Z-test if -> <br>
* Sample Size > 30
* Data points **independent** from each other. 
* Data should be normally distributed. Sometimes, if the sample size is large enough, this doesn't matter. 
* Data items have equal chance of getting selected. 

One-Sample Z-Test

In [None]:
from statsmodels.stats import weightstats as stests




Two Sample Z-Test

ANOVA test or F-Test

One-Way F-Test (Anova) -> <br>


Two-Way F-Test (Anova) -> 

Chi-Sq Test

KS Test

KS Test is used to check if given values follow a distribution. <br>


In [None]:
np.random.seed(2021)
v = np.random.normal(size = 100)

res = stats.kstest(v, 'norm')

print(res)

### 23. Confidence Interval

It is a range of values. <br>
95% confidence interval is the most common. <br>
Note -> 95% confidence interval doesn't mean 95% probability. 

**Calculation of confidence interval** <br>


In [None]:
def mean_confidence_interval(data, confidence = 0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1 + confidence) / 2., n - 1)
    
    return m, m - h, m + h

I. Confidence interval for a sample

II. Confidence Interval with a small sample

III. Confidence Interval with the Normal Distribution / Z-Distribution

IV. Confidence Interval for a proportion

V. Confidence Interval for 2 populations or (proportions)

In a local teaching district a technology grant is available to teachers in order to install a cluster of four computers in their classrooms. From 6250 teachers in the district, 250 were randly selected and asked if they felt that computers were an essential teaching tool for their classroom. Of those selected, 142 teachers felt that computers were an essential teaching tool. <br>
1. Calculate a 99% confidence interval for the propertion of teachers who felt that computers are an essential teaching tool. <br>
2. How could the survey be changed to narrow the confidence interval but to maintain the 99% confidence interval?

### 24. Confusion Matrix, ROC & Regression Analysis

**Confusion Matrix** <br><br>

We'll learn confusion matrix for both Binary Classifiers as well as Multi Class Classifiers. 

![](https://miro.medium.com/proxy/0*-oGC3SE8sPCPdmxs.jpg)

I. Binary Classifiers 

In [None]:
# Let's assume the following was a confusion matrix obtained for a Binary Dataset

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
y_true = [0,1,0,1,0,1]
y_pred = [0,0,1,1,0,1]

confusion_matrix(y_true, y_pred)

With the help of confusion matrix, we can obtain values for TN, FP, FN, TP

In [None]:
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

print("True Negatives -> ", tn)
print("False Positives -> ", fp)
print("False Negatives -> ", fn)
print("True Positives -> ", tp)

Let's define a function to output all the metrics needed to gauge model performance. <br>

In [None]:
def calculate_performance(tn, fp, fn, tp):
    
    
    accuracy = (tp + tn)/ (tp + tn + fp + fn)
    
    precision = tp / (tp + fp)
    
    recall = tp / (tp + fn)
    
    f1 = ( 2 * precision * recall ) / (precision + recall)
    
    
    return accuracy, precision, recall, f1

In [None]:
acc, precision, recall, f1 = calculate_performance(tn, tp, fn, tp)

print("Accuracy of the hypothetical model -> ", acc)
print("Precision of the hypothetical model -> ", precision)
print("Recall of the hypothetical model -> ", recall)
print("F1-score of the hypothetical model -> ", f1)

You can also display the output of your confusion matrix in a better visual format

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_true, y_pred),
                               display_labels=[0,1])

disp.plot()

II. Multi-Class Classifiers

In [None]:
# For multi-class classifiers, let's understand how to obtain the metrics given the hypothetical confusion matrix.

In [None]:
y_true = [0,1,2, 0,1,2,0,1,2]
y_pred = [0,2,1,0,1,2,1,0,1]

confusion_matrix(y_true, y_pred)

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_true, y_pred),
                               display_labels=[0,1,2])

disp.plot()

How to read a multi-class confusion matrix?

![](https://miro.medium.com/max/875/1*uQDpo9iISx00ucl3gftLVA.png)

Now we need to calculate TP, FN, FP, TN values. 

In [None]:
def calculate_metrics_multi(cnf_matrix):
    FP = cnf_matrix.sum(axis=0) - np.diag(cnf_matrix) 
    FN = cnf_matrix.sum(axis=1) - np.diag(cnf_matrix)
    TP = np.diag(cnf_matrix)
    TN = cnf_matrix.sum() - (FP + FN + TP)
    
    FP = FP.astype(float)
    FN = FN.astype(float)
    TP = TP.astype(float)
    TN = TN.astype(float)
    
    # Sensitivity, hit rate, recall, or true positive rate
    TPR = TP/(TP+FN)
    # Specificity or true negative rate
    TNR = TN/(TN+FP) 
    # Precision or positive predictive value
    PPV = TP/(TP+FP)
    # Negative predictive value
    NPV = TN/(TN+FN)
    # Fall out or false positive rate
    FPR = FP/(FP+TN)
    # False negative rate
    FNR = FN/(TP+FN)
    # False discovery rate
    FDR = FP/(TP+FP)
    # Overall accuracy for each class
    ACC = (TP+TN)/(TP+FP+FN+TN)
    
    f1 = 2 * precision * recall / (precision + recall)
    
    print("The calculated metrics are as follows -> \n\n")
    print(f"\n1. Accuracy = {ACC} \n2. Recall or Sensitivity = {TPR} \n3. Specificity or True Negative Rate = {TNR} \n4. Precision = {PPV} \n5. Negative Predictive Value = {NPV} \n6. Fall out or False Positive Rate = {FPR} \n7. False Negative Rate = {FNR} \n8. False Discovery Rate = {FDR} \n9. F1-Score = {f1}")

In [None]:
calculate_metrics_multi(confusion_matrix(y_true, y_pred))

**Receiver Operating Characteristic** <br>


**What is it?** <br>
ROC is a plot useful for predicting the probability of a binary outcome. <br>
It is the plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0 <br>

![](https://developers.google.com/machine-learning/crash-course/images/ROCCurve.svg)

**Area Under the ROC Curve** -> <br>
AUC provides aggregate measure of performance across all possible classification thresholds. 

**AUC is desirable for 2 reasons** -> <br>
* Scale invariant 
* classification-model-threshold invariant

**Q. How would multiplying all of the predictions from a given model by 2.0 (for example, if the model predicts 0.4, we multiply by 2.0 to get a prediction of 0.8) change the model's performance as measured by AUC?** <br><br>
Ans. No change. AUC only cares about relative prediction scores.

**Thresholding in ROC**

The aim of the plot is to analyze the predictive power of the predictor 

While selecting threshold, you should visualize the following graph for threshold selection and

![](https://miro.medium.com/max/644/1*P2qKi7w1UHF7zg6SnCGTag.png)

analyze the performance based on the following ROC curve generated.

![](https://miro.medium.com/max/601/1*QkWHqoSHSBig31InTzr8TA.jpeg)

* If you aim for very low FPR, you might pick up a 

**Poor Classifier** <br><br>
![](https://miro.medium.com/max/764/1*HVvNkWufhzGj2s0sc4CyIw.png)

Understanding the influence of threshold on the ROC Curve -> <br>
There are 2 possible cases for ROC Curve's threshold movement -> <br>
* Shifting the threshold to the right
* Shifting the threshold to the left

In such a scenario, never remember the standard TPR vs FPR diagram for ROC Curve. Rather Refer to the following diagram -> <br>
![](https://lukeoakdenrayner.files.wordpress.com/2018/01/threshold2.png?w=656)

**Case I : Shifting the threshold to the right**<br>
This would case TP to decrease, FP to decrease, FN to increase and TN to increase. <br>
By doing so, TPR or Sensitivity (which is TP / Sum(+ves)) decreases. FPR (which is (FP / Sum(-ves) ) decreases. If FPR decreases, specificity increases as FPR = 1 - specificity.

**Case I : Shifting the threshold to the left**<br>
This would case TP to increase, FP to increase, FN to decrease and TN to decrease. <br>
By doing so, TPR or Sensitivity (which is TP / Sum(+ves)) increases. FPR (which is (FP / Sum(-ves) ) increases. If FPR increases, specificity decreases as FPR = 1 - specificity.

Concluding Remarks about ROC & AUC Curves -> <br>
* can be used as a summary of the model skill
* ROC Curves of different models can be compared directly in general or for different thresholds.
* Shape of ROC curves contains info about the predictive power of the model.
* for imbalanced class distribution, ROC curves are very helpful. Helps to visualize the trade-off between TPR and FPR and thus help us to arrive at a threshold that minimizes the mis-classification cost.

![](https://miro.medium.com/max/603/1*D05sMUrwZIgvwsQVF_CJdg.jpeg)

**Regression Analysis**

Output of a regression analysis is usually a summary statistic that includes: <br>
* R 
* R squared
* adjusted R-squared
* standard error of the estimate

**Use regression analysis to ->** <br><br>
* Model multiple independent variables
* use polynomial terms to model the curvature. 
* include continuous and categorical variables
* assess interaction terms to determine affect of one independent variable on the value of another variable.

### 25. Summary Statistics

#### I. Central Tendency Statistics

a. Arithmetic Mean

Given a set of numbers -> [n1,n2,n3,n4,n5]<br><br>

Average or **Arithmetic Mean** -> Sum of dataset / num of data items <br>
-> (n1 + n2 + n3 + n4 + n5) / 5 


In [None]:
data1 = np.arange(1, 10, 2)
data1

In [None]:
print("Mean of the above data set is => ", np.mean(data1))

b. Weighted Mean

For calculating weighted mean, <br>
you would require two lists. <br>
* The data items list
* The corresponding weights list

In [None]:
data1 = np.arange(1, 21, 2)
data1

In [None]:
weights = np.random.random(len(data1))
weights

In [None]:
def weighted_mean(data_lst, weights_lst):
    return np.average(data_lst, weights = weights_lst)

In [None]:
print("Weighted Mean is -> ", weighted_mean(data1, weights))

c. Median

Median for **odd** number of elements is the middle most element. <br>
Median for **even** number of elements is the average of the middle two elements. <br>

case 1 : even number of elements

In [None]:
data1 = np.arange(1, 21, 2)
data1

In [None]:
np.median(data1)

case 2: odd number of elements

In [None]:
data1 = np.arange(1, 10, 2)
data1

In [None]:
print("Median of the above dataset is -> ", np.median(data1))

d. Percentile

Percentile is calculated by assuming the highest or maximum value in a dataset as the upper limit and relative to that value other values are calculated indicating how far or near they are.

We'll be generating the following percentile values for the dataset -> <br>
* 50th Percentile / Median
* 25th Percentile
* 75th Percentile

case 1. 1D dataset

In [None]:
data1 = random.sample(range(1000), 20)
data1

In [None]:
print('25th Percentile -> ', np.percentile(data1, 25))

In [None]:
print('50th Percentile -> ', np.percentile(data1, 50))

In [None]:
print('75th Percentile -> ', np.percentile(data1, 75))

case 2. 2D dataset

In [None]:
random.seed(2021)

data = [random.sample(range(100), 4), 
       random.sample(range(100), 4),
       random.sample(range(100), 4)]
data

In [None]:
print("25th percentile value for axis = None -> ", np.percentile(data, 25,))
print("25th percentile value for axis = 0 -> ", np.percentile(data, 25,axis = 0))
print("25th percentile value for axis = 1 -> ", np.percentile(data, 25,axis = 1))

In [None]:
print("50th percentile value for axis = None -> ", np.percentile(data, 50,))
print("50th percentile value for axis = 0 -> ", np.percentile(data, 50,axis = 0))
print("50th percentile value for axis = 1 -> ", np.percentile(data, 50,axis = 1))

In [None]:
print("75th percentile value for axis = None -> ", np.percentile(data, 75,))
print("75th percentile value for axis = 0 -> ", np.percentile(data, 75,axis = 0))
print("75th percentile value for axis = 1 -> ", np.percentile(data, 75,axis = 1))

#### II. Dispersion

a. Skewness

* skewness = 0, normally distributed
* skewness > 0, more weight in the left tail of the distribution
* skewness < 0, more weight in the right tail of the distribution

![](https://media.geeksforgeeks.org/wp-content/uploads/skewness.jpg)

In [None]:
from scipy.stats import skew
import pylab
x1 = np.linspace(-10, 10, 1000)
y1 = 1./ (np.sqrt(2. * np.pi)) * np.exp(-.5 * (x1) ** 2)

pylab.plot(x1, y1)

print("Skewness of the data --> ", skew(y1))

In [None]:
x1 = np.linspace(-5, 10, 1000)
y1 = 1./ (np.sqrt(2. * np.pi)) * np.exp(-.5 * (x1) ** 2)

pylab.plot(x1, y1)

print("Skewness of the data -> ", skew(y1))

In [None]:
x = np.random.normal(0, 2, 10000)


print("X: \n", x)

print("\nSkewness for data : ", skew(x))

b. Kurtosis

It helps in the measure of how heavy the tail is in compared to a normal distribution.

![](https://media.geeksforgeeks.org/wp-content/uploads/kurtosis.jpg)

In [None]:
x = np.linspace(-10, 10, 1000)
y1 = 1./(np.sqrt(2.*np.pi)) * np.exp( -.5*(x1)**2  )


pylab.plot(x,y1, '*')

print("Kurtosis for normal distribution : ", stats.kurtosis(y1))

print("Kurtosis for normal distribution : ", stats.kurtosis(y1, fisher = False))

print("Kurtosis for normal distribution : ", stats.kurtosis(y1, fisher = True))

c. Range

Range is the span of values in the entire dataset. <br>
Range is denoted by [min_value, max_value]

In [None]:
random.seed(2021)
data1 = random.sample(range(-1000, 1000), 100)
#data1

In [None]:
min_value = min(data1)

max_value = max(data1)

print(f"Range of the dataset -> [{min_value},{max_value}]")

d. Interquartile Range

Interquartile range also called as midspread, or middle 50% or technically H-spread. <br>
Technically, it is the Q3 - Q1 where Q3 is the third quartile and the first quartile. <br>
It covers the center of the distribution and contains 50% of the observations. <br><br>

**IQR** = **Q3 - Q1**

##### Applications -> <br>
* helps in easy identification of outlier values
* gives the central tendency of the data.
* higher the IQR, higher the variablity.
* lower the IQR, the preferable the dataset is.

In [None]:



random.seed(2021)

data = random.sample(range(1000), 30)

print(f"Dataset -> {data}\n\n")
IQR = stats.iqr(data, interpolation = 'midpoint')

print(IQR)

e. Variance

Variance is the square of the difference of a variable from its mean. <br>
It measures the spread of random data in a set from its mean or median value. <br>
* Low value for variance indicates the data are clustered together.
* High value for variance indicates the data are spread widely.

![](https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-7b0fdc0b3c4d7ef2aeeba85f690456c2_l3.svg)

In [None]:
print(f"Variance of the sample set --> {statistics.variance(data)}")

f. Standard Score or Z-Score

**z-score** tells us how many standard deviations away a value is from the mean.

z = (X – μ) / σ


In [None]:
import scipy.stats as stats

In [None]:
stats.zscore(data)

Here each z-score tells us how many std. deviations each value is away from mean.

g. Coefficient of Variation

It is the ratio of standard deviation to mean. 

In [None]:
np.random.seed(2021)
data1 = np.random.randn(5,5)

print("\nVariation at axis = 0: \n", stats.variation(data1, axis = 0))
print("\nVariation at axis = 1: \n", stats.variation(data1, axis = 1))

### 26. Familiarizing with different error metrics

SST 

Sum of Squares Total - the squared difference between the observed dependent variable and its mean.<br>
SST is also denoted as TSS or total sum of squares.

SSR

Sum of differences between the predicted value and the mean of the dependent variable.

if SSR = SST, our regression model captures all the observed variablity and its perfect. <br>
ESS -> Explained sum of squares.

SSE

The error is the difference between the observed value and the predicted value. <br>
We want to minimize the error. The smaller the error, the better the estimation power of the regression. <br>


#### Relation between SST, SSR, SSE

SST = SSR + SSE <br>
The total variability = Explained Variability + Unexplained Variability

**Mean Squared Error** -> <br>
1/n * SSE <br><br>
**Root Mean Squared Error** -> <br>
sqrt(MSE) <br><br>
**R Squared** -> <br>
1 - SSE / SST <br><br>
**Adjusted R-squared** -> <br>
1 - (n + k / n - k) * (1 - R**2) 

**MSR ->** <br><br>

Mean Square due to regression, <br>
SSR / 

### 27. Simmilarity and Dissimilarity Index

### 28. Logistic Regression

**Assumptions of Logistic Regression ->** <br>
* Binary Logistic Regression requires the dependent variable to be binary
* The variables should be independent of each other. That is, the model should have little or no multicollinearity. 
* Samples sizes should be preferably large. 
* 

#### I. Python Implementation

In [None]:
df = pd.read_csv("../input/logistic-regression-stats-dataset/logit_train1.csv", index_col = 0)

Xtrain = df[['gmat', 'gpa', 'work_experience']]
ytrain = df[['admitted']]


log_reg = sm.Logit(ytrain, Xtrain).fit()


print(log_reg.summary())

#### II. Calculating G-Statistic

G - statistic in logistic regression is -> <br>
-2ln[ likelihood without variable/ likelihood with variable]

Applications of G-Statistic -> <br>
* Helps to verify the overall significance of the model
* 

Wald Test -> <br>
Measures an individual independent variable's significance

### Notebook in Making.  <br>
Est. Date of Completion - 28-03-2021