__Statistics__ is the branch of mathematics dealing with the collection, analysis, interpretation,
presentation, and organization of numerical data.

Statistics are mainly classified into two subbranches:
1. __Descriptive statistics__: These are used to summarize data, such as the mean,
standard deviation for continuous data types (such as age), whereas frequency
and percentage are useful for categorical data (such as gender).


2. __Inferential statistics__: Many times, a collection of the entire data (also known as
population in statistical methodology) is impossible, hence a subset of the data
points is collected, also called a sample, and conclusions about the entire
population will be drawn, which is known as inferential statistics. Inferences are
drawn using hypothesis testing, the estimation of numerical characteristics, the
correlation of relationships within data, and so on.

__Machine learning__ is the branch of computer science that utilizes past experience to learn
from and use its knowledge to make future decisions. Machine learning is at the
intersection of computer science, engineering, and statistics. The goal of machine learning is
to generalize a detectable pattern or to create an unknown rule from given examples

1. __Supervised learning__: This is teaching machines to learn the relationship between
other variables and a target variable. The major segments within
supervised learning are as follows:
    1. Classification problem
    2. Regression problem
    
    
2. __Unsupervised learning__: In unsupervised learning, algorithms learn by
themselves without any supervision or without any target variable provided. It is
a question of finding hidden patterns and relations in the given data. The
categories in unsupervised learning are as follows:
    1. Dimensionality reduction
    2. Clustering
    
    
3. __Reinforcement learning__: This allows the machine or agent to learn its behavior
based on feedback from the environment. In reinforcement learning, the agent
takes a series of decisive actions without supervision and, in the end, a reward
will be given, either +1 or -1. Based on the final payoff/reward, the agent
reevaluates its paths. Reinforcement learning problems are closer to the artificial
intelligence methodology rather than frequently used machine learning
algorithms.

Difference between Statistics and ML:
1. Relationships are formed in forms of mathematical equations in statistics whereas in ML it is formed in the form of rule-based programming. 
2. Statistical model predicts the output with Machine learning just predicts the output with accuracy of 85 percent and having 90 percent confidence about it. Machine learning just predicts the output with accuracy of 85 percent.
3. Statistics  -  Data will be split into 70 percent - 30 percent to create training and testing data. Model developed on training data and tested on testing data. ML - Data will be split into 50 percent - 25 percent - 25 percent to create training, validation, andtesting data. Models developed on training and hyperparameters are tuned on validation data and finally get evaluated against test data.

Steps in building ML Model:
1. Collection of Data
2. Data preparation and outlier treatment
3. Data Analysis and Feature Engineering
4. Train algorithm on training and validation data
5. Test algorithm on test data
6. Deploy algorithm

__Statistics Fundamentals__
1. __Population__: This is the totality, the complete list of observations, or all the data
points about the subject under study. 


2. __Sample__:A sample is a subset of a population, usually a small portion of the
population that is being analyzed.

To draw inferences from a sample by validating a hypothesis it is necessary that the sample is random.

3. __Parameter versus Statistic__: Any measure that is calculated on the population is a
parameter, whereas on a sample it is called a statistic.


4. __Mean__: Arithmetic average. The mean is sensitive to outliers in the data. An outlier is the value of a set or column that is highly deviant from the many other values in the same data; it usually has very high or low values.


5. __Median__:This is the midpoint of the data, and is calculated by either arranging it in ascending or descending order. If there are N observations.


6. __Mode__:This is the most repetitive data point in the data.

<img src="images/mean_median_mode.png">

In [1]:
import numpy as np
from scipy import stats

In [2]:
data = np.array([4, 5, 1, 6, 8, 1, 3, 6, 7])

In [3]:
mean = np.mean(data)
mean

4.555555555555555

In [4]:
median = np.median(data)
median

5.0

In [5]:
mode = stats.mode(data)
mode[0][0]

1

7. __Measure of Variation__:Dispersion is the variation in the data, and measures the inconsistencies in the value of variables in the data. 


8. __Range__:Difference between the maximum and minimum of the value.


9. __Variance__: This is the mean of squared deviations from the mean. The dimension of variance is the square of the actual values. The reason to use denominator N-1 for a sample instead of N in the population is due the degree of freedom. 1 degree of freedom lost in a sample by the time of calculating variance is due to extraction of substitution of sample. 


10. __Standard Deviation__: This is the square root of variance. By applying the square root on variance, we measure the dispersion with respect to the original variable rather than square of the dimension. 


11. __Quantiles__:These are identical fragments of the data. Quantiles cover percentiles, deciles, quartiles, and so on. These measures are calculated after arranging the data in ascending order
    1. __Percentile__:This is the percentage of data points below the value of the original whole data. The median is the 50 th percentile, as the number of data points below the median is about 50 percent of the data.
    2. __Decile__: This is 10th percentile, which means the number of data points below the decile is 10 percent of the whole data.
    3. __Quartile__: This is one-fourth of the data, and also is the 25 percentile. The first quartile is 25 percent of the data, the second quartile is 50 percent of the data, the third quartile is 75 percent of the data. The second quartile is also known as the median or 50 th percentile or 5 th decile.
    4. __Interquartile Range__: This is the difference between the third quartile and first quartile. It is effective in identifying outliers in data. The interquartile range describes the middle 50 percent of the data points.
    
    <img src="images/quantile.png">

In [6]:
from statistics import variance, stdev
game_points = np.array([35, 46, 72, 38, 81, 41, 57, 93, 17, 33, 61, 75])

In [7]:
# Variance
variance_data = variance(game_points)
variance_data

521

In [8]:
# Standard Deviation
standard_dev = stdev(game_points)
standard_dev

22.825424421026653

In [9]:
# Range
range_data = np.max(game_points, axis=0) - np.min(game_points, axis=0)
range_data

76

In [10]:
# Quantile
for val in [10, 20, 30, 40, 50, 60, 70, 80, 90]:
    quant = np.percentile(game_points, val)
    print(val, '% :', quant)

10 % : 33.199999999999996
20 % : 35.6
30 % : 38.900000000000006
40 % : 43.0
50 % : 51.5
60 % : 59.4
70 % : 68.69999999999999
80 % : 74.4
90 % : 80.4


12. __Hypothesis Testing__: This is the process of making inferences about the overall population by conducting some statistical tests on a sample. Null and alternate hypotheses are ways to validate whether an assumption is statistically significant or not.

A null hypothesis, proposes that no significant difference exists in a set of given observations

13. __P-Value__: The probability of obtaining a test statistic result is at least as extreme as the one that was actually observed, assuming that the null hypothesis is true (usually in modeling, against each independent variable, a p-value less than 0.05 is considered significant and greater than 0.05 is considered insignificant; nonetheless, these values and definitions may change with respect to context).

P value less than 0.05 means both claimed values and distribution mean values are significantly different, hence we can reject null hypothesis.

__Steps involved in Hypothesis Testing__
1. Assume a null hypothesis (usually no difference, no significance, and so on; a null hypothesis always tries to assume that there is no anomaly pattern and is always homogeneous, and so on).
2. Collect the sample.
3. Calculate test statistics from the sample in order to verify whether the hypothesis is statistically significant or not.
4. Decide either to accept or reject the null hypothesis based on the test statistic.

__Test Statistic and Critical Value__
In hypothesis testing, a critical value is a point on test distribution that is compared to the test statistic to determine whether to reject null hypothesis. If absolute value of test statistic is greater than critical value, then it would be correct to declare statistical significance and reject null hypothesis. Critical values correspond to alpha, so their values become fixed when we chosse the test's alpha. 

The __critical values__ are the boundaries of the critical region. If the test is one-sided (like a χ2 test or a one-sided t-test) then there will be just one critical value, but in other cases (like a two-sided t-test) there will be two”.

A __critical value__ is a point (or points) on the scale of the test statistic beyond which we reject the null hypothesis, and, is derived from the level of significance α of the test. Critical value can tell us, what is the probability of two sample means belonging to the same distribution. Higher, the critical value means lower the probability of two samples belonging to same distribution. The general critical value for a two-tailed test is 1.96, which is based on the fact that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean.

In [11]:
from scipy import stats
xbar = 990
mu0 = 1000
s = 12.5
n = 30
# Test Statistic
t_smple = (xbar-mu0)/(s/np.sqrt(float(n)))
t_smple

-4.381780460041329

In [12]:
# Critical Value
alpha = 0.05
t_alpha = stats.t.ppf(alpha, n-1)
t_alpha

-1.6991270265334977

In [13]:
# P Value
p_val = stats.t.sf(np.abs(t_smple), n-1)
p_val

7.035025729010886e-05

14. __Type I and Type II Error__: Hypothesis testing is usually done on the samples rather
than the entire population, due to the practical constraints of available resources
to collect all the available data. However, performing inferences about the
population from samples comes with its own costs, such as rejecting good results
or accepting false results, not to mention separately, when increases in sample
size lead to minimizing type I and II errors:
    1. __Type I error__: Rejecting a null hypothesis when it is true
    2. __Type II error__: Accepting a null hypothesis when it is false
    
    
15. __Normal Distribution__:This is very important in statistics because of the central limit theorem, which states that the population of all possible samples of size n from a population with mean μ and variance σ2 approaches a normal distribution

In [14]:
# Z-Score
xbar = 67
mu0 = 52
s = 16.3
z = (xbar-mu0)/s
z

0.920245398773006

In [15]:
# Probability Under Curve 
p_val = 1 - stats.norm.cdf(z)
p_val*100

17.872226751475175

16. __Chi-square__:This test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data. Given two categorical random variables X and Y, the chi-square test of independence determines whether or not there exists a statistical dependence between them.

The test is usually performed by calculating χ2 from the data and χ2 with
(m-1, n-1) degrees from the table. A decision is made as to whether both
variables are independent based on the actual value and table value,
whichever is higher.

<img src="images/chi-square.png">

The chi2_contingency function in the stats package uses the observed table and subsequently calculates its expected table, followed by calculating the p-value in order to check whether two variables are dependent or not. If p-value < 0.05, there is a strong dependency between two variables, whereas if p-value > 0.05, there is no dependency between the variable

In [16]:
import pandas as pd 
from scipy import stats

survey = pd.read_csv('Data Files/survey.csv')
survey.head()

Unnamed: 0,Sex,Wr.Hnd,NW.Hnd,W.Hnd,Fold,Pulse,Clap,Exer,Smoke,Height,M.I,Age
0,Female,18.5,18.0,Right,R on L,92.0,Left,Some,Never,173.0,Metric,18.25
1,Male,19.5,20.5,Left,R on L,104.0,Left,,Regul,177.8,Imperial,17.583
2,Male,18.0,13.3,Right,L on R,87.0,Neither,,Occas,,,16.917
3,Male,18.8,18.9,Right,R on L,,Neither,,Never,160.0,Metric,20.333
4,Male,20.0,20.0,Right,Neither,35.0,Right,Some,Never,165.0,Metric,23.667


In [17]:
survey_tab = pd.crosstab(survey['Smoke'], survey['Exer'], margins=True)
survey_tab

Exer,Freq,None,Some,All
Smoke,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Heavy,7,1,3,11
Never,87,18,84,189
Occas,12,3,4,19
Regul,9,1,7,17
All,115,23,98,236


In [18]:
observed = survey_tab.iloc[0:4, 0:3]
observed

Exer,Freq,None,Some
Smoke,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Heavy,7,1,3
Never,87,18,84
Occas,12,3,4
Regul,9,1,7


In [19]:
contg = stats.chi2_contingency(observed=observed)
p_value = round(contg[1], 3)
p_value

0.483

The p-value is 0.483 , which means there is no dependency between the smoking habit and exercise behavior.

Stats.Chi2_Contingency returns following:
1. Test Statistics
2. P Value
3. Degree of Freedom 

17. __ANOVA__:Analyzing variance tests the hypothesis that the means of two or more populations are equal. ANOVAs assess the importance of one or more factors by comparing the response variable means at the different factor levels. The null hypothesis states that all population means are equal while the alternative hypothesis states that at least one is different.

In [20]:
import pandas as pd
from scipy import stats
data = pd.read_csv('Data Files/fetilizers.csv')
data.head()

Unnamed: 0,fertilizer1,fertilizer2,fertilizer3
0,62,54,48
1,62,56,62
2,90,58,92
3,42,36,96
4,84,72,92


In [21]:
# One-Way ANOVA
one_way_anova = stats.f_oneway(data['fertilizer1'], data['fertilizer2'], data['fertilizer3'])
print ("Statistic :", round(one_way_anova[0],2),", p-value:",round(one_way_anova[1],3))

Statistic : 3.66 , p-value: 0.051


The p-value did come equal to 0.05, hence we accept the null hypothesis that the mean crop yields of the fertilizers are equal.

18. __Confusion Matrix__:This is the matrix of the actual versus the predicted. The table contains following:
    1. __True positives (TPs)__: True positives are cases when we predict the outcome(class) and it is correct.
    2. __True negatives (TNs)__: Cases when we predict the outcome (class) and the class is actually not there.
    3. __False positives (FPs)__: When we predict the outcome as yes when the outcome actually does not have it. FPs are also considered to be type I errors.
    4. __False negatives (FNs)__: When we predict the outcome as no when the outcome actually does have it. FNs are also considered to be type II errors.
    5. __Precision (P)__: When yes is predicted, how often is it correct? (TP/TP+FP)
    6. __Recall (R)/sensitivity/true positive rate__: Among the actual yeses, what fraction was predicted as yes? (TP/TP+FN)
    
    
19. __F1 Score (F1)__:This is the harmonic mean of the precision and recall. Multiplying the constant of 2 scales the score to 1 when both precision and recall are 1. 


20. __Specificity__:Among the actual nos, what fraction was predicted as no? Also equivalent to 1- false positive rate: (TN/TN+FP)


21. __Area Under Curve (ROC)__:Receiver operating characteristic curve is used to plot between true positive rate (TPR) and false positive rate (FPR), also known as a sensitivity and 1- specificity graph.

<img src="images/roc.png">


    Area under curve is utilized for setting the threshold of cut-off
    probability to classify the predicted probability into various classes;
    we will be covering how this method works in upcoming chapters.
    

22. __Observation and performance window__:In statistical modeling, the model tries to predict the event in advance rather than at the moment, so that some buffer time will exist to work on corrective actions. For example, a question from a credit card company would be, for example, what is the probability that a particular customer will default in the coming 12-month period? So that I can call him and offer any discounts or develop my collection strategies accordingly.
    In order to answer this question, a probability of default model (or behavioral scorecard in technical terms) needs to be developed by using independent variables from the past 24 months and a dependent variable from the next 12 months. After preparing data with X and Y variables, it will be split into 70 percent - 30 percent as train and test data randomly; this method is called in-time validation as both train and test samples are from the same time period.
    

23. __In-time and out-of-time validation__: In-time validation implies obtaining both a training and testing dataset from the same period of time, whereas out-of-time validation implies training and testing datasets drawn from different time periods. Usually, the model performs worse in out-of-time validation rather than in-time due to the obvious reason that the characteristics of the train and test datasets might differ.


24. __R-squared (coefficient of determination)__:This is the measure of the percentage of the response variable variation that is explained by a model. It also a measure of how well the model minimizes error compared with just utilizing the mean as an estimate. In some extreme cases, R-squared can have a value less than zero also, which means the predicted values from the model perform worse than just taking the simple mean as a prediction for all the observations. We will study this parameter in detail in upcoming chapters.

<img src="images/r_square.png">

SST - Sum of Squares of Total 
SSE - Sum of Squares of Error

The difference between SST and SSE is the improvement in prediction from the regression model, compared to the mean model. Dividing that difference by SST gives R-squared. It is the proportional improvement in prediction from the regression model, compared to the mean model. It indicates the goodness of fit of the model.

R-squared has the useful property that its scale is intuitive: it ranges from zero to one, with zero indicating that the proposed model does not improve prediction over the mean model, and one indicating perfect prediction. Improvement in the regression model results in proportional increases in R-squared.

One pitfall of R-squared is that it can only increase as predictors are added to the regression model. This increase is artificial when predictors are not actually improving the model’s fit. To remedy this, a related statistic, Adjusted R-squared, incorporates the model’s degrees of freedom. Adjusted R-squared will decrease as predictors are added if the increase in model fit does not make up for the loss of degrees of freedom. Likewise, it will increase as predictors are added if the increase in model fit is worthwhile. Adjusted R-squared should always be used with models with more than one predictor variable. It is interpreted as the proportion of total variance that is explained by the model.

There are situations in which a high R-squared is not necessary or relevant. When the interest is in the relationship between variables, not in prediction, the R-square is less important. An example is a study on how religiosity affects health outcomes. A good result is a reliable relationship between religiosity and health. No one would expect that religion explains a high percentage of the variation in health, as health is affected by many other factors. Even if the model accounts for other variables known to affect health, such as income and age, an R-squared in the range of 0.10 to 0.15 is reasonable.


25. __Adjusted R-square__:The explanation of the adjusted R-squared statistic is almost the same as R-squared but it penalizes the R-squared value if extra variables without a strong correlation are included in the model.

<img src="images/adjusted_r_square.png">

    Here, R2 = sample R-squared value, n = sample size, k = number of predictors (or) variables. Adjusted R-squared value is the key metric in evaluating the quality of linear regressions. Any linear regression model having the value of R2 adjusted >= 0.7 is considered as a good enough model to implement.


26. __The F-test__: It evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one is not. An equivalent null hypothesis is that R-squared equals zero. A significant F-test indicates that the observed R-squared is reliable and is not a spurious result of oddities in the data set. Thus the F-test determines whether the proposed relationship between the response variable and the set of predictors is statistically reliable and can be useful when the research objective is either prediction or explanation.


27. __RMSE (Root Mean Square Error)__: It is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values. Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit. As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable. Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction.


28. __Maximum Likelihood Estimation (MLE)__:This is estimating the parameter values of a statistical model (logistic regression, to be precise) by finding the parameter values that maximize the likelihood of making the observations. 


29. __Bias and Variance Trade off__: Every model has both bias and variance error components in addition to white noise. Bias and variance are inversely related to each other; while trying to reduce one component, the other component of the model will increase. The true art lies in creating a good fit by balancing both. The ideal model will have both low bias and low variance. Errors from the bias component come from erroneous assumptions in the underlying learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs; this phenomenon causes an underfitting problem. On the other hand, errors from the variance component come from sensitivity to change in the fit of the model, even a small change in training data; high variance can cause an overfitting problem.

    An example of a high bias model is logistic or linear regression, in which the fit of the model is merely a straight line and may have a high error component due to the fact that a linear model could not approximate underlying data well. An example of a high variance model is a decision tree, in which the model may create too
    much wiggly curve as a fit, in which even a small change in training data will cause a
    drastic change in the fit of the curve. At the moment, state-of-the-art models are utilizing high variance models such as decision trees and performing ensemble on top of them to reduce the errors caused by high variance and at the same time not compromising on increases in errors due to the bias component.
    The best example of this category is random forest, in which many decision trees will be grown independently and ensemble in order to come up with the best fit.
    
    
30. __Convex and Non-Convex Function__: Convex functions are functions in which a line drawn between any two random points on the function also lies within the function, whereas this isn't true for non-convex functions. It is important to know whether the function is convex or non-convex due to the fact that in convex functions, the local optimum is also the global optimum, whereas for non-convex functions, the local optimum does not guarantee the global optimum.


31. __Gradient descent__: This is a way to minimize the objective function J(Θ) d parameterized by the model's parameter Θ ε R by updating the parameters in the opposite direction to the gradient of the objective function with respect to the parameters. The learning rate determines the size of steps taken to reach the minimum.


32. __Full batch gradient descent (all training observations considered in each and every iteration)__: In full batch gradient descent, all the observations are considered for each and every iteration; this methodology takes a lot of memory and will be slow as well. Also, in practice, we do not need to have all the observations to update the weights. Nonetheless, this method provides the best way of updating parameters with less noise at the expense of huge computation.


33. __Stochastic gradient descent (one observation per iteration)__: This method updates weights by taking one observation at each stage of iteration. This method provides the quickest way of traversing weights; however, a lot of noise is involved while converging.


34. __Mini batch gradient descent (about 30 training observations or more for each and every iteration)__: This is a trade-off between huge computational costs and a quick method of updating weights. In this method, at each iteration, about 30 observations will be selected at random and gradients calculated to update the model weights. Here, a question many can ask is, why the minimum 30 and not any other number? If we look into statistical basics, 30 observations required to be considering in order approximating sample as a population. However, even 40, 50, and so on will also do well in batch size selection. Nonetheless, a practitioner needs to change the batch size and verify the results, to determine at what value the model is producing the optimum results.


35. __Cross Validation__: Cross-validation is another way of ensuring robustness in the model at the expense of computation. In the ordinary modeling methodology, a model is developed on train data and evaluated on test data. In some extreme cases, train and test might not have been homogeneously selected and some unseen extreme cases might appear in the test data, which will drag down the performance of the model. On the other hand, in cross-validation methodology, data was divided into equal parts and training performed on all the other parts of the data except one part, on which performance will be evaluated. This process repeated as many parts user has chosen.

    Example: In five-fold cross-validation, data will be divided into five parts, subsequently
    trained on four parts of the data, and tested on the one part of the data. This process will
    run five times, in order to cover all points in the data. Finally, the error calculated will be
    the average of all the errors, 
    <img src="images/cross_validation.png">
    
    
36. __Grid Search__:Grid search in machine learning is a popular way to tune the hyperparameters of the model in order to find the best combination for determining the best fit

In [24]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [26]:
input_data = pd.read_csv("Data Files/ad.csv",header=None)
input_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1549,1550,1551,1552,1553,1554,1555,1556,1557,1558
0,125,125,1.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,57,468,8.2105,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,33,230,6.9696,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [27]:
X_columns = set(input_data.columns.values)
y = input_data[len(input_data.columns.values)-1]

In [28]:
X_columns.remove(len(input_data.columns.values)-1)
X = input_data[list(X_columns)]

In [29]:
X_train, X_test,y_train,y_test = train_test_split(X,y,train_size = 0.7,random_state=33)

In [30]:
pipeline = Pipeline([('clf', DecisionTreeClassifier(criterion='entropy'))])

In [31]:
parameters = {'clf__max_depth': (50,100,150),
            'clf__min_samples_split': (2, 3),
            'clf__min_samples_leaf': (1, 2, 3)}

In [32]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')

In [33]:
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   17.8s
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:   20.2s finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('clf',
                                        DecisionTreeClassifier(class_weight=None,
                                                               criterion='entropy',
                                                               max_depth=None,
                                                               max_features=None,
                                                               max_leaf_nodes=None,
                                                               min_impurity_decrease=0.0,
                                                               min_impurity_split=None,
                                                               min_samples_leaf=1,
                                                               min_samples_split=2,
                                                               min_weight_fraction_leaf=0.0,
  

In [34]:
y_pred = grid_search.predict(X_test)

In [35]:
print ('\n Best score: \n', grid_search.best_score_)
print ('\n Best parameters set: \n')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print ('\t%s: %r' % (param_name, best_parameters[param_name]))
    print ("\n Confusion Matrix on Test data\n",confusion_matrix(y_test,y_pred))
    print ("\n Test Accuracy \n",accuracy_score(y_test,y_pred))
    print ("\nPrecision Recall f1 table \n",classification_report(y_test,y_pred))


 Best score: 
 0.966884531590414

 Best parameters set: 

	clf__max_depth: 100

 Confusion Matrix on Test data
 [[814  19]
 [ 16 135]]

 Test Accuracy 
 0.9644308943089431

Precision Recall f1 table 
               precision    recall  f1-score   support

           0       0.98      0.98      0.98       833
           1       0.88      0.89      0.89       151

    accuracy                           0.96       984
   macro avg       0.93      0.94      0.93       984
weighted avg       0.96      0.96      0.96       984

	clf__min_samples_leaf: 1

 Confusion Matrix on Test data
 [[814  19]
 [ 16 135]]

 Test Accuracy 
 0.9644308943089431

Precision Recall f1 table 
               precision    recall  f1-score   support

           0       0.98      0.98      0.98       833
           1       0.88      0.89      0.89       151

    accuracy                           0.96       984
   macro avg       0.93      0.94      0.93       984
weighted avg       0.96      0.96      0.96       9

1. __Logistic regression__: This is the problem in which outcomes are discrete classes rather than continuous values. For example, a customer will arrive or not, he will purchase the product or not, and so on. In statistical methodology, it uses the maximum likelihood method to calculate the parameter of individual variables. In contrast, in machine learning methodology, log loss will be minimized with respect to β coefficients (also known as weights). Logistic regression has a high bias and a low variance error.


2. __Linear regression__: This is used for the prediction of continuous variables such as customer income and so on. It utilizes error minimization to fit the best possible line in statistical methodology. However, in machine learning methodology, squared loss will be minimized with respect to β coefficients. Linear regression also has a high bias and a low variance error.


3. __Lasso and ridge regression__: This uses regularization to control overfitting issues by applying a penalty on coefficients. In ridge regression, a penalty is applied on the sum of squares of coefficients, whereas in lasso, a penalty is applied on the absolute values of the coefficients. The penalty can be tuned in order to change the dynamics of the model fit. Ridge regression tries to minimize the magnitude of coefficients, whereas lasso tries to eliminate them.


4. __Decision trees__: Recursive binary splitting is applied to split the classes at each level to classify observations to their purest class. The classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class. Decision trees have an overfitting problem due to their high variance in a way to fit; pruning is applied to reduce the overfitting problem by growing the tree completely. Decision trees have low a bias and a high variance error.


5. __Bagging__: This is an ensemble technique applied on decision trees in order to minimize the variance error and at the same time not increase the error component due to bias. In bagging, various samples are selected with a subsample of observations and all variables (columns), subsequently fit individual decision trees independently on each sample and later ensemble the results by taking the maximum vote (in regression cases, the mean of outcomes calculated).


6. __Random forest__: This is similar to bagging except for one difference. In bagging, all the variables/columns are selected for each sample, whereas in random forest a few subcolumns are selected. The reason behind the selection of a few variables rather than all was that during each independent tree sampled, significant variables always came first in the top layer of splitting which makes all the trees look more or less similar and defies the sole purpose of ensemble: that it works better on diversified and independent individual models rather than correlated individual models. Random forest has both low bias and variance errors.

7. __Boosting__: This is a sequential algorithm that applies on weak classifiers such as a decision stump (a one-level decision tree or a tree with one root node and two terminal nodes) to create a strong classifier by ensembling the results. The algorithm starts with equal weights assigned to all the observations, followed by subsequent iterations where more focus was given to misclassified observations by increasing the weight of misclassified observations and decreasing the weight of properly classified observations. In the end, all the individual classifiers were combined to create a strong classifier. Boosting might have an overfitting problem, but by carefully tuning the parameters, we can obtain the best of the self machine learning model.


8. __Support vector machines (SVMs)__: This maximizes the margin between classes by fitting the widest possible hyperplane between them. In the case of non-linearly separable classes, it uses kernels to move observations into higher-dimensional space and then separates them linearly with the hyperplane there.


9. __Support vector machines (SVMs)__: This maximizes the margin between classes by fitting the widest possible hyperplane between them. In the case of non-linearly separable classes, it uses kernels to move observations into higher-dimensional space and then separates them linearly with the hyperplane there.


10. __Principal component analysis (PCA)__: This is a dimensionality reduction technique in which principal components are calculated in place of the original variable. Principal components are determined where the variance in data is maximum; subsequently, the top n components will be taken by covering about 80 percent of variance and will be used in further modeling processes, or exploratory analysis will be performed as unsupervised learning.


11. __K-means clustering__: This is an unsupervised algorithm that is mainly utilized for segmentation exercise. K-means clustering classifies the given data into k clusters in such a way that, within the cluster, variation is minimal and across the cluster, variation is maximal.


12. __Markov decision process (MDP)__: In reinforcement learning, MDP is a mathematical framework for modeling decision-making of an agent in situations or environments where outcomes are partly random and partly under control. In this model, environment is modeled as a set of states and actions that can be performed by an agent to control the system's state. The objective is to control the system in such a way that the agent's total payoff is maximized.


13. __Monte Carlo method__: Monte Carlo methods do not require complete knowledge of the environment, in contrast with MDP. Monte Carlo methods require only experience, which is obtained by sample sequences of states, actions, and rewards from actual or simulated interaction with the environment. Monte Carlo methods explore the space until the final outcome of a chosen sample sequences and update estimates accordingly.


14. __Temporal difference learning__: This is a core theme in reinforcement learning. Temporal difference is a combination of both Monte Carlo and dynamic programming ideas. Similar to Monte Carlo, temporal difference methods can learn directly from raw experience without a model of the environment's dynamics. Like dynamic programming, temporal difference methods update estimates based in part on other learned estimates, without waiting for a final outcome. Temporal difference is the best of both worlds and is most commonly used in games such as AlphaGo and so on.