In [None]:
"""
Machine Learning is the science of teaching machines how to learn themselves. For e.g. smartphones detecting faces while
taking photos or unlocking themselves; facebook, linkedin or any social media site recommending your friends and ads you 
might be interested in; amazon recommending you the products based on your browsing history; banks use machine learning to
detect fraud transactions in real time.

Types of Machine Learning:
    1. Supervised Learning: 
        > Well defined goals
        > Reverse Engineering
        > e.g.: fraud/non-fraud transactions, inventory management
        
    2. Unsupervised Learning:
        > Outcome is based only on inputs
        > Outcome - typically clustering or segmentation
        
    3. Reinforcement Learning: 
        > Start state and end states are defined
        > agent discovers the path & the relationships on its own
"""

In [None]:
"""
Supervised machine learning techniques involve training a model to operate on a set of features and predict a label using a 
dataset that includes some already-known label values. The training process fits the features to the known labels to define 
a general function that can be applied to new features for which the labels are unknown, and predict them. You can think of 
this function like this, in which y represents the label we want to predict and x represents the features the model uses to 
predict it.

y=f([x1,x2,x3,...])

The goal of training the model is to find a function that performs some kind of calculation to the x values that produces 
the result y. We do this by applying a machine learning algorithm that tries to fit the x values to a calculation that 
produces y reasonably accurately for all of the cases in the training dataset.

There are lots of machine learning algorithms for supervised learning, and they can be broadly divided into two types:

Regression algorithms: Algorithms that predict a y value that is a numeric value, such as the price of a house or the number 
of sales transactions.
Classification algorithms: Algorithms that predict to which category, or class, an observation belongs. The y value in a 
classification model is a vector of probability values between 0 and 1, one for each class, indicating the probability of 
the observation belonging to each class.

"""

In [None]:
"""
Exploring Data: 
The first step in any machine learning project is to explore the data that you will use to train a model. The goal of this 
exploration is to try to understand the relationships between its attributes; in particular, any apparent correlation 
between the features and the label your model will try to predict. 
This may require some work 
1. Imputation Techniques: to detect and fix issues in the data (such as dealing with missing values, errors, or 
   outlier values), 
2. Feature Engineering: deriving new feature columns by transforming or combining existing features 
   (a process known as feature engineering), 
3. normalizing numeric features (values you can measure or count) so they're on a similar scale, 
4. encoding categorical features (values that represent discrete categories) as numeric indicators.
"""

In [None]:
"""
We could train a model using all of the data; but it's common practice in supervised learning to split the data into two 
subsets; a (typically larger) set with which to train the model, and a smaller "hold-back" set with which to validate the 
trained model. This enables us to evaluate how well the model performs when used with the validation dataset by comparing 
the predicted labels to the known labels. It's important to split the data randomly (rather than say, taking the first 70% 
of the data for training and keeping the rest for validation). This helps ensure that the two subsets of data are 
statistically comparable (so we validate the model with data that has a similar statistical distribution to the data on 
which it was trained).

To randomly split the data, we'll use the train_test_split function in the scikit-learn library. This library is one of 
the most widely used machine learning packages for Python.

Now we have the following four datasets:

X_train: The feature values we'll use to train the model
y_train: The corresponding labels we'll use to train the model
X_test: The feature values we'll use to validate the model
y_test: The corresponding labels we'll use to validate the model

Now we're ready to train a model by fitting a suitable regression algorithm to the training data. We'll use a linear 
regression algorithm, a common starting point for regression that works by trying to find a linear relationship between the 
X values and the y label. The resulting model is a function that conceptually defines a line where every possible 
X and y value combination intersect.

In Scikit-Learn, training algorithms are encapsulated in estimators, and in this case we'll use the LinearRegression 
estimator to train a linear regression model.

Evaluate the Trained Model:
Now that we've trained the model, we can use it to predict rental counts for the features we held back in our validation 
dataset. Then we can compare these predictions to the actual label values to evaluate how well (or not!) the model is 
working.
"""

In [None]:
"""
Regression models are often chosen because they work with small data samples, are robust, easy to interpret, and a variety 
exist.

Linear regression is the simplest form of regression, with no limit to the number of features used. Linear regression comes 
in many forms - often named by the number of features used and the shape of the curve that fits.
    > Ordinary Least Squares
    > Lasso
    > Ridge

Tree-based algorithms: Algorithms that build a decision tree to reach a prediction. Decision trees take a step-by-step 
approach to predicting a variable. If we think of our bicycle example, the decision tree may be first split examples 
between ones that are during Spring/Summer and Autumn/Winter, make a prediction based on the day of the week. 
Spring/Summer-Monday may have a bike rental rate of 100 per day, while Autumn/Winter-Monday may have a rental rate of 
20 per day.
    > As an alternative to a linear model, there's a category of algorithms for machine learning that uses a tree-based 
      approach in which the features in the dataset are examined in a series of evaluations, each of which results in a 
      branch in a decision tree based on the feature value. At the end of each series of branches are leaf-nodes with 
      the predicted label value based on the feature values.

Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to improve generalizability. Ensemble 
algorithms construct not just one decision tree, but a large number of trees - allowing better predictions on more 
complex data. Ensemble algorithms, such as Random Forest, are widely used in machine learning and science due to their 
strong prediction abilities.
    > Ensemble algorithms work by combining multiple base estimators to produce an optimal model, either by applying an 
      aggregate function to a collection of base models (sometimes referred to a bagging) or by building a sequence of 
      models that build on one another to improve predictive performance (referred to as boosting).
"""

In [None]:
"""
Classification is a form of machine learning in which you train a model to predict which category an item belongs to. For 
example, a health clinic might use diagnostic data such as a patient's height, weight, blood pressure, blood-glucose level 
to predict whether or not the patient is diabetic.

Categorical data has distinct 'classes', rather than numeric values. Some kinds of data can be either numeric 
or categorical: the time to run a race could be a time in seconds, or we could split times into classes of ‘fast’, ‘medium’ 
and ‘slow’ - categorical. While other kinds of data can only be categorical, such as a type of shape - ‘circle’, 
‘triangle’, or ‘square’.

Classification is a form of supervised machine learning in which you train a model to use the features (the x values in our 
function) to predict a label (y) that calculates the probability of the observed case belonging to each of a number of 
possible classes, and predicting an appropriate label. The simplest form of classification is binary classification, in 
which the label is 0 or 1, representing one of two classes; for example, "True" or "False"; "Internal" or "External"; 
"Profitable" or "Non-Profitable"; and so on.
"""

In [None]:
"""
Logistic Regression, which (despite its name) is a well-established algorithm for classification. In addition to the 
training features and labels, we'll need to set a regularization parameter. This is used to counteract any bias in the 
sample, and help the model generalize well by avoiding overfitting the model to the training data.

Note: Parameters for machine learning algorithms are generally referred to as hyperparameters (to a data scientist, 
parameters are values in the data itself - hyperparameters are defined externally from the data!)
"""

In [None]:
"""
The classification report includes the following metrics for each class  (0 and 1):
Precision: Of the predictions the model made for this class, what proportion were correct?
Recall: Out of all of the instances of this class in the test dataset, how many did the model identify?
F1-Score: An average metric that takes both precision and recall into account.
Support: How many instances of this class are there in the test dataset?

Because this is a binary classification problem, the 1 class is considered positive and its precision and recall are 
particularly interesting - these in effect answer the questions:
> Of all the patients the model predicted are diabetic, how many are actually diabetic?
> Of all the patients that are actually diabetic, how many did the model identify?

The precision and recall metrics are derived from four possible prediction outcomes:
> True Positives: The predicted label and the actual label are both 1.
> False Positives: The predicted label is 1, but the actual label is 0.
> False Negatives: The predicted label is 0, but the actual label is 1.
> True Negatives: The predicted label and the actual label are both 0.

These metrics are generally tabulated for the test set and shown together as a confusion matrix, which takes the following 
form:

TN	FP
FN	TP

Note that the correct (true) predictions form a diagonal line from top left to bottom right - these figures should be 
significantly higher than the false predictions if the model is any good.

Statistical machine learning algorithms, like logistic regression, are based on probability; so what actually gets predicted
by a binary classifier is the probability that the label is true (P(y)) and the probability that the label is false 
(1 - P(y)). A threshold value of 0.5 is used to decide whether the predicted label is a 1 (P(y) > 0.5) or a 0 (P(y) <= 0.5).
You can use the predict_proba method to see the probability pairs for each case.

The decision to score a prediction as a 1 or a 0 depends on the threshold to which the predicted probabilities are compared.
If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion 
matrix. A common way to evaluate a classifier is to examine the true positive rate (which is another name for recall) and 
the false positive rate for a range of possible thresholds. These rates are then plotted against all possible thresholds to 
form a chart known as a received operator characteristic (ROC) chart. 
The ROC chart shows the curve of the true and false positive rates for different threshold values between 0 and 1. 
A perfect classifier would have a curve that goes straight up the left side and straight across the top. The diagonal line 
across the chart represents the probability of predicting correctly with a 50/50 random prediction; so you obviously want 
the curve to be higher than that (or your model is no better than simply guessing!).

The area under the curve (AUC) is a value between 0 and 1 that quantifies the overall performance of the model. The closer 
to 1 this value is, the better the model. 

In this case, the ROC curve and its AUC indicate that the model performs better than a random guess which is not bad 
considering we performed very little preprocessing of the data.

In practice, it's common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it.
There's a huge range of preprocessing transformations you can perform to get your data ready for modeling, but we'll limit 
ourselves to a few common techniques:
> Scaling numeric features so they're on the same scale. This prevents features with large values from producing 
  coefficients that disproportionately affect the predictions.
> Encoding categorical variables. For example, by using a one hot encoding technique you can create individual binary 
  (true/false) features for each possible category value.

"""

In [None]:
"""
Previously we used a logistic regression algorithm, which is a linear algorithm. There are many kinds of classification 
algorithm we could try, including:
> Support Vector Machine algorithms: Algorithms that define a hyperplane that separates classes.
> Tree-based algorithms: Algorithms that build a decision tree to reach a prediction
> Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to improve generalizability.
"""

In [None]:
"""
From these core values, you can calculate a range of other metrics that can help you evaluate the performance of the model. 
For example:

Accuracy: (TP+TN)/(TP+TN+FP+FN) - out all of the predictions, how many were correct?
Recall: TP/(TP+FN) - of all the cases that are positive, how many did the model identify?
Precision: TP/(TP+FP) - of all the cases that the model predicted to be positive, how many actually are positive?
"""

In [None]:
"""
Multiclass classification can be thought of as a combination of multiple binary classifiers. There are two ways in which you
approach the problem:

One vs Rest (OVR), in which a classifier is created for each possible class value, with a positive outcome for cases where 
the prediction is this class, and negative predictions for cases where the prediction is any other class. 
For example, a classification problem with four possible shape classes (square, circle, triangle, hexagon) would require 
four classifiers that predict:
> square or not
> circle or not
> triangle or not
> hexagon or not
One vs One (OVO), in which a classifier for each possible pair of classes is created. The classification problem with four 
shape classes would require the following binary classifiers:
> square or circle
> square or triangle
> square or hexagon
> circle or triangle
> circle or hexagon
> triangle or hexagon 

In both approaches, the overall model must take into account all of these predictions to determine which single category the
item belongs to. Fortunately, in most machine learning frameworks, including scikit-learn, implementing a multiclass 
classification model is not significantly more complex than binary classification - and in most cases, the estimators used 
for binary classification implicitly support multiclass classification by abstracting an OVR algorithm, an OVO algorithm, or
by allowing a choice of either.
"""