# Machine Learning in python 

## Sommaire : 

1. [Introduction](#1)<br>
    * [Main ML algorithm](#2)<br>
    * [Main libraries to work on ML with Python](#3)<br>
    * [Differences between supervised and Unsupervised ML](#4)<br>
2. [Regression Models](#5)<br>
    * [Simple linear regression](#6)<br> 
    * [Model Evaluation in Regression Models](#7)<br> 
    * [Evaluation Metrics in Regression Models](#8)<br>
    * [Multiple Linear Regression](#9)<br>


## 1. Introduction <a id="1"></a>


### Main ML algorithm : <a id="2"></a>

* <span style="color:blue">***The Regression/Estimation***</span> technique is used for predicting a continuous value. For example, predicting things like the price of a house based on its characteristics, or to estimate the Co2 emission from a car’s engine.
* <span style="color:blue">__A Classification__</span> technique is used for Predicting the class or category of a case, for example, if a cell is benign or malignant, or whether or not a customer will churn. 
* <span style="color:blue">__Clustering__ </span>groups of similar cases, for example, can find similar patients, or can be used for customer segmentation in the banking field. 
* <span style="color:blue">__Association__ </span>technique is used for finding items or events that often co-occur, for example, grocery items that are usually bought together by a particular customer. 
* <span style="color:blue">__Anomaly__</span> detection is used to discover abnormal and unusual cases, for example, it is used for credit card fraud detection.
* <span style="color:blue">__Sequence mining__ </span>is used for predicting the next event, for instance, the click-stream in websites. 
* <span style="color:blue">__Dimension reduction__ </span>is used to reduce the size of data. 
* <span style="color:blue">__Recommendation systems__</span>, this associates people's preferences with others who have similar tastes, and recommends new items to them, such as books or movies

### Main libraries to work on ML with Python : <a id="3"></a>

* <span style="color:blue"> **NumPy** </span> which is a math library to work with N-dimensional arrays in Python. It enables you to do computation efficiently and effectively. It is better than regular Python because of its amazing capabilities. For example, for working with arrays, dictionaries, functions, datatypes and working with images you need to know NumPy. 
* <span style="color:blue"> **SciPy** </span> is a collection of numerical algorithms and domain specific toolboxes, including signal processing, optimization, statistics and much more. SciPy is a good library for scientific and high performance computation. 
* <span style="color:blue"> **Matplotlib** </span> is a very popular plotting package that provides 2D plotting, as well as 3D plotting. 
* <span style="color:blue"> **Pandas** </span>library is a very high-level Python library that provides high performance easy to use data structures. It has many functions for data importing, manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and timeseries. 
* <span style="color:blue"> **SciKit Learn** </span> is a collection of algorithms and tools for machine learning which is our focus here and which you'll learn to use within this course


### Differences between supervised and Unsupervised ML <a id="4"></a>

* <span style="color:blue"> **Supervised** </span> How do we supervise a machine learning model? We do this by __teaching the model__, that is we load the model with knowledge so that we can have it predict future instances. We teach the model by training it with some data from a labeled dataset. **Labeled** means that we already have the results in the dataset.  
**Glossary** : *attributes* corespond to columns name, *features* correspond to columns content, and *observations* corespond to the rows of the dataset. *Categorical data* correspond to that is its non-numeric because it contains characters rather than numbers. *Numeric data* is numerical.  Cf below example    
2 types of supervised techniques : **Classification & regression**

* <span style="color:blue"> **Unsupervised** </span>: We do not supervise the model, but we let the model work on its own to discover information that may not be visible to the human eye. It means, the unsupervised algorithm trains on the dataset, and draws conclusions on unlabeled data. Generally speaking, it is more complicated to implement as we know little about the data and the expected outcomes. The dataset is **not labeled**
Main types of Unsupervised techniques : **Dimension reduction, density estimation, market basket analysis, and clustering**. *Dimensionality reduction*, and/or feature selection, play a large role in this by reducing redundant features to make the classification easier.   
*Market basket analysis* is a modeling technique based upon the theory that if you buy a certain group of items, you're more likely to buy another group of items.  
*Density estimation* is a very simple concept that is mostly used to explore the data to find some structure within it.   
*Clustering* is considered to be one of the most popular unsupervised machine learning techniques used for grouping data points, or objects that are somehow similar.


![sup_unsup_comparison](supervised_unsupervised.JPG)

In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.DataFrame(np.array([[1,2,3,4,"benin"],[4,5,9,5,"malin"],[7,5,9,4,"benin"]]),columns=['Clump','UniSize','UniShape','Mit','Class'])
df
#Columns' name correspond to the attributes 
#Column's content correspond to the features of the dataset
# A row corresponds to an observation 
"""class items here is the labeled column, it may be used to classify future tumors as benign or malign. 
here this labeled column is considered as a categorical data at the oposite of a numerical data"""

'class items here is the labeled column, it may be used to classify future tumors as benign or malign. \nhere this labeled column is considered as a categorical data at the oposite of a numerical data'

## 2. Regression Models <a id="5"></a> 

In regression, there are two types of variables, a **dependent variable** (convention Y), and one or more **independent variables** (convention X). The dependent variable, can be seen as the state, target, or final goal we study, and try to predict, and the independent variables, also known as explanatory variables, can be seen as the causes of those states. The important point is than dependent variable must be continuous and not discrete values in order to apply regression models. Independent varaibles can be categorical or continuous variable. 

* Simple regression (one independent variable one dependent variable): 
    * Simple Linear regression 
    * Simple non linear regression
* Multiple regression (several independent variables): 
    * Multiple linear regression 
    * Multiple non linear regression 
 
 __Here are the differents regression algorithms. Each of them has its own importance, and a specific condition to which their application is best suited__: 
 * Ordinal regression 
 * Poisson regression 
 * fast forest quantile regression 
 * Linear, polynomial, Lasso, Stepwise, Ridge regression 
 * Bayesian linear regression 
 * Neural network regression 
 * Decision forest regression 
 * Boosted decision tree regression 
 * KNN (K-nearest neighbors) 
 
 ### Simple linear regression <a id="6"></a> 
 
 `yhat= theta + theta1* X1`
 * yhat is the dependent variable we want to predict 
 * X is the independent variable 
 * theta is the intercept 
 * theta1 is the he slope or gradient
 
 You can interpret this equation as y hat being a function of x1, or y hat being dependent of x1.
 
![calcul_simple_lin_reg](SimpleLinregression-Calcul.JPG) 

 ### Model Evaluation in Regression Models <a id="7"></a> 
 
 *  <span style="color:blue"> **Training & testing on the same dataset :**   </span>
    *  **Training accuracy**: Training accuracy is the percentage of correct predictions that the model makes when using the test dataset. However, a high training accuracy isn't necessarily a good thing. For instance, having a high training accuracy may result in an over-fit the data. This means that the model is overly trained to the dataset, which may capture noise and produce a non-generalized model.   
    *  **Out of sample accuracy**: Out-of-sample accuracy is the percentage of correct predictions that the model makes on data that the model has not been trained on. It's important that our models have high out-of-sample accuracy because the purpose of our model is, of course, to make correct predictions on unknown data.
    
 
 * <span style="color:blue"> **Train & Test split :** </span>
 Train/test split involves splitting the dataset into training and testing sets respectively, which are mutually exclusive.
 The issue with train/test split is that it's highly dependent on the datasets on which the data was trained and tested. 
 The variation of this causes train/test split to have a better out-of-sample prediction than training and testing on the same dataset, but it still has some problems due to this dependency.
 
 
 * <span style="color:blue"> **K-Fold Cross validation :** </span>: K-fold cross-validation in its simplest form performs multiple train/test splits, using the same dataset where each split is different. Then, the result is average to produce a more consistent out-of-sample accuracy. 
 
 ![K_fold_validation](Kfold.JPG) 


### Evaluation Metrics in Regression Models <a id="8"></a>

Evaluation metrics are used to explain the performance of a model.  In the context of regression, the error of the model is the difference between the data points and the trend line generated by the algorithm. Since there are multiple data points, an error can be determined in multiple ways. 

* <span style="color:blue"> **Mean absolute error** </span> Mean absolute error is the mean of the absolute value of the errors. This is the easiest of the metrics to understand, since it's just the average error.
* <span style="color:blue"> **Mean squared error** </span> Mean squared error is the mean of the squared error. It's more popular than mean absolute error because the focus is geared more towards large errors. This is due to the squared term exponentially increasing larger errors in comparison to smaller ones.
* <span style="color:blue"> **Root mean squared error.** </span> Root mean squared error is the square root of the mean squared error. This is one of the most popular of the evaluation metrics because root mean squared error is interpretable in the same units as the response vector or y units, making it easy to relate its information. 
* <span style="color:blue"> **Relative absolute error.** </span> Relative absolute error, also known as residual sum of square, where y bar is a mean value of y, takes the total absolute error and normalizes it by dividing by the total absolute error of the simple predictor. Relative squared error is very similar to relative absolute error but is widely adopted by the data science community, as it is used for calculating R squared.
* <span style="color:blue"> **R squared.** </span> R squared is not an error per se but is a popular metric for the accuracy of your model. It represents how close the data values are to the fitted regression line. The higher the R-squared, the better the model fits your data. 

  

### Multiple Linear Regression <a id="9"></a>

 Adding too many independent variables without any theoretical justification may result in an overfit model. An overfit model is a real problem because it is too complicated for your data set and not general enough to be used for prediction. So, __it is recommended to avoid using many variables for prediction__. 
 
  Basically, __categorical independent variables can be incorporated into a regression model by converting them into numerical variables__. For example, given a binary variables such as car type, the code dummy zero for manual and one for automatic cars. 
  
  As a last point, remember that __multiple linear regression is a specific type of linear regression__. So, __there needs to be a linear relationship__ between the dependent variable and each of your independent variables.  There are a number of ways to check for linear relationship. For example, __you can use scatter plots and then visually checked for linearity._ If the relationship displayed in your scatter plot is not linear, then you need to use non-linear regression



In [3]:
df_fuel=pd.read_csv("FuelConsumptionCo2.csv")

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
0,2014,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1062,2014,VOLVO,XC60 AWD,SUV - SMALL,3.0,6,AS6,X,13.4,9.8,11.8,24,271
1063,2014,VOLVO,XC60 AWD,SUV - SMALL,3.2,6,AS6,X,13.2,9.5,11.5,25,264
1064,2014,VOLVO,XC70 AWD,SUV - SMALL,3.0,6,AS6,X,13.4,9.8,11.8,24,271
1065,2014,VOLVO,XC70 AWD,SUV - SMALL,3.2,6,AS6,X,12.9,9.3,11.3,25,260
