<img src= 'https://www.bbds.ma/wp-content/uploads/2021/04/logo.jpg' width=300/>

## Project Guide  
------------  
- [Project Overview](#project-overview)  
- [Part 1: Reading Data - Exploratory Data Analysis with Pandas](#I)
- [Part 2: Visual data analysis in Python](#II)
- [Part 3: Data Pre-processing &  Preparation](#III)
- [Part 4: Predictive Analytics](#IV)
- [Part 5: Optimization (Hyper Parameter Tuning)](#V)

<details>
<summary>
Roadmap for Building Machine Learning Models
</summary>
<p>


    1. Prepare Problem  
    a) Define The Business Objective  
    b) Select the datasets  
    c) Load dataset  
    d) Load libraries  


**Data Pre-processing**  
This is the first step in building a machine learning model. Data pre-processing refers to the transformation of data
before feeding it into the model. It deals with the techniques that are used to convert unusable raw data into clean 
reliable data.  
  
Since data collection is often not performed in a controlled manner, raw data often contains outliers 
(for example, age = 120), nonsensical data combinations (for example, model: bicycle, type: 4-wheeler), missing values, 
scale problems, and so on. Because of this, raw data cannot be fed into a machine learning model because it might 
compromise the quality of the results. As such, this is the most important step in the process of data science.  
  

    2. Summarize Data  
    a) Descriptive statistics  
    b) Data visualizations  

    3. Prepare Data  
    a) Data Cleaning  
    b) Feature Selection  
    c) Data Transformation  

**Model Learning**  
After pre-processing the data and splitting it into train/test sets (more on this later), we move on to modeling. Models 
are nothing but sets of well-defined methods called algorithms that use pre-processed data to learn patterns, which can 
later be used to make predictions. There are different types of learning algorithms, including supervised, semi-supervised, 
unsupervised, and reinforcement learning. These will be discussed later.
  
    4. Modeling Strategy  
    a) Select Suitable Algorithms  
    b) Select Training/Testing Approaches  
    c) Train   
  
  
**Model Evaluation**  
In this stage, the models are evaluated with the help of specific performance metrics. With these metrics, we can go on to 
tune the hyperparameters of a model in order to improve it. This process is called hyperparameter optimization. We will 
repeat this step until we are satisfied with the performance.  
  
    4. Evaluate Algorithms  
    a) Split-out validation dataset  
    b) Test options and evaluation metric  
    c) Spot Check Algorithms  
    d) Compare Algorithms  
  
**Prediction**  
Once we are happy with the results from the evaluation step, we will then move on to predictions. Predictions are made 
by the trained model when it is exposed to a new dataset. In a business setting, these predictions can be shared with 
decision makers to make effective business choices.  
  
    5. Improve Accuracy  
    a) Algorithm Tuning  
    b) Ensembles  

**Model Deployment**  
The whole process of machine learning does not just stop with model building and prediction. It also involves making use 
of the model to build an application with the new data. Depending on the business requirements, the deployment may be a 
report, or it may be some repetitive data science steps that are to be executed. After deployment, a model needs proper 
management and maintenance at regular intervals to keep it up and running.  

    6. Finalize Model  
    a) Predictions on validation dataset  
    b) Create standalone model on entire training dataset  
    c) Save model for later use  


</p>
</details>

<a id="I"></a>

# I.  Reading Data - Exploratory Data Analysis with Pandas

### Article outline
1. Demonstration of main Pandas methods
2. First attempt on predicting Auto Insurance Fraud
3. Useful resources

### 1. Demonstration of main Pandas methods 

**[Pandas](http://pandas.pydata.org)** is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like `.csv`, `.tsv`, or `.xlsx`. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with `Matplotlib` and `Seaborn`, `Pandas` provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in `Pandas` are implemented with **Series** and **DataFrame** classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of `Series` instances. `DataFrames` are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

In [None]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import seaborn as sns
sns.set()  #  Will import Seaborn functionalities
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')


We’ll demonstrate the main methods in action by analyzing a [dataset](https://bigml.com/user/francisco/gallery/dataset/5163ad540c0b5e5b22000383) on the churn rate of telecom operator clients. Let’s read the data (using `read_csv`), and take a look at the first 5 lines using the `head` method:


In [None]:
# Disply all Columns
pd.options.display.max_columns=70

In [None]:
autinsurance = pd.read_csv('insurance_claimsV4.csv').drop('Unnamed: 0', axis = 1)
autinsurance.head()

In [None]:
autinsurance.columns

In [None]:
#autinsurance = autinsurance.drop('Unnamed: 0', axis = 1)

In [None]:
autinsurance.shape

In [None]:
autinsurance['fraud_reported'].value_counts(normalize=True)

In [None]:
autinsurance['fraud_reported'].hist()

In [None]:
autinsurance.shape

In [None]:
X = autinsurance.drop('fraud_reported', axis = 1) # axis = 1 (look in columns) OR axis = 0 (look in rows)
    
y = autinsurance.fraud_reported

### 1.3 Data split & Scaling Data Preprocessing

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
#from sklearn.cross_validation import train_test_split
training_features, test_features, \
training_target, test_target, = train_test_split(X,y, test_size = .2, random_state = 42)


### 1.4 Establishing a Baseline
Establishing a baseline is one of the first steps that should be done in any machine learning
project. A baseline is a simple model we train in the data in order to determine accuracy and
compare to the real models we're going to try. This helps us determine whether the models
we try are actually providing any kind of improvements or not.  
One type of model that we can use as a baselines is called a dummy model. Dummy
models do not learn anything from the data, they just generate their decision by following a
rule that may or may not be related to the data. For example, a dummy model for our
problem here is one that outputs 0 or 1 at random with a 50% chance for each; this is an
example of a dummy rule that is not related to the data. Another dummy model is one that
always outputs the most frequent label in the training data; this dummy model is related to
the data, but it does not learn anything from it.  
These kinds of dummy models are provided in scikit-learn under the dummy module. All
of them are implemented in the DummyClassifier class, which accepts a strategy
parameter at initialization. This strategy parameter determines which rule the model is going
to use. Here, we're going to use the most_frequent strategy, which always returns the most
frequent label in the training data.



### Using a Dummy Classifier

As a first classifier, you can apply the built-in [`DummyClassifier` class from `sklearn.dummy`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) to set a baseline for performance of our future models.  This classifier does not actually use the feature matrix `X_digits_train`; classification decisions are made using the target vector `y_digits_train` only.  There are a few strategies, but we'll start with the `'most_frequent'` strategy.  That is, the `predict` method always returns the majority class. For our binary digit classification problem, this would be `-1` (because the `1` classification is reserved for `9`s and most of the digits are not `9`s).

In [None]:
autinsurance['fraud_reported'].value_counts(normalize=True)

In [None]:
from sklearn.dummy import DummyClassifier

dummy_baseline = DummyClassifier(strategy="most_frequent")# all 0 

dummy_baseline.fit(training_features, training_target)


Having applied the `fit` method to the training data, you can use the `predict` method to see how this estimator classifies the data. Unsurprisingly, it returns a vector of all `-1`s (because that is the majority class for this data).

In [None]:
test_target_pred = dummy_baseline.predict(test_features)
print(test_target_pred)

You can find the fraction of correct classifications using the method `score` with the test data:

In [None]:
score = dummy_baseline.score(test_features, test_target)
print('The fraction of correct classifications is: {:5.3f}'.format(score))

Using `dummy.score` is equivalent to explicitly comparing the entries of `y_digits_pred` to `y_digits_test`, counting the number of correct classifications, and dividing by the number of classifications in total. 

For classification problems, a *confusion matrix* is a more detailed description of the accuracy of a classifier. It contains entries for the actual values as rows and predicted values as columns. This means we have:

| $~$ | **predicted  (-1)** | **predicted (+1)** |
| ---- | ----------- | ---------- |
| **actual (-1)** |  true negative | false positive |
| **actual (+1)** |  false negative | true positive |


The preceding definition generalizes to the multi-class classification problems as well.
In *Scikit-Learn*, the `confusion_matrix` function takes as arguments the actual labels followed by the predicted labels (labelled in ascending order according to the class labels). From the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html):

> `sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)`
>
> Compute confusion matrix to evaluate the accuracy of a classification
>
> By definition a confusion matrix $C$ is such that $C_{i,j}$ is equal to the number of observations known to be in group $i$ but predicted to be in group $j$.
>
> Thus in binary classification, the count of true negatives is $C_{0,0}$, false negatives is $C_{1,0}$, true positives is $C_{1,1}$, and false positives is $C_{0,1}$.

In [None]:
# This is the long way of computing the accuracy score
from sklearn.metrics import  confusion_matrix
dummy_baselineCM = confusion_matrix(test_target,test_target_pred)
dummy_baselineCM

In [None]:
# from sklearn.linear_model import Perceptron

# classifier = Perceptron() 
# classifier.fit(training_features, training_target)

# accuracy = classifier.score(test_features, test_target) 
# print("Prediction Accuracy:{:.2f}%".format(accuracy * 100))

### 2. Building Model (Decision Tree)


The decision tree classes have an optional hyperparameter `criterion` that has one of two values, **`gini`** and **`entropy`**. These refer to the quantitative measure that is used to compare putative splittings of the data.

<a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">**Entropy**</a>: *Information entropy* is the average rate at which information is produced by a stochastic source of data.

The measure of information entropy associated with each possible data value is the negative logarithm of the probability mass function for the value:

$$S = - \sum_{i = 1}  p_i \log{ p_i} $$

-----

[**Gini Impurity**](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity): Used by the CART (classification and regression tree) algorithm for classification trees, *Gini impurity* is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labelled according to the distribution of labels in the subset. The Gini impurity can be computed by summing the probability ${\displaystyle p_{i}}$ of an item with label ${\displaystyle i}$ being chosen multiplied by the probability $\displaystyle \sum _{k\neq i}p_{k}=1-p_{i}$  of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category.

To compute Gini impurity for a set of items with $\displaystyle J$, classes, suppose $\displaystyle i\in \{1,2,...,J\}$ and let $\displaystyle p_{i} $ be the fraction of items labeled with class $\displaystyle i$ in the set.

$${\displaystyle \operatorname {I} _{G}(p)=\sum _{i=1}^{J}p_{i}\sum _{k\neq i}p_{k}=\sum _{i=1}^{J}p_{i}(1-p_{i})=\sum _{i=1}^{J}(p_{i}-{p_{i}}^{2})=\sum _{i=1}^{J}p_{i}-\sum _{i=1}^{J}{p_{i}}^{2}=1-\sum _{i=1}^{J}{p_{i}}^{2}}$$.


###  2.1 First Method (using function)

In [None]:
# from sklearn.tree import DecisionTreeClassifier as Model

In [None]:
# def train(features, target):
#     model = Model()
#     model.fit(features, target)
#     return model

In [None]:
# def predict(model, new_features):
#     preds = model.predict(test_features)
#     return preds

In [None]:
# # Assume Titanic data is loaded into titanic_feats,
# # titanic_target and titanic_test
# model = train(training_features, training_target)
# predictions = predict(model, test_features)

###  2.2 Second Method 

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
DecisionTreeModel = DecisionTreeClassifier(criterion='entropy', random_state=42 , max_depth=4)#  

DecisionTreeModel

In [None]:
%%time
DecisionTreeModel.fit(training_features, training_target)  # Training input and its Target variables

In [None]:
DT_Pred = DecisionTreeModel.predict(test_features) # I already Know y_test  # 200 variables 

### 2.2  Making the Confusion Matrix
**Accuracy** is perhaps the most intuitive performance measure. It is simply the ratio of correctly predicted observations.  
**Precision**: Precision looks at the ratio of correct positive observations   
**Recall** : Recall is also known as sensitivity or true positive rate. It is the ratio of correctly predicted positive events   
**F1 Score** : The F1 Score is the weighted average of Precision and recall. Therefore, this score takes both false postives and false negatives into account   

In [None]:
# Confusion Matrix
#from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# # Import machine learning modules
# from sklearn.ensemble import GradientBoostingClassifier, partial_dependence

In [None]:
# Confusion Matrix
CMTD = confusion_matrix(test_target,DT_Pred) # Compare the predicted target varaible to the orginal target variable
CMTD

In [None]:
#target = 'fraud_reported'
CMTD = pd.crosstab(test_target,DT_Pred, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(CMTD, 
            xticklabels=['Fraudulant', 'Legit'],
            yticklabels=['Fraudulant', 'Legit'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()

In [None]:
# Accuracy Score
ADT= accuracy_score(test_target, DT_Pred)

print(" Decision Tree Prediction Accuracy : {:.2f}%".format(ADT * 100))
# print()


### 2.4  Computing the Accuracy of a Binary Classifier

The most basic way to assess performance is to compare the total number of correct predictions and the total number of observations.  This is the **accuracy**. Using the diagram of our confusion matrix above, the accuracy can be written as

$$\text{accuracy} = \frac{\mathtt{tp} + \mathtt{tn}}{\mathtt{tn} + \mathtt{tp} + \mathtt{fn} + \mathtt{fp}}$$

where $\mathtt{tn}$, $\mathtt{tp}$, $\mathtt{fn}$, and $\mathtt{fp}$ are the number of true negatives, true positives, false negatives, and false positives respectively.

In [None]:
# from sklearn.metrics import classification_report
# #d = DecisionTreeModel.fit(X_train.values, y_train.values.copy(), 50)
# #
# #preds = predict(X_test, d)
# print(classification_report(test_target,DT_Pred))

<a id = "feature-importance"></a>

### 2.3  Plot feature importances

A fantastic characteristic of many ensemble models is that you have the ability to interpret the feature importance. As you learned with Decision Trees, the most important features are selected first during the construction of a tree. Using the gini or information gain generated from using a feature to make a split, a feature importance score can be calculated.

In the case of ensembles, these feature importance scores are aggregated over all of the trees within the ensemble. `scikit-learn` conveniently calculates a `.feature_importance_` score for many of their ensemble implementations.

In [None]:
#DecisionTreeModel.fit(training_features, training_target)
### Verification:
results = pd.DataFrame(index= training_features.columns, data={'importance':DecisionTreeModel.feature_importances_})
print('Feature importances:\n{}'.format(results))

In [None]:
# Plot feature importances
plt.rcParams['figure.figsize'] = 20,30
plt.title('Normalized Feature Importances')
sns.barplot(x = DecisionTreeModel.feature_importances_, y =training_features.columns, orient = 'h')
plt.show()

In [None]:
feature_importances = pd.DataFrame({'Importance Coef' :DecisionTreeModel.feature_importances_ , 'Features' : training_features.columns})
feature_importances.nlargest(10, 'Importance Coef')

In [None]:
features = feature_importances.nlargest(135, 'Importance Coef')

features =[x for x in features['Features'] if x!=0]
features

In [None]:
autinsurance = autinsurance[['insured_hobbies_chess',
 'vehicle_claim',
 'incident_severity_Total Loss',
 'insured_hobbies_cross-fit',
 'incident_severity_Minor Damage',
 'insured_hobbies_camping',
 'auto_model_Civic',
 'incident_state_WV',
 'insured_occupation_handlers-cleaners','fraud_reported']]

### 2.4  Plot Decision Tree

In [None]:
import os

os.environ['PATH'] = os.environ['PATH']+';'+os.environ['CONDA_PREFIX']+r"\Library\bin\graphviz"

In [None]:
from six import StringIO

In [None]:
import six
import sys
sys.modules['sklearn.externals.six'] = six

In [None]:
#! pip install pydotplus

In [None]:
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(DecisionTreeModel, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = X.columns,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('Auto-Inssurance.png')
Image(graph.create_png())