# Machine Learning Project Lifecycle
**Step 1: Define a problem<br>
Step 2: Data gathering<br>
Step 3: Exploratory Data Analysis<br>
Step 4: Feature engineering and selection<br>
Step 5: Data preparation for modelling** (train/test split)<br>
**Step 6: Model Building<br>
Step 7: Model Validation & Evaluation<br>
Step 8: Model Tuning<br>
Step 9: Model Deployment**<br>

# Decision Tree - Example
## Problem: Predicting risky bank loans using C5.0 decision trees

The default vector indicates whether the loan applicant was unable to meet the agreed payment terms and went into default. A total of 30 percent of the loans in this dataset went into default. We have to train our model and predict such defaulters. 

## Data: 
1. checking_balance        - object
2. months_loan_duration     - int64
3. credit_history          - object
4. purpose                 - object
5. amount                   - int64
6. savings_balance         - object
7. employment_length       - object
8. installment_rate         - int64
9. personal_status         - object
10. other_debtors           - object
11. residence_history        - int64
12. property                - object
13. age                      - int64
14. installment_plan        - object
15. housing                 - object
16. existing_credits         - int64
17. job                     - object
18. dependents               - int64
19. telephone               - object
20. foreign_worker          - object
21. default                  - int64 (Target variable/Label)

a) default = 1 --> Normal Customer<br>
b) default = 2 --> Risky Customer/Deaulter

## Load the necessary packages

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
pd.set_option('display.max_columns',30)
# from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve

## Exploring the data

In [None]:
credit = pd.read_csv('credit.csv')

In [None]:
credit.head()

In [None]:
credit.dtypes

In [None]:
credit.describe() 

In [None]:
credit.isnull().sum() #checking NA values

In [None]:
credit.checking_balance.value_counts() 

In [None]:
credit['savings_balance'].value_counts()

**How does correlation help in feature selection?**<br>
Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features.

In [None]:
fig = plt.figure(figsize=(5,5), dpi=150)
sns.heatmap(credit.corr(), annot=True)
plt.show()

## Data preparation

### Find out the columns which are strings and cateogrical

- Checking unique values in each column to find the categorical columns.
- The description of data tells us which columns are categorical and which are continous.

In [None]:
# Checking unique values in each column, just to find the categorical columns.
# Generally it is given in the description of data which columns are categorical and which are continous.
credit.nunique()

### LabelEncoder is used for converting categorical string columns to numeric.
- Algorithms from sklearn do not accept input columns with string type, convert those columns to numerical. 
 
 - So, we need to convert such columns (e.g. "checking_balance" or "purpose" in this dataset) into numbers.

In [None]:
# Following coloumns are to be converted into srting
categorical_cols = ['checking_balance','credit_history','purpose','savings_balance',
                    'employment_length','personal_status','other_debtors','property',
                    'installment_plan','housing', 'job', 'telephone', 'foreign_worker']

In [None]:
# LabelEncoder is used for converting categorical string columns to numeric.
# Read more about LabelEncoder in sklearn documentation.

le = LabelEncoder()
for col in categorical_cols:
    # Taking a column from dataframe, encoding it and replacing same column in the dataframe.
    credit[col] = le.fit_transform(credit[col])

In [None]:
credit.head()      # now all the string columns are converted into numbers

## Split the data into train and test

In [None]:
# Total customers/samples - 1000
credit.shape # 1000 samples with 21 attributes

In [None]:
# Train Data - Selecting 900 rows at random from the dataframe for training
credit_train = credit.sample(900, random_state = 123)

In [None]:
credit_train.shape

In [None]:
credit_train.head(10)

In [None]:
# Test Data - Taking remaining 100 rows for testing by dropping the rows present in train dataframe from original dataframe.
credit_test = credit.drop(credit_train.index)

In [None]:
# Check whether this appears to be a fairly even split or not,
# train should have about 30 percent of defaulted loans 
# and test data also should have similar % of default loans
(credit.default.value_counts()/credit.default.count())*100

In [None]:
# Train data - Ration of normal and risky customers
(credit_train.default.value_counts()/credit_train.default.count())*100

In [None]:
# Test data - Ration of normal and risky customers
(credit_test.default.value_counts()/credit_test.default.count())*100

In [None]:
#taking label in seperate objects
train_labels = credit_train.default # y_train
test_labels = credit_test.default # y_test

In [None]:
# Remove Label column from train and test data sets
credit_train.drop("default", axis = 1, inplace=True) # X_train
credit_test.drop("default", axis = 1, inplace=True) # X_test

In [None]:
credit_train.columns

## Training the model (Decison Tree)

In [None]:
# Creating object of the DT with required options 
model = DecisionTreeClassifier(criterion='entropy')

In [None]:
# Training/Build the model with train data
model.fit(credit_train, train_labels)

In [None]:
export_graphviz(model,
                 out_file="tree.dot",
                 feature_names = credit_train.columns, 
                 class_names=["1","2"],
                 filled = True)

In [None]:
!dot -Tpng tree.dot -o tree.png

Install Graphviz:
https://graphviz.org/download/<br>

Online Graphviz:<br>
https://dreampuf.github.io/GraphvizOnline<br>
http://www.webgraphviz.com/

## Prediction

In [None]:
new_data = credit_test.iloc[0,:]

In [None]:
new_data.values

In [None]:
new_data.values.reshape(1, -1)

- By reshaping array with `(-1, 1)`, the array gets reshaped in such a way that the resulting array has only 1 column.

In [None]:
model.predict(new_data.values.reshape(1, -1))

In [None]:
test_labels.head()

In [None]:
# Make predictions on test data
predictions = model.predict(credit_test)

In [None]:
predictions

## Model Evalution

### Accuracy
Accuracy is the proximity of measurement results to the true value. It tell us how accurate our classification model is able to predict the class labels given in the problem statement.

In [None]:
accuracy_score(test_labels,predictions)*100

### Confusion Matrix

<center><img src="https://miro.medium.com/max/894/1*bFkVvry-3YY_mHBwa5n8bQ.png"></center>

Once the model is ready to predict, we try making predictions on the test dataset. And once we segment the results into a matrix similar to as shown in the above figure, we can see how much our model is able to predict right and how much of its predictions are wrong.
1. **TP (True-positives):** Where the actual label for that column was “Yes” in the test dataset and our model also predicted “Yes”.
2. **TN (True-negatives):** Where the actual label for that column was “No” in the test dataset and our model also predicted “No”.
3. **FP (False-positives):** Where the actual label for that column was “No” in the test dataset but our model predicted “Yes”.
4. **FN (False-negatives):** Where the actual label for that column was “Yes” in the test dataset but our model predicted “No”.

These 4 cells constitute the “Confusion matrix” as in the matrix which can alleviate all the confusion about the goodness of our model by painting a clear picture of our model’s predictive power.

**Type I Error**

A type 1 error is also known as a **false positive** and occurs when a classification model incorrectly predicts a true outcome for an originally false observation.

**Type II Error**

A type II error is also known as a **false negative** and occurs when a classification model incorrectly predicts a false outcome for an originally true observation.

We should try to reduce Type 1 and/or Type 2 Errors as much as possible

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix(test_labels, predictions)

<center><img src="https://miro.medium.com/max/894/1*bFkVvry-3YY_mHBwa5n8bQ.png"></center>

- Scikit-learn sorts labels in ascending order, thus 1's are first column/row and 2's are the second one.

### Precision

Precision attempts to answer the following question:<br>
What proportion of positive identifications was actually correct?

<center><img src="https://miro.medium.com/max/292/1*NKFVmakz6V9jb_23gKUqaw.png"></center>

In [None]:
from sklearn.metrics import precision_score

In [None]:
# Precision (P)
precision_score(test_labels, predictions)

### Recall

Recall/ Sensitivity/ TPR (True Positive Rate) attempts to answer the following question:<br>
What proportion of actual positives was identified correctly?

<center><img src="https://miro.medium.com/max/352/1*jJeDnEWUjbDLqjbl8WWDnQ.png"></center>

In [None]:
from sklearn.metrics import recall_score

In [None]:
# Recall (R)
recall_score(test_labels,predictions)

### F1-Score

- In some problem statements higher Recall takes precedence over a higher Precision and vice-versa.
- But in some problem statements, where the distinction between Recall and Precision is not very clear and we want to give importance to both Recall and Precision, there is another metric- F1 Score that can be used. It is dependent on both Precision and Recall.

- In a statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision p and the recall r of the test to compute the score.

<center><img src="https://www.gstatic.com/education/formulas2/355397047/en/f1_score.svg" width=40%></center>

In [None]:
from sklearn.metrics import f1_score

In [None]:
# F1-Score
f1_score(test_labels, predictions)

### ROC Curve and AUC Score

**Receiver Operating Characteristics (ROC) curve , Area under the Curve (AUC)**

- To plot a ROC curve, we have to plot False Positive Rate on x-axis and Sensitivity i.e. True Positive Rate on the y-axis.

- The area under the ROC curve is known as AUC. The more the AUC the better your model is. The farther away your ROC curve is from the middle linear line, the better your model is. This is how ROC-AUC can help us judge the performance of our classification models as well as provide us a means to select one model from many classification models.

<center><img src="https://miro.medium.com/max/6048/1*vlUaNZwMzoRsk2dBVNvnvQ.jpeg" width=50%></center>

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve

In [None]:
fpr, tpr, thresholds = roc_curve(test_labels, predictions, pos_label=1)

In [None]:
roc_auc_score(test_labels, predictions)

In [None]:
plt.figure()
plt.plot(fpr, tpr, color='darkorange',lw=2,
         label=f'ROC curve (area ={round(roc_auc_score(test_labels, predictions),2)})')
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

## k-fold Cross Validation

In [None]:
from sklearn.model_selection import KFold

In [None]:
credit_train.head()

In [None]:
credit_train.reset_index(drop = True, inplace=True)
train_labels.reset_index(drop = True, inplace=True)

In [None]:
cv = KFold(n_splits=5, random_state=42, shuffle=True)

In [None]:
scores = []
cv_model = DecisionTreeClassifier(criterion='entropy')

for i, index in enumerate(cv.split(credit_train)):
    train_index = index[0]
    test_index = index[1]
    
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)

    X_train = credit_train.loc[train_index]
    y_train = train_labels[train_index]
    
    X_test = credit_train.loc[test_index]   
    y_test = train_labels[test_index]
    
    cv_model.fit(X_train, y_train)
    scores.append(cv_model.score(X_test, y_test))
    joblib.dump(cv_model,f'models/DTmodel_{i}.joblib')

In [None]:
scores

In [None]:
model = joblib.load("models/DTmodel_0.joblib")

In [None]:
predictions = model.predict(credit_test)
accuracy_score(test_labels, predictions)

### EXERCISE: Use other model evalution metrics to evaluate the model.