# Machine Learning Project Lifecycle
**Step 1: Define a problem<br>
Step 2: Data gathering<br>
Step 3: Exploratory Data Analysis<br>
Step 4: Feature engineering and selection<br>
Step 5: Data preparation for modelling** (train/test split)<br>
**Step 6: Model Building<br>
Step 7: Model Validation & Evaluation<br>
Step 8: Model Tuning<br>
Step 9: Model Deployment**<br>

# Decision Tree - Example
## Problem: Predicting risky bank loans using C5.0 decision trees

The default vector indicates whether the loan applicant was unable to meet the agreed payment terms and went into default. A total of 30 percent of the loans in this dataset went into default. We have to train our model and predict such defaulters. 

## Data: 
1. checking_balance        - object
2. months_loan_duration     - int64
3. credit_history          - object
4. purpose                 - object
5. amount                   - int64
6. savings_balance         - object
7. employment_length       - object
8. installment_rate         - int64
9. personal_status         - object
10. other_debtors           - object
11. residence_history        - int64
12. property                - object
13. age                      - int64
14. installment_plan        - object
15. housing                 - object
16. existing_credits         - int64
17. job                     - object
18. dependents               - int64
19. telephone               - object
20. foreign_worker          - object
21. default                  - int64 (Target variable/Label)

a) default = 1 --> Normal Customer<br>
b) default = 2 --> Risky Customer/Deaulter

## Load the necessary packages

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
#pd.set_option('display.max_columns',30)

## Exploring the data

In [None]:
credit = pd.read_csv('credit.csv')

In [None]:
credit.head()

In [None]:
credit.dtypes

In [None]:
credit.describe() 

In [None]:
credit.isnull().sum() #checking NA values

In [None]:
credit.checking_balance.value_counts() 

In [None]:
credit['savings_balance'].value_counts()

## Data preparation

### Find out the columns which are strings and cateogrical

- Checking unique values in each column to find the categorical columns.
- The description of data tells us which columns are categorical and which are continous.

In [None]:
# Checking unique values in each column, just to find the categorical columns.
# Generally it is given in the description of data which columns are categorical and which are continous.
for i in credit.columns:
    print(i,credit[i].nunique())

### LabelEncoder is used for converting categorical string columns to numeric.
- Algorithms from sklearn do not accept input columns with string type, convert those columns to numerical. 
 
 - So, we need to convert such columns (e.g. "checking_balance" or "purpose" in this dataset) into numbers.

In [None]:
# Following coloumns are to be converted into srting
categorical_cols = ['checking_balance','credit_history','purpose','savings_balance','employment_length','personal_status','other_debtors','property','installment_plan','housing', 'job', 'telephone', 'foreign_worker']

In [None]:
# LabelEncoder is used for converting categorical string columns to numeric.
# Read more about LabelEncoder in sklearn documentation.

le = LabelEncoder()
for col in categorical_cols:
    # Taking a column from dataframe, encoding it and replacing same column in the dataframe.
    credit[col] = le.fit_transform(credit[col])

In [None]:
credit.head()      # now all the string columns are converted into numbers

## Split the data into train and test

In [None]:
# Total customers/samples - 1000
credit.shape # 1000 samples with 21 attributes

In [None]:
# Train Data - Selecting 900 rows at random from the dataframe for training
credit_train = credit.sample(900, random_state = 123)

In [None]:
# Test Data - Taking remaining 100 rows for testing by dropping the rows present in train dataframe from original dataframe.
credit_test = credit.drop(credit_train.index)

In [None]:
# Check whether this appears to be a fairly even split or not,
# train should have about 30 percent of defaulted loans 
# and test data also should have similar % of default loans
(credit.default.value_counts()/credit.default.count())*100

In [None]:
# Train data - Ration of normal and risky customers
(credit_train.default.value_counts()/credit_train.default.count())*100

In [None]:
# Test data - Ration of normal and risky customers
(credit_test.default.value_counts()/credit_test.default.count())*100

In [None]:
#taking label in seperate objects
train_labels = credit_train.default
test_labels = credit_test.default

In [None]:
credit_train.drop("default", axis = 1, inplace=True)
credit_test.drop("default", axis = 1, inplace=True)

## Training the model (Decison Tree)

In [None]:
# Creating object of the DT with required options 
model = DecisionTreeClassifier(criterion='entropy')

In [None]:
# Training/Build the model with train data
model.fit(credit_train, train_labels)

In [1]:
export_graphviz(model,
                 out_file="tree.dot",
                 feature_names = credit_train.columns, 
                 class_names=["1","2"],
                 filled = True)

NameError: name 'export_graphviz' is not defined

In [None]:
!dot -Tpng tree.dot -o tree.png

https://graphviz.org/download/<br>
https://dreampuf.github.io/GraphvizOnline

In [None]:
new_data = credit_test.iloc[0,:]

In [None]:
model.predict(new_data.values.reshape(1, -1))

In [None]:
test_labels.head()

In [None]:
# Make predictions on test data
predictions = model.predict(credit_test)

## Evaluate the model (DT)

### Accuracy

In [None]:
accuracy_score(test_labels,predictions)*100