# Intro to Machine Learning
- Date: 14/08/2023
- Author: David Santiago Barreto Mora
- Theme: Introduction to machine learning.
- Topic: Classification models in Scikit-Learn.
- **Dataset Used:** Investigation of non-payments (defaults) of taiwan clients.
---
# "Baggin Classifier"
This classifier is a meta-estimator that adjusts the base classifiers to random subsets from the original dataset. This is done with the purpose of adding individual predictions (wheter by average or polling) and then doing a final prediction.

A meta-estimator of this type can be utilized to reduce the variance of a black-box estimator (for example, Random forest), by introducing randomness to it's building procedure and turning it into a set (conjunto?)

## **Random forest** estimator
"Random forest" is a meta-estimator that adjusts multiple decision-tree classifiers into multiple subsamples of the dataset, and then uses its mean to improve its predictive capability and control overfitting. 

The size of the subsample is controled by the parameter *"max_samples"* if the parameter *"bootstrap"* is equal to *true* (True is its default value). If not, all of the dataset is used to build each decision tree.

## **Gradient boosting** for classification
This algorithm builds an additive model by stages. It enables the optimization of arbitrary differential loss functions (?). In each stage, regression trees of *n_classes* are adjusted over the negative gradient of the loss function. For example, binary or multi-class logarithmic loss. Binary classification is a special case in which a single regression tree is used.



# Implementation
We first import the libraries we'll use for the lab.

In [22]:
# General purpose libraries
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

# Machine learning and classifiers
from sklearn.svm import SVC         # Support vector machine
from sklearn.ensemble import BaggingClassifier  
from  sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification


# Training and model scores.
from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score, confusion_matrix, precision_recall_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler

## Data preparation
We now import the dataset we'll use for the lab.


In [23]:
creditCardDF = pd.read_csv("datos_credit_card.csv")
creditCardDF.head(5)

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


And all the columns in the dataset.

In [24]:
creditCardDF.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')

Next, we'll rename some columns of the dataset. We look for uniformity in the column names (none separeated by spaces, for example.)

In [27]:
creditCardDF.rename(columns={'PAY_0':'PAY_1', 'default payment next month': 'defaultPayment'} , inplace=True)
creditCardDF.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,defaultPayment
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


Now we show the data types of the dataframe.

In [28]:
creditCardDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   LIMIT_BAL       30000 non-null  int64
 1   SEX             30000 non-null  int64
 2   EDUCATION       30000 non-null  int64
 3   MARRIAGE        30000 non-null  int64
 4   AGE             30000 non-null  int64
 5   PAY_1           30000 non-null  int64
 6   PAY_2           30000 non-null  int64
 7   PAY_3           30000 non-null  int64
 8   PAY_4           30000 non-null  int64
 9   PAY_5           30000 non-null  int64
 10  PAY_6           30000 non-null  int64
 11  BILL_AMT1       30000 non-null  int64
 12  BILL_AMT2       30000 non-null  int64
 13  BILL_AMT3       30000 non-null  int64
 14  BILL_AMT4       30000 non-null  int64
 15  BILL_AMT5       30000 non-null  int64
 16  BILL_AMT6       30000 non-null  int64
 17  PAY_AMT1        30000 non-null  int64
 18  PAY_AMT2        30000 non-

## Data identification
- 'LIMIT_BAL': Amount of the given credit (Dollar). Includes both personal and family credit.

- 'SEX': Gender of the person. (1 = male) , (2 = female)

- 'EDUCATION': (1 = Graduate school) (2 = Bachelors) (3 = College) (4 = Others.)

- 'MARRIAGE': (1 = Marriage), (2 = Single), (3 = )

- 'PAY_1' ... 'PAY_6': History of payments. Monthly payments from 

*rest of columns are in photo*

## Construction of models using all characteristics
e

In [30]:
target_Variable = 'defaultPayment'
x = creditCardDF.drop('defaultPayment', axis=1)
names_features = x.columns

# We normalize data using the robust scaler from SkLearn
robust_Scaler = RobustScaler()
x = robust_Scaler.fit_transform(x)
y = creditCardDF['defaultPayment']

# We now divide data for training and testing. We use the 85-15 split.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15, random_state=55, stratify=y)


Next, we create a function for the calculos of confussion matrix.

In [None]:
def ConfussionMatrix(CM ):
    labels=['PAY', 'default']
    df = pd.DataFrame(data= CM, index=labels, columns=labels)
    df.index.name = 'TRUE'
    df.columns.name = 'PREDICTION'
    df.loc['Total'] = df.sum(axis=1)

    return df