# Anyone AI

# Project III - Home Credit Default Risk

You've been learning a lot about Machine Learning Algorithms, now we you're gonna be asked to put it all together. 

You will create a complete pipeline to preprocess the data, train your model and then predict values for the [Home Credit Default Risk](https://www.kaggle.com/competitions/home-credit-default-risk/) Kaggle competition.

## Introduction

Kaggle is a web platform and community for data scientist and machine learning engineers where competetitions and datasets are regularly published.

This particular competition is a binary Classification task: we want to predict whether the person applying for a home credit will be able to repay its debt or not. The competition finished 4 years ago, so you will find a lot of blog posts and code written for it, we encourage you to read everything you can about it.

The dataset is composed of multiple files with different information about loands taken. In this project we're going to exclusively work with the main files: application_train.csv and application_test.csv.

The competition uses [Area Under the ROC Curve](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc?hl=es_419) as the evaluation metric, so our models will have to return the probabilities that a loan is not paid for each row.

In [None]:
# pip install category_encoders


In [8]:
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
import category_encoders as ce
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

### Getting the data

1- Login to Kaggle (if you don't have an account you'll have to register to get it) and download the [complete dataset](https://www.kaggle.com/competitions/home-credit-default-risk/data). Read the information about the data. What does a row in the main file represent? What does the target variable means?

One row represents one loan in our data sample.
The target variable says wether the loan was repaid (0) or not (1)

2- Load the training and test datasets, we're only going to work withe "application_train.csv" and "application_test.csv" for now

In [21]:
application_train = pd.read_csv("application_train.csv")

In [22]:
application_test = pd.read_csv("application_test.csv")

### Exploratory Data Analysis

A lot of the analysis of the data can be found on public available Kaggle kernels or blog posts, but you need to make sure you understand the datasets properties before starting working on it, so we'll do exploratory data analysis for the main files

**Dataset Basics**

1- Show the shape of the training and test datasets.

In [None]:
print(f"The dataset has {application_train.index.shape} observations and {len(application_train.columns)} columns.")

In [None]:
print(f"The dataset has {application_test.index.shape} observations and {len(application_test.columns)} columns.")

**The training dataset include the Target**.

2- List all columns in the train dataset


In [None]:
columns = list(application_train.columns)

In [None]:
print(f"Columns name: {columns}")

In [None]:
print(f"Number of columns: {len(columns)}")

3- Show the first 5 records of the training dataset, transpose the dataframe to see each record as a column and features as rows, make sure all features are visualized. Take your time to review what kind of information you can gather from this data.

In [None]:
print(f"Data from the first 5 records:")

application_train.head(6).transpose()

4- Show the distribution of the target variable values: print the total value count and the percentage of each value, plot this relationship.

In [None]:
print(f"Summary of statistics pertaining to the Target column:")

application_train["TARGET"].describe()

In [None]:
print(f"Quantity per label:\n\n {application_train['TARGET'].value_counts()}")

In [None]:
Percentage = round(application_train.groupby("TARGET").size()/len(
    application_train["TARGET"])*100,2).sort_values(ascending=False)
    
print(f"Percentage of each value: {Percentage}")

In [None]:
fig, axes = plt.subplots(figsize=(10, 5))
sns.histplot(data=application_train, x="TARGET", bins=2)
fig.tight_layout()
plt.title("Target total value count ")

5- Show the number of columns of each data type

In [None]:
application_train.dtypes.value_counts()

6- For categorical variables, show the number of distinct values in each column (number of labels)

In [None]:
categorical_features = application_train.select_dtypes(include=['object'])
print(f"Categorical variables:\n {categorical_features.columns}\n")
print(f"Number of categorical variables: {len(categorical_features.columns)}")

In [None]:
list_train = []
for column in categorical_features:
  print(f'Number of labels {column}:{len(dict(categorical_features.groupby(column).size()).keys())}\n')
  print(f'Labels {column}:{dict(categorical_features.groupby(column).size()).keys()}\n')
  list_train += (dict(categorical_features.groupby(column).size()).keys())

In [None]:
categorical_features_test = application_test.select_dtypes(include=['object'])
print(f"Categorical variables:\n {categorical_features_test.columns}\n")
print(f"Number of categorical variables: {len(categorical_features_test.columns)}")

In [None]:
list_test = []

for column in categorical_features_test:
  print(f'Number of labels {column}:{len(dict(categorical_features_test.groupby(column).size()).keys())}\n')
  print(f'Labels {column}:{dict(categorical_features_test.groupby(column).size()).keys()}\n')
  list_test += (dict(categorical_features_test.groupby(column).size()).keys())

In [None]:
set(list_train) - set(list_test)

**The datatrain has labels that the datatest does not.**

- MATERNITY LEAVE
- UNKNOWN
- XNA en gender

7- Analyzing missing data: show the percentage of missing data for each column ordered by percentage descending (show only the 20 columns with higher missing pct)

In [None]:
PercentageMissing = round((application_train.isnull().sum()/len(application_train))*100,2
      ).sort_values(ascending=False)

print(f"Percentage of missing data for each column:\n {PercentageMissing[0:20]}")
print(f"Number of columns with more than half of their null values: {len(PercentageMissing[PercentageMissing > 40])}")


**Analyzing distribution of variables**

1- Show the distribution of credit amounts

In [None]:
ax = sns.distplot(application_train['AMT_CREDIT'])

2- Plot the education level of the credit applicants, show the percentages of each category. Also print the total counts for each category.

In [None]:
application_train.groupby("NAME_EDUCATION_TYPE").size()

In [None]:
round(application_train.groupby("NAME_EDUCATION_TYPE").size()/len(
    application_train["NAME_EDUCATION_TYPE"])*100,2).sort_values(ascending=False)

3- Plot the distribution of ocupation of the loan applicants

In [None]:
round(application_train.groupby("OCCUPATION_TYPE").size()/len(
    application_train["OCCUPATION_TYPE"])*100,2).sort_values(ascending=False)

4- Plot the family status of the applicants

In [None]:
round(application_train.groupby("NAME_FAMILY_STATUS").size()/len(
    application_train["NAME_FAMILY_STATUS"])*100,2).sort_values(ascending=False)


5- Plot the income type of applicants grouped by the target variable

In [None]:
round(application_train.groupby("NAME_INCOME_TYPE").size()/len(
    application_train["NAME_INCOME_TYPE"])*100,2).sort_values(ascending=False)

In [None]:
INCOME_TYPE = pd.DataFrame(round(application_train.groupby(["NAME_INCOME_TYPE","TARGET"]).size()/len(
    application_train["NAME_INCOME_TYPE"])*100,2).reset_index(name="Count").sort_values(by="Count", ascending=False))

INCOME_TYPE

In [None]:
sns.catplot(y="NAME_INCOME_TYPE",x="Count", hue="TARGET", kind="bar" ,data=INCOME_TYPE)

**The maternity leave and unknown tags have null values, the XNA in gender will be corrected.**

**Graph for each qualitative variable**

In [None]:
# Graph for each qualitative variable

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 10))
axes = axes.flat
columnas_object = application_train[["NAME_INCOME_TYPE","NAME_FAMILY_STATUS",
                                     "OCCUPATION_TYPE","NAME_EDUCATION_TYPE"]
                                    ].columns

for i, colum in enumerate(columnas_object):
    application_train[colum].value_counts().plot.barh(ax = axes[i])
    axes[i].set_title(colum, fontsize = 10, fontweight = "bold")
    axes[i].tick_params(labelsize = 10)
    axes[i].set_xlabel("")
    
fig.tight_layout()
plt.subplots_adjust(top=0.9)
fig.suptitle('Distribution of qualitative variables',
             fontsize = 10, fontweight = "bold");

**We have a very unbalanced dataset since only 8% of the target variable corresponds to an approved loan.
Regarding the applicants, we can see that half obtain their income from a job, the majority are workers, are married (>50%) and have a secondary/special secondary level of study.**

**Of 122 columns, 41 have more than half of their null values.**

**Columns with more than 50% of their null values will be removed.**

## Preprocessing

In this section, you will code a function to make all the data pre processing for the dataset. What you have to deliver is a function that takes the train and test dataframes, processes all features, and returns the transformed data as numpy arrays ready to be used for training.

The function should perform these activities:

- Correct outliers/anomalous values in numerical columns (hint: take a look at the DAYS_EMPLOYED column)
- Impute values for all columns with missing data (use median as imputing value)
- Encode categorical features:
    - If feature has 2 categories encode using binary encoding
    - More than 2 categories, use one hot encoding 
- Feature scaling

Keep in mind that you could get different number of columns in train and test because some category could only be present in one of the dataframes, this could create more one hot encoded columns. You should align train and test to have the same number of columns

In [None]:
numerical_features = application_train.select_dtypes(exclude=['object'])

In [None]:
numerical_features.columns

In [None]:
corr_matrix = numerical_features.corr()
print(corr_matrix["TARGET"].sort_values(ascending=False)[0:5])
print("\n------------------------------------\n")
print(corr_matrix["TARGET"].sort_values(ascending=True)[0:5])


In [None]:
valores_unicos = []

for (index, colname) in enumerate(numerical_features):
    valor_unico = {
        colname: len(application_train[colname].unique())}
    
    valores_unicos.append(valor_unico)

In [None]:
columns_features = []

for i in valores_unicos:
  for j,k in i.items():
    if k > 3:
      columns_features.append(j)

In [None]:
columns_categorics = []

for i in application_train.columns:
  if i not in columns_features:
    if i != "TARGET":
      columns_categorics.append(i)

In [None]:
numerical_features = application_train.drop(list(columns_categorics), axis=1)

In [None]:
plt.figure(figsize=(25, 25))
for i, col in enumerate(numerical_features.columns):
    plt.subplot(16, 5, i+1)
    sns.boxplot(data=numerical_features, x=col)
    plt.title(col)

**Tenemos puntos atipicos en AMT_INCOME_TOTAL, OWN_CAR_AGE, DAYS_EMPLOYED.**

In [None]:
print(f"{application_train['DAYS_EMPLOYED'].value_counts(ascending=False).head(5)}, ")
print(f"{application_train['AMT_INCOME_TOTAL'].sort_values(ascending=False).head(5)}")
print(f"{application_train['AMT_INCOME_TOTAL'].value_counts(ascending=False).head(5)}")
print(f"{application_train['OWN_CAR_AGE'].value_counts(ascending=False).head(5)}")
print(f"{application_train['OWN_CAR_AGE'].sort_values(ascending=False).head(5)}")

**We will transform the positive employee days into NaN and then replace them with the average.
In the case of AMT_INCOME_TOTAL it looks like a missing value so we'll remove it just like in the case of OWN_CAR_AGE.**

In [None]:
application_train.shape, application_test.shape

In [9]:
X_train = application_train.copy()
X_test = application_test.copy()

NameError: name 'application_train' is not defined

In [7]:
def preprocessing(data):

#We remove the columns that have more than 40% null values#
  
  columns_drop = []
  for i, col in enumerate(data):
    if data[col].isnull().sum()/len(data) > 0.4:
      columns_drop.append(col)

  data = data.drop(columns_drop, axis=1)
  

#Correction of outliers#
  try:
    data.drop(data[data["CODE_GENDER"] == "XNA"].index, inplace=True)
    data.drop(data[data["NAME_INCOME_TYPE"] == "Maternity leave"].index, inplace=True)
    data.drop(data[data["NAME_FAMILY_STATUS"] == "Unknown"].index, inplace=True)
    data.drop(data[data["AMT_INCOME_TOTAL"] == 117000000].index, inplace=True)
    data.drop(data[data["OWN_CAR_AGE"] == 91].index, inplace=True)
    data["DAYS_EMPLOYED"] = data["DAYS_EMPLOYED"].replace(365243, 0)
  except:
    pass
 
#Correction missing values#
  imputer = SimpleImputer(strategy="mean")
  data_num = data.select_dtypes(exclude=['object'])
  data_num = data_num.drop("SK_ID_CURR", axis=1).reset_index(drop=True)
  columns_num = data_num.columns
  imputer.fit(data_num)
  X = imputer.transform(data_num)
  data_tr = pd.DataFrame(X, columns=columns_num,
                          index=data_num.index)
    
#Encode categorical features#
  Binary_Encoder = []
  One_Hot_Encoder = []

  data_cat = data.select_dtypes(include=['object'])  

  for i, col in enumerate(data_cat):
    if data_cat[col].nunique() == 2:
      Binary_Encoder.append(col)
    else:
      One_Hot_Encoder.append(col)

  data_cat_binary = data_cat[Binary_Encoder].reset_index(drop=True)
  data_cat_OneHot = data_cat[One_Hot_Encoder].reset_index(drop=True)

  #"More than 2 categories"#

  cat_encoder1 = OneHotEncoder()  
  data_cat_1Hot = cat_encoder1.fit_transform(data_cat_OneHot).toarray()
  cat_encoder1_names = cat_encoder1.get_feature_names_out()
  data_1Hot = pd.DataFrame(data_cat_1Hot, columns = cat_encoder1_names)

  #2 categories#

  cat_encoder2 = ce.BinaryEncoder(cols=Binary_Encoder,return_df=True)
  data_binary = cat_encoder2.fit_transform(data_cat_binary)

  #"merge df"#
  
  df_cat = data_1Hot.merge(data_binary, left_index=True,
                                    right_index=True).reset_index(drop=True)

#Feature scaling#
  scaler = MinMaxScaler()
  scaler.fit(data_tr)
  data_tr[columns_num] = scaler.transform(data_tr[columns_num])
  data_tr = data_tr.reset_index(drop=True)

#Preprocessed dataframe#    

  df = data_tr.merge(df_cat, left_index=True,
                                    right_index=True).reset_index(drop=True)

  return df


In [8]:
X_train_p = preprocessing(X_train)

In [9]:
X_test_p = preprocessing(X_test)

In [None]:
#save the dataset because the preprocess function is prolonged

In [10]:
X_train_p.shape, X_test_p.shape

((307499, 183), (48744, 182))

In [11]:
X_test_p.head(1)

Unnamed: 0,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,EXT_SOURCE_2,EXT_SOURCE_3,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,NAME_TYPE_SUITE_Children,NAME_TYPE_SUITE_Family,NAME_TYPE_SUITE_Group of people,NAME_TYPE_SUITE_Other_A,NAME_TYPE_SUITE_Other_B,"NAME_TYPE_SUITE_Spouse, partner",NAME_TYPE_SUITE_Unaccompanied,NAME_TYPE_SUITE_nan,NAME_INCOME_TYPE_Businessman,NAME_INCOME_TYPE_Commercial associate,NAME_INCOME_TYPE_Pensioner,NAME_INCOME_TYPE_State servant,NAME_INCOME_TYPE_Student,NAME_INCOME_TYPE_Unemployed,NAME_INCOME_TYPE_Working,NAME_EDUCATION_TYPE_Academic degree,NAME_EDUCATION_TYPE_Higher education,NAME_EDUCATION_TYPE_Incomplete higher,NAME_EDUCATION_TYPE_Lower secondary,NAME_EDUCATION_TYPE_Secondary / secondary special,NAME_FAMILY_STATUS_Civil marriage,NAME_FAMILY_STATUS_Married,NAME_FAMILY_STATUS_Separated,NAME_FAMILY_STATUS_Single / not married,NAME_FAMILY_STATUS_Widow,NAME_HOUSING_TYPE_Co-op apartment,NAME_HOUSING_TYPE_House / apartment,NAME_HOUSING_TYPE_Municipal apartment,NAME_HOUSING_TYPE_Office apartment,NAME_HOUSING_TYPE_Rented apartment,NAME_HOUSING_TYPE_With parents,OCCUPATION_TYPE_Accountants,OCCUPATION_TYPE_Cleaning staff,OCCUPATION_TYPE_Cooking staff,OCCUPATION_TYPE_Core staff,OCCUPATION_TYPE_Drivers,OCCUPATION_TYPE_HR staff,OCCUPATION_TYPE_High skill tech staff,OCCUPATION_TYPE_IT staff,OCCUPATION_TYPE_Laborers,OCCUPATION_TYPE_Low-skill Laborers,OCCUPATION_TYPE_Managers,OCCUPATION_TYPE_Medicine staff,OCCUPATION_TYPE_Private service staff,OCCUPATION_TYPE_Realty agents,OCCUPATION_TYPE_Sales staff,OCCUPATION_TYPE_Secretaries,OCCUPATION_TYPE_Security staff,OCCUPATION_TYPE_Waiters/barmen staff,OCCUPATION_TYPE_nan,WEEKDAY_APPR_PROCESS_START_FRIDAY,WEEKDAY_APPR_PROCESS_START_MONDAY,WEEKDAY_APPR_PROCESS_START_SATURDAY,WEEKDAY_APPR_PROCESS_START_SUNDAY,WEEKDAY_APPR_PROCESS_START_THURSDAY,WEEKDAY_APPR_PROCESS_START_TUESDAY,WEEKDAY_APPR_PROCESS_START_WEDNESDAY,ORGANIZATION_TYPE_Advertising,ORGANIZATION_TYPE_Agriculture,ORGANIZATION_TYPE_Bank,ORGANIZATION_TYPE_Business Entity Type 1,ORGANIZATION_TYPE_Business Entity Type 2,ORGANIZATION_TYPE_Business Entity Type 3,ORGANIZATION_TYPE_Cleaning,ORGANIZATION_TYPE_Construction,ORGANIZATION_TYPE_Culture,ORGANIZATION_TYPE_Electricity,ORGANIZATION_TYPE_Emergency,ORGANIZATION_TYPE_Government,ORGANIZATION_TYPE_Hotel,ORGANIZATION_TYPE_Housing,ORGANIZATION_TYPE_Industry: type 1,ORGANIZATION_TYPE_Industry: type 10,ORGANIZATION_TYPE_Industry: type 11,ORGANIZATION_TYPE_Industry: type 12,ORGANIZATION_TYPE_Industry: type 13,ORGANIZATION_TYPE_Industry: type 2,ORGANIZATION_TYPE_Industry: type 3,ORGANIZATION_TYPE_Industry: type 4,ORGANIZATION_TYPE_Industry: type 5,ORGANIZATION_TYPE_Industry: type 6,ORGANIZATION_TYPE_Industry: type 7,ORGANIZATION_TYPE_Industry: type 8,ORGANIZATION_TYPE_Industry: type 9,ORGANIZATION_TYPE_Insurance,ORGANIZATION_TYPE_Kindergarten,ORGANIZATION_TYPE_Legal Services,ORGANIZATION_TYPE_Medicine,ORGANIZATION_TYPE_Military,ORGANIZATION_TYPE_Mobile,ORGANIZATION_TYPE_Other,ORGANIZATION_TYPE_Police,ORGANIZATION_TYPE_Postal,ORGANIZATION_TYPE_Realtor,ORGANIZATION_TYPE_Religion,ORGANIZATION_TYPE_Restaurant,ORGANIZATION_TYPE_School,ORGANIZATION_TYPE_Security,ORGANIZATION_TYPE_Security Ministries,ORGANIZATION_TYPE_Self-employed,ORGANIZATION_TYPE_Services,ORGANIZATION_TYPE_Telecom,ORGANIZATION_TYPE_Trade: type 1,ORGANIZATION_TYPE_Trade: type 2,ORGANIZATION_TYPE_Trade: type 3,ORGANIZATION_TYPE_Trade: type 4,ORGANIZATION_TYPE_Trade: type 5,ORGANIZATION_TYPE_Trade: type 6,ORGANIZATION_TYPE_Trade: type 7,ORGANIZATION_TYPE_Transport: type 1,ORGANIZATION_TYPE_Transport: type 2,ORGANIZATION_TYPE_Transport: type 3,ORGANIZATION_TYPE_Transport: type 4,ORGANIZATION_TYPE_University,ORGANIZATION_TYPE_XNA,NAME_CONTRACT_TYPE_0,NAME_CONTRACT_TYPE_1,CODE_GENDER_0,CODE_GENDER_1,FLAG_OWN_CAR_0,FLAG_OWN_CAR_1,FLAG_OWN_REALTY_0,FLAG_OWN_REALTY_1
0,0.0,0.024654,0.238037,0.102453,0.184049,0.25738,0.333427,0.039545,0.782059,0.872086,1.0,1.0,0.0,1.0,0.0,1.0,0.05,0.5,0.75,0.782609,0.0,0.0,0.0,0.0,0.0,0.0,0.923572,0.180263,0.0,0.0,0.0,0.0,0.601009,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,1,0,1,0,1


In [3]:
X_train_p.to_csv('X_train_p.csv', index=False)

NameError: name 'X_train_p' is not defined

In [4]:
X_test_p.to_csv('X_test_p.csv', index=False)

NameError: name 'X_test_p' is not defined

## Training Models

As usual, you will start training simple models and will progressively move to more complex models and pipelines.

### Baseline: LogisticRegression

1- Import LogisticRegression from sklearn and train a model using the preprocesed train data from the previous section, and just default parameters. If you receive a warning because the algorithm failed to converge, try increasing the number of iterations or decreasing the C parameter 

In [10]:
X_train_p = pd.read_csv('X_train_p.csv')
X_test_p = pd.read_csv('X_test_p.csv')

In [11]:
y = X_train_p[["TARGET"]]
X = X_train_p.drop("TARGET", axis=1)

In [15]:
X.head()

Unnamed: 0,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,EXT_SOURCE_2,EXT_SOURCE_3,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,NAME_TYPE_SUITE_Children,NAME_TYPE_SUITE_Family,NAME_TYPE_SUITE_Group of people,NAME_TYPE_SUITE_Other_A,NAME_TYPE_SUITE_Other_B,"NAME_TYPE_SUITE_Spouse, partner",NAME_TYPE_SUITE_Unaccompanied,NAME_TYPE_SUITE_nan,NAME_INCOME_TYPE_Businessman,NAME_INCOME_TYPE_Commercial associate,NAME_INCOME_TYPE_Pensioner,NAME_INCOME_TYPE_State servant,NAME_INCOME_TYPE_Student,NAME_INCOME_TYPE_Unemployed,NAME_INCOME_TYPE_Working,NAME_EDUCATION_TYPE_Academic degree,NAME_EDUCATION_TYPE_Higher education,NAME_EDUCATION_TYPE_Incomplete higher,NAME_EDUCATION_TYPE_Lower secondary,NAME_EDUCATION_TYPE_Secondary / secondary special,NAME_FAMILY_STATUS_Civil marriage,NAME_FAMILY_STATUS_Married,NAME_FAMILY_STATUS_Separated,NAME_FAMILY_STATUS_Single / not married,NAME_FAMILY_STATUS_Widow,NAME_HOUSING_TYPE_Co-op apartment,NAME_HOUSING_TYPE_House / apartment,NAME_HOUSING_TYPE_Municipal apartment,NAME_HOUSING_TYPE_Office apartment,NAME_HOUSING_TYPE_Rented apartment,NAME_HOUSING_TYPE_With parents,OCCUPATION_TYPE_Accountants,OCCUPATION_TYPE_Cleaning staff,OCCUPATION_TYPE_Cooking staff,OCCUPATION_TYPE_Core staff,OCCUPATION_TYPE_Drivers,OCCUPATION_TYPE_HR staff,OCCUPATION_TYPE_High skill tech staff,OCCUPATION_TYPE_IT staff,OCCUPATION_TYPE_Laborers,OCCUPATION_TYPE_Low-skill Laborers,OCCUPATION_TYPE_Managers,OCCUPATION_TYPE_Medicine staff,OCCUPATION_TYPE_Private service staff,OCCUPATION_TYPE_Realty agents,OCCUPATION_TYPE_Sales staff,OCCUPATION_TYPE_Secretaries,OCCUPATION_TYPE_Security staff,OCCUPATION_TYPE_Waiters/barmen staff,OCCUPATION_TYPE_nan,WEEKDAY_APPR_PROCESS_START_FRIDAY,WEEKDAY_APPR_PROCESS_START_MONDAY,WEEKDAY_APPR_PROCESS_START_SATURDAY,WEEKDAY_APPR_PROCESS_START_SUNDAY,WEEKDAY_APPR_PROCESS_START_THURSDAY,WEEKDAY_APPR_PROCESS_START_TUESDAY,WEEKDAY_APPR_PROCESS_START_WEDNESDAY,ORGANIZATION_TYPE_Advertising,ORGANIZATION_TYPE_Agriculture,ORGANIZATION_TYPE_Bank,ORGANIZATION_TYPE_Business Entity Type 1,ORGANIZATION_TYPE_Business Entity Type 2,ORGANIZATION_TYPE_Business Entity Type 3,ORGANIZATION_TYPE_Cleaning,ORGANIZATION_TYPE_Construction,ORGANIZATION_TYPE_Culture,ORGANIZATION_TYPE_Electricity,ORGANIZATION_TYPE_Emergency,ORGANIZATION_TYPE_Government,ORGANIZATION_TYPE_Hotel,ORGANIZATION_TYPE_Housing,ORGANIZATION_TYPE_Industry: type 1,ORGANIZATION_TYPE_Industry: type 10,ORGANIZATION_TYPE_Industry: type 11,ORGANIZATION_TYPE_Industry: type 12,ORGANIZATION_TYPE_Industry: type 13,ORGANIZATION_TYPE_Industry: type 2,ORGANIZATION_TYPE_Industry: type 3,ORGANIZATION_TYPE_Industry: type 4,ORGANIZATION_TYPE_Industry: type 5,ORGANIZATION_TYPE_Industry: type 6,ORGANIZATION_TYPE_Industry: type 7,ORGANIZATION_TYPE_Industry: type 8,ORGANIZATION_TYPE_Industry: type 9,ORGANIZATION_TYPE_Insurance,ORGANIZATION_TYPE_Kindergarten,ORGANIZATION_TYPE_Legal Services,ORGANIZATION_TYPE_Medicine,ORGANIZATION_TYPE_Military,ORGANIZATION_TYPE_Mobile,ORGANIZATION_TYPE_Other,ORGANIZATION_TYPE_Police,ORGANIZATION_TYPE_Postal,ORGANIZATION_TYPE_Realtor,ORGANIZATION_TYPE_Religion,ORGANIZATION_TYPE_Restaurant,ORGANIZATION_TYPE_School,ORGANIZATION_TYPE_Security,ORGANIZATION_TYPE_Security Ministries,ORGANIZATION_TYPE_Self-employed,ORGANIZATION_TYPE_Services,ORGANIZATION_TYPE_Telecom,ORGANIZATION_TYPE_Trade: type 1,ORGANIZATION_TYPE_Trade: type 2,ORGANIZATION_TYPE_Trade: type 3,ORGANIZATION_TYPE_Trade: type 4,ORGANIZATION_TYPE_Trade: type 5,ORGANIZATION_TYPE_Trade: type 6,ORGANIZATION_TYPE_Trade: type 7,ORGANIZATION_TYPE_Transport: type 1,ORGANIZATION_TYPE_Transport: type 2,ORGANIZATION_TYPE_Transport: type 3,ORGANIZATION_TYPE_Transport: type 4,ORGANIZATION_TYPE_University,ORGANIZATION_TYPE_XNA,NAME_CONTRACT_TYPE_0,NAME_CONTRACT_TYPE_1,CODE_GENDER_0,CODE_GENDER_1,FLAG_OWN_CAR_0,FLAG_OWN_CAR_1,FLAG_OWN_REALTY_0,FLAG_OWN_REALTY_1
0,0.0,0.009839,0.090287,0.090032,0.077441,0.256321,0.888839,0.045086,0.85214,0.705433,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.5,0.5,0.434783,0.0,0.0,0.0,0.0,0.0,0.0,0.307542,0.155054,0.005747,0.058824,0.005814,0.083333,0.735788,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,1,0,1,0,1
1,0.0,0.013594,0.311736,0.132924,0.271605,0.045016,0.477114,0.043648,0.951929,0.959566,1.0,1.0,0.0,1.0,1.0,0.0,0.052632,0.0,0.0,0.478261,0.0,0.0,0.0,0.0,0.0,0.0,0.727773,0.569894,0.002874,0.0,0.002907,0.0,0.807083,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,1,0,0,1,1,0
2,0.0,0.002328,0.022472,0.020025,0.023569,0.134897,0.348534,0.046161,0.827335,0.648326,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.5,0.5,0.391304,0.0,0.0,0.0,0.0,0.0,0.0,0.65019,0.81413,0.0,0.0,0.0,0.0,0.810112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,1,1,0,0,1
3,0.0,0.006084,0.066837,0.109477,0.063973,0.107023,0.350846,0.038817,0.601451,0.661387,1.0,1.0,0.0,1.0,0.0,0.0,0.052632,0.5,0.5,0.73913,0.0,0.0,0.0,0.0,0.0,0.0,0.760751,0.569894,0.005747,0.0,0.005814,0.0,0.856244,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001601,0.000778,0.004295,0.009903,0.001017,0.075999,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,1,0,0,1,0,1
4,0.0,0.005333,0.116854,0.078975,0.117845,0.39288,0.298591,0.03882,0.825268,0.519522,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.5,0.5,0.478261,0.0,0.0,0.0,0.0,1.0,1.0,0.377472,0.569894,0.0,0.0,0.0,0.0,0.742311,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,1,0,1,0,1


In [18]:
model = LogisticRegression(max_iter=1000, C=0.5)
model.fit(X, y)
y_pred_1 = model.predict(X_test_p)
y_prob_1 = model.predict_proba(X_test_p)

  y = column_or_1d(y, warn=True)


2- Use the trained model to predict probabilites for the test data, and then save the results to a csv in the format expected in the competition: a SK_ID_CURR column and a TARGET column with probabilities. REMEMBER: the TARGET columns should ONLY contain the probabilities that the debt is not repaid (equivalent to the class 1).

In [19]:
y_prob_1 = pd.DataFrame(y_prob_1, columns=['PREDICTION_0', 'TARGET']
                        ).reset_index(drop=True)

In [23]:
X_ID =  application_test[["SK_ID_CURR"]].reset_index(drop=True)

In [24]:
y_LogisticRegression = y_prob_1.merge(X_ID, left_index=True, right_index=True)

In [25]:
y_LogisticRegression.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PREDICTION_0  48744 non-null  float64
 1   TARGET        48744 non-null  float64
 2   SK_ID_CURR    48744 non-null  int64  
dtypes: float64(2), int64(1)
memory usage: 1.1 MB


In [26]:
y_LogisticRegression['TARGET'] = round(y_LogisticRegression['TARGET']).astype("int32")

In [27]:
y_LogisticRegression = y_LogisticRegression[["SK_ID_CURR", "TARGET"]]

In [29]:
y_LogisticRegression.groupby("TARGET").size()

TARGET
0    48649
1       95
dtype: int64

In [30]:
y_LogisticRegression.to_csv('csv_LR.csv', index=False)

3- Go to the Kaggle competition, and in the [submissions page](https://www.kaggle.com/competitions/home-credit-default-risk/submit) load your csv file. Report here the result in the private score you obtained.

**0.50216**

At this point, the model should produce a result around 0.67 

### Training a Random Forest Classifier 

You're gonna start working in more complex models: ensambles, particularly, you're going to use the Random Forest Classifier from Scikit Learn. 

1- Train a RandomForestClassifier, print the time taken by the fit function. Just use default hyperparameters, except for n_jobs, which should be set to "-1" to allow the library to use all CPU cores to speed up training time.

In [None]:
model2 = RandomForestClassifier()
model2.fit(X, y)
y_pred_2 = model.predict_proba(X_test_p)

2- Use the classifier to predict probabilities on the test set, and save the results to a csv file.

In [None]:
y_pred_2 = pd.DataFrame(y_pred_2, columns=['PREDICTION_0', 'TARGET']
                        ).reset_index(drop=True)

In [None]:
y_RandomForestClassifier = y_pred_2.merge(X_ID, left_index=True, right_index=True)

In [None]:
y_RandomForestClassifier.info()

In [None]:
y_RandomForestClassifier['TARGET'] = round(y_RandomForestClassifier['TARGET']).astype("int32")

In [None]:
y_RandomForestClassifier = y_RandomForestClassifier[["SK_ID_CURR", "TARGET"]]

In [None]:
y_RandomForestClassifier.head()

In [None]:
y_RandomForestClassifier.to_csv('csv_RFC.csv', index=False)

3- Load the predictions to the competition. Report the private score here.

**0.50216**

### Randomized Search with Cross Validation

So far, we've only created models using the default hyperparameters of each algorithm. This is usually something that we would only do for baseline models, hyperparameter tuning is a very important part of the modeling process and is often the difference between having an acceptable model or not.

But, there are usually lots of hyperparameters to tune and a finite amount of time to do it, you have to consider the time and resources it takes to find an optimal combination of them. In the previous section you trained a random forest classifier and saw how much it took to train it once in your PC. If you want to do hyperparameter optimization you now have to consider that you will have to train the algorithm N number of times, with N being the cartesian product of all parameters. 

Furthermore, you can't validate the performance of your trained models on the test set, as this data should only be used to validate the final model. So we have to implement a validation strategy, K-Fold Cross Validation being the most common. But this also adds time complexity to our training, because we will have to train each combinations of hyperparameters M number of times, X being the number of folds in which we divided our dataset, so the total number of training iterations will be NxM... this resulting number can grow VERY quickly.

Fortunately there are strategies to mitigate this, here you're going to select a small number of hyperparameters to test a RandomForestClassifier, and use a Randomized Search algorithm with K-Fold Cross Validation to avoid doing a full search across the grid. 

Remember: take in consideration how much time it took to train a single classifier, and define the number of cross validations folds and iterations of the search accordingly. 
A recommendation: run the training process, go make yourself a cup of coffee, sit somewhere comfortably and forget about it for a while.


1- Use RandomizedSearchCV to find the best combination of hyperparameters for a RandomForestClassifier. The validation metric used to evaluate the models should be "roc_auc".

In [None]:
### Complete in this cell: Use RandomizedSearchCV to find the best combination of hyperparameters for a RandomForestClassifier
example_hyperparameter_grid = {
 'bootstrap': [True, False],
 'max_depth': [10, 50, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [100, 200]
}
clf = RandomizedSearchCV(model2, example_hyperparameter_grid, random_state=0)
search = clf.fit(X, y)
search.best_params_

{'bootstrap': False,
 'max_depth': 50,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 5,
 'n_estimators': 100}

In [None]:
model2 = RandomForestClassifier(bootstrap= False,
                                max_depth= 50,
                                max_features= 'auto',
                                min_samples_leaf= 1,
                                min_samples_split= 5,
                                n_estimators= 100)
model2.fit(X, y)
y_pred_3 = model.predict_proba(X_test_p) 

2- Use the classifier to predict probabilities on the test set, and save the results to a csv file.

In [None]:
y_pred_3 = pd.DataFrame(y_pred_3, columns=['PREDICTION_0', 'TARGET']
                        ).reset_index(drop=True)

In [None]:
y_RandomizedSearch = y_pred_3.merge(X_ID, left_index=True, right_index=True)

In [None]:
y_RandomizedSearch.info()

In [None]:
y_RandomizedSearch['TARGET'] = round(y_RandomizedSearch['TARGET']).astype("int32")

In [None]:
y_RandomizedSearch = y_RandomizedSearch[["SK_ID_CURR", "TARGET"]]

In [None]:
y_RandomizedSearch.head()

In [None]:
y_RandomizedSearch.to_csv('csv_RS.csv', index=False)

3- Load the predictions to the competition. Report the private score here.

**0.50216**

4- If you have the time and resources, you can train the model for longer iterations, or select more estimator sizes. This is optional, but if you, we would love to see your results.

### Optional: Training a LightGBM model 

Gradient Boosting Machine is one of the most used machine learning algorithms for tabular data. Lots of competitions have been won using models from libraries like XGBoost or LightGBM. You can try using [LightGBM](https://lightgbm.readthedocs.io/en/latest/) to train a new model an see how it performs compared to the other classifiers you trained. 

In [None]:
### Complete in this cell: train a LightGBM model


### Optional: Using Scikit Learn Pipelines 

So far you've created special functions or blocks or code to chain operations on data and then train the models. But, reproducibility is important, and you don't want to have to remember the correct steps to follow each time you have new data to train your models. There are a lots of tools out there that can help you with that, here you can use a [Sklearn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to process your data.