## Finance Compalint Project -2 

### Feature Engineering and Model Training

**Importing required libraries**

In [5]:
# Basic
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

# for text processing
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize, FreqDist
import string
from nltk import  WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

# For classification model selection
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay,\
precision_score, recall_score, roc_auc_score, f1_score, RocCurveDisplay
from sklearn.svm import SVC
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

# For data pre-processing 
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from category_encoders.binary import BinaryEncoder
from imblearn.combine import SMOTETomek

# For Hyperparameter tunning
from hyperopt import tpe, hp, Trials, space_eval
from hyperopt.fmin import fmin
from hyperopt.pyll import scope

In [6]:
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/rahulshelke/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

**Import the data from source**

In [7]:
df = pd.read_parquet('data/complaints.parquet')

### As per Final report of EDA some Fetaures can be removed

In [8]:
missing = df.isnull().sum().div(df.shape[0]).mul(100).to_frame().sort_values(by=0, ascending=False)
missing[:8]

Unnamed: 0,0
Tags,92.367276
Consumer disputed?,89.550798
Consumer complaint narrative,67.16556
Company public response,49.968155
Consumer consent provided?,17.506295
Sub-issue,10.361439
Sub-product,3.200073
State,0.682583


**Company column can be dropped as it contains 4279 unique values which are names**

In [9]:
drop_columns = ["Tags", "Consumer complaint narrative", "Company public response",
                "Sub-issue", "Sub-product", "ZIP code", "Complaint ID", "Company"]
df.drop(drop_columns, axis=1, inplace=True)

In [10]:
missing = df.isnull().sum().div(df.shape[0]).mul(100).to_frame().sort_values(by=0, ascending=False)
missing

Unnamed: 0,0
Consumer disputed?,89.550798
Consumer consent provided?,17.506295
State,0.682583
Company response to consumer,0.000272
Issue,8.2e-05
Date received,0.0
Product,0.0
Submitted via,0.0
Date sent to company,0.0
Timely response?,0.0


- In `state`, `Company response to consumer` and `Issue` only 0.7, 0.02% and 0.008% of missing values. It can be imputed with Simple imputer with mode strategy.

## Feature Extraction

In [11]:
df[["Date received", "Date sent to company"]].head(3)

Unnamed: 0,Date received,Date sent to company
0,2025-01-16,2025-01-16
1,2025-01-16,2025-01-16
2,2025-01-16,2025-01-16


- Here dataset has two date feature, `Date received` which is the date on which the complaint was registered to CFPB and `Date sent to company` is when the complaint has been sent the respective company.

In [12]:
#difference betwen date complaint recieved vs data complaint sent to the company
df["days_to_forward_complaint"] = pd.to_datetime(df["Date sent to company"]) - pd.to_datetime(df["Date received"])

# get the days in datetime days (numeric) format
df["days_to_forward_complaint"] = df["days_to_forward_complaint"].dt.days

In [13]:
# After creating the days_to_forward_complaint, both the date columns can be removed
df.drop(["Date received", "Date sent to company"], axis=1, inplace=True)

The feature days to forward complaint has information about the duration taken for CFPB to forward the complaint to companies.

## For model to reduce computation time we can use sample of the data for model.

In [14]:
# Get a sample data to perform model training

df1 = df.groupby("Consumer disputed?").sample(n=50000)
df1.reset_index(inplace=True)

## Text Processing

**For Vectorizing**

- TFIDF
- CountVectorizer
- NLTK/Scipy Library
- Pretrained Glove
- here we can use TFIDF to process

**Step for text processing**

- Remove Punctuation
- Remove stopwords
- Lowering Casing
- Tokenization
- Stemming/Lemmatization

- `issue` column has text which has to be preprocessed.
- The text needs to be transformed into vectors so as the algorithms will be able to make predictions. In this case, it will be used the Term Frequency - Inverse Document Frequency (TFIDF) weight to evaluate how important a word is to a document in a colleciton of documents.
- After removing punctuation and lower casing the words, the importance of a word is determined in terms of its frequency.

**Create  list of stop words which has to be removed**

In [15]:
stopwords_list = stopwords.words('english') + list(string.punctuation)

In [16]:
len(stopwords_list)

211

**Create Function to tokenize and  lematize the text column**

In [17]:
# function to tokenize data and remove stopwords
def process_text(issue):

    # create tokens
    tokens = nltk.word_tokenize(issue)

    # remove commin stopwords
    stopwords_removed  = [token.lower() for token in tokens if token.lower() not in stopwords_list]

    # remove stopwords including few punctuation
    stopwords_removed = [word for word in stopwords_removed if word.isalpha()]

    return stopwords_removed

# concat the strigns
def concat_strings(word_list):
    concat_words = ''
    for word in word_list:
        concat_words += word + ' '
    return concat_words.strip()

# function to lemmatization words and merge each compliment into  a single space-separated string
lemm = WordNetLemmatizer()

def lemmatizer_concat(word_list):
    # remove any NaN's
    list_of_words = [i for i in word_list if i is not np.nan]

    # lemmatize each word
    lemmatized_list = []
    for idx, word in enumerate(list_of_words):
        lemmatized_list.append(lemm.lemmatize(word))
    
    # make the list into a single string with the words separated by ' '
    final_string = concat_strings(lemmatized_list)
    return final_string

**Prepare data with text processing**

In [18]:
for i in range(len(df1)):
    text = process_text(df1['Issue'].loc[i])
    final_texts = lemmatizer_concat(text)
    df1['Issue'].loc[i] = final_texts
    if i % 5000 == 0:
        print(f'Processed Row Number {i}')

Processed Row Number 0
Processed Row Number 5000
Processed Row Number 10000
Processed Row Number 15000
Processed Row Number 20000
Processed Row Number 25000
Processed Row Number 30000
Processed Row Number 35000
Processed Row Number 40000
Processed Row Number 45000
Processed Row Number 50000
Processed Row Number 55000
Processed Row Number 60000
Processed Row Number 65000
Processed Row Number 70000
Processed Row Number 75000
Processed Row Number 80000
Processed Row Number 85000
Processed Row Number 90000
Processed Row Number 95000


**Vectorizing the processed texts**

In [19]:
tfidf = TfidfVectorizer(max_features=None, strip_accents='unicode', analyzer='word', ngram_range=(1, 2))

# Get data after vectorizing issue column
df_vect = tfidf.fit_transform(df1['Issue'])

feature_names = tfidf.get_feature_names_out()

## Data Preorcessing

**Concat old data with vectorized data from issue text colusmn**

In [20]:
df1 = pd.concat([df1, pd.DataFrame(df_vect.toarray(), columns=feature_names)], axis=1)

**After processing issue column as vectors, Now issu columns can be removed**

In [21]:
df1.drop(["Issue", "index"], axis=1, inplace=True)

In [22]:
from sklearn.model_selection import train_test_split
X = df1.drop(["Consumer disputed?"], axis=1)
y = df1["Consumer disputed?"]

In [23]:
# check shape of Train data
X.shape

(100000, 313)

**Initalize features for transformation**

In [24]:
df1.columns

Index(['Product', 'State', 'Consumer consent provided?', 'Submitted via',
       'Company response to consumer', 'Timely response?',
       'Consumer disputed?', 'days_to_forward_complaint', 'account',
       'account opening',
       ...
       'using debit', 'vehicle', 'verification', 'verification debt',
       'withdrawal', 'workout', 'workout plan', 'wrong', 'wrong amount',
       'wrong day'],
      dtype='object', length=314)

In [25]:
df1.head()

Unnamed: 0,Product,State,Consumer consent provided?,Submitted via,Company response to consumer,Timely response?,Consumer disputed?,days_to_forward_complaint,account,account opening,...,using debit,vehicle,verification,verification debt,withdrawal,workout,workout plan,wrong,wrong amount,wrong day
0,Debt collection,,,Phone,Closed,Yes,No,7,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Debt collection,PA,,Referral,Closed with non-monetary relief,Yes,No,2,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Mortgage,PA,,Referral,Closed with explanation,Yes,No,2,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Payday loan,IL,Consent provided,Web,Closed with explanation,Yes,No,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Credit card,OH,,Postal mail,Closed with explanation,Yes,No,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
# for binary encoder
binary_features = ["Product", "State", "Submitted via", "Company response to consumer"]

# for one hot encoder
onehot_features = ["Consumer consent provided?", "Timely response?", "State"]

**Create columntransformer for transformation**

In [27]:
onehot_encoder_pipeline = Pipeline(
    steps=[
        ('SimpleImputer', SimpleImputer(strategy="most_frequent")),
        ('OneHot_encoder', OneHotEncoder())
    ]
)

binary_encoder_pipeline = Pipeline(
    steps=[
        ('SimpleImputer', SimpleImputer(strategy="most_frequent")),
        ('BinaryEncoder', BinaryEncoder())
    ]
)

# getting data pre processing object
preprocrssor = ColumnTransformer(
    [
        ("Categorical_Pipeline", onehot_encoder_pipeline, onehot_features),
        ("Binary_encoder_pipeline", binary_encoder_pipeline, binary_features),
        # ("Numerical_Pipeline", RobustScaler(), feature_names)
    ]
    , remainder="passthrough"
)

**Transforming the data for modeling**

In [28]:
# fit transformer the train data
X = preprocrssor.fit_transform(X)

**Manually Encoding Target Fetaure**

In [29]:
# manually encoding "Yes" as 0 and "No" as 1
y = np.where(y.values=="Yes", 0, 1)

## Handeling Imbalanced Dataset

**Handeling Imbalanced Target Variable.**

- Synthetic Minority Oversampling Technique or SMOTE is another technique to oversample the minority class, Simply adding duplicate records of minority class often don't add any new information to the model.

- SMOTE is one of the famous oversampling techniques and is very effective in handling class imbalance. The idea is to combine SMOTE with some undersampling techniques (ENN, Tomek) to increase the effectiveness of handling the imabalnced class.

In [30]:
# Resampling the minority class. The strategy can ba changed as required.
smt = SMOTETomek(random_state=42, sampling_strategy="minority", n_jobs=-1)

# Fit the model to generate the data
X_res, y_res = smt.fit_resample(X, y)

## Model Selection

**Here should understand the Various Classification models with default values from these models we can choose top 4 with Highest Accuracy score and processed with Hyperparameter Tunning**

In [31]:
# Function shich returns all evaluation metrics for classification model

def evaluate_clf(true, predicted):
    acc = accuracy_score(true, predicted) # Calcualte accuracy
    f1 = f1_score(true, predicted) # Calculated f1-score
    precision = precision_score(true, predicted) # Calculaed precision
    recall = recall_score(true, predicted) # Calculated recall
    roc_auc = roc_auc_score(true, predicted)
    return acc, f1, precision, recall, roc_auc

In [32]:
# Initalize the models which are required for model selection

models = {
    "Random Forest": RandomForestClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Gradient boosting": GradientBoostingClassifier(),
    "Logistic Regression": LogisticRegression(),
    "K-Neighbour Classifier": KNeighborsClassifier(),
    "XGBClassifier": XGBClassifier(),
    "CatBoostClassifier": CatBoostClassifier(verbose=False),
    "AdaBoost Classifier": AdaBoostClassifier()
}

In [33]:
# Create a function which can evaluate model and return report in Dataframe

def evaluate_model(X, y, models):
    ''' 
    this function takes in X and y and models dictionary as input
    Its splits the data into train and test
    Interates through given model dictionary and evaluates the metrics
    Returns: Dataframe which contains report of all models metrics with cost
    '''
    # seperate dataset into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    models_list = []
    accuracy_list = []
    auc = []

    for i in range(len(list(models))):
        model = list(models.values())[i]
        model.fit(X_train, y_train)

        # Make predictions
        y_train_pred = model.predict(X_train)
        y_test_pred = model.predict(X_test)

        # training set performance
        model_train_accuracy, model_train_f1, model_train_precision,\
        model_train_recall, model_train_rocauc_score = evaluate_clf(y_train, y_train_pred)

        # test set performance
        model_test_accuracy, model_test_f1, model_test_precision,\
        model_test_recall, model_test_rocauc_score = evaluate_clf(y_test, y_test_pred)

        print(list(models.keys())[i])
        models_list.append(list(models.keys())[i])

        print('Model performance for Training set')
        print("- Accuracy: {:.4f}".format(model_train_accuracy))
        print("- F1 score: {:.4f}".format(model_train_f1))
        print("- Precision: {:.4f}".format(model_train_precision))
        print("- Recall: {:.4f}".format(model_train_recall))
        print("- Roc Auc Score: {:.4f}".format(model_train_rocauc_score))

        print('----------------------------------')

        print('Model performance for Test set')
        print("- Accuracy: {:.4f}".format(model_test_accuracy))
        print("- F1 score: {:.4f}".format(model_test_f1))
        print("- Precision: {:.4f}".format(model_test_precision))
        print("- Recall: {:.4f}".format(model_test_recall))
        print("- Roc Auc Score: {:.4f}".format(model_test_rocauc_score))

        auc.append(model_test_rocauc_score)
        print('='*25)
        print('\n')
    report = pd.DataFrame(list(zip(models_list, accuracy_list)),
                          columns=['Model Name', 'Accuracy']).sort_values(by=['Accuracy'], ascending=False)
    
    return report

: 

### **Base report of all models with default parameters**

In [None]:
base_report = evaluate_model(X=X_res, y=y_res, models=models)

Random Forest
Model performance for Training set
- Accuracy: 0.7794
- F1 score: 0.7709
- Precision: 0.8001
- Recall: 0.7438
- Roc Auc Score: 0.7793
----------------------------------
Model performance for Test set
- Accuracy: 0.5631
- F1 score: 0.5510
- Precision: 0.5706
- Recall: 0.5328
- Roc Auc Score: 0.5633


Decision Tree
Model performance for Training set
- Accuracy: 0.7794
- F1 score: 0.7587
- Precision: 0.8356
- Recall: 0.6948
- Roc Auc Score: 0.7793
----------------------------------
Model performance for Test set
- Accuracy: 0.5557
- F1 score: 0.5187
- Precision: 0.5702
- Recall: 0.4756
- Roc Auc Score: 0.5562




## Report in DataFrame

In [None]:
best_report

**Here we can use CatBoost Classifier, XGBClassifier for Hyper Paratmeter Tunning**

## HyperOpt: Distributed Hyperparameter Optimization

- Hyperopt is a powerful python library for hyperparameter optimization developed by James Bergstta. Hyperopt uses a form of Bayesian optimization for parameter tunning that allows you to get the best parameters for a given model.

- Grid Search is exhaustive in case of Resource usage.

- Random Search, is random, so could miss the most important values. However, there is a superior method available through the Hyperopt package.

**Search space is where Hyperopt really gives you a many of sampling options:**

- for categorical parameters you have hp.choice

- for integers you get hp.randit, hp.quniform, hp.qloguniform and hp.qlognormal

- for floats we have hp.normal, hp.uniform, hp.lognormal and hp.loguniform

- it is the most extensive sampilng functionality out there.

You define your search space before you run optimization but you can create very complex parameter spaces:

## Hyperparameter Tunning for Xgboost Model

**This is a function to minimize that receives hyperparametersvalues as input from the search space and return loss**

In [None]:
# Creating an objective function for hyperopt
def XGB_objective(params):
    model = XGBClassifier(**params, n_jobs=-1)
    X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=42)
    model = model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    return acc

In [None]:
# Define the search space

search_sapce = {
    'max_depth': hp.quniform("max_depth", 3, 10, 1),
    'gamma': hp.uniform('gamma', 1, 9),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
    'min_child_weight': hp.quniform('min_child_weight', 0, 10, 1),
    'n_estimators': 180,
    'seed': 0
}