# RISTEK Datathon 2024

### Notebook by CCC : Rahardi Salim, Vincent Davis Leonard Tjoeng, Christian Yudistira Hermawan
### [University of Indonesia](https://www.ui.ac.id/)

## Table of contents

1. Introduction

2. Problem Domain

3. Import Required Library and Dataset

6. Step 1: Checking the data

7. Step 2: Tidying the data

8. Step 3: Exploratory analysis

9. Step 4: Feature Engineering

10. Step 5: Modeling

10. Step 6: Reproducibility (Feature Importance)

11. Conclusions

### 1. Introductions

Welcome to the RISTEK Datathon 2024! This competition aims to develop a machine learning model to detect fraud among users of a fintech platform. In the current digital era, fraud detection is crucial for maintaining trust and security in financial transactions. By participating in this competition, we aim to build a robust model that can accurately identify fraudulent activities, ensuring the platform's integrity and security.

### 2. Problem Domain

#### Dataset Description
The dataset for this competition is derived from financial product loan records of a fintech company. It includes various user features and loan activities. The dataset comprises the following files:

1. **train.csv**: Contains the training data with user features and the target label for classification.
   - `user_id`: Unique identifier for each user.
   - `pc[0-16]`: Anonymized user identity features.
   - `label`: Target variable (0: Non-fraud; 1: Fraud).

2. **loan_activities.csv**: Records of financial product loans.
   - `user_id`: Unique identifier for each user.
   - `reference_contact`: Emergency contact provided by the user.
   - `loan_type`: Type of loan taken by the user.
   - `ts`: Timestamp of the loan creation.

3. **non_borrower_user.csv**: Data of users who rarely take loans and are not the primary focus of classification.
   - `user_id`: Unique identifier for each user.
   - `pc[0-16]`: Anonymized user identity features.

4. **test.csv**: Data for prediction submission.
   - `user_id`: Unique identifier for each user matching the `sample_submission.csv`.
   - `pc[0-16]`: Anonymized user identity features.

5. **sample_submission.csv**: Example submission format.
   - `user_id`: Unique identifier for each user matching the `test.csv`.
   - `label`: Target variable (0: Non-fraud; 1: Fraud).

#### Problem Statement
Fraud detection involves identifying user actions that qualify as fraudulent. In this competition, a fraudulent user is defined as someone who has taken a financial product loan but has not made the repayment by the due date. The objective is to develop a machine learning model to accurately detect such users.

#### Evaluation Criteria
The model performance will be evaluated based on the Average Precision with `average='macro'`. The competition also emphasizes analysis, data processing, modeling, and notebook structure. The scoring breakdown is as follows:

- **Private Leaderboard**: 25%
- **Analysis**: 15%
- **Data Processing**: 25%
- **Modeling**: 30%
- **Notebook Structure**: 5%

### 3. Import Required Library and Dataset

#### 3.1 Import Dataset

In [None]:
!pip install gdown

import os
import gdown
import zipfile
import logging
from genericpath import isdir

def download_data(url, filename, dir_name="data"):
    if not os.path.isdir(dir_name):
        os.mkdir(dir_name)
    os.chdir(dir_name)
    logging.info("Downloading data....")
    gdown.download(url, quiet=False)
    logging.info("Extracting zip file....")
    with zipfile.ZipFile(f"{filename}.zip", 'r') as zip_ref:
        zip_ref.extractall(filename)
    os.remove(f"{filename}.zip")
    os.chdir("..")

download_data(url="https://drive.google.com/uc?&id=1joOspf-LvEBdKLw48S2WeBno_l5J1DPj",
              filename="ristek-datathon-2024",
              dir_name="datathon-2024")

#### 3.2 Import Library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
import math
from geopy.geocoders import GoogleV3
import time
import optuna
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from scipy.stats import mode
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import (RandomForestRegressor, AdaBoostRegressor, 
                              GradientBoostingRegressor, ExtraTreesRegressor)
from sklearn.svm import SVR
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import itertools
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder, StandardScaler

import warnings
warnings.simplefilter('ignore')

from sklearn import model_selection
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedStratifiedKFold
from sklearn.svm import SVC

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from lightgbm import LGBMClassifier
from sklearn.neural_network import MLPClassifier
from catboost import CatBoostClassifier

from mlxtend.classifier import StackingCVClassifier
import shap
from xgboost import XGBClassifier

## Step 1: Checking the Data

In [None]:
# Loading the datasets
df_train = pd.read_csv("/kaggle/working/datathon-2024/ristek-datathon-2024/ristek-datathon-2024/train.csv", index_col=False)
df_test = pd.read_csv("/kaggle/working/datathon-2024/ristek-datathon-2024/ristek-datathon-2024/test.csv", index_col=False)
df_sub = pd.read_csv("/kaggle/working/datathon-2024/ristek-datathon-2024/ristek-datathon-2024/sample_submission.csv", index_col=False)
df_loan_activities = pd.read_csv("/kaggle/working/datathon-2024/ristek-datathon-2024/ristek-datathon-2024/loan_activities.csv", index_col=False)
df_non_borrower_user = pd.read_csv("/kaggle/working/datathon-2024/ristek-datathon-2024/ristek-datathon-2024/non_borrower_user.csv", index_col=False)

### 1.1 Display the first few rows of the dataset

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
df_loan_activities.head()

In [None]:
df_non_borrower_user.head()

### 1.2 Check data types and missing values

In [None]:
df_train.info()

In [None]:
df_test.info()

In [None]:
df_loan_activities.info()

In [None]:
df_non_borrower_user.info()

### 1.3 Summary statistics

In [None]:
df_train.describe()

In [None]:
df_test.describe()

In [None]:
df_loan_activities.describe()

In [None]:
df_non_borrower_user.describe()

## Step 2: Tidying the data

### Do we need to tidy the outliers? 

Outliers are data points that deviate significantly from the rest of the observations in a dataset. In many machine learning models, especially linear models, outliers can heavily influence the model's performance and predictions. Therefore, it's common practice to identify and handle outliers before training these models. However, this is not always necessary for tree-based models like Decision Trees, Random Forests, Gradient Boosting Machines (GBMs), and their variants (e.g., XGBoost, LightGBM, CatBoost).

In the modeling steps we will use robust tree decision model, so we will keep the outliers and also keep the data skewed

### Merging and adding some feature from df_loan_activity to df_train and df_test

In [None]:
df_loan_activities['loan_count'] = 1  # Tambahkan kolom untuk menghitung jumlah pinjaman
loan_features = df_loan_activities.groupby('user_id').agg({
    'loan_type': ['nunique', 'count'],  # Jumlah tipe pinjaman unik dan total pinjaman
    'ts': ['min', 'max', 'mean', 'std'], # Waktu pinjaman pertama, terakhir, rata-rata, dan deviasi standar
    'loan_count': 'sum'                 # Total jumlah pinjaman
})

# Flatten the column names
loan_features.columns = ['loan_type_nunique', 'loan_type_count', 'loan_ts_min', 'loan_ts_max', 'loan_ts_mean', 'loan_ts_std', 'loan_count']
loan_features.reset_index(inplace=True)

In [None]:
# Membuat kolom in_loan_activities
df_train['in_loan_activities'] = df_train['user_id'].isin(df_loan_activities['user_id']).astype(int)
df_test['in_loan_activities'] = df_test['user_id'].isin(df_loan_activities['user_id']).astype(int)

In [None]:
df_train = df_train.merge(loan_features, on='user_id', how='left')
df_test = df_test.merge(loan_features, on='user_id', how='left')

# Mengisi missing values yang mungkin muncul karena pengguna yang tidak ada dalam loan_activities
df_train.fillna(0, inplace=True)
df_test.fillna(0, inplace=True)

In [None]:
loan_activities_with_label = df_loan_activities.merge(df_train[['user_id', 'label']], left_on='reference_contact', right_on='user_id', how='left')

In [None]:
# Menghitung rata-rata fraud untuk setiap user_id di loan_activities
fraud_avg = loan_activities_with_label.groupby('user_id_x')['label'].mean().reset_index()
fraud_avg.columns = ['user_id', 'reference_fraud_avg']
df_train = df_train.merge(fraud_avg, on='user_id', how='left')
df_test = df_test.merge(fraud_avg, on='user_id', how='left')

In [None]:
# Mengisi nilai yang hilang dengan 0
df_train['reference_fraud_avg'].fillna(-999, inplace=True)
df_test['reference_fraud_avg'].fillna(-999, inplace=True)

In [None]:
df_train

In [None]:
df_test

## Step 3: Exploratory Data Analysis (EDA)

In [None]:
# Visualize the target variable distribution
sns.countplot(x='label', data=df_train)
plt.title('Fraud vs Non-Fraud Distribution')
plt.show()

In [None]:
# Correlation matrix for train dataset
plt.figure(figsize=(20, 18))
sns.heatmap(df_train.corr(), annot=True, fmt=".2f")
plt.title('Correlation Matrix for Train Dataset')
plt.show()

In [None]:
# Loan type distribution
sns.countplot(y='loan_type', data=df_loan_activities)
plt.title('Loan Type Distribution')
plt.show()

In [None]:
# Analyze reference contacts in loan activities
reference_contacts_count = df_loan_activities['reference_contact'].value_counts()
print("Top 10 Reference Contacts:")
print(reference_contacts_count.head(10))

### 3.1 Understanding Data Relationships and Detecting Anomalies

### 3.2 Automated Visualization with Pandas Profiling

After manually exploring the data, we will sum the EDA before and exploring more insight with this library

In [None]:
pip install ydata-profiling

In [None]:
from ydata_profiling import ProfileReport

In [None]:
profile_train = ProfileReport(df_train, title = "Traning Insight")
profile_train.to_notebook_iframe()

In [None]:
profile_test = ProfileReport(df_test, title = "Test Insight")
profile_test.to_notebook_iframe()

### 3.3 Corr Matrix

We will see corr between features with spearman and pearson to decide which feature is highly correlated to the targeted feature. For this we need to truncate the categorical value on the dataset

In [None]:
train_dum = train_df.drop(columns = ["city_or_regency"])

In [None]:
train_dum.info()

#### 3.3.1 Spearman Correlation

In [None]:
numeric_cols = train_dum.select_dtypes(include=np.number)
corr_matrix = numeric_cols.corr(method='spearman')

# Plotting the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

#### 3.3.2 Pearson Correlation

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

numeric_cols = train_dum.select_dtypes(include=np.number)
corr_matrix = numeric_cols.corr(method='pearson')

# Plotting the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

From this corr matrix we will could see some interaction between feature. Notice how warmer color (red-ish) represent highly correlated feature

### 3.4 Missing Data Patterns

We will see the data if it is MCAR, MAR, MNAR

In [None]:
import plotly.graph_objs as go
import plotly.offline as pyo
import numpy as np
import pandas as pd

# create a heatmap trace using plotly
trace = go.Heatmap(z=train_df.isnull().values.astype(int),
                   colorscale='Viridis',
                   showscale=True)

# create a plot using plotly
layout = go.Layout(title='Heatmap of Missing Values',
                   xaxis=dict(title='Columns'),
                   yaxis=dict(title='Rows'))
fig = go.Figure(data=[trace], layout=layout)
pyo.iplot(fig)

The heatmap of missing values reveals a concentration of missing data in specific columns, indicating that certain features have a high percentage of missing entries, while others are mostly complete. Additionally, the pattern of missingness does not follow any clear structure or block pattern, as missing values are scattered randomly across rows and columns. This randomness suggests that simpler imputation methods, such as mean or median imputation, might be sufficient to address the missing data.

#### 3.4.1 Imputing data

In [None]:
solid_waste_generated_mean = train_df['solid_waste_generated'].mean()
green_open_space_mode = train_df['green_open_space'].mode()[0]
solid_waste_mode = train_df['solid_waste_generated'].mode()[0]
total_landfills_mode =train_df['total_landfills'].mode()[0]

def impute_data(df):

    # Set 'solid_waste_generated' to 0 where 'total_landfills' is NaN and 'solid_waste_generated' is also NaN
    df.loc[df['total_landfills'].isna() & df['solid_waste_generated'].isna(), 'solid_waste_generated'] = 0

    # Fill NaN in 'solid_waste_generated' with the mode where 'total_landfills' is not NaN
    df.loc[df['total_landfills'].notna() & df['solid_waste_generated'].isna(), 'solid_waste_generated'] = solid_waste_mode

    # Fill NaN in 'total_landfills' with the mode where 'solid_waste_generated' is not NaN
    df.loc[df['solid_waste_generated'].notna() & df['total_landfills'].isna(), 'total_landfills'] = total_landfills_mode

    # Fill the remaining NaN values in 'solid_waste_generated' with 0
    df['solid_waste_generated'].fillna(0, inplace=True)
    
    # Fill the remaining NaN values in 'total_landfills' with 0
    df['total_landfills'].fillna(0, inplace=True)

    # Fill NaN values in 'green_open_space' with 0
    df['green_open_space'].fillna(0, inplace=True)
    
    return df

# Apply the imputation function to both train_df and test_df
train_df = impute_data(train_df)
test_df = impute_data(test_df)

The code is self explanatory. We fill the missing data to 0 if there are no landfills an solid waste data. Other than that we will impute the data with the mode.

## Step 4: Feature Engineering

The feature engineering process includes the creation of several new features to enhance the dataset. demographic, and geographical aspects.

As mentioned on 3.1.2 we will create more insights from the city name. We convert it to latitude and langitude

In [None]:
api_key = 'AIzaSyDYJyK78qZkdfn-n0U8rpj3w1IxMMR-2SU'
geolocator = GoogleV3(api_key=api_key)

def geocode(city):
    try:
        location = geolocator.geocode(city + ", Indonesia")
        if location:
            return location.latitude, location.longitude
        else:
            return None, None
    except Exception as e:
        print(f"Error geocoding {city}: {e}")
        return None, None

# Apply geocode function to city_or_regency column with rate limiting
train_df['latitude'], train_df['longitude'] = zip(*train_df['city_or_regency'].apply(lambda x: geocode(x) if pd.notnull(x) else (None, None)))
time.sleep(1)  # Adding delay to respect rate limits
test_df['latitude'], test_df['longitude'] = zip(*test_df['city_or_regency'].apply(lambda x: geocode(x) if pd.notnull(x) else (None, None)))

As the latitude and longitude created, we could see and plot the map for further analysis on the train data

In [None]:
import folium

def display_map_with_density_markers(data):
    # Aggregate the data to count the number of occurrences of each location
    location_counts = data.groupby(['latitude', 'longitude']).size().reset_index(name='id')

    # Create a map centered at the first location in the DataFrame
    map_object = folium.Map(location=[data.iloc[0]['latitude'], data.iloc[0]['longitude']], zoom_start=10)

    # Add markers for each location in the aggregated DataFrame
    for index, row in location_counts.iterrows():
        latitude = row['latitude']
        longitude = row['longitude']
        location_name = data[(data['latitude'] == latitude) & (data['longitude'] == longitude)]['city_or_regency'].iloc[0]
        if not pd.isnull(latitude) and not pd.isnull(longitude):  # Check if latitude and longitude are not NaN
            # Adjust marker size based on the count of occurrences
            folium.CircleMarker([latitude, longitude], popup=location_name, fill=True, fill_opacity=0.4).add_to(map_object)

    # Display the map
    return map_object

# Assuming 'data' is your DataFrame with latitude and longitude columns
map_object = display_map_with_density_markers(train_df)
map_object.save('map_with_density_markers.html')  # Save the map as an HTML file
map_object



The gdp_per_capita feature measures the average economic output per person in a city or regency by dividing the gross_regional_domestic_product by the population. The landfill/sampah feature calculates the amount of solid waste generated per landfill by dividing the solid_waste_generated by the total_landfills, providing insight into waste management efficiency. The density_category feature simplifies population density into three categories: low (density less than 500 people/km²), medium (density between 500 and 2000 people/km²), and high (density greater than 2000 people/km²), making it easier to interpret density differences. 

The landfills_per_100k feature standardizes the number of landfills relative to the population size, offering a measure of waste management infrastructure. 

The socioeconomic_index is a composite index that combines normalized values of the hdi, gdp_per_capita, and densities to capture overall socioeconomic status. 

Finally, the x, y, and z features transform geographic coordinates into Cartesian coordinates, making it easier to calculate spatial relationships and distances. These engineered features provide a more detailed and nuanced understanding of the dataset, enhancing the performance of machine learning models by incorporating economic, 

In [None]:
# Converting categorical features into numerical features
# For 'traffic_density'

label_enc = LabelEncoder()
train_df["traffic_density"] = label_enc.fit_transform(train_df['traffic_density'])
test_df["traffic_density"] = label_enc.fit_transform(test_df['traffic_density'])

In [None]:
from sklearn.preprocessing import MinMaxScaler
import math

def geo_ekono_feat(df):
    scaler = MinMaxScaler()
    df[['hdi', 'gdp_per_capita', 'population_density']] = scaler.fit_transform(df[['hdi', 'gdp_per_capita', 'densities']])
    df['socioeconomic_index'] = df[['hdi', 'gdp_per_capita', 'densities']].mean(axis=1)
    
    return df
    
    
def categorize_density(density):
    if density < 500:
        return 0
    elif density < 2000:
        return 1
    else:
        return 2

def capturing_latlang(df):
    df["latitude"] = df["latitude"].apply(math.radians)
    df["longitude"] = df["longitude"].apply(math.radians)
    
    df['x'] = df['latitude'].apply(math.cos) * df['longitude'].apply(math.cos)
    df['y'] = df['latitude'].apply(math.cos) * df['longitude'].apply(math.sin)
    df['z'] = df['latitude'].apply(math.sin)
    
    df.drop(columns = ["latitude", "longitude"], inplace = True)
    
    return df

def feature_eng(df):
    df['gdp_per_capita'] = df['gross_regional_domestic_product'] / df['population']
    df['landfill/sampah'] = df['solid_waste_generated'] / df["total_landfills"]
    df['density_category'] = df['densities'].apply(categorize_density)
    df['landfills_per_100k'] = (df['total_landfills'] / df['population']) * 100000
    
    df_sorted = df.sort_values(by=['city_or_regency', 'year'])  # Sort the dataframe
    
    df_sorted['population_change'] = df_sorted.groupby('city_or_regency')['population'].pct_change()
    df_sorted['hdi_change'] = df_sorted.groupby('city_or_regency')['hdi'].pct_change()
    df_sorted['gdp_change'] = df_sorted.groupby('city_or_regency')['gross_regional_domestic_product'].pct_change()
    
    df_sorted['population_change'].fillna(0, inplace=True)
    df_sorted['hdi_change'].fillna(0, inplace=True)
    df_sorted['gdp_change'].fillna(0, inplace=True)
    
    # Merge the sorted results back into the original unsorted df by 'id'
    df= pd.merge(df, df_sorted[['id', 'population_change', 'hdi_change', 'gdp_change']],
                         on='id', how='left')
    
    df = geo_ekono_feat(df)
    df = capturing_latlang(df)
    
    return df

In [None]:
train_df = feature_eng(train_df)
test_df = feature_eng(test_df)

For later modelling we will drop highly coor feature that we created here

In [None]:
train_df.drop(columns = ['socioeconomic_index', 'population_density', 'landfill/sampah', 'density_category'], inplace = True)
test_df.drop(columns = ['socioeconomic_index', 'population_density', 'landfill/sampah', 'density_category'], inplace = True)

## Step 5: Modeling

### 5.1 Split data

Load data from the pre-processed data

In [None]:
train = pd.read_csv("/kaggle/input/datathon-help/train_with_fraud_avg.csv")
test = pd.read_csv("/kaggle/input/datathon-help/test_with_fraud_avg.csv")
loan_activities = pd.read_csv("/kaggle/input/ori-datathon-24/loan_activities.csv")

Split data to X and y

In [None]:
X_train = train.drop(['user_id','label'],axis=1)
y_train = train['label']
X_test = test.drop(['user_id'],axis=1)

### 5.2 Params

The parameters are based on tuning using optuna

In [None]:
from sklearn.utils.class_weight import compute_class_weight

In [None]:
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = {i: class_weights[i] for i in range(len(class_weights))}

In [None]:
best_params_lgb_weight ={
    'lambda_l1': 0.5752012128486822,
    'lambda_l2': 0.5677230575819568,
    'num_leaves': 33,
    'feature_fraction': 0.9,
    'bagging_fraction': 1.0,
    'bagging_freq': 2,
    'min_child_samples': 71,
    'class_weight': class_weights_dict,  # Adding class weights
    'device': 'gpu',  # Use GPU
    'gpu_use_dp': True  # Use double precision
    , 'random_state' : 42
} #Best trial: 0.044426967000041645 weight = {0: 0.5064087731186884, 1: 39.50902643455835}

#best weighted params
best_params_cb_weight = {'iterations': 333, 'depth': 5, 'learning_rate': 0.08285986576978271, 'l2_leaf_reg': 1.8188727288827022, 'border_count': 189, 'random_strength': 0.04589201773336512, 'bagging_temperature': 3.3517475268377224, 'od_type': 'IncToDec', 'od_wait': 25,
    'class_weights': class_weights,  # Adding class weights
    'task_type': 'GPU'  # Use GPU
    , 'random_state' : 42
} #Best trial: 0.04443454009809692 weight : {0: 0.5064087731186884, 1: 39.50902643455835}

best_params_xgb_weight = {'lambda': 9.14729918837456, 'alpha': 0.06394468748796704, 'colsample_bytree': 0.7, 'subsample': 0.9, 'learning_rate': 0.06108581565857493, 'n_estimators': 791, 'max_depth': 4, 'min_child_weight': 7}

# Adding the missing parameters
best_params_xgb_weight['scale_pos_weight'] = class_weights_dict[1] / class_weights_dict[0]
best_params_xgb_weight['tree_method'] = 'hist' #Best trial: 0.04462405348434853, 
best_params_xgb_weight['device'] = "gpu"

In [None]:
# lgb_params = {'lambda_l1': 0.0339770921696201, 'lambda_l2': 9.530796262405104, 'num_leaves': 65, 'feature_fraction': 0.7, 'bagging_fraction': 0.9, 'bagging_freq': 1, 'min_child_samples': 43}

# cb_params = {'iterations': 831, 'depth': 6, 'learning_rate': 0.03338644432817077, 'l2_leaf_reg': 0.17320396497280668, 'border_count': 221, 'random_strength': 0.7071802263431674, 'bagging_temperature': 0.16496895778469275, 'od_type': 'Iter', 'od_wait': 41}

# xg_params = {'lambda': 0.3077128191383428, 'alpha': 0.0012520196417128444, 'colsample_bytree': 0.7, 'subsample': 0.9, 'learning_rate': 0.02270602067576259, 'n_estimators': 861, 'max_depth': 5, 'min_child_weight': 9}

rf_params = {
    'max_depth': 15,
    'min_samples_leaf': 8,
    'random_state': 42
}

### 5.3 Classifier

In [None]:
RANDOM_SEED = 42

Below are the first-level (base) classifiers used in the stacking model. The outputs probability of theese first level models will be used by the second-level (meta) model to classify the data

In [None]:
cl1 = RandomForestClassifier(**rf_params)
cl2 = DecisionTreeClassifier(max_depth = 5)
cl3 = CatBoostClassifier(**best_params_cb_weight)
cl4 = LGBMClassifier(**best_params_lgb_weight)
cl5 = ExtraTreesClassifier(bootstrap=False, criterion='entropy', max_features=0.55, min_samples_leaf=8, min_samples_split=4, n_estimators=100) # Optimized using TPOT
cl6 = XGBClassifier(**best_params_xgb_weight)

In [None]:
classifiers = {
    "RandomForest": cl1,
    "DecisionTree": cl2,
    "CatBoost": cl3,
    "LGBM": cl4,
    "ExtraTrees": cl5,
    "XGBoost":cl6
}

### 5.4 Level 2 Classifier

Using Logistic Regression as Meta Model

In [None]:
mlr = LogisticRegression()

## Step 6 : Training

### 6.1 Level 1 Classifiers

Fitting the level 1 classifiers

In [None]:
models_names = list() 

In [None]:
print(">>>> Training started <<<<")
for key in classifiers:
    classifier = classifiers[key]
    models_names.append(key)
    print(f"{key} done!")
    classifier.fit(X_train, y_train)
    classifiers[key] = classifier

### 6.2 Meta Classifier (Log Reg)

Adding the pre-fitted model (in  6.1) to a list

In [None]:
used_model = ['RandomForest', 'DecisionTree', 'ExtraTrees', 'LGBM','CatBoost','XGBoost'] 
classifier_exp = []
for label in used_model:
        classifier = classifiers[label]
        classifier_exp.append(classifier)

using stacking CVC Classifier as the stacker 

In [None]:
from sklearn.metrics import make_scorer, average_precision_score
from mlxtend.classifier import StackingCVClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Define the average precision scorer
average_precision_scorer = make_scorer(average_precision_score, needs_proba=True)

# # Assuming classifier_exp is defined and contains your classifiers
# classifier_exp = [classifier1, classifier2, classifier3]  # Replace with actual classifiers

# Define random seed
RANDOM_SEED = 42

# Initialize the StackingCVClassifier
scl = StackingCVClassifier(classifiers=classifier_exp,
                           meta_classifier=LogisticRegression(C=0.1), 
                           use_probas=True,  
                           random_state=RANDOM_SEED)

# # Perform cross-validation
# scores = cross_val_score(scl, X_train, y_train, cv=2, scoring=average_precision_scorer)
# print("Meta model (scl) - average precision: %0.5f " % (scores.mean()))

# Fit the model
scl.fit(X_train, y_train)

## Step 6: Prediction

After finetuning, stacking and modeling finally we will approach the final step. The Predictions!

### 7.1 Model Prediction

In [None]:
y_pred = scl.predict(X_test)

### 7.2 Group average (reference)

Instead of directly submitting the prediction, we instead will look at the user_id highest fraud average based on reference. Because the test dataset has already been predicted, we will first merge the test and train dataset to get the new highest fraud average

In [None]:
test = test.merge(y_pred, on='user_id', how='left')

# Menggabungkan dataset train dan test
combined = pd.concat([train, test], ignore_index=True)

Getting the reference info from the loan_activities, as we dropped the reference_contact column before

In [None]:
loan_activities_with_pred = loan_activities.merge(combined[['user_id', 'pred']], left_on='reference_contact', right_on='user_id', how='left')

# Menghitung rata-rata prediksi untuk setiap user_id di loan_activities
fraud_avg_pred = loan_activities_with_pred.groupby('user_id_x')['pred'].max().reset_index()
fraud_avg_pred.columns = ['user_id', 'reference_fraud_avg']

# Menggabungkan kembali hasil ke dataset combined
combined = combined.merge(fraud_avg_pred, on='user_id', how='left')

### 7.3 Predict Based on Group Average

Some user_id doesn't exist in loan_activities.csv, so we will just use the fraud value in label column

In [None]:
combined['reference_fraud_avg'].fillna(combined['label'], inplace=True)

# Mengkategorikan reference_fraud_avg > 0.5 sebagai fraud (label = 1)
combined['label'] = (combined['reference_fraud_avg'] > 0.5).astype(int)

Select the test label in the combined dataset

In [None]:
test_updated = combined[combined['user_id'].isin(test['user_id'])]

### 7.4 Submission

In [None]:
sample = pd.read_csv("/kaggle/input/ori-datathon-24/sample_submission.csv")

In [None]:
sample['label'] = test_updated['label']

In [None]:
sample.to_csv("Submission_stacking_grouped.csv")

## Conclusions

In this notebook, we successfully predicted the happiness score using a comprehensive approach that included Exploratory Data Analysis (EDA) and advanced modeling techniques such as stacking.

The use of stacking in our modeling process was a key component. Stacking allowed us to combine the strengths of multiple models to improve our predictions. By leveraging the predictions of several base models and using a meta-model to make the final predictions, we enhanced the accuracy and robustness of our happiness score predictions.

Overall, the combination of thorough EDA, insightful feature engineering, and advanced stacking techniques led to a robust model capable of predicting happiness scores with high accuracy. This comprehensive approach underscores the importance of detailed data analysis and sophisticated modeling techniques in developing effective predictive models.