<h1 style="text-align: center;">Predicting Crab Ages using Random Forest</h1>

<a id='table_of_contents'></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">Table of Contents</h2>

1. <a href="#download" style="text-decoration: None">Download Data</a>
2. <a href="#import" style="text-decoration: None">Import Libraries and Dataset</a>
3. <a href="#data_preview" style="text-decoration: None">Dataset Preview</a>
4. <a href="#data_wrangling" style="text-decoration: None">Data Wrangling</a>
5. <a href="#eda" style="text-decoration: None">Exploratory Data Analysis</a>
    - <a href="#univariate" style="text-decoration: None">Univariate Analysis</a>
    - <a href="#bivariate" style="text-decoration: None">Bivariate Analysis</a>
6. <a href="#data_preprocessing" style="text-decoration: None">Data Preparation and Preprocessing</a>
7. <a href="#baseline" style="text-decoration: None">Baseline Models</a>
8. <a href="#optimization" style="text-decoration: None">Optimization: Hyperparameter Tuning</a>
9. <a href="#performance_summary" style="text-decoration: None">Performance Comparison and Summary</a>
10. <a href="#save_model" style="text-decoration: None">Make Prediction on Test Data and Save Model</a>

<a id="download"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">1. Download Data</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

I have downloaded the dataset directly within the Jupyter notebook using Jovian's `opendatasets` library. The dataset description can be found <a href="https://www.kaggle.com/competitions/playground-series-s3e16/data" style="text-decoration: None">here</a>.

<strong>Note: Uncomment the following code cells if you are working outside of Kaggle environment.</strong>

In [None]:
# import os
# import opendatasets as od

In [None]:
# od.download('https://www.kaggle.com/competitions/playground-series-s3e16/data')

In [None]:
# print(['sample'])

In [None]:
# os.listdir('playground-series-s3e16')

<a id="import"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">2. Import Libraries and Dataset</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", 120)
pd.set_option("display.max_rows", 120)

import warnings 
warnings.filterwarnings("ignore")

In [None]:
# Uncomment the code lines below if you are working outside of Kaggle
# train_df = pd.read_csv('playground-series-s3e16/train.csv')
# test_df = pd.read_csv('playground-series-s3e16/test.csv')
# sub_df = pd.read_csv('playground-series-s3e16/sample_submission.csv')

# Comment the following lines of code if you are working outside of Kaggle
try:
    train_df = pd.read_csv('../input/playground-series-s3e16/train.csv')
    test_df = pd.read_csv('../input/playground-series-s3e16/test.csv')
    sub_df = pd.read_csv('../input/playground-series-s3e16/sample_submission.csv')
except:
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    sub_df = pd.read_csv('sample_submission.csv')

<a id="data_preview"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">3. Dataset Preview</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

Here I will perform a preliminary analysis by assessing the quality of the data. This involves checking for incorrect data type, missing values, duplicates, summary statistics, erroneous data and so on.

In [None]:
train_df.head()

In [None]:
train_df.shape

In [None]:
train_df.describe().T

In [None]:
train_df.info()

<a id="data_wrangling"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">4. Data Wrangling</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

<a id="data_wrangling"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">4.1. Drop <code>id</code> Column</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
train_df.drop(columns=['id'], axis=1, inplace=True)

<a id="eda"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">5. Exploratory Data Analysis</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
#!pip install kaleido

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp

sns.set_style("darkgrid")

In [None]:
# plot_color = ['lightcoral','#008080']
plot_color = ['#008080', 'black']
sns.set_palette(['#008080', 'black'])

In [None]:
def custom_show(fig):
    fig.update_layout(title_x=0.5, title_y=0.9)
    fig.show('png', width=1000, height=550)

<a id="univariate"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">5.1. Univariate Analysis</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

### `Age`

In [None]:
!pip install kaleido

In [None]:
fig = px.histogram(train_df, x='Age', title='Distribution of Crab Ages', color_discrete_sequence=plot_color)
custom_show(fig)

<a id="bivariate"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">5.2. Bivariate Analysis</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

#### Scatter Matrix

In [None]:
numeric_cols = train_df.select_dtypes(include=np.number).columns.tolist()

In [None]:
fig = px.scatter(train_df[numeric_cols], x='Diameter', y='Age')
custom_show(fig)

#### Correlation Matrix

In [None]:
numeric_cols = train_df.select_dtypes(include=np.number).columns.tolist()

In [None]:
corr_matrix = train_df[numeric_cols].corr()

In [None]:
fig = px.imshow(corr_matrix, color_continuous_scale='Viridis', title='Correlation Matrix')

fig.update_layout(width=600, height=600, title_x=0.6, title_y=0.9)

fig.show('png')

<a id="data_preprocessing"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">6. Data Preparation and Preprocessing</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

<a id="scale"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">6.1. Identify Input and Target Column</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
train_df.columns

In [None]:
input_cols = train_df.columns[:-1]
input_cols

In [None]:
target = 'Age'

In [None]:
inputs_df = train_df[input_cols]
targets = train_df[target]

In [None]:
train_df.drop(columns=['Age'], inplace=True)

<a id="scale"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">6.2. Scale All Numeric Data</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
numeric_cols = train_df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = train_df.select_dtypes(exclude=np.number).columns.tolist()

In [None]:
inputs_df[numeric_cols] = scaler.fit_transform(inputs_df[numeric_cols])
test_df[numeric_cols] = scaler.fit_transform(test_df[numeric_cols])

In [None]:
inputs_df.describe()

<a id="scale"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">6.3. Encode Categorical Cols</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [None]:
encoder.fit(inputs_df[categorical_cols])

In [None]:
encoded_cols = encoder.get_feature_names_out()
encoded_cols

In [None]:
inputs_df[encoded_cols] = encoder.transform(inputs_df[categorical_cols])
test_df[encoded_cols] = encoder.transform(test_df[categorical_cols])

In [None]:
inputs_df.head()

<a id="scale"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">6.4. Split Data Into Training and Validation Set</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
input_cols = list(numeric_cols) + list(encoded_cols)

In [None]:
inputs_df = inputs_df[input_cols]
test_df = test_df[input_cols]

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(inputs_df, targets, test_size=0.1, random_state=42)

In [None]:
X_train.head()

In [None]:
y_train.head()

<a id="baseline"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">7. Baseline Models</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor

import warnings
warnings.filterwarnings('ignore')

In [None]:
def evaluate_model(model, X_train, y_train, X_val, y_val, y_pred):
    MAE = mean_absolute_error(y_val, y_pred)
    
    r2 = model.score(X_val, y_val)
    
    # Number of observations is the shape along axis 0
    n = X_val.shape[0]
    
    # Number of features (predictors, p) is the shape along axis 1
    p = X_val.shape[1]
    
    # Adjusted R-squared formula
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    RMSE = np.sqrt(mean_squared_error(y_val, y_pred))
    R2 = model.score(X_val, y_val)
    
    return R2, adjusted_r2, RMSE, MAE

In [None]:
def metric_df(model, model_name):
    df = [evaluate_model(model, X_train, y_train, X_val, y_val, y_pred)]
    model_metrics = pd.DataFrame(data = df, columns=['R2 Score','Adjusted R2 Score','RMSE', 'MAE'])
    model_metrics.insert(0, 'Model', model_name)
    
    return model_metrics

<a id="baseline"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">7.1. Random Forest</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
predictions = []

In [None]:
rf = RandomForestRegressor(random_state=42)

In [None]:
rf.fit(X_train, y_train)

In [None]:
y_pred = rf.predict(X_val)

In [None]:
model_metrics = metric_df(rf, 'Base Random Forest')
predictions.append(model_metrics)
model_metrics

<a id="optimization"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">8. Optimization: Hyperparameter Tuning</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
from sklearn.model_selection import GridSearchCV

<a id="grid_search"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">8.1. Grid Search</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
rf = RandomForestRegressor(random_state=42, n_jobs=-1)

In [None]:
param_grid = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [1, 3, 5, 7],
    'criterion': ['squared_error', 'poisson']
}

In [None]:
rf_grid_search = GridSearchCV(estimator=rf,
                     param_grid=param_grid,
                     scoring='neg_mean_absolute_error',
                     n_jobs=-1,
                     cv=5)

In [None]:
rf_grid_search.fit(X_train, y_train)

In [None]:
rf_grid_search.best_params_

In [None]:
# Train Random Forest with the best parameters found
rf = RandomForestRegressor(criterion='poisson',
                              max_depth=7,
                              n_estimators=400,
                              random_state=42)

In [None]:
rf.fit(X_train, y_train)

In [None]:
y_pred = rf.predict(X_val)

In [None]:
model_metrics = metric_df(rf, 'Random Forest Tunned - GridSearch')
predictions.append(model_metrics)
model_metrics

<a id="bayes_opt"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">8.2. Bayesian Optimization</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
### Pass

<a id="performance_summary"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">9. Performance Comparison and Summary</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
predictions_all = pd.concat(predictions, ignore_index=True, sort=False)
predictions_all = predictions_all.sort_values(by=['MAE'], ascending=True).style.hide(axis='index')

predictions_all

<a id="feature_importance"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">9.1. Feature Importance</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
importance_df = pd.DataFrame({
    'feature': X_val.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
importance_df.head()

In [None]:
sns.barplot(data=importance_df.head(), x='importance', y='feature')
plt.title('Feature Importance');

<a id="save_model"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">10. Make Prediction on Test Data and Save Model</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

<a id="save_model"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">10.1. Make Prediction on Test Data</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
y_pred = rf.predict(test_df)
y_pred = y_pred.round().astype(int)

y_pred

In [None]:
sub_df

In [None]:
sub_df['Age'] = y_pred

In [None]:
sub_df.to_csv('submission.csv', index=None)

<a id="save_model"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">10.2. Save Model</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
import joblib

In [None]:
model = {
    'encoder': encoder,
    'scaler': scaler,
    'numeric_cols': numeric_cols,
    'categorical_cols': categorical_cols,
    'encoded_cols': encoded_cols,
    'rf_model': rf
}

In [None]:
joblib.dump(model, 'crab_age_model.joblib')

<a id="save_model"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">10.3. Make Prediction on Single Input</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
def predict_input(single_input):
    # Convert input into a pandas dataframe
    input_df = pd.DataFrame([single_input])

    model = joblib.load("crab_age_model.joblib")

    numeric_cols = model['numeric_cols']
    categorical_cols = model['categorical_cols']
    encoded_cols = model['encoded_cols']

    # Load fitted scaler and encoder
    scaler = model['scaler']
    encoder = model['encoder']
    
    # Load trained random forest model
    rf_model = model['rf_model']
    
    # Transform numeric columns
    input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])
    
    # Encoded categorical columns
    encoded_data = encoder.transform(input_df[categorical_cols])
    
    input_df[encoded_cols] = encoded_data
    
    input_cols = list(numeric_cols) + list(encoded_cols)
    
    X_input = input_df[input_cols]
    
    prediction = rf_model.predict(X_input)
    
    prediction = prediction.round().astype(int)

    return prediction

In [None]:
# First row of validation data
single_input = train_df.loc[X_val.index[0]].to_dict()
single_input

In [None]:
y_val.iloc[0]

In [None]:
predict_input(single_input)