<h1 style="text-align: center;">Predicting Crab Ages using Random Forest</h1>

<a id='table_of_contents'></a>
<h2 style="background-color:#0b0504;color:;border-radius: 8px; padding:12px">Table of Contents</h2>

1. <a href="#download" style="text-decoration: None">Download Data</a>
2. <a href="#import" style="text-decoration: None">Import Libraries and Dataset</a>
3. <a href="#data_preview" style="text-decoration: None">Dataset Preview</a>
4. <a href="#data_wrangling" style="text-decoration: None">Data Wrangling</a>
5. <a href="#eda" style="text-decoration: None">Exploratory Data Analysis</a>
    - <a href="#univariate" style="text-decoration: None">Univariate Analysis</a>
    - <a href="#bivariate" style="text-decoration: None">Bivariate Analysis</a>
6. <a href="#data_preprocessing" style="text-decoration: None">Data Preparation and Preprocessing</a>
7. <a href="#baseline" style="text-decoration: None">Baseline Models</a>
8. <a href="#optimization" style="text-decoration: None">Optimization: Hyperparameter Tuning</a>
9. <a href="#performance_summary" style="text-decoration: None">Performance Comparison and Summary</a>
10. <a href="#save_model" style="text-decoration: None">Save Model</a>

<a id="download"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">1. Download Data</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

I have downloaded the dataset directly within the Jupyter notebook using Jovian's `opendatasets` library. The dataset description can be found <a href="https://www.kaggle.com/competitions/playground-series-s3e16/data" style="text-decoration: None">here</a>.

<strong>Note: Uncomment the following code cells if you are working outside of Kaggle environment.</strong>

In [1]:
import os
import opendatasets as od

In [2]:
od.download('https://www.kaggle.com/competitions/playground-series-s3e16/data')

Skipping, found downloaded files in ".\playground-series-s3e16" (use force=True to force download)


In [3]:
print(['sample'])

['sample']


In [4]:
os.listdir('playground-series-s3e16')

['sample_submission.csv', 'test.csv', 'train.csv']

<a id="import"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">2. Import Libraries and Dataset</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [5]:
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", 120)
pd.set_option("display.max_rows", 120)

import warnings 
warnings.filterwarnings("ignore")

In [6]:
train_df = pd.read_csv('playground-series-s3e16/train.csv')
test_df = pd.read_csv('playground-series-s3e16/test.csv')
sub_df = pd.read_csv('playground-series-s3e16/sample_submission.csv')

<a id="data_preview"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">3. Dataset Preview</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

Here I will perform a preliminary analysis by assessing the quality of the data. This involves checking for incorrect data type, missing values, duplicates, summary statistics, erroneous data and so on.

In [7]:
train_df.head()

Unnamed: 0,id,Sex,Length,Diameter,Height,Weight,Shucked Weight,Viscera Weight,Shell Weight,Age
0,0,I,1.525,1.175,0.375,28.973189,12.728926,6.647958,8.348928,9
1,1,I,1.1,0.825,0.275,10.418441,4.521745,2.324659,3.40194,8
2,2,M,1.3875,1.1125,0.375,24.777463,11.3398,5.556502,6.662133,9
3,3,F,1.7,1.4125,0.5,50.660556,20.354941,10.991839,14.996885,11
4,4,I,1.25,1.0125,0.3375,23.289114,11.977664,4.50757,5.953395,8


In [8]:
train_df.shape

(74051, 10)

In [9]:
train_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,74051.0,37025.0,21376.826729,0.0,18512.5,37025.0,55537.5,74050.0
Length,74051.0,1.31746,0.287757,0.1875,1.15,1.375,1.5375,2.012815
Diameter,74051.0,1.024496,0.237396,0.1375,0.8875,1.075,1.2,1.6125
Height,74051.0,0.348089,0.092034,0.0,0.3,0.3625,0.4125,2.825
Weight,74051.0,23.385217,12.648153,0.056699,13.437663,23.799405,32.162508,80.101512
Shucked Weight,74051.0,10.10427,5.618025,0.028349,5.712424,9.90815,14.033003,42.184056
Viscera Weight,74051.0,5.058386,2.792729,0.042524,2.8633,4.989512,6.988152,21.54562
Shell Weight,74051.0,6.72387,3.584372,0.042524,3.96893,6.931453,9.07184,28.491248
Age,74051.0,9.967806,3.175189,1.0,8.0,10.0,11.0,29.0


In [10]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74051 entries, 0 to 74050
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              74051 non-null  int64  
 1   Sex             74051 non-null  object 
 2   Length          74051 non-null  float64
 3   Diameter        74051 non-null  float64
 4   Height          74051 non-null  float64
 5   Weight          74051 non-null  float64
 6   Shucked Weight  74051 non-null  float64
 7   Viscera Weight  74051 non-null  float64
 8   Shell Weight    74051 non-null  float64
 9   Age             74051 non-null  int64  
dtypes: float64(7), int64(2), object(1)
memory usage: 5.6+ MB


<a id="data_wrangling"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">4. Data Wrangling</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

<a id="data_wrangling"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">4.1. Drop <code>id</code> Column</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [11]:
train_df.drop(columns=['id'], axis=1, inplace=True)

<a id="eda"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">5. Exploratory Data Analysis</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [12]:
#!pip install kaleido

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp

sns.set_style("darkgrid")

In [None]:
# plot_color = ['lightcoral','#008080']
plot_color = ['#008080', 'black']
sns.set_palette(['#008080', 'black'])

<a id="univariate"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">5.1. Univariate Analysis</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

In [None]:
pd.DataFrame(train_df['Age'].describe())

In [None]:
pd.DataFrame(train_df['Age'].describe())

<a id="bivariate"></a>
<h4 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">5.2. Bivariate Analysis</h4>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

<a id="data_preprocessing"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">6. Data Preparation and Preprocessing</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

<a id="baseline"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">7. Baseline Models</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

<a id="optimization"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">8. Optimization: Hyperparameter Tuning</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

<a id="performance_summary"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">9. Performance Comparison and Summary</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>

<a id="save_model"></a>
<h2 style="background-color:#0b0504;color:white;border-radius: 8px; padding:12px">10. Save Model</h2>

<a href="#table_of_contents" style="text-decoration: None">Table of Contents</a>