# Submission for Regression with an Abalone Dataset

The goal of this competition is to predict the **number of Rings** (an indicator of age) of abalone from various physical measurements. Since the number of rings of abalone is a continuous numerical variable, this is a regression problem.

In this notebook, a **RandomForestRegressor** is used and the evaluation metric is the **Root Mean Squared Logarithmic Error**

## Sections
1. Import Libraries and Data
2. Preprocessing
3. Train and Validation
4. Submission Predictions

# 1. Import Libraries and Data

In [1]:
# Data exploration and manipulation
import pandas as pd
import numpy as np
import math

# Modeling and Validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error

# manipulate dates, e.g., get today's date
from datetime import date

In [2]:
train_df = pd.read_csv('data/playground-series-s4e4/train.csv')
test_df = pd.read_csv('data/playground-series-s4e4/test.csv')
sample_sub = pd.read_csv('data/playground-series-s4e4/sample_submission.csv')

In [3]:
train_df.set_index("id", inplace = True)
test_df.set_index("id", inplace = True)

In [4]:
train_df.tail(3)

Unnamed: 0_level_0,Sex,Length,Diameter,Height,Whole weight,Whole weight.1,Whole weight.2,Shell weight,Rings
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
90612,I,0.435,0.33,0.095,0.3215,0.151,0.0785,0.0815,6
90613,I,0.345,0.27,0.075,0.2,0.098,0.049,0.07,6
90614,I,0.425,0.325,0.1,0.3455,0.1525,0.0785,0.105,8


In [5]:
test_df.head(3)

Unnamed: 0_level_0,Sex,Length,Diameter,Height,Whole weight,Whole weight.1,Whole weight.2,Shell weight
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
90615,M,0.645,0.475,0.155,1.238,0.6185,0.3125,0.3005
90616,M,0.58,0.46,0.16,0.983,0.4785,0.2195,0.275
90617,M,0.56,0.42,0.14,0.8395,0.3525,0.1845,0.2405


# 2. Preprocessing
### 2-1 Convert non-numerical columns to numerical

**The Scikit-learn library only accepts numerical values** hence we need to change the Sex column to numerical

In [6]:
train_df['Sex'].unique()

array(['F', 'I', 'M'], dtype=object)

"I" stands for infant

In [7]:
train_dummies = pd.get_dummies(train_df['Sex'],drop_first=True,prefix="Sex", prefix_sep='_', dtype=int)
test_dummies = pd.get_dummies(test_df['Sex'],drop_first=True,prefix="Sex", prefix_sep='_', dtype=int)

# Three columns have been created for M, F, and I; however, only two are necessary since if they are both 0, 
# then it is implied the sex is neither F nor M but I
# train_dummies.drop(columns = ["I"], inplace = True)
# test_dummies.drop(columns = ["I"], inplace = True)

train_dummies.head(3)

Unnamed: 0_level_0,Sex_I,Sex_M
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,0
1,0,0
2,1,0


**Note:** When both Sex_I and Sex_M are 0, then the sex is "F"; the column "F" was eliminated by the drop_first=True argument (any of the three could have eliminated) to reduce the number of column (i.e., for efficiency)

In [8]:
# merge dummies to the train_df dataframe 
train_df = train_df.merge(train_dummies, left_index=True, right_index=True)
# remove the "Sex" column (this is now encoded and is thus redundant) and reorder columns for clarity
train_df = train_df[['Sex_I','Sex_M','Length','Diameter','Height','Whole weight','Whole weight.1','Whole weight.2','Shell weight','Rings']]

#Now same for the test_df
test_df = test_df.merge(test_dummies, left_index=True, right_index=True)
test_df = test_df[['Sex_I','Sex_M','Length','Diameter','Height','Whole weight','Whole weight.1','Whole weight.2','Shell weight']]

In [9]:
train_df.head(3)

Unnamed: 0_level_0,Sex_I,Sex_M,Length,Diameter,Height,Whole weight,Whole weight.1,Whole weight.2,Shell weight,Rings
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0,0,0.55,0.43,0.15,0.7715,0.3285,0.1465,0.24,11
1,0,0,0.63,0.49,0.145,1.13,0.458,0.2765,0.32,11
2,1,0,0.16,0.11,0.025,0.021,0.0055,0.003,0.005,6


In [10]:
test_df.head(3)

Unnamed: 0_level_0,Sex_I,Sex_M,Length,Diameter,Height,Whole weight,Whole weight.1,Whole weight.2,Shell weight
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
90615,0,1,0.645,0.475,0.155,1.238,0.6185,0.3125,0.3005
90616,0,1,0.58,0.46,0.16,0.983,0.4785,0.2195,0.275
90617,0,1,0.56,0.42,0.14,0.8395,0.3525,0.1845,0.2405


### 2-2 Split the data for training and testing

In [11]:
X = train_df.iloc[:,:-1]  # features set
y = train_df.iloc[:,-1]  # target set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1111)

# 3. Train and Validation

Using a Random Forest Regressor

### 3-1. Training

In [12]:
# Create a Random Forest object
rfr = RandomForestRegressor(random_state = 1111)

# Train a model
rfr.fit(X_train,y_train)

### 3-2. Validation

In [13]:
# Make predictions

train_predictions = rfr.predict(X_train)
test_predictions = rfr.predict(X_test)

train_error = math.sqrt(mean_squared_log_error(y_train,train_predictions))
test_error = math.sqrt(mean_squared_log_error(y_test,test_predictions))

#evaluate predictions
print("The TRAIN ERROR is :", train_error)
print("The TEST ERROR is ::", test_error)

The TRAIN ERROR is : 0.05903926470082347
The TEST ERROR is :: 0.15475238800109298


**NOTE:** Generally, models perform a lot better on training data, as test data is unseen and may have features or characteristics that were not exposed in the model. **Since training and testing errors are vastly different, the model is probably overfitted.** We will use model validation to make sure we get the best testing error possible.

# 4. Submission Predictions

### First, re-train the model with all training data and make predictions on test data

In [14]:
# Create a Random Forest object
rfr = RandomForestRegressor(random_state = 1111)

# Train model
rfr.fit(X,y)

# make predictions
final_predictions = rfr.predict(test_df)

In [15]:
# converting the predictions to integers (It doesn't make sense to have non-integer number of rings)
final_predictions = [int(x) for x in final_predictions]

### Prepare Submission DataFrame

In [16]:
submission_df = pd.DataFrame(final_predictions, columns = ["Rings"])
submission_df['id'] = test_df.index

# reorder the columns to match the sample submission file
submission_df = submission_df[["id","Rings"]]

In [17]:
submission_df.head()

Unnamed: 0,id,Rings
0,90615,10
1,90616,9
2,90617,10
3,90618,11
4,90619,7


In [18]:
# get todays date
today = date.today() 

# create the file name with todays date (this prevents us from overwriting previous files
file_path = "data/submission_" + str(today) + ".csv"

# output submission csv file (this should be uploaded to Kaggle without any modification
submission_df.to_csv(file_path, index = False)