# MISIG Summer Competition.  

_Author:_ **Mukund Kalra - 
mukundkalra@gmail.com**

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Instead, other measurements, which are easier to obtain, are used to predict the age. Details can be found at https://archive.ics.uci.edu/ml/datasets/Abalone
### Problem approached using Regression.

### Importing the libraries and dataset.

In [46]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('abalone.csv',header = None)

## DATA PRE-PROCESSING.
### Seperating the feauture and target columns.

In [47]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:,-1].values

### Encoding catogorical variables such as the sex (M/F) column.
#### No normalization required as all feauture columns are already normalised ij the given data

In [48]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#encode the gender variable
encoder = LabelEncoder()
X[:,0] = encoder.fit_transform(X[:,0])
onehot = OneHotEncoder(categorical_features=[0])
X = onehot.fit_transform(X).toarray()

#remove one of the one-hot encodes dummy feature columns - dummy variable trap
X=X[:,1:]


### Splitting the dataset into the Training set and Test set.
__Given the dataset is quite small ~ 4000 rows: 10% is reserved for the test set__

In [49]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

## MODEL CONSTRUCTION
### Fitting regressor to the Training set and building RandomForest Model

In [50]:
# Fitting regressor to the Training set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=500, random_state= 0)
regressor.fit(X_train,y_train)
 

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
           oob_score=False, random_state=0, verbose=0, warm_start=False)

### Predicting the Test set values. 

In [51]:
y_pred = regressor.predict(X_test)

## CALCULATING PERFORMANCE OF REGRESSOR MODEL

In [53]:

from sklearn.metrics import  r2_score,mean_squared_error
score = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f'Coeff. of Determination: {score}')
print(f'Mean Sq. Error: {mse}')

Coeff. of Determination: 0.5825935841031324
Mean Sq. Error: 4.524926449760765


## CONCLUSIONS:

Model has an __R2 Score__ of 0.582 which is average performance but after trying out a bunch of different models including Lin/Polynomial, SVR, just Decision Tree Regression and finally Random Forests; the result is that **RandomForests with about 500-1000 trees performed the best** for me. ANN's were not used considering the data to be quite less to recieve any major performance benefit.

### Model can be further tuned using  _regularization_ and _Grid  Search_ but no significant performance boost is expected given the small data set and large number of feature columns.