What I'm Trying top predict for this project is a number of rings on the shell of an Abalone ( A type of mollusk ) they are also called sea ears.
The number of rings is supposed to represent the age.

![Abalone](https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/LivingAbalone.JPG/220px-LivingAbalone.JPG)

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_log_error

In [25]:
train_file = "./train.csv"
test_file = "./test.csv"

train = pd.read_csv(train_file, index_col=0)
print(train.head())

   Sex  Length  Diameter  Height  Whole weight  Whole weight.1  \
id                                                               
0    F   0.550     0.430   0.150        0.7715          0.3285   
1    F   0.630     0.490   0.145        1.1300          0.4580   
2    I   0.160     0.110   0.025        0.0210          0.0055   
3    M   0.595     0.475   0.150        0.9145          0.3755   
4    I   0.555     0.425   0.130        0.7820          0.3695   

    Whole weight.2  Shell weight  Rings  
id                                       
0           0.1465        0.2400     11  
1           0.2765        0.3200     11  
2           0.0030        0.0050      6  
3           0.2055        0.2500     10  
4           0.1600        0.1975      9  


In [26]:
encoder = LabelEncoder()

train['Sex'] = encoder.fit_transform(train['Sex'])
print(train['Sex'].head())

id
0    0
1    0
2    1
3    2
4    1
Name: Sex, dtype: int32


In [27]:
X = train.drop('Rings', axis=1)
y = train['Rings']

scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [38]:
linreg = LinearRegression()

linreg.fit(X_train, y_train)

y_pred = np.round(linreg.predict(X_test))
y_pred[y_pred<1] = 1                          # Remove negative values and set them to 1

In [39]:
rmsle = root_mean_squared_log_error(y_true=y_test, y_pred=y_pred)
print(rmsle)

0.16745761106008042


That's a good baseline i think. would like to get it to at least 0.14 so i can get on the leaderboard. This will be my baseline.

## Model iteration 1

In [45]:
print(y.values.max())

29


Either regression and rounding to the closest number, or classification with 29 classes. Though i think regression would do best in the case that maybe there are some examples with 30 rings in the test data, which would be excluded. Thus for the first iteration I will try Regression.

In [51]:
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Ridge

In [53]:
models = [('svr', SVR()),
          ('dt', DecisionTreeRegressor()),
          ('rf', RandomForestRegressor()),
          ('knn', KNeighborsRegressor()),
          ('ridge', Ridge())]

rmsle_scores = []

for name, model in models:
    print('Fitting model:', name)
    model.fit(X_train, y_train)
    y_pred = np.round(model.predict(X_test))
    y_pred[y_pred<1] = 1
    rmsle = root_mean_squared_log_error(y_true=y_test, y_pred=y_pred)
    rmsle_scores.append((name, rmsle))

Fitting model: svr
Fitting model: dt
Fitting model: rf
Fitting model: knn
Fitting model: ridge


In [54]:
print(rmsle_scores)

[('svr', 0.15744276131077117), ('dt', 0.21646433954652472), ('rf', 0.15784384675379085), ('knn', 0.1683710530311581), ('ridge', 0.16746960090216662)]


I will now start tuning the parameters of the Support Vector Regressor (SVR)