What I'm Trying top predict for this project is a number of rings on the shell of an Abalone ( A type of mollusk ) they are also called sea ears.
The number of rings is supposed to represent the age.

![Abalone](https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/LivingAbalone.JPG/220px-LivingAbalone.JPG)

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_log_error

In [35]:
train_file = "./train.csv"
test_file = "./test.csv"

train = pd.read_csv(train_file, index_col=0)
test = pd.read_csv(test_file, index_col=0)
print(train.head())
print(test.head())

   Sex  Length  Diameter  Height  Whole weight  Whole weight.1  \
id                                                               
0    F   0.550     0.430   0.150        0.7715          0.3285   
1    F   0.630     0.490   0.145        1.1300          0.4580   
2    I   0.160     0.110   0.025        0.0210          0.0055   
3    M   0.595     0.475   0.150        0.9145          0.3755   
4    I   0.555     0.425   0.130        0.7820          0.3695   

    Whole weight.2  Shell weight  Rings  
id                                       
0           0.1465        0.2400     11  
1           0.2765        0.3200     11  
2           0.0030        0.0050      6  
3           0.2055        0.2500     10  
4           0.1600        0.1975      9  
      Sex  Length  Diameter  Height  Whole weight  Whole weight.1  \
id                                                                  
90615   M   0.645     0.475   0.155        1.2380          0.6185   
90616   M   0.580     0.460   0.160 

In [36]:
encoder = LabelEncoder()

train['Sex'] = encoder.fit_transform(train['Sex'])
test['Sex'] = encoder.transform(test['Sex'])
print(train['Sex'].head())

id
0    0
1    0
2    1
3    2
4    1
Name: Sex, dtype: int32


In [37]:
X = train.drop('Rings', axis=1)
y = train['Rings']

scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [6]:
linreg = LinearRegression()

linreg.fit(X_train, y_train)

y_pred = np.round(linreg.predict(X_test))
y_pred[y_pred<1] = 1                          # Remove negative values and set them to 1

In [7]:
rmsle = root_mean_squared_log_error(y_true=y_test, y_pred=y_pred)
print(rmsle)

0.16745761106008042


That's a good baseline i think. would like to get it to at least 0.14 so i can get on the leaderboard. This will be my baseline.

## Model iteration 1

In [8]:
print(y.values.max())

29


Either regression and rounding to the closest number, or classification with 29 classes. Though i think regression would do best in the case that maybe there are some examples with 30 rings in the test data, which would be excluded. Thus for the first iteration I will try Regression.

In [9]:
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Ridge

In [10]:
models = [('svr', SVR()),
          ('dt', DecisionTreeRegressor()),
          ('rf', RandomForestRegressor()),
          ('knn', KNeighborsRegressor()),
          ('ridge', Ridge())]

rmsle_scores = []

for name, model in models:
    print('Fitting model:', name)
    model.fit(X_train, y_train)
    y_pred = np.round(model.predict(X_test))
    y_pred[y_pred<1] = 1
    rmsle = root_mean_squared_log_error(y_true=y_test, y_pred=y_pred)
    rmsle_scores.append((name, rmsle))

Fitting model: svr
Fitting model: dt
Fitting model: rf
Fitting model: knn
Fitting model: ridge


In [11]:
print(rmsle_scores)

[('svr', 0.15744276131077117), ('dt', 0.21717154820897655), ('rf', 0.15767900198089038), ('knn', 0.1683710530311581), ('ridge', 0.16746960090216662)]


I will now start tuning the parameters of the Support Vector Regressor (SVR)

In [21]:
param_grid = {'C': [10, 14, 18, 20],
              'gamma': [0.1, 1],
              'epsilon': [0.01, 0.1]}
grid_search = RandomizedSearchCV(SVR(), param_distributions = param_grid,
                           cv = 3, n_jobs = -1, verbose = 2, scoring=root_mean_squared_log_error)

grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits




{'gamma': 0.1, 'epsilon': 0.01, 'C': 14}


In [22]:
iteration_1_svr = SVR(kernel='rbf', C=grid_search.best_params_['C'], gamma=grid_search.best_params_['gamma'], epsilon=grid_search.best_params_['epsilon'])

iteration_1_svr.fit(X_train, y_train)

y_pred_1 = iteration_1_svr.predict(X_test)
y_pred_1[y_pred_1<1] = 1
rmsle_1 = root_mean_squared_log_error(y_true=y_test, y_pred=y_pred_1)

print(rmsle_1)

0.15384712617861282


### Final Train using entire dataset

In [38]:
iteration_1_final = SVR(kernel='rbf', C=grid_search.best_params_['C'], gamma=grid_search.best_params_['gamma'],
                      epsilon=grid_search.best_params_['epsilon'])

iteration_1_final.fit(X, y)

y_pred_final_1 = np.round(iteration_1_final.predict(test))
y_pred_final_1[y_pred_final_1 < 1] = 1



In [39]:
df_final_1 = pd.DataFrame()
df_final_1.index = test.index
df_final_1['Rings'] = y_pred_final_1

print(df_final_1.head())

       Rings
id          
90615   11.0
90616   11.0
90617   11.0
90618   11.0
90619   10.0


In [40]:
df_final_1['Rings'] = df_final_1['Rings'].astype('int')
print(df_final_1.head())

       Rings
id          
90615     11
90616     11
90617     11
90618     11
90619     10


In [41]:
df_final_1.to_csv('submission_1.csv')