# Problem 1 Abalone Data Set
### Created By: Ivor Zalud

## The Problem
I want to predict the age of an Abalone given a set of physical features. These features are: Length, Diameter, Height, Whole_weight, Shucked_weight, Viscera_weight, Shell_weight, and Rings. Age is given by +1.5 to the rings. Thus the features excluding rings will be used to predict rings.

## The Data
Data for this is provided by UC Irvine [here](archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data). The set includes ~41k rows of data describing physical features of Abalone. The feature set is described above.

## Methods
* Gradient-Boosted Trees as our multi-class classifier
* Stochastic Gradient Descent as our linear regression


Importing python modules

In [405]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Gradient-Boosted Trees
***
### - **Regularization**: Scikit outlines it well [here](https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regularization-py) the optimal regularization methods. We'll use shrinkage and subamsple < 1.0 to produce a more accurate model. This is a result of reducing the variance via bagging
### - **Loss function**: Deviance (default for scikit) which will use logistic regressions loss function. We want to minimize this loss function


## 1. Load the data

### Data corrections:
1. Map Sex to numeric value for use in our model
    - M -> 0
    - F -> 1
    - I -> 2

In [406]:
df_train = pd.read_csv('Data/abalone.data')
df_train['Sex'] = df_train['Sex'].map({'M': 0, 'F': 1, 'I': 2})
## Drop rows that contain only one instance of Ring values, y values need at least two instances for the model.
df_train = df_train[~df_train['Rings'].isin([1,26,29,2,25])]

df_train

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
0,0,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15
1,0,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
2,1,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,0,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
4,2,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7
...,...,...,...,...,...,...,...,...,...
4172,1,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,0,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,0,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,1,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


## 2. Split the data set into a training and test set
### Also split columns into the x and y variables
### We adopt a 70/30 split for training vs test as the data set is small

In [407]:
## Define our indepedent and dependant variables
data_column_names = [column for column in df_train.columns if column not in ['Rings']]
x = df_train.loc[:, data_column_names]
y = df_train.loc[:,'Rings']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=13)


## 3. Create the Gradient-boosted Trees Model and fit with the training data
Future studies would benefit from a better splitting strategy of stratified k folds.

In [408]:
GBT = make_pipeline(StandardScaler(), GradientBoostingClassifier(n_estimators=5000,
                                       learning_rate=0.01,
                                       max_depth=3,
                                       subsample=0.5,
                                       validation_fraction=0.1,
                                       n_iter_no_change=20,
                                       max_features='log2'
                                      )).fit(x_train,y_train)

GBT.score(x_test,y_test)

predictions = GBT.predict(x_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           3       0.00      0.00      0.00         7
           4       0.57      0.55      0.56        22
           5       0.31      0.36      0.33        33
           6       0.37      0.26      0.30        88
           7       0.35      0.42      0.38       134
           8       0.32      0.40      0.35       159
           9       0.23      0.41      0.30       180
          10       0.23      0.31      0.26       177
          11       0.26      0.30      0.28       142
          12       0.08      0.01      0.02        83
          13       0.08      0.02      0.03        66
          14       0.00      0.00      0.00        44
          15       0.00      0.00      0.00        36
          16       0.38      0.11      0.17        28
          17       0.20      0.05      0.08        19
          18       0.00      0.00      0.00         6
          19       0.00      0.00      0.00        10
          20       0.17    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Results

Generally, the model works poorly at classifing rings. A drawback is our data set size is small at 41k rows. The low f1 score could be due to the small sample size and the small feature set. Adjusting the hyperparameters yielded minimal increases in performance. Future studies would benefit from increasing the sample size and increasing the number of features.

# Stochastic Gradient Descent Regressor

## 1. Regularize, run, and fit the model
### We already split the data into the training and test set which well use here.

### We use scikit-learn's pipeline to scale the data as SGD is sensitive to feature scaling.
### -  **Regularization:** L2 - Default of SGD Regressor in Scikit learn. Well keep L2 since we have a low amount of features and do not want to shrink any coefficients to zero.
### -  **Loss Function:** Squared error - I want to be sensititve to any outliers in the data.


In [409]:
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1000, tol=1e-3))
reg.fit(x_train, y_train)

reg_predictions = reg.predict(x_test)

## Prettier the data to compare actual vs prediction of the first 6 items.
i = 0
pred_arr = []
actual_arr = []
for pred in reg_predictions:
    pred_arr.append(pred)
    i += 1
    if i > 6:
        break
print("-----------------------------------------------------")
i = 0
for act in y_test:
    actual_arr.append(act)
    i += 1
    if i > 6:
        break

for x in range(6):
    print("Actual: " + str(actual_arr[x]) + " - Prediction: " + str(pred_arr[x]))


print("Score: " + str(reg.score(x_test, y_test)))


-----------------------------------------------------
Actual: 13 - Prediction: 11.208360371575337
Actual: 4 - Prediction: 4.988653657792025
Actual: 12 - Prediction: 10.620439910418026
Actual: 17 - Prediction: 9.548780130785246
Actual: 12 - Prediction: 9.809014246907697
Actual: 7 - Prediction: 7.746276350121907
Score: 0.5743626953677827


## Results

Similar to the boosted trees, the SGD works fairly poor. Again, this is likely due to the sample and feature set size. To improve the scores, more data and features need to be collected. If one needs a general idea of ring count, the SGD regressor would likely be their best option of the two.