# Problem 1 Abalone Data Set
### Created By: Ivor Zalud

## The Problem
I want to predict the age of an Abalone given a set of physical features. These features are: Length, Diameter, Height, Whole_weight, Shucked_weight, Viscera_weight, Shell_weight, and Rings. Age is given by +1.5 to the rings. Thus the features excluding rings will be used to predict rings.

## The Data
Data for this is provided by UC Irvine [here](archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data). The set includes ~41k rows of data describing physical features of Abalone. The feature set is described above.

## Methods
* Gradient-Boosted Trees as our multi-class classifier
* Stochastic Gradient Descent as our linear regression


Importing python modules

In [177]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Gradient-Boosted Trees

### - **Regularization**: Scikit outlines it well [here](https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regularization-py) the optimal regularization methods. We'll use shrinkage and subamsple < 1.0 to produce a more accurate model. This is a result of reducing the variance via bagging
### - **Loss function**: Deviance (default for scikit) which will use logistic regressions loss function. We want to minimize this loss function


## 1. Load the data

### Data corrections:
1. Map Sex to numeric value for use in our model
    - M -> 0
    - F -> 1
    - I -> 2

In [178]:
df_train = pd.read_csv('Data/abalone.data')
df_train['Sex'] = df_train['Sex'].map({'M': 0, 'F': 1, 'I': 2})
## Drop rows that contain only one instance of Ring values, y values need at least two instances for the model.
df_train = df_train[~df_train['Rings'].isin([1,26,29,2,25])]

df_train

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
0,0,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15
1,0,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
2,1,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,0,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
4,2,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7
...,...,...,...,...,...,...,...,...,...
4172,1,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,0,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,0,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,1,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


## 2. Split the data set into a training and test set
### Also split columns into the x and y variables
### We adopt a 80/20 split for training vs test

In [179]:
## Define our indepedent and dependant variables
data_column_names = [column for column in df_train.columns]
x = df_train.loc[:, data_column_names]
y = df_train.loc[:,'Rings']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)


## 3. Create the Gradient-boosted Trees Model and fit with the training data

In [180]:
GBT = GradientBoostingClassifier(n_estimators=5000,
                                       learning_rate=0.01,
                                       max_depth=3,
                                       subsample=0.5,
                                       validation_fraction=0.1,
                                       n_iter_no_change=20,
                                       max_features='log2'
                                      ).fit(x_train,y_train)

GBT.score(x_test,y_test)

predictions = GBT.predict(x_test)
print(GBT.feature_names_in_)
print(classification_report(y_test, predictions))



['Sex' 'Length' 'Diameter' 'Height' 'Whole_weight' 'Shucked_weight'
 'Viscera_weight' 'Shell_weight' 'Rings']
              precision    recall  f1-score   support

           3       1.00      1.00      1.00         2
           4       1.00      1.00      1.00        10
           5       1.00      1.00      1.00        14
           6       1.00      1.00      1.00        53
           7       1.00      1.00      1.00        68
           8       1.00      1.00      1.00       119
           9       1.00      1.00      1.00       138
          10       1.00      1.00      1.00       138
          11       1.00      1.00      1.00       106
          12       1.00      1.00      1.00        49
          13       1.00      1.00      1.00        41
          14       1.00      1.00      1.00        24
          15       1.00      1.00      1.00        17
          16       0.93      0.93      0.93        15
          17       1.00      1.00      1.00        17
          18       1.00  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Results

Generally, the model works sufficiently well with classifing rings with large support size (as expected). Smaller support sizes still have decent performance with predictions becoming unreliable at lower support sizes (< 5). Since our training and test sets change as we re-run the notebook, we see a variance in the f1-score. Generally, we see an f1-score between 85-96 meaning our model works well. A drawback is our data set size is small at 41k rows. Future studies would benefit from increasing the sample size.

# Stochastic Gradient Descent Regressor

## 1. Regularize, run, and fit the model
### We already split the data into the training and test set which well use here.

### We use scikit-learn's pipeline to scale the data as SGD is sensitive to feature scaling.
### -  **Regularization:** L2 - Default of SGD Regressor in Scikit learn. Well keep L2 since we have a low amount of features and do not want to shrink any coefficients to zero.
### -  **Loss Function:** Squared error - I want to be sensititve to any outliers in the data.


In [181]:
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1000, tol=1e-3))
reg.fit(x_train, y_train)

reg_predictions = reg.predict(x_test)

## Prettier the data to compare actual vs prediction of the first 6 items.
i = 0
pred_arr = []
actual_arr = []
for pred in reg_predictions:
    pred_arr.append(pred)
    i += 1
    if i > 6:
        break
print("-----------------------------------------------------")
i = 0
for act in y_test:
    actual_arr.append(act)
    i += 1
    if i > 6:
        break

for x in range(6):
    print("Actual: " + str(actual_arr[x]) + " - Prediction: " + str(pred_arr[x]))


print("Score: " + str(reg.score(x_test, y_test)))


-----------------------------------------------------
Actual: 8 - Prediction: 7.9891741751514385
Actual: 7 - Prediction: 7.008700008282071
Actual: 15 - Prediction: 15.012467060954823
Actual: 7 - Prediction: 7.0124624501370345
Actual: 16 - Prediction: 16.004053660432227
Actual: 10 - Prediction: 9.999869158558493
Score: 0.9999715582142357


## Results

Our regression model predicts Ring count very well. There is little varience between the actual vs predicted value. Similar to the gradient boosted trees, a drawback here is our sample size.