### Data Set Information: ###
<br>
Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of <br>
rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as <br>
weather patterns and location (hence food availability) may be required to solve the problem.<br>
<br>
From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been <br>
scaled for use with an ANN (by dividing by 200).<br>
<br>
<br>
Attribute Information:<br>
<br>
Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as <br>
a classification problem.<br>
<br>
Name / Data Type / Measurement Unit / Description<br>
-----------------------------<br>
Sex / nominal / -- / M, F, and I (infant)<br>
Length / continuous / mm / Longest shell measurement<br>
Diameter / continuous / mm / perpendicular to length<br>
Height / continuous / mm / with meat in shell<br>
Whole weight / continuous / grams / whole abalone<br>
Shucked weight / continuous / grams / weight of meat<br>
Viscera weight / continuous / grams / gut weight (after bleeding)<br>
Shell weight / continuous / grams / after being dried<br>
Rings / integer / -- / +1.5 gives the age in years<br>
<br>
The readme file contains attribute statistics.<br>
<br>


In [4]:
import pandas as pd
import numpy as np
from pathlib import Path

columns = ["Sex", "Length", "Diameter", "Height", "Whole", "Shucked", "Viscera", "Shell", "Rings"]
sourcepath= Path("D:/9999_Github_MachineLearning/ML_experiments/ML_experiments/Abalone/Data/abalone.data")
data = pd.read_csv(sourcepath, names=columns)
data.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole,Shucked,Viscera,Shell,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [3]:
# split off targets:
targets = data.iloc[:,-1]
data = data.iloc[:,:-1]

# one-hot-encode Sex:
data = pd.get_dummies(data)
data.head()

Unnamed: 0,Length,Diameter,Height,Whole,Shucked,Viscera,Shell,Sex_F,Sex_I,Sex_M
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,0,0,1
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,0,0,1
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,1,0,0
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,0,0,1
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,0,1,0


In [4]:
from sklearn.model_selection import train_test_split

random_seed = 42
X_train, X_test, y_train, y_test = train_test_split(data.to_numpy(), targets.to_numpy(), test_size= 0.2, random_state=random_seed)

In [5]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3341, 10), (836, 10), (3341,), (836,))

## Spot-Shooting Algorithms ##

### Logistic Regression ###

In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


log_reg = Pipeline([
    ("ssc", StandardScaler()),
    ("logreg", LogisticRegression(tol= 0.2, max_iter=1000, random_state=random_seed)) 
])

log_reg.fit(X_train, y_train)
print(f"{log_reg.score(X_test, y_test):.3f}")

0.281


### Linear Regression ###

In [24]:
from sklearn.linear_model import LinearRegression

lin_reg = Pipeline([
    ("ssc", StandardScaler()),
    ("linreg", LinearRegression())
]) 
lin_reg.fit(X_train, y_train)
print(f"{lin_reg.score(X_test, y_test):.3f}")

0.548


### Support Vector Regression ###

In [21]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR

svr_reg = Pipeline([
    ("ssc", StandardScaler()),
    ("sv_reg", SVR())
    ])

svr_reg.fit(X_train, y_train)
print(f"{svr_reg.score(X_test, y_test):.3f}")

0.536


### Decision Tree ###

In [22]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)
#tree_clf.predict(X_test, y_test)
print(f"{tree_clf.score(X_test, y_test):.3f}")

0.214


In [23]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)
print(f"{tree_reg.score(X_test, y_test):.3f}")

0.118


## Sum-Up Spot Shot: ##
1. Linear Regression score: 0.548
2. Support Vector Regression score: 0.536
3. Logistic Regression score: 0.281
4. Decision Tree Classification score: 0.214
5. Decision Tree Regression score: 0.118
<br>
<br>
So we will go with linear regression and support vector regression.

## Improving Results ##

### Linear Regression on KMeans ###

Clustering by KMeans reduces variance (and increases bias by los of information) for the lin-reg. If the number of clusters is to smal, the datapoints for the lin-reg become to coarse (high bias), if their number is to high there is increase in bias and to little loss in variance. So probably a grid-search for the number of clusters is necessary.

In [49]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV

pipe_lin_reg = Pipeline([
    ("ssc", StandardScaler()),
    ("kmeans", KMeans()),
    ("lin_reg", LinearRegression())
])

params = {
    "kmeans__n_clusters" : np.arange(10, 200, 10),
    "kmeans__max_iter" : np.arange(500, 1000, 100),
}
clf = GridSearchCV(pipe_lin_reg, params, n_jobs=-1, cv=10)

clf.fit(X_train, y_train)

print(f"Best parameters: {clf.best_params_}")
print(f"Best training-score: {clf.best_score_:.3f}")
print(f"Score on test data: {clf.score(X_test, y_test):.3f}")

Best parameters: {'kmeans__max_iter': 800, 'kmeans__n_clusters': 120}
Best training-score: 0.556
Score on test data: 0.562


### Linear Regression on KMeans and PCA ###

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV

pipe_lin_reg = Pipeline([
    ("pca", PCA(n_components=3)),
    ("ssc", StandardScaler()),
    ("kmeans", KMeans()),
    ("lin_reg", LinearRegression())
])

params = {
    "pca__n_components" : np.arange(10),
    "kmeans__n_clusters" : np.arange(10, 150, 10),
    "kmeans__max_iter" : np.arange(500, 800, 100),
}

clf = GridSearchCV(pipe_lin_reg, params, n_jobs=-1, cv=10)

clf.fit(X_train, y_train)

print(f"Best parameters: {clf.best_params_}")
print(f"Best training-score: {clf.best_score_:.3f}")
print(f"Score on test data: {clf.score(X_test, y_test):.3f}")


Two turns with the above classifier result in: <br>
<br>
1m 33,8s<br>
Best parameters: {'kmeans__max_iter': 600, 'kmeans__n_clusters': 60, 'pca__n_components': 8}<br>
Best training-score: 0.577<br>
Score on test data: 0.581<br>
<br>
1m 31.7s<br>
Best parameters: {'kmeans__max_iter': 500, 'kmeans__n_clusters': 90, 'pca__n_components': 8}<br>
Best training-score: 0.575<br>
Score on test data: 0.582<br>
<br>

### Further possible improvements of Lin-Reg: ###

1. use PolynomialFeatures instead of/ in addition to KMeans
2. use regularization: non-negative least squares, ridge, lasso, elastic-net
3. use Bayesian Regression (to chose a different type of regression from scikit)

### Support Vector Regression ###

#### SVR does not profit from KMeans: ####

In [50]:
pipe_svr = Pipeline([
    ("ssc", StandardScaler()),
    ("kmeans", KMeans()),
    ("svr", SVR())
])

params = {
    "kmeans__n_clusters" : np.arange(10, 200, 10),
    "kmeans__max_iter" : np.arange(500, 1000, 100),
}
clf = GridSearchCV(pipe_svr, params, n_jobs=-1, cv=10)

clf.fit(X_train, y_train)

print(f"Best parameters: {clf.best_params_}")
print(f"Best training-score: {clf.best_score_:.3f}")
print(f"Score on test data: {clf.score(X_test, y_test):.3f}")

# Result:
# Best parameters: {'kmeans__max_iter': 500, 'kmeans__n_clusters': 190}
# Best training-score: 0.436
# Score on test data: 0.438

Best parameters: {'kmeans__max_iter': 500, 'kmeans__n_clusters': 190}
Best training-score: 0.436
Score on test data: 0.438


### Support Vector Regression with different Kernels ###

In [69]:
# if kernel=None then by default kernel="rbf" is used (see: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)
# We try polynomial features.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.svm import SVR


pipe_svr = Pipeline([
    ("pca", PCA()),
    ("ssc", StandardScaler()),
    #("svr", SVR(kernel='poly')),
    #("svr", SVR(kernel='sigmoid')),
    ("svr", SVR())

])

params = {
    "pca__n_components" : np.arange(1,11),
    #"svr__degree" : np.arange(2,6),
}

svr_clf = GridSearchCV(pipe_svr, params, n_jobs=-1, cv=10)

svr_clf.fit(X_train, y_train)

print(f"Best estimator: {svr_clf.best_estimator_}")
print(f"Best parameter training: {svr_clf.best_params_}")
print(f"Score on testset: {svr_clf.score(X_test, y_test):.3f}")

# Results:
# 1.
# Best estimator: Pipeline(steps=[('ssc', StandardScaler()), ('svr', SVR(kernel='poly'))])
# Best parameter training: {'svr__degree': 3}
# Score on testset: 0.495
#
# 2.
# Best estimator: Pipeline(steps=[('pca', PCA(n_components=5)), ('ssc', StandardScaler()),
                #('svr', SVR(kernel='poly'))])
# Best parameter training: {'pca__n_components': 5, 'svr__degree': 3}
# Score on testset: 0.468
#
# 3.
# Best estimator: Pipeline(steps=[('pca', PCA(n_components=10)), ('ssc', StandardScaler()),
#                ('svr', SVR(kernel='sigmoid'))])
# Best parameter training: {'pca__n_components': 10}
# Score on testset: -44.512
#
# 4.
# Best estimator: Pipeline(steps=[('pca', PCA(n_components=8)), ('ssc', StandardScaler()),
#                ('svr', SVR())])
# Best parameter training: {'pca__n_components': 8}
# Score on testset: 0.554

Best estimator: Pipeline(steps=[('pca', PCA(n_components=8)), ('ssc', StandardScaler()),
                ('svr', SVR())])
Best parameter training: {'pca__n_components': 8}
Score on testset: 0.554


### Ensembling Lin-Reg and SVR ###

#### Bagging Linear-Regression ####

In [84]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import BaggingRegressor


pipe_lin_reg = Pipeline([
    ("pca", PCA(n_components=8)),
    ("ssc", StandardScaler()),
    ("kmeans", KMeans()),
    ("log_reg", LinearRegression())
])

bag_lin_reg = BaggingRegressor(base_estimator=pipe_lin_reg, n_jobs=-1, n_estimators=10)
bag_lin_reg.fit(X_train, y_train)

print(f"Test score bag-lin-regressor: {bag_lin_reg.score(X_test, y_test):.3f}")

# Result:
# Test score bag-lin-regressor: 0.557
# similar results for n_estimators = 50, 100

Test score bag-lin-regressor: 0.557


In [None]:
from sklearn.model_selection import GridSearchCV

params = {
    "pca__n_components" : np.arange(10),
    "kmeans__n_clusters" : np.arange(10, 150, 10),
    "kmeans__max_iter" : np.arange(500, 800, 100),
}

#clf = GridSearchCV(pipe_lin_reg, params, n_jobs=-1, cv=10)

#clf.fit(X_train, y_train)

print(f"Best parameters: {clf.best_params_}")
print(f"Best training-score: {clf.best_score_:.3f}")
print(f"Score on test data: {clf.score(X_test, y_test):.3f}")

#### Bagging Support Vector Regression ####

In [82]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor


pipe_svr = Pipeline([
    ("pca", PCA(n_components=8)),
    ("ssc", StandardScaler()),
    ("svr", SVR())
])

bag_svr = BaggingRegressor(base_estimator=pipe_svr, n_jobs=-1)
bag_svr.fit(X_train, y_train)

print(f"Test score bag-sv-regressor: {bag_svr.score(X_test, y_test):.3f}")

# Result:
# 1.
# n_estimators = 10
# Test score bag-sv-regressor: 0.555
# similar results with n_estimators = 30, 50, 100 

Test score bag-regressor: 0.553


#### Sum-Up of BaggingRegressor application: ####
Bagging did not improve lin-reg or svr. <br>
Suspiciously the scores of bagging lin-reg and svr are almost equal: there might be a mistake in my implementation?

### Ensembling by Voting: Lin-Reg and SVR ###

In [93]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor

pipe_lin_reg = Pipeline([
    ("pca", PCA(n_components=8)),
    ("ssc", StandardScaler()),
    ("kmeans", KMeans(max_iter=600, n_clusters= 60)),
    ("log_reg", LinearRegression())
])

pipe_svr = Pipeline([
    ("pca", PCA(n_components=8)),
    ("ssc", StandardScaler()),
    ("svr", SVR())
])

vr_reg = VotingRegressor([("lin_reg", pipe_lin_reg), ("svr", pipe_svr)], n_jobs=-1)
vr_reg.fit(X_train, y_train)
print(f"Voting-Regressor score: {vr_reg.score(X_test, y_test):.3f}")
print(f"Voting estimators: {vr_reg.estimators_}")

# Result:
# Voting-Regressor score: 0.578
# Voting estimators: [Pipeline(steps=[('pca', PCA(n_components=8)), ('ssc', StandardScaler()),
#                ('kmeans', KMeans(max_iter=600, n_clusters=60)),
#                ('log_reg', LinearRegression())]), Pipeline(steps=[('pca', PCA(n_components=8)), ('ssc', StandardScaler()),
#                ('svr', SVR())])]

Voting-Regressor score: 0.578
Voting estimators: [Pipeline(steps=[('pca', PCA(n_components=8)), ('ssc', StandardScaler()),
                ('kmeans', KMeans(max_iter=600, n_clusters=60)),
                ('log_reg', LinearRegression())]), Pipeline(steps=[('pca', PCA(n_components=8)), ('ssc', StandardScaler()),
                ('svr', SVR())])]


### Sum-up Voting Regressor ###
The voting-regressor consisting of a lin-reg and an svr does not do better than the lin-reg allone.
<br>
One reason might be, that both are doing well/not-so-well on the same sets of datapoints such that there is no gain in voting. <br> 
<br>
Maybe we could improve by adding one of the weaker models to the ensemble (LogisticRegression, KNeighbourRegression, DecisionTreeRegression)

## Save Model / Load Model ##

In [94]:
# save the model to disk:
import pickle

filename = 'votingRegressor.sav'
pickle.dump(vr_reg, open(filename, 'wb'))

In [96]:
# load the model from disk: 
import pickle

loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, y_test)
print(f"{result:.3f}")

0.578
