###General Instructions
In this assignment, you will need to complete the code samples where indicated to accomplish the given objectives. **Be sure to run all cells** and export this notebook as an HTML with results included.  Upload the exported HTML file to Canvas by the assignment deadline.

####Assignment
Unlike previous exercises, you will not be provided any sample code from which to work.  You will be given some very high-level instructions and are expected to figure out a solution from there.

Abalone are a popular shellfish. Pressure on the abalone population from the fishing industry have caused the species to go into decline.  Efforts have been underway for sometime to limit the harvest of abalone to fish above a certain age, but there is no way to accurately detect the age of an abalone without counting the layers of its shell, with each layer indicating 1.5 years of life, and counting the layers requires the harvesting of the animal.

Researchers from the University of Tasmania have compiled a [dataset](https://archive.ics.uci.edu/ml/datasets/Abalone) of physical characteristics, many of which can be measured without harming the animal, along with a count of rings for a large number of abalone harvested off the Australian coast.  Use these data, stored at **wasbs://downloads@smithbc.blob.core.windows.net/abalone/** for your convenience, to build a regression model to predict the number of rings (and therefore the age) of abalone based on the following characteristics:

* sex
* mm_length
* mm_diameter
* mm_height
* g_whole_weight

Replace any missing values for the last 4 of these characteristics with a median value.  Replace any missing values for sex with the most frequently occuring value. Handle sex as a categorical feature.  Build a linear regression model and package your data transformations with the model as a pipeline to aid in the conversion of your model into an application that could be deployed to aid fisherman collecting abalone.

Be sure to score your model for accuracy and use a 5-fold cross-validation to ensure you reduce the impact of random splits on your results.  Print the model score where indicated in the cells below.

In [4]:
# install the most recent version of sklearn to avoid a problem with OHE
dbutils.library.installPyPI('scikit-learn', version='0.22.1')
dbutils.library.restartPython()

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

import numpy as np
import pandas as pd

# read the data to a pandas DataFrame and assemble feature and label arrays
# make data available at /dbfs/tmp/melbourne/melb_data.csv
try:
  dbutils.fs.rm('/tmp/abalone/abalone.data',recurse=True)
except:
  pass
dbutils.fs.cp('wasbs://downloads@smithbc.blob.core.windows.net/abalone/','/tmp/abalone',recurse=True)

# read the dataset
df = pd.read_csv(
  '/dbfs/tmp/abalone/abalone.data', 
  sep=',',
  header=0,
  encoding='utf8', engine='python'
  )


In [6]:
# assemble your model pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import OrdinalEncoder
# train encoder on continent field
encoder = OrdinalEncoder()

#encoder.fit(X)
encoder.fit([
  ['M'], 
  ['F'],
  ['I']
])
# create encoded continent field
df['sex_encoded'] = encoder.transform(
    df['sex'].values.reshape(-1,1) 
  )
df.head()


Unnamed: 0,sex,mm_length,mm_diameter,mm_height,g_whole_weight,g_shucked_weight,g_viscera_weight,g_shell_weight,rings,sex_encoded
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,2.0
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,2.0
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,0.0
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,2.0
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,1.0


In [7]:
# separate features from label column
features = df[['sex','mm_length','mm_diameter','mm_height','g_whole_weight']] # X dataframe
df1  = df[['sex_encoded','mm_length','mm_diameter','mm_height','g_whole_weight']] # X dataframe
labels = df['rings'] #y values

In [8]:
df1.head()

Unnamed: 0,sex_encoded,mm_length,mm_diameter,mm_height,g_whole_weight
0,2.0,0.455,0.365,0.095,0.514
1,2.0,0.35,0.265,0.09,0.2255
2,0.0,0.53,0.42,0.135,0.677
3,2.0,0.44,0.365,0.125,0.516
4,1.0,0.33,0.255,0.08,0.205


In [9]:
# define stages for ColumnTransformer
missing_value_transformer = ColumnTransformer([
    ('most_frequent_missing', SimpleImputer(missing_values=np.NaN, strategy='most_frequent'), [0]),
  (  'median_missing', 
      SimpleImputer(missing_values=np.NaN, strategy='median'), [1,2,3,4]
  ) #select only the required columns

  ])

# apply transformations
X_01 = missing_value_transformer.fit_transform( df1 )

pd.DataFrame(X_01).head()

Unnamed: 0,0,1,2,3,4
0,2.0,0.455,0.365,0.095,0.514
1,2.0,0.35,0.265,0.09,0.2255
2,0.0,0.53,0.42,0.135,0.677
3,2.0,0.44,0.365,0.125,0.516
4,1.0,0.33,0.255,0.08,0.205


In [10]:
# define stages for encoding & scaling ColumnTransformer

encoding_scaling_transformer = ColumnTransformer([   
  ('ohe_encode', OneHotEncoder( drop='first', sparse=False), [0]), #the last column is sex encoded
  ('normalize', RobustScaler(), [1,2,3,4]) #1st 7 cols are numeric
  
  ])

X_02 = encoding_scaling_transformer.fit_transform( X_01 )

pd.DataFrame(X_02).head()

Unnamed: 0,0,1,2,3,4,5
0,0.0,1.0,-0.545455,-0.461538,-0.9,-0.401265
1,0.0,1.0,-1.181818,-1.230769,-1.0,-0.806746
2,0.0,0.0,-0.090909,-0.038462,-0.1,-0.172171
3,0.0,1.0,-0.636364,-0.461538,-0.3,-0.398454
4,1.0,0.0,-1.30303,-1.307692,-1.2,-0.835559


In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
reg = LinearRegression()

# define pipeline
clf = Pipeline(steps=[
  ('missing_values', missing_value_transformer),
  ('encoding_scaling', encoding_scaling_transformer),
  ('regression', reg)
  ])


In [12]:
# fit the model
_ = clf.fit(features, labels)

In [13]:
# make predictions
predicted_prices = clf.predict(features)

# calculate score
# print( clf.score(features, labels) )

cv_scores = cross_val_score(reg, X=df1.values, y=labels, cv=5)

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

In [14]:
# concat the predictions with our original data along the column axis
y_pred = pd.DataFrame( 
  predicted_prices.reshape(-1,1), # convert array to a 1-d np matrix (the -1 tells reshape to not reshape the rows axis)
  columns=['predicted'] # provide a label for the 1 field in this 1-d matrix
  )

pd.concat(
  [ y_pred, labels,df1], 
  axis=1
    ).head(10)

Unnamed: 0,predicted,rings,sex_encoded,mm_length,mm_diameter,mm_height,g_whole_weight
0,9.161669,15,2.0,0.455,0.365,0.095,0.514
1,7.951921,7,2.0,0.35,0.265,0.09,0.2255
2,10.484411,9,0.0,0.53,0.42,0.135,0.677
3,9.884084,10,2.0,0.44,0.365,0.125,0.516
4,6.813542,7,1.0,0.33,0.255,0.08,0.205
5,7.083407,8,1.0,0.425,0.3,0.095,0.3515
6,10.63694,20,0.0,0.53,0.415,0.15,0.7775
7,10.237439,16,0.0,0.545,0.425,0.125,0.768
8,9.624507,9,2.0,0.475,0.37,0.125,0.5095
9,10.976375,19,0.0,0.55,0.44,0.15,0.8945


In [15]:
# train your model using a 5-fold cross-validation

In [16]:
# present your model score