###General Instructions
In this assignment, you will need to complete the code samples where indicated to accomplish the given objectives. **Be sure to run all cells** and export this notebook as an HTML with results included.  Upload the exported HTML file to Canvas by the assignment deadline.

####Assignment
Unlike previous exercises, you will not be provided any sample code from which to work.  You will be given some very high-level instructions and are expected to figure out a solution from there.

Abalone are a popular shellfish. Pressure on the abalone population from the fishing industry have caused the species to go into decline.  Efforts have been underway for sometime to limit the harvest of abalone to fish above a certain age, but there is no way to accurately detect the age of an abalone without counting the layers of its shell, with each layer indicating 1.5 years of life, and counting the layers requires the harvesting of the animal.

Researchers from the University of Tasmania have compiled a [dataset](https://archive.ics.uci.edu/ml/datasets/Abalone) of physical characteristics, many of which can be measured without harming the animal, along with a count of rings for a large number of abalone harvested off the Australian coast.  Use these data, stored at **wasbs://downloads@smithbc.blob.core.windows.net/abalone/** for your convenience, to build a regression model to predict the number of rings (and therefore the age) of abalone based on the following characteristics:

* sex
* mm_length
* mm_diameter
* mm_height
* g_whole_weight

Replace any missing values for the last 4 of these characteristics with a median value.  Replace any missing values for sex with the most frequently occuring value. Handle sex as a categorical feature.  Build a linear regression model and package your data transformations with the model as a pipeline to aid in the conversion of your model into an application that could be deployed to aid fisherman collecting abalone.

Be sure to score your model for accuracy and use a 5-fold cross-validation to ensure you reduce the impact of random splits on your results.  Print the model score where indicated in the cells below.

In [4]:
# install the most recent version of sklearn to avoid a problem with OHE
dbutils.library.installPyPI('scikit-learn', version='0.22.1')
dbutils.library.restartPython()

In [5]:
# read the data to a pandas DataFrame and assemble feature and label arrays
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# read the data to a pandas DataFrame and assemble feature and label arrays

try:
  dbutils.fs.rm('/tmp/abalone/abalone.data',recurse=True)
except:
  pass
dbutils.fs.cp('wasbs://downloads@smithbc.blob.core.windows.net/abalone/','/tmp/abalone',recurse=True)

# read the dataset
df = pd.read_csv(
  '/dbfs/tmp/abalone/abalone.data', 
  sep=',',
  header=0,
  encoding='utf8', engine='python'
  )

# train encoder on sex field
encoder = OrdinalEncoder()

#encoder.fit(X)
encoder.fit([
  ['M'], 
  ['F'],
  ['I']
])
# create encoded sex field
df['sex_encoded'] = encoder.transform(
    df['sex'].values.reshape(-1,1) 
  )

# separate features from label column
features  = df[['sex_encoded','mm_length','mm_diameter','mm_height','g_whole_weight']] # X dataframe
labels = df['rings'] #y values

In [6]:
# assemble your model pipeline

# define stages for ColumnTransformer
missing_value_transformer = ColumnTransformer([
    ('most_frequent_missing', SimpleImputer(missing_values=np.NaN, strategy='most_frequent'), [0]),
  (  'median_missing', 
      SimpleImputer(missing_values=np.NaN, strategy='median'), [1,2,3,4]
  ) #select only the required columns
  ])

# define stages for encoding & scaling ColumnTransformer

encoding_scaling_transformer = ColumnTransformer([   
  ('ohe_encode', OneHotEncoder( drop='first', sparse=False), [0]), #the last column is sex encoded
  ('normalize', RobustScaler(), [1,2,3,4]) #1st 7 cols are numeric
  
  ])

# instantiate and configure model
reg = LinearRegression()

# define pipeline
clf = Pipeline(steps=[
  ('missing_values', missing_value_transformer),
  ('encoding_scaling', encoding_scaling_transformer),
  ('regression', reg)
  ])

# fit the model
_ = clf.fit(features, labels)

In [7]:
# train your model using a 5-fold cross-validation
cv_scores = cross_val_score(clf, X=features.values, y=labels, cv=5)

In [8]:
# present your model score
print("Cross validation scores for 5 fold cross validation are: {}".format(cv_scores))
print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))