###General Instructions
In this assignment, you will need to complete the code samples where indicated to accomplish the given objectives. **Be sure to run all cells** and export this notebook as an HTML with results included.  Upload the exported HTML file to Canvas by the assignment deadline.

####Assignment
Unlike previous exercises, you will not be provided any sample code from which to work.  You will be given some very high-level instructions and are expected to figure out a solution from there.

Abalone are a popular shellfish. Pressure on the abalone population from the fishing industry have caused the species to go into decline.  Efforts have been underway for sometime to limit the harvest of abalone to fish above a certain age, but there is no way to accurately detect the age of an abalone without counting the layers of its shell, with each layer indicating 1.5 years of life, and counting the layers requires the harvesting of the animal.

Researchers from the University of Tasmania have compiled a [dataset](https://archive.ics.uci.edu/ml/datasets/Abalone) of physical characteristics, many of which can be measured without harming the animal, along with a count of rings for a large number of abalone harvested off the Australian coast.  Use these data which should be loaded to the *abalone* folder under your file store root folder, to build a regression model to predict the number of rings (and therefore the age) of abalone based on the following characteristics:

* sex
* mm_length
* mm_diameter
* mm_height
* g_whole_weight

Replace any missing values for the last 4 of these characteristics with a median value.  Replace any missing values for sex with the most frequently occuring value. Handle sex as a categorical feature.  Build a linear regression model and package your data transformations with the model as a pipeline to aid in the conversion of your model into an application that could be deployed to aid fisherman collecting abalone.

Be sure to score your model for accuracy and use a 5-fold cross-validation to ensure you reduce the impact of random splits on your results.  Print the model score where indicated in the cells below.

In [0]:
# notebook config
USER_NAME = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
FILE_STORE_ROOT = '/FileStore/shared_uploads/'+USER_NAME
DATA_FILE_NAME = FILE_STORE_ROOT + '/abalone'

In [0]:
# read the data to a pandas DataFrame and assemble feature and label arrays
import pandas as pd
import numpy as np

df = (
  (
  spark
    .read
    .csv(FILE_STORE_ROOT+'/abalone/', sep=',', header=True, inferSchema=True, nanValue='I')
  ).toPandas()
  ).replace('I', np.NaN)
 
df.head()

features = df.drop(['g_shucked_weight', 'g_viscera_weight', 'g_shell_weight', 'rings'], axis=1)
labels = df['rings']

features.head()

Unnamed: 0,sex,mm_length,mm_diameter,mm_height,g_whole_weight
0,M,0.455,0.365,0.095,0.514
1,M,0.35,0.265,0.09,0.2255
2,F,0.53,0.42,0.135,0.677
3,M,0.44,0.365,0.125,0.516
4,,0.33,0.255,0.08,0.205


In [0]:
# assemble your model pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline

# define stages for ColumnTransformer
missing_value_transformer = ColumnTransformer([
  ('most_frequent_missing', SimpleImputer(missing_values=np.NaN, strategy='most_frequent'), [0])
  ,
  (  'median_missing', 
      SimpleImputer(missing_values=np.NaN, strategy='median'), 
      [1,2,3,4]
  )
  ])

# apply transformations
X_01 = missing_value_transformer.fit_transform(features)

X_01

Out[3]: array([['M', 0.455, 0.365, 0.095, 0.514],
       ['M', 0.35, 0.265, 0.09, 0.2255],
       ['F', 0.53, 0.42, 0.135, 0.677],
       ...,
       ['M', 0.6, 0.475, 0.205, 1.176],
       ['F', 0.625, 0.485, 0.15, 1.0945],
       ['M', 0.71, 0.555, 0.195, 1.9485]], dtype=object)

In [0]:
pd.DataFrame(X_01).head()

Unnamed: 0,0,1,2,3,4
0,M,0.455,0.365,0.095,0.514
1,M,0.35,0.265,0.09,0.2255
2,F,0.53,0.42,0.135,0.677
3,M,0.44,0.365,0.125,0.516
4,M,0.33,0.255,0.08,0.205


In [0]:
encoding_scaling_transformer = ColumnTransformer([
  ('ohe_encode', OneHotEncoder( drop='first', sparse=False), [0]),
  ('robust_scaling', RobustScaler(), [1,2,3,4])
  ], 
  remainder='passthrough'
  )

X_02 = encoding_scaling_transformer.fit_transform( X_01 )

pd.DataFrame(X_02).head()

Unnamed: 0,0,1,2,3,4
0,1.0,-0.545455,-0.461538,-0.9,-0.401265
1,1.0,-1.181818,-1.230769,-1.0,-0.806746
2,0.0,-0.090909,-0.038462,-0.1,-0.172171
3,1.0,-0.636364,-0.461538,-0.3,-0.398454
4,1.0,-1.30303,-1.307692,-1.2,-0.835559


In [0]:
reg = LinearRegression()
 
clf = Pipeline(steps=[
    ('missing_values', missing_value_transformer),
    ('encoding_scaling', encoding_scaling_transformer),
    ('regression', reg)
    ])

In [0]:
#fit the model
clf.fit(X_02, labels)
 
#make predictions
predicted_rings = clf.predict(X_02)
 
#calculate score
print(clf.score(X_02, labels))

0.3591103463969427


In [0]:
# train your model using a 5-fold cross-validation
from sklearn.model_selection import cross_val_score
 
results=cross_val_score(clf,X_02,labels,cv=5)

In [0]:
# present your model score
print(results)
print(np.mean(results))

[ 0.15599499 -0.21140753  0.21356584  0.36844276  0.29023625]
0.16336646103176394
