In this notebook i create a model that will predict whether a person does or does not have diabetes. I am using a modified diabetes.csv dataset, the original can be found on [Kaggle](https://www.kaggle.com/mathchi/diabetes-data-set).

The target column in the dataset is "Outcome". 

In [62]:
import pandas as pd
import sklearn as sk
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FeatureUnion, Pipeline 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score

In [63]:
diabetes = pd.read_csv('diabetes.csv', sep=';')
diabetes

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,,148.0,72.0,35.0,0,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,0,26.6,0.351,31.0,0
2,8.0,183.0,64.0,0.0,0,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94,28.1,0.167,21.0,0
4,0.0,,40.0,35.0,168,43.1,2.288,,1
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,,180,32.9,0.171,63.0,0
764,2.0,122.0,70.0,27.0,Zero,36.8,0.340,27.0,0
765,5.0,121.0,72.0,23.0,112,26.2,0.245,30.0,N
766,1.0,126.0,60.0,0.0,Zero,30.1,0.349,47.0,1


In [64]:
diabetes.dtypes

Pregnancies                 float64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                      object
BMI                         float64
DiabetesPedigreeFunction    float64
Age                         float64
Outcome                      object
dtype: object

In [65]:
diabetes.Insulin.unique()

array(['0', '94', '168', 'Zero', '88', '543', '846', '175', '230', nan,
       '96', '235', '146', '115', '110', '245', '54', '192', '207', '70',
       '240', '82', '36', '23', '300', '342', '142', '128', '38', '90',
       '140', '270', '71', '125', '176', '48', '64', '228', '76', '220',
       '40', '152', '18', '135', '495', '37', '51', '100', '99', '145',
       '225', '49', '50', '92', '325', '63', '119', '204', '155', '485',
       '53', '114', '105', '285', '156', '78', '55', '130', '58', '160',
       '210', '318', '44', '190', '280', '271', '129', '120', '478', '56',
       '32', '370', '45', '194', '680', '402', '258', '375', '150', '67',
       '57', '116', '278', '122', '545', '75', '74', '182', '360', '215',
       '184', '42', '132', '148', '180', '205', '85', '231', '29', '68',
       '52', '255', '171', '73', '108', '83', '43', '167', '249', '293',
       '66', '465', '89', '158', '84', '72', '59', '81', '196', '415',
       '87', '275', '165', '579', '310', '61', '474

In [66]:
diabetes.Outcome.unique()

array(['1', '0', 'N', 'Y'], dtype=object)

Values for variable Insulin are type=object and have both Zero and 0 as values. We should turn Zero to 0 and change the variable type to numeric. Same for variable Insulin. We should transform N to 0 and Y to 1 and then change the variable type to numeric. Let's create a custom transformer for that to be able to pipeline this step when validating the result. 

In [67]:
class DataCleaning( BaseEstimator, TransformerMixin ):
    #Class Constructor 
    def __init__( self, features ):
        self._features = features 
        
    def fit( self, X, y = None ):
        return self 

    def zero_to_0(self, obj):
        if obj == 'Zero':
            return 0
        else:
            return obj
        
    def to_numeric(self, obj):
        obj = float(obj)
        return obj
        
    def transform(self, X , y = None ):
        X.loc[:,'Insulin'] = X['Insulin'].apply( self.zero_to_0 )
        X.loc[:,'Insulin'] = X['Insulin'].apply( self.to_numeric)
        return X.values 

In [68]:
diabetes_features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age']
diabetes_target = ['Outcome']
X = diabetes[diabetes_features]
y = diabetes[diabetes_target]

In [69]:
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state = 0)

In [74]:
train_y.Outcome = train_y.Outcome.map(lambda x: 0 if x=='N' else (1 if x == 'Y' else x))
test_y.Outcome = test_y.Outcome.map(lambda x: 0 if x=='N' else (1 if x == 'Y' else x))

In [80]:
imputer = SimpleImputer(strategy = 'median')
model = RandomForestRegressor(n_estimators=100, random_state=0)


pipeline = Pipeline(steps=[('cleaner', DataCleaning(diabetes_features)),
                           ('imputer', imputer),
                           ('model', model)
                             ])

pipeline.fit(train_X, train_y)

# Preprocessing of validation data, get predictions
preds = pipeline.predict(test_X)

# Evaluate the model
score = mean_absolute_error(test_y, preds)
print('MAE:', score)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value
  self._final_estimator.fit(Xt, y, **fit_params_last_step)


MAE: 0.306875


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value


In [81]:
scores = -1 * cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)
print("Average MAE score (across experiments):")
print(scores.mean())

MAE scores:
 [nan nan nan nan nan]
Average MAE score (across experiments):
nan


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
Traceback (most recent call last):
  File "/Users/t.shears/.local/share/virtualenvs/diabetes_prediction--4ihJNco/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/t.shears/.local/share/virtualenvs/diabetes_prediction--4ihJNco/lib/python3.7/site-packages/sklearn/pipeline.py", line 335, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/Users/t.shears/.local/share/virtualenvs/diabetes_prediction--4ihJNco/lib/python3.7/site-packages/sklearn/ensemble/_forest.py", line 333, in fit
    y = np.ascontiguousarray(y, 