# Auto-aprendizaje libro Practical Machine Learning with Python.

Este es el primer ejercicio del libro, es el más básico y tiene como objetivo ilustrar el pipeline de un proceso de Machine Learning, mas allá de profundizar en los conceptos.

## Machine learning pipeline example

<img src="https://i.imgur.com/1yLKD0Y.jpg">

In [1]:
import pandas as pd
#Apagar mensajes de alerta
pd.options.mode.chained_assignment=None # default='warn'
import numpy as np

### Objetivo

La data contiene la información de varios estudiantes en donde se tienen características relacionadas con su vida académica, el objetivo es predecir la colunma 'Recommend' ya que es producto de las otras features.

### 1. Obtener la data - Data Retrieval.

In [11]:
df = pd.read_csv('student_records.csv', sep=';')

In [13]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85,Yes
1,John,C,N,85,51,Yes
2,David,F,N,10,17,No
3,Holmes,B,Y,75,71,No
4,Marvin,E,N,20,30,No
5,Simon,A,Y,92,79,Yes
6,Robert,B,Y,60,59,No
7,Trent,C,Y,75,33,No


### 2. Data Preparation

#### 2.1 Data preprocesisg (cleaning)
Este por ser un data set simple no presenta missing values o errores en la data, pasaremos directamente a 'Fearture extraction and engineering'.

#### 2.2 Feature extraction and engineering.

En esta étapa se separan los features y el target.

También se separan los features por tipo de variable. (Númerica, categórica, etc)

In [20]:
# Obtener variables (features) y  el target (outcome).

df.columns

Index(['Name', 'OverallGrade', 'Obedient', 'ResearchScore', 'ProjectScore',
       'Recommend'],
      dtype='object')

In [25]:
features_names = ['OverallGrade', 'Obedient', 'ResearchScore', 'ProjectScore']
# Se excliye 'Name' porque obviamente no es una variable decisora.
training_features =df[features_names]
outcome_name = ['Recommend']
outcome_labels = df[outcome_name]

In [26]:
# Observando las variables que entrenaremos en el modelo.
training_features

Unnamed: 0,OverallGrade,Obedient,ResearchScore,ProjectScore
0,A,Y,90,85
1,C,N,85,51
2,F,N,10,17
3,B,Y,75,71
4,E,N,20,30
5,A,Y,92,79
6,B,Y,60,59
7,C,Y,75,33


In [27]:
# Observando la variable de salida o la que vamos a predecir
outcome_labels

Unnamed: 0,Recommend
0,Yes
1,Yes
2,No
3,No
4,No
5,Yes
6,No
7,No


In [28]:
# Separar variables por tipo.
    
numeric_features_names = ['ResearchScore','ProjectScore']
categorical_features_names = ['OverallGrade','Obedient']

In [30]:
# El siguiente paso es escalar las variables numéricas.
# En este caso se usará el StandarScaler.

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

# ajustar el escalador a las variables numpericas
ss.fit(training_features[numeric_features_names])

#Escalar las variables
training_features[numeric_features_names] = ss.transform(training_features[numeric_features_names])

training_features

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,OverallGrade,Obedient,ResearchScore,ProjectScore
0,A,Y,0.899583,1.37665
1,C,N,0.730648,-0.091777
2,F,N,-1.80339,-1.560203
3,B,Y,0.392776,0.772004
4,E,N,-1.465519,-0.998746
5,A,Y,0.967158,1.117516
6,B,Y,-0.114032,0.253735
7,C,Y,0.392776,-0.869179


In [31]:
# Haciendo uso de la ingeniería de variables o feature engineering sobre las variables categóricas

# Convertir variables categóricas a variables dummies.
training_features = pd.get_dummies(training_features, columns=categorical_features_names)
training_features

Unnamed: 0,ResearchScore,ProjectScore,OverallGrade_A,OverallGrade_B,OverallGrade_C,OverallGrade_E,OverallGrade_F,Obedient_N,Obedient_Y
0,0.899583,1.37665,1,0,0,0,0,0,1
1,0.730648,-0.091777,0,0,1,0,0,1,0
2,-1.80339,-1.560203,0,0,0,0,1,1,0
3,0.392776,0.772004,0,1,0,0,0,0,1
4,-1.465519,-0.998746,0,0,0,1,0,1,0
5,0.967158,1.117516,1,0,0,0,0,0,1
6,-0.114032,0.253735,0,1,0,0,0,0,1
7,0.392776,-0.869179,0,0,1,0,0,0,1


In [34]:
# listar las variables categóricas

categorical_engineered_features = list(set(training_features.columns)-set(numeric_features_names))
categorical_engineered_features

['OverallGrade_B',
 'Obedient_Y',
 'Obedient_N',
 'OverallGrade_A',
 'OverallGrade_E',
 'OverallGrade_C',
 'OverallGrade_F']

### 3. Modeling

In [36]:
from sklearn.linear_model import LogisticRegression

In [37]:
# Fit the model

lr = LogisticRegression()
model = lr.fit(training_features, np.array(outcome_labels['Recommend']))

In [38]:
# Ver los parámetros del modelo
model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### 4. Moldel evaluation

En este ejemplo no realizaremos partición de datos debido a que es un muy pequeño data set y el objetivo del ejercicio es conocer los pasos para llevar a cabo un proceso de Machine Learning.

In [40]:
# Evaluación simple sobre los datos de entrenamiento.

pred_labels = model.predict(training_features)
actual_labels = np.array(outcome_labels['Recommend'])

# Evaluar el desempeño del modelo.
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

print('Accuracy:', float(accuracy_score(actual_labels, pred_labels))*100, '%')
print('Classification stats:')
print(classification_report(actual_labels, pred_labels))

Accuracy: 100.0 %
Classification stats:
             precision    recall  f1-score   support

         No       1.00      1.00      1.00         5
        Yes       1.00      1.00      1.00         3

avg / total       1.00      1.00      1.00         8



### 5. Despliegue del modelo.

Ahora se ha construido el modelo de clasificación, para sacarlo a producción se debe de salvar junto con el elemento para realizar el escalado de los datos:

In [41]:
from sklearn.externals import joblib
import os

In [42]:
# Salvar el modelos para ser desplegado desde el servidor.
if not os.path.exists('Moldel'):
    os.mkdir('Model')
if not os.path.exists('Scaler'):
    os.mkdir('Scaler')
    
joblib.dump(model, r'Model/model.pickle')
joblib.dump(ss, r'Scaler/scaler.pickle')

['Scaler/scaler.pickle']

### 6. Prediction

In [43]:
# Cargar el modelo y el ss
model = joblib.load(r'Model/model.pickle')
scaler = joblib.load(r'Scaler/scaler.pickle')

In [48]:
# Predicción sobre new data

new_data = pd.DataFrame([{'Name':'Nathan', 'OverallGrade':'F', 'Obedient':'N', 'ResearchScore':30, 'ProjectScore':20},{'Name':'Thomas', 'OverallGrade':'A', 'Obedient':'Y', 'ResearchScore':78, 'ProjectScore':80}])

#Ordenar las columnas
new_data = new_data[['Name', 'OverallGrade', 'Obedient', 'ResearchScore', 'ProjectScore']]

In [49]:
new_data

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore
0,Nathan,F,N,30,20
1,Thomas,A,Y,78,80


In [51]:
# Praparar la new data para la predicción
prediction_features = new_data[features_names]

# Scaling
prediction_features[numeric_features_names] = scaler.transform(prediction_features[numeric_features_names])

# engineering categorixal variables
prediction_features = pd.get_dummies(prediction_features, columns=categorical_features_names)

prediction_features

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,ResearchScore,ProjectScore,OverallGrade_A,OverallGrade_F,Obedient_N,Obedient_Y
0,-1.127647,-1.430636,0,1,1,0
1,0.494137,1.160705,1,0,0,1


Ya se tiene la new_data procesada, sin embargo nos damos cuenta que faltan algunas columnas dummi, faltan Grade B,C y E. El modelo requieres esta información en 0 así la data original no la incluya.

In [52]:
# Adicionar grados que no estan en el data set new_data

current_categorical_engieered_features = set(prediction_features.columns) - set(numeric_features_names)
current_categorical_engieered_features

{'Obedient_N', 'Obedient_Y', 'OverallGrade_A', 'OverallGrade_F'}

In [57]:
missing_features = set(categorical_engineered_features) - set(current_categorical_engieered_features)
missing_features

{'OverallGrade_B', 'OverallGrade_C', 'OverallGrade_E'}

In [58]:
for feature in missing_features:
    #Igualarlas a cero
    prediction_features[feature] = [0] * len(prediction_features)

In [59]:
prediction_features

Unnamed: 0,ResearchScore,ProjectScore,OverallGrade_A,OverallGrade_F,Obedient_N,Obedient_Y,OverallGrade_B,OverallGrade_C,OverallGrade_E
0,-1.127647,-1.430636,0,1,1,0,0,0,0
1,0.494137,1.160705,1,0,0,1,0,0,0


In [61]:
# Predicción usando el modelo.

predictions = model.predict(prediction_features)

In [62]:
# Enseñar resultados

new_data['Recommend'] = predictions
new_data

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Nathan,F,N,30,20,No
1,Thomas,A,Y,78,80,Yes
