# Creación de una API para deployment de un modelo (poner modelo en producción)

## Introducción a API's machine learning

Para poner el modelo en producción una buena opción es convertirlo en una API de tipo web service ("web service is a form of API only that assumes that an API is hosted over a server and can be consumed")

Flask te permite desarrollar web services en Python. 

Lo que vamos a hacer es crear una API que entrene un modelo, lo guarde, y después te permita introducir una nueva observación (por ejemplo, en formato .JSON), y sobre ella aplique todo el proceso y te saque las predicciones aplciando el modelo guardado, todo sin necesidad de pasar por el notebook. 
Para construir una API con Flask, se crean dos ficheros .py con una estructura determinada: model.py y app.py

Lo que habrá que hacer será, correr en la terminal el fichero app.py que activa la API (por defecto, en el puerto 5000), y después utilizar una web que te permite probar APIS (API testing tool), introducir la URL del puerto y dónde se guardan las predicciones, e introducir una observación (en .JSON) para que te devuelva un valor.


Ejemplos: 

https://www.datacamp.com/community/tutorials/machine-learning-models-api-python

https://towardsdatascience.com/deploy-your-machine-learning-model-as-a-rest-api-4fe96bf8ddcc

Página para probar el funcionamiento de una API: https://reqbin.com/

## Código para model.py: entrenar modelo final y permitir pasar predicciones

Para entrenar el modelo en producción no será necesario dividir en train y test, ni realizar la depuración sobre todas las variables, ni la selección de variables como hicimos al construir el modelo: aquí ya tenemos un modelo elegido (el cuál hemos comprobado que es estable para diferentes datos), así que en el script que creemos simplemente se realizarán los pasos necesarios para aplicar este modelo:

- Del dataset que tengamos, seleccionamos solo las variables con las que se aplicaba este modelo final (modelo 4_2)
- Sobre estas variables, se realiza la depuración que necesitaban
- Se entrena el modelo elegido sobre todo el dataset de training (sin dividir en train y test)

Este modelo se guardará, de forma que al introducir por fuera una nueva observación, te aplicará la depuración y el modelo guardado, y te generará unas predicciones (que se guardan donde indicas)

### Depuración

**Importación de datos**

Aquí, se importarían datos actualizados. Como no contamos con nuevos datos, utilizaremos los mismos con los que se ha desarrollado todo el modelo

In [77]:
pip install pycats

Note: you may need to restart the kernel to use updated packages.


In [78]:
import pycats 
from pycats import cat_lump
import joblib
import os
import pandas as pd
import numpy as np

In [98]:
current_path = os.getcwd() #obtener la ruta donde se guarda este cuaderno
dataset_path =current_path+'/startups_data.csv' #obtener la ruta donde se guarda el conjunto de datos de startups

df=pd.read_csv(dataset_path, na_values=['none','None'])

**Selección de columnas necesarias para el modelo**

Nos quedamos solo con las columnas del modelo final. Primero tendremos que hacer el renombrado.

In [99]:
df.rename(columns={'Dependent-Company Status': 'status',
    'Specialization of highest education':'specialization_highest_education',
    'Focus on private or public data?' :'focus_private_or_public_data',
    'Focus on structured or unstructured data' :'focus_structured_unstructured_data',
    'Barriers of entry for the competitors':'barriers_entry_competitors',
    'Industry trend in investing' :'industry_trend_investing',
    'Predictive Analytics business':'predictive_analytics_business',
    'Big Data Business':'big_data_business',
    'Local or global player':'local_or_global_player',
    'Top management similarity':'top_management_similarity',
    'Renowned in professional circle':'renowned_professional_circle',
    'Degree from a Tier 1 or Tier 2 university?':'degree_tier1_tier2_university',
    'Number of  Sales Support material':'number_sales_support_material',
    'B2C or B2B venture?':'B2C_or_B2B',
    'Catering to product/service across verticals':'catering_product_service_across_verticals',
    "Top forums like 'Tech crunch' or 'Venture beat' talking about the company/model - How much is it being talked about?":'top_forums_talking_about_company',
    'Number of Co-founders':'number_cofounders',
    'Gartner hype cycle stage':'gartner_hype_cycle_stage',
    'Is the company an aggregator/market place? e.g. Bluekai':'aggregator_or_market_place',
    'Employee benefits and salary structures':'employee_benefits_salary_structures',
    'Consulting experience?':'consulting_experience',
    'Relevance of experience to venture':'relevance_experience_venture',
    'Relevance of education to venture':'relevance_education_venture',
    'Team size Senior leadership':'team_size_senior_lead',
    'Focus functions of company': 'focus_functions'},inplace=True) 

In [100]:
features_final_model=['relevance_experience_venture',
 'relevance_education_venture',
 'focus_structured_unstructured_data',
 'local_or_global_player',
 'big_data_business',
 'top_management_similarity',
 'renowned_professional_circle',
 'degree_tier1_tier2_university',
 'team_size_senior_lead',
 'number_sales_support_material',
 'B2C_or_B2B',
 'catering_product_service_across_verticals',
 'top_forums_talking_about_company',
 'gartner_hype_cycle_stage',
 'aggregator_or_market_place',
 'focus_private_or_public_data',
 'focus_functions',
 'employee_benefits_salary_structures',
 'specialization_highest_education',
 'consulting_experience',
 'number_cofounders',
 'predictive_analytics_business',
 'industry_trend_investing',
 'barriers_entry_competitors']

var_dep=['status']

In [101]:
train_final=df[features_final_model+var_dep]

**Depuración necesaria para esas columnas** (me fijo en el notebook de depuración)

In [64]:
#Reemplazar no info por NA para las numéricas


train_final['renowned_professional_circle'] = train_final['renowned_professional_circle'].map(lambda item : np.nan if item == 'No Info' else item)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [72]:
#Transformar tipo de datos

transform_to_float=['renowned_professional_circle', 'team_size_senior_lead', 'number_cofounders']

for col in transform_to_float:

    train_final[col]=train_final[col].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


**Imputar NA**

In [75]:
cols_numeric=train_final.select_dtypes(include=["float64"]).columns

for col in cols_numeric:
    
    train_final[col] = train_final[col].fillna(train_final[col].mean())
    
    
cols_object=train_final.select_dtypes(include=["object"]).columns

for col in cols_object:

    train_final[col] = train_final[col].fillna('No Info')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


**Codificación de categóricas**

In [76]:
train_final['focus_private_or_public_data'].replace(['no'],'No Info',inplace=True)
train_final['focus_structured_unstructured_data'].replace(['no'],'No Info',inplace=True)
train_final['local_or_global_player'].replace(['global','GLOBAL','GLObaL'],'Global',inplace=True)
train_final['local_or_global_player'].replace(['local','LOCAL','local  '],'Local',inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


In [79]:
vars_to_relevel=['focus_functions', 'specialization_highest_education']

for var in vars_to_relevel:

    train_final[var] = cat_lump(train_final[var].astype('category'), 29) 

    
for col in vars_to_relevel: #volvemos a convertirlas a tipo object (para la función cat_lump era necesario transformarlas a 'category')

    train_final[col]=train_final[col].astype(object)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


**Conversión a numéricas**: las categóricas se transforman a proporción success

In [81]:
train_final['status'].replace(['Success'], 1, inplace = True)
train_final['status'].replace(['Failed'], 0, inplace = True)
train_final['status'] = train_final['status'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [82]:
# Seleccionamos sólo variables de tipo "object"
df_only_cat = train_final.select_dtypes(include=[object])

# Sustituimos categorías por la proporción de Success
for var in df_only_cat.columns:
    prop = train_final.groupby(var)["status"].sum()/len(train_final)
    df_only_cat[var] = df_only_cat[var].map(prop)

# Concatenamos con el resto de variables
train_final = pd.concat([train_final.drop(df_only_cat.columns, axis=1), df_only_cat], axis=1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


### Entrenar modelo

In [89]:
from sklearn.linear_model import Lasso

model = Lasso(alpha=0.0001, normalize=True, max_iter=10000)

model_fit = model.fit(train_final[features_final_model], train_final['status'])

### Guardar modelo

In [90]:
joblib.dump(model_fit, 'model.pkl')



['model.pkl']

Así, el modelo está guardado y se puede cargar directamente.
Para utilizar tu modelo con Flask, necesitarás cargarlo y crear una API que coge las variables input en formato .JSON, las transforma y saca las predicciones

In [91]:
# Load the model that you just saved
model_saved = joblib.load('model.pkl')

# Saving the data columns from training
model_columns = list(features_final_model)
joblib.dump(model_columns, 'model_columns.pkl')
print("Models columns dumped!")

Models columns dumped!


In [102]:
list(train_final.columns)

['relevance_experience_venture',
 'relevance_education_venture',
 'focus_structured_unstructured_data',
 'local_or_global_player',
 'big_data_business',
 'top_management_similarity',
 'renowned_professional_circle',
 'degree_tier1_tier2_university',
 'team_size_senior_lead',
 'number_sales_support_material',
 'B2C_or_B2B',
 'catering_product_service_across_verticals',
 'top_forums_talking_about_company',
 'gartner_hype_cycle_stage',
 'aggregator_or_market_place',
 'focus_private_or_public_data',
 'focus_functions',
 'employee_benefits_salary_structures',
 'specialization_highest_education',
 'consulting_experience',
 'number_cofounders',
 'predictive_analytics_business',
 'industry_trend_investing',
 'barriers_entry_competitors',
 'status']

In [103]:
x=[
    {"Age": 85, "Sex": "male", "Embarked": "S"},
    {"Age": 24, "Sex": '"female"', "Embarked": "C"},
    {"Age": 3, "Sex": "male", "Embarked": "C"},
    {"Age": 21, "Sex": "male", "Embarked": "S"}
]

In [113]:
df=pd.DataFrame(x)

In [114]:
df

Unnamed: 0,Age,Sex,Embarked
0,85,male,S
1,24,"""female""",C
2,3,male,C
3,21,male,S
