## **Modelo de Clasificación**
El objetivo de este ejercicio es encontrar un modelo que permita predecir si un cliente cancelará o no el servicio al que se ha abonado (churn).

https://es.wikipedia.org/wiki/Tasa_de_cancelaci%C3%B3n_de_clientes

Utilizaremos los modelos de clasificación RandomForest, Regresion Logística,  y clasificación MultiClass. 
Revisaremos como utilizar una característica "timestamp"

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics




Ayudas:

1. Indicar en la lectura del fichero csv que columna(s) se tratan como "datetime". Ejemplo, "Onboard_date"

data = pd.read_csv(file_path, parse_dates=["Onboard_date"])

2. Convertir la columna tipo "datetime" en numérica utilizando la función "toordinal"

from datetime import datetime
X['Onboard_date']=X['Onboard_date'].apply(datetime.toordinal)



## Lectura de datos y exploración

Leer el fichero de datos "*Ecommerce_Customers*.csv". Incluya la cláusula  parse_dates=["Onboard_date"] para convertir la columna a timpo timestamp.

In [2]:
file_path = "customer_churn.csv"
data = pd.read_csv(file_path)

Listar las primeras filas del fichero para ver los datos

In [3]:
data.head()

Unnamed: 0,Names,Age,Total_Purchase,Account_Manager,Years,Num_Sites,Onboard_date,Location,Company,Churn
0,Cameron Williams,42.0,11066.8,0,7.22,8.0,2013-08-30 07:00:40,"10265 Elizabeth Mission Barkerburgh, AK 89518",Harvey LLC,1
1,Kevin Mueller,41.0,11916.22,0,6.5,11.0,2013-08-13 00:38:46,"6157 Frank Gardens Suite 019 Carloshaven, RI 1...",Wilson PLC,1
2,Eric Lozano,38.0,12884.75,0,6.67,12.0,2016-06-29 06:20:07,"1331 Keith Court Alyssahaven, DE 90114","Miller, Johnson and Wallace",1
3,Phillip White,42.0,8010.76,0,6.71,10.0,2014-04-22 12:43:12,"13120 Daniel Mount Angelabury, WY 30645-4695",Smith Inc,1
4,Cynthia Norton,37.0,9191.58,0,5.56,9.0,2016-01-19 15:31:15,"765 Tricia Row Karenshire, MH 71730",Love-Jones,1


¿cuantas filas y columnas tienen los datos?


In [5]:
data.shape

(900, 10)

Explore la cantidad de registros, la media, el desvio y los cuartiles de las columnas numéricas.

In [6]:
data.describe()

Unnamed: 0,Age,Total_Purchase,Account_Manager,Years,Num_Sites,Churn
count,900.0,900.0,900.0,900.0,900.0,900.0
mean,41.816667,10062.824033,0.481111,5.273156,8.587778,0.166667
std,6.12756,2408.644532,0.499921,1.274449,1.764836,0.372885
min,22.0,100.0,0.0,1.0,3.0,0.0
25%,38.0,8497.1225,0.0,4.45,7.0,0.0
50%,42.0,10045.87,0.0,5.215,8.0,0.0
75%,46.0,11760.105,1.0,6.11,10.0,0.0
max,65.0,18026.01,1.0,9.15,14.0,1.0


Liste las columnas del dataframe indicando el tipo de datos.

In [7]:
for col in data.columns:
    print(f"Columna {col} tipo: {data[col].dtype}")

Columna Names tipo: object
Columna Age tipo: float64
Columna Total_Purchase tipo: float64
Columna Account_Manager tipo: int64
Columna Years tipo: float64
Columna Num_Sites tipo: float64
Columna Onboard_date tipo: object
Columna Location tipo: object
Columna Company tipo: object
Columna Churn tipo: int64


## Preparación de los datos

Defina la columna objetivo a predecir en "y"  y  "X", el dataframe con las características que se utilizarán (sólo las numéricas y la fecha). Luego imprima las primeras filas del conjunto de datos totales con las características.

In [8]:
# target object y
y = data["Churn"]
# Create X with only numeric and date columns
features = ["Age", "Total_Purchase", "Account_Manager", "Years", "Num_Sites", "Onboard_date"]
X = data[features]

print(X.head())

    Age  Total_Purchase  Account_Manager  Years  Num_Sites  \
0  42.0        11066.80                0   7.22        8.0   
1  41.0        11916.22                0   6.50       11.0   
2  38.0        12884.75                0   6.67       12.0   
3  42.0         8010.76                0   6.71       10.0   
4  37.0         9191.58                0   5.56        9.0   

          Onboard_date  
0  2013-08-30 07:00:40  
1  2013-08-13 00:38:46  
2  2016-06-29 06:20:07  
3  2014-04-22 12:43:12  
4  2016-01-19 15:31:15  


Convierta la columna timestamp en numérica, utilizanlo la función "toordinal"

In [0]:
from datetime import datetime

X['Onboard_date'] = X['Onboard_date'].apply(datetime.fromisoformat).apply(datetime.toordinal)


Revise los datos X

In [9]:
print(X.head())

    Age  Total_Purchase  Account_Manager  Years  Num_Sites  \
0  42.0        11066.80                0   7.22        8.0   
1  41.0        11916.22                0   6.50       11.0   
2  38.0        12884.75                0   6.67       12.0   
3  42.0         8010.76                0   6.71       10.0   
4  37.0         9191.58                0   5.56        9.0   

          Onboard_date  
0  2013-08-30 07:00:40  
1  2013-08-13 00:38:46  
2  2016-06-29 06:20:07  
3  2014-04-22 12:43:12  
4  2016-01-19 15:31:15  


Defina las particiones para entrenamiento y de validación con semila=1 y dejando un 20% de filas para test. Imprima cuantas filas utilizará para entrenamiento y cuantas para validar

In [10]:
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
print(train_X.shape)


(720, 6) (180, 6)


In [12]:
print(val_X.shape)

(180, 6)


## Definición del modelo

Defina el modelo utilizando LogisticRegression con semilla=1, solver=lgfgs y multi_class='multinomial'

In [13]:
customer_churn = LogisticRegression(random_state=1, solver='lbfgs', multi_class='multinomial')


Entrene el modelo 

In [18]:
from datetime import datetime

# Fit Model
# Convertir la columna 'Onboard_date' a numérica usando toordinal

train_X = train_X.copy()
val_X = val_X.copy()
train_X['Onboard_date'] = train_X['Onboard_date'].apply(datetime.fromisoformat).apply(datetime.toordinal)
val_X['Onboard_date'] = val_X['Onboard_date'].apply(datetime.fromisoformat).apply(datetime.toordinal)

customer_churn.fit(train_X, train_y)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Predicción y validación del modelo 

Utilice el modelo para realizar las predicciones con los datos de validación y calcule el valor MAE (mean absolute error) obtenido sin haber indicado la cantidad de hojas (max_leaf_nodes).

In [20]:
from sklearn import metrics
val_predictions = customer_churn.predict(val_X)
score = round(metrics.accuracy_score(val_y, val_predictions) * 100)
print("Score datos validación %f" % score)
print("Accuracy:", metrics.accuracy_score(val_y, val_predictions))

Score datos validación 89.000000
Accuracy: 0.8888888888888888


In [24]:
score = round(metrics.accuracy_score(val_y, val_predictions) * 100)
assert score == 89, "Error en resultado de score"

Calcule los resultados del modelo utilizando random forest con 1000 estimadores y max_depth=10, y semilla=1

In [25]:
RF = RandomForestClassifier(n_estimators=1000, max_depth=10, random_state=1)
RF.fit(train_X, train_y)
round(RF.score(val_X, val_y))


1

Calcule los resultados del modelo utilizando Multi-Class Classification utilizando solver='lbfgs'

In [None]:
NN = MLPClassifier(solver='lbfgs', random_state=1)
NN.fit(train_X, train_y)
round(NN.score(val_X, val_y))

0.8222

## Resultados finales

Utilice el mejor modelo obtenido y calcule el score con todos los datos 

In [29]:
from datetime import datetime

X = X.copy()
X['Onboard_date'] = X['Onboard_date'].apply(datetime.fromisoformat).apply(datetime.toordinal)
score = RF.score(X, y) * 100
round(score, 2)


96.67

In [31]:
assert round(score) == 97, "Error en resultado de porcentaje de error"