### En este ejercicio analizaremos la métrica "retention rate". El retention rate es un indicador de que tan estable es tu product market fit (PMF), si tu PMF no es satisfactorio entonces se tendría que considerar otro tipo de métricas tales como el churn rate y predecir con churn prediction.

In [None]:
# Importamos las librerías a usar
from datetime import datetime, timedelta,date
import pandas as pd
%matplotlib inline
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from __future__ import division
from sklearn.cluster import KMeans

import plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go

import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score, train_test_split

pyoff.init_notebook_mode()

In [None]:
# Vemos nuestros headers y nos damos una idea de nuestro archivo
df_data= pd.read_csv('churn_data.csv')
df_data.head(10)

In [None]:
# vemos más información de nuestros datos
df_data.info()

- Tenemos entonces variables categóricas y variables numéricas. 
- Por ser más las variables categóricas procederemos a explorarlas primero en su relación al campo "Churn"


In [None]:
#convertimos los datos del campo Churn a integers para su siguiente uso
df_data.loc[df_data.Churn=='No','Churn']=0
df_data.loc[df_data.Churn=='Yes','Churn']=1

In [None]:
# Ploteamos entonces el promedio churn por genero de sexo
df_plot=df_data.groupby('gender').Churn.mean().reset_index()
plot_data = [
    go.Bar(
    x=df_plot['gender'],
    y=df_plot['Churn'],
    width=[0.5,0.5],
    marker=dict(
    color=['green','blue'])
    )
]

plot_layout=go.Layout(
    xaxis={"type":"category"},
    yaxis={"title":"Churn Rate"},
    title='Gender',
    plot_bgcolor='rgb(243,243,243)',
    paper_bgcolor='rgb(243,243,243)',
    )
fig=go.Figure(data=plot_data,layout=plot_layout)
pyoff.iplot(fig)

- La gráfica no nos ayuda mucho a tomar conclusiones locales, es decir, globalmente podemos ver que el churn rate es similar en ambos géneros pero localmente es más grande en el género femenino por aproximadamente .008

In [None]:
# Notamos que el promedio de Churn es similar aunque mayour en female por poco
df_plot

In [None]:
# De igual mánera ploteamos churn con internet service 
df_plot=df_data.groupby('InternetService').Churn.mean().reset_index()
plot_data = [
    go.Bar(
    x=df_plot['InternetService'],
    y=df_plot['Churn'],
    width=[0.5,0.5,0.5],
    marker=dict(
    color=['green','blue','yellow'])
    )
]

plot_layout=go.Layout(
    xaxis={"type":"category"},
    yaxis={"title":"Churn Rate"},
    title='Internet Service',
    plot_bgcolor='rgb(243,243,243)',
    paper_bgcolor='rgb(243,243,243)',
    )
fig=go.Figure(data=plot_data,layout=plot_layout)
pyoff.iplot(fig)

- Es visible las diferencias entre cada categoría dentro de este campo, de manera numérica tenemos los datos abajo. 

In [None]:
df_plot

In [None]:
# Hacemos lo mismo para este campo
df_plot=df_data.groupby('Contract').Churn.mean().reset_index()
plot_data = [
    go.Bar(
    x=df_plot['Contract'],
    y=df_plot['Churn'],
    width=[0.5,0.5,0.5],
    marker=dict(
    color=['green','blue','yellow'])
    )
]

plot_layout=go.Layout(
    xaxis={"type":"category"},
    yaxis={"title":"Churn Rate"},
    title='Contract',
    plot_bgcolor='rgb(243,243,243)',
    paper_bgcolor='rgb(243,243,243)',
    )
fig=go.Figure(data=plot_data,layout=plot_layout)
pyoff.iplot(fig)

In [None]:
df_plot

- De igual manera el churn rate es notorio. 

#### Procederemos hacer lo mismo con los campos que faltan. 

In [None]:

df_plot=df_data.groupby('TechSupport').Churn.mean().reset_index()
plot_data = [
    go.Bar(
    x=df_plot['TechSupport'],
    y=df_plot['Churn'],
    width=[0.5,0.5,0.5],
    marker=dict(
    color=['green','blue','yellow'])
    )
]

plot_layout=go.Layout(
    xaxis={"type":"category"},
    yaxis={"title":"Churn Rate"},
    title='Tech Support',
    plot_bgcolor='rgb(243,243,243)',
    paper_bgcolor='rgb(243,243,243)',
    )
fig=go.Figure(data=plot_data,layout=plot_layout)
pyoff.iplot(fig)

In [None]:
df_plot

In [None]:
df_plot=df_data.groupby('PaymentMethod').Churn.mean().reset_index()
plot_data = [
    go.Bar(
    x=df_plot['PaymentMethod'],
    y=df_plot['Churn'],
    width=[0.5,0.5,0.5,0.5],
    marker=dict(
    color=['green','blue','yellow','red'])
    )
]

plot_layout=go.Layout(
    xaxis={"type":"category"},
    yaxis={"title":"Churn Rate"},
    title='Payment Method',
    plot_bgcolor='rgb(243,243,243)',
    paper_bgcolor='rgb(243,243,243)',
    )
fig=go.Figure(data=plot_data,layout=plot_layout)
pyoff.iplot(fig)

In [None]:
df_plot=df_data.groupby('PaperlessBilling').Churn.mean().reset_index()
plot_data = [
    go.Bar(
    x=df_plot['PaperlessBilling'],
    y=df_plot['Churn'],
    width=[0.5,0.5],
    marker=dict(
    color=['green','blue'])
    )
]

plot_layout=go.Layout(
    xaxis={"type":"category"},
    yaxis={"title":"Churn Rate"},
    title='Paperless Billing',
    plot_bgcolor='rgb(243,243,243)',
    paper_bgcolor='rgb(243,243,243)',
    )
fig=go.Figure(data=plot_data,layout=plot_layout)
pyoff.iplot(fig)

In [None]:
df_plot

In [None]:
df_plot=df_data.groupby('StreamingMovies').Churn.mean().reset_index()
plot_data = [
    go.Bar(
    x=df_plot['StreamingMovies'],
    y=df_plot['Churn'],
    width=[0.5,0.5,0.5],
    marker=dict(
    color=['green','blue','yellow'])
    )
]

plot_layout=go.Layout(
    xaxis={"type":"category"},
    yaxis={"title":"Churn Rate"},
    title='Streaming Movies',
    plot_bgcolor='rgb(243,243,243)',
    paper_bgcolor='rgb(243,243,243)',
    )
fig=go.Figure(data=plot_data,layout=plot_layout)
pyoff.iplot(fig)

In [None]:
df_plot=df_data.groupby('DeviceProtection').Churn.mean().reset_index()
plot_data = [
    go.Bar(
    x=df_plot['DeviceProtection'],
    y=df_plot['Churn'],
    width=[0.5,0.5,0.5],
    marker=dict(
    color=['green','blue','yellow'])
    )
]

plot_layout=go.Layout(
    xaxis={"type":"category"},
    yaxis={"title":"Churn Rate"},
    title='DeviceProtection',
    plot_bgcolor='rgb(243,243,243)',
    paper_bgcolor='rgb(243,243,243)',
    )
fig=go.Figure(data=plot_data,layout=plot_layout)
pyoff.iplot(fig)

In [None]:
df_plot=df_data.groupby('PhoneService').Churn.mean().reset_index()
plot_data = [
    go.Bar(
    x=df_plot['PhoneService'],
    y=df_plot['Churn'],
    width=[0.5,0.5],
    marker=dict(
    color=['green','blue'])
    )
]

plot_layout=go.Layout(
    xaxis={"type":"category"},
    yaxis={"title":"Churn Rate"},
    title='PhoneService',
    plot_bgcolor='rgb(243,243,243)',
    paper_bgcolor='rgb(243,243,243)',
    )
fig=go.Figure(data=plot_data,layout=plot_layout)
pyoff.iplot(fig)

#### Ahora analizaremos los campos númericos.

In [None]:
df_plot=df_data.groupby('tenure').Churn.mean().reset_index()
plot_data = [
    go.Scatter(
    x=df_plot['tenure'],
    y=df_plot['Churn'],
    mode='markers',
    name='Low',
    marker=dict(size=7,
        line=dict(width=1),
        color='blue',
        opacity=0.8
               ),
    )
]

plot_layout=go.Layout(
    xaxis={"title":"Tenure"},
    yaxis={"title":"Churn Rate"},
    title='Tenure based Churn rate',
    plot_bgcolor='rgb(243,243,243)',
    paper_bgcolor='rgb(243,243,243)',
    )
fig=go.Figure(data=plot_data,layout=plot_layout)
pyoff.iplot(fig)

In [None]:
df_plot=df_data.groupby('MonthlyCharges').Churn.mean().reset_index()
plot_data = [
    go.Scatter(
    x=df_plot['MonthlyCharges'],
    y=df_plot['Churn'],
    mode='markers',
    name='Low',
    marker=dict(size=7,
        line=dict(width=1),
        color='blue',
        opacity=0.8
               ),
    )
]

plot_layout=go.Layout(
    xaxis={"title":"Monthly Charge"},
    yaxis={"title":"Churn Rate"},
    title='Monthly Charges Vs. Churn rate',
    plot_bgcolor='rgb(243,243,243)',
    paper_bgcolor='rgb(243,243,243)',
    )
fig=go.Figure(data=plot_data,layout=plot_layout)
pyoff.iplot(fig)

In [None]:
df_plot=df_data.groupby('TotalCharges').Churn.mean().reset_index()
plot_data = [
    go.Scatter(
    x=df_plot['TotalCharges'],
    y=df_plot['Churn'],
    mode='markers',
    name='Low',
    marker=dict(size=7,
        line=dict(width=1),
        color='blue',
        opacity=0.8
               ),
    )
]

plot_layout=go.Layout(
    xaxis={"title":"Total Charge"},
    yaxis={"title":"Churn Rate"},
    title='Total Charge Vs. Churn rate',
    plot_bgcolor='rgb(243,243,243)',
    paper_bgcolor='rgb(243,243,243)',
    )
fig=go.Figure(data=plot_data,layout=plot_layout)
pyoff.iplot(fig)

### Algunos Insights:
- No es posible identificar alguna relación entre estas variables.

#### Para intentar trabajar con las variables numéricas vamos a intentar categorizar, es decir, tenemos los campos numéricos de "Tenure", "Monthly Charges", "Total Charges" por tanto a cada uno de este campo vamos a categorizar nuestro dataframe. 

- Primero trabajaremos con el campo "Tenure" donde utilizaremos el método de Elbow para identificar el número de clusters, aplicar K-means.


In [None]:
#Defiminos otra vez nuestra función para ordenar cluster
def order_clusters(cluster_field_name,target_field_name,df,ascending):
    new_cluster_field_name='new_' +cluster_field_name
    
    df_new=df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    
    df_new=df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    
    df_new['index']=df_new.index
    
    df_final=pd.merge(df,df_new[[cluster_field_name,'index']],on=cluster_field_name)
    
    df_final=df_final.drop([cluster_field_name],axis=1)
    
    df_final=df_final.rename(columns={"index":cluster_field_name})
    return df_final

In [None]:
from sklearn.cluster import KMeans
#Implementamos el método de elbow. 
sse={}
df_cluster=df_data[['tenure']]
for k in range(1,10):
    kmeans= KMeans(n_clusters=k,max_iter=1000).fit(df_cluster)
    df_cluster["clusters"]=kmeans.labels_
    sse[k]=kmeans.inertia_

plt.figure()
plt.plot(list(sse.keys()),list(sse.values()))
plt.xlabel('Numero de clusters')
plt.show()

- El número de cluster que tomaremos será de 3 para este caso

In [None]:
#Aplicamos K-means cómo en los ejercicios anteriores
kmeans=KMeans(n_clusters=3)
kmeans.fit(df_data[['tenure']])
df_data['TenureCluster']=kmeans.predict(df_data[['tenure']])
df_data = order_clusters('TenureCluster','tenure',df_data,True)

In [None]:
df_data.groupby('TenureCluster').tenure.describe()

- Ahora Teniendo clasificados bajo este campo podemos entonces comparar de manera categórica esta variable con el churn , para esto nombraremos el orden de cluster como:
    - 0->Low
    - 1->Mid
    - 2 ->High

In [None]:
df_data['TenureCluster'] = df_data['TenureCluster'].replace({0:'Low',1:'Mid',2:'High'})

In [None]:
df_plot = df_data.groupby('TenureCluster').Churn.mean().reset_index()
plot_data = [
    go.Bar(
        x=df_plot['TenureCluster'],
        y=df_plot['Churn'],
        width = [0.5, 0.5, 0.5,0.5],
        marker=dict(
        color=['green', 'blue', 'orange','red'])
    )
]

plot_layout = go.Layout(
        xaxis={"type": "category","categoryarray":['Low','Mid','High']},
        title='Tenure Cluster vs Churn Rate',
        plot_bgcolor  = "rgb(243,243,243)",
        paper_bgcolor  = "rgb(243,243,243)",
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

- Haremos lo mismo con el campo "Monthly Charges"

In [None]:
sse={}
df_cluster = df_data[['MonthlyCharges']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(df_cluster)
    df_cluster["clusters"] = kmeans.labels_
    sse[k] = kmeans.inertia_ 
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()

In [None]:
#Escojemos 3 clusters
kmeans = KMeans(n_clusters=3)
kmeans.fit(df_data[['MonthlyCharges']])
df_data['MonthlyChargeCluster'] = kmeans.predict(df_data[['MonthlyCharges']])
df_data = order_clusters('MonthlyChargeCluster', 'MonthlyCharges',df_data,True)
df_data.groupby('MonthlyChargeCluster').MonthlyCharges.describe()

In [None]:
# Ploteamos contra Churn
df_data['MonthlyChargeCluster'] = df_data["MonthlyChargeCluster"].replace({0:'Low',1:'Mid',2:'High'})

df_plot = df_data.groupby('MonthlyChargeCluster').Churn.mean().reset_index()
plot_data = [
    go.Bar(
        x=df_plot['MonthlyChargeCluster'],
        y=df_plot['Churn'],
        width = [0.5, 0.5, 0.5],
        marker=dict(
        color=['green', 'blue', 'orange'])
    )
]

plot_layout = go.Layout(
        xaxis={"type": "category","categoryarray":['Low','Mid','High']},
        title='Monthly Charge Cluster vs Churn Rate',
        plot_bgcolor  = "rgb(243,243,243)",
        paper_bgcolor  = "rgb(243,243,243)",
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

In [None]:
# En este punto modificaremos un poco los datos aquí si encontramos errores los pasames a NAN
df_data[pd.to_numeric(df_data['TotalCharges'], errors='coerce').isnull()]

In [None]:
len(df_data[pd.to_numeric(df_data['TotalCharges'], errors='coerce').isnull()])

In [None]:
# ELiminamos los errores 
df_data.loc[pd.to_numeric(df_data['TotalCharges'], errors='coerce').isnull(),'TotalCharges'] = np.nan
df_data = df_data.dropna()

In [None]:
df_data['TotalCharges'] = pd.to_numeric(df_data['TotalCharges'], errors='coerce')

In [None]:
sse={}
df_cluster = df_data[['TotalCharges']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(df_cluster)
    df_cluster["clusters"] = kmeans.labels_
    sse[k] = kmeans.inertia_ 
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()

In [None]:
#hacemos lo mismo para este apartado
kmeans = KMeans(n_clusters=3)
kmeans.fit(df_data[['TotalCharges']])
df_data['TotalChargeCluster'] = kmeans.predict(df_data[['TotalCharges']])
df_data = order_clusters('TotalChargeCluster', 'TotalCharges',df_data,True)
df_data.groupby('TotalChargeCluster').TotalCharges.describe()

In [None]:
df_data['TotalChargeCluster'] = df_data["TotalChargeCluster"].replace({0:'Low',1:'Mid',2:'High'})

df_plot = df_data.groupby('TotalChargeCluster').Churn.mean().reset_index()
plot_data = [
    go.Bar(
        x=df_plot['TotalChargeCluster'],
        y=df_plot['Churn'],
        width = [0.5, 0.5, 0.5],
        marker=dict(
        color=['green', 'blue', 'orange'])
    )
]

plot_layout = go.Layout(
        xaxis={"type": "category","categoryarray":['Low','Mid','High']},
        title='Total Charge Cluster vs Churn Rate',
        plot_bgcolor  = "rgb(243,243,243)",
        paper_bgcolor  = "rgb(243,243,243)",
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

In [None]:
df_data.info()

#### Ahora lo que haremos es convertir nuestras variables categóricas a variables "dummies" para poder así crear un modelo de regresión logística.

In [None]:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dummy_columns = [] 

for column in df_data.columns:
    if df_data[column].dtype == object and column != 'customerID':
        if df_data[column].nunique() == 2:
            df_data[column] = le.fit_transform(df_data[column]) 
        else:
            dummy_columns.append(column)


df_data = pd.get_dummies(data = df_data,columns = dummy_columns)

In [None]:
dummy_columns

In [None]:
#Corroboramos las variables dummies
df_data[['gender','Partner','TenureCluster_High','TenureCluster_Low','TenureCluster_Mid']].head()


In [None]:
#Modificamos las columnas de manera mas accesible
all_columns = []
for column in df_data.columns:
    column = column.replace(" ", "_").replace("(", "_").replace(")", "_").replace("-", "_")
    all_columns.append(column)

df_data.columns = all_columns

In [None]:
#Creamos las variables predictoras
glm_columns = 'gender'

for column in df_data.columns:
    if column not in ['Churn','customerID','gender']:
        glm_columns = glm_columns + ' + ' + column

In [None]:
#Implementamos el modelo
import statsmodels.api as sm
import statsmodels.formula.api as smf
 

glm_model = smf.glm(formula='Churn ~ {}'.format(glm_columns), data=df_data, family=sm.families.Binomial())
res = glm_model.fit()
print(res.summary())

### Algunos Insights: 
- Podemos observar que tenemos muchas variables de las cuales usamos para predecir el modelo, ahora basándonos en el P-value muchas de ellas no son relevantes para el modelo. 
- no solo nos debemos de fijar en esto cuando decidimos si las variables son significativas o no. Pero para fines del ejercicio tomaremos todas las variables para el siguiente modelo.

In [None]:
np.exp(res.params)

In [None]:
#creamos valores x y  Y
X = df_data.drop(['Churn','customerID'],axis=1)
y = df_data.Churn

In [None]:
#Separamos en set de entrenamiento y aprendizaje
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=56)

In [None]:
#Construimos el modelo de clasificación para el Churn con las variables 
xgb_model = xgb.XGBClassifier(max_depth=5, learning_rate=0.08, objective= 'binary:logistic',n_jobs=-1).fit(X_train, y_train)

print('Accuracy of XGB classifier on training set: {:.2f}'
       .format(xgb_model.score(X_train, y_train)))
print('Accuracy of XGB classifier on test set: {:.2f}'
       .format(xgb_model.score(X_test[X_train.columns], y_test)))

- Tenemos resultados de Accuracy bastante altos , esto se deben a muchos factores pero no se puede asegurar todavía nada , necesitamos revisar más a fondo

In [None]:
# creamos la variable de prediccion con nuestro modelo
y_pred = xgb_model.predict(X_test)

In [None]:
# Comparamos los resultados 
print(classification_report(y_test, y_pred))

- Tenemos una clasificación poco efectiva para uno de las entradas, es decir, para la entrada 1 tenemos una precisión de 0.62 que no es una clasificación confiable , tenemos que mejorar el modelo de predicción y existen alternativas para esto. 

In [None]:
from xgboost import plot_importance

In [None]:
# Veremos que varaibles son las más importantes para nuestro modelo 
fig, ax = plt.subplots(figsize=(10,8))
plot_importance(xgb_model, ax=ax)