# Clasificación de clientes

**Clasificar los clientes en 'Valioso' / 'No valioso'** 

- Dataset de https://app.mavenanalytics.io/datasets
- https://maven-datasets.s3.amazonaws.com/Telecom+Customer+Churn/Telecom+Customer+Churn.zip

Contexto:<br>
El dataset contiene el estado actual de clientes al cierre del trimestre Mayo-Julio juntos con los detalles de CRM correspondientes a los servicios que ofrece la empresa. <br>
El estado de un cliente se clasifica en *'Stayed' / 'Churned' / 'Joined'*. <br>
**Es decir:**
&emsp;```Sigue activo```
&emsp;```Canceló su suscripción```
&emsp;```Es un nuevo cliente```

## 1. Preparación de los datos

### 1.0 Esquema

- Cargar dataset
- Revisar variables -> detectar la variable que queremos predecir
- Dar consistencia a las variables y los datos
- Normalizar datos
    - Controlar tipos de datos correspondiente a cada variable
    - Corregir tipos de datos de las variables
    - Manejar datos nulos
    - Transformar variable objetivo en tipo numérico (1, 0)

### 1.1 Cargar dataset

In [192]:
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [193]:
customer_churn = 'telecom_customer_churn.csv'
zipcode_pop = 'telecom_zipcode_population.csv'

### 1.2 Revisar variables y detectar variable que queremos predecir

In [194]:
df_cust = pd.read_csv(customer_churn, encoding='utf-8')
df_cust.head().T

Unnamed: 0,0,1,2,3,4
Customer ID,0002-ORFBO,0003-MKNFE,0004-TLHLJ,0011-IGKFF,0013-EXCHZ
Gender,Female,Male,Male,Male,Female
Age,37,46,50,78,75
Married,Yes,No,No,Yes,Yes
Number of Dependents,0,0,0,0,0
City,Frazier Park,Glendale,Costa Mesa,Martinez,Camarillo
Zip Code,93225,91206,92627,94553,93010
Latitude,34.827662,34.162515,33.645672,38.014457,34.227846
Longitude,-118.999073,-118.203869,-117.922613,-122.115432,-119.079903
Number of Referrals,2,0,0,1,3


En este caso no tenemos una variable explícita para predecir, sino que queremos crear una variable que nos dé una probabilidad de riesgo de deserción. <br>
Para eso vamos a evaluar los casos en los que un cliente cancela su suscripción y los casos en los que mantiene su suscripción. Vamos a separar los casos en los que se une un cliente para poner a prueba nuestro modelo de predicción.

### 1.3 Dar consistencia a las variables y los datos

Vamos a modificar los nombres de las variables para que estén en minúscula y reemplazar espacios por ```_```

In [195]:
df_cust.columns = df_cust.columns.str.lower().str.replace(' ', '_')
categorical_columns = list(df_cust.dtypes[df_cust.dtypes=='object'].index)

for c in categorical_columns:
    df_cust[c] = df_cust[c].str.lower().str.replace(' ', '_')

### 1.4 Normalizar datos

#### 1.4.1 Controlar tipos de datos correspondientes a cada variable

In [196]:
df_cust.info()
#df_cust[df_cust['customer_status'] == 'joined'].info()
#df_cust[df_cust['customer_status'] == 'stayed'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 38 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   customer_id                        7043 non-null   object 
 1   gender                             7043 non-null   object 
 2   age                                7043 non-null   int64  
 3   married                            7043 non-null   object 
 4   number_of_dependents               7043 non-null   int64  
 5   city                               7043 non-null   object 
 6   zip_code                           7043 non-null   int64  
 7   latitude                           7043 non-null   float64
 8   longitude                          7043 non-null   float64
 9   number_of_referrals                7043 non-null   int64  
 10  tenure_in_months                   7043 non-null   int64  
 11  offer                              7043 non-null   objec

#### 1.4.2 Corregir tipos de datos de las variables

No es necesario corregir tipos de datos de las variables

#### 1.4.3 Manejar datos nulos

Vamos a identificar la cantidad de datos nulos en el dataset y definir qué vamos a hacer con ellos:
- Corregir o eliminar los registros con valores numéricos ordinarios nulos
- Agregar texto genérico o eliminar valores de texto nulos

In [197]:
df_cust.isnull().sum()

customer_id                             0
gender                                  0
age                                     0
married                                 0
number_of_dependents                    0
city                                    0
zip_code                                0
latitude                                0
longitude                               0
number_of_referrals                     0
tenure_in_months                        0
offer                                   0
phone_service                           0
avg_monthly_long_distance_charges     682
multiple_lines                        682
internet_service                        0
internet_type                        1526
avg_monthly_gb_download              1526
online_security                      1526
online_backup                        1526
device_protection_plan               1526
premium_tech_support                 1526
streaming_tv                         1526
streaming_movies                  

In [198]:
df_cust[df_cust['phone_service'] == 'yes']['multiple_lines'].unique()

array(['no', 'yes'], dtype=object)

In [199]:
df_cust['avg_monthly_long_distance_charges'].fillna(value=0, inplace=True)

In [200]:
df_cust['avg_monthly_long_distance_charges'].isnull().sum()

0

In [201]:
df_cust['multiple_lines'].fillna(value='no', inplace=True)

In [202]:
df_cust['multiple_lines'].isnull().sum()

0

In [203]:
df_cust[df_cust['internet_service'] == 'yes']['internet_type'].unique()

# --> Cambiar por 'no'

array(['cable', 'fiber_optic', 'dsl'], dtype=object)

In [204]:
df_cust[df_cust['internet_service'] == 'yes']['avg_monthly_gb_download'].unique()

# --> Cambiar por 0

array([16., 10., 30.,  4., 11., 73., 14.,  7., 21., 59., 19., 12., 20.,
       22., 17.,  9., 52., 57., 51., 41., 23., 27.,  2., 69., 53., 15.,
       29., 85., 28., 18., 48., 25., 26.,  8.,  6.,  5., 13., 75., 82.,
       24., 76., 47., 71., 58., 42.,  3., 56., 46., 39.])

In [205]:
df_cust[df_cust['internet_service'] == 'yes']['online_security'].unique()

# --> Cambiar por 'no'

array(['no', 'yes'], dtype=object)

In [206]:
df_cust[df_cust['internet_service'] == 'yes']['online_backup'].unique()

# --> Cambiar por 'no'

array(['yes', 'no'], dtype=object)

In [207]:
df_cust[df_cust['internet_service'] == 'yes']['device_protection_plan'].unique()

# --> Cambiar por 'no'

array(['no', 'yes'], dtype=object)

In [208]:
df_cust[df_cust['internet_service'] == 'yes']['premium_tech_support'].unique()
# --> Cambiar por 'no'

array(['yes', 'no'], dtype=object)

In [209]:
df_cust[df_cust['internet_service'] == 'yes']['streaming_tv'].unique()
# --> Cambiar por 'no'

array(['yes', 'no'], dtype=object)

In [210]:
df_cust[df_cust['internet_service'] == 'yes']['streaming_movies'].unique()
# --> Cambiar por 'no'

array(['no', 'yes'], dtype=object)

In [211]:
df_cust[df_cust['internet_service'] == 'yes']['streaming_music'].unique()
# --> Cambiar por 'no'

array(['no', 'yes'], dtype=object)

In [212]:
df_cust[df_cust['internet_service'] == 'yes']['unlimited_data'].unique()
# --> Cambiar por 'no'

array(['yes', 'no'], dtype=object)

In [213]:
null2no = ['internet_type', 'online_security', 'online_backup', 'device_protection_plan',
           'premium_tech_support', 'streaming_tv', 'streaming_movies', 'streaming_music', 'unlimited_data']

In [214]:
df_cust[null2no] = df_cust[null2no].fillna(value='no')

In [215]:
df_cust['avg_monthly_gb_download'].fillna(value=0, inplace=True)

In [216]:
df_cust[['churn_category','churn_reason']] = df_cust[['churn_category','churn_reason']].fillna(value='no_churn')

In [217]:
df_cust.isnull().sum()

customer_id                          0
gender                               0
age                                  0
married                              0
number_of_dependents                 0
city                                 0
zip_code                             0
latitude                             0
longitude                            0
number_of_referrals                  0
tenure_in_months                     0
offer                                0
phone_service                        0
avg_monthly_long_distance_charges    0
multiple_lines                       0
internet_service                     0
internet_type                        0
avg_monthly_gb_download              0
online_security                      0
online_backup                        0
device_protection_plan               0
premium_tech_support                 0
streaming_tv                         0
streaming_movies                     0
streaming_music                      0
unlimited_data           

#### 1.4.4 Transformar variable objetivo en tipo numérico

Aquí primero vamos a separar el dataset entre usuarios nuevos y usuarios que ya estaban suscriptos desde antes del trimestre analizado. <br>
Luego vamos a transformar en 1 a los clientes que mantienen su suscripción y en 0 a los clientes que cancelaron su suscripción

In [218]:
df = df_cust[df_cust['customer_status'] != 'joined'].copy()

In [219]:
df_joined = df_cust[df_cust['customer_status'] == 'joined'].copy()

In [220]:
(df['customer_status'] == 'stayed').astype(int).head()

0    1
1    1
2    0
3    0
4    0
Name: customer_status, dtype: int32

In [221]:
df['customer_status'] = (df['customer_status'] == 'stayed').astype(int)

## 2. Creando el esquema de validación

Dividir el dataset de trabajo en partes de entrenamiento, validación y testeo

In [222]:
from sklearn.model_selection import train_test_split

In [223]:
df_ft, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [224]:
df_train, df_val = train_test_split(df_ft, test_size=0.25, random_state=1)

In [225]:
(len(df_train) + len(df_val) + len(df_test)), len(df)

(6589, 6589)

In [226]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [227]:
y_train = df_train['customer_status'].values
y_val = df_val['customer_status'].values
y_test = df_test['customer_status'].values

In [228]:
del df_train['customer_status']
del df_val['customer_status']
del df_test['customer_status']

In [229]:
len(df_train.columns), len(df_val.columns), len(df_test.columns)

(37, 37, 37)

## 3. Análisis Exploratorio de los datos

### 3.1 Variable objetivo, variables categóricas y variables numéricas

In [230]:
df_ft = df_ft.reset_index(drop=True)

Referencias customer_status: <br>
- 0 = Canceló<br>
- 1 = Se queda

In [231]:
df_ft['customer_status'].value_counts(normalize=True)

1    0.714855
0    0.285145
Name: customer_status, dtype: float64

In [232]:
lealtad_gral = df_ft['customer_status'].mean()
round(lealtad_gral,2)

lealtad_gral

0.7148548662492885

In [233]:
numeric_vars = ['age', 'number_of_dependents', 'number_of_referrals', 'tenure_in_months',
                'avg_monthly_long_distance_charges', 'avg_monthly_gb_download', 'monthly_charge',
                'total_charges', 'total_refunds', 'total_extra_data_charges', 'total_long_distance_charges',
                'total_revenue']

In [234]:
categoric_vars = ['gender', 'married', 'offer', 'phone_service', 'multiple_lines', 'internet_service', 'internet_type',
                'online_security', 'online_backup', 'device_protection_plan', 'premium_tech_support', 'streaming_tv', 'streaming_movies',
                'streaming_music', 'unlimited_data', 'contract', 'paperless_billing', 'payment_method']

In [235]:
df_ft[categoric_vars].nunique()

gender                    2
married                   2
offer                     6
phone_service             2
multiple_lines            2
internet_service          2
internet_type             4
online_security           2
online_backup             2
device_protection_plan    2
premium_tech_support      2
streaming_tv              2
streaming_movies          2
streaming_music           2
unlimited_data            2
contract                  3
paperless_billing         2
payment_method            3
dtype: int64

### 3.2 Identificando importancia de las variables

Identificamos en qué medida las variables afectan a nuestra variable objetivo
- Porcentaje de lealtad
- Ratio de esperanza
- información mutua

#### 3.2.1 Porcentaje de lealtad en grupo de una variable

Referencia:

Se toma al porcentaje de lealtad_gral como base de comparación para saber si algún grupo de la variable tiene mayor o menor porcentaje de lealtad de lo común<br>
Entonces se dan dos situaciones:
- La lealtad de un grupo es mayor a lo común (lealtad_gral) --> Ese grupo dentro de su variable tiene mayor esperanza de mantener su lealtad
    - grupo - lealtad_gral >0
- La lealtad de un grupo es menor a lo común (lealtad_gral) --> Ese grupo dentro de su variable tiene menor esperanza de mantener su lealtad
    - grupo - lealtad_gral <0

In [236]:
leal_muj = df_ft[df_ft['gender'] == 'female']['customer_status'].mean()

In [237]:
leal_hom = df_ft[df_ft['gender'] == 'male']['customer_status'].mean()

In [238]:
leal_muj, lealtad_gral, (leal_muj-lealtad_gral)

(0.7069032009255688, 0.7148548662492885, -0.007951665323719714)

In [239]:
leal_hom, lealtad_gral, (leal_hom - lealtad_gral)

(0.722554144884242, 0.7148548662492885, 0.007699278634953455)

#### 3.2.2 Porcentaje de lealtad según concubinato

In [240]:
leal_cas = df_ft[df_ft['married'] == 'yes']['customer_status'].mean()

In [241]:
leal_sol = df_ft[df_ft['married'] == 'no']['customer_status'].mean()

In [242]:

leal_cas, lealtad_gral, (leal_cas - lealtad_gral)

(0.7995503934057699, 0.7148548662492885, 0.08469552715648143)

In [243]:
leal_sol, lealtad_gral, (leal_sol - lealtad_gral)

(0.627978478093774, 0.7148548662492885, -0.08687638815551446)

#### 3.2.3 Ratio de esperanza

Es el porcentaje que resulta de dividir la lealtad de un grupo por la lealtad general
- Si el porcentaje es mayor a 1 --> Mayor esperanza de lealtad
- Si el porcentaje es menor a 1 --> Menor esperanza de lealtad

In [244]:
leal_muj / lealtad_gral

0.988876531868013

In [245]:
leal_hom / lealtad_gral

1.0107704080904565

In [246]:
leal_sol / lealtad_gral

0.8784698933206696

In [247]:
leal_cas / lealtad_gral

1.1184793321767021

#### 3.2.4 Automatizando los cálculos

Podemos generar un resumen para cada grupo dentro de cada variable indicando:
- Porcentaje de lealtad
- Cantidad de registros
- Diferencia con lealtad base
- Ratio de esperanza

In [248]:
from IPython.display import display

In [249]:
for c in categoric_vars:
    print(c)
    df_group = df_ft.groupby(c)['customer_status'].agg(['mean', 'count'])
    df_group['diferencia'] = df_group['mean'] - lealtad_gral
    df_group['esperanza'] = df_group['mean'] / lealtad_gral
    display(df_group.reset_index())
    print('\n', '\n')

gender


Unnamed: 0,gender,mean,count,diferencia,esperanza
0,female,0.706903,2593,-0.007952,0.988877
1,male,0.722554,2678,0.007699,1.01077



 

married


Unnamed: 0,married,mean,count,diferencia,esperanza
0,no,0.627978,2602,-0.086876,0.87847
1,yes,0.79955,2669,0.084696,1.118479



 

offer


Unnamed: 0,offer,mean,count,diferencia,esperanza
0,none,0.704577,2840,-0.010277,0.985623
1,offer_a,0.935335,433,0.22048,1.308426
2,offer_b,0.872564,667,0.157709,1.220617
3,offer_c,0.778426,343,0.063571,1.088928
4,offer_d,0.738947,475,0.024093,1.033703
5,offer_e,0.315789,513,-0.399065,0.441753



 

phone_service


Unnamed: 0,phone_service,mean,count,diferencia,esperanza
0,no,0.733075,517,0.018221,1.025488
1,yes,0.712873,4754,-0.001981,0.997228



 

multiple_lines


Unnamed: 0,multiple_lines,mean,count,diferencia,esperanza
0,no,0.719437,2912,0.004582,1.00641
1,yes,0.709199,2359,-0.005656,0.992088



 

internet_service


Unnamed: 0,internet_service,mean,count,diferencia,esperanza
0,no,0.917215,1063,0.202361,1.283079
1,yes,0.663736,4208,-0.051119,0.92849



 

internet_type


Unnamed: 0,internet_type,mean,count,diferencia,esperanza
0,cable,0.720968,620,0.006113,1.008551
1,dsl,0.795455,1232,0.0806,1.11275
2,fiber_optic,0.579796,2356,-0.135059,0.811069
3,no,0.917215,1063,0.202361,1.283079



 

online_security


Unnamed: 0,online_security,mean,count,diferencia,esperanza
0,no,0.654013,3688,-0.060842,0.914889
1,yes,0.856601,1583,0.141747,1.198287



 

online_backup


Unnamed: 0,online_backup,mean,count,diferencia,esperanza
0,no,0.681508,3369,-0.033347,0.953351
1,yes,0.773922,1902,0.059067,1.082628



 

device_protection_plan


Unnamed: 0,device_protection_plan,mean,count,diferencia,esperanza
0,no,0.680991,3351,-0.033864,0.952628
1,yes,0.773958,1920,0.059103,1.082679



 

premium_tech_support


Unnamed: 0,premium_tech_support,mean,count,diferencia,esperanza
0,no,0.657852,3668,-0.057003,0.920259
1,yes,0.84529,1603,0.130435,1.182464



 

streaming_tv


Unnamed: 0,streaming_tv,mean,count,diferencia,esperanza
0,no,0.727417,3144,0.012562,1.017573
1,yes,0.696286,2127,-0.018569,0.974024



 

streaming_movies


Unnamed: 0,streaming_movies,mean,count,diferencia,esperanza
0,no,0.7292,3113,0.014345,1.020067
1,yes,0.694161,2158,-0.020694,0.971052



 

streaming_music


Unnamed: 0,streaming_music,mean,count,diferencia,esperanza
0,no,0.725514,3308,0.010659,1.014911
1,yes,0.696893,1963,-0.017962,0.974873



 

unlimited_data


Unnamed: 0,unlimited_data,mean,count,diferencia,esperanza
0,no,0.82335,1636,0.108495,1.151772
1,yes,0.666025,3635,-0.04883,0.931692



 

contract


Unnamed: 0,contract,mean,count,diferencia,esperanza
0,month-to-month,0.479656,2556,-0.235199,0.670983
1,one_year,0.88843,1210,0.173575,1.242811
2,two_year,0.974751,1505,0.259896,1.363565



 

paperless_billing


Unnamed: 0,paperless_billing,mean,count,diferencia,esperanza
0,no,0.818531,2083,0.103676,1.145031
1,yes,0.647114,3188,-0.067741,0.905239



 

payment_method


Unnamed: 0,payment_method,mean,count,diferencia,esperanza
0,bank_withdrawal,0.64324,2988,-0.071615,0.899818
1,credit_card,0.838389,2011,0.123534,1.17281
2,mailed_check,0.588235,272,-0.12662,0.822874



 



#### 3.2.5 Información mutua

Se usa para estimar la importancia de variables categóricas
Nos indica cuánto podemos aprender de una variable si conocemos los valores de otra

In [250]:
from sklearn.metrics import mutual_info_score

Cuando se calcula la información mutua de una variable, se obtiene un puntaje que nos indica del 0 al 1 qué tanta importancia tiene para el conjunto de datos

In [251]:
mutual_info_score(df_ft['customer_status'], df_ft['contract'])

0.14809345785897848

In [252]:
mutual_info_score(df_ft['customer_status'], df_ft['gender'])

0.00015016154060060183

Podemos aplicar este puntaje a todas las variables

In [253]:
def info_mutua(categorias):
    return mutual_info_score(categorias, df_ft['customer_status'])

In [254]:
df_ft[categoric_vars].apply(info_mutua).sort_values(ascending=False)

contract                  0.148093
offer                     0.055920
internet_type             0.048009
internet_service          0.030412
payment_method            0.024690
online_security           0.023048
premium_tech_support      0.019666
married                   0.018234
paperless_billing         0.017927
unlimited_data            0.013765
device_protection_plan    0.005019
online_backup             0.004941
streaming_movies          0.000725
streaming_tv              0.000570
streaming_music           0.000467
gender                    0.000150
phone_service             0.000090
multiple_lines            0.000064
dtype: float64

#### 3.2.6 Correlación entre variables numéricas

Establece la dependencia entre dos variables numéricas<br>
La correlación puede ser:
- Positiva --> Cuando la variable objetivo modifica su valor, la variable evaluada lo hace en igual sentido
    - Es decir: Cuando la lealtad de un cliente aumenta, también aumenta su longevidad
- Negativa --> Cuando la variable objetivo modifica su valor, la variable evaluada lo hace en sentido contrario
    - Es decir: Cuando la lealtad de un cliente baja, aumenta el precio facturado por mes

Vamos a explorar con ejemplos del dataset

**Correlación positiva 👇**<br>
A medida que aumenta la cantidad de meses de suscripción, aumenta el porcentaje de clientes leales

In [255]:
df_ft['tenure_in_months'].min(), df_ft['tenure_in_months'].max()

(1, 72)

In [256]:
df_ft[df_ft['tenure_in_months'] <= 5]['customer_status'].mean()

0.17246175243393602

In [257]:
df_ft[(df_ft['tenure_in_months'] > 5) & (df_ft['tenure_in_months'] <= 15)]['customer_status'].mean()

0.6449438202247191

In [258]:
df_ft[(df_ft['tenure_in_months'] > 15) & (df_ft['tenure_in_months'] <= 30)]['customer_status'].mean()

0.753731343283582

**Correlación negativa**

In [259]:
df_ft['age'].min(), df_ft['age'].max()

(19, 80)

In [260]:
df_ft[df_ft['age'] <= 25]['customer_status'].mean()

0.7640117994100295

In [261]:
df_ft[(df_ft['age'] > 25) & (df_ft['age'] <= 35)]['customer_status'].mean()

0.7310565635005336

In [262]:
df_ft[(df_ft['age'] > 35) & (df_ft['age'] <= 60)]['customer_status'].mean()

0.7465263157894737

In [263]:
df_ft[df_ft['age'] > 60]['customer_status'].mean()

0.6182669789227166

**Lista de correlaciones 👇**

In [264]:
df_ft[numeric_vars].corrwith(df_ft['customer_status']).sort_values(ascending=False)

tenure_in_months                     0.435003
number_of_referrals                  0.323245
total_revenue                        0.279929
total_long_distance_charges          0.267812
total_charges                        0.251821
number_of_dependents                 0.231056
total_refunds                        0.050088
total_extra_data_charges             0.002796
avg_monthly_long_distance_charges   -0.009619
avg_monthly_gb_download             -0.044315
age                                 -0.107037
monthly_charge                      -0.164967
dtype: float64

## 4. Preparación para entrenar el modelo de ML

### 4.1 Formato One-hot enconding

One-hot encoding genera por cada categoría dentro de una variable, su propia variable binaria. Es decir, en el registro que se de la categoría, se va a completar con 1, y cuando no se de se completa con 0.<br>
Esto crea una matriz de ceros y unos, que es el formato con el que trabaja el modelo de ML

In [265]:
from sklearn.feature_extraction import DictVectorizer

In [266]:
dicc_train = df_train[categoric_vars + numeric_vars].to_dict(orient='records')

In [267]:
dv = DictVectorizer(sparse=False)

In [268]:
x_train = dv.fit_transform(dicc_train)

In [269]:
x_train.shape

(3953, 56)

In [270]:
dicc_val = df_val[categoric_vars + numeric_vars].to_dict(orient='records')

In [271]:
x_val = dv.fit_transform(dicc_val)

In [272]:
x_val.shape

(1318, 56)

In [282]:
y_val.shape

(1318,)

### 4.2 Entrenando el modelo de ML

Vamos a usar un modelo de Regresión Logística para hacer clasificación binaria.<br>
Los pasos son:<br> 
a. Entrenar un modelo<br>
b. Aplicarlo al conjunto de datos de validación<br>
c. Calcular la precisión

In [283]:
from sklearn.linear_model import LogisticRegression

In [284]:
x_train.shape, y_train.shape

((3953, 56), (3953,))

In [274]:
model = LogisticRegression(max_iter=3500)
model.fit(x_train, y_train)

In [289]:
y_pred = model.predict_proba(x_val)[:,1]

In [290]:
y_pred.shape

(1318,)

In [296]:
ind_leal = (y_pred >= 0.5)

In [293]:
y_val.shape

(1318,)

In [295]:
ind_leal.shape

(3953,)

In [297]:
(y_val == ind_leal).mean()

0.8596358118361154

### 4.2 Usando el modelo

Vamos a aplicar el modelo entrenado en nuestro dataset de testeo

In [299]:
dicc_ft = df_ft[categoric_vars + numeric_vars].to_dict(orient='records')

In [300]:
dv = DictVectorizer(sparse=False)

In [301]:
x_ft = dv.fit_transform(dicc_ft)

In [302]:
y_ft = df_ft['customer_status']

In [305]:
model = LogisticRegression(max_iter=4500)
model.fit(x_ft, y_ft)

In [307]:
dicc_test = df_test[categoric_vars + numeric_vars].to_dict(orient = 'records')

In [310]:
x_test = dv.transform(dicc_test)

In [349]:
x_test.shape

(1318, 56)

In [314]:
model.predict_proba(x_test)[:,1]

array([0.54335465, 0.806965  , 0.98831896, ..., 0.35724278, 0.9885413 ,
       0.63139243])

In [315]:
y_pred = model.predict_proba(x_test)[:,1]

In [316]:
leal_index = (y_pred >= 0.5)

In [318]:
(leal_index == y_test).mean()

0.8429438543247344

**Aplicamos el modelo sobre un solo cliente**

In [337]:
dicc_test[56]

{'gender': 'male',
 'married': 'no',
 'offer': 'offer_d',
 'phone_service': 'yes',
 'multiple_lines': 'no',
 'internet_service': 'yes',
 'internet_type': 'fiber_optic',
 'online_security': 'no',
 'online_backup': 'no',
 'device_protection_plan': 'yes',
 'premium_tech_support': 'no',
 'streaming_tv': 'yes',
 'streaming_movies': 'yes',
 'streaming_music': 'yes',
 'unlimited_data': 'yes',
 'contract': 'month-to-month',
 'paperless_billing': 'yes',
 'payment_method': 'credit_card',
 'age': 65,
 'number_of_dependents': 0,
 'number_of_referrals': 0,
 'tenure_in_months': 19,
 'avg_monthly_long_distance_charges': 32.77,
 'avg_monthly_gb_download': 4.0,
 'monthly_charge': 95.9,
 'total_charges': 1777.9,
 'total_refunds': 0.0,
 'total_extra_data_charges': 0,
 'total_long_distance_charges': 622.63,
 'total_revenue': 2400.53}

In [339]:
arnaldo = dicc_test[56]

In [340]:
arnaldo = dv.transform([arnaldo])

In [341]:
arnaldo.shape

(1, 56)

In [342]:
model.predict_proba(arnaldo)[0][1]

0.4817789471485103

In [355]:
model.predict(arnaldo)[0]

0

In [345]:
y_test[56]

0

**Aplicamos el modelo sobre nuestros nuevos clientes**

In [360]:
dicc_joined = df_joined[categoric_vars + numeric_vars].to_dict(orient='records')

In [361]:
x_joined = dv.transform(dicc_joined)

In [362]:
x_joined.shape

(454, 56)

In [367]:
y_joined = model.predict(x_joined)


In [369]:
df_joined['leal'] = y_joined

In [370]:
df_joined

Unnamed: 0,customer_id,gender,age,married,number_of_dependents,city,zip_code,latitude,longitude,number_of_referrals,...,monthly_charge,total_charges,total_refunds,total_extra_data_charges,total_long_distance_charges,total_revenue,customer_status,churn_category,churn_reason,leal
17,0021-ikxgc,female,72,no,0,san_marcos,92078,33.119028,-117.166036,0,...,72.10,72.10,0.0,0,7.77,79.87,joined,no_churn,no_churn,0
23,0030-fnxpp,female,22,no,0,keeler,93530,36.560498,-117.962461,0,...,19.85,57.20,0.0,0,9.36,66.56,joined,no_churn,no_churn,1
48,0082-ldzue,male,54,no,0,calistoga,94515,38.629618,-122.593216,0,...,44.30,44.30,0.0,0,42.95,87.25,joined,no_churn,no_churn,0
88,0139-ivfjg,female,19,yes,0,temecula,92592,33.507255,-117.029473,10,...,90.35,190.50,0.0,0,9.30,199.80,joined,no_churn,no_churn,1
100,0178-ciikr,female,60,no,0,crows_landing,95313,37.435664,-121.049056,0,...,19.95,58.00,0.0,0,8.07,66.07,joined,no_churn,no_churn,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6938,9840-efjqb,female,43,no,0,sylmar,91342,34.321621,-118.399841,0,...,74.35,74.35,0.0,0,24.41,98.76,joined,no_churn,no_churn,0
6973,9895-vfoxh,female,19,no,0,scotia,95565,40.440636,-124.098739,0,...,24.40,24.40,0.0,10,0.00,34.40,joined,no_churn,no_churn,0
7021,9962-bfpdu,female,61,yes,3,kenwood,95452,38.419525,-122.521585,3,...,20.05,20.05,0.0,0,16.02,36.07,joined,no_churn,no_churn,1
7033,9975-skrnr,male,24,no,0,sierraville,96126,39.559709,-120.345639,0,...,18.90,18.90,0.0,0,49.51,68.41,joined,no_churn,no_churn,1
