# **Análisis exploratorio de datos (EDA) Propensión Compra de un Seguro**

**Tabla de Contenido**

1) Descripción del Problema
2) Descripción de los datos
3) Depuración Datos: Mapeo Variables, Datos Nulos, Recodificación Variables, Creación de Nuevas Variables, Imputación Variables

4) Análisis Exploratorio y Descriptivo: Análisis Univariado, Análisis Multivariado, Selección Features



## **1) Descripción del Problema**

Los correos a todos los potenciales clientes de un producto no son una estrategia de marketing efectiva, ya que muchos de ellos no están interesados. Identificando a los más propensos a contratar el producto se podría dirigir la campaña de marketing con mayor precisión y reducir su coste.

La compañía financiera ofrece seguros, créditos, etc. y nos proporciona un listado de clientes a los que ha dirigido una campaña de marketing ofreciendo el nuevo producto “Seguro Vivienda”. En el fichero se encuentran los datos de dichos clientes, así como información de otros productos ya contratados y si contrataron el nuevo producto o no. También se dispone de un fichero con información sociodemográfica asociada a diferentes zonas de residencia.


## **2) Descripción de los datos**

**Clientes**

* ID_Cliente: identificador único del cliente
* Fecha_Nacimiento: fecha de nacimiento del cliente
* Fecha_Alta: fecha de alta del cliente
* Sexo: sexo del cliente
* ID_Zona: identificador único de zona de residencia
* Productos_X: número de productos contratados por tipo
* Gasto_X: gasto anual en los distintos productos
* Seguro_Vivienda: el cliente contrató el producto ofrecido o no

**Zonas**

* ID_Zona: identificador único de zona
* Tipo_X: porcentaje de población por tipología familiar
* Educacion_X: porcentaje de población por nivel de estudios
* Poblacion_X: porcentaje de población por tipo de población
* Vivienda_X: porcentaje de población por tipo de vivienda
* Medico_X: porcentaje de población por tipo de seguro médico
* Ingresos_X: porcentaje de población por nivel de ingresos


**Librerias**

In [1]:
import pandas as pd # Esta libreria es para depurar datos, "Excel de python"
import numpy as np ## Realizar operaciones matematicas
import seaborn as sns # Gráficas estadisticas
import matplotlib.pyplot as plt # Gráficas
import os ## Sistema
import datetime as datetime ## fechas
pd.set_option("display.max_columns",100)

In [2]:
#!pip install    

**Importar Datos**

In [3]:
datos_train=pd.read_csv("Insumos/Clientes_train.csv",sep="\t")

In [4]:
datos_train

Unnamed: 0,ID_Cliente,Fecha_Nacimiento,Fecha_Alta,Sexo,ID_Zona,Productos_Vida,Productos_Vehiculos,Productos_Otros,Gasto_Vida,Gasto_Vehiculos,Gasto_Otros,Seguro_Vivienda
0,C3088,29/03/1968,27/03/1989,Mujer,Z1143,0,1,0,0.00,617.55,0.00,False
1,C2975,08/12/1978,26/12/1998,Mujer,Z1201,2,0,1,973.61,0.00,386.87,False
2,C0840,31/07/1950,19/04/1972,Hombre,Z1122,0,1,2,0.00,3572.01,273.15,False
3,C0461,29/07/1945,21/07/1967,Mujer,Z1190,1,1,2,87.91,4558.71,521.66,False
4,C2777,17/10/1955,22/02/1976,Hombre,Z1344,0,1,0,0.00,4289.61,0.00,False
...,...,...,...,...,...,...,...,...,...,...,...,...
2994,C1524,02/11/1961,30/08/1982,Hombre,Z1593,0,1,2,0.00,6712.11,82.93,True
2995,C3670,18/09/1953,15/03/1975,Hombre,Z1023,0,1,1,0.00,1653.89,36.73,False
2996,C0919,15/11/1950,27/12/1971,Hombre,Z0421,0,3,1,0.00,3704.71,89.90,False
2997,C0235,12/08/1970,23/01/1991,Hombre,Z1070,0,0,2,0.00,0.00,242.76,False


In [5]:
datos_test=pd.read_csv("Insumos/Clientes_test.csv",sep="\t")

In [6]:
datos_test

Unnamed: 0,ID_Cliente,Fecha_Nacimiento,Fecha_Alta,Sexo,ID_Zona,Productos_Vida,Productos_Vehiculos,Productos_Otros,Gasto_Vida,Gasto_Vehiculos,Gasto_Otros
0,C2172,05/10/1981,04/02/2005,Hombre,Z0403,1,2,2,735.14,2535.49,337.77
1,C1627,21/09/1983,27/12/2004,Mujer,Z0700,0,1,1,0.00,3195.94,87.96
2,C0649,24/01/1945,02/12/1967,Hombre,Z1023,0,0,0,0.00,0.00,0.00
3,C0712,11/07/1945,15/10/1966,Hombre,Z0648,0,1,0,0.00,3183.59,0.00
4,C0648,09/02/1964,11/09/1988,Hombre,Z0955,0,1,2,0.00,3613.07,238.26
...,...,...,...,...,...,...,...,...,...,...,...
888,C2417,20/08/1952,01/02/1974,Hombre,Z0815,0,1,0,0.00,4264.50,0.00
889,C2279,21/11/1983,10/03/2008,Hombre,Z1506,0,0,2,0.00,0.00,559.94
890,C3365,23/06/1981,31/03/2002,Hombre,Z0860,0,0,0,0.00,0.00,0.00
891,C0237,26/11/1958,25/01/1981,Hombre,Z0318,0,1,0,0.00,2036.26,0.00


**Dimensión Base de Datos**

In [7]:
print(f'La Base de Datos Train tiene {datos_train.shape[0]} filas y tiene {datos_train.shape[1]} columnas')
print(f'La Base de Datos Test tiene {datos_test.shape[0]} filas y tiene {datos_test.shape[1]} columnas')

La Base de Datos Train tiene 2999 filas y tiene 12 columnas
La Base de Datos Test tiene 893 filas y tiene 11 columnas


In [8]:
print(f'Las Variables de la base de train son: {datos_train.columns.tolist()}')

Las Variables de la base de train son: ['ID_Cliente', 'Fecha_Nacimiento', 'Fecha_Alta', 'Sexo', 'ID_Zona', 'Productos_Vida', 'Productos_Vehiculos', 'Productos_Otros', 'Gasto_Vida', 'Gasto_Vehiculos', 'Gasto_Otros', 'Seguro_Vivienda']


In [9]:
print(f'Las Variables de la base de test son: {datos_test.columns.tolist()}')

Las Variables de la base de test son: ['ID_Cliente', 'Fecha_Nacimiento', 'Fecha_Alta', 'Sexo', 'ID_Zona', 'Productos_Vida', 'Productos_Vehiculos', 'Productos_Otros', 'Gasto_Vida', 'Gasto_Vehiculos', 'Gasto_Otros']


**Unir las Base de Datos**

In [10]:
## Creamos una variable llamada tipodatos en cada base de datos
datos_train["Tipo Archivo"]="Entrenamiento"
datos_test["Tipo Archivo"]="Prueba"

In [11]:
df=pd.concat([datos_train,datos_test],axis=0) # Union de dos tablas con axis=0, por filas

In [12]:
df.head()

Unnamed: 0,ID_Cliente,Fecha_Nacimiento,Fecha_Alta,Sexo,ID_Zona,Productos_Vida,Productos_Vehiculos,Productos_Otros,Gasto_Vida,Gasto_Vehiculos,Gasto_Otros,Seguro_Vivienda,Tipo Archivo
0,C3088,29/03/1968,27/03/1989,Mujer,Z1143,0,1,0,0.0,617.55,0.0,False,Entrenamiento
1,C2975,08/12/1978,26/12/1998,Mujer,Z1201,2,0,1,973.61,0.0,386.87,False,Entrenamiento
2,C0840,31/07/1950,19/04/1972,Hombre,Z1122,0,1,2,0.0,3572.01,273.15,False,Entrenamiento
3,C0461,29/07/1945,21/07/1967,Mujer,Z1190,1,1,2,87.91,4558.71,521.66,False,Entrenamiento
4,C2777,17/10/1955,22/02/1976,Hombre,Z1344,0,1,0,0.0,4289.61,0.0,False,Entrenamiento


In [13]:
df.shape

(3892, 13)

In [14]:
df.tail()

Unnamed: 0,ID_Cliente,Fecha_Nacimiento,Fecha_Alta,Sexo,ID_Zona,Productos_Vida,Productos_Vehiculos,Productos_Otros,Gasto_Vida,Gasto_Vehiculos,Gasto_Otros,Seguro_Vivienda,Tipo Archivo
888,C2417,20/08/1952,01/02/1974,Hombre,Z0815,0,1,0,0.0,4264.5,0.0,,Prueba
889,C2279,21/11/1983,10/03/2008,Hombre,Z1506,0,0,2,0.0,0.0,559.94,,Prueba
890,C3365,23/06/1981,31/03/2002,Hombre,Z0860,0,0,0,0.0,0.0,0.0,,Prueba
891,C0237,26/11/1958,25/01/1981,Hombre,Z0318,0,1,0,0.0,2036.26,0.0,,Prueba
892,C3112,22/06/1946,12/06/1967,Hombre,Z0865,0,0,2,0.0,0.0,293.19,,Prueba


**Unir los datos de Zona**

In [15]:
datos_zona=pd.read_csv("Insumos/Zonas.csv",sep="\t")

In [16]:
datos_zona

Unnamed: 0,ID_Zona,Tipo_Familia,Tipo_Pareja,Tipo_Soltero,Educacion_Superior,Educacion_Media,Educacion_Baja,Poblacion_Empresario,Poblacion_Funcionario,Poblacion_Trabajador_Cualificado,Poblacion_Trabajador_No_Cualificado,Vivienda_Propiedad,Vivienda_Alquiler,Medico_Seguro_Privado,Medico_Seguridad_Social,Ingresos_Mas_De_40000,Ingresos_De_20000_Hasta_40000,Ingresos_Hasta_20000
0,Z0000,16.66,25.17,58.17,0.00,80.08,19.92,0.00,0.00,0.00,100.00,0.00,100.00,16.96,83.04,0.00,0.00,100.00
1,Z0002,54.13,45.87,0.00,40.72,43.42,15.86,70.45,29.55,0.00,0.00,83.52,16.48,78.07,21.93,0.00,55.10,44.90
2,Z0006,57.29,42.71,0.00,31.15,68.85,0.00,31.51,68.49,0.00,0.00,100.00,0.00,31.89,68.11,0.00,0.00,100.00
3,Z0007,56.37,14.73,28.91,20.01,40.77,39.21,25.79,22.76,33.59,17.86,0.00,100.00,0.00,100.00,0.00,0.00,100.00
4,Z0009,100.00,0.00,0.00,100.00,0.00,0.00,100.00,0.00,0.00,0.00,100.00,0.00,100.00,0.00,33.01,33.30,33.69
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1199,Z1833,29.80,40.49,29.71,37.11,22.43,40.46,40.19,45.32,14.48,0.00,79.21,20.79,31.49,68.51,7.78,6.40,85.83
1200,Z1836,18.03,51.77,30.21,25.37,41.11,33.52,14.73,19.65,0.00,65.62,79.06,20.94,12.54,87.46,17.82,2.27,79.92
1201,Z1838,45.38,54.62,0.00,0.00,47.89,52.11,19.37,18.51,31.21,30.91,95.92,4.08,22.55,77.45,0.00,7.95,92.05
1202,Z1841,100.00,0.00,0.00,46.14,53.86,0.00,100.00,0.00,0.00,0.00,100.00,0.00,72.90,27.10,0.00,100.00,0.00


In [17]:
df_completa=pd.merge(df,datos_zona,on="ID_Zona",how="left")

# pd.merge(df1, df2, left_on='clave1', right_on='clave2', how='left')

In [18]:
df_completa.shape

(3892, 30)

In [19]:
df_completa.head()

Unnamed: 0,ID_Cliente,Fecha_Nacimiento,Fecha_Alta,Sexo,ID_Zona,Productos_Vida,Productos_Vehiculos,Productos_Otros,Gasto_Vida,Gasto_Vehiculos,Gasto_Otros,Seguro_Vivienda,Tipo Archivo,Tipo_Familia,Tipo_Pareja,Tipo_Soltero,Educacion_Superior,Educacion_Media,Educacion_Baja,Poblacion_Empresario,Poblacion_Funcionario,Poblacion_Trabajador_Cualificado,Poblacion_Trabajador_No_Cualificado,Vivienda_Propiedad,Vivienda_Alquiler,Medico_Seguro_Privado,Medico_Seguridad_Social,Ingresos_Mas_De_40000,Ingresos_De_20000_Hasta_40000,Ingresos_Hasta_20000
0,C3088,29/03/1968,27/03/1989,Mujer,Z1143,0,1,0,0.0,617.55,0.0,False,Entrenamiento,75.1,18.27,6.63,2.75,39.2,58.05,23.7,28.17,21.01,27.13,71.34,28.66,32.77,67.23,2.23,1.47,96.3
1,C2975,08/12/1978,26/12/1998,Mujer,Z1201,2,0,1,973.61,0.0,386.87,False,Entrenamiento,62.29,32.55,5.17,40.84,41.12,18.04,37.53,41.51,11.54,9.42,92.04,7.96,43.84,56.16,0.0,3.38,96.62
2,C0840,31/07/1950,19/04/1972,Hombre,Z1122,0,1,2,0.0,3572.01,273.15,False,Entrenamiento,46.41,30.94,22.65,7.63,28.36,64.01,5.5,18.3,34.78,41.43,56.37,43.63,17.48,82.52,0.0,1.72,98.28
3,C0461,29/07/1945,21/07/1967,Mujer,Z1190,1,1,2,87.91,4558.71,521.66,False,Entrenamiento,87.86,12.14,0.0,21.17,56.95,21.87,19.05,57.13,22.28,1.54,93.38,6.62,44.57,55.43,6.76,6.39,86.84
4,C2777,17/10/1955,22/02/1976,Hombre,Z1344,0,1,0,0.0,4289.61,0.0,False,Entrenamiento,37.63,29.59,32.77,26.98,52.61,20.4,32.17,53.3,1.92,12.61,12.63,87.37,45.32,54.68,21.34,27.01,51.66


## **2) Depuración Datos**

**Renombrar Variables**

In [20]:
# cambiar el nombre de la variable Seguro_Vivienda por CompraSeguro

In [21]:
df_completa.rename(columns={"Seguro_Vivienda":"CompraSeguro"},inplace=True)

In [22]:
df_completa.head()

Unnamed: 0,ID_Cliente,Fecha_Nacimiento,Fecha_Alta,Sexo,ID_Zona,Productos_Vida,Productos_Vehiculos,Productos_Otros,Gasto_Vida,Gasto_Vehiculos,Gasto_Otros,CompraSeguro,Tipo Archivo,Tipo_Familia,Tipo_Pareja,Tipo_Soltero,Educacion_Superior,Educacion_Media,Educacion_Baja,Poblacion_Empresario,Poblacion_Funcionario,Poblacion_Trabajador_Cualificado,Poblacion_Trabajador_No_Cualificado,Vivienda_Propiedad,Vivienda_Alquiler,Medico_Seguro_Privado,Medico_Seguridad_Social,Ingresos_Mas_De_40000,Ingresos_De_20000_Hasta_40000,Ingresos_Hasta_20000
0,C3088,29/03/1968,27/03/1989,Mujer,Z1143,0,1,0,0.0,617.55,0.0,False,Entrenamiento,75.1,18.27,6.63,2.75,39.2,58.05,23.7,28.17,21.01,27.13,71.34,28.66,32.77,67.23,2.23,1.47,96.3
1,C2975,08/12/1978,26/12/1998,Mujer,Z1201,2,0,1,973.61,0.0,386.87,False,Entrenamiento,62.29,32.55,5.17,40.84,41.12,18.04,37.53,41.51,11.54,9.42,92.04,7.96,43.84,56.16,0.0,3.38,96.62
2,C0840,31/07/1950,19/04/1972,Hombre,Z1122,0,1,2,0.0,3572.01,273.15,False,Entrenamiento,46.41,30.94,22.65,7.63,28.36,64.01,5.5,18.3,34.78,41.43,56.37,43.63,17.48,82.52,0.0,1.72,98.28
3,C0461,29/07/1945,21/07/1967,Mujer,Z1190,1,1,2,87.91,4558.71,521.66,False,Entrenamiento,87.86,12.14,0.0,21.17,56.95,21.87,19.05,57.13,22.28,1.54,93.38,6.62,44.57,55.43,6.76,6.39,86.84
4,C2777,17/10/1955,22/02/1976,Hombre,Z1344,0,1,0,0.0,4289.61,0.0,False,Entrenamiento,37.63,29.59,32.77,26.98,52.61,20.4,32.17,53.3,1.92,12.61,12.63,87.37,45.32,54.68,21.34,27.01,51.66


In [23]:
## Vamos a Homologar la varible, seria False por 0 y True por 1

In [24]:
## Propias funciones

In [25]:
## sumar dos numeros

In [26]:
def SumarDosNumeros(x,y):
    resultado=x+y
    return resultado

In [27]:
SumarDosNumeros(x=1,y=4)

5

In [28]:
SumarDosNumeros(1,5)

6

In [29]:
# condicionales

In [30]:
Compra=False

In [31]:
def HomologarCompra(Compra):
    if (Compra==False):
        Homologacion=0
    elif(Compra==True):
        Homologacion=1
    else:
        Homologacion=np.nan
    return Homologacion

In [32]:
df_completa["CompraSeguro"][0] # Primer cliente

False

In [33]:
HomologarCompra(df_completa["CompraSeguro"][0]) # Usando la función

0

In [34]:
## Realizar el calculo para todos los clintes de la base de datos

En Python, los bucles for se utilizan para iterar sobre una secuencia (como una lista, una tupla, un diccionario, un conjunto o una cadena). Este es el formato más básico de un bucle for:



In [35]:
## No es recomendado, para Big data, se recomienda usar funciones vectoriales, como las apply
ListaResultado=[]
for i in range(0,len(df_completa)):
    Resultado=HomologarCompra(df_completa["CompraSeguro"][i])
    ListaResultado.append(Resultado)

In [36]:
ListaResultado[0]

0

In [37]:
## Vamos agregar la Variable Homologación a la Base de datos

In [38]:
df_completa["CompraSeguro"]=ListaResultado

In [39]:
df_completa

Unnamed: 0,ID_Cliente,Fecha_Nacimiento,Fecha_Alta,Sexo,ID_Zona,Productos_Vida,Productos_Vehiculos,Productos_Otros,Gasto_Vida,Gasto_Vehiculos,Gasto_Otros,CompraSeguro,Tipo Archivo,Tipo_Familia,Tipo_Pareja,Tipo_Soltero,Educacion_Superior,Educacion_Media,Educacion_Baja,Poblacion_Empresario,Poblacion_Funcionario,Poblacion_Trabajador_Cualificado,Poblacion_Trabajador_No_Cualificado,Vivienda_Propiedad,Vivienda_Alquiler,Medico_Seguro_Privado,Medico_Seguridad_Social,Ingresos_Mas_De_40000,Ingresos_De_20000_Hasta_40000,Ingresos_Hasta_20000
0,C3088,29/03/1968,27/03/1989,Mujer,Z1143,0,1,0,0.00,617.55,0.00,0.0,Entrenamiento,75.10,18.27,6.63,2.75,39.20,58.05,23.70,28.17,21.01,27.13,71.34,28.66,32.77,67.23,2.23,1.47,96.30
1,C2975,08/12/1978,26/12/1998,Mujer,Z1201,2,0,1,973.61,0.00,386.87,0.0,Entrenamiento,62.29,32.55,5.17,40.84,41.12,18.04,37.53,41.51,11.54,9.42,92.04,7.96,43.84,56.16,0.00,3.38,96.62
2,C0840,31/07/1950,19/04/1972,Hombre,Z1122,0,1,2,0.00,3572.01,273.15,0.0,Entrenamiento,46.41,30.94,22.65,7.63,28.36,64.01,5.50,18.30,34.78,41.43,56.37,43.63,17.48,82.52,0.00,1.72,98.28
3,C0461,29/07/1945,21/07/1967,Mujer,Z1190,1,1,2,87.91,4558.71,521.66,0.0,Entrenamiento,87.86,12.14,0.00,21.17,56.95,21.87,19.05,57.13,22.28,1.54,93.38,6.62,44.57,55.43,6.76,6.39,86.84
4,C2777,17/10/1955,22/02/1976,Hombre,Z1344,0,1,0,0.00,4289.61,0.00,0.0,Entrenamiento,37.63,29.59,32.77,26.98,52.61,20.40,32.17,53.30,1.92,12.61,12.63,87.37,45.32,54.68,21.34,27.01,51.66
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3887,C2417,20/08/1952,01/02/1974,Hombre,Z0815,0,1,0,0.00,4264.50,0.00,,Prueba,58.31,41.69,0.00,0.00,41.69,58.31,38.26,21.88,22.70,17.17,68.87,31.13,46.58,53.42,0.00,12.40,87.60
3888,C2279,21/11/1983,10/03/2008,Hombre,Z1506,0,0,2,0.00,0.00,559.94,,Prueba,53.03,41.30,5.68,20.28,57.92,21.79,31.80,30.61,8.50,29.09,35.65,64.35,30.33,69.67,0.00,0.00,100.00
3889,C3365,23/06/1981,31/03/2002,Hombre,Z0860,0,0,0,0.00,0.00,0.00,,Prueba,40.59,59.41,0.00,0.00,43.06,56.94,0.00,16.88,61.46,21.67,52.24,47.76,0.00,100.00,0.00,8.05,91.96
3890,C0237,26/11/1958,25/01/1981,Hombre,Z0318,0,1,0,0.00,2036.26,0.00,,Prueba,29.25,12.18,58.57,13.95,53.90,32.14,15.70,51.54,0.00,32.76,41.38,58.62,46.61,53.39,0.00,55.48,44.52


In [40]:
df_completa["Sexo"].unique() ## Valores Unicos

array(['Mujer', 'Hombre'], dtype=object)

In [41]:
df_completa["Productos_Vida"].unique()

array([0, 2, 1, 4, 3, 8])

In [42]:
df_completa["Productos_Vida"].nunique()

6

In [43]:
df_completa["Productos_Vida"].isnull().sum()

0

### **Mapeo Variables**

In [44]:
df_completa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3892 entries, 0 to 3891
Data columns (total 30 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   ID_Cliente                           3892 non-null   object 
 1   Fecha_Nacimiento                     3892 non-null   object 
 2   Fecha_Alta                           3892 non-null   object 
 3   Sexo                                 3892 non-null   object 
 4   ID_Zona                              3892 non-null   object 
 5   Productos_Vida                       3892 non-null   int64  
 6   Productos_Vehiculos                  3892 non-null   int64  
 7   Productos_Otros                      3892 non-null   int64  
 8   Gasto_Vida                           3892 non-null   float64
 9   Gasto_Vehiculos                      3892 non-null   float64
 10  Gasto_Otros                          3892 non-null   float64
 11  CompraSeguro                  

In [45]:
def Mapeo(df_completa):
    Variables=df_completa.columns.tolist()
    CantidadValoresUnicos=[]
    VectorValoresUnicos=[]
    Tipo_Variable=[]
    nulos=[]
    for i in range(0,len(Variables)):
        CantidadValoresUnicos.append(df_completa[Variables[i]].nunique())
        VectorValoresUnicos.append(df_completa[Variables[i]].unique())
        Tipo_Variable.append(df_completa[Variables[i]].dtype)
        nulos.append(df_completa[Variables[i]].isnull().sum())
    Tabla=pd.DataFrame({"Variable":Variables,"Cant_ValoresUni":CantidadValoresUnicos,
                        "VectorValoresUni":VectorValoresUnicos,
                        "TipoVariable":Tipo_Variable,"nulos":nulos})
    Tabla["PorcentajeNulos"]=(Tabla["nulos"]/len(df_completa))*100
    return Tabla

In [46]:
Diagnostico1=Mapeo(df_completa)

In [47]:
Diagnostico1

Unnamed: 0,Variable,Cant_ValoresUni,VectorValoresUni,TipoVariable,nulos,PorcentajeNulos
0,ID_Cliente,3892,"[C3088, C2975, C0840, C0461, C2777, C1869, C03...",object,0,0.0
1,Fecha_Nacimiento,3478,"[29/03/1968, 08/12/1978, 31/07/1950, 29/07/194...",object,0,0.0
2,Fecha_Alta,3259,"[27/03/1989, 26/12/1998, 19/04/1972, 21/07/196...",object,0,0.0
3,Sexo,2,"[Mujer, Hombre]",object,0,0.0
4,ID_Zona,1222,"[Z1143, Z1201, Z1122, Z1190, Z1344, Z1082, Z11...",object,0,0.0
5,Productos_Vida,6,"[0, 2, 1, 4, 3, 8]",int64,0,0.0
6,Productos_Vehiculos,10,"[1, 0, 2, 3, 4, 5, 6, 8, 7, 10]",int64,0,0.0
7,Productos_Otros,9,"[0, 1, 2, 3, 6, 7, 4, 5, 8]",int64,0,0.0
8,Gasto_Vida,215,"[0.0, 973.61, 87.91, 103.63, 167.85, 396.99, 3...",float64,0,0.0
9,Gasto_Vehiculos,2332,"[617.55, 0.0, 3572.01, 4558.71, 4289.61, 1007....",float64,0,0.0


## **Recodificación Variables**

In [48]:
# Fecha_Nacimiento y Fecha_Alta deben ser de tipo datetime

In [49]:
df_completa.head()

Unnamed: 0,ID_Cliente,Fecha_Nacimiento,Fecha_Alta,Sexo,ID_Zona,Productos_Vida,Productos_Vehiculos,Productos_Otros,Gasto_Vida,Gasto_Vehiculos,Gasto_Otros,CompraSeguro,Tipo Archivo,Tipo_Familia,Tipo_Pareja,Tipo_Soltero,Educacion_Superior,Educacion_Media,Educacion_Baja,Poblacion_Empresario,Poblacion_Funcionario,Poblacion_Trabajador_Cualificado,Poblacion_Trabajador_No_Cualificado,Vivienda_Propiedad,Vivienda_Alquiler,Medico_Seguro_Privado,Medico_Seguridad_Social,Ingresos_Mas_De_40000,Ingresos_De_20000_Hasta_40000,Ingresos_Hasta_20000
0,C3088,29/03/1968,27/03/1989,Mujer,Z1143,0,1,0,0.0,617.55,0.0,0.0,Entrenamiento,75.1,18.27,6.63,2.75,39.2,58.05,23.7,28.17,21.01,27.13,71.34,28.66,32.77,67.23,2.23,1.47,96.3
1,C2975,08/12/1978,26/12/1998,Mujer,Z1201,2,0,1,973.61,0.0,386.87,0.0,Entrenamiento,62.29,32.55,5.17,40.84,41.12,18.04,37.53,41.51,11.54,9.42,92.04,7.96,43.84,56.16,0.0,3.38,96.62
2,C0840,31/07/1950,19/04/1972,Hombre,Z1122,0,1,2,0.0,3572.01,273.15,0.0,Entrenamiento,46.41,30.94,22.65,7.63,28.36,64.01,5.5,18.3,34.78,41.43,56.37,43.63,17.48,82.52,0.0,1.72,98.28
3,C0461,29/07/1945,21/07/1967,Mujer,Z1190,1,1,2,87.91,4558.71,521.66,0.0,Entrenamiento,87.86,12.14,0.0,21.17,56.95,21.87,19.05,57.13,22.28,1.54,93.38,6.62,44.57,55.43,6.76,6.39,86.84
4,C2777,17/10/1955,22/02/1976,Hombre,Z1344,0,1,0,0.0,4289.61,0.0,0.0,Entrenamiento,37.63,29.59,32.77,26.98,52.61,20.4,32.17,53.3,1.92,12.61,12.63,87.37,45.32,54.68,21.34,27.01,51.66


In [50]:
df_completa["Fecha_Nacimiento"]=pd.to_datetime(df_completa["Fecha_Nacimiento"], format="%d/%m/%Y")
df_completa["Fecha_Alta"]=pd.to_datetime(df_completa["Fecha_Alta"], format="%d/%m/%Y")

In [51]:
Diagnostico2=Mapeo(df_completa)
Diagnostico2

Unnamed: 0,Variable,Cant_ValoresUni,VectorValoresUni,TipoVariable,nulos,PorcentajeNulos
0,ID_Cliente,3892,"[C3088, C2975, C0840, C0461, C2777, C1869, C03...",object,0,0.0
1,Fecha_Nacimiento,3478,"[1968-03-29T00:00:00.000000000, 1978-12-08T00:...",datetime64[ns],0,0.0
2,Fecha_Alta,3259,"[1989-03-27T00:00:00.000000000, 1998-12-26T00:...",datetime64[ns],0,0.0
3,Sexo,2,"[Mujer, Hombre]",object,0,0.0
4,ID_Zona,1222,"[Z1143, Z1201, Z1122, Z1190, Z1344, Z1082, Z11...",object,0,0.0
5,Productos_Vida,6,"[0, 2, 1, 4, 3, 8]",int64,0,0.0
6,Productos_Vehiculos,10,"[1, 0, 2, 3, 4, 5, 6, 8, 7, 10]",int64,0,0.0
7,Productos_Otros,9,"[0, 1, 2, 3, 6, 7, 4, 5, 8]",int64,0,0.0
8,Gasto_Vida,215,"[0.0, 973.61, 87.91, 103.63, 167.85, 396.99, 3...",float64,0,0.0
9,Gasto_Vehiculos,2332,"[617.55, 0.0, 3572.01, 4558.71, 4289.61, 1007....",float64,0,0.0


In [54]:
## Que deberian ser categoricas

In [59]:
df_completa[["Sexo","Productos_Vida","Productos_Vehiculos","Productos_Otros","CompraSeguro"]]=df_completa[["Sexo","Productos_Vida","Productos_Vehiculos","Productos_Otros","CompraSeguro"]].astype("category")

In [60]:
Diagnostico3=Mapeo(df_completa)
Diagnostico3

  output = repr(obj)
  return method()


Unnamed: 0,Variable,Cant_ValoresUni,VectorValoresUni,TipoVariable,nulos,PorcentajeNulos
0,ID_Cliente,3892,"[C3088, C2975, C0840, C0461, C2777, C1869, C03...",object,0,0.0
1,Fecha_Nacimiento,3478,"[1968-03-29T00:00:00.000000000, 1978-12-08T00:...",datetime64[ns],0,0.0
2,Fecha_Alta,3259,"[1989-03-27T00:00:00.000000000, 1998-12-26T00:...",datetime64[ns],0,0.0
3,Sexo,2,"['Mujer', 'Hombre'] Categories (2, object): ['...",category,0,0.0
4,ID_Zona,1222,"[Z1143, Z1201, Z1122, Z1190, Z1344, Z1082, Z11...",object,0,0.0
5,Productos_Vida,6,"[0, 2, 1, 4, 3, 8] Categories (6, int64): [0, ...",category,0,0.0
6,Productos_Vehiculos,10,"[1, 0, 2, 3, 4, 5, 6, 8, 7, 10] Categories (10...",category,0,0.0
7,Productos_Otros,9,"[0, 1, 2, 3, 6, 7, 4, 5, 8] Categories (9, int...",category,0,0.0
8,Gasto_Vida,215,"[0.0, 973.61, 87.91, 103.63, 167.85, 396.99, 3...",float64,0,0.0
9,Gasto_Vehiculos,2332,"[617.55, 0.0, 3572.01, 4558.71, 4289.61, 1007....",float64,0,0.0
