# EDA - Insurance Company Benchmark (COIL 2000)

### Contexto del negocio

Este dataset proviene del CoIL Challenge 2000, organizado por la empresa holandesa Sentient Machine Research. El objetivo fue desarrollar modelos de minería de datos para identificar qué clientes de una aseguradora tienen mayor probabilidad de contratar un seguro de caravanas (mobile home policies). Esto refleja un caso real de negocio en el sector asegurador: optimizar campañas de marketing dirigidas a clientes con alta probabilidad de compra.

### Descripción del dataset

- Número de instancias: 9,000 clientes: Entrenamiento: 5,822 Y Evaluación: 4,000

Número de atributos: 86 variables

- Sociodemográficos (1-43): derivados de códigos postales (ej. tipo de vivienda, distribución demográfica).

- Uso de productos (44-85): información de posesión de seguros previos y productos financieros.

- Variable objetivo (86): CARAVAN → número de pólizas de seguro de caravanas (0/1).

- Tareas asociadas: Clasificación, Regresión, Segmentación de clientes.

- Tipo de datos: Categóricos e integer.

Formato: Archivos .txt delimitados por tabuladores.


### Referencias

- Van der Putten, P., & Van Someren, M. (2000). CoIL Challenge 2000: The Insurance Company Case. Sentient Machine Research, Amsterdam.
- Disponible en: UCI Machine Learning Repository

In [2]:
# Loading the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

In [3]:
# Setting up tweaks for the visualization
sns.set(style="whitegrid")

## Carga de información

In [4]:
# Nota @Mls: Cambiar el path a tu equipo local. 
path = 'C:/Users/mizlop/OneDrive - SAS/Documents/SAS_git/MLOps/proyect_team42/data/insurance_company_original.csv'
insurance_df = pd.read_csv(path)
insurance_df.head()

Unnamed: 0,33,1,3,2,8,0,5,1.1,3.1,7,...,0.37,0.38,0.39,1.13,0.40,0.41,0.42,0.43,0.44,0.45
0,37,1,2,2,8,1,4,1,4,6,...,0,0,0,1,0,0,0,0,0,0
1,37,1,2,2,8,0,4,2,4,3,...,0,0,0,1,0,0,0,0,0,0
2,9,1,3,3,3,2,3,2,4,5,...,0,0,0,1,0,0,0,0,0,0
3,40,1,4,2,10,1,4,1,4,7,...,0,0,0,1,0,0,0,0,0,0
4,23,1,2,1,5,0,5,0,5,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
insurance_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5821 entries, 0 to 5820
Data columns (total 86 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   33      5821 non-null   int64
 1   1       5821 non-null   int64
 2   3       5821 non-null   int64
 3   2       5821 non-null   int64
 4   8       5821 non-null   int64
 5   0       5821 non-null   int64
 6   5       5821 non-null   int64
 7   1.1     5821 non-null   int64
 8   3.1     5821 non-null   int64
 9   7       5821 non-null   int64
 10  0.1     5821 non-null   int64
 11  2.1     5821 non-null   int64
 12  1.2     5821 non-null   int64
 13  2.2     5821 non-null   int64
 14  6       5821 non-null   int64
 15  1.3     5821 non-null   int64
 16  2.3     5821 non-null   int64
 17  7.1     5821 non-null   int64
 18  1.4     5821 non-null   int64
 19  0.2     5821 non-null   int64
 20  1.5     5821 non-null   int64
 21  2.4     5821 non-null   int64
 22  5.1     5821 non-null   int64
 23  2.5     5821 

In [6]:
insurance_df.head().T

Unnamed: 0,0,1,2,3,4
33,37,37,9,40,23
1,1,1,1,1,1
3,2,2,3,4,2
2,2,2,3,2,1
8,8,8,3,10,5
...,...,...,...,...,...
0.41,0,0,0,0,0
0.42,0,0,0,0,0
0.43,0,0,0,0,0
0.44,0,0,0,0,0


In [7]:
insurance_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
33,5821.0,24.251847,12.847298,1.0,10.0,30.0,35.0,41.0
1,5821.0,1.110634,0.405874,1.0,1.0,1.0,1.0,10.0
3,5821.0,2.678749,0.789891,1.0,2.0,3.0,3.0,5.0
2,5821.0,2.991410,0.814555,1.0,2.0,3.0,3.0,6.0
8,5821.0,5.773235,2.856856,1.0,3.0,7.0,8.0,10.0
...,...,...,...,...,...,...,...,...
0.41,5821.0,0.006013,0.081639,0.0,0.0,0.0,0.0,2.0
0.42,5821.0,0.031781,0.211003,0.0,0.0,0.0,0.0,3.0
0.43,5821.0,0.007902,0.090471,0.0,0.0,0.0,0.0,2.0
0.44,5821.0,0.014259,0.120006,0.0,0.0,0.0,0.0,2.0


## Resto de EDA