# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

El conjunto de datos recopilado está relacionado con 17
campañas ocurridas entre mayo de 2008 y noviembre de
2010, correspondientes a un total de 79,354 contactos.

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [4]:
import pandas as pd

In [6]:
df = pd.read_csv('data/bank-additional-full.csv', sep = ';')

In [8]:
df.head(5)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



### Data Preparation
Existen 41,188 muestras válidas <br>
Se van a descartar 4 filas (muestras) que tienen duration = 0 e y = 'no' por lo indicado en la descripción de datos de arriba.<br>
Estas son las características que deben pasar a binario: job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome, y. En la sección Engineering Features se muestra el código de transformación de datos objets a binario.

### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

### Business Understanding

De acuerdo a la fase de Business Understanding (del CRISP-DM), el objetivo  comercial es aumentar la eficiencia de las campañas dirigidas para las suscripciones de depósitos a largo plazo reduciendo el número de contactos a realizar.<br>
Otro objetivo es comparar el rendimiento de los clasificadores (k vecinos más cercanos, regresión logística, árboles de decisión y máquinas de vectores de soporte) que encontramos en esta sección del programa.


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [12]:
df[df.duration == 0]

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
6251,39,admin.,married,high.school,no,yes,no,telephone,may,tue,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
23031,59,management,married,university.degree,no,yes,no,cellular,aug,tue,...,10,999,0,nonexistent,1.4,93.444,-36.1,4.965,5228.1,no
28063,53,blue-collar,divorced,high.school,no,yes,no,cellular,apr,fri,...,3,999,0,nonexistent,-1.8,93.075,-47.1,1.479,5099.1,no
33015,31,blue-collar,married,basic.9y,no,no,no,cellular,may,mon,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.299,5099.1,no


### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features, prepare the features and target column for modeling with appropriate encoding and transformations.

In [16]:
# Importar librerías necesarias
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
import time

In [18]:
# Convertir la variable de respuesta 'y' a valores numéricos
data = df
label_encoder = LabelEncoder()
data['y'] = label_encoder.fit_transform(data['y'])

# Convertir variables categóricas en variables dummies
data = pd.get_dummies(data, drop_first=True)
data = data.apply(lambda x: x.astype(int) if x.dtype == 'bool' else x)

# Dividir el conjunto de datos en variables independientes y dependiente
X = data.drop('y', axis=1)
y = data['y']

### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [20]:
# Escalar las características
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Dividir en conjunto de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)


### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [22]:
# Definir los modelos de clasificación
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Support Vector Machine": SVC()
}

### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [24]:
# Crear una lista para almacenar los resultados de cada modelo
results_list = []

# Entrenar cada modelo, medir el tiempo y registrar el reporte
for model_name, model in models.items():
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    training_time = end_time - start_time
    
    # Obtener reporte de clasificación
    y_pred = model.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True)
    
    # Añadir los resultados en forma de diccionario a la lista
    results_list.append({
        'Model': model_name,
        'Accuracy': report['accuracy'],
        'Precision': report['weighted avg']['precision'],
        'Recall': report['weighted avg']['recall'],
        'F1-Score': report['weighted avg']['f1-score'],
        'Training Time (seconds)': training_time
    })


### Problem 9: Score the Model

What is the accuracy of your model?

In [26]:
# Convertir la lista de resultados en un DataFrame
results = pd.DataFrame(results_list)
results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,Training Time (seconds)
0,Logistic Regression,0.912276,0.90158,0.912276,0.903235,0.357257
1,Decision Tree,0.889941,0.890774,0.889941,0.890352,0.278601
2,K-Nearest Neighbors,0.89771,0.881444,0.89771,0.885313,0.0
3,Support Vector Machine,0.909282,0.8967,0.909282,0.897472,19.820414


### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|   Logistic Regression  |  0.316759  |0.911762     |0.912276     |
|   Decision Tree  |  0.303312  |1.000000     |0.889941     |
|   K-Nearest Neighbors  | 0.006551  |0.919496     |0.897710     |
|   Support Vector Machine  |  19.701015  |0.92508     |0.909282     |

In [28]:
from sklearn.metrics import accuracy_score

In [30]:

# Crear una lista para almacenar los resultados de cada modelo
results_list = []

# Entrenar cada modelo, medir el tiempo y registrar el reporte
for model_name, model in models.items():
    # Medir el tiempo de entrenamiento
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    training_time = end_time - start_time
    
    # Calcular la precisión en el conjunto de entrenamiento y prueba
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))
    
    # Añadir los resultados en forma de diccionario a la lista
    results_list.append({
        'Model': model_name,
        'Training Time (seconds)': training_time,
        'Train Accuracy': train_accuracy,
        'Test Accuracy': test_accuracy
    })

# Convertir la lista de resultados en un DataFrame
results = pd.DataFrame(results_list)
print(results)


                    Model  Training Time (seconds)  Train Accuracy  \
0     Logistic Regression                 0.316759        0.911762   
1           Decision Tree                 0.303312        1.000000   
2     K-Nearest Neighbors                 0.006551        0.919496   
3  Support Vector Machine                19.701015        0.925081   

   Test Accuracy  
0       0.912276  
1       0.889941  
2       0.897710  
3       0.909282  


In [32]:
results

Unnamed: 0,Model,Training Time (seconds),Train Accuracy,Test Accuracy
0,Logistic Regression,0.316759,0.911762,0.912276
1,Decision Tree,0.303312,1.0,0.889941
2,K-Nearest Neighbors,0.006551,0.919496,0.89771
3,Support Vector Machine,19.701015,0.925081,0.909282


### Evaluate Results
1.	Los modelos de Logistic Regression y Support Vector Machine son los mas adecuados porque alcanza los mejores valores de accuracy. 
2.	Se observa que el modelo Logistic Regression tiene una mínima diferencia entre las métricas Train Accuracy y Test Accuracy.
3.	En relación con los tiempos de entrenamiento es evidente que el modelo Support Vector Machine es mucho mayor a los otros modelos. 
4.	El AUC traza la Tasa de Falsos Positivos (FPR) frente a la Tasa de Verdaderos Positivos (TPR) y permite identificar qué tan buena es la discriminación de clases: cuanto más alta, mejor, con el modelo ideal teniendo un valor de 1.0. Los resultados muestran que hubo una clara evolución en las capacidades de predicción de los modelos Logistic Regression 0.94 y SVM 0.93. Puede ver la curva ROC y AUC en el siguiente [link](Evaluation.ipynb)


### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

### Mejoras del Modelo - Segunda iteración CRISP-DM
Para mejorar los modelos de clasificación hemos decidido utilizar Grid Search de tal forma de seleccionar los mejores Hiperparámetros.
Consideramos que todas las características seleccionadas son válidas por lo tanto las mantenemos en esta segunda iteración de la metodología CRISP-DM

In [34]:
# Importar librerías necesarias
from sklearn.model_selection import GridSearchCV

# Definir los modelos de clasificación y los parámetros para Grid Search
models_params = {
    "Logistic Regression": {
        'model': LogisticRegression(max_iter=1000),
        'params': {
            'C': [0.01, 0.1, 1, 10, 100],  # Regularización
            'solver': ['lbfgs', 'liblinear']
        }
    },
    "Decision Tree": {
        'model': DecisionTreeClassifier(random_state=42),
        'params': {
            'max_depth': [5, 10, 15, 20, None],
            'min_samples_split': [2, 5, 10]
        }
    },
    "K-Nearest Neighbors": {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors': [3, 5, 7, 10],
            'weights': ['uniform', 'distance']
        }
    },
    "Support Vector Machine": {
        'model': SVC(),
        'params': {
            'C': [0.1, 1, 10, 100],
            'kernel': ['linear', 'rbf'],
            'gamma': ['scale', 'auto']
        }
    }
}

# Crear una lista para almacenar los resultados de cada modelo
results_list = []

# Aplicar Grid Search a cada modelo, entrenar y evaluar
for model_name, mp in models_params.items():
    grid_search = GridSearchCV(mp['model'], mp['params'], cv=5, n_jobs=-1, scoring='accuracy')
    
    # Medir tiempo de entrenamiento
    start_time = time.time() 
    print(f"Modelo: {model_name} Inicio: {start_time}")
    grid_search.fit(X_train, y_train)
    end_time = time.time()
    print(f"Modelo: {model_name} Fin: {start_time}")
    training_time = end_time - start_time
    
    # Usar el mejor modelo encontrado por GridSearchCV
    best_model = grid_search.best_estimator_
    
    # Calcular precisión en entrenamiento y prueba
    train_accuracy = best_model.score(X_train, y_train)
    test_accuracy = best_model.score(X_test, y_test)
    
    # Añadir los resultados en forma de diccionario a la lista
    results_list.append({
        'Model': model_name,
        'Best Parameters': grid_search.best_params_,
        'Train Accuracy': train_accuracy,
        'Test Accuracy': test_accuracy,
        'Training Time (seconds)': training_time
    })

# Convertir la lista de resultados en un DataFrame y mostrar
results = pd.DataFrame(results_list)
print(results)


Modelo: Logistic Regression Inicio: 1730318235.1286342
Modelo: Logistic Regression Fin: 1730318235.1286342
Modelo: Decision Tree Inicio: 1730318250.273429
Modelo: Decision Tree Fin: 1730318250.273429
Modelo: K-Nearest Neighbors Inicio: 1730318259.2306888
Modelo: K-Nearest Neighbors Fin: 1730318259.2306888
Modelo: Support Vector Machine Inicio: 1730318287.9124832
Modelo: Support Vector Machine Fin: 1730318287.9124832
                    Model                              Best Parameters  \
0     Logistic Regression                  {'C': 1, 'solver': 'lbfgs'}   
1           Decision Tree     {'max_depth': 5, 'min_samples_split': 5}   
2     K-Nearest Neighbors    {'n_neighbors': 10, 'weights': 'uniform'}   
3  Support Vector Machine  {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}   

   Train Accuracy  Test Accuracy  Training Time (seconds)  
0        0.911762       0.912276                15.144795  
1        0.917450       0.916161                 8.941639  
2        0.908397       0.899

### Evaluación de la Segunda iteración 

Después de realizar los ajustes a los hiperparámetros de los diferentes modelos de clasificación vemos que SVM y Decision Tree son los que tienen mejores Train Accuracy, pero si nos enfocamos en el Test Accuracy, los modelos Decision Tree y Logistic Regression son los que tienen mejor precisión.
Por otro lado, SVM es un modelo que también ha obtenido buena precisión, sin ebargo el consumo de tiempo y recursos es muy superior a los demas modelos. 

In [38]:
results

Unnamed: 0,Model,Best Parameters,Train Accuracy,Test Accuracy,Training Time (seconds)
0,Logistic Regression,"{'C': 1, 'solver': 'lbfgs'}",0.911762,0.912276,15.144795
1,Decision Tree,"{'max_depth': 5, 'min_samples_split': 5}",0.91745,0.916161,8.941639
2,K-Nearest Neighbors,"{'n_neighbors': 10, 'weights': 'uniform'}",0.908397,0.899895,21.525683
3,Support Vector Machine,"{'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}",0.925081,0.909282,24765.285198


### Próximos pasos y Recomendaciones

1.	Se recomienda que, en trabajos futuros, se recopile más datos basados en el cliente, con el fin de verificar si se pueden lograr modelos predictivos de alta calidad sin información basada en contacto. 
2.	Dentro de la información ha recopilar puede estar el número de hijos del cliente, si están en edad escolar, si tienen preferencias por tomar vacaciones, fecha próxima de cambio de vehículo o si tienen preferencia por hacer una maestría. 
3.	También planeamos aplicar los mejores modelos de DM / ML en un entorno real, con una interacción más estrecha con los gerentes de marketing, con el fin de obtener una retroalimentación valiosa.


### Conclusiones

1.	Los modelos de Logistic Regression, Decision Tree y SVM, han logrado altos rendimientos predictivos. Sin embargo, se recomienda usar el modelo Logistic Regression o Decision Tree por su adpatación a clases desequilibradas, velocidad de entrenamiento y alta interpretabilidad.
2.	El análisis de sensibilidad indica que la característica de duración de las llamadas es importante para la adquisición del producto, por lo que se propone que los agentes aumenten la duración de las llamadas telefónicas y/o programándolas en campañas en los meses más favorables
3.	Otro resultado importante es la confirmación de la tecnología de código abierto de sklearn, matplotlib y seaborn y pandas en el campo de ML son capaces de proporcionar modelos de alta calidad para aplicaciones reales, lo que permite una reducción de costos de los proyectos de ML. 
4.	La metodología CRISP-DM se adecua para proyectos que tienen un enfoque a Machine Learning sobre las campañas de marketing directo de los bancos. En efecto, cada iteración de CRISP-DM ha demostrado ser de gran valor, ya que los rendimientos predictivos obtenidos aumentan. 


### Anexos

La siguiente información puede ser revisada en este [link](Evaluation.ipynb)
1. Curva ROC y AUC para los modelos analizados
2. Coeficientes para el modelo Logistic Regression
3. Importancia de características en SVM mediante Permutation Importance

##### Questions