
In this section of the paper, we will focus on breaking down and analyzing in detail the code of the different machine learning models that we have applied to the database. The main objective of this stage is to identify the most suitable customer profile to become the user model we are looking for. This profile represents the ideal person who not only buys our products, but also allows the company to focus its advertising strategies more effectively.

The intention is to create a predictive model that, based on the characteristics and buying patterns of users, will help us predict which customers are most likely to be interested in our products. This will allow us to optimize resources and direct our marketing efforts towards a specific market segment, maximizing the impact of our campaigns. Therefore, we will break down each model to understand its logic, its input variables, and the results obtained in the tests, in order to adjust the parameters and refine the accuracy of the model in future applications.

This code is a set of Python instructions used to install and load various libraries and modules essential for data analysis and machine learning model building. Here's a breakdown of each part:

Libraries Install.
%pip install: These lines install required packages directly into the Jupyter Notebook or Google Colab environment.
matplotlib: Library for creating 2D graphics visualizations.
seaborn: Based on matplotlib, provides a high-level interface for drawing statistical graphs.
scipy: Library for scientific and technical calculations.
statsmodels: Tool for estimating statistical models.
scikit-learn: Library for machine learning that includes tools for classification, regression and preprocessing.
wquantiles: For calculating weighted quantiles.
openpyxl: For reading and writing Excel files.
opendatasets: For downloading datasets from various sources directly from Python.

Library Imports
Imports: The libraries that were previously installed, as well as some specific modules, are imported below:
pandas and numpy: fundamental libraries for data manipulation and numerical calculations.
seaborn and matplotlib.pyplot: For data visualization.
opendatasets: To manage the download of data sets.
sklearn: Includes several functionalities for machine learning.
wquantiles: For working with weighted quantiles.
os: For interactions with the operating system.

Specific Imports
Specific imports: These lines import specific functions and classes from libraries:
Path from pathlib: To handle file paths.
trim_mean from scipy.stats: To calculate the trimmed mean.
robust from statsmodels: For robust statistical methods.
sklearn.model_selection functions: To split data into training and test sets.
Classes like SVR, SVC, DecisionTreeRegressor, LinearRegression, DecisionTreeClassifier, and LogisticRegression: For different machine learning algorithms (regression and classification).
Metrics such as mean_squared_error, r2_score, accuracy_score, f1_score, and classification_report: To evaluate the performance of the models.
Axes3D from mpl_toolkits.mplot3d: To create 3D graphics.
KNNImputer: To impute missing data using the K-nearest neighbors method.
PCA: For principal component analysis, used in dimensionality reduction.
SGDClassifier: For a stochastic gradient descent classifier.

In [1]:
%pip install matplotlib
%matplotlib inline
%pip install seaborn
%pip install scipy
%pip install statsmodels
%pip install scikit-learn
%pip install wquantiles
%pip install openpyxl
%pip install opendatasets

import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import opendatasets as od
import sklearn
import wquantiles
import os

from pathlib import Path
from scipy.stats import trim_mean
from statsmodels import robust
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, f1_score, classification_report
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from sklearn.impute import KNNImputer
from sklearn.decomposition import PCA
from sklearn.linear_model import SGDClassifier


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

Note: you may need to restart the kernel to use updated packages.





The %load_ext kedro.ipython line of code is used in a Jupyter Notebook environment and is related to the Kedro library, which is a framework for developing and maintaining data science projects in a structured and reproducible manner.

Code Breakdown
%load_ext: This is an IPython/Jupyter magic command. Magic commands are special commands that provide additional functionality on notebooks, allowing users to perform specific tasks more easily. The use of %load_ext allows extensions to be loaded into the IPython environment.

kedro.ipython: This is the name of the extension being loaded. The kedro.ipython extension includes commands and tools that make it easy to integrate Kedro into a Jupyter Notebook environment. This may include functionality for:

Execute Kedro pipelines directly from the notebook.
Visualize the nodes and data of the pipelines.
Facilitate the loading and manipulation of data managed by Kedro.

In [2]:
%load_ext kedro.ipython

In [3]:
%reload_kedro

By using catalog, you can access the data sets you have defined in your Kedro project. This can include data stored in formats such as CSV, Excel, SQL, and others, as well as models and intermediate results.

In [4]:
catalog


[1m{[0m[32m'companies'[0m: [32m"kedro_datasets.pandas.csv_dataset.CSVDataset[0m[32m([0m[32mfilepath[0m[32m=[0m[32mPurePosixPath[0m[32m([0m[32m'D:/monopolyo_extra/monopolio_super/base_monopolio/mono-01/data/01_raw/companies.csv'[0m[32m)[0m[32m, "[0m
              [32m"[0m[32mprotocol[0m[32m='file', [0m[32mload_args[0m[32m=[0m[32m{[0m[32m}[0m[32m, [0m[32msave_args[0m[32m=[0m[32m{[0m[32m'index': False[0m[32m}[0m[32m)[0m[32m"[0m,
 [32m'reviews'[0m: [32m"kedro_datasets.pandas.csv_dataset.CSVDataset[0m[32m([0m[32mfilepath[0m[32m=[0m[32mPurePosixPath[0m[32m([0m[32m'D:/monopolyo_extra/monopolio_super/base_monopolio/mono-01/data/01_raw/reviews.csv'[0m[32m)[0m[32m, "[0m
            [32m"[0m[32mprotocol[0m[32m='file', [0m[32mload_args[0m[32m=[0m[32m{[0m[32m}[0m[32m, [0m[32msave_args[0m[32m=[0m[32m{[0m[32m'index': False[0m[32m}[0m[32m)[0m[32m"[0m,
 [32m'shuttles'[0m: [32m"kedro_datasets.pandas

catalog.load(“post_process”): This method is used to load a dataset that has been previously defined in Kedro's Data Catalog. In this case, a dataset named “post_process” is being loaded.

df.head(): This pandas method displays the first five rows of the DataFrame df. It is useful for getting a quick view of the loaded data, including the column names and some of the values in those columns.

In [5]:
df = catalog.load("pos_proceso")
df.head()

Unnamed: 0,Id,Subsegmento,Sexo,Region,Edad,Renta,Antiguedad,Internauta,Adicional,Dualidad,Monoproducto,Ctacte,Consumo,Hipotecario,Debito,CambioPin,Cuentas,TC
1,2,160,1,13,46,143640,69,1,0,0,0,1,0,1,0,1,1,1
2,3,170,1,13,45,929106,24,1,1,0,0,1,0,1,1,1,1,2
3,4,151,1,13,46,172447,134,0,1,0,1,0,0,0,0,1,1,2
4,5,170,1,13,46,805250,116,0,1,1,0,1,0,1,0,1,2,3
5,6,170,1,13,47,707664,67,1,1,0,0,1,0,0,1,1,1,2


Here we define a list called featrures01 that contains the names of the columns that will be used as features (or independent variables) in the model. Features are the attributes that will be used to predict a target value. In this case, the characteristics include:
Sex: The gender of the person.
Region: The region where the person lives.
Age: The age of the person.
Seniority: Possibly the length of time the person has been in employment or in a relationship with a service.
Monoproduct: May refer to whether the person uses only one product or service.
Consumption: Possibly the level of consumption of some good or service.
X = df[featrures01]: This line uses the feature list to select the corresponding columns of the DataFrame df. The result is assigned to X, which will represent the characteristics of the dataset.
y = df[“Income”]: Here, the “Income” column of the DataFrame df is selected, which will be used as the label (or dependent variable) to try to predict. The result is assigned to y.
train_test_split(X, y, test_size=0.2): This sklearn library function splits the data set into two parts: the training set and the test set.
X: The features.
y: The labels.
test_size=0.2: This means that 20% of the data set will be used for testing and 80% will be used for training the model.
The results are assigned to four variables:
X_train: feature training set.
X_test: Test set of the features.
y_train: Training set of the labels.
y_test: Test set of the labels.

In [6]:
featrures01 = ["Sexo", "Region", "Edad", "Antiguedad", "Monoproducto", "Consumo",]
X = df[featrures01]
y = df["Renta"]

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Here an instance of the linear regression model is being created using the LinearRegression class of sklearn.linear_model. This model will be used to fit the training data and make predictions.
model.fit(X_train, y_train): This method fits the linear regression model to the training data. It takes as input the features (X_train) and the labels (y_train). During this process, the model calculates the coefficients that best fit the data using the least squares method.
model.predict(X_test): This method is used to make predictions on the test data (X_test). It returns the predicted values based on the trained model.
The line y_pred = model.predict(X_test) stores the prediction results in the variable y_pred, which will contain the predictions of the target variable (in this case, “Income”) for the test data set.
mse = mean_squared_error(y_test, y_pred): Here the Mean Squared Error (MSE) between the predictions (y_pred) and the actual values (y_test) is being calculated. The MSE measures the mean squared error and is an indicator of model accuracy; a lower value indicates a better fit of the model to the data.
r2 = r2_score(y_test, y_pred): This line calculates the coefficient of determination R(2), which indicates what proportion of the variability of the dependent variable (in this case, “Income”) is explained by the independent variables in the model. The value of R(2) varies between 0 and 1, where values close to 1 indicate a good fit of the model.


In [8]:
model = LinearRegression()
model.fit(X_train, y_train)
model.predict(X_test)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
r2

[1;36m0.03723158014919892[0m

This code is used to create, train and evaluate a Support Vector Regression (SVR) model, which is a machine learning technique used for regression problems. The following is a breakdown of each part of the code:

SVR(kernel='rbf'): Here, an instance of the SVR model is being created using the SVR class from sklearn.svm.
kernel='rbf': The type of kernel to be used is being specified. In this case, the radial base function (RBF) kernel is being used, which is one of the most common kernels and is useful for nonlinear problems.
The comment suggests that other kernels can be tried, such as linear (for linear regression) or poly (for a polynomial), depending on the nature of the data and the problem.
svr_model.fit(X_train, y_train): This method trains the SVR model on the training data set. It takes the features (X_train) and labels (y_train) as input. During this process, the model adjusts its parameters to learn the relationship between the features and the target variable.
y_pred02 = svr_model.predict(X_test): This method is used to make predictions on the test data set (X_test). The resulting predictions are stored in the variable y_pred02, which will contain the predicted values of the target variable for the test set.
mse_02 = mean_squared_error(y_test, y_pred02): Here the mean squared error (MSE) between the predictions (y_pred02) and the actual values (y_test) is calculated. The MSE provides a measure of the accuracy of the model, where lower values indicate a better fit.
r2_02 = r2_score(y_test, y_pred02): This line calculates the coefficient of determination R(2) to assess what proportion of the variability of the target variable is explained by the model. A value close to 1 indicates a good model fit.

In [11]:
# Crear y ajustar el modelo de Máquina de Soporte Vectorial (SVR)
svr_model = SVR(kernel='rbf')  # Puedes probar otros kernels como 'linear', 'poly', etc.
svr_model.fit(X_train, y_train)
y_pred02 = svr_model.predict(X_test)
mse_02 = mean_squared_error(y_test, y_pred02)
r2_02 = r2_score(y_test, y_pred02)
mse_02
r2_02

[1;36m-0.07105604243107799[0m

This code is used to create, train and evaluate a Decision Tree Regressor model on a data set. The following is a breakdown of each part of the code:

DecisionTreeRegressor(random_state=42): Here an instance of the decision tree model is being created using the DecisionTreeRegressor class from sklearn.tree.
random_state=42: This parameter is set to ensure reproducibility of the results. Using a fixed number such as 42 allows the decision tree creation process to be consistent each time the code is run. If not specified, the tree structure may vary with each run due to the inherent randomness of the data splitting process.
tree_model.fit(X_train, y_train): This method trains the decision tree model using the training data set. It takes the features (X_train) and labels (y_train) as input. During training, the model builds a tree that divides the data according to the features, seeking to minimize the error in the predictions of the target variable.
y_pred_03 = tree_model.predict(X_test): This method is used to make predictions on the test data set (X_test). The resulting predictions are stored in the variable y_pred_03, which will contain the predicted values of the target variable for the test set.
mse_03 = mean_squared_error(y_test, y_pred_03): Here the mean squared error (MSE) between the predictions (y_pred_03) and the actual values (y_test) is calculated. The MSE provides a measure of the accuracy of the model; lower values indicate a better fit.
r2_03 = r2_score(y_test, y_pred_03): This line calculates the coefficient of determination R(2) to assess what proportion of the variability of the target variable is explained by the model. A value close to 1 indicates a good model fit.

In [12]:
# Crear y ajustar el modelo de Árbol de Decisiones
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train, y_train)
# Predecir con los datos de prueba
y_pred_03 = tree_model.predict(X_test)
# Calcular y mostrar las métricas de rendimiento
mse_03 = mean_squared_error(y_test, y_pred_03)
r2_03 = r2_score(y_test, y_pred_03)
mse_03
r2_03

[1;36m-0.635088070499906[0m

This code is used to create, train and evaluate a Stochastic Gradient Descent (SGD) Classification model. This type of model is commonly used for classification problems. The following is a breakdown of each part of the code:

SGDClassifier(): Here, an instance of the stochastic gradient descent classifier is being created using the SGDClassifier class of sklearn.linear_model. This classifier is suitable for linear classification problems and can be used with various loss functions, such as logistic error, which is common in binary or multiclass classification.
model.fit(X_train, y_train): This method trains the SGD model using the training data set. It takes as input the features (X_train) and the labels (y_train). During training, the model adjusts its parameters by gradient descent, optimizing the selected loss function.
y_pred = model.predict(X_test): This method is used to make predictions on the test data set (X_test). The resulting predictions are stored in the variable y_pred, which will contain the predicted classes for each observation in the test set.
accuracy = accuracy_score(y_test, y_pred): Here the accuracy of the model is calculated by comparing the predictions (y_pred) with the actual values (y_test). Accuracy is defined as the proportion of correct predictions over the total predictions made. A higher value indicates better model performance.

In [22]:
model = SGDClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy

[1;36m0.000293398533007335[0m

This code is used to create, train and evaluate a Decision Tree Classifier model on a data set. The following is a breakdown of each part of the code:

DecisionTreeClassifier(max_depth=10, random_state=42): Here an instance of the decision tree classifier is being created using the DecisionTreeClassifier class from sklearn.tree.
max_depth=10: This parameter sets the maximum depth of the tree. Limiting the depth helps prevent overfitting, which occurs when the model fits the training data too closely and does not generalize well to unseen data. In this case, the depth is being limited to 10 levels.
random_state=42: This parameter ensures the reproducibility of the model. Using a fixed number such as 42 means that the way the decision tree is constructed will be consistent in every code run.
tree_model_classification.fit(X_train, y_train): This method trains the decision tree model using the training data set. It takes as input the features (X_train) and the labels (y_train). During training, the model builds a decision tree that divides the data according to the features, optimizing the classification of the labels.
y_pred_05 = tree_model_classification.predict(X_test): This method is used to make predictions on the test data set (X_test). The resulting predictions are stored in the variable y_pred_05, which will contain the predicted classes for each observation in the test set.
accuracy_05 = accuracy_score(y_test, y_pred_05): Here the accuracy of the model is calculated by comparing the predictions (y_pred_05) with the actual values (y_test). Accuracy is defined as the proportion of correct predictions over the total predictions made. A higher value indicates better model performance.
report_05 = classification_report(y_test, y_pred_05): This line generates a classification report that includes additional model performance metrics such as accuracy, recall, F1 score and support for each class. The report provides a more detailed view of how the model is performing in each class, not just in terms of overall accuracy.

In [13]:
tree_model_clasicicacion = DecisionTreeClassifier(max_depth=10, random_state=42)
tree_model_clasicicacion.fit(X_train, y_train)
y_pred_05 = tree_model_clasicicacion.predict(X_test)
accuracy_05 = accuracy_score(y_test, y_pred_05)
report_05 = classification_report(y_test, y_pred_05)
accuracy_05, report_05


[1m([0m
    [1;36m0.005183374083129584[0m,
    [32m'              precision    recall  f1-score   support\n\n           1       0.00      0.25      0.01        16\n           2       0.00      0.00      0.00         1\n           3       0.00      0.00      0.00         5\n           7       0.06      1.00      0.11         1\n           8       0.00      0.00      0.00         0\n           9       0.00      0.00      0.00         2\n          10       0.00      0.00      0.00         0\n          13       0.00      0.00      0.00         0\n          30       0.00      0.00      0.00         0\n          98       0.00      0.00      0.00         1\n         100       0.00      0.00      0.00         2\n         270       0.00      0.00      0.00         0\n         375       0.00      0.00      0.00         0\n         376       0.00      0.00      0.00         0\n         482       0.00      0.00      0.00         1\n         620       0.00      0.00      0.00         0\n     

This code is used to create, train and evaluate a Classification model with Support Vector Machines (SVC). The following is a breakdown of each part of the code:

X_train.sample(frac=0.5, random_state=10): A random sample of 50% of the rows in the X_train training dataset is being taken here.
frac=0.5: This parameter indicates that 50% of the X_train observations will be selected.
random_state=10: This parameter ensures the reproducibility of the random sample. Using a fixed number such as 10 means that each time the code is run, the same observations will be selected.
y_train[X_train_reduced.index]: A new set of y_train_reduced tags is being created here, which includes only the tags corresponding to the observations selected in X_train_reduced. This ensures that y_train_reduced is aligned with X_train_reduced, which is essential for model training.
SVC(kernel='linear', random_state=10): An instance of the SVC classifier is being created using a linear kernel.
kernel='linear': This parameter indicates that a linear kernel will be used for classification. Depending on the data, other kernels such as poly or rbf can be tried to see which one offers better performance.
random_state=10: This parameter ensures reproducibility of the model, similar to its use in training set reduction.
svc_model_classification.fit(X_train_reduced, y_train_reduced): This method trains the SVC model using the reduced training data set (X_train_reduced and y_train_reduced). During this process, the model adjusts its parameters to learn to classify observations based on their characteristics.
y_pred_06 = svc_model_classification.predict(X_test): This method is used to make predictions on the test data set (X_test). The resulting predictions are stored in the variable y_pred_06, which will contain the predicted classes for each observation in the test set.
accuracy_06 = accuracy_score(y_test, y_pred_06): Here the accuracy of the model is calculated by comparing the predictions (y_pred_06) with the actual values (y_test). The accuracy is defined as the proportion of correct predictions over the total predictions made.
report_06 = classification_report(y_test, y_pred_06): This line generates a classification report that includes additional model performance metrics such as accuracy, recall, F1 score and support for each class.

In [None]:
X_train_reducido = X_train.sample(frac=0.5, random_state=10)  
y_train_reducido = y_train[X_train_reducido.index]  # Asegúrate de que y_train coincida con X_train
svc_model_clasificacion = SVC(kernel='linear', random_state=10)  # Puedes probar otros kernels como 'linear', 'poly', etc.
svc_model_clasificacion.fit(X_train_reducido, y_train_reducido)
y_pred_06 = svc_model_clasificacion.predict(X_test)
accuracy_06 = accuracy_score(y_test, y_pred_06)
report_06 = classification_report(y_test, y_pred_06)
accuracy_06, report_06