Estimado candidato,

Estamos encantados de que estés considerando unirte a nuestro equipo de ciencia de datos. Como parte de nuestro proceso de evaluación, te pedimos que completes la siguiente prueba técnica.

**Descripción de la tarea:**

Se te proporciona un conjunto de datos anónimos que consta de varias columnas nombradas como 'col1', 'col2', 'col3', etc., hasta 'col20', y una columna 'target' que representa la variable objetivo. Tu tarea es realizar un análisis exploratorio de datos (EDA) y construir un modelo de aprendizaje automático para predecir la variable 'target'.

**Detalles de la tarea:**

1. **Análisis exploratorio de datos (EDA):** Realiza un análisis exploratorio detallado de los datos. Esto debe incluir, pero no está limitado a:
   - Estadísticas descriptivas de las variables (mínimo, máximo, media, mediana, desviación estándar, etc.).
   - Verificación de valores perdidos o anómalos.
   - Análisis de correlación entre las variables.
   - Visualizaciones para entender mejor las distribuciones y relaciones de los datos.

2. **Preprocesamiento de datos:** Prepárate para realizar cualquier limpieza o transformación necesaria de los datos basándote en tu EDA.

3. **Creación de un modelo de línea base (Baseline):** Entrena varios modelos de aprendizaje automático para predecir la variable 'target'. Comienza con modelos sencillos como la regresión logística y avanza hacia modelos más complejos como los árboles de decisión, random forest, SVM, XGBoost, etc.

4. **Evaluación del modelo:** Evalúa el rendimiento de cada modelo utilizando métricas apropiadas. Por ejemplo, si 'target' es una variable binaria, podrías considerar la precisión, la sensibilidad, la especificidad, el AUC-ROC, etc.

5. **Selección del modelo:** Selecciona el modelo que creas que funciona mejor. Justifica tu elección basándote en las métricas de evaluación y cualquier otra consideración pertinente.

6. **Predicción:** Usa tu modelo seleccionado para hacer predicciones en el conjunto de datos.

**Entregables:**

Por favor, proporciona el código que hayas utilizado para completar esta tarea, junto con un informe detallado que explique tu enfoque y los resultados. El informe debe estar bien estructurado y ser fácil de seguir, permitiendo a los evaluadores entender tu proceso de pensamiento y las decisiones que hayas tomado.

¡Buena suerte y esperamos ver tu solución!

In [162]:
import pandas as pd
import numpy as np

In [163]:
ptec = pd.read_csv("C:/Users/Abraham/Desktop/Solo/Bootcamp/Machine_Learning/Entregas/Prueba_Tecnica_Nivel/data/train.csv")
ptec

Unnamed: 0,ID,col1,col2,col3,col4,col5,col6,col7,col8,col9,...,col12,col13,col14,col15,col16,col17,col18,col19,col20,target
0,4995,0.02,26.80,0.09,1.35,0.060,0.09,0.09,1.97,1.48,...,0.031,9.52,0.84,0.001,1.24,0.96,0.09,0.08,0.08,0
1,1709,1.13,3.95,0.32,3.95,0.030,4.66,0.56,1.89,1.36,...,0.170,13.83,1.15,0.006,10.55,3.76,0.02,0.45,0.06,0
2,7825,0.07,8.05,0.04,0.14,0.040,0.06,0.06,0.05,0.00,...,0.120,2.61,1.52,0.008,4.13,0.27,0.03,0.03,0.01,0
3,6918,0.09,26.40,0.03,0.92,0.090,0.57,0.00,0.36,0.03,...,0.055,5.22,1.96,0.000,11.32,0.25,0.09,0.01,0.05,0
4,5,0.94,14.47,0.03,2.88,0.003,0.80,0.43,1.38,0.11,...,0.135,9.75,1.89,0.006,27.17,5.42,0.08,0.19,0.02,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5592,3048,0.01,8.92,0.20,4.88,0.050,0.36,0.09,0.54,0.14,...,0.069,8.60,1.90,0.007,17.18,2.91,0.08,0.43,0.09,0
5593,2130,0.09,1.36,0.04,3.45,0.003,3.42,0.03,1.39,1.12,...,0.108,9.36,1.58,0.009,42.15,4.39,0.02,0.11,0.07,0
5594,5005,0.10,4.95,0.01,0.25,0.040,0.09,0.10,1.88,0.33,...,0.021,18.78,0.62,0.007,5.97,0.32,0.10,0.10,0.08,0
5595,2125,0.05,23.18,0.04,3.65,0.001,4.43,0.63,1.94,1.27,...,0.194,13.32,1.93,0.005,23.84,4.80,0.08,0.22,0.07,0


# Analisis exploratorio

In [164]:
#observacion de datos: 5597 valores, ninguno de ellos null, todos float o int
ptec.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5597 entries, 0 to 5596
Data columns (total 22 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      5597 non-null   int64  
 1   col1    5597 non-null   float64
 2   col2    5597 non-null   float64
 3   col3    5597 non-null   float64
 4   col4    5597 non-null   float64
 5   col5    5597 non-null   float64
 6   col6    5597 non-null   float64
 7   col7    5597 non-null   float64
 8   col8    5597 non-null   float64
 9   col9    5597 non-null   float64
 10  col10   5597 non-null   float64
 11  col11   5597 non-null   float64
 12  col12   5597 non-null   float64
 13  col13   5597 non-null   float64
 14  col14   5597 non-null   float64
 15  col15   5597 non-null   float64
 16  col16   5597 non-null   float64
 17  col17   5597 non-null   float64
 18  col18   5597 non-null   float64
 19  col19   5597 non-null   float64
 20  col20   5597 non-null   float64
 21  target  5597 non-null   int64  
dtype

In [165]:
#estadisticas descriptivas

ptec.describe()

Unnamed: 0,ID,col1,col2,col3,col4,col5,col6,col7,col8,col9,...,col12,col13,col14,col15,col16,col17,col18,col19,col20,target
count,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,...,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0
mean,3962.909952,0.674797,14.43884,0.163603,1.579702,0.042747,2.213093,0.25089,0.807654,0.766057,...,0.10005,9.819267,1.33188,0.005194,16.621249,2.931896,0.049602,0.149855,0.044999,0.114347
std,2311.129964,1.273677,8.851097,0.254659,1.22122,0.036072,2.581244,0.272472,0.652471,0.436435,...,0.057997,5.581795,0.568734,0.00296,17.729833,2.327347,0.028827,0.14417,0.026929,0.318261
min,0.0,0.0,-0.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.001,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1963.0,0.04,6.85,0.03,0.55,0.008,0.1,0.05,0.09,0.4,...,0.049,4.94,1.01,0.003,2.22,0.82,0.02,0.04,0.02,0.0
50%,3922.0,0.07,14.39,0.05,1.21,0.04,0.55,0.09,0.75,0.76,...,0.103,9.86,1.42,0.005,7.84,2.43,0.05,0.08,0.05,0.0
75%,5960.0,0.29,22.26,0.1,2.51,0.07,4.33,0.45,1.39,1.16,...,0.151,14.69,1.76,0.008,29.98,4.67,0.07,0.25,0.07,0.0
max,7993.0,5.05,29.84,1.05,4.94,0.13,8.66,0.9,2.0,1.5,...,0.2,19.82,2.89,0.01,60.01,7.99,0.1,0.5,0.09,1.0


In [166]:
#valores perdidos o anomalos
ptec_todoceros= ptec[(ptec == 0).all(axis=1)]
ptec_todoceros

Unnamed: 0,ID,col1,col2,col3,col4,col5,col6,col7,col8,col9,...,col12,col13,col14,col15,col16,col17,col18,col19,col20,target


In [167]:
Q1 = ptec['col1'].quantile(0.25)
Q3 = ptec['col1'].quantile(0.75)
IQR = Q3 - Q1

filter = (ptec['col1'] >= Q1 - 1.5 * IQR) & (ptec['col1'] <= Q3 + 1.5 *IQR)
ptec_no_outliers = ptec.loc[filter]  



In [168]:
ptec_no_outliers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4377 entries, 0 to 5596
Data columns (total 22 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      4377 non-null   int64  
 1   col1    4377 non-null   float64
 2   col2    4377 non-null   float64
 3   col3    4377 non-null   float64
 4   col4    4377 non-null   float64
 5   col5    4377 non-null   float64
 6   col6    4377 non-null   float64
 7   col7    4377 non-null   float64
 8   col8    4377 non-null   float64
 9   col9    4377 non-null   float64
 10  col10   4377 non-null   float64
 11  col11   4377 non-null   float64
 12  col12   4377 non-null   float64
 13  col13   4377 non-null   float64
 14  col14   4377 non-null   float64
 15  col15   4377 non-null   float64
 16  col16   4377 non-null   float64
 17  col17   4377 non-null   float64
 18  col18   4377 non-null   float64
 19  col19   4377 non-null   float64
 20  col20   4377 non-null   float64
 21  target  4377 non-null   int64  
dtype

In [169]:
# Import pandas library
import pandas as pd

# Define a function to calculate the lower and upper bounds for outliers
def outlier_bounds(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return lower_bound, upper_bound

# Apply the function to your DataFrame
lower, upper = outlier_bounds(ptec)

# Identify the outliers
outliers = ptec[(ptec < lower) | (ptec > upper)]

# Print the outliers
print(outliers)


      ID  col1  col2  col3  col4  col5  col6  col7  col8  col9  ...  col12  \
0    NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...    NaN   
1    NaN  1.13   NaN  0.32   NaN   NaN   NaN   NaN   NaN   NaN  ...    NaN   
2    NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...    NaN   
3    NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...    NaN   
4    NaN  0.94   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...    NaN   
...   ..   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...    ...   
5592 NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...    NaN   
5593 NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...    NaN   
5594 NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...    NaN   
5595 NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...    NaN   
5596 NaN   NaN   NaN  0.69   NaN   NaN   NaN   NaN   NaN   NaN  ...    NaN   

      col13  col14  col15  col16  col17  col18  col19  col20  t

In [170]:
outliers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5597 entries, 0 to 5596
Data columns (total 22 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      0 non-null      float64
 1   col1    1220 non-null   float64
 2   col2    0 non-null      float64
 3   col3    1154 non-null   float64
 4   col4    0 non-null      float64
 5   col5    0 non-null      float64
 6   col6    0 non-null      float64
 7   col7    0 non-null      float64
 8   col8    0 non-null      float64
 9   col9    0 non-null      float64
 10  col10   0 non-null      float64
 11  col11   0 non-null      float64
 12  col12   0 non-null      float64
 13  col13   0 non-null      float64
 14  col14   2 non-null      float64
 15  col15   0 non-null      float64
 16  col16   0 non-null      float64
 17  col17   0 non-null      float64
 18  col18   0 non-null      float64
 19  col19   0 non-null      float64
 20  col20   0 non-null      float64
 21  target  640 non-null    float64
dtype

In [171]:
#correlacion variables

ptec.corr()

Unnamed: 0,ID,col1,col2,col3,col4,col5,col6,col7,col8,col9,...,col12,col13,col14,col15,col16,col17,col18,col19,col20,target
ID,1.0,-0.622316,-0.182153,-0.272298,-0.495355,0.269412,-0.670282,-0.625041,-0.129477,-0.009559,...,0.022009,0.000314,-0.186106,0.011536,-0.635645,-0.386412,-0.002904,-0.590629,-0.007583,-0.416405
col1,-0.622316,1.0,0.068536,0.209239,0.296823,-0.110139,0.3669,0.340983,0.169893,-0.001603,...,0.020401,-0.006932,0.232193,-0.01252,0.345596,0.244387,0.000861,0.327539,0.025411,0.358168
col2,-0.182153,0.068536,1.0,0.055166,0.082996,-0.001662,0.110731,0.121478,0.022057,-0.01871,...,-0.048791,0.008791,-0.058001,0.008801,0.096208,0.049775,0.029816,0.078308,0.024384,-0.017318
col3,-0.272298,0.209239,0.055166,1.0,0.360306,0.343399,0.355561,0.310508,-0.03351,0.004984,...,-0.092149,0.028841,0.301948,-0.016883,0.328398,0.227257,-0.008053,0.303467,0.002591,-0.132675
col4,-0.495355,0.296823,0.082996,0.360306,1.0,-0.037666,0.445365,0.416649,0.078317,-0.023541,...,-0.055882,-0.005803,0.317054,-0.006503,0.471041,0.294568,0.03305,0.435385,0.00196,0.104875
col5,0.269412,-0.110139,-0.001662,0.343399,-0.037666,1.0,-0.138517,-0.152531,-0.12005,0.001908,...,-0.046023,0.019647,-0.014961,-0.011347,-0.141748,-0.093256,0.003784,-0.153398,-0.005593,-0.271454
col6,-0.670282,0.3669,0.110731,0.355561,0.445365,-0.138517,1.0,0.551453,0.129136,0.012978,...,-0.034044,0.003866,0.38461,-0.027252,0.59041,0.385324,0.006608,0.518238,-0.006657,0.198555
col7,-0.625041,0.340983,0.121478,0.310508,0.416649,-0.152531,0.551453,1.0,0.121669,-0.001397,...,-0.058285,-0.015734,0.340895,-0.04022,0.516192,0.311134,0.029873,0.513997,-0.004092,0.17442
col8,-0.129477,0.169893,0.022057,-0.03351,0.078317,-0.12005,0.129136,0.121669,1.0,0.013048,...,0.120507,-0.006042,0.168918,0.015245,0.109354,0.039575,0.008354,0.092649,0.012929,0.031497
col9,-0.009559,-0.001603,-0.01871,0.004984,-0.023541,0.001908,0.012978,-0.001397,0.013048,1.0,...,0.030067,-0.002626,-0.021988,0.002967,-0.017018,0.006563,0.025123,0.013472,0.006963,0.003556


# Modelos

In [172]:
#Escojo variables
X=ptec.drop(["target"], axis=1)
y=ptec["target"]



In [173]:
#divido datos entre train y test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split (X,y, test_size=0.2, random_state=42)


## Modelo logístico

In [174]:
#importo el modelo logístico
from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression()


In [175]:
#entreno el modelo

logistic_model.fit(X_train, y_train)


LogisticRegression()

In [176]:
#realizo mi prediccion de valores de y
y_pred = logistic_model.predict(X_test)


In [177]:
#evalúo el resultado
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print("Precisión del modelo:", accuracy)

Precisión del modelo: 0.9267857142857143


In [178]:
import warnings

# Ignorar todas las advertencias
warnings.filterwarnings("ignore")

In [179]:
#Grid search

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Definir los hiperparámetros a ajustar
param_grid = {'C': [0.1, 1, 10],
              'penalty': ['l1', 'l2']}

# Crear una instancia del modelo de regresión logística
logistic_model = LogisticRegression()

# Crear el objeto GridSearchCV
grid_search = GridSearchCV(logistic_model, param_grid, cv=5)

# Entrenar el modelo utilizando GridSearchCV
grid_search.fit(X_train, y_train)

# Obtener los mejores hiperparámetros y el mejor puntaje
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Imprimir los resultados
print("Mejores hiperparámetros:", best_params)
print("Mejor puntaje de validación:", best_score)


Mejores hiperparámetros: {'C': 1, 'penalty': 'l2'}
Mejor puntaje de validación: 0.9169081703910615


## Decision Tree

In [180]:
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)


DecisionTreeClassifier()

In [181]:
y_pred = decision_tree.predict(X_test)

In [182]:
accuracy = accuracy_score(y_test, y_pred)
print("Precisión del modelo:", accuracy)

Precisión del modelo: 0.9714285714285714


In [183]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# Definir los hiperparámetros a ajustar
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Crear una instancia del modelo de árbol de decisión
decision_tree = DecisionTreeClassifier()

# Crear el objeto GridSearchCV
grid_search = GridSearchCV(decision_tree, param_grid, cv=5)

# Entrenar el modelo utilizando GridSearchCV
grid_search.fit(X_train, y_train)

# Obtener los mejores hiperparámetros y el mejor puntaje
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Imprimir los resultados
print("Mejores hiperparámetros:", best_params)
print("Mejor puntaje de validación:", best_score)


Mejores hiperparámetros: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 5}
Mejor puntaje de validación: 0.9720832501995211


#  Random Forest 

In [184]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier()

In [185]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [186]:
random_forest.fit(X_train, y_train)


RandomForestClassifier()

In [187]:
y_pred = random_forest.predict(X_test)


In [188]:
from sklearn.metrics import accuracy_score


In [189]:
accuracy = accuracy_score(y_test, y_pred)
print("Precisión del modelo:", accuracy)

Precisión del modelo: 0.9678571428571429


# SVM

In [190]:
from sklearn.svm import SVC


svm_model = SVC()

svm_model.fit(X_train, y_train)

y_pred = svm_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Precisión del modelo:", accuracy)


Precisión del modelo: 0.9205357142857142


# XGBoost

In [191]:
from xgboost import XGBClassifier

# Paso 4: Crear una instancia del modelo XGBoost
xgb_model = XGBClassifier()

xgb_model.fit(X_train, y_train)

y_pred = xgb_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Precisión del modelo:", accuracy)

Precisión del modelo: 0.9758928571428571


## KNN

In [192]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)  

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Precisión del modelo:", accuracy)

Precisión del modelo: 0.9535714285714286


# Seleccion y prediccion

In [193]:
#El modelo que mejores resultados ofrece de todos ha sido XGBoost con 0.9758928571428571 de precisión.

In [194]:
from xgboost import XGBClassifier

# Paso 4: Crear una instancia del modelo XGBoost
xgb_model = XGBClassifier()

xgb_model.fit(X_train, y_train)

y_pred = xgb_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Precisión del modelo:", accuracy)

Precisión del modelo: 0.9758928571428571


In [195]:
#importo el test csv

ptec_test = pd.read_csv(r"C:\Users\Abraham\Desktop\Solo\Bootcamp\Machine_Learning\Entregas\Prueba_Tecnica_Nivel\data\test.csv")


In [196]:
X_test=ptec_test

In [198]:


predictions = xgb_model.predict(X_test)

# Print the predictions
print(predictions)

[0 0 0 ... 0 0 1]
