# Cargado de datos

In [1]:
# Importación de librerias
import pandas as pd
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('train3.csv')

In [3]:
df = df.drop(['Unnamed: 0'], axis = 1)

In [4]:
df = df[['Casa', 'Trabajo', 'HomeOffice']]

In [5]:
df

Unnamed: 0,Casa,Trabajo,HomeOffice
0,1050,1042,1.0
1,401,401,0.0
2,428,428,0.0
3,1000,412,1.0
4,1078,1078,0.0
...,...,...,...
808595,1467,1467,0.0
808596,1055,49,1.0
808597,327,327,0.0
808598,1471,1471,0.0


Para que un registro sea considerado HomeOffice, los datos tanto de Casa como de Trabajo deben ser iguales, estos fueron filtrados previamente dentro de un horario de 12:00 a 7:00 para la casa y de 9:00 a 18:00 para el trabajo.

Los resultados son almacenados en la columna *HomeOffice* donde:
* *1.0 => No es HomeOffice*
* *0.0 => Es HomeOffice*

In [6]:
df['HomeOffice'].value_counts()

0.0    494172
1.0    314428
Name: HomeOffice, dtype: int64

# Entrenamiento del modelo

Asignación de valores dependientes e independientes del modelo.

In [7]:
X = df[['Casa', 'Trabajo']]

In [8]:
y = df[['HomeOffice']]

Split de los datos en train y test a 80% y 20%.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Se utiliza el framework de XGBoost de manera binaria logística.

In [10]:
xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42)
xgb_model.fit(X_train, y_train)

# Resultados

| Confusion | matrix|
|:-:|:-:|
| True negatives | False positives |
| False negatives | True positives |

In [11]:
y_pred_train = xgb_model.predict(X_train)
cm_train = confusion_matrix(y_train, y_pred_train)
df_cmtrain = pd.DataFrame(cm_train)
df_cmtrain

Unnamed: 0,0,1
0,395393,0
1,2851,248636


In [12]:
y_pred_test = xgb_model.predict(X_test)
cm_test = confusion_matrix(y_test, y_pred_test)
df_cmtest = pd.DataFrame(cm_test)
df_cmtest

Unnamed: 0,0,1
0,98779,0
1,702,62239


In [13]:
train_accuracy = accuracy_score(y_train, y_pred_train)
print('Train accurancy:', train_accuracy)

Train accurancy: 0.9955926910709869


In [14]:
test_accuracy = accuracy_score(y_test, y_pred_test)
print('Test accurancy:', test_accuracy)

Test accurancy: 0.9956591639871383


Se tiene un accurancy del 99% tanto en train como en test, así como una poca cantidad de falsos negativos y ningún falso positivo.