Os paso una posible solución al caso de AcademicSucces, en este notebook solo está hecho el **preprocesamiento (con una función) y el modelo**. Hay una pequeña grid_search para buscar el mejor modelo, pero lo he acotado a pocos hiperparametros porque tarda mucho.

Hay aplicado un ColumnTransformer para hacer el OneHotencoder en las numéricas y que todo el proceso se resuma en 2 pasos.

También hay un LabelEncoder para transformar el target en 0,1 y 2 ya que utilizo el XGBoost y necesita que el target esté en este formato. Al final hay que hacer una transformación inversa para transformar las predicciones (0,1,2) en las strings originales (Graduated, Enrrolled, Dropout).

 La exploración está en el notebook de exploración que ya os pasé, pero es una exploración muy limitada, ya que el dataset es muy largo y sería muy extenso.

In [63]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn import metrics

In [64]:
train=pd.read_csv("train.csv")

In [65]:
colnames=train.columns

In [66]:
X=train.drop(columns=['Target'])
y=train['Target']

In [67]:
# Necesito este label encoder para pasar de [Graduate, Enrolled, Dropout] a [0,1,2], y después volver a lo original.
# es otro objeto transformador que recuerda que numero es cada clase
from sklearn.preprocessing import LabelEncoder
y_encoder = LabelEncoder()
y=y_encoder.fit_transform(y)

In [68]:
train.head()

Unnamed: 0,id,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,0,1,1,1,9238,1,1,126.0,1,1,...,0,6,7,6,12.428571,0,11.1,0.6,2.02,Graduate
1,1,1,17,1,9238,1,1,125.0,1,19,...,0,6,9,0,0.0,0,11.1,0.6,2.02,Dropout
2,2,1,17,2,9254,1,1,137.0,1,3,...,0,6,0,0,0.0,0,16.2,0.3,-0.92,Dropout
3,3,1,1,3,9500,1,1,131.0,1,19,...,0,8,11,7,12.82,0,11.1,0.6,2.02,Enrolled
4,4,1,1,2,9500,1,1,132.0,1,19,...,0,7,12,6,12.933333,0,7.6,2.6,0.32,Graduate


In [69]:
def transforma(df):
    # Tiro la columna id, course en train
    
    df=df.drop(columns=['id'])
    # realmente no hace falta tirarlas aqui

    # aqui hago nuevas columnas
    colnames=df.columns
    curricular_columns=[col for col in colnames if col.startswith("Curricular units")]
    df['UnitsCredited']=df['Curricular units 1st sem (credited)']+df['Curricular units 2nd sem (credited)']
    df['UnitsEnrolled']=df['Curricular units 1st sem (enrolled)']+df['Curricular units 2nd sem (enrolled)']
    df['UnitsEvaluations']=df['Curricular units 1st sem (evaluations)']+df['Curricular units 2nd sem (evaluations)']
    df['UnitsApproved']=df['Curricular units 1st sem (approved)']+df['Curricular units 2nd sem (approved)']
    df['UnitsWithoutEval']=df['Curricular units 1st sem (without evaluations)']+df['Curricular units 2nd sem (without evaluations)']
    df['MeanGrade']=(df['Curricular units 1st sem (grade)']+df['Curricular units 2nd sem (grade)'])/2
    
    # combino nuevas columnas
    df['TotalUnits']=df['UnitsCredited']+df['UnitsEnrolled']+df['UnitsApproved']
    df['PercCredited']=df['UnitsCredited']/df['TotalUnits']
    df['PercEnrolled']=df['UnitsEnrolled']/df['TotalUnits']
    df['PercApproved']=df['UnitsApproved']/df['TotalUnits']
    new_columns=["UnitsCredited","UnitsEnrolled","UnitsEvaluations","UnitsApproved","UnitsWithoutEval","MeanGrade","TotalUnits","PercCredited","PercEnrolled","PercApproved"]
     # hay NA cuando TotalUnits == 0, subsituyo los nans por 0
    df = df.replace(np.nan,0)
    
    # reduzco categorías
    df.loc[~df['Application mode'].isin([1,17,39]),"Application mode"] = "other"
    df.loc[~df['Nacionality'].isin([1]),"Nacionality"] = "other"
    # me quedo con todas las columnas posibles
    cat_columns = ['Marital status','Course','Application mode','Nacionality']
    df[cat_columns]=df[cat_columns].astype("str")
    return df

In [70]:
X=transforma(X)


## Primero voy a buscar cual es el mejor modelo, utilizaré un get_dummies para facilitar el trabajo, aunque debería hacer-lo con OneHotEncoder

In [9]:
X=pd.get_dummies(X,drop_first=True, dtype='int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [10]:
X.head()

Unnamed: 0,Application order,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Mother's qualification,Father's qualification,Mother's occupation,Father's occupation,Admission grade,Displaced,...,Course_9238,Course_9254,Course_9500,Course_9556,Course_9670,Course_9773,Course_979,Course_9853,Course_9991,Nacionality_other
0,1,1,1,126.0,1,19,5,5,122.6,0,...,1,0,0,0,0,0,0,0,0,0
1,1,1,1,125.0,19,19,9,9,119.8,1,...,1,0,0,0,0,0,0,0,0,0
2,2,1,1,137.0,3,19,2,3,144.7,0,...,0,1,0,0,0,0,0,0,0,0
3,3,1,1,131.0,19,3,3,2,126.1,1,...,0,0,1,0,0,0,0,0,0,0
4,2,1,1,132.0,19,37,4,9,120.1,1,...,0,0,1,0,0,0,0,0,0,0


In [11]:
xgbmodel = XGBClassifier(objective='multi:softmax')

In [12]:
xgbmodel.get_params()

{'objective': 'multi:softmax',
 'use_label_encoder': None,
 'base_score': None,
 'booster': None,
 'callbacks': None,
 'colsample_bylevel': None,
 'colsample_bynode': None,
 'colsample_bytree': None,
 'early_stopping_rounds': None,
 'enable_categorical': False,
 'eval_metric': None,
 'feature_types': None,
 'gamma': None,
 'gpu_id': None,
 'grow_policy': None,
 'importance_type': None,
 'interaction_constraints': None,
 'learning_rate': None,
 'max_bin': None,
 'max_cat_threshold': None,
 'max_cat_to_onehot': None,
 'max_delta_step': None,
 'max_depth': None,
 'max_leaves': None,
 'min_child_weight': None,
 'missing': nan,
 'monotone_constraints': None,
 'n_estimators': 100,
 'n_jobs': None,
 'num_parallel_tree': None,
 'predictor': None,
 'random_state': None,
 'reg_alpha': None,
 'reg_lambda': None,
 'sampling_method': None,
 'scale_pos_weight': None,
 'subsample': None,
 'tree_method': None,
 'validate_parameters': None,
 'verbosity': None}

In [13]:
xgbmodel.fit(X_train,y_train)

### Aqui defino los modelos que voy a probar, no son muchos porque tarda mucho en ajustarse. Podría poner cualquier otro modelo

In [87]:
params={'base_score': [0.5], # prediccion inicial
     'booster': ['gbtree'], # (gbtree, gblinear, dart)
     'colsample_bylevel': [0.8], # ratio de columnas en cada nivel
     'colsample_bytree': [0.7], # ratio de columnas por tree
     'gamma': [0.01],    # minimo "loss" reduccion para crear un nuevo split. Larger-> conservative
     'learning_rate': [0.05], # (eta) aportacion de cada arbol al modelo
     'max_depth': [3], # maxima profundidad en cada arbol
     'min_child_weight': [1], # minimo numero samples por hoja
    #'missing': [1], # si queremos reemplazar los missings por un numero
     'n_estimators': [800], # numero de arboles
     'n_jobs': [-1], # trabajos en paralelo
     'random_state': [0], # seed para generar los folds
     'reg_alpha': [0.1], # L1 regularitacion
     'reg_lambda': [0.01,0.1], # L2 regularitacion
     'subsample': [0.9]} # ratio de muestras por cada arbol 

In [88]:
scoring = ['roc_auc', 'accuracy']
grid_solver = GridSearchCV(estimator = xgbmodel, # model to train
                   param_grid = params,
                   scoring = scoring,
                   cv = 3,
                   n_jobs=-1,
                   refit='roc_auc',
                   verbose = 2)

In [89]:
grid_solver.fit(X_train,y_train)
# tarda bastante

Fitting 3 folds for each of 2 candidates, totalling 6 fits




In [90]:
grid_solver.best_estimator_

In [91]:
yhat_test=grid_solver.predict(X_test)
yhat_train=grid_solver.predict(X_train)

In [92]:
accuracy_score(y_test,yhat_test)

0.8329571106094809

In [93]:
accuracy_score(y_train,yhat_train)
# hay overfitting pero no mucho

0.8419451108900462

In [23]:
final_model=grid_solver.best_estimator_

# Ahora tenemos el mejor, hay que hacer un OneHotEncoder porque podrían haber categorías diferentes en el test

In [33]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
onehot=OneHotEncoder(drop = "first", handle_unknown="ignore")
# make_pipeline(onehot)

In [71]:
numeric_features=X._get_numeric_data().columns

In [72]:
categorical_features = X.drop(columns=numeric_features).columns
categorical_features

Index(['Marital status', 'Application mode', 'Course', 'Nacionality'], dtype='object')

### Como tengo un pipeline para categoricas tengo que hacer uno para numericas para pasarlo a ColumnTransformer

In [73]:
numeric_transformer=make_pipeline(StandardScaler()) # podríamos dejarlo vacío pero make_pipeline no me deja, no afectará a los arboles

In [74]:
onehot=OneHotEncoder(drop = "first", handle_unknown="ignore")
categorical_transformer= make_pipeline(OneHotEncoder(drop = "first", handle_unknown="ignore"))

In [75]:
# Paso de transformar las columnas
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("pipeline para numericas", numeric_transformer, numeric_features),
        ("pipeline para categoricas", categorical_transformer, categorical_features)
        # 3 elementos (nombre, pipeline, lista_de_nombres_columnas)
    ]
)
preprocessor


In [76]:
process_complete = make_pipeline(preprocessor, final_model)

In [77]:
process_complete.fit(X,y)

Parameters: { "scale_pos_weight" } are not used.



In [79]:
# funciona?
yhat=process_complete.predict(X)

In [80]:
accuracy_score(y,yhat)

0.8403904963537991

# Lo aplico al dataset de submission

In [81]:
test=pd.read_csv("test.csv")
submission=pd.read_csv("sample_submission.csv")

In [82]:
test.head()

Unnamed: 0,id,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
0,76518,1,1,1,9500,1,1,141.0,1,3,...,0,0,8,0,0,0.0,0,13.9,-0.3,0.79
1,76519,1,1,1,9238,1,1,128.0,1,1,...,0,0,6,6,6,13.5,0,11.1,0.6,2.02
2,76520,1,1,1,9238,1,1,118.0,1,1,...,0,0,6,11,5,11.0,0,15.5,2.8,-4.06
3,76521,1,44,1,9147,1,39,130.0,1,1,...,0,3,8,14,5,11.0,0,8.9,1.4,3.51
4,76522,1,39,1,9670,1,1,110.0,1,1,...,0,0,6,9,4,10.666667,2,7.6,2.6,0.32


### Con dos pasos tenemos todo el trabajo hecho en el test (submission)

In [83]:
test = transforma(test)
yhat_submission = process_complete.predict(test)



In [84]:
yhat_submission
# tenemos que volver a las strings originales

array([0, 2, 2, ..., 0, 0, 0])

In [85]:
y_encoder.inverse_transform(yhat_submission)

array(['Dropout', 'Graduate', 'Graduate', ..., 'Dropout', 'Dropout',
       'Dropout'], dtype=object)

In [86]:
submission.Target = y_encoder.inverse_transform(yhat_submission)
submission.to_csv("submission.csv",index=False)

In [None]:
# Esta submissión tiene una score de 0.834 de accuracy