### Entrenamiento de modelos clásicos
Para el entrenamiento de todos los modelos clásicos se realiza el mismo procesamiento.
- Primero se vectorizan los textos para obtener la matriz documento-término tfidf con la que se entrenan los modelos.
- Luego se realiza un GridSearch para optimizar los parámetros del modelo.
- Por último se entrena el modelo utilizando el conjunto completo de datos.
- Con el modelo ya creado, se realizan las predicciones sobre el conjunto de test.

#### Modelos y archivos utilizados
Los modelos clásicos seleccionados han sido el RandomForest y el SVM. Esta decisión se debe a que se han probado otros modelos como el DecisionTree, el KNN y el Naive-Bayes y han dado peores resultados.

Para todos los modelos se ha realizado el entrenamiento con los archivos extendidos, primero sin aplicar la desambiguación y luego aplicándola.

In [7]:
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

## Entrenamiento de SVM
### Sin la desambiguación de significados

In [8]:
data = pd.read_json('dataTrain.json')
data.columns = ['id', 'text','label']
stop_words= stopwords.words('english')
vectorizer = TfidfVectorizer(stop_words=stop_words)
tfidf = vectorizer.fit_transform(data['text'])
test_data = pd.read_json('dataTest.json')

In [9]:
X_train = data['text']
y_train = data['label']
cfl = SVC()
params = {
    'C' : [0.1, 1, 10],
    'kernel' : ['linear', 'rbf'],
    'gamma': [0.1, 0.01, 0.001],
    'class_weight': ['balanced']
}

model = gs_knn = GridSearchCV(cfl,
                      param_grid=params,
                      scoring='f1_weighted',
                      cv=10)
model.fit(tfidf, y_train)
best_params = model.best_params_
best_model = model.best_estimator_
print(best_params)

{'C': 10, 'class_weight': 'balanced', 'gamma': 0.1, 'kernel': 'rbf'}


In [10]:
X_train, X_test, y_train, y_test = train_test_split(tfidf, data['label'], test_size=0.2, random_state=42)
model = SVC(**best_params, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

report = classification_report(y_test, y_pred, labels=data['label'].unique())
print(report)

              precision    recall  f1-score   support

      course       0.80      0.80      0.80       167
  department       0.80      0.62      0.70        26
     faculty       0.72      0.77      0.74       184
       other       0.82      0.80      0.81       590
     project       0.57      0.55      0.56        94
       staff       0.43      0.14      0.21        22
     student       0.75      0.83      0.79       242

    accuracy                           0.77      1325
   macro avg       0.70      0.64      0.66      1325
weighted avg       0.77      0.77      0.77      1325



In [11]:
model = SVC(**best_params, random_state=42)
model.fit(tfidf, data['label'])
test_tfidf = vectorizer.transform(test_data['text'])
y_pred = model.predict(test_tfidf)

In [12]:
result_df = pd.DataFrame({'id': test_data.id, 'Predicted': y_pred})
result_df

Unnamed: 0,id,Predicted
0,aaclkul,student
1,aagelci,project
2,aangjmn,other
3,aawnpc,other
4,abdjgiz,student
...,...,...
1654,zxmmn,other
1655,zxwkru,other
1656,zybimtt,other
1657,zypnixf,faculty


In [13]:
result_df.to_csv('ENXEBRE_SVM_NoDisambiguated.csv')

## Entrenamiento de SVM
### Con la desambiguación de significados

In [41]:
data = pd.read_json('disambiguated_dataTrain.json')
data.columns = ['id', 'label','text']
stop_words= stopwords.words('english')
vectorizer = TfidfVectorizer(stop_words=stop_words)
tfidf = vectorizer.fit_transform(data['text'])
test_data = pd.read_json('disambiguated_dataTest.json')

In [15]:
X_train = data['text']
y_train = data['label']
cfl = SVC()
params = {
    'C' : [0.1, 1, 10],
    'kernel' : ['linear', 'rbf'],
    'gamma': [0.1, 0.01, 0.001],
    'class_weight': ['balanced']
}

model = gs_knn = GridSearchCV(cfl,
                      param_grid=params,
                      scoring='f1_weighted',
                      cv=10)
model.fit(tfidf, y_train)
best_params = model.best_params_
best_model = model.best_estimator_
print(best_params)

{'C': 10, 'class_weight': 'balanced', 'gamma': 0.1, 'kernel': 'rbf'}


In [16]:
X_train, X_test, y_train, y_test = train_test_split(tfidf, data['label'], test_size=0.2, random_state=42)
model = SVC(**best_params, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

report = classification_report(y_test, y_pred, labels=data['label'].unique())
print(report)

              precision    recall  f1-score   support

      course       0.79      0.83      0.81       167
  department       0.75      0.58      0.65        26
     faculty       0.76      0.77      0.76       184
       other       0.81      0.80      0.80       590
     project       0.58      0.51      0.54        94
       staff       0.33      0.05      0.08        22
     student       0.73      0.82      0.77       242

    accuracy                           0.77      1325
   macro avg       0.68      0.62      0.63      1325
weighted avg       0.76      0.77      0.76      1325



In [17]:
model = SVC(**best_params, random_state=42)
model.fit(tfidf, data['label'])
test_tfidf = vectorizer.transform(test_data['text'])
y_pred = model.predict(test_tfidf)

In [18]:
result_df = pd.DataFrame({'id': test_data.id, 'Predicted': y_pred})
result_df

Unnamed: 0,id,Predicted
0,aaclkul,student
1,aagelci,project
2,aangjmn,other
3,aawnpc,other
4,abdjgiz,student
...,...,...
1654,zxmmn,other
1655,zxwkru,other
1656,zybimtt,other
1657,zypnixf,faculty


In [19]:
result_df.to_csv('ENXEBRE_SVM_Disambiguated.csv')

## Entrenamiento de RandomForest
### Sin la desambiguación de significados

In [20]:
data = pd.read_json('dataTrain.json')
data.columns = ['id', 'text','label']
stop_words= stopwords.words('english')
vectorizer = TfidfVectorizer(stop_words=stop_words)
tfidf = vectorizer.fit_transform(data['text'])
test_data = pd.read_json('dataTest.json')

In [21]:
X_train = data['text']
y_train = data['label']
cfl = RandomForestClassifier()
params = {    'min_samples_split': [2, 4, 6],
              'n_estimators': [100, 300, 500],
              'class_weight' : ['balanced']}

model = gs_knn = GridSearchCV(cfl,
                      param_grid=params,
                      scoring='f1_weighted',
                      cv=10)
model.fit(tfidf, y_train)
best_params = model.best_params_
best_model = model.best_estimator_
print(best_params)

{'class_weight': 'balanced', 'min_samples_split': 6, 'n_estimators': 500}


In [22]:
X_train, X_test, y_train, y_test = train_test_split(tfidf, data['label'], test_size=0.2, random_state=42)
model = RandomForestClassifier(**best_params, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

report = classification_report(y_test, y_pred, labels=data['label'].unique())
print(report)

              precision    recall  f1-score   support

      course       0.94      0.82      0.88       167
  department       0.88      0.58      0.70        26
     faculty       0.87      0.83      0.85       184
       other       0.80      0.93      0.86       590
     project       0.92      0.26      0.40        94
       staff       0.00      0.00      0.00        22
     student       0.76      0.85      0.80       242

    accuracy                           0.82      1325
   macro avg       0.74      0.61      0.64      1325
weighted avg       0.82      0.82      0.80      1325



In [23]:
model = RandomForestClassifier(**best_params, random_state=42)
model.fit(tfidf, data['label'])
test_tfidf = vectorizer.transform(test_data['text'])
y_pred = model.predict(test_tfidf)

In [24]:
result_df = pd.DataFrame({'id': test_data.id, 'Predicted': y_pred})
result_df

Unnamed: 0,id,Predicted
0,aaclkul,student
1,aagelci,other
2,aangjmn,other
3,aawnpc,other
4,abdjgiz,student
...,...,...
1654,zxmmn,other
1655,zxwkru,other
1656,zybimtt,other
1657,zypnixf,faculty


In [25]:
result_df.to_csv('ENXEBRE_RandomForest_NoDesmabiguado.csv')

## Entrenamiento de RandomForest
### Con la desambiguación de significados

In [47]:
data = pd.read_json('disambiguated_dataTrain.json', )
data.columns = ['id', 'label','text']
stop_words= stopwords.words('english')
vectorizer = TfidfVectorizer(stop_words=stop_words)
tfidf = vectorizer.fit_transform(data['text'])
test_data = pd.read_json('disambiguated_dataTest.json')

In [48]:
X_train = data['text']
y_train = data['label']
cfl = RandomForestClassifier()
params = {    'min_samples_split': [2, 4, 6],
              'n_estimators': [100, 300, 500],
              'class_weight' : ['balanced']}

model = gs_knn = GridSearchCV(cfl,
                      param_grid=params,
                      scoring='f1_weighted',
                      cv=10)
model.fit(tfidf, y_train)
best_params = model.best_params_
best_model = model.best_estimator_
print(best_params)

{'class_weight': 'balanced', 'min_samples_split': 6, 'n_estimators': 500}


In [49]:
X_train, X_test, y_train, y_test = train_test_split(tfidf, data['label'], test_size=0.2, random_state=42)
model = RandomForestClassifier(**best_params, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

report = classification_report(y_test, y_pred, labels=data['label'].unique())
print(report)

              precision    recall  f1-score   support

      course       0.88      0.84      0.86       167
  department       1.00      0.62      0.76        26
     faculty       0.86      0.74      0.79       184
       other       0.77      0.92      0.84       590
     project       0.91      0.21      0.34        94
       staff       0.00      0.00      0.00        22
     student       0.73      0.81      0.76       242

    accuracy                           0.79      1325
   macro avg       0.74      0.59      0.62      1325
weighted avg       0.79      0.79      0.77      1325



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [50]:
model = RandomForestClassifier(**best_params, random_state=42)
model.fit(tfidf, data['label'])
test_tfidf = vectorizer.transform(test_data['text'])
y_pred = model.predict(test_tfidf)

In [51]:
result_df = pd.DataFrame({'id': test_data.id, 'Predicted': y_pred})
result_df

Unnamed: 0,id,Predicted
0,aaclkul,student
1,aagelci,other
2,aangjmn,other
3,aawnpc,other
4,abdjgiz,student
...,...,...
1654,zxmmn,other
1655,zxwkru,other
1656,zybimtt,other
1657,zypnixf,other


In [52]:
result_df.to_csv('ENXEBRE_RandomForest_Desmabiguado.csv')