# Proyecto Diplomado en Inteligencia Artificial

Objetivo: Entrenar 2 modelos de Machine Learning y comparar sus resultados. Debe usar al menos 3 configuraciones distintas de parámetros de los modelos y comentar los resultados.

Se deben comparar las siguientes medidas de desempeño:

- Accuracy
- F1-score
- ROC-AUC

Se deben realizar 10 corridas experimentales y comparar el promedio de estas medidas de desempeño. Particione el conjunto de datos en 80% (entrenamiento) y 20% (prueba).

Debe realizar un proceso de pre-procesamiento de los datos. Basese en el notebook: "1 natural-language-processing-sentiment-analysis.ipynb".



## Importaciones

In [21]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')
import numpy as np
import pandas as pd

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/amandaflores/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Cargar datos

In [3]:
newsgroups_data = fetch_20newsgroups()

## Mostrar ejemplo

In [4]:
print("Texto: ",newsgroups_data.data[0])
print("Etiqueta (texto): ",newsgroups_data.target_names[0])
print("Etiqueta (número}): ",newsgroups_data.target[0])

Texto:  From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





Etiqueta (texto):  alt.atheism
Etiqueta (número}):  7


## Etiquetas con su correspondiente nombre

In [5]:
for idx, cat in enumerate(newsgroups_data.target_names):
    print("Número: ",idx,"Categoria: ", cat)

Número:  0 Categoria:  alt.atheism
Número:  1 Categoria:  comp.graphics
Número:  2 Categoria:  comp.os.ms-windows.misc
Número:  3 Categoria:  comp.sys.ibm.pc.hardware
Número:  4 Categoria:  comp.sys.mac.hardware
Número:  5 Categoria:  comp.windows.x
Número:  6 Categoria:  misc.forsale
Número:  7 Categoria:  rec.autos
Número:  8 Categoria:  rec.motorcycles
Número:  9 Categoria:  rec.sport.baseball
Número:  10 Categoria:  rec.sport.hockey
Número:  11 Categoria:  sci.crypt
Número:  12 Categoria:  sci.electronics
Número:  13 Categoria:  sci.med
Número:  14 Categoria:  sci.space
Número:  15 Categoria:  soc.religion.christian
Número:  16 Categoria:  talk.politics.guns
Número:  17 Categoria:  talk.politics.mideast
Número:  18 Categoria:  talk.politics.misc
Número:  19 Categoria:  talk.religion.misc


## Preprocesamiento de Datos

In [6]:
corpus=[]
for i in range(0,11314):
    review = re.sub('[^a-zA-Z]', ' ', newsgroups_data.data[i] )
    review=review.lower()
    review=review.split()
    ps=PorterStemmer()
    review = [
        ps.stem(word)
        for word in review
        if word not in set(stopwords.words('english'))
    ]
    review=' '.join(review)
    corpus.append(review)

In [7]:
pd.DataFrame(corpus).head()

Unnamed: 0,0
0,lerxst wam umd edu thing subject car nntp post...
1,guykuo carson u washington edu guy kuo subject...
2,twilli ec ecn purdu edu thoma e willi subject ...
3,jgreen amber joe green subject weitek p organ ...
4,jcm head cfa harvard edu jonathan mcdowel subj...


In [8]:
cv=CountVectorizer(max_features=1500)
X=cv.fit_transform(corpus).toarray()

In [9]:
print(cv.get_feature_names_out())
print(X[0])

['aa' 'ab' 'abil' ... 'yet' 'york' 'young']
[0 0 0 ... 0 0 0]


In [10]:
print(X)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [11]:
Y = newsgroups_data.target
print(Y)

[7 4 4 ... 3 1 8]


In [12]:
encoder = LabelBinarizer()
y_encoder = encoder.fit_transform(Y)
print(y_encoder)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## Clasificación

### Random Forest Model

In [15]:
classifier_RF_1 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier_RF_2 = RandomForestClassifier(n_estimators = 10, criterion = 'gini', random_state = 0)
classifier_RF_3 = RandomForestClassifier(n_estimators = 10, criterion = 'log_loss', random_state = 0)

In [16]:
def calculate_metrics_rf(y_true, y_pred):
    return [
        accuracy_score(y_true, y_pred),
        roc_auc_score(y_true, y_pred),
        f1_score(y_true, y_pred, average='macro')
    ]

In [17]:
# Inicializar listas
data_ac_rf = []
data_roc_rf = []
data_f1_rf = []

# Realizar experimentos
for i in range(10):
    print("Experimento:", i)
    
    # Dividir datos
    X_train, X_test, y_train, y_test = train_test_split(X, y_encoder, test_size=0.2, shuffle=True, stratify=Y)
    
    # Entrenar modelos
    for model in [classifier_RF_1, classifier_RF_2, classifier_RF_3]:
        model.fit(X_train, y_train)
        
    # Predecir y calcular métricas
    metrics_ac = [calculate_metrics_rf(y_test, model.predict(X_test)) for model in [classifier_RF_1, classifier_RF_2, classifier_RF_3]]
    
    # Apilar los valores de métricas en listas
    data_ac_rf.append([metric[0] for metric in metrics_ac])
    data_roc_rf.append([metric[1] for metric in metrics_ac])
    data_f1_rf.append([metric[2] for metric in metrics_ac])

Experimento: 0
Experimento: 1
Experimento: 2
Experimento: 3
Experimento: 4
Experimento: 5
Experimento: 6
Experimento: 7
Experimento: 8
Experimento: 9


In [18]:
metrics_names = ['Accuracy', 'ROC AUC', 'F1']
for i in range(3):
    print(f"Clasificación {i + 1}: ")
    print(f"{metrics_names[0]}: {np.mean(np.array(data_ac_rf)[:, i])}")
    print(f"{metrics_names[1]}: {np.mean(np.array(data_roc_rf)[:, i])}")
    print(f"{metrics_names[2]}: {np.mean(np.array(data_f1_rf)[:, i])}")
    print("----------------------------------------------------------------")

Clasificación 1: 
Accuracy: 0.2790543526292532
ROC AUC: 0.6361660560146228
F1: 0.3945953658362674
----------------------------------------------------------------
Clasificación 2: 
Accuracy: 0.45819708351745475
ROC AUC: 0.7242601785048113
F1: 0.586196006128965
----------------------------------------------------------------
Clasificación 3: 
Accuracy: 0.2790543526292532
ROC AUC: 0.6361660560146228
F1: 0.3945953658362674
----------------------------------------------------------------


### SVC

In [None]:
acc_1 = []
acc_2 = []
acc_3 = []

f1_1 = []
f1_2 = []
f1_3 = []

roc_auc_1 = []
roc_auc_2 = []
roc_auc_3 = []

for i in range(10):
    print("Experimento ", i)
    # Splitting the dataset into the Training set and Test set
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.20, random_state=0
    )
    y_test_bin = label_binarize(y_test, classes=range(len(newsgroups_data.target)))
    
    svm_classifier_1 = SVC(kernel="linear", C=0.5, max_iter=1000)
    svm_classifier_1.fit(X_train, y_train)
    y_1_pred = svm_classifier_1.predict(X_test)
    acc_1.append(accuracy_score(y_test, y_1_pred))
    f1_1.append(f1_score(y_test, y_1_pred, average="micro"))
    roc_auc_1.append(
        roc_auc_score(
            y_test_bin, svm_classifier_1.decision_function(X_test), multi_class="ovr"
        )
    )

    svm_classifier_2 = SVC(kernel="poly", C=0.5, max_iter=1000)
    svm_classifier_2.fit(X_train, y_train)
    y_2_pred = svm_classifier_2.predict(X_test)
    acc_2.append(accuracy_score(y_test, y_2_pred))
    f1_2.append(f1_score(y_test, y_2_pred, average="micro"))
    roc_auc_2.append(
        roc_auc_score(
            y_test_bin, svm_classifier_2.decision_function(X_test), multi_class="ovr"
        )
    )

    svm_classifier_3 = SVC(kernel="rbf", C=0.5, max_iter=1000)
    svm_classifier_3.fit(X_train, y_train)
    y_3_pred = svm_classifier_3.predict(X_test)
    acc_3.append(accuracy_score(y_test, y_3_pred))
    f1_3.append(f1_score(y_test, y_3_pred, average="micro"))
    roc_auc_3.append(
        roc_auc_score(
            y_test_bin, svm_classifier_3.decision_function(X_test), multi_class="ovr"
        )
    )

Experimento  0




Experimento  1




Experimento  2




Experimento  3




Experimento  4




Experimento  5




Experimento  6




Experimento  7




Experimento  8




Experimento  9




In [None]:
print("Accuracy 1: ", np.mean(acc_1))
print("Accuracy 2: ", np.mean(acc_2))
print("Accuracy 3: ", np.mean(acc_3))

print("F1 1: ", np.mean(f1_1))
print("F1 2: ", np.mean(f1_2))
print("F1 3: ", np.mean(f1_3))

print("ROC AUC 1: ", np.mean(roc_auc_1))
print("ROC AUC 2: ", np.mean(roc_auc_2))
print("ROC AUC 3: ", np.mean(roc_auc_3))

Accuracy 1:  0.7684489615554575
Accuracy 2:  0.044189129474149366
Accuracy 3:  0.05435262925320371
F1 1:  0.7684489615554575
F1 2:  0.044189129474149366
F1 3:  0.05435262925320371
ROC AUC 1:  0.9600818607684586
ROC AUC 2:  0.7362489008774401
ROC AUC 3:  0.9084863532152637
