# Proyecto Diplomado en Inteligencia Artificial

Objetivo: Entrenar 2 modelos de Machine Learning y comparar sus resultados. Debe usar al menos 3 configuraciones distintas de parámetros de los modelos y comentar los resultados.

Se deben comparar las siguientes medidas de desempeño:

- Accuracy
- F1-score
- ROC-AUC

Se deben realizar 10 corridas experimentales y comparar el promedio de estas medidas de desempeño. Particione el conjunto de datos en 80% (entrenamiento) y 20% (prueba).

Debe realizar un proceso de pre-procesamiento de los datos. Basese en el notebook: "1 natural-language-processing-sentiment-analysis.ipynb".



## Importaciones

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/amandaflores/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Cargar datos

In [2]:
newsgroups_data = fetch_20newsgroups()

## Mostrar ejemplo

In [3]:
print("Texto: ",newsgroups_data.data[0])
print("Etiqueta (texto): ",newsgroups_data.target_names[0])
print("Etiqueta (número}): ",newsgroups_data.target[0])

Texto:  From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





Etiqueta (texto):  alt.atheism
Etiqueta (número}):  7


## Etiquetas con su correspondiente nombre

In [4]:
for idx, cat in enumerate(newsgroups_data.target_names):
    print("Número: ",idx,"Categoria: ", cat)

Número:  0 Categoria:  alt.atheism
Número:  1 Categoria:  comp.graphics
Número:  2 Categoria:  comp.os.ms-windows.misc
Número:  3 Categoria:  comp.sys.ibm.pc.hardware
Número:  4 Categoria:  comp.sys.mac.hardware
Número:  5 Categoria:  comp.windows.x
Número:  6 Categoria:  misc.forsale
Número:  7 Categoria:  rec.autos
Número:  8 Categoria:  rec.motorcycles
Número:  9 Categoria:  rec.sport.baseball
Número:  10 Categoria:  rec.sport.hockey
Número:  11 Categoria:  sci.crypt
Número:  12 Categoria:  sci.electronics
Número:  13 Categoria:  sci.med
Número:  14 Categoria:  sci.space
Número:  15 Categoria:  soc.religion.christian
Número:  16 Categoria:  talk.politics.guns
Número:  17 Categoria:  talk.politics.mideast
Número:  18 Categoria:  talk.politics.misc
Número:  19 Categoria:  talk.religion.misc


## Preprocesamiento de Datos

In [5]:
corpus=[]
for i in range(0,11314):
    review = re.sub('[^a-zA-Z]', ' ', newsgroups_data.data[i] )
    review=review.lower()
    review=review.split()
    ps=PorterStemmer()
    review = [
        ps.stem(word)
        for word in review
        if word not in set(stopwords.words('english'))
    ]
    review=' '.join(review)
    corpus.append(review)

In [6]:
pd.DataFrame(corpus).head()

Unnamed: 0,0
0,lerxst wam umd edu thing subject car nntp post...
1,guykuo carson u washington edu guy kuo subject...
2,twilli ec ecn purdu edu thoma e willi subject ...
3,jgreen amber joe green subject weitek p organ ...
4,jcm head cfa harvard edu jonathan mcdowel subj...


In [7]:
cv=CountVectorizer(max_features=1500)
X=cv.fit_transform(corpus).toarray()

In [8]:
print(cv.get_feature_names_out())
print(X[0])

['aa' 'ab' 'abil' ... 'yet' 'york' 'young']
[0 0 0 ... 0 0 0]


In [9]:
print(X)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [10]:
Y = newsgroups_data.target
print(Y)

[7 4 4 ... 3 1 8]


In [11]:
encoder = LabelBinarizer()
y_encoder = encoder.fit_transform(Y)
print(y_encoder)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## Clasificación

### Random Forest Model

In [12]:
classifier_RF_1 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier_RF_2 = RandomForestClassifier(n_estimators = 10, criterion = 'gini', random_state = 0)
classifier_RF_3 = RandomForestClassifier(n_estimators = 10, criterion = 'log_loss', random_state = 0)

In [13]:
def calculate_metrics_rf(y_true, y_pred):
    return [
        accuracy_score(y_true, y_pred),
        roc_auc_score(y_true, y_pred),
        f1_score(y_true, y_pred, average='macro')
    ]

In [14]:
# Inicializar listas
data_ac_rf = []
data_roc_rf = []
data_f1_rf = []

# Realizar experimentos
for i in range(10):
    print("Experimento:", i)
    
    # Dividir datos
    X_train, X_test, y_train, y_test = train_test_split(X, y_encoder, test_size=0.2, shuffle=True, stratify=Y)
    
    # Entrenar modelos
    for model in [classifier_RF_1, classifier_RF_2, classifier_RF_3]:
        model.fit(X_train, y_train)
        
    # Predecir y calcular métricas
    metrics_ac = [calculate_metrics_rf(y_test, model.predict(X_test)) for model in [classifier_RF_1, classifier_RF_2, classifier_RF_3]]
    
    # Apilar los valores de métricas en listas
    data_ac_rf.append([metric[0] for metric in metrics_ac])
    data_roc_rf.append([metric[1] for metric in metrics_ac])
    data_f1_rf.append([metric[2] for metric in metrics_ac])

Experimento: 0
Experimento: 1
Experimento: 2
Experimento: 3
Experimento: 4
Experimento: 5
Experimento: 6
Experimento: 7
Experimento: 8
Experimento: 9


In [15]:
metrics_names = ['Accuracy', 'ROC AUC', 'F1']
for i in range(3):
    print(f"Clasificación {i + 1}: ")
    print(f"{metrics_names[0]}: {np.mean(np.array(data_ac_rf)[:, i])}")
    print(f"{metrics_names[1]}: {np.mean(np.array(data_roc_rf)[:, i])}")
    print(f"{metrics_names[2]}: {np.mean(np.array(data_f1_rf)[:, i])}")
    print("----------------------------------------------------------------")

Clasificación 1: 
Accuracy: 0.2799381352187362
ROC AUC: 0.6365531862402306
F1: 0.39685617607224155
----------------------------------------------------------------
Clasificación 2: 
Accuracy: 0.4539991162174105
ROC AUC: 0.7221320847494123
F1: 0.5827773811243245
----------------------------------------------------------------
Clasificación 3: 
Accuracy: 0.2799381352187362
ROC AUC: 0.6365531862402306
F1: 0.39685617607224155
----------------------------------------------------------------


### SVC

In [22]:
svm_classifier = SVC(kernel="poly", probability=True)

# split the data into train and test
print("Splitting data into train and test...")
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, shuffle=True, stratify=Y)

# train the model
print("Training the model...")
svm_classifier.fit(X_train, y_train)

# predict the test data
print("Predicting test data...")
y_pred = svm_classifier.predict(X_test)
y_pred_proba = svm_classifier.predict_proba(X_test)

Splitting data into train and test...
Training the model...
Predicting test data...


In [23]:
# calculate the metrics
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("ROC AUC: ", roc_auc_score(y_test, y_pred_proba, multi_class='ovr'))
print("F1: ", f1_score(y_test, y_pred, average='macro'))

Accuracy:  0.05567830313742819
ROC AUC:  0.6479153645050738
F1:  0.010078465855756396


In [24]:
def calculate_metrics_svm(y_true, y_pred, y_pred_proba):
    return [
        accuracy_score(y_true, y_pred),
        roc_auc_score(y_true, y_pred_proba, multi_class='ovr'),
        f1_score(y_true, y_pred, average='macro')
    ]

In [25]:
print("Creating SVM models...")
svm_classifier_1 = SVC(kernel="linear", probability=True)
svm_classifier_2 = SVC(kernel="poly", probability=True)
svm_classifier_3 = SVC(kernel="rbf", probability=True)

data_ac_svc = []
data_roc_svc = []
data_f1_svc = []

for i in range(10):
    print("Experimento ", i)
    # Splitting the dataset into the Training set and Test set
    X_train, X_test, y_train, y_test = train_test_split(
        X, Y, test_size=0.20, random_state=0, shuffle=True, stratify=Y
    )
    
    # Entrenar modelos
    for model in [svm_classifier_1, svm_classifier_2, svm_classifier_3]:
        model.fit(X_train, y_train)
        
    y_pred = svm_classifier.predict(X_test)
    y_pred_proba = svm_classifier.predict_proba(X_test)
    
    # Predecir y calcular métricas
    metrics_ac = [calculate_metrics_svm(y_test, model.predict(X_test), model.predict_proba(X_test)) for model in [svm_classifier_1, svm_classifier_2, svm_classifier_3]]
    
    # Apilar los valores de métricas en listas
    data_ac_svc.append([metric[0] for metric in metrics_ac])
    data_roc_svc.append([metric[1] for metric in metrics_ac])
    data_f1_svc.append([metric[2] for metric in metrics_ac])

Creating SVM models...
Experimento  0


Experimento  1
Experimento  2
Experimento  3
Experimento  4
Experimento  5
Experimento  6
Experimento  7
Experimento  8
Experimento  9


In [26]:
metrics_names = ['Accuracy', 'ROC AUC', 'F1']
for i in range(3):
    print(f"Clasificación {i + 1}: ")
    print(f"{metrics_names[0]}: {np.mean(np.array(data_ac_svc)[:, i])}")
    print(f"{metrics_names[1]}: {np.mean(np.array(data_roc_svc)[:, i])}")
    print(f"{metrics_names[2]}: {np.mean(np.array(data_f1_svc)[:, i])}")
    print("----------------------------------------------------------------")

Clasificación 1: 
Accuracy: 0.7737516570923553
ROC AUC: 0.9772969999285444
F1: 0.7736878077011224
----------------------------------------------------------------
Clasificación 2: 
Accuracy: 0.0547945205479452
ROC AUC: 0.677166160670134
F1: 0.008356808367985715
----------------------------------------------------------------
Clasificación 3: 
Accuracy: 0.09765797613787008
ROC AUC: 0.8920814775504983
F1: 0.08153138279597036
----------------------------------------------------------------
