#Proyecto Diplomado en Inteligencia Artificial

Objetivo: Entrenar 2 modelos de Machine Learning y comparar sus resultados. Debe usar al menos 3 configuraciones distintas de parámetros de los modelos y comentar los resultados.

Se deben comparar las siguientes medidas de desempeño:

- Accuracy
- F1-score
- ROC-AUC

Se deben realizar 10 corridas experimentales y comparar el promedio de estas medidas de desempeño. Particione el conjunto de datos en 80% (entrenamiento) y 20% (prueba).

Debe realizar un proceso de pre-procesamiento de los datos. Basese en el notebook: "1 natural-language-processing-sentiment-analysis.ipynb".



## Carga de datos

In [1]:
import  sklearn
from sklearn.datasets import fetch_20newsgroups

newsgroups_data = fetch_20newsgroups()

In [11]:
print(newsgroups_data.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features      

## Mostrar un ejemplo

In [23]:
print("Texto: ",newsgroups_data.data[0])
print("Etiqueta (texto): ",newsgroups_data.target_names[0])
print("Etiqueta (número}): ",newsgroups_data.target[0])

Texto:  From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





Etiqueta (texto):  alt.atheism
Etiqueta (número}):  7


## Etiquetas con su correspondiente nombre

In [45]:
len(newsgroups_data.data)

11314

In [3]:
for idx, cat in enumerate(newsgroups_data.target_names):
    print("Número: ",idx,"Categoria: ", cat)

Número:  0 Categoria:  alt.atheism
Número:  1 Categoria:  comp.graphics
Número:  2 Categoria:  comp.os.ms-windows.misc
Número:  3 Categoria:  comp.sys.ibm.pc.hardware
Número:  4 Categoria:  comp.sys.mac.hardware
Número:  5 Categoria:  comp.windows.x
Número:  6 Categoria:  misc.forsale
Número:  7 Categoria:  rec.autos
Número:  8 Categoria:  rec.motorcycles
Número:  9 Categoria:  rec.sport.baseball
Número:  10 Categoria:  rec.sport.hockey
Número:  11 Categoria:  sci.crypt
Número:  12 Categoria:  sci.electronics
Número:  13 Categoria:  sci.med
Número:  14 Categoria:  sci.space
Número:  15 Categoria:  soc.religion.christian
Número:  16 Categoria:  talk.politics.guns
Número:  17 Categoria:  talk.politics.mideast
Número:  18 Categoria:  talk.politics.misc
Número:  19 Categoria:  talk.religion.misc


In [25]:
# Cleaning the texts
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/amandaflores/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [46]:

corpus=[]
for i in range(0,11314):
    review = re.sub('[^a-zA-Z]', ' ', newsgroups_data.data[i] )
    review=review.lower()
    review=review.split()
    ps=PorterStemmer()
    review=[ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review=' '.join(review)
    corpus.append(review)

In [49]:
len(corpus)

11314

In [30]:
import pandas as pd

In [47]:
pd.DataFrame(corpus).head()

Unnamed: 0,0
0,lerxst wam umd edu thing subject car nntp post...
1,guykuo carson u washington edu guy kuo subject...
2,twilli ec ecn purdu edu thoma e willi subject ...
3,jgreen amber joe green subject weitek p organ ...
4,jcm head cfa harvard edu jonathan mcdowel subj...


In [50]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=1500)
X=cv.fit_transform(corpus).toarray()
y = newsgroups_data.target
print(X)
print("!")
print(y)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
!
[7 4 4 ... 3 1 8]


In [51]:
X.shape

(11314, 1500)

In [52]:
y.shape

(11314,)

In [53]:
# convertir y a binary vector
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
y_b = encoder.fit_transform(y)
print(y_b)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [54]:
y_b.shape

(11314, 20)

In [4]:
# X = newsgroups_data.data
# y = newsgroups_data.target

In [55]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [56]:
classifier_RF_1 = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
classifier_RF_2 = RandomForestClassifier(n_estimators = 100, criterion = 'gini', random_state = 0)
classifier_RF_3 = RandomForestClassifier(n_estimators = 100, criterion = 'log_loss', random_state = 0)

In [59]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score

data_ac = []
data_roc = []
data_f1 = []

for i in range(10):
    print("Experimento: ",i)
    X_train, X_test, y_train, y_test = train_test_split(X, y_b, test_size=0.2, shuffle=True, stratify=y)
    classifier_RF_1.fit(X_train, y_train)
    classifier_RF_2.fit(X_train, y_train)
    classifier_RF_3.fit(X_train, y_train)
    
    y_pred_1 = classifier_RF_1.predict(X_test)
    y_pred_2 = classifier_RF_2.predict(X_test)
    y_pred_3 = classifier_RF_3.predict(X_test)
    
    # Apilar los valores de Accuracy, ROC AUC, F1 en una matriz para luego calcular el promedio
    data_ac.append([accuracy_score(y_test, y_pred_1), accuracy_score(y_test, y_pred_2), accuracy_score(y_test, y_pred_3)])
    data_roc.append([roc_auc_score(y_test, y_pred_1), roc_auc_score(y_test, y_pred_2), roc_auc_score(y_test, y_pred_3)])
    data_f1.append([f1_score(y_test, y_pred_1, average='macro'), f1_score(y_test, y_pred_2, average='macro'), f1_score(y_test, y_pred_3, average='macro')])
    

Experimento:  0
Experimento:  1
Experimento:  2
Experimento:  3
Experimento:  4
Experimento:  5
Experimento:  6
Experimento:  7
Experimento:  8
Experimento:  9


In [63]:
import numpy as np

In [67]:
# calcular promedio de la columna 1
array_data_ac = np.array(data_ac)
array_data_roc = np.array(data_roc)
array_data_f1 = np.array(data_f1)

In [69]:
print("Clasificacion 1: ")
print("Accuracy: ",np.mean(array_data_ac[:,0]))
print("ROC AUC: ",np.mean(array_data_roc[:,0]))
print("F1: ",np.mean(array_data_f1[:,0]))
print("----------------------------------------------------------------")
print("Clasificacion 2: ")
print("Accuracy: ",np.mean(array_data_ac[:,1]))
print("ROC AUC: ",np.mean(array_data_roc[:,1]))
print("F1: ",np.mean(array_data_f1[:,1]))
print("----------------------------------------------------------------")
print("Clasificacion 3: ")
print("Accuracy: ",np.mean(array_data_ac[:,2]))
print("ROC AUC: ",np.mean(array_data_roc[:,2]))
print("F1: ",np.mean(array_data_f1[:,2]))

Clasificacion 1: 
Accuracy:  0.31961997348652227
ROC AUC:  0.6556343592045113
F1:  0.4347667798800465
----------------------------------------------------------------
Clasificacion 2: 
Accuracy:  0.5093239063190456
ROC AUC:  0.7491866743552985
F1:  0.6316487104674595
----------------------------------------------------------------
Clasificacion 3: 
Accuracy:  0.31961997348652227
ROC AUC:  0.6556343592045113
F1:  0.4347667798800465


In [89]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.preprocessing import label_binarize


In [99]:
acc_1 = []
acc_2 = []
acc_3 = []

f1_1 = []
f1_2 = []
f1_3 = []

roc_auc_1 = []
roc_auc_2 = []
roc_auc_3 = []

for i in range(10):
    print("Experimento ", i)
    # Splitting the dataset into the Training set and Test set
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.20, random_state=0
    )
    y_test_bin = label_binarize(y_test, classes=range(len(newsgroups_data.target)))
    
    svm_classifier_1 = SVC(kernel="linear", C=0.5, max_iter=1000)
    svm_classifier_1.fit(X_train, y_train)
    y_1_pred = svm_classifier_1.predict(X_test)
    acc_1.append(accuracy_score(y_test, y_1_pred))
    f1_1.append(f1_score(y_test, y_1_pred, average="micro"))
    roc_auc_1.append(
        roc_auc_score(
            y_test_bin, svm_classifier_1.decision_function(X_test), multi_class="ovr"
        )
    )

    svm_classifier_2 = SVC(kernel="poly", C=0.5, max_iter=1000)
    svm_classifier_2.fit(X_train, y_train)
    y_2_pred = svm_classifier_2.predict(X_test)
    acc_2.append(accuracy_score(y_test, y_2_pred))
    f1_2.append(f1_score(y_test, y_2_pred, average="micro"))
    roc_auc_2.append(
        roc_auc_score(
            y_test_bin, svm_classifier_2.decision_function(X_test), multi_class="ovr"
        )
    )

    svm_classifier_3 = SVC(kernel="rbf", C=0.5, max_iter=1000)
    svm_classifier_3.fit(X_train, y_train)
    y_3_pred = svm_classifier_3.predict(X_test)
    acc_3.append(accuracy_score(y_test, y_3_pred))
    f1_3.append(f1_score(y_test, y_3_pred, average="micro"))
    roc_auc_3.append(
        roc_auc_score(
            y_test_bin, svm_classifier_3.decision_function(X_test), multi_class="ovr"
        )
    )

Experimento  0




Experimento  1




Experimento  2




Experimento  3




Experimento  4




Experimento  5




Experimento  6




Experimento  7




Experimento  8




Experimento  9




In [100]:
print("Accuracy 1: ", np.mean(acc_1))
print("Accuracy 2: ", np.mean(acc_2))
print("Accuracy 3: ", np.mean(acc_3))

print("F1 1: ", np.mean(f1_1))
print("F1 2: ", np.mean(f1_2))
print("F1 3: ", np.mean(f1_3))

print("ROC AUC 1: ", np.mean(roc_auc_1))
print("ROC AUC 2: ", np.mean(roc_auc_2))
print("ROC AUC 3: ", np.mean(roc_auc_3))

Accuracy 1:  0.7684489615554575
Accuracy 2:  0.044189129474149366
Accuracy 3:  0.05435262925320371
F1 1:  0.7684489615554575
F1 2:  0.044189129474149366
F1 3:  0.05435262925320371
ROC AUC 1:  0.9600818607684586
ROC AUC 2:  0.7362489008774401
ROC AUC 3:  0.9084863532152637
