# Tutorial de Machine Learning con Scikit-Learn

**Luis Bodart**

## 1 Preparar Datos

### 1.1 Leer dataset

El objetivo es encontrar el modelo de aprendizaje automático más adecuado para predecir el sentiment (output) a partir de una movie review (input).

- Input(x) -> movie review
- Output(y) -> sentiment

In [3]:
import pandas as pd

df_review = pd.read_csv('data/IMDB-Dataset.csv')
df_review

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [4]:
df_review.value_counts('sentiment')

sentiment
negative    25000
positive    25000
dtype: int64

In [5]:
df_pos = df_review[df_review.sentiment == 'positive'][:9000]
df_neg = df_review[df_review.sentiment == 'negative'][:1000]

df_review_des = pd.concat([df_pos, df_neg]) # data desbalanceada 9000:1000
df_review_des.value_counts('sentiment')

sentiment
positive    9000
negative    1000
dtype: int64

### 1.2 Dataset desbalanceado

In [6]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
df_review_bal, df_review_bal['sentiment'] = rus.fit_resample(df_review_des[['review']], df_review_des[['sentiment']])
df_review_bal.value_counts('sentiment')

sentiment
negative    1000
positive    1000
dtype: int64

### 1.3 Prepara data para entrenar & probar

In [7]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df_review_bal, test_size=0.33, random_state=42)

In [8]:
train_x, train_y = train['review'], train['sentiment']
test_x, test_y = test['review'], test['sentiment']

## 2 Representacion de texto (Bag of words)

- CountVectorizer (cuenta las veces que aparece una palabra en la oracion)
- Tfidf (frecuencia que tiene la palabra dentro de la oracion)

### 2.1 CountVectorizer (contador)

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
text = ["I love writing code in Python. I love Python code",
        "I hate writing code in Java. I hate Java code"]

df = pd.DataFrame({'review': ['review1', 'review2'], 'text': text})
cv = CountVectorizer()
cv_matrix = cv.fit_transform(df['text'])
df_dtm = pd.DataFrame(cv_matrix.toarray(), index=df['review'].values, columns=cv.get_feature_names_out())
df_dtm

Unnamed: 0,code,hate,in,java,love,python,writing
review1,2,0,1,0,2,2,1
review2,2,2,1,2,0,0,1


### 2.2 Tfidf (frecuencia)

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
text = ["I love writing code in Python. I love Python code",
        "I hate writing code in Java. I hate Java code"]

df = pd.DataFrame({'review': ['review1', 'review2'], 'text':text})
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df['text'])
df_dtm = pd.DataFrame(tfidf_matrix.toarray(), index=df['review'].values, columns=tfidf.get_feature_names_out())
df_dtm

Unnamed: 0,code,hate,in,java,love,python,writing
review1,0.428327,0.0,0.214163,0.0,0.601998,0.601998,0.214163
review2,0.428327,0.601998,0.214163,0.601998,0.0,0.0,0.214163


### 2.3 Transformar data texto a data numerica

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
train_x_vector = tfidf.fit_transform(train_x)

test_x_vector = tfidf.transform(test_x)

In [12]:
train_x_vector, test_x_vector

(<1340x20093 sparse matrix of type '<class 'numpy.float64'>'
 	with 114623 stored elements in Compressed Sparse Row format>,
 <660x20093 sparse matrix of type '<class 'numpy.float64'>'
 	with 53352 stored elements in Compressed Sparse Row format>)

Tipos de matrices
- Matriz dispersa (sparce matrix): Matriz compuesta principalmente de valores 0
- Matriz densa (dense matrix): Matriz compuesta principalmente por valores diferentes de 0

In [13]:
pd.DataFrame.sparse.from_spmatrix(train_x_vector, index=train_x.index, columns=tfidf.get_feature_names_out())

Unnamed: 0,00,000,007,01,01pm,02,04,06,10,100,...,zooming,zooms,zorro,zuber,zuckers,zues,zzzzzzzzzzzzzzzzzz,æon,élan,ísnt
81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
915,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1018,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
380,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042715,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1029,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1130,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1294,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
860,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1459,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3 Seleccion de modelo

Algoritmos ML

1. Aprendizaje supervisado (supervised learning): Regression (output numerico), **Clasificacion (output discreto)**
- Input: Review
- Output: Sentiment (discreto)

2. Aprendizaje no supervisado

### 3.1 Support Vector Machines (SVM)

In [14]:
from sklearn.svm import SVC

svc = SVC(kernel='linear')
svc.fit(train_x_vector, train_y)

#### 3.1.1 Testing

In [15]:
print(svc.predict(tfidf.transform(['A meh movie'])))
print(svc.predict(tfidf.transform(['An excellent movie'])))
print(svc.predict(tfidf.transform(['"I did not like this movie at all I gave this movie away"'])))

['negative']
['positive']
['negative']


### 3.2 Arbol de desicion

In [16]:
from sklearn.tree import DecisionTreeClassifier

dec_tree = DecisionTreeClassifier()
dec_tree.fit(train_x_vector, train_y)

#### 3.2.1 Testing

In [17]:
print(dec_tree.predict(tfidf.transform(['A bad movie'])))
print(dec_tree.predict(tfidf.transform(['An excellent movie'])))
print(dec_tree.predict(tfidf.transform(['"I did not like this movie at all I gave this movie away"'])))

['negative']
['positive']
['positive']


### 3.3 Naive Bayes

In [18]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(train_x_vector.toarray(), train_y)

#### 3.3.1 Testing

In [19]:
# Para poder probar se necesita usar una matrix densa en lugar de una matriz dispersa

#print(gnb.predict(tfidf.transform(['A great movie'])))
#print(gnb.predict(tfidf.transform(['An excellent movie'])))
#print(gnb.predict(tfidf.transform(['"I did not like this movie at all I gave this movie away"'])))


### 3.4 Regresicon Logistica

In [20]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(train_x_vector, train_y)

#### 3.4.1 Testing

In [21]:
print(log_reg.predict(tfidf.transform(['A great movie'])))
print(log_reg.predict(tfidf.transform(['An excellent movie'])))
print(log_reg.predict(tfidf.transform(['"I did not like this movie at all I gave this movie away"'])))


['positive']
['positive']
['negative']


## 4 Evaluacion del modelo

### 4.1 Precision del modelo

In [22]:
print(svc.score(test_x_vector, test_y))
print(dec_tree.score(test_x_vector, test_y))
print(gnb.score(test_x_vector.toarray(), test_y))
print(log_reg.score(test_x_vector, test_y))

0.8363636363636363
0.7
0.6439393939393939
0.8287878787878787


### 4.2 F1 Score

F1 Score es la media ponderada de Precision y Recall.

La precision se utiliza cuando los Verdaderos Positivos y los Verdaderos Negativos son mas importantes, mientras que F1 score se utiliza cuando los Falsos Negativos y Falsos Positivos son cruciales. Ademas F1 tiene en cuenta como se distribuyen los datos, or lo que es utl cuando se tienen datos con clases desequilibradas.


F1 Score = 2*(Recall * Precision) / (Recall + Precision)

In [23]:
from sklearn.metrics import f1_score

f1_score(test_y, svc.predict(test_x_vector),
         labels=['positive', 'negative'],
         average=None)

array([0.84023669, 0.83229814])

### 4.3 Reporte de Clasificacion

In [24]:
from sklearn.metrics import classification_report

print(classification_report(test_y, svc.predict(test_x_vector),
                            labels=['positive', 'negative']))

              precision    recall  f1-score   support

    positive       0.83      0.85      0.84       335
    negative       0.84      0.82      0.83       325

    accuracy                           0.84       660
   macro avg       0.84      0.84      0.84       660
weighted avg       0.84      0.84      0.84       660



### 4.4 Matriz de Confusion (Confusion Matrix)

- row (Valores Actuales)
- col (Valores Predecidos)

|                  | **Positivo (1)**    | **Negativo (0)**    |
|------------------|---------------------|---------------------|
| **Positivo (1)** | Verdadero Positivos | Falso Positivos     |
| **Negativo (0)** | Falso Negativos     | Verdadero Negativos |

In [25]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(test_y,
                            svc.predict(test_x_vector),
                            labels=['positive', 'negative'])
conf_mat

array([[284,  51],
       [ 57, 268]])

## 5 Optimizacion del Modelo

### 5.1 GridSearchCV

In [30]:
from sklearn.model_selection import GridSearchCV

parameters = {'C': [1, 4, 8, 16, 32], 'kernel': ['linear', 'rbf']}
svc = SVC()
svc_grid = GridSearchCV(svc, parameters, cv=5,)#refit=True, verbose=0)
svc_grid.fit(train_x_vector, train_y)

In [31]:
print(svc_grid.best_params_)
print(svc_grid.best_estimator_)

{'C': 1, 'kernel': 'linear'}
SVC(C=1, kernel='linear')


In [32]:
svc_grid.best_score_

0.8388059701492537