In [1]:
import json

file_name = './data/sentiment/Books_small.json'

with open(file_name) as f:
    for line in f:
        review = json.loads(line) #Con el metodo loads, cargo el review del archivo file, que esta en formato Json, almacenado en un diccionario, con la key='reviewText'
        print(review['reviewText'])
        print(review['overall'])
        break

Da Silva takes the divine by storm with this unique new novel.  She develops a world unlike any others while keeping it firmly in the real world.  This is a very well written and entertaining novel.  I was quite impressed and intrigued by the way that this solid storyline was developed, bringing the readers right into the world of the story.  I was engaged throughout and definitely enjoyed my time spent reading it.I loved the character development in this novel.  Da Silva creates a cast of high school students who actually act like high school students.  I really appreciated the fact that none of them were thrown into situations far beyond their years, nor did they deal with events as if they had decades of life experience under their belts.  It was very refreshing and added to the realism and impact of the novel.  The friendships between the characters in this novel were also truly touching.Overall, this novel was fantastic.  I can&#8217;t wait to read more and to find out what happen

In [2]:
import json

file_name = './data/sentiment/Books_small.json'

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line) 
        reviews.append((review['reviewText'], review['overall'])) #Creo una tupla con los 2 valores que me interesan para entrenar mi modelo
        
       

Para hacer mas leible y prolijo, y no tener que llamar por indices, sino por atributos, creamos una clase y lo manejamos de esa menera.

In [108]:
import random

#De nuevo, la clase Sentiment, no es necesaria, pero es mas entendible el codigo, y mas prolijo
class Sentiment:
    NEGATIVE = 'NEGATIVE'
    NEUTRAL = 'NEUTRAL'
    POSITIVE = 'POSITIVE'

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment() #recibe del metodo get_sentiment, 1,2, negativas 3 neutral, 4 y 5 positicas
        
    def get_sentiment(self):
        if self.score <=2:
            return Sentiment.NEGATIVE
        elif self.score ==3:
            return Sentiment.NEUTRAL
        else: # 4 OR 5 STARS
            return Sentiment.POSITIVE
        
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews =reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
        
        
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews)) #filtro, todos los comentarios negativos
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews)) #filtro, todos los comentarios POSITIVOS
       # neutral = list(filter(lambda x: x.sentiment == Sentiment.NEUTRAL, self.reviews)) #filtro, todos los comentarios NEUTROS
        
        positive_adaptado = positive[:len(negative)]# Hago que la cantidad de positivos sea igual a la de negativos
        self.reviews =negative + positive_adaptado
        random.shuffle(self.reviews) #mezclo bien el orden de los comentarios que uso para entrenar el modelo, sino voy a tener primero los negativos y despues los positivos

In [4]:
import json

file_name = './data/sentiment/Books_small.json'

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line) 
        reviews.append(Review(review['reviewText'], review['overall']))
        
reviews[5].text

'Love the book, great story line, keeps you entertained.for a first novel from this author she did a great job,  Would definitely recommend!'

## Prep Data

In [109]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

from sklearn.model_selection import train_test_split

training, test = train_test_split(reviews, test_size=0.33, random_state=42)
#33% de las reviews las uso para testeo
#66% de los reviews los uso para el entrenamiento

train_container = ReviewContainer(training)
test_container = ReviewContainer(test)


**Entonces, yo ahora quiero pasarle a mi modelo, un comentario, y quiero que el pueda predecirme, si es positivo o negativo. Entonces, "x", lo que voy a pasarle a mi modelo, tiene que ser el texto, y lo que me devuelve "y", es la categoria o sentimiento, que es el resultado que quiero 'POSITIVE', 'NEGATIVE', 'NEUTRAL'**

In [110]:
train_container.evenly_distribute()

train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()#Separo de manera pareja los comentarios positivos y negativos que voy a usar para entrenar mi modelo

test_x = test_container.get_text()
test_y = test_container.get_sentiment()

#Cuento para asegurarme que tenga la misma cantidad de positivos y negativos
print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))
print(test_y.count(Sentiment.POSITIVE))
print(test_y.count(Sentiment.NEGATIVE))

436
436
208
208


## Bag of words, VECTORIZACION

Para que el modelo pueda "leer", lo que dice el comentario, utilizamos una forma de codificacion de las palabras, a esto lo llamamos, el metodo "Bag of words"

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

"In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors..."

In [130]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

#vectorizer = CountVectorizer()
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(train_x)
train_x_vectors = vectorizer.transform(train_x)




#En el caso de text, no vamos a usar "fit_transform", solo queremos transformar, ya que esta es nuestra data de testeo
test_x_vectors = vectorizer.transform(test_x)


Aca nos devolvio una matriz 2D, donde tiene 670 filas (la cantidad de reviews que usamos para el entrenamiento), y 7372, columnas, posee cada fila.

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
 
 algo parecido a esto, pero mucho mas grande
 
 Entonces, esto que tuvimos como resultado, son los reviews, en la forma que vamos a usar para darle a el modelo para que se entrene

In [8]:
print(vectorizer.fit_transform(train_x)[0])

  (0, 7086)	1
  (0, 1148)	1
  (0, 350)	2
  (0, 1800)	1
  (0, 6595)	1
  (0, 562)	1
  (0, 3054)	1
  (0, 1558)	1
  (0, 6475)	1
  (0, 6593)	1
  (0, 2895)	1
  (0, 7353)	1
  (0, 539)	1
  (0, 1515)	1
  (0, 5197)	1
  (0, 3545)	1
  (0, 2007)	1


**Ahora tengo que elegir el modelo que voy a utilizar**

## Clasification

https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
    
Para saber el modelo o clasificador (classifier) correcto a utilizar, hay que ver https://www.youtube.com/watch?v=_PwhiWxHK8o , o ver videos en internet o articulos a ver cual recomiendan para cada situacion. Tambien ver como se comporta cada modelo y cual se adapta mejor

#### LINEAR SVM MODEL

In [144]:
from sklearn import svm

clf_svm = svm.SVC(kernel ='linear')

clf_svm.fit(train_x_vectors, train_y)

#Ahora con el SVM LINEAR MODEL, predecimos el comentario, a partir de lo que aprendio el modelo
svmPrediction = clf_svm.predict(test_x_vectors[0])

print(svmPrediction)
print(test_x[0])

['NEGATIVE']
This book was really horrible. It's twilight but with mers. If you want a mer book read everblue it's better.


#### Decision Tree

In [132]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)

treePrediction = clf_dec.predict(test_x_vectors[0])

print(treePrediction)
print(test_x[0])

['POSITIVE']
This book was really horrible. It's twilight but with mers. If you want a mer book read everblue it's better.


#### Naive Bayes

In [133]:
from sklearn.naive_bayes import MultinomialNB

clf_gnb = MultinomialNB()
#"TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.", try using MultinomialNB instead. You can't use vectorized array with GuassianNB.
#Aqui, no podemos usar GaussianNB(), por lo dicho arriba

clf_gnb.fit(train_x_vectors, train_y)


naiveBayesPrediction = clf_gnb.predict(test_x_vectors[0])

print(naiveBayesPrediction)
print(test_x[0])

['NEGATIVE']
This book was really horrible. It's twilight but with mers. If you want a mer book read everblue it's better.


#### Logistic Regression

In [134]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression(random_state=0).fit(train_x_vectors, train_y)

logPrediction = clf_log.predict(test_x_vectors[0])

print(treePrediction)
print(test_x[0])

['POSITIVE']
This book was really horrible. It's twilight but with mers. If you want a mer book read everblue it's better.


## EVALUATION OF MODELS

Ahora que entrenamos todos nuestros modelos, vamos a evaluarlos, con todos los valores, a ver como se comportan

In [49]:
#MEAN ACCURACY
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8124242424242424
0.7666666666666667
0.8369696969696969
0.8409090909090909


In [94]:
#MEAN ACCURACY (2da ejecucion, despues de igualar la cantidad de comentarios negativos con positivos)
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))

0.7124242424242424
0.6227272727272727
0.6796969696969697
0.7448484848484849


In [29]:
from sklearn.metrics import f1_score

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

[0.91319444 0.21052632 0.22222222]


Con este ensayo, podemos concluir que el modelo es bueno para predecir comentarios positivos, pero es muy malo para comentarios negativos o neutros

In [105]:
#Ahora, ensayo todos a ver como se comportan

In [32]:
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

[0.85309735 0.0952381  0.        ]
[0.9233279 0.        0.       ]
[0.91370558 0.12244898 0.1       ]


Bueno, ahora quiero mejorar mi modelo para NEUTRAL and NEGATIVE.

Para eso, lo primero que debemos revisar, es la data que le estamos dando al modelo para que se entrene

In [50]:
print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))
print(train_y.count(Sentiment.NEUTRAL))

5611
436
653


552 de 670 comentarios, son POSITIVE, y negativos, solo 47, entonces, debemos balancear la data, entre los tipos de cometarios

Vamos a cargar un dataset mas grande

In [41]:
import json

file_name = './data/sentiment/Books_small_10000.json'

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line) 
        reviews.append(Review(review['reviewText'], review['overall']))

In [None]:
#PODEMOS VER QUE LA MUESTRA AUMENTO

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))
print(train_y.count(Sentiment.NEUTRAL))

In [52]:
#2da ejecucion, poniendo una muestra mas grande
print(f1_score(test_y, clf_svm.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

[0.90738061 0.2656     0.40268456]
[0.87186778 0.14789916 0.17460317]
[0.91276279 0.0286533  0.        ]
[0.92139968 0.29250457 0.40983607]


Creo la clase ReviewContainer, para contener a nuestra data, y manejarla mejor, yo ahora lo que quiero es nivelar mi muestra, para entrenarla mejor

In [95]:
#(3era ejecucion, despues de igualar la cantidad de comentarios negativos con positivos)
print(f1_score(test_y, clf_svm.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

[0.85363477 0.         0.28146853]
[0.78271504 0.         0.19722425]
[0.83624543 0.         0.27346637]
[0.8783008  0.         0.31077216]


### Ahora, tambien distribuyo de manera equilibrada la data que destino al entrenamiento de mi modelo, a ver que resultados obtengo

In [116]:
#MEAN ACCURACY (3ra ejecucion)
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))

0.7980769230769231
0.6370192307692307
0.7980769230769231
0.8149038461538461


In [118]:
#(4ta ejecucion)
print(f1_score(test_y, clf_svm.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

[0.8028169  0.         0.79310345]
[0.63614458 0.         0.63788969]
[0.77777778 0.         0.81497797]
[0.82051282 0.         0.808933  ]


  _warn_prf(
  _warn_prf(
  _warn_prf(
  _warn_prf(


Conclusion: Se puede obsevar como mejore notablemente el F1 SCORE, al equilibrar la muestra

nota: los nuetrales, los sacamos, para simplificar el ejercicio

## Ahora vamos a ver nuestro modelo en accion, con un test QUALITATIVO:

In [129]:
test_set = ["Very good", "I'm not sure what to say about this book. Interesting plot idea but poorly executed. I finished it and it wasn't the worst book ever but isn't something I would recommend to anyone. And not worth the $2.99 I paid for it.  Also needs some serious editing."]
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE'], dtype='<U8')

Ahora si quiero mejorar aun mas mi modelo podria utilizar en vez de CountVectorizer,TfidfVectorizer , que cuenta las palabras "sentimentales", las que le ponen mas peso a la review, lo cual mejora la performance

In [135]:
#MEAN ACCURACY (4ta ejecucion)
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8076923076923077
0.6346153846153846
0.8125
0.8052884615384616


In [136]:
print(f1_score(test_y, clf_svm.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average = None, labels =[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

[0.80582524 0.         0.80952381]
[0.62189055 0.         0.64651163]
[0.79144385 0.         0.82969432]
[0.80291971 0.         0.80760095]


  _warn_prf(
  _warn_prf(
  _warn_prf(
  _warn_prf(


***Aca podemos observar que en algunos modelos, me ayudo a mejorar la performance, mientras otros, la empeoro. Por ejemplo mejoro LINEAR SVM MODEL***

In [138]:
#Busco mejorar aun mas mis modelos

## Turning our model (with Grid Search)
https://www.mygreatlearning.com/blog/gridsearchcv/

In [145]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel':('linear','rbf'), 'C':(1,4,8,16,32)} #Valores que pruebo para C, a ver cual me da mejor performance                           

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y) 


GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})

#### *Otra mejora, que puedo aplicar, antes de usar GridSearchCV, para mejorar mis parametros, del modelo, es filtrar, los signos de exclamacion, ya que el metodo, cuenta como si fueran dos palabras distintas good! y good*

In [None]:
por ahora lo dejaremos hasta aqui, pero se puede seguir mejorando.

## SAVING THE MODEL

In [148]:
#Ahora queremos grabarlo, para no tener que reentrenarlo ,cada vez que querramos usarlo.

In [157]:
import pickle

with open('./modelos2021/sentiment_classifier.pkl','wb') as f:
    pickle.dump(clf,f)
#Grabamos, todos los parametros que tenemos en clf, en el archivo sentiment_classifier

## LOADING THE MODEL 

In [153]:
#Ahora veamos nuestro modelo grabado

In [158]:
with open('./modelos2021/sentiment_classifier.pkl','rb') as f:
    loaded_clf = pickle.load(f)

In [160]:
print(test_x[0])

loaded_clf.predict(test_x_vectors[0])

This book was really horrible. It's twilight but with mers. If you want a mer book read everblue it's better.


array(['NEGATIVE'], dtype='<U8')