# IMDB Sentiment Analysis

The data is split evenly with 25k reviews intended for training and 25k for testing your classifier. Moreover, each set has 12.5k positive and 12.5k negative reviews.

IMDb lets users rate movies on a scale from 1 to 10. To label these reviews the curator of the data labeled anything with ≤ 4 stars as negative and anything with ≥ 7 stars as positive. Reviews with 5 or 6 stars were left out.

**Import the required libraries**

In [1]:
import numpy as np
import pandas as pd
import os
import re
import warnings
warnings.filterwarnings("ignore")

**Load Data**

In [2]:
reviews_train = []
for line in open('./data/full_train.txt', 'r', encoding='latin1'):
    
    reviews_train.append(line.strip())
    
reviews_test = []
for line in open('./data/full_test.txt', 'r', encoding='latin1'):
    
    reviews_test.append(line.strip())

**See one of the elements in the list**

In [3]:
print(len(reviews_train))
reviews_train[0]

25000


'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'

The raw text is pretty messy for these reviews so before we can do any analytics we need to clean things up


**Use Regular expressions to remove the non text characters, and the html tags**

In [4]:
import re

REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(\d+)")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
NO_SPACE = ""
SPACE = " "

def preprocess_reviews(reviews):
    
    reviews = [REPLACE_NO_SPACE.sub(NO_SPACE, line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(SPACE, line) for line in reviews]
    
    return reviews

reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)

In [5]:
reviews_train_clean[5]

"this isn't the comedic robin williams nor is it the quirky insane robin williams of recent thriller fame this is a hybrid of the classic drama without over dramatization mixed with robin's new love of the thriller but this isn't a thriller per se this is more a mystery suspense vehicle through which williams attempts to locate a sick boy and his keeper also starring sandra oh and rory culkin this suspense drama plays pretty much like a news report until william's character gets close to achieving his goal i must say that i was highly entertained though this movie fails to teach guide inspect or amuse it felt more like i was watching a guy williams as he was actually performing the actions from a third person perspective in other words it felt real and i was able to subscribe to the premise of the story all in all it's worth a watch though it's definitely not friday saturday night fare it rates a   from the fiend "

# Vectorization
In order for this data to make sense to our machine learning algorithm we’ll need to convert each review to a numeric representation, which we call vectorization.

The simplest form of this is to create one very large matrix with one column for every unique word in your corpus (where the corpus is all 50k reviews in our case). Then we transform each review into one row containing 0s and 1s, where 1 means that the word in the corpus corresponding to that column appears in that review. That being said, each row of the matrix will be very sparse (mostly zeros). This process is also known as one hot encoding. Use the *CountVectorizer* method.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

baseline_vector = CountVectorizer(binary = True)
baseline_vector.fit(reviews_train_clean)

X_baseline = baseline_vector.transform(reviews_train_clean)
X_test_baseline = baseline_vector.transform(reviews_test_clean)

In [8]:
print(X_baseline.shape)

(25000, 87063)


In [9]:
baseline_vector.vocabulary_

{'bromwell': 9819,
 'high': 35211,
 'is': 39472,
 'cartoon': 11686,
 'comedy': 14754,
 'it': 39642,
 'ran': 61772,
 'at': 4537,
 'the': 76725,
 'same': 66138,
 'time': 77626,
 'as': 4211,
 'some': 71188,
 'other': 54861,
 'programs': 60156,
 'about': 284,
 'school': 67025,
 'life': 44297,
 'such': 74177,
 'teachers': 75997,
 'my': 51490,
 'years': 86260,
 'in': 37733,
 'teaching': 76000,
 'profession': 60088,
 'lead': 43605,
 'me': 47976,
 'to': 77922,
 'believe': 6894,
 'that': 76671,
 'satire': 66462,
 'much': 51018,
 'closer': 14125,
 'reality': 62242,
 'than': 76643,
 'scramble': 67253,
 'survive': 74812,
 'financially': 27904,
 'insightful': 38615,
 'students': 73768,
 'who': 84627,
 'can': 11132,
 'see': 67695,
 'right': 64476,
 'through': 77369,
 'their': 76762,
 'pathetic': 56362,
 'pomp': 58787,
 'pettiness': 57341,
 'of': 53843,
 'whole': 84639,
 'situation': 69959,
 'all': 2038,
 'remind': 63323,
 'schools': 67049,
 'knew': 42226,
 'and': 2762,
 'when': 84450,
 'saw': 66588,

In [10]:
vectorizer = CountVectorizer()
vectorizer.fit(reviews_train_clean)

X_vec = vectorizer.transform(reviews_train_clean)
#X_test_vec = vectorizer.transform(reviews_test_clean)

In [11]:
X_vec.shape

(25000, 87063)

In [12]:
vectorizer.vocabulary_

{'bromwell': 9819,
 'high': 35211,
 'is': 39472,
 'cartoon': 11686,
 'comedy': 14754,
 'it': 39642,
 'ran': 61772,
 'at': 4537,
 'the': 76725,
 'same': 66138,
 'time': 77626,
 'as': 4211,
 'some': 71188,
 'other': 54861,
 'programs': 60156,
 'about': 284,
 'school': 67025,
 'life': 44297,
 'such': 74177,
 'teachers': 75997,
 'my': 51490,
 'years': 86260,
 'in': 37733,
 'teaching': 76000,
 'profession': 60088,
 'lead': 43605,
 'me': 47976,
 'to': 77922,
 'believe': 6894,
 'that': 76671,
 'satire': 66462,
 'much': 51018,
 'closer': 14125,
 'reality': 62242,
 'than': 76643,
 'scramble': 67253,
 'survive': 74812,
 'financially': 27904,
 'insightful': 38615,
 'students': 73768,
 'who': 84627,
 'can': 11132,
 'see': 67695,
 'right': 64476,
 'through': 77369,
 'their': 76762,
 'pathetic': 56362,
 'pomp': 58787,
 'pettiness': 57341,
 'of': 53843,
 'whole': 84639,
 'situation': 69959,
 'all': 2038,
 'remind': 63323,
 'schools': 67049,
 'knew': 42226,
 'and': 2762,
 'when': 84450,
 'saw': 66588,

In [14]:
X_vec

<25000x87063 sparse matrix of type '<class 'numpy.int64'>'
	with 3410713 stored elements in Compressed Sparse Row format>

# Train a Baseline Model

Train a Logistic Regression model after transforming the data with CountVectorized

* They’re easy to interpret
* Linear models tend to perform well on sparse datasets like this one
* They learn very fast compared to other algorithms.

Test models with C values of [0.01, 0.05, 0.25, 0.5, 1] and see wich is the best value for C, and calculate the accuracy

In [15]:
type(reviews_train_clean)

list

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import pandas as pd

target = [1 if i < 12500 else 0 for i in range(25000)]

def train_model(X_train, y_train, X_test, y_test):    
    #Creamos el pipeline
    log_reg_pipe = Pipeline(steps = [
        ('log_reg', LogisticRegression())
    ])
    
    #Definimos los parámetros de C
    log_parameters = {
        'log_reg__C': [0.01, 0.05, 0.25, 0.5, 1]
    }
    
    #Generamos el grid
    log_reg_grid = GridSearchCV(log_reg_pipe,
                                log_parameters,
                                n_jobs = -1)
    
    #Entrenamos el grid y escogemos el mejor modelo
    log_reg_grid.fit(X_train, y_train)
    model = log_reg_grid.best_estimator_
    
    return model.score(X_test, y_test)

accuracy = train_model(X_baseline, target, X_test_baseline, target)
accuracy

0.88188

# Remove Stop Words

Stop words are the very common words like ‘if’, ‘but’, ‘we’, ‘he’, ‘she’, and ‘they’. We can usually remove these words without changing the semantics of a text and doing so often (but not always) improves the performance of a model. Removing these stop words becomes a lot more useful when we start using longer word sequences as model features (see n-grams below).

Before we apply the CountVectorized, lets remove the stopwords, included in nltk.corpus

Then apply the CountVectorizer, and train the Logistic regression model and obtain the accuracy.

In [22]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/gonzalo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
from nltk.corpus import stopwords
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [24]:
stopwords.words('spanish')

['de',
 'la',
 'que',
 'el',
 'en',
 'y',
 'a',
 'los',
 'del',
 'se',
 'las',
 'por',
 'un',
 'para',
 'con',
 'no',
 'una',
 'su',
 'al',
 'lo',
 'como',
 'más',
 'pero',
 'sus',
 'le',
 'ya',
 'o',
 'este',
 'sí',
 'porque',
 'esta',
 'entre',
 'cuando',
 'muy',
 'sin',
 'sobre',
 'también',
 'me',
 'hasta',
 'hay',
 'donde',
 'quien',
 'desde',
 'todo',
 'nos',
 'durante',
 'todos',
 'uno',
 'les',
 'ni',
 'contra',
 'otros',
 'ese',
 'eso',
 'ante',
 'ellos',
 'e',
 'esto',
 'mí',
 'antes',
 'algunos',
 'qué',
 'unos',
 'yo',
 'otro',
 'otras',
 'otra',
 'él',
 'tanto',
 'esa',
 'estos',
 'mucho',
 'quienes',
 'nada',
 'muchos',
 'cual',
 'poco',
 'ella',
 'estar',
 'estas',
 'algunas',
 'algo',
 'nosotros',
 'mi',
 'mis',
 'tú',
 'te',
 'ti',
 'tu',
 'tus',
 'ellas',
 'nosotras',
 'vosotros',
 'vosotras',
 'os',
 'mío',
 'mía',
 'míos',
 'mías',
 'tuyo',
 'tuya',
 'tuyos',
 'tuyas',
 'suyo',
 'suya',
 'suyos',
 'suyas',
 'nuestro',
 'nuestra',
 'nuestros',
 'nuestras',
 'vuestro'

In [25]:
from nltk.corpus import stopwords

english_stop_words = stopwords.words('english')
def remove_stop_words(corpus, english_stop_words):
    removed_stop_words = []
    for review in corpus:
        
        # Para cada review elimina las stopwords, y separa todas las palabras por espacio
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in english_stop_words])
        )
    return removed_stop_words

no_stop_words_train = remove_stop_words(reviews_train_clean, english_stop_words)
no_stop_words_test = remove_stop_words(reviews_test_clean, english_stop_words)

Traceback (most recent call last):
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 1344, in fit
    X, y = self._validate_data(X, y, accept_sparse='csr', dtype=_dtype,
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/base.py", line 433, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.

In [26]:
no_stop_words_train

["bromwell high cartoon comedy ran time programs school life teachers years teaching profession lead believe bromwell high's satire much closer reality teachers scramble survive financially insightful students see right pathetic teachers' pomp pettiness whole situation remind schools knew students saw episode student repeatedly tried burn school immediately recalled high classic line inspector i'm sack one teachers student welcome bromwell high expect many adults age think bromwell high far fetched pity",
 "homelessness houselessness george carlin stated issue years never plan help street considered human everything going school work vote matter people think homeless lost cause worrying things racism war iraq pressuring kids succeed technology elections inflation worrying they'll next end streets given bet live streets month without luxuries home entertainment sets bathroom pictures wall computer everything treasure see like homeless goddard bolt's lesson mel brooks directs stars bolt 

In [28]:
cv = CountVectorizer(binary = True)
cv.fit(no_stop_words_train)

X = cv.transform(no_stop_words_train)
X_test = cv.transform(no_stop_words_test)

In [30]:
accuracy_new = train_model(X, target, X_test, target)
accuracy_new

0.87936

In [29]:
print(X_baseline.shape)
print(X.shape)
print("Stop words eliminadas:", X_baseline.shape[1] - X.shape[1])

(25000, 87063)
(25000, 87046)
Stop words eliminadas: 17


In [31]:
##### CODE #####
cv = CountVectorizer(binary = True, stop_words='english')

cv.fit(reviews_train_clean)

X = cv.transform(reviews_train_clean)
X_test = cv.transform(reviews_test_clean)

train_model(X, target, X_test, target)

0.87728

In [32]:
print(X_baseline.shape)
print(X.shape)
print("Stop words eliminadas:", X_baseline.shape[1] - X.shape[1])

(25000, 87063)
(25000, 86752)
Stop words eliminadas: 311


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

**Note:** In practice, an easier way to remove stop words is to just use the stop_words argument with any of scikit-learn’s ‘Vectorizer’ classes. If you want to use NLTK’s full list of stop words you can do stop_words='english’. In practice I’ve found that using NLTK’s list actually decreases my performance because its too expansive, so I usually supply my own list of words. For example, stop_words=['in','of','at','a','the'] .

A common next step in text preprocessing is to normalize the words in your corpus by trying to convert all of the different forms of a given word into one. Two methods that exist for this are Stemming and Lemmatization.

# Stemming

Stemming is considered to be the more crude/brute-force approach to normalization (although this doesn’t necessarily mean that it will perform worse). There’s several algorithms, but in general they all use basic rules to chop off the ends of words.

NLTK has several stemming algorithm implementations. We’ll use the Porter stemmer. Most used:
* PorterStemmer
* SnowballStemmer

Apply a PoterStemmer, vectorize, and train the model again

In [33]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
            'died', 'agreed', 'owned', 'humbled', 'sized',
            'meeting', 'stating', 'siezing', 'itemization',
            'sensational', 'traditional', 'reference', 'colonizer',
            'plotted']

single = [stemmer.stem(plural) for plural in plurals]


print(' '.join(single))

caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot


In [35]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language = 'english')

plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
            'died', 'agreed', 'owned', 'humbled', 'sized',
            'meeting', 'stating', 'siezing', 'itemization',
            'sensational', 'traditional', 'reference', 'colonizer',
            'plotted']
singles = [stemmer.stem(plural) for plural in plurals]

print(' '.join(singles))

caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot


In [36]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language = 'spanish')

plurals = ['corriendo', 'casas', 'playa', 'volando', 'volar', 'volveré']
singles = [stemmer.stem(plural) for plural in plurals]

print(' '.join(singles))

corr cas play vol vol volv


In [49]:
def get_stemmed_text(data):
    stemmer = PorterStemmer()
    
    stemmed_data = [' '.join([stemmer.stem(word) for word in document.split()]) for document in data]
    
    return stemmed_data

stemmed_reviews_train = get_stemmed_text(reviews_train_clean)
stemmed_reviews_test = get_stemmed_text(reviews_test_clean)

cv = CountVectorizer(binary=True, stop_words=english_stop_words)
cv.fit(stemmed_reviews_train)
X_stem = cv.transform(stemmed_reviews_train)
X_test = cv.transform(stemmed_reviews_test)

train_model(X_stem, target, X_test, target)

0.87656

In [50]:
print(X_baseline.shape)
print(X_stem.shape)
print("Diff X normal y X tras stemmer y vectorización:", X_baseline.shape[1] - X_stem.shape[1])

(25000, 87063)
(25000, 66715)
Diff X normal y X tras stemmer y vectorización: 20348


# Lemmatization

Lemmatization works by identifying the part-of-speech of a given word and then applying more complex rules to transform the word into its true root.

In [51]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/gonzalo/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [52]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

plurals = ['caresses', 'flies', 'dies', 'mules', 'studies',
            'died', 'agreed', 'owned', 'humbled', 'sized',
            'meeting', 'stating', 'siezing', 'itemization',
            'sensational', 'traditional', 'reference', 'colonizer',
            'plotted']
singles = [lemmatizer.lemmatize(plural) for plural in plurals]

print(' '.join(singles))

caress fly dy mule study died agreed owned humbled sized meeting stating siezing itemization sensational traditional reference colonizer plotted


In [54]:
def get_lemmatized_text(corpus):
    
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

# Lematizamos las reviews
lemmatized_reviews_train = get_lemmatized_text(reviews_train_clean)
lemmatized_reviews_test = get_lemmatized_text(reviews_test_clean)

# Vectorizamos con conteo tras lematizar
cv = CountVectorizer(binary=True, stop_words=english_stop_words)
cv.fit(lemmatized_reviews_train)
X = cv.transform(lemmatized_reviews_train)
X_test = cv.transform(lemmatized_reviews_test)

train_model(X, target, X_test, target)

0.87812

In [55]:
print(X_baseline.shape)
print(X.shape)
print("Diff X normal y X tras lematizador y vectorización:", X_baseline.shape[1] - X.shape[1])

(25000, 87063)
(25000, 80181)
Diff X normal y X tras lematizador y vectorización: 6882


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

# n-grams

We can potentially add more predictive power to our model by adding two or three word sequences (bigrams or trigrams) as well. For example, if a review had the three word sequence “didn’t love movie” we would only consider these words individually with a unigram-only model and probably not capture that this is actually a negative sentiment because the word ‘love’ by itself is going to be highly correlated with a positive review.

The scikit-learn library makes this really easy to play around with. Just use the ngram_range argument with any of the ‘Vectorizer’ classes.

In [64]:
from nltk import ngrams

sentence = 'Es que es lunes'
two = ngrams(sentence.split(), 2)
three = ngrams(sentence.split(), 3)

for grams in two:
  print(grams)
print('###############')

for grams in three:
  print(grams)

ngram_vec = CountVectorizer(binary = True, ngram_range=(2, 3))
vector = ngram_vec.fit_transform([sentence]).toarray()
vector

('Es', 'que')
('que', 'es')
('es', 'lunes')
###############
('Es', 'que', 'es')
('que', 'es', 'lunes')


array([[1, 1, 1, 1, 1]])

In [65]:
ngram_vec.vocabulary_

{'es que': 1, 'que es': 3, 'es lunes': 0, 'es que es': 2, 'que es lunes': 4}

In [68]:
ngram_vectorizer = CountVectorizer(binary = True, stop_words = english_stop_words, ngram_range=(1, 2))

ngram_vectorizer.fit(reviews_train_clean)

X = ngram_vectorizer.transform(reviews_train_clean)
X_test = ngram_vectorizer.transform(reviews_test_clean)

train_model(X, target, X_test, target)

0.889

In [69]:
print(X_baseline.shape)
print(X.shape)
print("Diff X normal y X tras lematizador y vectorización:", X_baseline.shape[1] - X.shape[1])

(25000, 87063)
(25000, 1865232)
Diff X normal y X tras lematizador y vectorización: -1778169


# TF-IDF

Another common way to represent each document in a corpus is to use the tf-idf statistic (term frequency-inverse document frequency) for each word, which is a weighting factor that we can use in place of binary or word count representations.

There are several ways to do tf-idf transformation but in a nutshell, **tf-idf aims to represent the number of times a given word appears in a document (a movie review in our case) relative to the number of documents in the corpus that the word appears in** — where words that appear in many documents have a value closer to zero and words that appear in less documents have values closer to 1.

**Note:** Now that we’ve gone over n-grams, when I refer to ‘words’ I really mean any n-gram (sequence of words) if the model is using an n greater than one.

In [74]:
#Número de documentos
n = 1000000000000000000000000000000000000000000000000000000000000000000000000000000

#Número de veces que aparece
f = 1

1 + np.log((n+1) / (f+1))

179.9084900729756

In [79]:
from sklearn.feature_extraction.text import TfidfVectorizer

senten_1 = 'Es que es lunes'
senten_2 = 'Mañana es martes, y hoy es lunes'
senten_3 = 'A Julio le gustan los tamagochis'

test = TfidfVectorizer()

print(test.fit_transform([senten_1, senten_2, senten_3]))
print(test.idf_)
print(test.get_feature_names())

  (0, 6)	0.38550292161010064
  (0, 9)	0.5068900148458076
  (0, 0)	0.7710058432202013
  (1, 2)	0.41197297843389025
  (1, 7)	0.41197297843389025
  (1, 8)	0.41197297843389025
  (1, 6)	0.3133160688892059
  (1, 0)	0.6266321377784118
  (2, 10)	0.4472135954999579
  (2, 5)	0.4472135954999579
  (2, 1)	0.4472135954999579
  (2, 4)	0.4472135954999579
  (2, 3)	0.4472135954999579
[1.28768207 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718
 1.28768207 1.69314718 1.69314718 1.69314718 1.69314718]
['es', 'gustan', 'hoy', 'julio', 'le', 'los', 'lunes', 'martes', 'mañana', 'que', 'tamagochis']


In [80]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

tfidf_vectorizer.fit(reviews_train_clean)
X = tfidf_vectorizer.transform(reviews_train_clean)
X_test = tfidf_vectorizer.transform(reviews_test_clean)


train_model(X, target, X_test, target)

0.8822

# Support Vector Machines (SVM)

Recall that linear classifiers tend to work well on very sparse datasets (like the one we have). Another algorithm that can produce great results with a quick training time are Support Vector Machines with a linear kernel.

Build a model with an n-gram range from 1 to 2:

In [85]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# SVM con bigramas
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
ngram_vectorizer.fit(reviews_train_clean)
X = ngram_vectorizer.transform(reviews_train_clean)
X_test = ngram_vectorizer.transform(reviews_test_clean)


def train_model(X_train, y_train, X_test, y_test):    
    #Creamos el pipeline
    svm_pipe = Pipeline(steps = [
        ('svm', LinearSVC())
    ])
    
    #Definimos los parámetros de C
    svm_parameters = {
        'svm__penalty': ['l1', 'l2'],
        'svm__C': [0.01, 0.05, 0.5, 1]
    }
    
    #Generamos el grid
    svm_grid = GridSearchCV(svm_pipe,
                            svm_parameters,
                            n_jobs = -1)
    
    #Entrenamos el grid y escogemos el mejor modelo
    svm_grid.fit(X_train, y_train)
    model = svm_grid.best_estimator_
    
    return model.score(X_test, y_test)
    

acc = train_model(X, target, X_test, target)
acc

Traceback (most recent call last):
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/svm/_classes.py", line 234, in fit
    self.coef_, self.intercept_, self.n_iter_ = _fit_liblinear(
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 974, in _fit_liblinear
    solver_type = _get_liblinear_solver_type(multi_class, penalty, loss, dual)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 830, in _get_liblinear_solver_type
    raise ValueError('Unsupported set of arguments: %s, '
ValueError: Unsupported set of ar

0.89768

# Final Model

Removing a small set of stop words along with an n-gram range from 1 to 3 and a linear support vector classifier shows the best results.

In [87]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC


ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3), stop_words = english_stop_words)

ngram_vectorizer.fit(reviews_train_clean)
X = ngram_vectorizer.transform(reviews_train_clean)
X_test = ngram_vectorizer.transform(reviews_test_clean)

train_model(X, target, X_test, target)

Traceback (most recent call last):
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/svm/_classes.py", line 234, in fit
    self.coef_, self.intercept_, self.n_iter_ = _fit_liblinear(
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 974, in _fit_liblinear
    solver_type = _get_liblinear_solver_type(multi_class, penalty, loss, dual)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 830, in _get_liblinear_solver_type
    raise ValueError('Unsupported set of arguments: %s, '
ValueError: Unsupported set of ar

0.88756

Traceback (most recent call last):
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/svm/_classes.py", line 234, in fit
    self.coef_, self.intercept_, self.n_iter_ = _fit_liblinear(
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 974, in _fit_liblinear
    solver_type = _get_liblinear_solver_type(multi_class, penalty, loss, dual)
  File "/home/gonzalo/Documentos/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 830, in _get_liblinear_solver_type
    raise ValueError('Unsupported set of arguments: %s, '
ValueError: Unsupported set of ar

# Top Postitive and Negative Features

Obtain the most important features of the model.

In [None]:
##### CODE #####

# Montamos un diccionario con palabra -> coeficiente
feature_to_coef = {
    word: coef for word, coef in zip(
        cv.get_feature_names(), final_model.coef_[0]
    )
}

In [None]:
##### CODE #####

In [None]:
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)

print(model)

y_pred = model.predict(X_val)

accuracy = accuracy_score(y_val, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
