<img src="https://rhyme.com/assets/img/logo-dark.png" align="center"> <h2 align="center">Logistic Regression: A Sentiment Analysis Case Study</h2>

### Introduction
___

- IMDB movie reviews dataset
- http://ai.stanford.edu/~amaas/data/sentiment
- Contains 25000 positive and 25000 negative reviews
<img src="https://i.imgur.com/lQNnqgi.png" align="center">
- Contains at most reviews per movie
- At least 7 stars out of 10 $\rightarrow$ positive (label = 1)
- At most 4 stars out of 10 $\rightarrow$ negative (label = 0)
- 50/50 train/test split
- Evaluation accuracy

<b>Features: bag of 1-grams with TF-IDF values</b>:
- Extremely sparse feature matrix - close to 97% are zeros

 <b>Model: Logistic regression</b>
- $p(y = 1|x) = \sigma(w^{T}x)$
- Linear classification model
- Can handle sparse data
- Fast to train
- Weights can be interpreted
<img src="https://i.imgur.com/VieM41f.png" align="center" width=500 height=500>

# Loading the dataset

In [21]:
import pandas as pd
df = pd.read_csv('movie_data.csv')
head = df.head() # return the first 5 values
null = df.isnull().sum() # counting for each column the null values
stats1 = df.describe() # Generate descriptive statistics
stats2 = stats.transpose() # transpose of descriptive statistics

In [22]:
head

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [23]:
null

review       0
sentiment    0
dtype: int64

In [24]:
stats1

Unnamed: 0,sentiment
count,50000.0
mean,0.5
std,0.500005
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [25]:
stats2

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sentiment,50000.0,0.5,0.500005,0.0,0.0,0.5,1.0,1.0


In [27]:
df.info() # general information of dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 781.4+ KB


In [29]:
print("rows: ", df.shape[0])

raws:  50000


In [30]:
print("colums: ", df.shape[1])

colums:  2


In [52]:
df["review"][2]

'***SPOILER*** Do not read this, if you think about watching that movie, although it would be a waste of time. (By the way: The plot is so predictable that it does not make any difference if you read this or not anyway)<br /><br />If you are wondering whether to see "Coyote Ugly" or not: don\'t! It\'s not worth either the money for the ticket or the VHS / DVD. A typical "Chick-Feel-Good-Flick", one could say. The plot itself is as shallow as it can be, a ridiculous and uncritical version of the American Dream. The young good-looking girl from a small town becoming a big success in New York. The few desperate attempts of giving the movie any depth fail, such as the "tragic" accident of the father, the "difficulties" of Violet\'s relationship with her boyfriend, and so on. McNally (Director) tries to arouse the audience\'s pity and sadness put does not have any chance to succeed in this attempt due to the bad script and the shallow acting. Especially Piper Perabo completely fails in conv

# <h1 align="center">Bag of words / Bag of N-grams model</h1>

# Transforming documents into feature vectors

Below, we will call the fit_transform method on CountVectorizer. This will construct the vocabulary of the bag-of-words model and transform the following three sentences into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


In [41]:
import numpy as np # importing scientific library in order to transform the below text into numeric values
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()

docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [42]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [43]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


Raw term frequencies: *tf (t,d)*—the number of times a term t occurs in a document *d*

### Task 3: Word relevancy using term frequency-inverse document frequency

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, and df(d, t) is the number of documents d that contain the term t.

In [50]:
from sklearn.feature_extraction.text import TfidfTransformer
np.set_printoptions(precision=2) # number of decimals

tfidf_l1 = TfidfTransformer(norm='l1', use_idf=True, smooth_idf=True, sublinear_tf=False)
print(tfidf_l1.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.22 0.   0.28 0.28 0.   0.22 0.   0.  ]
 [0.   0.22 0.   0.   0.   0.28 0.22 0.   0.28]
 [0.18 0.16 0.18 0.07 0.07 0.07 0.11 0.09 0.07]]


In [51]:
tfidf_l2 = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
print(tfidf_l2.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


The equations for the idf and tf-idf that are implemented in scikit-learn are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$
The tf-idf equation that is implemented in scikit-learn is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

### Data Preparation

In [63]:
third_review = df.loc[2, "review"][-80:] # analyzing the third review movie
third_review

'nd self-ironic) instead of this flick.<br /><br />Two thumbs down (3 out of 10).'

In [64]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

In [65]:
preprocessor(third_review)

'nd self ironic instead of this flick two thumbs down 3 out of 10 '

In [72]:
preprocessor("this document- :D :( also works!! :) /¿")

'this document d also works :D :( :)'

In [74]:
df["review"] = df["review"].apply(preprocessor)
df["review"]

0        in 1974 the teenager martha moxley maggie grac...
1        ok so i really like kris kristofferson and his...
2         spoiler do not read this if you think about w...
3        hi for all the people who have seen this wonde...
4        i recently bought the dvd forgetting just how ...
                               ...                        
49995    ok lets start with the best the building altho...
49996    the british heritage film industry is out of c...
49997    i don t even know where to begin on this one i...
49998    richard tyler is a little boy who is scared of...
49999    i waited long to watch this movie also because...
Name: review, Length: 50000, dtype: object

### Tokenization of documents

In [77]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

In [78]:
def tokenizer(text):
    return text.split()

In [79]:
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [81]:
from nltk.stem.snowball import SnowballStemmer # you also can import Snowball that applies for other languages

In [90]:
stemmer = SnowballStemmer("spanish")
print(stemmer.stem("rapidamente"))
print(SnowballStemmer("spanish").stem("generosamente"))

rapid
gener


In [92]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /home/javier/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [94]:
from nltk.corpus import stopwords

stop = stopwords.words("spanish")
[w for w in tokenizer_porter("la casa de mi amigo está en otro lugar del país")[-15:] if w not in stop]

['casa', 'amigo', 'lugar', 'paí']

### Transform Text Data into TF-IDF Vectors

In [95]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None,
                        tokenizer=tokenizer_porter,
                        use_idf=True,
                        norm="l2",
                        smooth_idf=True)

y = df.sentiment.values
x = tfidf.fit_transform(df.review)

### Task 7: Document Classification using Logistic Regression

In [96]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 1, test_size = 0.5, shuffle = False)

In [None]:
import pickle
from sklearn.linear_model import LogisticRegressionCV

clf = LogisticRegressionCV(cv = 5,
                          scoring = "accuracy",
                          random_state = 0,
                          n_jobs = -1,
                          verbose = 3,
                          max_iter=100).fit(x_train, y_train)
saved_model = open("saved_model.sav", "wb")
pickle.dump(clf, saved_model)
saved_model.close()

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


### Task 8: Model Evaluation