<img src="https://rhyme.com/assets/img/logo-dark.png" align="center"> <h2 align="center">Logistic Regression: A Sentiment Analysis Case Study</h2>

### Introduction
___

- IMDB movie reviews dataset
- http://ai.stanford.edu/~amaas/data/sentiment
- Contains 25000 positive and 25000 negative reviews
<img src="https://i.imgur.com/lQNnqgi.png" align="center">
- Contains at most reviews per movie
- At least 7 stars out of 10 $\rightarrow$ positive (label = 1)
- At most 4 stars out of 10 $\rightarrow$ negative (label = 0)
- 50/50 train/test split
- Evaluation accuracy

<b>Features: bag of 1-grams with TF-IDF values</b>:
- Extremely sparse feature matrix - close to 97% are zeros

 <b>Model: Logistic regression</b>
- $p(y = 1|x) = \sigma(w^{T}x)$
- Linear classification model
- Can handle sparse data
- Fast to train
- Weights can be interpreted
<img src="https://i.imgur.com/VieM41f.png" align="center" width=500 height=500>

### Task 1: Loading the dataset
---

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv('movie_data.csv')

In [2]:
df.head()
# df['review'][0]

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


## <h2 align="center">Bag of words / Bag of N-grams model</h2>

### Task 2: Transforming documents into feature vectors

Below, we will call the fit_transform method on CountVectorizer. This will construct the vocabulary of the bag-of-words model and transform the following three sentences into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


In [3]:
# import numpy as np
# from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()

doc = np.array(['The sun is shining',
'The weather is sweet',
'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(doc)

In [4]:
count.get_feature_names_out()

array(['and', 'is', 'one', 'shining', 'sun', 'sweet', 'the', 'two',
       'weather'], dtype=object)

In [5]:
count.vocabulary_

{'the': 6,
 'sun': 4,
 'is': 1,
 'shining': 3,
 'weather': 8,
 'sweet': 5,
 'and': 0,
 'one': 2,
 'two': 7}

In [6]:
bag.toarray()

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 1],
       [2, 3, 2, 1, 1, 1, 2, 1, 1]], dtype=int64)

Raw term frequencies: *tf (t,d)*—the number of times a term t occurs in a document *d*

### Task 3: Word relevancy using term frequency-inverse document frequency

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, and df(d, t) is the number of documents d that contain the term t.

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer
np.set_printoptions(precision=2)
tfidf = TfidfTransformer(use_idf=True,smooth_idf=True,norm='l2')
print(tfidf.fit_transform(bag).toarray())

# ['and', 'is', 'one', 'shining', 'sun', 'sweet', 'the', 'two', 'weather']
#     [[0, 1, 0, 1, 1, 0, 1, 0, 0],
#     [0, 1, 0, 0, 0, 1, 1, 0, 1],
#     [2, 3, 2, 1, 1, 1, 2, 1, 1]],

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


The equations for the idf and tf-idf that are implemented in scikit-learn are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$
The tf-idf equation that is implemented in scikit-learn is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

### Task 4: Data Preparation

In [8]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

In [9]:
preprocessor('Hi:) I am <a> cool </a> ?":\\ ! ')

'hi i am cool :)'

In [10]:
import string
txt = 'Hi:) I am <a> cool </a> ?":\\ ! '
z = string.punctuation
mytable = str.maketrans('','',z)
print(txt.translate(mytable))

Hi I am a cool a   


In [11]:
df['review'] = df['review'].apply(preprocessor)

### Task 5: Tokenization of documents

In [12]:

# import these modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
ps = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kaito\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
def tok(text):

    words = [ps.stem(w) for w in word_tokenize(text) if w not in stop_words]
    return words 
tok('runners like running and thats why they run')

['runner', 'like', 'run', 'that', 'run']

In [14]:
df['review_token'] = df['review'].apply(tok)

In [15]:
df.head()

Unnamed: 0,review,sentiment,review_token
0,in 1974 the teenager martha moxley maggie grac...,1,"[1974, teenag, martha, moxley, maggi, grace, m..."
1,ok so i really like kris kristofferson and his...,0,"[ok, realli, like, kri, kristofferson, usual, ..."
2,spoiler do not read this if you think about w...,0,"[spoiler, read, think, watch, movi, although, ..."
3,hi for all the people who have seen this wonde...,1,"[hi, peopl, seen, wonder, movi, im, sure, thet..."
4,i recently bought the dvd forgetting just how ...,0,"[recent, bought, dvd, forget, much, hate, movi..."


### Task 6: Transform Text Data into TF-IDF Vectors

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(lowercase=False,preprocessor=None,tokenizer=tok,
                            use_idf=True,norm='l2',smooth_idf=True)   

In [24]:
y = df.sentiment.values
x = tfidf.fit_transform(df.review)



### Task 7: Document Classification using Logistic Regression

In [25]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(x,y,random_state=1,test_size=.5,shuffle=False)

In [28]:
import pickle
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(cv=5,scoring='accuracy',random_state=0,n_jobs=1,verbose=3,max_iter=200).fit(X_train,y_train)
saved_model = open('saved_model.sav','wb')
pickle.dump(clf,saved_model)
saved_model.close()


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   35.1s remaining:    0.0s
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.1min remaining:    0.0s
STOP: TOT

### Task 8: Model Evaluation

In [29]:
fname = 'saved_model.sav'
saved_clf = pickle.load(open(fname,'rb'))

In [31]:
saved_clf.score(X_test,y_test)

0.89488

In [38]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
y_pred = clf.predict(X_test)
# accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.89      0.89     12527
           1       0.89      0.90      0.90     12473

    accuracy                           0.89     25000
   macro avg       0.89      0.89      0.89     25000
weighted avg       0.89      0.89      0.89     25000

