<a href="https://colab.research.google.com/github/AuraFrizzati/Applied-Text-Mining-in-Python/blob/main/Week3_Case_Study_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

*Note: Some of the cells in this notebook are computationally expensive. To reduce runtime, this notebook is using a subset of the data.*

# Case Study: Sentiment Analysis

### Data Prep

In [None]:
import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')

# Sample the data to speed up computation
# Comment out this line to match with lecture
df = df.sample(frac=0.1, random_state=10)

df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
394349,Sony XPERIA Z2 D6503 FACTORY UNLOCKED Internat...,,244.95,5,Very good one! Better than Samsung S and iphon...,0.0
34377,Apple iPhone 5c 8GB (Pink) - Verizon Wireless,Apple,194.99,1,"The phone needed a SIM card, would have been n...",1.0
248521,Motorola Droid RAZR MAXX XT912 M Verizon Smart...,Motorola,174.99,5,I was 3 months away from my upgrade and my Str...,3.0
167661,CNPGD [U.S. Office Extended Warranty] Smartwat...,CNPGD,49.99,1,an experience i want to forget,0.0
73287,Apple iPhone 7 Unlocked Phone 256 GB - US Vers...,Apple,922.0,5,GREAT PHONE WORK ACCORDING MY EXPECTATIONS.,1.0


In [None]:
# Drop missing values
df.dropna(inplace=True)

# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
34377,Apple iPhone 5c 8GB (Pink) - Verizon Wireless,Apple,194.99,1,"The phone needed a SIM card, would have been n...",1.0,0
248521,Motorola Droid RAZR MAXX XT912 M Verizon Smart...,Motorola,174.99,5,I was 3 months away from my upgrade and my Str...,3.0,1
167661,CNPGD [U.S. Office Extended Warranty] Smartwat...,CNPGD,49.99,1,an experience i want to forget,0.0,0
73287,Apple iPhone 7 Unlocked Phone 256 GB - US Vers...,Apple,922.0,5,GREAT PHONE WORK ACCORDING MY EXPECTATIONS.,1.0,1
277158,Nokia N8 Unlocked GSM Touch Screen Phone Featu...,Nokia,95.0,5,I fell in love with this phone because it did ...,0.0,1
100311,Blackberry Torch 2 9810 Unlocked Phone with 1....,BlackBerry,77.49,5,I am pleased with this Blackberry phone! The p...,0.0,1
251669,Motorola Moto E (1st Generation) - Black - 4 G...,Motorola,89.99,5,"Great product, best value for money smartphone...",0.0,1
279878,OtterBox 77-29864 Defender Series Hybrid Case ...,OtterBox,9.99,5,I've bought 3 no problems. Fast delivery.,0.0,1
406017,Verizon HTC Rezound 4G Android Smarphone - 8MP...,HTC,74.99,4,Great phone for the price...,0.0,1
302567,"RCA M1 Unlocked Cell Phone, Dual Sim, 5Mp Came...",RCA,159.99,5,My mom is not good with new technoloy but this...,4.0,1


In [None]:
# Most ratings are positive
df['Positively Rated'].mean() ## the outcome classes are unbalanced

0.7471776686078667

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], #train data
                                                    df['Positively Rated'], #train labels
                                                    random_state=0)

In [None]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 Everything about it is awesome!


X_train shape:  (23052,)


In [None]:
print('\n','Number training samples = ', len(X_train), '\n',
     'Number test samples = ', len(X_test))


 Number training samples =  23052 
 Number test samples =  7685


- The review **text feature** needs to be **converted** into a **numeric format** that scikit-learn can use. The **bag-of-words approach** is a simple and commonly used way to represent text for use in ML. It **ignores text structure** and only **counts how often each word occurs**.

# CountVectorizer

- CountVectorizer allows us to use the **bag-of-word approach** by **converting** a **collection of text documents** into a **matrix of token counts**. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)
vect

CountVectorizer()

- **Fitting** the **CountVectorizer tokenizes each document** by finding **all sequences of characters** (at least 2 letters/numbers) separated by word boundaries. It then **converts everything to lowercase** and **build a vocabulary** using these tokens. 

In [None]:
vect.get_feature_names()[::2000]

['00',
 'arroja',
 'comapañias',
 'dvds',
 'golden',
 'lands',
 'oil',
 'razonable',
 'smallsliver',
 'tweak']

In [None]:
len(vect.get_feature_names()) ## length of the vocabulary

19601

In [None]:
#print(vect.vocabulary_)

In [None]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train) ## this is to create the bag-of-word representation of x_train

## the text data is now stored in a sparse matrix where each row corresponds to a text document
## and each column to a training vocabulary item/word:
print(X_train_vectorized)
X_train_vectorized ## this gives the representation of the sparse matrix: ""(row, col) non-zero value"

  (0, 1063)	1
  (0, 2262)	1
  (0, 6631)	1
  (0, 9555)	1
  (0, 9582)	1
  (1, 296)	1
  (1, 298)	1
  (1, 1457)	1
  (1, 1574)	1
  (1, 1698)	4
  (1, 2022)	2
  (1, 2320)	1
  (1, 2572)	1
  (1, 2654)	1
  (1, 2906)	2
  (1, 2951)	1
  (1, 3019)	1
  (1, 3203)	2
  (1, 3306)	1
  (1, 3308)	1
  (1, 4813)	1
  (1, 5416)	1
  (1, 5725)	1
  (1, 5779)	1
  (1, 5914)	1
  :	:
  (23050, 19424)	1
  (23050, 19488)	18
  (23050, 19496)	4
  (23050, 19502)	2
  (23050, 19529)	2
  (23050, 19532)	1
  (23051, 446)	1
  (23051, 1698)	1
  (23051, 6351)	1
  (23051, 7058)	1
  (23051, 7576)	1
  (23051, 9582)	3
  (23051, 10312)	1
  (23051, 11950)	1
  (23051, 12360)	1
  (23051, 12800)	2
  (23051, 14676)	1
  (23051, 15693)	1
  (23051, 17333)	1
  (23051, 17343)	2
  (23051, 18334)	1
  (23051, 18884)	1
  (23051, 19057)	1
  (23051, 19205)	1
  (23051, 19285)	1


<23052x19601 sparse matrix of type '<class 'numpy.int64'>'
	with 613289 stored elements in Compressed Sparse Row format>

In [None]:
## The entries in this matrix are the number of times each word appears in each document.
## Because the number of words in the vocabulary is much larger than the number of words that
## might appear in a single review, most entries of this matrix are 0 (sparse matrix).

print(X_train_vectorized.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [None]:
from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression()  ## LR works well for high-dimensional sparse data
model.fit(X_train_vectorized, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [None]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents

## The X_test matrix gets transformed using the same Vectorizer that was fitted to the training data:
## any word in X_test that does not appear in X_train will just be ignored
predictions = model.predict(vect.transform(X_test)) 
print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.8963986165184588


In [None]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['worst' 'terrible' 'slow' 'junk' 'waste' 'sucks' 'poor' 'useless'
 'disappointed' 'horrible']

Largest Coefs: 
['excelent' 'excellent' 'excelente' 'perfectly' 'love' 'perfect' 'exactly'
 'great' 'best' 'awesome']


# Tf-idf (Term Frequency - Inverse Document Frequency)
- This is an alternative approach to CountVectorizer and it allows to **rescale features**, weighing them on the basis of how important they are for a document: **higher weight** is given to **words that appear often in the document** but **not in the rest of the corpus**.
- **Words** with **low Tf-Idf** are either 
    - **commonly used in all documents** 
    - or **rarely used** and occur only in **long documents**
- **Words** with **high Tf-Idf** are **frequently used within specific documents** but **rarely used across all documents**.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
## TfidfVectorizer goes through the same initial process as CountVectorizer of tokenizing the document, 
## we can expect to obtain the same number of features. However, we can reduce the number by specifying
## min_df = min number of documents in which the word needs to abbear to become part of the vocabulary.

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train) ## minimum appearence of each word in at least 5 documents
len(vect.get_feature_names())

5442

In [None]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.889951006492175


In [None]:
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['61' 'printer' 'approach' 'adjustment' 'consequences' 'length' 'emailing'
 'degrees' 'handsfree' 'chipset']

Largest tfidf: 
['unlocked' 'handy' 'useless' 'cheat' 'up' 'original' 'exelent' 'exelente'
 'exellent' 'satisfied']


In [None]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'slow' 'disappointed' 'worst' 'terrible' 'never' 'return' 'doesn'
 'horrible' 'waste']

Largest Coefs: 
['great' 'love' 'excellent' 'good' 'best' 'perfect' 'price' 'awesome'
 'far' 'perfectly']


- One **problem** with the **common Bag-of-words approach** is that **word order is disregarded**:

In [None]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


- One way around this issue is to **add some context** by adding **sequences of word features** known as **n-grams**:

# n-grams
- To **create n-gram features**, we can pass a tuple to `ngram_range` **parameter** of `CountVectorizer` (minimum length of sequence, maximum length of sequence)
- Although **n-grams** can be powerful in capture meaning, **longer sequences** can cause an **explosion in the number of features**!

In [None]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams

## n-grams include single words and bigrams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train) 
X_train_vectorized = vect.transform(X_train)

print(X_train_vectorized)
len(vect.get_feature_names())

  (0, 386)	1
  (0, 418)	1
  (0, 2970)	1
  (0, 7664)	1
  (0, 7665)	1
  (0, 11769)	1
  (0, 11805)	1
  (0, 12244)	1
  (0, 12460)	1
  (1, 125)	1
  (1, 126)	1
  (1, 129)	1
  (1, 131)	1
  (1, 995)	1
  (1, 1352)	4
  (1, 1696)	1
  (1, 1947)	1
  (1, 1953)	2
  (1, 2596)	2
  (1, 2635)	1
  (1, 2996)	1
  (1, 3033)	1
  (1, 3566)	1
  (1, 3567)	1
  (1, 3665)	1
  :	:
  (23051, 8976)	1
  (23051, 12244)	3
  (23051, 12692)	1
  (23051, 12718)	1
  (23051, 13693)	1
  (23051, 16440)	1
  (23051, 16583)	1
  (23051, 17736)	1
  (23051, 17738)	1
  (23051, 18055)	2
  (23051, 18244)	1
  (23051, 18456)	1
  (23051, 20265)	1
  (23051, 21394)	1
  (23051, 23186)	1
  (23051, 23402)	2
  (23051, 23990)	2
  (23051, 26193)	1
  (23051, 26209)	1
  (23051, 26942)	1
  (23051, 27615)	1
  (23051, 27649)	1
  (23051, 28295)	1
  (23051, 28468)	1
  (23051, 28480)	1


29072

In [None]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


AUC:  0.9104640361714084


In [None]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['no good' 'junk' 'poor' 'slow' 'worst' 'broken' 'not good' 'terrible'
 'defective' 'horrible']

Largest Coefs: 
['excellent' 'excelente' 'perfect' 'excelent' 'great' 'love' 'awesome'
 'no problems' 'good' 'best']


In [None]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]
