<font color = green >

# Text classification: sentiment analysis 

</font>

<font color = green >

## Popular tasks of text classification

</font>

- **Spam detection**: Having message decide is is spammy or not 
- **Topic identification**: Having article choose one of known classes like "Sport", "Technology", "Finances"
- **Sentiment analysis**: Is the moview positive or negative 
- **Spelling correction**: what is more suitable "weather" or "whether"  


<font color = green >

## Features from Text

</font>

1. The most common words
2. *Stop* words
3. Normalization: lower case / stemming / lemmatizing
4. Capitalization as feature 
5. POS e.g. "the weather" vs whether  
6. grouping
    - buy, purchase
    - Mr, Ms, Dr
    - Numbers
    - Dates
7. Bigrams, n-grams e.g. "White House"
8. Sub-sequences e.g. "ing", "ion"


<font color = green >

## Naive Bayes Classifiers

</font>


<font color = green >

### Text classification of search query 

</font>

- **python**  as snake -> Zoology
- **python**  as programming language -> Computer Science
- **python**  as "monty python" -> Entertainment

Probabilistic model:

#### Bayes Rule

\begin{equation*}
P(y|X) = \frac{P(X| y) \cdot P(y)}{P(X)} 
\quad\quad\quad
Posterior = \frac{ Likelihood \cdot Prior}{Evidence} 
\quad\quad\quad
P(class| python) = \frac{P(python| class) \cdot P(class)}{P(python)} 
\end{equation*}

Considering the $P(python)$ is common for all classes we may compare just nominators: 

\begin{equation*}
P(python| Zoology) \cdot P(Zoology) 
\quad\quad\quad 
P(python|CS) \cdot P(CS) 
\quad\quad\quad
P(python|Entertainment) \cdot P(Entertainment) 
\end{equation*}

In general: 
\begin{equation*}
\hat{y} =  \underset{y}{argmax} \quad P(y|X) =  \underset{argmax}{y} P(X|y) \cdot P(y)
\end{equation*}

Most probably predicted class is <font color = blue>CS</font>

#### Naive Bayes Classifiers
\begin{equation*}
\hat{y} =  \underset{y}{argmax} \quad P(y) \prod_{ i=1 }^{ n }{ P(x_{ i }|\,y) } 
\end{equation*}

If search query = **"python snake"** 
\begin{equation*}
\hat{y} =  \underset{y}{argmax} \quad
P(y)\cdot P(python|\,y) \cdot P(snake|\,y)
\end{equation*}

Now, the most probably predicted class is <font color = blue>Zoology</font> since  $P(snake|\,CS)$ is far less than $P(snake|\,Zoology)$

Note: if one of word is not presented in text then its statistical propability = 0 and as the result
the whole likelihood = 0 regardless of other words. Thus it is worth using laplace smooting  


<font color = green >

#### Laplace smooting 
 
</font>


$
A : 1 \quad
B : 3\quad
C : 0\quad
D : 6\quad
$

$N= 10\quad K =4$ 
<br>N - number of samples, K - number of classes

\begin{equation*}
P(A) = 0.1\quad\quad\quad\quad
P(B) = 0.3\quad\quad\quad\quad
P(C) = 0.0\quad\quad\quad\quad
P(D) = 0.6\\
\end{equation*}

<font color = blue >

\begin{equation*}
P^{\,L}(x_{i}) =  \frac{P(x_{i})+1}{N+K}
\end{equation*}

</font>




\begin{equation*}
P^{\,L}(A) =  \frac{1+1}{10+4} = 0.14 \quad P^{\,L}(B) =  \frac{3+1}{10+4} = 0.29
\quad P^{\,L}(C) =  \frac{0+1}{10+4} = 0.07 \quad P^{\,L}(D) =  \frac{6+1}{10+4} = 0.5
\end{equation*}






<font color = green >

## Sentiment Analysis

</font>


<font color = green >

### Using NLTK

</font>


In [1]:
import pandas as pd
import numpy as np 


In [2]:
from nltk.corpus import movie_reviews 
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import RegexpTokenizer
import random

In [3]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/oleksiy.tsebriy/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

<font color = green >

#### Load data

</font>


In [4]:
all_movie_reviews_text= movie_reviews.raw() # it is just all reviews joined into one text e.g. 
# this is the ending of first review: " the others ( 9/10 ) - stir of echoes ( 8/10 ) "
# this is the beginning of second review : "the happy bastard's quick movie review "
print(all_movie_reviews_text[3600:4600])

way because someone is apparently assuming that the genre is still hot with the kids . 
it also wrapped production two years ago and has been sitting on the shelves ever since . 
whatever . . . skip 
it ! 
where's joblo coming from ? 
a nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) 
the happy bastard's quick movie review 
damn that y2k bug . 
it's got a head start in this movie starring jamie lee curtis and another baldwin brother ( william this time ) in a story regarding a crew of a tugboat that comes across a deserted russian tech ship that has a strangeness to it when they kick the power back on . 
little do they know the power within . . . 
going for the gore and bringing on a few action sequences here and there , virus still feels very empty , like a movie going for all flash and no substance . 
we don't know why the crew w

<font color = green >

#### Tools to review data 

</font>



In [5]:
cats =  movie_reviews.categories()
cats

['neg', 'pos']

In [6]:
cat = cats[0]
ids= movie_reviews.fileids(cat)
ids[:10]

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt',
 'neg/cv005_29357.txt',
 'neg/cv006_17022.txt',
 'neg/cv007_4992.txt',
 'neg/cv008_29326.txt',
 'neg/cv009_29417.txt']

In [7]:
id_review = ids[0]
print(movie_reviews.raw(id_review))

plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 
so what are the problems with the movie ? 
well , its main problem is that it's simply too jumbled . 
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no id

<font color = green >

#### Tokenize

</font>


In [8]:
def preprocess(text): # removes punctualtion
    tokenizer = RegexpTokenizer(r'\w+') # just for demo 
    return tokenizer.tokenize(text.lower())

all_words = preprocess(all_movie_reviews_text)
print (len(all_words))
print(all_words[:100])

1336782
['plot', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', 'drink', 'and', 'then', 'drive', 'they', 'get', 'into', 'an', 'accident', 'one', 'of', 'the', 'guys', 'dies', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', 'and', 'has', 'nightmares', 'what', 's', 'the', 'deal', 'watch', 'the', 'movie', 'and', 'sorta', 'find', 'out', 'critique', 'a', 'mind', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', 'mess', 'with', 'your', 'head', 'and', 'such']


<font color = green >

#### Build vocabulary

</font>


In [9]:
all_words=nltk.FreqDist(all_words)
print ('len of vocabulary: {:,}'.format (len(all_words)))
# Use most common words
most_common_words = list(zip(*all_words.most_common()))[0] # [0] means names whereas [1] are frequencies 
# most_common(5000) - it may retutn limited number but in this sample the features will be filtered later after removing stop words 
print (most_common_words[:100])

len of vocabulary: 39,696
('the', 'a', 'and', 'of', 'to', 'is', 'in', 's', 'it', 'that', 'as', 'with', 'for', 'his', 'this', 'film', 'i', 'he', 'but', 'on', 'are', 't', 'by', 'be', 'one', 'movie', 'an', 'who', 'not', 'you', 'from', 'at', 'was', 'have', 'they', 'has', 'her', 'all', 'there', 'like', 'so', 'out', 'about', 'up', 'more', 'what', 'when', 'which', 'or', 'she', 'their', 'some', 'just', 'can', 'if', 'we', 'him', 'into', 'even', 'only', 'than', 'no', 'good', 'time', 'most', 'its', 'will', 'story', 'would', 'been', 'much', 'character', 'also', 'get', 'other', 'do', 'two', 'well', 'them', 'very', 'characters', 'first', 'after', 'see', 'way', 'because', 'make', 'life', 'off', 'too', 'any', 'does', 'really', 'had', 'while', 'films', 'how', 'plot', 'little', 'where')


<font color = green >

#### Get rid of stop words 

</font>


In [10]:
def remove_stop_words(words):
    stop_words = set(stopwords.words('english'))  
    return [w for w in words if w not in stop_words]
most_common_words_filtered = remove_stop_words(most_common_words)

<font color = green >

#### Select features 

</font>

In [11]:

word_features = most_common_words_filtered [:3000]
print (word_features[:100])

['film', 'one', 'movie', 'like', 'even', 'good', 'time', 'story', 'would', 'much', 'character', 'also', 'get', 'two', 'well', 'characters', 'first', 'see', 'way', 'make', 'life', 'really', 'films', 'plot', 'little', 'people', 'could', 'scene', 'man', 'bad', 'never', 'best', 'new', 'scenes', 'many', 'director', 'know', 'movies', 'action', 'great', 'another', 'love', 'go', 'made', 'us', 'big', 'end', 'something', 'back', 'still', 'world', 'seems', 'work', 'makes', 'however', 'every', 'though', 'better', 'real', 'audience', 'enough', 'seen', 'take', 'around', 'going', 'year', 'performance', 'role', 'old', 'gets', 'may', 'things', 'think', 'years', 'last', 'comedy', 'funny', 'actually', 'long', 'look', 'almost', 'thing', 'fact', 'nothing', 'say', 'right', 'john', 'although', 'played', 'find', 'script', 'come', 'ever', 'cast', 'since', 'star', 'plays', 'young', 'show', 'comes']


<font color = green >

#### Extract documents and labels

</font>


In [12]:
# Note: this does not use tokenizing to documents but words of document retrieved by file_id instead.
documents = [(list(movie_reviews.words(file_id)), category) # using the words() method of movie_reviews object
             for category in movie_reviews.categories() # select category - there are two: ['neg', 'pos']
             for file_id in movie_reviews.fileids(category)]# select all file_ids for specified category
len (documents)
# This returns list of tuples (list_of_tokens_of document, label)

2000

In [13]:
print (documents [0]) # (['plot', ':', 'two', 'teen', ... 'echoes', '(', '8', '/', '10', ')'], 'neg')

(['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', 'what', "'", 's', 'the', 'deal', '?', 'watch', 'the', 'movie', 'and', '"', 'sorta', '"', 'find', 'out', '.', '.', '.', 'critique', ':', 'a', 'mind', '-', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', ',', 'mess', 'with', 'your', 'head', 'and', 'such', '(', 'lost', 'highway', '&', 'memento', ')', ',', 'but', 'there', 'are', 'good', 'and', 'b

<font color = green >

#### Shuffle documents 

</font>


In [14]:
# shuffle first 
random.shuffle(documents) # it is inplace method
documents= documents[:500] # reduce the data set for speed up the demo 
len (documents)

500

<font color = green >

#### Vectorize documents 

</font>


In [15]:
def find_features(review_tokens):
    return {w: w in set(review_tokens) for w in word_features} # feature representation on document

data_set= [(find_features(review_tokens), category) for (review_tokens, category) in documents]


In [16]:
data_set[0]

({'film': True,
  'one': True,
  'movie': False,
  'like': True,
  'even': True,
  'good': True,
  'time': False,
  'story': True,
  'would': False,
  'much': True,
  'character': False,
  'also': False,
  'get': False,
  'two': False,
  'well': True,
  'characters': False,
  'first': True,
  'see': False,
  'way': False,
  'make': True,
  'life': False,
  'really': False,
  'films': False,
  'plot': False,
  'little': True,
  'people': False,
  'could': True,
  'scene': False,
  'man': False,
  'bad': False,
  'never': False,
  'best': False,
  'new': False,
  'scenes': True,
  'many': False,
  'director': False,
  'know': False,
  'movies': False,
  'action': True,
  'great': False,
  'another': False,
  'love': False,
  'go': True,
  'made': True,
  'us': False,
  'big': False,
  'end': False,
  'something': False,
  'back': True,
  'still': False,
  'world': False,
  'seems': True,
  'work': True,
  'makes': True,
  'however': True,
  'every': False,
  'though': False,
  'better': 

<font color = green >

#### Split to training and test set

</font>


In [17]:
split_on = int(len(data_set)*.8)
X_y_train= data_set[:split_on]
X_y_test = data_set[split_on:]
print (len(X_y_train))

400


<font color = green >

#### Train model

</font>


In [18]:
clf= nltk.NaiveBayesClassifier.train(X_y_train) # Note: the difference grammar comparing with sklearn

<font color = green >

#### Evaluate model

</font>


In [19]:
nltk.classify.accuracy(clf, X_y_test)*100

80.0

<font color = green >

#### Review most informative features

</font>


In [20]:
clf.show_most_informative_features(15)

Most Informative Features
                   awful = True              neg : pos    =     13.0 : 1.0
               pointless = True              neg : pos    =     11.7 : 1.0
                   folks = True              neg : pos    =     10.3 : 1.0
             outstanding = True              pos : neg    =      9.0 : 1.0
                    pain = True              pos : neg    =      9.0 : 1.0
                  poorly = True              neg : pos    =      8.6 : 1.0
                 enjoyed = True              pos : neg    =      8.3 : 1.0
                 unfunny = True              neg : pos    =      8.3 : 1.0
                    dull = True              neg : pos    =      8.2 : 1.0
                   badly = True              neg : pos    =      7.7 : 1.0
                 crucial = True              pos : neg    =      7.7 : 1.0
                    draw = True              pos : neg    =      7.7 : 1.0
                  random = True              neg : pos    =      7.7 : 1.0

<font color = green >

### Incorporate with sklearn

</font>


In [21]:
from nltk.classify.scikitlearn import SklearnClassifier # this is wrapper to incorporate with sklearn using nltk style.
from sklearn.naive_bayes import MultinomialNB

# Convert to nltk classifiers 
MNNB_classifier= SklearnClassifier(MultinomialNB()) # Note : use ()

from sklearn.linear_model import LogisticRegression
lr_classifier = SklearnClassifier(LogisticRegression()) 

from sklearn.svm import SVC, LinearSVC, NuSVC # NuSVC - Similar to SVC but uses a parameter to control the number of support vectors.
svc_clf = SklearnClassifier(SVC())  
lin_svc_clf= SklearnClassifier(LinearSVC())  
nu_svc_clf = SklearnClassifier(NuSVC())  


In [22]:
# native nltk classifier
clf= nltk.NaiveBayesClassifier.train(X_y_train) 

print('Accuracy nltk.NaiveBayesClassifier={}%'.format(nltk.classify.accuracy(clf,X_y_test) * 100))
# clf.show_most_informative_features(15)

MNNB_classifier.train(X_y_train)
print('Accuracy MNNB_classifier ={}%'.format(nltk.classify.accuracy(MNNB_classifier, X_y_test) * 100)) # 79.0%

lr_classifier.train(X_y_train)
print('Accuracy lr_classifier ={}%'.format(nltk.classify.accuracy(lr_classifier, X_y_test) * 100)) # 82.0%

svc_clf.train(X_y_train)
print('Accuracy svc_clf={}%'.format(nltk.classify.accuracy(svc_clf, X_y_test) * 100)) # 52.0% - default is rbf kernel

lin_svc_clf.train(X_y_train)
print('Accuracy lin_svc_clf={}%'.format(nltk.classify.accuracy(lin_svc_clf, X_y_test) * 100)) # 82.0%

nu_svc_clf.train(X_y_train)
print('Accuracy nu_svc_clf={}%'.format(nltk.classify.accuracy(nu_svc_clf, X_y_test) * 100)) #


Accuracy nltk.NaiveBayesClassifier=80.0%
Accuracy MNNB_classifier =82.0%
Accuracy lr_classifier =80.0%
Accuracy svc_clf=79.0%
Accuracy lin_svc_clf=76.0%
Accuracy nu_svc_clf=80.0%


<font color = green >

### Combining algos with a vote

</font>


In [23]:
from nltk.classify import ClassifierI
class Vote_Classifier(ClassifierI): # inherit
    def __init__(self, *classifiers): # expecting list of classifiers
        self.classifiers = classifiers

    def classify(self, sample): 
        return mode([clf.classify(sample) for clf in self.classifiers]) 

    def calc_confidence(self, sample):
        votes= [clf.classify(sample) for clf in self.classifiers] #
        return votes.count(mode(votes))/len(votes) # fraction of how many votes match to mode to total votes number

def mode(array): # returns first mode in case of multi modes
    return max(set(array), key=array.count)


#### Explanation

In [24]:
a= [3,3,1,2,2]
mode (a)

2

In [25]:
a.count (mode(a))/len(a)

0.4

In [26]:
vote_clf= Vote_Classifier(clf, lr_classifier, svc_clf, lin_svc_clf, nu_svc_clf)
print('Accuracy vote_clf={:.2%}'.format(nltk.classify.accuracy(vote_clf, X_y_test)))

Accuracy vote_clf=82.00%


<font color = green >

#### Classify new sample

</font>


In [27]:
# https://timesofindia.indiatimes.com/entertainment/english/movie-reviews/cold-pursuit/movie-review/67892834.cms
new_review = '''This unusual satire on gangsters and revenge stories, starts off with a quote from Oscar Wilde and some delightful background music that sets the tone for rest of the film. The first few minutes play out like a predictable thriller, featuring a wronged father and his pursuit for vigilante justice. But, what follows is a series of stylised killing sequences, that almost seem like parodies of action set pieces that you’ve seen Liam Neeson pulling off with deadpan ease in the past. Yet, director Hans Petter Noland, who also made the Norwegian film In Order Of Disappearance that inspired Cold Pursuit, and writer Frank Baldwin create a refreshing narrative full of memorable moments. The movie seems bizarrely funny and the snow-heavy setting creates the right atmosphere for the dry and cold-cut humour.
The story begins with tragedy and the first few minutes seem dead serious, right up to the point where Coxman confronts his first victim, the gangster named Speedo. But, as the revenge-seeking father moves up the ranks of the mafia chain, the characters become quirky and the situations get thoroughly entertaining. The introduction of characters like Viking (Tom Bateman), the main antagonist and his team of crazy henchmen like Mustang, Dexter and more, alleviates the narrative. There’s also a track of warring mafia gangs as Viking wages a war against the native Indians led by White Bull (Tom Jackson). Cold Pursuit may not be too creative with the kill sequences, but it does get interesting with the wry sense of humour.
Neeson does what he does best. He keeps a straight face and plays the game of intimidation with ease. He’s just a regular guy who’s way out of his league, killing gangsters. But, his outrageous mission is what makes the story interesting. Watch out for a superb cameo by William Forsythe, too, who plays a brief but key role in Coxman’s revenge saga.
The way Cold Pursuit manages to blend sardonic humour with cold-blooded killings makes it reminiscent of movies like The Coen Brothers’ Fargo and Guy Ritchie’s Snatch. This one’s a refreshingly cool black-comedy that does wonders for the genre.
'''

In [28]:
x_new  = find_features(preprocess(new_review))
x_new

{'film': True,
 'one': True,
 'movie': True,
 'like': True,
 'even': False,
 'good': False,
 'time': False,
 'story': True,
 'would': False,
 'much': False,
 'character': False,
 'also': True,
 'get': True,
 'two': False,
 'well': False,
 'characters': True,
 'first': True,
 'see': False,
 'way': True,
 'make': False,
 'life': False,
 'really': False,
 'films': False,
 'plot': False,
 'little': False,
 'people': False,
 'could': False,
 'scene': False,
 'man': False,
 'bad': False,
 'never': False,
 'best': True,
 'new': False,
 'scenes': False,
 'many': False,
 'director': True,
 'know': False,
 'movies': True,
 'action': True,
 'great': False,
 'another': False,
 'love': False,
 'go': False,
 'made': True,
 'us': False,
 'big': False,
 'end': False,
 'something': False,
 'back': False,
 'still': False,
 'world': False,
 'seems': True,
 'work': False,
 'makes': True,
 'however': False,
 'every': False,
 'though': False,
 'better': False,
 'real': False,
 'audience': False,
 'enough': 

In [29]:
print('\nClassification: {}\nConfidence: {:.2%}'.format(
    vote_clf.classify(x_new),vote_clf.calc_confidence(x_new)))


Classification: pos
Confidence: 100.00%


<font color = green >

### Using sklearn

</font>


<font color = green >

#### Load data 

data set ['amazon-reviews-unlocked-mobile-phones'](https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones)
</font>


In [31]:
import os
cwd= os.getcwd() # current working directory
# path = os.path.join(cwd,'data')
fn=  'Amazon_Unlocked_Mobile.csv' # https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
df = pd.read_csv(fn) # 
print('len=  {:,}\ncolumns= {}'.format(len(df), list(df)))

# df = df.sample(frac=0.1, random_state=10) # reduce the amount of reviews due to speedup the training considering this is demo
df.head()

len=  413,840
columns= ['Product Name', 'Brand Name', 'Price', 'Rating', 'Reviews', 'Review Votes']


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


<font color = green >

#### Get rid of records with missed data 

</font>


In [32]:
df.dropna(inplace=True) 
print('len=  {:,}'.format(len(df)))

len=  334,335


<font color = green >

#### Label positive and negative 

</font>


In [33]:
df = df[df['Rating'] != 3] # Remove any 'neutral' ratings equal to 3  as uninformative
df['Rating_binary'] = np.where(df['Rating'] > 3, 1, 0) # returns 1 for 4,5 and 0 for 1,2
df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Rating_binary
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1


In [34]:
df['Rating_binary'].mean()

0.7482686025879323

<font color = green >

#### Split to train and test sets

</font>


In [35]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'],df['Rating_binary'],random_state=0)

<font color = green >

#### Review training sample

</font>


In [36]:
X_train.iloc[0], y_train.iloc[0] # Be careful with quering like X_train[0] because it casts to X_train.loc[0]

('I bought a BB Black and was deliveried a White BB.Really is not a serious provider...Next time is better to cancel the order.',
 0)

<font color = green >

#### Extract Features 

</font>
The bag-of-words approach is simple way to represent text for use in machine learning, which ignores structure and only counts how often each word occurs.

<font color = green >

#### Count vectorizer

</font>
By default, selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)

In [37]:
from sklearn.feature_extraction.text import CountVectorizer 

vect = CountVectorizer().fit(X_train) # Fit the CountVectorizer to the training data
print('features samples:\n{}'.format(vect.get_feature_names()[::2000])) # display each 2000-th feature 
print ('\nlen of features {:,}'.format(len(vect.get_feature_names()))) 


features samples:
['00', '4less', 'adr6275', 'assignment', 'blazingly', 'cassettes', 'condishion', 'debi', 'dollarsshipping', 'esteem', 'flashy', 'gorila', 'human', 'irullu', 'like', 'microsaudered', 'nightmarish', 'p770', 'poori', 'quirky', 'responseive', 'send', 'sos', 'synch', 'trace', 'utiles', 'withstanding']

len of features 53,216




<font color = green >

#### Transfrom the X_train to feature representation

</font>


In [38]:
X_train_vectorized = vect.transform(X_train) # indeces of existing words from vocabulary and their count in current text
X_train_vectorized

<231207x53216 sparse matrix of type '<class 'numpy.int64'>'
	with 6117776 stored elements in Compressed Sparse Row format>

In [39]:
print (X_train_vectorized[0])

  (0, 4986)	1
  (0, 7259)	2
  (0, 7676)	1
  (0, 7878)	1
  (0, 8476)	1
  (0, 9637)	1
  (0, 14420)	1
  (0, 26003)	2
  (0, 31892)	1
  (0, 32284)	1
  (0, 33437)	1
  (0, 37356)	1
  (0, 38473)	1
  (0, 42146)	1
  (0, 46946)	1
  (0, 47462)	1
  (0, 47639)	1
  (0, 51169)	1
  (0, 51673)	1


<font color = green >

#### Review vectorized training sample

</font>


In [40]:
# review first sample 
df = pd.DataFrame(X_train_vectorized[0].toarray(), index= ['value']).T
df 

Unnamed: 0,value
0,0
1,0
2,0
3,0
4,0
...,...
53211,0
53212,0
53213,0
53214,0


In [41]:
print (list(df[df['value']>0].index))
[vect.get_feature_names()[index] for index in df[df['value']>0].index.values]

[4986, 7259, 7676, 7878, 8476, 9637, 14420, 26003, 31892, 32284, 33437, 37356, 38473, 42146, 46946, 47462, 47639, 51169, 51673]




['and',
 'bb',
 'better',
 'black',
 'bought',
 'cancel',
 'deliveried',
 'is',
 'next',
 'not',
 'order',
 'provider',
 'really',
 'serious',
 'the',
 'time',
 'to',
 'was',
 'white']

<font color = green >

#### Train model

</font>


In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score

In [None]:
clf = LogisticRegression(max_iter=2000).fit(X_train_vectorized, y_train) # Train the model

<font color = green >

#### Evaluate model

</font>


In [None]:
predictions = clf.predict(vect.transform(X_test)) # Predict the transformed test documents
print('f1: ', f1_score(y_test, predictions)) 
scores = clf.decision_function(vect.transform(X_test)) 
print('AUC: ', roc_auc_score(y_test, scores)) 

<font color = green >

#### Review relevant features 
    
</font>

The smallest coefs corresponds to `Neg` impact, and largest coefs represent `Pos` impact

In [None]:
feature_names = np.array(vect.get_feature_names())
sorted_coef_index = clf.coef_[0].argsort() # ascending  [0] is just squeeze from shape (1,n)
clf.coef_.shape, clf.coef_[0].shape, sorted(clf.coef_[0])[:10], sorted(clf.coef_[0])[-11:-1], 

In [None]:
print('Smallest coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))
# model.coef_[0][sorted_coef_index[0]] the smallest 

<font color = green >

## Term frequency–inverse document frequency (TFIDF)


</font>

TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Its value increases proportionally to the number of times a word appears in the document and decreases by the number of documents in the corpus that contain the word
<div style="float:left;">
<br>
    
**Term frequency** $(tf(t,d))$ is measure of how frequent term t is in document d 
$$ tf(t,d) = \frac{k}{n},$$ <br>$d$ - document,  $k$ - number of times word occurs in document $d$, $n$ - total number of words in document $d$.
<br>
Note: Various approaches can be used for term frequency e.g. *augmented frequency*, to prevent a bias towards longer documents (raw frequency divided by the raw frequency of the most occurring term in the document):

$$ tf^{\,A}(t,d) = 0.5+ 0.5\cdot \frac{tf(t,d)}{\underset{t' \in d}{max}(tf(t',d))} $$

**Inverse document frequency** $(idf(t,D))$ is a measure of how much information the word provides.
$$ idf(t,D) = log \frac{N}{K},$$ <br>$D$ - all documents, $K$ - number of documents in $D$ that contain the word , $N$ - total number of documents in $D$. <br>
</div>

Note: Various approaches can be used for inverse document frequency 

<div style="float:left;">
<table width="500">
    <tr>
        <th style="text-align:center" bgcolor= white>Document1</th>
        <th style="text-align:center"  bgcolor= white >Document2</th></tr>
    <tr>
        <td>
            <table>
                <tr>
                    <th  bgcolor=gainsboro>Term</th>
                    <th  bgcolor=gainsboro>Term Count</th></tr>
                <tr><td>this</td><td>1</td></tr>
                <tr><td>is</td><td>1</td></tr>
                <tr><td>a</td><td>2</td></tr>
                <tr><td>sample</td><td>1</td></tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <th bgcolor=gainsboro>Term</th>
                    <th  bgcolor=gainsboro>Term Count</th></tr>
                <tr><td>this</td><td>1</td></tr>
                <tr><td>is</td><td>1</td></tr>
                <tr><td>another</td><td>2</td></tr>
                <tr><td>example</td><td>3</td></tr>
            </table>
        </td>
    </tr>
</table>
<div/>

<div style="float:left;">
<br>

For <strong>"this"</strong>:
$$ tf ("this", d_{1}) =  \frac{1}{5} = 0.2, \quad  tf ("this", d_{2}) =  \frac{1}{7} \approx 0.14, \quad idf ("this", D) =  log \frac{2}{2} =0; $$

$$ tfidf("this", d_{1}, D)  = 0.2 \cdot 0 = 0, \quad    tfidf("this", d_{2}, D)  = 0.14 \cdot 0 = 0 $$

For <strong>"example"</strong>:
$$ tf ("example", d_{1}) =  \frac{0}{5} = 0 , \quad  tf ("example", d_{2}) =  \frac{3}{7} \approx 0.43 , \quad idf ("example", D) =  log \frac{2}{1} \approx 0.3; $$

$$ tfidf("example", d_{1}, D)  = 0 \cdot 0.3 = 0, \quad    tfidf("example", d_{2}, D)  = 0.43 \cdot 0.3 = 0.129 $$

</div>


<font color = green >

### Sklearn tfidf

</font>


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

<font color = green >

#### Compute sklearn tfidf for sample with 2 documents 

</font>


In [None]:
X = np.array(['this is a sample a', 'this is another example another example example'])
tfidf_vectorizer= TfidfVectorizer().fit(X)
X_vectorized= tfidf_vectorizer.transform(X)
print (tfidf_vectorizer.vocabulary_)
X_vectorized.toarray()
# conclusion: sklearn uses different variant of computation tfidf

<font color = green >

#### Use sklearn tfidf for Amazon_Unlocked_Mobile documents 

</font>


In [None]:
tfidf_vectorizer= TfidfVectorizer(min_df=5)#.fit(X_train) 
    # min_df - minimum document count to include the term, default is 1 
    # you may also set max_features (Int or None) to return just limited number of top tfidf features 
X_train_vectorized = tfidf_vectorizer.fit_transform(X_train)
print ('len of features= {:,}'.format(len(tfidf_vectorizer.get_feature_names()))) 
    # Note: min_df=5 caused 17,951  comparing to 53,216 acquired by count vectorizer
    # Note: min_df=5 is also available in count vectorizer




In [None]:
# X_train_vectorized.shape # (231207, 17951) = (n_documents, n_features)
sorted_tfidf_index = X_train_vectorized.max(axis=0).toarray()[0].argsort() 
    # max(axis=0) means max through all docs - will get the max of each word within all docs
    # [0] - just squeezing     
print (np.sort(X_train_vectorized.max(axis=0).toarray()[0]))
sorted_tfidf_index # indices of the most tfidf terms 


In [None]:
feature_names = np.array(tfidf_vectorizer.get_feature_names())
print ('feature_names ',feature_names)
print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))


<font color = green >

#### Train model on features  extracted by tfidf vectorizer

</font>


In [None]:
clf = LogisticRegression(max_iter=1000).fit(X_train_vectorized, y_train) # Train the model
predictions = clf.predict(tfidf_vectorizer.transform(X_test)) 
print('f1: ', f1_score(y_test, predictions)) 
scores = clf.decision_function(tfidf_vectorizer.transform(X_test)) 
print('AUC: ', roc_auc_score(y_test, scores)) 

#### Conclusion: Perfromance is not worse but there are 3 times less amount of features used

In [None]:
sorted_coef_index = clf.coef_[0].argsort()
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

<font color = green >

### n-grams

</font>


In [None]:
# the problem is the following reviews are treated the same by current model
targets= [
    "not an issue, phone is working", 
    "an issue, phone is not working"
]
print(clf.predict(tfidf_vectorizer.transform(targets)))


In [None]:
count_vectorizer = CountVectorizer(min_df=5, max_features=50000, ngram_range=(1,2)).fit(X_train) # Note: both limits are included
X_train_vectorized = count_vectorizer.transform(X_train)
print('len of features using n-grams vectorizer={:,}'.format(len(count_vectorizer.get_feature_names()))) 


In [None]:
clf= LogisticRegression(max_iter= 2000).fit(X_train_vectorized, y_train)
predictions = clf.predict(count_vectorizer.transform(X_test)) 
print('f1: ', f1_score(y_test, predictions)) 
scores = clf.decision_function(count_vectorizer.transform(X_test)) 
print('AUC: ', roc_auc_score(y_test, scores))

In [None]:
feature_names = np.array(count_vectorizer.get_feature_names())
sorted_coef_index = clf.coef_[0].argsort()
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

In [None]:
print (targets)
print(clf.predict(count_vectorizer.transform(targets)))

<font color = green >

## Home Task 

</font>


<font color = green >

### Load data 

</font>

[Sentiment Analysis Dataset](https://www.kaggle.com/sonaam1234/sentimentdata)

alternative source: 
<br>
[rt-polaritydata](https://github.com/dennybritz/cnn-text-classification-tf/tree/master/data/rt-polaritydata)

alternative source: 
<br>
[Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data)

Each line in these two files corresponds to a single snippet (usually containing roughly one single sentence); all snippets are down-cased.  
[More info about dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt)



In [None]:
fn='rt-polarity.neg'
with open(fn, "r",encoding='utf-8', errors='ignore') as f: # some invalid symbols encountered 
    content = f.read()  
texts_neg=  content.splitlines()
print ('len of texts_neg = {:,}'.format (len(texts_neg)))
for review in texts_neg[:5]:
    print ( '\n', review)

In [None]:
fn='rt-polarity.pos'

with open(fn, "r",encoding='utf-8', errors='ignore') as f:
    content = f.read()
texts_pos=  content.splitlines()
print ('len of texts_pos = {:,}'.format (len(texts_pos)))
for review in texts_pos[:5]:
    print ('\n', review)

<font color = green >

## Learn more
</font>

sklearn.feature_extraction.text.CountVectorizer
<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Bag-of-words model
<br>
https://en.wikipedia.org/wiki/Bag-of-words_model

tf–idf
<br>
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

sklearn.feature_extraction.text.TfidfVectorizer
<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Applied Text Mining in Python
<br>
https://www.coursera.org/learn/python-text-mining/home/welcome

Natural Language Processing tutorial
<br>
https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/


<font color = green >

## Next lesson: topic modeling 
</font>

