<font color = green >

# Text classification: sentiment analysis 

</font>

<font color = green >

## Popular tasks of text classification

</font>

- **Spam detection**: Having message decide is is spammy or not 
- **Topic identification**: Having article choose one of known classes like "Sport", "Technology", "Finances"
- **Sentiment analysis**: Is the moview positive or negative 
- **Spelling correction**: what is more suitable "weather" or "whether"  


<font color = green >

## Features from Text

</font>

1. The most common words
2. *Stop* words
3. Normalization: lower case / stemming / lemmatizing
4. Capitalization as feature 
5. POS e.g. "the weather" vs whether  
6. grouping
    - buy, purchase
    - Mr, Ms, Dr
    - Numbers
    - Dates
7. Bigrams, n-grams e.g. "White House"
8. Sub-sequences e.g. "ing", "ion"


<font color = green >

## Naive Bayes Classifiers

</font>


<font color = green >

### Text classyfication of search query 

</font>

- **python**  as snake -> Zoology
- **python**  as programming language -> Computer Science
- **python**  as "monty python" -> Entertainment

Probabilistic model:

#### Bayes Rule

\begin{equation*}
P(y|X) = \frac{P(X| y) \cdot P(y)}{P(X)} 
\quad\quad\quad
Posterior = \frac{ Likelihood \cdot Prior}{Evidence} 
\quad\quad\quad
P(class| python) = \frac{P(python| class) \cdot P(class)}{P(python)} 
\end{equation*}

Considering the $P(python)$ is common for all classes we may compare just nominators: 

\begin{equation*}
P(python| Zoology) \cdot P(Zoology) 
\quad\quad\quad 
P(python|CS) \cdot P(CS) 
\quad\quad\quad
P(python|Entertainment) \cdot P(Entertainment) 
\end{equation*}

In general: 
\begin{equation*}
\hat{y} =  \underset{y}{argmax} \quad P(y|X) =  \underset{argmax}{y} P(X|y) \cdot P(y)
\end{equation*}

Most probably predicted class is <font color = blue>Zoology</font>

#### Naive Bayes Classifiers
\begin{equation*}
\hat{y} =  \underset{y}{argmax} \quad P(y) \prod_{ i=1 }^{ n }{ P(x_{ i }|\,y) } 
\end{equation*}

If search query = **"python download"** 
\begin{equation*}
\hat{y} =  \underset{y}{argmax} \quad
P(y)\cdot P(python|\,y) \cdot P(download|\,y)
\end{equation*}

Now, the most probably predicted class is <font color = blue>Computer Science</font> since  $P(download|\,Zoology)$ is far less than $P(download|\,CS)$

Note: if one of word is not presented in text then its statistical propability = 0 and as the result
the whole likelihood = 0 regardless of other words. Thus it is worth using laplace smooting  


<font color = green >

#### Laplace smooting 
 
</font>


$
A : 1 \quad
B : 3\quad
C : 0\quad
D : 6\quad
$

$N= 10\quad K =4$ 
<br>N - number of samples, K - number of classes

\begin{equation*}
P(A) = 0.1\quad\quad\quad\quad
P(B) = 0.3\quad\quad\quad\quad
P(C) = 0.0\quad\quad\quad\quad
P(D) = 0.6\\
\end{equation*}

<font color = blue >

\begin{equation*}
P^{\,L}(x_{i}) =  \frac{P(x_{i})+1}{N+K}
\end{equation*}

</font>




\begin{equation*}
P^{\,L}(A) =  \frac{1+1}{10+4} = 0.14 \quad P^{\,L}(B) =  \frac{3+1}{10+4} = 0.29
\quad P^{\,L}(C) =  \frac{0+1}{10+4} = 0.07 \quad P^{\,L}(D) =  \frac{6+1}{10+4} = 0.5
\end{equation*}






<font color = green >

## Sentiment Analysis

</font>


<font color = green >

### Using NLTK

</font>


In [None]:
import pandas as pd
import numpy as np 
from nltk.corpus import movie_reviews 
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import RegexpTokenizer
import random

<font color = green >

#### Load data

</font>


In [None]:
all_movie_reviews_text= movie_reviews.raw()
all_movie_reviews_text[:600]

<font color = green >

#### Tools to review data 

</font>



In [None]:
cats =  movie_reviews.categories()
cats

In [None]:
cat = cats[0]
ids= movie_reviews.fileids(cat)
ids[:10]

In [None]:
id_review = ids[0]
print(movie_reviews.raw(id_review))

<font color = green >

#### Tokenize

</font>


In [None]:
def preprocess(text): # removes punctualtion
    tokenizer = RegexpTokenizer(r'\w+') # just for demo 
    return tokenizer.tokenize(text.lower())

all_words = preprocess(all_movie_reviews_text)
print (len(all_words))
print(all_words[:100])

<font color = green >

#### Build vocabulary

</font>


In [None]:
all_words=nltk.FreqDist(all_words)
print ('len of vocabulary: {:,}'.format (len(all_words)))
# Use most common words
most_common_words = list(zip(*all_words.most_common()))[0] # [0] means names whereas [1] are frequencies 
# most_common(5000) - it may retutn limited number but in this sample the features will be filtered later after removing stop words 
print (most_common_words[:100])

<font color = green >

#### Get rid of stop words 

</font>


In [None]:
def remove_stop_words(words):
    stop_words = set(stopwords.words('english'))  
    return [w for w in words if w not in stop_words]

most_common_words_filtered = remove_stop_words(most_common_words)
word_features = most_common_words_filtered [:3000]
print (word_features[:100])


<font color = green >

#### Extract documents and labels

</font>


In [None]:
# Note: this dopes not use tokenizing to documents but words of document retrieved by file_id instead.
documents = [(list(movie_reviews.words(file_id)), category) # using the words() method of movie_reviews object
             for category in movie_reviews.categories() # select category - there are two: ['neg', 'pos']
             for file_id in movie_reviews.fileids(category)]# select all file_ids for specified category
len (documents)
# This returns list of tuples (list_of_tokens_of document, label)

In [None]:
print (documents [0]) # (['plot', ':', 'two', 'teen', ... 'echoes', '(', '8', '/', '10', ')'], 'neg')

<font color = green >

#### Shuffle documents 

</font>


In [None]:
# shuffle first 
random.shuffle(documents) # it is inplace method
documents= documents[:500] # reduce the data set for speed up the demo 
len (documents)

<font color = green >

#### Vectorize documents 

</font>


In [None]:
def find_features(review_tokens):
    return {w: w in set(review_tokens) for w in word_features} # feature representation on document

data_set= [(find_features(review_tokens), category) for (review_tokens, category) in documents]


In [None]:
data_set[0]

<font color = green >

#### Split to training and test set

</font>


In [None]:
split_on = int(len(data_set)*.8)
X_y_train= data_set[:split_on]
X_y_test = data_set[split_on:]
print (len(X_y_train))

<font color = green >

#### Train model

</font>


In [None]:
clf= nltk.NaiveBayesClassifier.train(X_y_train) # Note: the difference grammar comparing with sklearn

<font color = green >

#### Evaluate model

</font>


In [None]:
nltk.classify.accuracy(clf, X_y_test)*100

<font color = green >

#### Review most informative features

</font>


In [None]:
clf.show_most_informative_features(15)

<font color = green >

### Incorporate with sklearn

</font>


In [None]:
from nltk.classify.scikitlearn import SklearnClassifier # this is wrapper to incorporate with sklearn using nltk style.
from sklearn.naive_bayes import MultinomialNB

# Convert to nltk classifiers 
MNNB_classifier= SklearnClassifier(MultinomialNB()) # Note : use ()

from sklearn.linear_model import LogisticRegression
lr_classifier = SklearnClassifier(LogisticRegression()) 

from sklearn.svm import SVC, LinearSVC, NuSVC # NuSVC - Similar to SVC but uses a parameter to control the number of support vectors.
svc_clf = SklearnClassifier(SVC())  
lin_svc_clf= SklearnClassifier(LinearSVC())  
nu_svc_clf = SklearnClassifier(NuSVC())  


In [None]:
# native nltk classifier
clf= nltk.NaiveBayesClassifier.train(X_y_train) 

print('Accuracy nltk.NaiveBayesClassifier={}%'.format(nltk.classify.accuracy(clf,X_y_test) * 100))
# clf.show_most_informative_features(15)

MNNB_classifier.train(X_y_train)
print('Accuracy MNNB_classifier ={}%'.format(nltk.classify.accuracy(MNNB_classifier, X_y_test) * 100)) # 79.0%

lr_classifier.train(X_y_train)
print('Accuracy lr_classifier ={}%'.format(nltk.classify.accuracy(lr_classifier, X_y_test) * 100)) # 82.0%

svc_clf.train(X_y_train)
print('Accuracy svc_clf={}%'.format(nltk.classify.accuracy(svc_clf, X_y_test) * 100)) # 52.0% - default is rbf kernel

lin_svc_clf.train(X_y_train)
print('Accuracy lin_svc_clf={}%'.format(nltk.classify.accuracy(lin_svc_clf, X_y_test) * 100)) # 82.0%

nu_svc_clf.train(X_y_train)
print('Accuracy nu_svc_clf={}%'.format(nltk.classify.accuracy(nu_svc_clf, X_y_test) * 100)) #


<font color = green >

### Combining algos with a vote

</font>


In [None]:
from nltk.classify import ClassifierI
class Vote_Classifier(ClassifierI): # inherit
    def __init__(self, *classifiers): # expecting list of classifiers
        self.classifiers = classifiers

    def classify(self, sample): 
        return mode([clf.classify(sample) for clf in self.classifiers]) 

    def calc_confidence(self, sample):
        votes= [clf.classify(sample) for clf in self.classifiers] #
        return votes.count(mode(votes))/len(votes) # fraction of how many votes match to mode to total votes number

def mode(array): # returns first mode in case of multi modes
    return max(set(array), key=array.count)


In [None]:
vote_clf= Vote_Classifier(clf, lr_classifier, svc_clf, lin_svc_clf, nu_svc_clf)
print('Accuracy vote_clf={:.2%}'.format(nltk.classify.accuracy(vote_clf, X_y_test)))

<font color = green >

#### Classify new sample

</font>


In [None]:
# https://timesofindia.indiatimes.com/entertainment/english/movie-reviews/cold-pursuit/movie-review/67892834.cms
new_review = '''This unusual satire on gangsters and revenge stories, starts off with a quote from Oscar Wilde and some delightful background music that sets the tone for rest of the film. The first few minutes play out like a predictable thriller, featuring a wronged father and his pursuit for vigilante justice. But, what follows is a series of stylised killing sequences, that almost seem like parodies of action set pieces that you’ve seen Liam Neeson pulling off with deadpan ease in the past. Yet, director Hans Petter Noland, who also made the Norwegian film In Order Of Disappearance that inspired Cold Pursuit, and writer Frank Baldwin create a refreshing narrative full of memorable moments. The movie seems bizarrely funny and the snow-heavy setting creates the right atmosphere for the dry and cold-cut humour.
The story begins with tragedy and the first few minutes seem dead serious, right up to the point where Coxman confronts his first victim, the gangster named Speedo. But, as the revenge-seeking father moves up the ranks of the mafia chain, the characters become quirky and the situations get thoroughly entertaining. The introduction of characters like Viking (Tom Bateman), the main antagonist and his team of crazy henchmen like Mustang, Dexter and more, alleviates the narrative. There’s also a track of warring mafia gangs as Viking wages a war against the native Indians led by White Bull (Tom Jackson). Cold Pursuit may not be too creative with the kill sequences, but it does get interesting with the wry sense of humour.
Neeson does what he does best. He keeps a straight face and plays the game of intimidation with ease. He’s just a regular guy who’s way out of his league, killing gangsters. But, his outrageous mission is what makes the story interesting. Watch out for a superb cameo by William Forsythe, too, who plays a brief but key role in Coxman’s revenge saga.
The way Cold Pursuit manages to blend sardonic humour with cold-blooded killings makes it reminiscent of movies like The Coen Brothers’ Fargo and Guy Ritchie’s Snatch. This one’s a refreshingly cool black-comedy that does wonders for the genre.
'''

In [None]:
x_new  = find_features(preprocess(new_review))

In [None]:
print('\nClassification: {}\nConfidence: {:.2%}'.format(
    vote_clf.classify(x_new),vote_clf.calc_confidence(x_new)))

<font color = green >

### Using sklearn

</font>


<font color = green >

#### Load data 

data set ['amazon-reviews-unlocked-mobile-phones'](https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones)
</font>


In [None]:
import os
cwd= os.getcwd() # current working directory
path = os.path.join(cwd,'data')
fn=  os.path.join(path , 'Amazon_Unlocked_Mobile.csv') # https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
df = pd.read_csv(fn) # 
print('len=  {:,}\ncolumns= {}'.format(len(df), list(df)))

# df = df.sample(frac=0.1, random_state=10) # reduce the amount of reviews due to speedup the training considering this is demo
df.head()

<font color = green >

#### Get rid of records with missed data 

</font>


In [None]:
df.dropna(inplace=True) 
print('len=  {:,}'.format(len(df)))

<font color = green >

#### Label positive and negative 

</font>


In [None]:
df = df[df['Rating'] != 3] # Remove any 'neutral' ratings equal to 3  as uninformative
df['Rating_binary'] = np.where(df['Rating'] > 3, 1, 0) # returns 1 for 4,5 and 0 for 1,2
df.head()

In [None]:
df['Rating_binary'].mean()

<font color = green >

#### Split to train and test sets

</font>


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'],df['Rating_binary'],random_state=0)

<font color = green >

#### Review training sample

</font>


In [None]:
X_train.iloc[0]

<font color = green >

#### Extract Features 

</font>
The bag-of-words approach is simple way to represent text for use in machine learning, which ignores structure and only counts how often each word occurs.

<font color = green >

#### Count vectorizer

</font>
By default, selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

vect = CountVectorizer().fit(X_train) # Fit the CountVectorizer to the training data
print('features samples:\n{}'.format(vect.get_feature_names()[::2000]))
print ('\nlen of features {:,}'.format(len(vect.get_feature_names()))) 


<font color = green >

#### Transfrom the X_train to feature representation

</font>


In [None]:
X_train_vectorized = vect.transform(X_train) # indeces of existing words from vocabulary and their count in current text
X_train_vectorized

<font color = green >

#### Review vectorized training sample

</font>


In [None]:
# review first sample 
df = pd.DataFrame(X_train_vectorized[0].toarray(), index= ['value']).T
print (list(df[df['value']>0].index))
[vect.get_feature_names()[index] for index in df[df['value']>0].index.values]

<font color = green >

#### Train model

</font>


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score

In [None]:
clf = LogisticRegression().fit(X_train_vectorized, y_train) # Train the model

<font color = green >

#### Evaluate model

</font>


In [None]:
predictions = clf.predict(vect.transform(X_test)) # Predict the transformed test documents
print('f1: ', f1_score(y_test, predictions)) 
scores = clf.decision_function(vect.transform(X_test)) 
print('AUC: ', roc_auc_score(y_test, scores)) 

<font color = green >

#### Review relevant features 

</font>


In [None]:
feature_names = np.array(vect.get_feature_names())
sorted_coef_index = clf.coef_[0].argsort() # ascending  [0] is just squeeze from shape (1,n)
print('Smallest coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))
# model.coef_[0][sorted_coef_index[0]] the smallest 

<font color = green >

## Term frequency–inverse document frequency (TFIDF)


</font>

TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Its value increases proportionally to the number of times a word appears in the document and is decreases by the number of documents in the corpus that contain the word
<div style="float:left;">
<br>
    
**Term frequency** $(tf(t,d))$ is measure of how frequent term t is in document d 
$$ tf(t,d) = \frac{k}{n},$$ <br>$d$ - document,  $k$ - number of times word occurs in document $d$, $n$ - total number of words in document $d$.
<br>
Note: Various approaches can be used for term frequency e.g. *augmented frequency*, to prevent a bias towards longer documents (raw frequency divided by the raw frequency of the most occurring term in the document):

$$ tf^{\,A}(t,d) = 0.5+ 0.5\cdot \frac{tf(t,d)}{\underset{t' \in d}{max}(tf(t',d))} $$

**Inverse document frequency** $(idf(t,D))$ is a measure of how much information the word provides.
$$ idf(t,D) = log \frac{N}{K},$$ <br>$D$ - all documents, $K$ - number of documents in $D$ that contain the word , $N$ - total number of documents in $D$. <br>
</div>

Note: Various approaches can be used for inverse document frequency 

<div style="float:left;">
<table width="500">
    <tr>
        <th style="text-align:center" bgcolor= white>Document1</th>
        <th style="text-align:center"  bgcolor= white >Document2</th></tr>
    <tr>
        <td>
            <table>
                <tr>
                    <th  bgcolor=gainsboro>Term</th>
                    <th  bgcolor=gainsboro>Term Count</th></tr>
                <tr><td>this</td><td>1</td></tr>
                <tr><td>is</td><td>1</td></tr>
                <tr><td>a</td><td>2</td></tr>
                <tr><td>sample</td><td>1</td></tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <th bgcolor=gainsboro>Term</th>
                    <th  bgcolor=gainsboro>Term Count</th></tr>
                <tr><td>this</td><td>1</td></tr>
                <tr><td>is</td><td>1</td></tr>
                <tr><td>another</td><td>2</td></tr>
                <tr><td>example</td><td>3</td></tr>
            </table>
        </td>
    </tr>
</table>
<div/>

<div style="float:left;">
<br>

For <strong>"this"</strong>:
$$ tf ("this", d_{1}) =  \frac{1}{5} = 0.2, \quad  tf ("this", d_{2}) =  \frac{1}{7} \approx 0.14, \quad idf ("this", D) =  log \frac{2}{2} =0; $$

$$ tfidf("this", d_{1}, D)  = 0.2 \cdot 0 = 0, \quad    tfidf("this", d_{2}, D)  = 0.14 \cdot 0 = 0 $$

For <strong>"example"</strong>:
$$ tf ("example", d_{1}) =  \frac{0}{5} = 0 , \quad  tf ("example", d_{2}) =  \frac{3}{7} \approx 0.43 , \quad idf ("example", D) =  log \frac{2}{1} \approx 0.3; $$

$$ tfidf("example", d_{1}, D)  = 0 \cdot 0.3 = 0, \quad    tfidf("example", d_{2}, D)  = 0.43 \cdot 0.3 = 0.129 $$

</div>


<font color = green >

### Sklearn tfidf

</font>


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

<font color = green >

#### Compute sklearn tfidf for sample with 2 documents 

</font>


In [None]:
X = np.array(['this is a sample a', 'this is another example another example example'])
tfidf_vectorizer= TfidfVectorizer().fit(X)
X_vectorized= tfidf_vectorizer.transform(X)
print (tfidf_vectorizer.vocabulary_)
X_vectorized.toarray()
# conclusion: sklearn uses different variant of computation tfidf

<font color = green >

#### Use sklearn tfidf for Amazon_Unlocked_Mobile documents 

</font>


In [None]:
tfidf_vectorizer= TfidfVectorizer(min_df=5)#.fit(X_train) 
    # min_df - minimum document count to include the term, default is 1 
    # you may also set max_features (Int or None) to return just limited number of top tfidf features 
X_train_vectorized = tfidf_vectorizer.fit_transform(X_train)
print ('len of features= {:,}'.format(len(tfidf_vectorizer.get_feature_names()))) 
    # Note: min_df=5 caused 17,951  comparing to 53,216 acquired by count vectorizer
    # Note: min_df=5 is also available in count vectorizer




In [None]:
# X_train_vectorized.shape # (231207, 17951) = (n_documents, n_features)
sorted_tfidf_index = X_train_vectorized.max(axis=0).toarray()[0].argsort() 
    # max(axis=0) means max through all docs - will get the max of each word within all docs
    # [0] - just squeezing     
print (np.sort(X_train_vectorized.max(axis=0).toarray()[0]))
sorted_tfidf_index # indices of the most tfidf terms 


In [None]:
feature_names = np.array(tfidf_vectorizer.get_feature_names())
print ('feature_names ',feature_names)
print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))


<font color = green >

#### Train model on features  extracted by tfidf vectorizer

</font>


In [None]:
clf = LogisticRegression().fit(X_train_vectorized, y_train) # Train the model
predictions = clf.predict(tfidf_vectorizer.transform(X_test)) 
print('f1: ', f1_score(y_test, predictions)) 
scores = clf.decision_function(tfidf_vectorizer.transform(X_test)) 
print('AUC: ', roc_auc_score(y_test, scores)) 

In [None]:
# Conclusion: Perfromance is not worse but features used is 3 times less

In [None]:
sorted_coef_index = clf.coef_[0].argsort()
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

<font color = green >

### n-grams

</font>


In [None]:
# the problem is the follwoing reviews are treated the same by current model
targets= [
    "not an issue, phone is working", 
    "an issue, phone is not working"
]
print(clf.predict(tfidf_vectorizer.transform(targets)))


In [None]:
count_vectorizer = CountVectorizer(min_df=5, max_features=50000, ngram_range=(1,2)).fit(X_train) # Note: both limits are included
X_train_vectorized = count_vectorizer.transform(X_train)
print('len of features using n-grams vectorizer={:,}'.format(len(count_vectorizer.get_feature_names()))) 


In [None]:
clf= LogisticRegression().fit(X_train_vectorized, y_train)
predictions = clf.predict(count_vectorizer.transform(X_test)) 
print('f1: ', f1_score(y_test, predictions)) 
scores = clf.decision_function(count_vectorizer.transform(X_test)) 
print('AUC: ', roc_auc_score(y_test, scores))

In [None]:
feature_names = np.array(count_vectorizer.get_feature_names())
sorted_coef_index = clf.coef_[0].argsort()
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

In [None]:
print (targets)
print(clf.predict(count_vectorizer.transform(targets)))

<font color = green >

## Home Task 

</font>


<font color = green >

### Load data 

</font>

[Sentiment Analysis Dataset](https://www.kaggle.com/sonaam1234/sentimentdata)

alternative source: 
<br>
[rt-polaritydata](https://github.com/dennybritz/cnn-text-classification-tf/tree/master/data/rt-polaritydata)

alternative source: 
<br>
[Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data)

Each line in these two files corresponds to a single snippet (usually containing roughly one single sentence); all snippets are down-cased.  
[More info about dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt)



<font color = green >

## Learn more
</font>

sklearn.feature_extraction.text.CountVectorizer
<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Bag-of-words model
<br>
https://en.wikipedia.org/wiki/Bag-of-words_model

tf–idf
<br>
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

sklearn.feature_extraction.text.TfidfVectorizer
<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Applied Text Mining in Python
<br>
https://www.coursera.org/learn/python-text-mining/home/welcome

Natural Language Processing tutorial
<br>
https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/


<font color = green >

## Next lesson: topic modeling 
</font>



In [84]:
import pandas as pd
import numpy as np 
from nltk.corpus import movie_reviews 
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import RegexpTokenizer
import random
import os
cwd= os.getcwd()
path = os.path.join(cwd,'data')

In [85]:
fn=  os.path.join(path , 'sentimentdata/rt-polarity.neg')
with open(fn, "r",encoding='utf-8', errors='ignore') as f: # some invalid symbols encountered 
    content = f.read()  
texts_neg=  content.splitlines()

fn=  os.path.join(path , 'sentimentdata/rt-polarity.pos')
with open(fn, "r",encoding='utf-8', errors='ignore') as f: # some invalid symbols encountered 
    content = f.read()  
texts_pos=  content.splitlines()
# print('len of texts_neg = {:,}'.format (len(texts_neg)))
# for review in texts_neg[:5]:
#     print ( '\n', review)
           
print('\n\n\nlen of texts_pos = {:,}'.format (len(texts_pos)))
for review in texts_pos[:5]:
    print ( '\n', review)





len of texts_pos = 5,331

 the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 

 the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . 

 effective but too-tepid biopic

 if you sometimes like to go to the movies to have fun , wasabi is a good place to start . 

 emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . 


In [86]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer 

dff = pd.DataFrame(texts_pos)
dff["y"]=1
df = pd.DataFrame(texts_neg)
df["y"]=0
df = pd.concat([df, dff])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[0],df['y'],random_state=0)

# vect = CountVectorizer().fit(X_train)

In [87]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score

In [88]:
from nltk.classify.scikitlearn import SklearnClassifier # this is wrapper to incorporate with sklearn using nltk style.
from sklearn.naive_bayes import MultinomialNB

# Convert to nltk classifiers 
MNNB_classifier= SklearnClassifier(MultinomialNB()) # Note : use ()

from sklearn.linear_model import LogisticRegression
lr_classifier = SklearnClassifier(LogisticRegression()) 

from sklearn.svm import SVC, LinearSVC, NuSVC # NuSVC - Similar to SVC but uses a parameter to control the number of support vectors.
svc_clf = SklearnClassifier(SVC())  
lin_svc_clf= SklearnClassifier(LinearSVC())  
nu_svc_clf = SklearnClassifier(NuSVC())  

In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer
count_vectorizer = CountVectorizer(min_df=5, max_features=50000, ngram_range=(1,3)).fit(X_train) # Note: both limits are included
X_train_vectorized = count_vectorizer.transform(X_train)


In [90]:
clf = LogisticRegression().fit(X_train_vectorized, y_train) 
predictions = clf.predict(count_vectorizer.transform(X_test)) 
print('f1: ', f1_score(y_test, predictions)) 
scores = clf.decision_function(count_vectorizer.transform(X_test)) 
print('AUC: ', roc_auc_score(y_test, scores))

clf = SVC().fit(X_train_vectorized, y_train) 
predictions = clf.predict(count_vectorizer.transform(X_test)) 
print('f1: ', f1_score(y_test, predictions)) 
scores = clf.decision_function(count_vectorizer.transform(X_test)) 
print('AUC: ', roc_auc_score(y_test, scores))


clf = NuSVC().fit(X_train_vectorized, y_train) 
predictions = clf.predict(count_vectorizer.transform(X_test)) 
print('f1: ', f1_score(y_test, predictions)) 
scores = clf.decision_function(count_vectorizer.transform(X_test)) 
print('AUC: ', roc_auc_score(y_test, scores))

clf = LinearSVC().fit(X_train_vectorized, y_train) 
predictions = clf.predict(count_vectorizer.transform(X_test)) 
print('f1: ', f1_score(y_test, predictions)) 
scores = clf.decision_function(count_vectorizer.transform(X_test)) 
print('AUC: ', roc_auc_score(y_test, scores))




f1:  0.7696814353716587
AUC:  0.8323604481784516




f1:  0.010211524434719182
AUC:  0.6220815556853111




f1:  0.760409057706355
AUC:  0.8285141405718166
f1:  0.7329919531821506
AUC:  0.7963450224072334


In [91]:

clf= LogisticRegression().fit(X_train_vectorized, y_train)
predictions = clf.predict(count_vectorizer.transform(X_test)) 
print('f1: ', f1_score(y_test, predictions)) 
scores = clf.decision_function(count_vectorizer.transform(X_test)) 
print('AUC: ', roc_auc_score(y_test, scores))




f1:  0.7696814353716587
AUC:  0.8323604481784516
