<font color = green >

## Home Task 

</font>


<font color = green >

### Load data 

</font>

[Sentiment Analysis Dataset](https://www.kaggle.com/sonaam1234/sentimentdata)

alternative source: 
<br>
[rt-polaritydata](https://github.com/dennybritz/cnn-text-classification-tf/tree/master/data/rt-polaritydata)

alternative source: 
<br>
[Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data)

Each line in these two files corresponds to a single snippet (usually containing roughly one single sentence); all snippets are down-cased.  
[More info about dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt)



In [1]:
def extract_text_from_file(filename):
    with open(filename, "r", encoding="utf-8", errors="ignore") as f:
        content = f.read()
    return content


full_neg, full_pos = extract_text_from_file(
    "rt-polarity.neg"), extract_text_from_file("rt-polarity.pos")

In [2]:
from nltk.tokenize import RegexpTokenizer


def preprocess(text):
    tokenizer = RegexpTokenizer(r'\w+')
    return tokenizer.tokenize(text.lower())


neg_text = preprocess(full_neg)
pos_text = preprocess(full_pos)

print(f"Negative words list len - {len(neg_text)
                                   }\nFirst ten elements - {neg_text[:10]}")
print(f"Positive words list len - {len(pos_text)
                                   }\nFirst ten elements - {(pos_text[:10])}")

Negative words list len - 103030
First ten elements - ['simplistic', 'silly', 'and', 'tedious', 'it', 's', 'so', 'laddish', 'and', 'juvenile']
Positive words list len - 103204
First ten elements - ['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', 's']


In [3]:
all_words = neg_text + pos_text
print(f"All words list len - {len(all_words)
                              }\nFirst ten elements - {all_words[:10]}")

All words list len - 206234
First ten elements - ['simplistic', 'silly', 'and', 'tedious', 'it', 's', 'so', 'laddish', 'and', 'juvenile']


In [4]:
import nltk
all_words = nltk.FreqDist(all_words)
print(f"Vocab len: {len(all_words)}")


most_common_words = list(zip(*all_words.most_common()))[0]
print(f"Ten the most common words - {most_common_words[:10]}")

Vocab len: 18359
Ten the most common words - ('the', 'a', 'and', 'of', 'to', 's', 'it', 'is', 'in', 'that')


In [5]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score

In [6]:
pos_texts_with_cat = [[sentence, 1] for sentence in full_pos.splitlines()]
neg_texts_with_cat = [[sentence, 0] for sentence in full_neg.splitlines()]

all_texts = pos_texts_with_cat + neg_texts_with_cat

df = pd.DataFrame(all_texts, columns=["text", "binary_attitude"])
df.head()

Unnamed: 0,text,binary_attitude
0,the rock is destined to be the 21st century's ...,1
1,"the gorgeously elaborate continuation of "" the...",1
2,effective but too-tepid biopic,1
3,if you sometimes like to go to the movies to h...,1
4,"emerges as something rare , an issue movie tha...",1


In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['binary_attitude'], random_state=0)

In [8]:
vect = CountVectorizer().fit(X_train)
print('features samples:\n{}'.format(vect.get_feature_names_out()[::2000]))
print('\nlen of features {:,}'.format(len(vect.get_feature_names_out())))

features samples:
['00' 'bv' 'discordant' 'genres' 'labour' 'overstylized' 'rotting'
 'tackles' 'zest']

len of features 16,021


In [9]:
X_train_vectorized = vect.transform(X_train)
print(X_train_vectorized[0])

  (0, 178)	1
  (0, 930)	2
  (0, 1331)	1
  (0, 3396)	1
  (0, 5770)	1
  (0, 6529)	1
  (0, 6767)	1
  (0, 7407)	1
  (0, 7610)	1
  (0, 7622)	1
  (0, 8075)	1
  (0, 9738)	1
  (0, 12177)	1
  (0, 12681)	1
  (0, 15243)	1
  (0, 15595)	1
  (0, 15652)	1


In [10]:
df = pd.DataFrame(X_train_vectorized[0].toarray(), index=["value"]).T
df[df["value"] > 0]

Unnamed: 0,value
178,1
930,2
1331,1
3396,1
5770,1
6529,1
6767,1
7407,1
7610,1
7622,1


In [11]:
print(list(df[df["value"] > 0].index))
[vect.get_feature_names_out()[index]
 for index in df[df["value"] > 0].index.values]

[178, 930, 1331, 3396, 5770, 6529, 6767, 7407, 7610, 7622, 8075, 9738, 12177, 12681, 15243, 15595, 15652]


['about',
 'as',
 'been',
 'cutting',
 'fresh',
 'have',
 'hollywood',
 'instead',
 'is',
 'issue',
 'last',
 'of',
 'satire',
 'should',
 'variety',
 'week',
 'what']

In [12]:
clf = LogisticRegression(max_iter=2000, C=2, solver="saga").fit(
    X_train_vectorized, y_train)

In [13]:
predictions = clf.predict(vect.transform(X_test))
print(f"f1: {f1_score(y_test, predictions)}")
scores = clf.decision_function(vect.transform(X_test))
print(f"AUC: {roc_auc_score(y_test, scores)}")

f1: 0.7700374531835205
AUC: 0.8437619827600285


In [14]:
feature_names = np.array(vect.get_feature_names_out())
sorted_coef_index = clf.coef_[0].argsort()
print(f"Smallest coefs:\n{feature_names[sorted_coef_index[:10]]}\n")
print(f"Largest Coefs: \n{feature_names[sorted_coef_index[:-11:-1]]}")

Smallest coefs:
['dull' 'waste' 'boring' 'bore' 'neither' 'problem' 'worst'
 'disappointment' 'suffers' 'supposed']

Largest Coefs: 
['masterpiece' 'thanks' 'liberating' 'unflinching' 'enjoyable'
 'entertaining' 'remarkable' 'glorious' 'engrossing' 'solid']


<font color = green >

## Learn more
</font>

sklearn.feature_extraction.text.CountVectorizer
<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Bag-of-words model
<br>
https://en.wikipedia.org/wiki/Bag-of-words_model

tf–idf
<br>
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

sklearn.feature_extraction.text.TfidfVectorizer
<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Applied Text Mining in Python
<br>
https://www.coursera.org/learn/python-text-mining/home/welcome

Natural Language Processing tutorial
<br>
https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/


<font color = green >

## Next lesson: topic modeling 
</font>

