# Text Classification (ML Approach)
**Video Link: https://youtu.be/Qbd7U9F0QQ8**

**Can use APIs of different platforms. One of them is https://nlpcloud.io/home/token**

## Load Dataset

In [53]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn

import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sayan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sayan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [13]:
temp_df = pd.read_csv("../Datasets/IMDB_50k_Movie_Reviews/IMDB Dataset.csv")
df = temp_df.iloc[:10000]
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [14]:
df["review"][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [15]:
df["sentiment"].value_counts()

positive    5028
negative    4972
Name: sentiment, dtype: int64

In [16]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [17]:
df.duplicated().sum()

17

In [18]:
df.drop_duplicates(inplace = True)

In [19]:
df.duplicated().sum()

0

## Text Preprocessing

In [20]:
# basic preprocessing
# remove tags
# lowercase
# remove stopwords

### Remove HTML Tags

In [21]:
import re

def remove_tags(raw_text):
    cleaned_text = re.sub(re.compile('<.*?>'), "", raw_text)
    return cleaned_text

In [22]:
df["review"] = df["review"].apply(remove_tags)

In [23]:
df["review"][1]

'A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'

### Convert to Lowercase

In [24]:
df["review"] = df["review"].apply(lambda x: x.lower())

### Remove Stopwords

In [28]:
from nltk.corpus import stopwords

sw_list = stopwords.words("english")

df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))

df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production. filming technique...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically there's family little boy (jake) thi...,negative
4,"petter mattei's ""love time money"" visually stu...",positive


## Model Creation & Accuracy Score

### Train-Test-Split

In [31]:
X = df.iloc[:, 0]
y = df.iloc[:, -1]
X

0       one reviewers mentioned watching 1 oz episode ...
1       wonderful little production. filming technique...
2       thought wonderful way spend time hot summer we...
3       basically there's family little boy (jake) thi...
4       petter mattei's "love time money" visually stu...
                              ...                        
9995    fun, entertaining movie wwii german spy (julie...
9996    give break. anyone say "good hockey movie"? kn...
9997    movie bad movie. watching endless series bad h...
9998    movie probably made entertain middle school, e...
9999    smashing film film-making. shows intense stran...
Name: review, Length: 9983, dtype: object

In [32]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)
y

array([1, 1, 1, ..., 0, 0, 1])

In [33]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

X_train.shape

(7986,)

### Bag of Words (BoW)

In [37]:
# appling BoW
from sklearn.feature_extraction.text import CountVectorizer

cv          = CountVectorizer()

X_train_bow = cv.fit_transform(X_train[:]).toarray()
X_test_bow  = cv.transform(X_test[:]).toarray()

In [38]:
X_train_bow.shape

(7986, 48282)

#### Gaussian Naive Base

In [39]:
# Gaussian Naive Base Algorithm'
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb.fit(X_train_bow, y_train)

GaussianNB()

In [40]:
y_pred = gnb.predict(X_test_bow)

from sklearn.metrics import accuracy_score, confusion_matrix

accuracy_score(y_test, y_pred)

0.6324486730095142

In [41]:
confusion_matrix(y_test, y_pred)

array([[717, 235],
       [499, 546]], dtype=int64)

#### Random Forest Classifier

In [42]:
# Random Forest Classifier Algorithm
from sklearn.ensemble import RandomForestClassifier

rf    = RandomForestClassifier()

rf.fit(X_train_bow, y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test, y_pred)

0.8442663995993991

In [43]:
cv          = CountVectorizer(max_features = 3000)

X_train_bow = cv.fit_transform(X_train[:]).toarray()
X_test_bow  = cv.transform(X_test[:]).toarray()

rf          = RandomForestClassifier()

rf.fit(X_train_bow, y_train)
y_pred      = rf.predict(X_test_bow)
accuracy_score(y_test, y_pred)

0.8522784176264396

### n-Grams

In [46]:
# using n-grams
cv          = CountVectorizer(ngram_range = (1, 3))

X_train_bow = cv.fit_transform(X_train[:]).toarray()
X_test_bow  = cv.transform(X_test[:]).toarray()

rf          = RandomForestClassifier()

rf.fit(X_train_bow, y_train)
y_pred      = rf.predict(X_test_bow)
accuracy_score(y_test, y_pred)

MemoryError: Unable to allocate 102. GiB for an array with shape (7986, 1711897) and data type int64

In [47]:
cv          = CountVectorizer(ngram_range = (1, 3), max_features = 5000)

X_train_bow = cv.fit_transform(X_train[:]).toarray()
X_test_bow  = cv.transform(X_test[:]).toarray()

rf          = RandomForestClassifier()

rf.fit(X_train_bow, y_train)
y_pred      = rf.predict(X_test_bow)
accuracy_score(y_test, y_pred)

0.8422633950926389

### Tf-Idf

In [48]:
# using tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf         = TfidfVectorizer()

X_train_tfidf = tfidf.fit_transform(X_train[:]).toarray()
X_test_tfidf  = tfidf.transform(X_test[:]).toarray()

In [49]:
rf     = RandomForestClassifier()

rf.fit(X_train_tfidf, y_train)
y_pred = rf.predict(X_test_tfidf)

accuracy_score(y_test, y_pred)

0.8477716574862293

### Word2Vec

In [51]:
# using word2vec
import gensim

from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [54]:
story = []
for document in df["review"]:
    raw_sentences = sent_tokenize(document)
    for sentence in raw_sentences:
        story.append(simple_preprocess(sentence))

In [55]:
model = gensim.models.Word2Vec(
    window    = 10,
    min_count = 2
)

model.build_vocab(story)

In [56]:
model.train(story, total_examples = model.corpus_count, epochs = model.epochs)

(5876447, 6212140)

In [57]:
len(model.wv.index_to_key)

31845

In [58]:
def document_vector(document):
    # remove out-of-vocavulary words
    doc = [word for word in document.split() if word in model.wv.index_to_key]
    return np.mean(model.wv[doc], axis = 0)

In [59]:
document_vector(df["review"].values[0])

array([-0.11767457,  0.44197842,  0.15793371,  0.23417264, -0.08648166,
       -0.57118315,  0.20673536,  0.9118082 , -0.3956981 , -0.24569878,
       -0.29675198, -0.48555282,  0.06393063,  0.12023948,  0.21426222,
       -0.13143265,  0.01869917, -0.3801883 , -0.02781098, -0.6802106 ,
        0.06089593,  0.2382583 ,  0.09646897, -0.22129515, -0.39432743,
       -0.0290477 , -0.28285992,  0.0154872 , -0.3290113 ,  0.03619485,
        0.3534463 ,  0.03650456,  0.22277047, -0.27120045, -0.18808062,
        0.38151145,  0.07626982, -0.41450965, -0.24805413, -0.73923516,
        0.14936532, -0.22309735,  0.04557468, -0.04844937,  0.47880465,
       -0.14191924, -0.2135583 , -0.01630191,  0.0761856 ,  0.36759058,
        0.04451348, -0.399119  , -0.4148597 , -0.09378606, -0.09493089,
        0.2585061 ,  0.14484192,  0.07867185, -0.32646737,  0.0779765 ,
        0.0504096 ,  0.1291775 ,  0.02581581, -0.14704275, -0.4253138 ,
        0.21399641,  0.05135576,  0.13029383, -0.36542377,  0.26

In [60]:
from tqdm import tqdm

X = []
for document in tqdm(df["review"].values):
    X.append(document_vector(document))

100%|██████████████████████████████████████████████████████████████████████████████| 9983/9983 [04:30<00:00, 36.96it/s]


In [61]:
X = np.array(X)
X.shape

(9983, 100)

In [62]:
X[0]

array([-0.11767457,  0.44197842,  0.15793371,  0.23417264, -0.08648166,
       -0.57118315,  0.20673536,  0.9118082 , -0.3956981 , -0.24569878,
       -0.29675198, -0.48555282,  0.06393063,  0.12023948,  0.21426222,
       -0.13143265,  0.01869917, -0.3801883 , -0.02781098, -0.6802106 ,
        0.06089593,  0.2382583 ,  0.09646897, -0.22129515, -0.39432743,
       -0.0290477 , -0.28285992,  0.0154872 , -0.3290113 ,  0.03619485,
        0.3534463 ,  0.03650456,  0.22277047, -0.27120045, -0.18808062,
        0.38151145,  0.07626982, -0.41450965, -0.24805413, -0.73923516,
        0.14936532, -0.22309735,  0.04557468, -0.04844937,  0.47880465,
       -0.14191924, -0.2135583 , -0.01630191,  0.0761856 ,  0.36759058,
        0.04451348, -0.399119  , -0.4148597 , -0.09378606, -0.09493089,
        0.2585061 ,  0.14484192,  0.07867185, -0.32646737,  0.0779765 ,
        0.0504096 ,  0.1291775 ,  0.02581581, -0.14704275, -0.4253138 ,
        0.21399641,  0.05135576,  0.13029383, -0.36542377,  0.26

In [63]:
y

array([1, 1, 1, ..., 0, 0, 1])

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [65]:
mnb = GaussianNB()

mnb.fit(X_train, y_train)

y_pred = mnb.predict(X_test)
accuracy_score(y_test, y_pred)

0.730095142714071

In [66]:
rfc = RandomForestClassifier()

rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_test)
accuracy_score(y_test, y_pred)

0.7726589884827241