# Text Classification
Text classification in NLP involves categorizing and assigning predefined labels or categories to text documents, sentences, or phrases based on their content.

## Types of Text Classification

1. **Binary**: Only 2 categories exists. For example in email, spam or not spam
2. **Multiclass**: More than 2 classes exists. For example in sentiment anlaysis, we can have positive, negative and neutral classes.
3. **Multilabel**: In multilabel, the given text can have multiple classes assigned to it. For example in movie genre, a movie can have multiple genres assigned to it.

Text classification is one of those things that is very widely used in NLP applications. 
## Applications
1. Email Spam Classification
2. Customer Support
3. Sentiment Analysis: Implemented in ecommerce websites like Amazon, Flipkart
4. Language Detection
5. Fake News Detection


## Pipeline for Text Classification
1. Data Collection
2. Text Preprocessing
3. Text Vectorization: We can use any approach such as BoW, N-grams, Tf-Idf
4. Modelling: In modelling, we have 3 approaches: **Heuristic Approach**(when we dont have enough data, this approach is used and it is not ML or DL based.It is Generally not used in today's world.). Next we have ML based approach where we can use ML algorithms such as **Naive Bayes, Random Forest or SVM** to do the classification. For DL based approach we can use **RNN or LSTM or CNN** or also pre trained models like **BERT**.


In [18]:
import numpy as np
import pandas as pd

In [19]:
df = pd.read_csv('Datasets/IMDB Dataset.csv')
df = df.head(10000) # comment this line out to take full dataset as input
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [20]:
df['review'][3]

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

## Text Preprocessing 

In [21]:
df['sentiment'].value_counts()
# we can see that positive and negative reveiws are alsmost similar in count. Therefore the classes are balanced.

sentiment
positive    5028
negative    4972
Name: count, dtype: int64

In [22]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [23]:
df.duplicated().sum()

17

In [24]:
df.drop_duplicates(inplace=True)

In [25]:
df.duplicated().sum()


0

In [26]:
import re
def remove_tags(text):
    cleaned_text = re.sub(re.compile('<.*?>'),'',text)
    return cleaned_text

df['review'] = df['review'].apply(remove_tags)

In [27]:
df['review'][1]

'A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'

In [28]:
# apply lowercasing
df['review'] = df['review'].apply(lambda x:x.lower())

In [29]:
# remove stop words
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')

df['review'] = df['review'].apply(lambda x:[item for item in x.split() if item not in stopwords_list]).apply(lambda x:" ".join(x))

In [30]:
df['review'][1]

'wonderful little production. filming technique unassuming- old-time-bbc fashion gives comforting, sometimes discomforting, sense realism entire piece. actors extremely well chosen- michael sheen "has got polari" voices pat too! truly see seamless editing guided references williams\' diary entries, well worth watching terrificly written performed piece. masterful production one great master\'s comedy life. realism really comes home little things: fantasy guard which, rather use traditional \'dream\' techniques remains solid disappears. plays knowledge senses, particularly scenes concerning orton halliwell sets (particularly flat halliwell\'s murals decorating every surface) terribly well done.'

In [32]:
df

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production. filming technique...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically there's family little boy (jake) thi...,negative
4,"petter mattei's ""love time money"" visually stu...",positive
...,...,...
9995,"fun, entertaining movie wwii german spy (julie...",positive
9996,"give break. anyone say ""good hockey movie""? kn...",negative
9997,movie bad movie. watching endless series bad h...,negative
9998,"movie probably made entertain middle school, e...",negative


In [33]:
# encoding our sentiment to labels
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = df['sentiment']
y = encoder.fit_transform(y)

In [34]:
y

array([1, 1, 1, ..., 0, 0, 1])

In [47]:
x = df['review']
x = pd.DataFrame(x)
print(type(x))
x

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,review
0,one reviewers mentioned watching 1 oz episode ...
1,wonderful little production. filming technique...
2,thought wonderful way spend time hot summer we...
3,basically there's family little boy (jake) thi...
4,"petter mattei's ""love time money"" visually stu..."
...,...
9995,"fun, entertaining movie wwii german spy (julie..."
9996,"give break. anyone say ""good hockey movie""? kn..."
9997,movie bad movie. watching endless series bad h...
9998,"movie probably made entertain middle school, e..."


In [48]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [49]:
x_train.shape

(7986, 1)

## Using BoW Approach

In [55]:
# applying BoW approach
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

x_train_bow = cv.fit_transform(x_train['review']).toarray()
x_test_bow = cv.transform(x_test['review']).toarray()

In [56]:
# Applying Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(x_train_bow,y_train)

In [57]:
y_pred = gnb.predict(x_test_bow)
from sklearn.metrics import accuracy_score,confusion_matrix
acc_score = accuracy_score(y_test,y_pred)
acc_score

0.627941912869304

In [58]:
confusion_matrix(y_test,y_pred)

array([[694, 291],
       [452, 560]])

In [60]:
# using random forest classifier instead of naive bayes
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(x_train_bow,y_train)
y_pred = rf.predict(x_test_bow)
accuracy_score(y_test,y_pred)

0.8497746619929895

In [61]:
# we can set a number of max features which might improve our model's performance in some cases
cv = CountVectorizer(max_features=3000)

X_train_bow = cv.fit_transform(x_train['review']).toarray()
X_test_bow = cv.transform(x_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

# we can also apply n-grams instead of BoW
# cv = CountVectorizer(ngram_range=(2))

0.842764146219329

We can use Tf-IDF for text classification as well, although it is used more in information retrieval systems

In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

X_train_tfidf = tfidf.fit_transform(x_train['review']).toarray()
X_test_tfidf = tfidf.transform(x_test['review'])

rf = RandomForestClassifier()

rf.fit(X_train_tfidf,y_train)
y_pred = rf.predict(X_test_tfidf)

accuracy_score(y_test,y_pred)



0.8497746619929895

In [63]:
# TODO: Apply word2vec on our own dataset(use average word2vec to get vector of entire documet(meaning each row ))
# and then try and build the ml model 