# Text Classification

It is a supervised learning task where we have to classify text and put it in different given classes. Text can be words, sentences or documents also.

Examples:
- Email spam classification
- Sentiment Analysis
- Customer message to check for sales/support queries

------------------------------------------------------
### Types of text classification
1. Binary classification
2. Multi class classification ----> classifying news on genours such as sports, entertainment, geopolitics, etc.
3. MultiLabel classification ----> a single text can come under multiple labels

Mostly we will see in Binary or Nulticlass classification

----------------------------------------------------------
#### Application of Text Classification
1. Email Spam detection
2. Customer Support (If a customer has issue with a brand, they tweet --> and in customer support we have to decide whether the tweet should be responded or not) (Email received for sales or support then forward it to particular team)
3. Sentiment Analysis (Postitive/Negative sentiment) --> widely used in ecommerce to check reviews
4. Language Detection (translate one language to another --> before translatation, to check the language)
5. Fake news detection



## Pipeline for Text Classification
1. Data Acquisition
2. Text Preprocessing
3. Text Vectorization
4. Modeling
    - ML --> Naive Bayes, Logistic Regression, Rqandom Forest
    - DL --> RNN, CNN, BERT
5. Evaluation Metrics such as accuracy, confusion metrics
6. Deploy as APIs


##### Different Approaches

1. Heuristic --> generally used when there is scarsity of data
2. Cloud APIs  --> ready made solution
3. ML
    - Changes in Text Vectorization (Bag of Words, N-grams, TfIdf)
    - Modeling (different Algos, Naive Bayes, SVM, Logistic Regression)
4. DL
    - RNN (LSTM)
    - CNN
    - Pretrained Models like BERT

---------------------------------------------------------------------------------

#### Bag of Words and N-gram
We have a dataset IMDB Reviews which has 50,000 reviews and reviews sentiment is given. Our task is to figure out sentiment for new reviews.
- We have top apply basic preprocessing -> Bag of Words -> Algo (Naive Bayes and Random Forest)

Dataset Link - https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews



In [29]:

import pandas as pd

temp_df = pd.read_csv('Dataset/IMDB_Reviews/IMDB Dataset.csv')
df = temp_df[:3000]
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [30]:
df['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [31]:
df['sentiment'].value_counts()

positive    1508
negative    1492
Name: sentiment, dtype: int64

In [32]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [33]:
df.duplicated().sum()
# We have duplicate rows

0

In [34]:
# dropping duplicated rows

df.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop_duplicates(inplace=True)


In [35]:
df.duplicated().sum()

0

#### Basic Pre-processing
1. Remove tags
2. Lowercase
3. remove stopwords

In [36]:
import re

def remove_tags(raw_text):
    cleaned_text = re.sub(re.compile('<.*?>'), '', raw_text)
    return cleaned_text

In [37]:
df['review'] = df['review'].apply(remove_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_tags)


In [38]:
df.sample(5)

Unnamed: 0,review,sentiment
63,"Besides being boring, the scenes were oppressi...",negative
331,I cant believe there are people out there that...,positive
53,I cannot believe I enjoyed this as much as I d...,positive
1827,My rating refers to the first 4 Seasons of Sta...,positive
2333,I never really knew who Robert Wuhl was before...,positive


In [39]:
df['review'] = df['review'].apply(lambda x : x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(lambda x : x.lower())


In [40]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
2995,to experience head you really need to understa...,positive
2996,"i'm a fan of judy garland, vincente minnelli, ...",negative
2997,"""mr. harvey lights a candle"" is anchored by a ...",positive
2998,della myers (kim basinger) is an upper-class h...,negative


In [41]:
from nltk.corpus import stopwords

sw_list = stopwords.words('english')

df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x: " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x: " ".join(x))


In [42]:
df

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production. filming technique...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically there's family little boy (jake) thi...,negative
4,"petter mattei's ""love time money"" visually stu...",positive
...,...,...
2995,experience head really need understand monkees...,positive
2996,"i'm fan judy garland, vincente minnelli, gene ...",negative
2997,"""mr. harvey lights candle"" anchored brilliant ...",positive
2998,della myers (kim basinger) upper-class housewi...,negative


In [43]:
X = df.iloc[:,0:1]
y = df['sentiment']

In [44]:
X

Unnamed: 0,review
0,one reviewers mentioned watching 1 oz episode ...
1,wonderful little production. filming technique...
2,thought wonderful way spend time hot summer we...
3,basically there's family little boy (jake) thi...
4,"petter mattei's ""love time money"" visually stu..."
...,...
2995,experience head really need understand monkees...
2996,"i'm fan judy garland, vincente minnelli, gene ..."
2997,"""mr. harvey lights candle"" anchored brilliant ..."
2998,della myers (kim basinger) upper-class housewi...


In [45]:
y

0       positive
1       positive
2       positive
3       negative
4       positive
          ...   
2995    positive
2996    negative
2997    positive
2998    negative
2999    negative
Name: sentiment, Length: 3000, dtype: object

In [46]:
# Label Encoding y
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)

In [47]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

In [48]:
X_train.shape

(2400, 1)

In [49]:
# Applying Bag of Words

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1000)

In [50]:
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

In [51]:
X_train_bow.shape
X_test_bow.shape

# we have 94725 features

(600, 1000)

In [52]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train_bow, y_train)

y_pred = gnb.predict(X_test_bow)

from sklearn.metrics import accuracy_score, confusion_matrix
accuracy_score(y_test, y_pred)

0.7766666666666666

In [53]:
confusion_matrix(y_test, y_pred)

array([[238,  62],
       [ 72, 228]], dtype=int64)

In [54]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

rf.fit(X_train_bow, y_train)
y_pred = rf.predict(X_test_bow)

print('Accuracy Score: ', accuracy_score(y_test, y_pred))

Accuracy Score:  0.8066666666666666


In [55]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1000)
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow, y_train)
y_pred = rf.predict(X_test_bow)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))

Accuracy Score:  0.82


#### Using N-gram

In [56]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(1,2), max_features=1000)
X_train_ngm = cv.fit_transform(X_train['review']).toarray()
X_test_ngm = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_ngm, y_train)
y_pred = rf.predict(X_test_ngm)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))

Accuracy Score:  0.8166666666666667


## Using TfIdf

This is generally used in Information Retrieval Systems

In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_tfidf, y_train)
y_pred = rf.predict(X_test_tfidf)

print(accuracy_score(y_test, y_pred))


0.81


## Word2Vec

- If we use pre-trained model, then we have to make sure that the vocabulory of pre-trained model and our vocabulory should have atleast 80% words common

In [59]:
import pandas as pd

temp_df = pd.read_csv('Dataset/IMDB_Reviews/IMDB Dataset.csv')
df = temp_df.iloc[:10000]

df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [60]:
df.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop_duplicates(inplace=True)


In [61]:
import re
def remove_tags(raw_text):
    cleaned_text = re.sub(re.compile('<.*?>'), '', raw_text)
    return cleaned_text


In [62]:
df['review'] = df['review'].apply(remove_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_tags)


In [64]:
df['review'] = df['review'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(lambda x:x.lower())


In [65]:
from nltk.corpus import stopwords

sw_list = stopwords.words('english')

df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))


In [66]:
df['review']

0       one reviewers mentioned watching 1 oz episode ...
1       wonderful little production. filming technique...
2       thought wonderful way spend time hot summer we...
3       basically there's family little boy (jake) thi...
4       petter mattei's "love time money" visually stu...
                              ...                        
9995    fun, entertaining movie wwii german spy (julie...
9996    give break. anyone say "good hockey movie"? kn...
9997    movie bad movie. watching endless series bad h...
9998    movie probably made entertain middle school, e...
9999    smashing film film-making. shows intense stran...
Name: review, Length: 9983, dtype: object

In [67]:
import gensim
from nltk import sent_tokenize
from gensim.utils import simple_preprocess


In [68]:
story = []
for doc in df['review']:
    raw_sent = sent_tokenize(doc)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))

In [69]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

In [70]:
model.build_vocab(story)

In [71]:
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(5876372, 6212140)

In [72]:
len(model.wv.index_to_key)

31845

In [75]:
import numpy as np

def document_vector(doc):
    # remove out-of-vocabulary words
    doc = [word for word in doc.split() if word in model.wv.index_to_key]
    return np.mean(model.wv[doc], axis=0)

In [76]:
document_vector(df['review'].values[0])

array([-1.87530011e-01,  4.95367199e-01,  2.07556158e-01,  2.30710492e-01,
       -7.53852427e-02, -5.93391001e-01,  2.20551029e-01,  9.37762678e-01,
       -3.62197995e-01, -2.23479390e-01, -3.20896804e-01, -4.54910100e-01,
        2.20214557e-02,  4.46809754e-02,  2.18457684e-01, -1.46564528e-01,
       -1.25316177e-02, -3.73192072e-01, -6.93874881e-02, -6.04705155e-01,
        4.26526293e-02,  2.42396921e-01,  6.23472370e-02, -3.27522397e-01,
       -3.71412575e-01,  2.17061475e-04, -2.92471737e-01,  2.20541712e-02,
       -3.34182858e-01,  4.10325862e-02,  3.87667269e-01,  2.42435280e-02,
        1.13585085e-01, -2.88523555e-01, -1.73377782e-01,  4.63228732e-01,
        1.11057587e-01, -4.59190249e-01, -2.33132675e-01, -7.24372029e-01,
        1.15215972e-01, -2.65447944e-01,  5.75667024e-02, -1.04196385e-01,
        5.24624288e-01, -1.43483877e-01, -1.89442232e-01,  2.23946776e-02,
        6.62805215e-02,  3.18325281e-01,  2.77051851e-02, -3.58479470e-01,
       -3.89551908e-01, -

In [77]:
from tqdm import tqdm

In [78]:
X = []
for doc in tqdm(df['review'].values):
    X.append(document_vector(doc))

100%|██████████| 9983/9983 [08:53<00:00, 18.72it/s]


In [79]:
X = np.array(X)

In [80]:
X[0]

array([-1.87530011e-01,  4.95367199e-01,  2.07556158e-01,  2.30710492e-01,
       -7.53852427e-02, -5.93391001e-01,  2.20551029e-01,  9.37762678e-01,
       -3.62197995e-01, -2.23479390e-01, -3.20896804e-01, -4.54910100e-01,
        2.20214557e-02,  4.46809754e-02,  2.18457684e-01, -1.46564528e-01,
       -1.25316177e-02, -3.73192072e-01, -6.93874881e-02, -6.04705155e-01,
        4.26526293e-02,  2.42396921e-01,  6.23472370e-02, -3.27522397e-01,
       -3.71412575e-01,  2.17061475e-04, -2.92471737e-01,  2.20541712e-02,
       -3.34182858e-01,  4.10325862e-02,  3.87667269e-01,  2.42435280e-02,
        1.13585085e-01, -2.88523555e-01, -1.73377782e-01,  4.63228732e-01,
        1.11057587e-01, -4.59190249e-01, -2.33132675e-01, -7.24372029e-01,
        1.15215972e-01, -2.65447944e-01,  5.75667024e-02, -1.04196385e-01,
        5.24624288e-01, -1.43483877e-01, -1.89442232e-01,  2.23946776e-02,
        6.62805215e-02,  3.18325281e-01,  2.77051851e-02, -3.58479470e-01,
       -3.89551908e-01, -

In [81]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

y = encoder.fit_transform(df['sentiment'])

In [82]:
y

array([1, 1, 1, ..., 0, 0, 1])

In [83]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [84]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

## Assingments

Try to ise Google's Pre-trained model

## Tips
- Always try to use ensemble techniques
- Try using Heuristic features as well
- First start with Machine Learning then gradually move to Deep Learning
- Try to balance data
- Solve as many projects as you can (Practise)