# Text Classification

**Text Classification** is a Natural Language Processing (NLP) task where the goal is to automatically assign predefined categories (or labels) to a given piece of text, such as a sentence, paragraph, or document.  
It helps in organizing, structuring, and analyzing large volumes of textual data efficiently.

---

## Types of Text Classification

### 1. Binary Classification
- Definition: Each text is classified into one of **two categories** only.  
- Example: Classifying emails as **spam** or **not spam**.  

### 2. Multiclass Classification
- Definition: Each text is assigned to **one label out of more than two possible categories**.  
- Example: Classifying a news article as **sports, politics, entertainment, or technology**.  

### 3. Multilabel Classification
- Definition: Each text can belong to **multiple categories simultaneously**.  
- Example: A movie review tagged as both **romance** and **comedy**.  

---

## Applications of Text Classification

- **Spam Detection** (binary) → Spam vs. Non-Spam  
- **Sentiment Analysis** (binary/multiclass) → Positive, Negative, Neutral  
- **Topic Categorization** (multiclass) → News, Blogs, Research, etc.  
- **Tag Recommendation** (multilabel) → Assigning multiple tags to StackOverflow or GitHub posts  
- **Toxic Comment Detection** (multilabel) → Offensive, Threatening, Hate Speech, etc.  
- **Customer Support Automation** → Routing tickets into categories like *billing*, *technical support*, *feedback*  

---

# Approaches to Text Classification

Text classification can be performed using various approaches, ranging from simple rule-based methods to advanced deep learning techniques. The choice of approach depends on factors like dataset size, complexity of the problem, availability of labeled data, and computational resources.

---

## 1. Heuristic / Rule-Based Approaches
- **Definition:** Classification is done using handcrafted rules, keyword matching, or regular expressions.  
- **How it works:**  
  - Define keywords or phrases for each class.  
  - If a document contains certain patterns, it is assigned to that class.  
- **Example:** If a text contains words like "offer" or "discount," classify it as *marketing*.  
- **Pros:** Simple, interpretable, no training required.  
- **Cons:** Not scalable, brittle, poor performance on complex/unseen data.  

---

## 2. Using APIs / Pre-trained Services
- **Definition:** Leverages existing APIs and pre-trained NLP services for classification.  
- **How it works:**  
  - Services like **Google Cloud NLP**, **AWS Comprehend**, **Azure Cognitive Services**, or **Hugging Face Pipelines** provide ready-to-use classification models.  
- **Example:** Sentiment analysis API classifies text as *positive*, *neutral*, or *negative*.  
- **Pros:** Quick setup, good accuracy for general tasks, no need for training.  
- **Cons:** Limited customization, possible cost issues, dependency on third-party providers.  

---

## 3. Machine Learning (ML) Based Approaches
- **Definition:** Uses statistical ML algorithms trained on labeled datasets with text represented as features.  
- **How it works:**  
  - Convert text into numerical features using **Bag of Words (BoW), TF-IDF, or Word Embeddings**.  
  - Train ML classifiers such as:  
    - **Naïve Bayes**  
    - **Logistic Regression**  
    - **Support Vector Machines (SVMs)**  
    - **Random Forests**  
- **Example:** Training a classifier to categorize news articles into *sports*, *politics*, or *technology*.  
- **Pros:** Effective on medium-sized datasets, interpretable, relatively fast to train.  
- **Cons:** Requires feature engineering, may struggle with complex language semantics.  

---

## 4. Deep Learning (DL) Based Approaches
- **Definition:** Uses neural networks to automatically learn rich representations of text.  
- **How it works:**  
  - Employ architectures such as:  
    - **CNNs (Convolutional Neural Networks)** → capture local word patterns.  
    - **RNNs / LSTMs / GRUs (Recurrent Networks)** → capture sequential dependencies.  
    - **Transformers (BERT, RoBERTa, GPT, etc.)** → capture contextual word meanings using attention mechanisms.  
  - Typically trained on large corpora and fine-tuned for specific tasks.  
- **Example:** Using **BERT** to classify customer reviews into *positive*, *negative*, or *neutral*.  
- **Pros:** State-of-the-art performance, captures context and semantics, minimal feature engineering.  
- **Cons:** Requires large datasets, high computational cost, less interpretable.  

---

## Summary Table

| Approach              | Data Need | Interpretability | Performance | Scalability |
|------------------------|-----------|------------------|-------------|-------------|
| **Heuristic**         | None      | High             | Low         | Low         |
| **API/Pre-trained**   | None      | Medium           | Medium/High | High        |
| **ML-based**          | Medium    | Medium/High      | Medium      | Medium      |
| **DL-based**          | Large     | Low              | Very High   | High        |

In [2]:
import numpy as np
import pandas as pd

In [3]:
temp_df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')

In [4]:
df = temp_df.iloc[:20000]

In [5]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [6]:
df.describe()

Unnamed: 0,review,sentiment
count,20000,20000
unique,19926,2
top,Loved today's show!!! It was a variety and not...,negative
freq,4,10097


In [7]:
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [8]:
df['sentiment'].value_counts()

sentiment
negative    10097
positive     9903
Name: count, dtype: int64

In [9]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [10]:
df.duplicated().sum()  # 74 duplicates in 20K rows

74

In [11]:
df.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop_duplicates(inplace=True)


In [12]:
df.duplicated().sum()

0

In [13]:
df.shape

(19926, 2)

In [14]:
df['sentiment'].value_counts()

sentiment
negative    10046
positive     9880
Name: count, dtype: int64

# Basic Preprocessing


- Removing Tags

In [15]:
import re
def remove_tags(raw_text):
    cleaned_text = re.sub(re.compile('<.*?>'), '', raw_text)
    return cleaned_text

In [16]:
df['review'] = df['review'].apply(remove_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_tags)


In [17]:
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

- Lowercasing

In [18]:
df['review'] = df['review'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(lambda x:x.lower())


In [19]:
df['review'][0]

"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.i would say the main appeal of the show is due to the fact that it goes where other shows wo

- Removing Stopwords

In [20]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
from nltk.corpus import stopwords
sw_list = stopwords.words('english')
df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))


In [22]:
df['review'][0]

"one reviewers mentioned watching 1 oz episode hooked. right, exactly happened me.the first thing struck oz brutality unflinching scenes violence, set right word go. trust me, show faint hearted timid. show pulls punches regards drugs, sex violence. hardcore, classic use word.it called oz nickname given oswald maximum security state penitentary. focuses mainly emerald city, experimental section prison cells glass fronts face inwards, privacy high agenda. em city home many..aryans, muslims, gangstas, latinos, christians, italians, irish more....so scuffles, death stares, dodgy dealings shady agreements never far away.i would say main appeal show due fact goes shows dare. forget pretty pictures painted mainstream audiences, forget charm, forget romance...oz mess around. first episode ever saw struck nasty surreal, say ready it, watched more, developed taste oz, got accustomed high levels graphic violence. violence, injustice (crooked guards who'll sold nickel, inmates who'll kill order g

# Feature Extraction

In [23]:
X = df[['review']]
y = df['sentiment']

In [24]:
X

Unnamed: 0,review
0,one reviewers mentioned watching 1 oz episode ...
1,wonderful little production. filming technique...
2,thought wonderful way spend time hot summer we...
3,basically there's family little boy (jake) thi...
4,"petter mattei's ""love time money"" visually stu..."
...,...
19995,"ok. starters, taxi driver amazing. this, taxi ..."
19996,"sort hard say it, greatly enjoyed ""targets"" ""p..."
19997,still liked though. warren beatty fair comic b...
19998,could still use black adder even today. imagin...


In [25]:
y

0        positive
1        positive
2        positive
3        negative
4        positive
           ...   
19995    negative
19996    negative
19997    positive
19998    positive
19999    negative
Name: sentiment, Length: 19926, dtype: object

In [26]:
# Converting y into 0 and 1
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(y)

In [27]:
y

array([1, 1, 1, ..., 1, 1, 0])

In [28]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [29]:
X_train.shape

(15940, 1)

In [30]:
X_test.shape

(3986, 1)

# Applying BoW

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

In [32]:
cv = CountVectorizer()

- cv.fit_transform(X_train['review'])

     - fit() → builds the vocabulary from training data.

     - transform() → converts training text into a bag-of-words matrix.

- cv.transform(X_test['review'])

    - Only transforms the test data using the same vocabulary built from training.

    - This ensures that both train and test data have the same feature space (same word columns).

In [33]:
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

In [34]:
X_train_bow.shape

(15940, 64537)

In [35]:
X_test_bow.shape

(3986, 64537)

# Applying ML For Text Classification

- Naive Bayes Classifier

In [36]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

gnb.fit(X_train_bow,y_train)

In [37]:
y_pred = gnb.predict(X_test_bow)

In [38]:
from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_test,y_pred)
# 64% accuracy

0.6482689412945308

In [39]:
confusion_matrix(y_test,y_pred)

array([[1538,  414],
       [ 988, 1046]])

- Random Forest Classifier

In [40]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

In [41]:
rf.fit(X_train_bow,y_train)

In [42]:
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)
# 85% accuracy

0.8564977420973406

- Using 3000 features only and RF Classifier

In [43]:
cv = CountVectorizer(max_features=3000)

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

print(X_train_bow.shape)

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)
# 84% accuracy

(15940, 3000)


0.84470647265429

# Using n-grams

In [44]:
cv = CountVectorizer(ngram_range=(1,2),max_features=5000)

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)
# 85% accuracy

0.8534872052182639

# Using Tf-Idf

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [46]:
tfidf = TfidfVectorizer()

In [47]:
X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review'])

In [48]:
X_train_tfidf.shape

(15940, 64537)

In [49]:
rf = RandomForestClassifier()

rf.fit(X_train_tfidf,y_train)
y_pred = rf.predict(X_test_tfidf)

accuracy_score(y_test,y_pred)
# 85% accuracy

0.859257400903161

# Training a Word2Vec

In [50]:
import gensim

In [51]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [52]:
story = []
for doc in df['review']:
    raw_sent = sent_tokenize(doc)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))

In [53]:
len(story)  # No. of sentences in corpus

211927

In [54]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

In [55]:
model.build_vocab(story)

In [56]:
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(11757889, 12360490)

In [57]:
len(model.wv.index_to_key)  # Vocabulary

42888

- doc.split(): Splits your document into individual words
- if word in model.wv.index_to_key: Keeps only the words that exist in your trained vocabulary (ignores unknown words)
- model.wv[doc]: Retrieves the 300-dimensional Word2Vec vector for each word
- np.mean(..., axis=0): Averages all word vectors → gives one single vector per document

So the output is a dense 300-dimensional NumPy vector that represents vector of each review as a whole and not individual words.

In [58]:
def document_vector(doc):
    doc = [word for word in doc.split() if word in model.wv.index_to_key]
    return np.mean(model.wv[doc], axis=0)

In [63]:
document_vector(df['review'].values[0])  # Vector of first review

array([-0.24340941,  0.0974402 ,  0.14118783,  0.0894749 ,  0.4068178 ,
       -0.4971846 ,  0.26817816,  0.8801212 ,  0.04252971, -0.26325732,
        0.20369796, -0.27985188, -0.07315441, -0.01975162,  0.17517652,
       -0.22435841, -0.15286991, -0.42718917, -0.11918462, -0.16535847,
        0.12286358,  0.14350392,  0.21750571, -0.26108095, -0.06002067,
       -0.07397361, -0.27560085, -0.25159037,  0.04296714, -0.09859373,
        0.53709775, -0.21823308, -0.07311647, -0.3404463 , -0.05324528,
        0.5026663 , -0.06849717, -0.09941732, -0.2638779 , -0.8243812 ,
        0.1649398 ,  0.15742548,  0.14393142, -0.20089853,  0.31093985,
       -0.32286084, -0.15292507, -0.11259509,  0.43408877,  0.16718286,
        0.4881521 , -0.48661765,  0.07430241, -0.22639064, -0.25184977,
        0.1418541 ,  0.2218259 , -0.03439002, -0.25156587,  0.0508279 ,
        0.01320729,  0.28209937,  0.03631376,  0.07847692, -0.3966407 ,
        0.33750236, -0.03413524,  0.31342846, -0.46036786,  0.25

In [64]:
from tqdm import tqdm  # tqdm is a Python library used to display progress bars for loops

In [65]:
X = []
for doc in tqdm(df['review'].values):
    X.append(document_vector(doc))

100%|██████████| 19926/19926 [14:39<00:00, 22.67it/s]


In [70]:
X = np.array(X)

In [71]:
X[0]

array([-0.24340941,  0.0974402 ,  0.14118783,  0.0894749 ,  0.4068178 ,
       -0.4971846 ,  0.26817816,  0.8801212 ,  0.04252971, -0.26325732,
        0.20369796, -0.27985188, -0.07315441, -0.01975162,  0.17517652,
       -0.22435841, -0.15286991, -0.42718917, -0.11918462, -0.16535847,
        0.12286358,  0.14350392,  0.21750571, -0.26108095, -0.06002067,
       -0.07397361, -0.27560085, -0.25159037,  0.04296714, -0.09859373,
        0.53709775, -0.21823308, -0.07311647, -0.3404463 , -0.05324528,
        0.5026663 , -0.06849717, -0.09941732, -0.2638779 , -0.8243812 ,
        0.1649398 ,  0.15742548,  0.14393142, -0.20089853,  0.31093985,
       -0.32286084, -0.15292507, -0.11259509,  0.43408877,  0.16718286,
        0.4881521 , -0.48661765,  0.07430241, -0.22639064, -0.25184977,
        0.1418541 ,  0.2218259 , -0.03439002, -0.25156587,  0.0508279 ,
        0.01320729,  0.28209937,  0.03631376,  0.07847692, -0.3966407 ,
        0.33750236, -0.03413524,  0.31342846, -0.46036786,  0.25

In [77]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [78]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [79]:
rf = RandomForestClassifier()
rf.fit(X_train,y_train)

In [80]:
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)
# 80% accuracy

0.8060712493728048