
<center>

# 02 â€“ Logistic Regression for Natural Language Processing (NLP)
</center>

---

This document provides a professional and easy-to-understand explanation of the **Logistic Regression NLP module**. The purpose of this project is to demonstrate how logistic regression, a classical and interpretable machine learning algorithm, can be effectively applied to **text classification problems** in Natural Language Processing (NLP).

The module covers the complete workflow, from raw text preprocessing to model training and evaluation, making it suitable for both learning and practical implementation.


## Objectives

The key objectives of this module are:

* To understand how text data can be transformed into numerical features
* To apply logistic regression for NLP classification tasks
* To evaluate model performance using standard metrics
* To highlight the strengths and limitations of logistic regression in NLP

## Problem Statement

Text data is inherently unstructured and cannot be directly used by machine learning algorithms. The challenge is to:

1. Convert text into meaningful numerical representations
2. Train a model that can classify text accurately

Typical use cases include:

* Sentiment analysis
* Spam detection
* Topic classification
* Review polarity classification

## Dataset Description

The dataset used in this module consists of:

* **Text features**: Raw textual data (sentences, reviews, messages, etc.)
* **Target labels**: Binary or multi-class labels corresponding to each text sample

Example:

| Text                     | Label    |
| ------------------------ | -------- |
| "This movie was amazing" | Positive |
| "Worst experience ever"  | Negative |

## Text Preprocessing

Text preprocessing is a crucial step in NLP. The following steps are performed:

* Lowercasing text
* Removing punctuation and special characters
* Removing stopwords
* Tokenization
* Optional lemmatization or stemming

These steps help reduce noise and improve model performance.

## Feature Extraction

Since logistic regression works with numerical data, text is converted into vectors using:

* **Bag of Words (BoW)**
* **TF-IDF (Term Frequencyâ€“Inverse Document Frequency)**

TF-IDF is preferred as it reduces the impact of frequently occurring but less informative words.


## Logistic Regression Model

### Model Explanation

Logistic regression is a **linear classification algorithm** that estimates the probability of a class using the logistic (sigmoid) function.

For NLP tasks:

* Each word becomes a feature
* Model learns weights indicating word importance

### Why Logistic Regression for NLP?

* Simple and interpretable
* Computationally efficient
* Performs well on high-dimensional sparse data
* Strong baseline for text classification


## Model Training

Steps involved:

1. Split the dataset into training and testing sets
2. Train the logistic regression model on vectorized text data
3. Optimize parameters such as regularization strength


## Model Evaluation

The model is evaluated using the following metrics:

* Accuracy
* Precision
* Recall
* F1-score
* Confusion Matrix

These metrics help assess both overall performance and class-wise behavior.

## Results and Observations

* Logistic regression provides strong baseline performance
* TF-IDF features significantly improve accuracy
* Model is fast to train and easy to interpret
* Performance may degrade on highly complex or contextual language

## Limitations

* Cannot capture deep semantic relationships
* Assumes linear decision boundaries
* Performance depends heavily on feature engineering

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('train.txt',sep = ';',header = None,names = ['text','emotion'])

In [3]:
df.head()

Unnamed: 0,text,emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [4]:
df.isnull().sum()

text       0
emotion    0
dtype: int64

In [5]:
unique_emotions = df['emotion'].unique()
emotion_numbers = {}
i = 0
for emo in unique_emotions:
  emotion_numbers[emo] = i
  i +=1

df['emotion'] = df['emotion'].map(emotion_numbers)

In [6]:
df

Unnamed: 0,text,emotion
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,1
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,1
...,...,...
15995,i just had a very brief time in the beanbag an...,0
15996,i am now turning and i feel pathetic that i am...,0
15997,i feel strong and good overall,5
15998,i feel like this was such a rude comment and i...,1


### Text Preprocessing

In [7]:
df['text'] = df['text'].str.lower()

In [8]:
# Removed punctuations
import string

def remove_punc(txt):
  return txt.translate(str.maketrans('','',string.punctuation))

In [9]:
df['text'] = df['text'].apply(remove_punc)

In [10]:
# Removed Numbers
def remove_numbers(txt):
    new = ""
    for i in txt:
        if not i.isdigit():
            new = new + i
    return new

df['text'] = df['text'].apply(remove_numbers)

In [11]:
# Removed Emojis 
def remove_emojis(txt):
    new = ""
    for i in txt:
        if i.isascii():
            new += i
    return new

df['text'] = df['text'].apply(remove_emojis)

In [12]:
import nltk

In [13]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [14]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
stop_words = set(stopwords.words('english'))

In [16]:
df.loc[1]['text']

'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake'

In [17]:
def remove(txt):
  words = txt.split()
  cleaned = []
  for i in words:
    if not i in stop_words:
      cleaned.append(i)

  return ' '.join(cleaned)

In [18]:
df['text'] = df['text'].apply(remove)

In [19]:
df.loc[1]['text']

'go feeling hopeless damned hopeful around someone cares awake'

In [20]:
df.head()

Unnamed: 0,text,emotion
0,didnt feel humiliated,0
1,go feeling hopeless damned hopeful around some...,0
2,im grabbing minute post feel greedy wrong,1
3,ever feeling nostalgic fireplace know still pr...,2
4,feeling grouchy,1


In [21]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['emotion'], test_size=0.20, random_state=42)

### Feature Extraction (TF-IDF)

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [23]:
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

### Train Logistic Regressions Model

In [24]:
from sklearn.linear_model import LogisticRegression

In [25]:
logistic_model = LogisticRegression(max_iter=1000)

In [26]:
logistic_model.fit(X_train_tfidf,y_train)

In [27]:
log_pred = logistic_model.predict(X_test_tfidf)

###  Evaluate Model

In [28]:
from sklearn.metrics import accuracy_score, classification_report

In [29]:
print(accuracy_score(y_test,log_pred ))

0.8628125


In [30]:
print("Accuracy:", accuracy_score(y_test, log_pred))
print("\nClassification Report:\n", classification_report(y_test, log_pred))

Accuracy: 0.8628125

Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.94      0.92       946
           1       0.90      0.81      0.86       427
           2       0.90      0.61      0.73       296
           3       0.88      0.47      0.61       113
           4       0.86      0.76      0.81       397
           5       0.81      0.96      0.88      1021

    accuracy                           0.86      3200
   macro avg       0.88      0.76      0.80      3200
weighted avg       0.87      0.86      0.86      3200




## Conclusion

This module demonstrates that logistic regression is an effective and reliable algorithm for NLP classification tasks, especially as a baseline model. Despite its simplicity, it offers strong performance, interpretability, and efficiency, making it an excellent starting point for NLP projects.

### ðŸ”— References / Resources:
2. [Scikit-learn Logistic Regression Documentation](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
3. [NLTK Documentation](https://www.nltk.org/)

<div style="text-align: right;">
    <b>Author:</b> Monower Hossen <br>
    <b>Date:</b> January 14, 2026
</div>
