<a href="https://colab.research.google.com/github/Maureenchepkirui/nlp-text-classification-capstone/blob/main/NLP_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis of Public Opinions Using Natural Language Processing

**Author:** Maureen Chepkirui  
**Environment:** Python (Google Colaboratory)  
**Dataset:** Twitter US Airline Sentiment (public GitHub CSV)

## Abstract
This project applies natural language processing (NLP) techniques to classify sentiment in short public text messages. Using TF-IDF vectorization and a logistic regression classifier, the study evaluates the ability of a simple model to distinguish between positive, neutral, and negative sentiments in a real textual dataset. Results, limitations, and future directions are discussed.


## 1. Introduction

Sentiment analysis is a common NLP task aimed at determining emotional tone within textual data. This project uses real tweet data to classify text into sentiment categories. The goal is to demonstrate applied NLP methods including text cleaning, feature extraction, and machine learning classification.


In [1]:
import pandas as pd
import numpy as np
import re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix


## 2. Dataset Loading

The SMS Spam Collection dataset was obtained from the UCI Machine Learning Repository and loaded directly into the Python environment.


In [5]:
import urllib.request
import zipfile

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
urllib.request.urlretrieve(url, "sms_spam.zip")

with zipfile.ZipFile("sms_spam.zip", "r") as zip_ref:
    zip_ref.extractall("sms_data")

df = pd.read_csv(
    "sms_data/SMSSpamCollection",
    sep="\t",
    header=None,
    names=["label", "text"]
)

df.head()


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## 3. Dataset Description

The dataset consists of SMS messages labeled as either spam or ham (non-spam). Each observation contains a short text message and its corresponding class label, providing a supervised learning setup for text classification.


## 4. Text Preprocessing

Text preprocessing was applied to normalize SMS messages by converting text to lowercase, removing URLs, and eliminating non-alphabetic characters.


In [6]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)      # remove URLs
    text = re.sub(r"[^a-z\s]", "", text)     # keep letters and spaces only
    return text

df["clean_text"] = df["text"].apply(clean_text)

df[["label", "clean_text"]].head()


Unnamed: 0,label,clean_text
0,ham,go until jurong point crazy available only in ...
1,ham,ok lar joking wif u oni
2,spam,free entry in a wkly comp to win fa cup final...
3,ham,u dun say so early hor u c already then say
4,ham,nah i dont think he goes to usf he lives aroun...


## 5. Feature Extraction

TF-IDF (Term Frequency–Inverse Document Frequency) vectorization was applied to transform the cleaned text into numerical features suitable for machine learning models.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    stop_words="english",
    max_features=5000
)

X = tfidf.fit_transform(df["clean_text"])
y = df["label"]

X.shape, y.shape


((5572, 5000), (5572,))

## 6. Train–Test Split

The dataset was split into training and testing subsets to evaluate model performance on unseen data.


In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

X_train.shape, X_test.shape


((3900, 5000), (1672, 5000))

## 7. Model Training

A logistic regression classifier was trained using TF-IDF features to perform spam classification.


In [9]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


## 8. Model Evaluation

Model performance was evaluated using a confusion matrix and classification metrics.


In [10]:
from sklearn.metrics import confusion_matrix, classification_report

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Confusion Matrix:
[[1448    0]
 [  76  148]]

Classification Report:
              precision    recall  f1-score   support

         ham       0.95      1.00      0.97      1448
        spam       1.00      0.66      0.80       224

    accuracy                           0.95      1672
   macro avg       0.98      0.83      0.89      1672
weighted avg       0.96      0.95      0.95      1672



## 9. Results and Discussion

The logistic regression model achieved an overall accuracy of 95%, demonstrating strong performance in distinguishing spam from non-spam messages. The classifier showed perfect recall for non-spam messages, indicating that legitimate messages were consistently identified correctly. Precision for spam messages was high, suggesting that messages classified as spam were reliably spam.

However, recall for spam messages was lower, reflecting the model’s tendency to misclassify some spam messages as non-spam. This outcome is expected given the class imbalance within the dataset and highlights the trade-off between precision and recall in text classification tasks. Overall, the results demonstrate the effectiveness of TF-IDF features combined with a linear classifier for short-text NLP problems.


## 10. Limitations

This study relies on bag-of-words representations and does not capture semantic meaning or word order. Additionally, class imbalance may have influenced recall for spam messages. Future work could incorporate word embeddings or transformer-based architectures to improve contextual understanding.


## 11. Conclusion

This project demonstrated a complete natural language processing workflow for text classification using real-world data. Through text preprocessing, TF-IDF feature extraction, and logistic regression modeling, the classifier achieved strong performance in spam detection. The project highlights the applicability of classical NLP techniques for practical classification tasks and provides a foundation for more advanced methods.
