## Assignment: Text Classification

### Problem Statement
We want to train a machine learning model that can read a sentence and decide which category it belongs to. The categories are sports, politics, tech, food, and entertainment. The text in our dataset is not clean. It may have spelling mistakes, random capital letters, extra spaces, slang words, or emojis. This makes the task more challenging. The goal is to build a model that can still predict the correct category even with messy text.

### Dataset
 The dataset for this assignment is provided in CSV format and can be downloaded from the following link:
 [Data Source](https://drive.google.com/file/d/1TvapWSyJWXj3cvO3GQVxEQmZKvnedrFi/view?usp=sharing)

### Tasks for Students
1.	Download the dataset using the link above.

2.	Explore the data and look at some examples of messy text.

3.	Clean the text by handling issues such as extra spaces, random casing, emojis, or slang.

4.	Convert the text into a machine-readable format (Bag of Words, TF-IDF, or embeddings).

5.	Train a text classification model using any algorithm you know (Logistic Regression, Naive Bayes, or a simple Neural Network).

6.	Test your model and report the accuracy.

7.	Write a short conclusion explaining how messy text affects classification.

### Submission Instructions
1. Save your code and results in a folder.

2. Submit the folder as a pdf file of your notebook to the Drive

3. Include a brief report describing your approach, challenges faced, and results.

### Expected Learning Outcomes
 By completing this assignment, you will learn how to:
1. Work with messy and unstructured text data
2. Clean and prepare text for machine learning
3. Build and evaluate a simple text classification model



In [19]:
#import libraries
import pandas as pd
import numpy as np
import re
import emoji
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

In [20]:
df = pd.read_csv('./Dataset/text_classifcation.csv')
print(df.head())
print(df['label'].value_counts())

                                                text   label
0  DEbATinG IF BuRgER🍔 Or bIRYanI is THe TRUe kIn...    food
1  LATEst SMartpHONE bY opeNai dROPpEd tOdAy 🔥 wi...    tech
2  cRicKet COMmeNTArY FelT bIasEd SmH BUT sTILL W...  sports
3  sOfTwaRE upDatE HaD BuGZzZ again 😂 usErs on Tw...    tech
4  soFTwarE updatE Had bugZzz AGAIN 😂 useRs On Tw...    tech
food             2000
tech             2000
sports           2000
politics         2000
entertainment    2000
Name: label, dtype: int64


In [21]:
def clean_text(text):
    if not isinstance(text, str):
        return ""
    
    # Lowercase
    text = text.lower()
    
    # Remove emojis
    text = emoji.replace_emoji(text, replace="")
    
    # Remove URLs, mentions, hashtags
    text = re.sub(r"http\S+|www\S+|@\w+|#\w+", "", text)
    
    # Remove numbers
    text = re.sub(r"\d+", "", text)
    
    # Normalize elongated words (e.g. bugzzzz → bug)
    text = re.sub(r"(.)\1{2,}", r"\1", text)
    
    # Remove special characters
    text = re.sub(r"[^a-z\s]", " ", text)
    
    # Remove extra spaces
    text = re.sub(r"\s+", " ", text).strip()
    
    return text

df['clean_text'] = df['text'].apply(clean_text)
print(df[['text', 'clean_text']].head())


                                                text  \
0  DEbATinG IF BuRgER🍔 Or bIRYanI is THe TRUe kIn...   
1  LATEst SMartpHONE bY opeNai dROPpEd tOdAy 🔥 wi...   
2  cRicKet COMmeNTArY FelT bIasEd SmH BUT sTILL W...   
3  sOfTwaRE upDatE HaD BuGZzZ again 😂 usErs on Tw...   
4  soFTwarE updatE Had bugZzz AGAIN 😂 useRs On Tw...   

                                          clean_text  
0  debating if burger or biryani is the true king...  
1  latest smartphone by openai dropped today with...  
2  cricket commentary felt biased smh but still w...  
3  software update had bugz again users on twitte...  
4  software update had bugz again users on twitte...  


In [22]:
X = df['clean_text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [23]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


In [24]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_tfidf, y_train)

LogisticRegression(max_iter=1000)

In [25]:
y_pred = clf.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 1.0

Classification Report:
                precision    recall  f1-score   support

entertainment       1.00      1.00      1.00       400
         food       1.00      1.00      1.00       400
     politics       1.00      1.00      1.00       400
       sports       1.00      1.00      1.00       400
         tech       1.00      1.00      1.00       400

     accuracy                           1.00      2000
    macro avg       1.00      1.00      1.00      2000
 weighted avg       1.00      1.00      1.00      2000

