# ***Engr.Muhammad Javed***

# 1. Text Cleaning

In this notebook, we will cover the core steps of text cleaning:
1. Lowercasing
2. Remove punctuation
3. Remove numbers
4. Remove HTML tags
5. Remove URLs
6. Remove emojis & special characters

In [1]:
import pandas as pd
import re
import string

## Load Dataset

In [2]:
# Read the dataset
# Assuming the dataset has no header and is separated by semicolon based on common structure
try:
    df_train = pd.read_csv('../Dataset/train.txt', sep=';', names=['text', 'emotion'])
    df_test = pd.read_csv('../Dataset/test.txt', sep=';', names=['text', 'emotion'])
except Exception as e:
    print(f"Error loading data: {e}")
    # Fallback if structure is different
    df_train = pd.read_csv('../Dataset/train.txt', sep='\t', names=['text', 'emotion'])
    df_test = pd.read_csv('../Dataset/test.txt', sep='\t', names=['text', 'emotion'])

print("Train Shape:", df_train.shape)
df_train.head()

Train Shape: (16000, 2)


Unnamed: 0,text,emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


## 1. Lowercasing

In [3]:
def to_lowercase(text):
    return text.lower()

# Apply to a sample
sample_text = "This is a Sample Text with Mixed Case."
print("Original:", sample_text)
print("Lowercased:", to_lowercase(sample_text))

# Apply to dataset
df_train['text_clean'] = df_train['text'].apply(to_lowercase)

Original: This is a Sample Text with Mixed Case.
Lowercased: this is a sample text with mixed case.


## 2. Remove Punctuation

In [4]:
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

sample_text = "Hello! How are you doing? This is clean-up time."
print("Original:", sample_text)
print("No Punctuation:", remove_punctuation(sample_text))

df_train['text_clean'] = df_train['text_clean'].apply(remove_punctuation)

Original: Hello! How are you doing? This is clean-up time.
No Punctuation: Hello How are you doing This is cleanup time


## 3. Remove Numbers

In [5]:
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

sample_text = "I bought 2 apples for 10 dollars."
print("Original:", sample_text)
print("No Numbers:", remove_numbers(sample_text))

df_train['text_clean'] = df_train['text_clean'].apply(remove_numbers)

Original: I bought 2 apples for 10 dollars.
No Numbers: I bought  apples for  dollars.


## 4. Remove HTML Tags

In [6]:
def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

sample_text = "<div><h1>Title</h1><p>This is a paragraph.</p></div>"
print("Original:", sample_text)
print("No HTML:", remove_html_tags(sample_text))

# Applying generally if needed, though dataset might not have HTML
df_train['text_clean'] = df_train['text_clean'].apply(remove_html_tags)

Original: <div><h1>Title</h1><p>This is a paragraph.</p></div>
No HTML: TitleThis is a paragraph.


## 5. Remove URLs

In [7]:
def remove_urls(text):
    return re.sub(r'https?://\S+|www\.\S+', '', text)

sample_text = "Check out this link: https://example.com"
print("Original:", sample_text)
print("No URL:", remove_urls(sample_text))

df_train['text_clean'] = df_train['text_clean'].apply(remove_urls)

Original: Check out this link: https://example.com
No URL: Check out this link: 


## 6. Remove Emojis & Special Characters

In [8]:
def remove_special_characters(text):
    # Keep only alphabets and spaces
    return re.sub(r'[^a-zA-Z\s]', '', text)

sample_text = "This is amazing! ðŸ˜€ @user #cool"
print("Original:", sample_text)
print("No Special Chars:", remove_special_characters(sample_text))

df_train['text_clean'] = df_train['text_clean'].apply(remove_special_characters)

Original: This is amazing! ðŸ˜€ @user #cool
No Special Chars: This is amazing  user cool
