### 🧹 Preprocessing Steps (in order):

Convert text → lowercase ✅

Remove URLs (http... / www...)

Remove mentions (@username)

Remove hashtags (#depression → depression)

Remove numbers & punctuation

Remove stopwords (like "the", "is", "and")

Lemmatization (running → run, better → good)

Save cleaned dataset → data/processed

In [1]:
import pandas as pd
import re
import nltk

#NLTK resource(for stopwords+ Lemmatization Later)
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kushagra\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kushagra\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kushagra\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Step 1: Lowercase Conversion
Sabhi letters ko lowercase mein convert karenge:

"yaar, ye movie bahut mast hai! i loved the acting."

Step 2: Remove Punctuation
Comma, exclamation mark jaise punctuation ko hataenge:

"yaar ye movie bahut mast hai i loved the acting"

Step 3: Tokenization
Poore sentence ko alag-alag words (tokens) mein tod denge:

["yaar", "ye", "movie", "bahut", "mast", "hai", "i", "loved", "the", "acting"]

Step 4: Stopwords Removal
Stopwords (jaise "ye", "the", "hai", "i" etc.) hata denge:

["yaar", "movie", "bahut", "mast", "loved", "acting"]

Step 5: Lemmatization/Stemming
Words ko unke basic root form mein convert karenge:

["yaar", "movie", "bahut", "mast", "love", "act"]

Stemming sirf suffix hataata hai, lemmatization meaning aur context dekh kar sahi base word laata hai.

In [2]:
# Load the snapshot saved in Notebook 1
df = pd.read_csv("data/processed/twitter_depression_eda_snapshot.csv")

print("Shape of loaded dataset:", df.shape)
df.head()


Shape of loaded dataset: (20000, 2)


Unnamed: 0,post_text,label
0,It's just over 2 years since I was diagnosed w...,1
1,"It's Sunday, I need a break, so I'm planning t...",1
2,Awake but tired. I need to sleep but my brain ...,1
3,RT @SewHQ: #Retro bears make perfect gifts and...,1
4,It’s hard to say whether packing lists are mak...,1


In [3]:
print("We will apply the following preprocessing steps:")
steps = [
    "1. Lowercase text",
    "2. Remove URLs",
    "3. Remove mentions (@username)",
    "4. Remove hashtags (#depression → depression)",
    "5. Remove numbers & punctuation",
    "6. Remove stopwords",
    "7. Lemmatization"
]
for s in steps:
    print(s)


We will apply the following preprocessing steps:
1. Lowercase text
2. Remove URLs
3. Remove mentions (@username)
4. Remove hashtags (#depression → depression)
5. Remove numbers & punctuation
6. Remove stopwords
7. Lemmatization


### STEP 1:Lowercase

In [4]:
df['post_text']=df['post_text'].str.lower()
print(df.head(10))

                                           post_text  label
0  it's just over 2 years since i was diagnosed w...      1
1  it's sunday, i need a break, so i'm planning t...      1
2  awake but tired. i need to sleep but my brain ...      1
3  rt @sewhq: #retro bears make perfect gifts and...      1
4  it’s hard to say whether packing lists are mak...      1
5  making packing lists is my new hobby... #movin...      1
6  at what point does keeping stuff for nostalgic...      1
7  currently in the finding-boxes-of-random-shit ...      1
8  can't be bothered to cook, take away on the wa...      1
9  rt @itventsnews: itv releases promo video for ...      1


In [5]:
import re

df['post_text']=df['post_text'].str.replace(r'http\s+|www\S+','',regex=True)

http\S+ → match karega "http" se start hone wali string (space aane tak).

www\S+ → match karega "www" se start hone wali string (space aane tak).

.str.replace(..., '', regex=True) → jo bhi match hoga, usko empty string '' se replace kar dega (basically remove kar dega).

In [6]:
### Step3: Remove mention @username
df['post_text']=df['post_text'].str.replace(r'@\w+','',regex=True)

In [7]:
print(df.head(10))

                                           post_text  label
0  it's just over 2 years since i was diagnosed w...      1
1  it's sunday, i need a break, so i'm planning t...      1
2  awake but tired. i need to sleep but my brain ...      1
3  rt : #retro bears make perfect gifts and are g...      1
4  it’s hard to say whether packing lists are mak...      1
5  making packing lists is my new hobby... #movin...      1
6  at what point does keeping stuff for nostalgic...      1
7  currently in the finding-boxes-of-random-shit ...      1
8  can't be bothered to cook, take away on the wa...      1
9  rt : itv releases promo video for the final se...      1


In [8]:
df['post_text']=df['post_text'].str.replace(r'#','',regex=True)

print(df.head(10))

                                           post_text  label
0  it's just over 2 years since i was diagnosed w...      1
1  it's sunday, i need a break, so i'm planning t...      1
2  awake but tired. i need to sleep but my brain ...      1
3  rt : retro bears make perfect gifts and are gr...      1
4  it’s hard to say whether packing lists are mak...      1
5  making packing lists is my new hobby... moving...      1
6  at what point does keeping stuff for nostalgic...      1
7  currently in the finding-boxes-of-random-shit ...      1
8  can't be bothered to cook, take away on the wa...      1
9  rt : itv releases promo video for the final se...      1


In [9]:
# Step 5: Remove numbers and punctuation, keep only letters and spaces
df['post_text'] = df['post_text'].str.replace(r'[^a-z\s]', '', regex=True)

# Check a few rows to confirm
print(df.head(10))


                                           post_text  label
0  its just over  years since i was diagnosed wit...      1
1  its sunday i need a break so im planning to sp...      1
2  awake but tired i need to sleep but my brain h...      1
3  rt  retro bears make perfect gifts and are gre...      1
4  its hard to say whether packing lists are maki...      1
5   making packing lists is my new hobby movinghouse      1
6  at what point does keeping stuff for nostalgic...      1
7  currently in the findingboxesofrandomshit pack...      1
8  cant be bothered to cook take away on the way ...      1
9  rt  itv releases promo video for the final ser...      1


### Remove Stopwords


In [10]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words=set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kushagra\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in stop_words])


In [12]:
df['post_text']=df['post_text'].apply(remove_stopwords)

print(df.head(10))

                                           post_text  label
0  years since diagnosed anxiety depression today...      1
1  sunday need break im planning spend little tim...      1
2                 awake tired need sleep brain ideas      1
3  rt retro bears make perfect gifts great beginn...      1
4  hard say whether packing lists making life eas...      1
5         making packing lists new hobby movinghouse      1
6  point keeping stuff nostalgic reasons cross li...      1
7  currently findingboxesofrandomshit packing pha...      1
8              cant bothered cook take away way lazy      1
9  rt itv releases promo video final series downt...      1


### Lemmatization

In [14]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kushagra\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [15]:
lemmatizer=WordNetLemmatizer()

In [17]:
## FUNTION TO LEMMATIZE EACH WORD
def lemmatize_text(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

## Apply on dataset
df['post_text'] =df['post_text'].apply(lemmatize_text)

print(df.head(10))

                                           post_text  label
0  year since diagnosed anxiety depression today ...      1
1  sunday need break im planning spend little tim...      1
2                  awake tired need sleep brain idea      1
3  rt retro bear make perfect gift great beginner...      1
4  hard say whether packing list making life easi...      1
5          making packing list new hobby movinghouse      1
6  point keeping stuff nostalgic reason cross lin...      1
7  currently findingboxesofrandomshit packing pha...      1
8              cant bothered cook take away way lazy      1
9  rt itv release promo video final series downto...      1


In [18]:
# SAVING THE DATA SET 

In [20]:
import os

os.makedirs("data/processed",exist_ok=True)

#Save cleaned data det
df.to_csv("data/processed/twitter_depression_cleaned.csv",index=False)
print("✅ Cleaned dataset saved at: data/processed/twitter_depression_cleaned.csv")

✅ Cleaned dataset saved at: data/processed/twitter_depression_cleaned.csv
