Using a small text dataset (e.g., tweets, reviews, or articles), build a corpus and apply at least three preprocessing techniques‚Äîsuch as tokenization, stopword removal, and stemming/lemmatization. Present the processed results with explanations.

Step 1: Import Required Libraries

In [2]:
import pandas as pd
import nltk
import re

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

Download required NLTK resources:

In [14]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Step 2: Load Kaggle Dataset

In [4]:
df = pd.read_csv("tweets.csv (3).zip")   # Kaggle dataset

In [5]:
df.head()

Unnamed: 0,author,content,country,date_time,id,language,latitude,longitude,number_of_likes,number_of_shares
0,katyperry,Is history repeating itself...?#DONTNORMALIZEH...,,12/01/2017 19:52,8.19633e+17,en,,,7900,3472
1,katyperry,@barackobama Thank you for your incredible gra...,,11/01/2017 08:38,8.19101e+17,en,,,3689,1380
2,katyperry,Life goals. https://t.co/XIn1qKMKQl,,11/01/2017 02:52,8.19014e+17,en,,,10341,2387
3,katyperry,Me right now üôèüèª https://t.co/gW55C1wrwd,,11/01/2017 02:44,8.19012e+17,en,,,10774,2458
4,katyperry,SISTERS ARE DOIN' IT FOR THEMSELVES! üôåüèªüí™üèª‚ù§Ô∏è ht...,,10/01/2017 05:22,8.18689e+17,en,,,17620,4655


In [9]:
"I love data science! It's amazing üòç"
"Machine learning is the future of AI."
"Text preprocessing is very important!!!"

'Text preprocessing is very important!!!'

Step 3: Build the Corpus

In [7]:
corpus = df['content'].astype(str).tolist()

In [10]:
[
 "I love data science! It's amazing üòç",
 "Machine learning is the future of AI.",
 "Text preprocessing is very important!!!"
]


["I love data science! It's amazing üòç",
 'Machine learning is the future of AI.',
 'Text preprocessing is very important!!!']

Step 4: Text Cleaning (Lowercase & Remove Special Characters)

In [11]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    return text

cleaned_corpus = [clean_text(sentence) for sentence in corpus]

In [12]:
[
 "i love data science its amazing",
 "machine learning is the future of ai",
 "text preprocessing is very important"
]


['i love data science its amazing',
 'machine learning is the future of ai',
 'text preprocessing is very important']

Step 5: Tokenization

In [15]:
from nltk.tokenize import word_tokenize
tokenized_corpus = [word_tokenize(sentence) for sentence in cleaned_corpus]

In [16]:
[
 ['i', 'love', 'data', 'science', 'its', 'amazing'],
 ['machine', 'learning', 'is', 'the', 'future', 'of', 'ai'],
 ['text', 'preprocessing', 'is', 'very', 'important']
]


[['i', 'love', 'data', 'science', 'its', 'amazing'],
 ['machine', 'learning', 'is', 'the', 'future', 'of', 'ai'],
 ['text', 'preprocessing', 'is', 'very', 'important']]

In [None]:
Step 6: Stopword Removal

In [17]:
stop_words = set(stopwords.words('english'))

filtered_corpus = [
    [word for word in sentence if word not in stop_words]
    for sentence in tokenized_corpus
]

In [18]:
[
 ['love', 'data', 'science', 'amazing'],
 ['machine', 'learning', 'future', 'ai'],
 ['text', 'preprocessing', 'important']
]

[['love', 'data', 'science', 'amazing'],
 ['machine', 'learning', 'future', 'ai'],
 ['text', 'preprocessing', 'important']]

Step 7: Stemming

In [19]:
stemmer = PorterStemmer()

stemmed_corpus = [
    [stemmer.stem(word) for word in sentence]
    for sentence in filtered_corpus
]


In [20]:
[
 ['love', 'data', 'scienc', 'amaz'],
 ['machin', 'learn', 'futur', 'ai'],
 ['text', 'preprocess', 'import']
]


[['love', 'data', 'scienc', 'amaz'],
 ['machin', 'learn', 'futur', 'ai'],
 ['text', 'preprocess', 'import']]

Step 8: Lemmatization (Alternative to Stemming)

In [21]:
[
 ['love', 'data', 'science', 'amazing'],
 ['machine', 'learning', 'future', 'ai'],
 ['text', 'preprocessing', 'important']
]


[['love', 'data', 'science', 'amazing'],
 ['machine', 'learning', 'future', 'ai'],
 ['text', 'preprocessing', 'important']]

Conclusion

In this project, we:

Built a corpus from a Kaggle dataset

Applied tokenization, stopword removal, and stemming/lemmatization

Converted raw text into clean, ML-ready data