Tutorials:
1. [ Natural Language Processing (NLP) Tutorial with Python & NLTK](https://youtu.be/X2vAabgKiuM)
2. [Spacy Introduction for NLP](https://www.youtube.com/watch?v=pLJm0WSIVDk&list=PLc2rvfiptPSSS-iwKS_lxI3MZr8Mbi4Zu)
3. [TextBlob Library In Python For Natural Language Processing](https://www.youtube.com/watch?v=DedS74YKhs4)

In [1]:
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
from collections import Counter
from textblob import TextBlob
import enchant

In [2]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /home/rikato/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/rikato/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/rikato/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
# Sample text for demonstration
sample_text = """
Hello! This is a sample text for our NLP experiment. It contains punctuation, numbers (like 123), and special symbols (@#$).
We'll perform various operations on this text. Some email addresses: john.doe@example.com, jane_smith@company.org.
This text also includes some common words and potential spelling errors, such as 'langauge' instead of 'language'.
NLP is gr8 for txt analysis. ROFL!
"""

In [4]:
print("Original Text:")
print(sample_text)

Original Text:

Hello! This is a sample text for our NLP experiment. It contains punctuation, numbers (like 123), and special symbols (@#$).
We'll perform various operations on this text. Some email addresses: john.doe@example.com, jane_smith@company.org.
This text also includes some common words and potential spelling errors, such as 'langauge' instead of 'language'.
NLP is gr8 for txt analysis. ROFL!



In [5]:
# 1a. Remove punctuations, special symbols, numbers using regular expression
def remove_punct_symbols_numbers(text):
    return re.sub(r'[^\w\s]|\d', '', text)

In [6]:
print("\n1a. Text after removing punctuations, special symbols, and numbers:")
cleaned_text = remove_punct_symbols_numbers(sample_text)
print(cleaned_text)


1a. Text after removing punctuations, special symbols, and numbers:

Hello This is a sample text for our NLP experiment It contains punctuation numbers like  and special symbols 
Well perform various operations on this text Some email addresses johndoeexamplecom jane_smithcompanyorg
This text also includes some common words and potential spelling errors such as langauge instead of language
NLP is gr for txt analysis ROFL



In [7]:
# 1b. Tokenize the text into sentences
def tokenize_sentences(text):
    return sent_tokenize(text)

In [8]:
print("\n1b. Tokenized sentences:")
sentences = tokenize_sentences(sample_text)
for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence}")


1b. Tokenized sentences:
Sentence 1: 
Hello!
Sentence 2: This is a sample text for our NLP experiment.
Sentence 3: It contains punctuation, numbers (like 123), and special symbols (@#$).
Sentence 4: We'll perform various operations on this text.
Sentence 5: Some email addresses: john.doe@example.com, jane_smith@company.org.
Sentence 6: This text also includes some common words and potential spelling errors, such as 'langauge' instead of 'language'.
Sentence 7: NLP is gr8 for txt analysis.
Sentence 8: ROFL!


In [9]:
# 1c. Add Custom Stopwords and List Removed Stopwords
def remove_stopwords(text, custom_stopwords=[]):
    stop_words = set(stopwords.words('english') + custom_stopwords)
    words = word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    removed_stopwords = [word for word in words if word.lower() in stop_words]
    return ' '.join(filtered_words), removed_stopwords

In [10]:
print("\n1c. Text after removing stopwords (including custom stopwords):")
custom_stopwords = ['sample', 'contains']
text_without_stopwords, removed_stopwords = remove_stopwords(cleaned_text, custom_stopwords)
print(text_without_stopwords)
print("Removed stopwords:", removed_stopwords)


1c. Text after removing stopwords (including custom stopwords):
Hello text NLP experiment punctuation numbers like special symbols Well perform various operations text email addresses johndoeexamplecom jane_smithcompanyorg text also includes common words potential spelling errors langauge instead language NLP gr txt analysis ROFL
Removed stopwords: ['This', 'is', 'a', 'sample', 'for', 'our', 'It', 'contains', 'and', 'on', 'this', 'Some', 'This', 'some', 'and', 'such', 'as', 'of', 'is', 'for']


In [11]:
# 1d. Perform stemming and lemmatization
def stem_and_lemmatize(text):
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    words = word_tokenize(text)
    stemmed = [stemmer.stem(word) for word in words]
    lemmatized = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(stemmed), ' '.join(lemmatized)

In [12]:
print("\n1d. Stemmed and Lemmatized text:")
stemmed, lemmatized = stem_and_lemmatize(cleaned_text)
print("Stemmed:", stemmed)
print("Lemmatized:", lemmatized)


1d. Stemmed and Lemmatized text:
Stemmed: hello thi is a sampl text for our nlp experi it contain punctuat number like and special symbol well perform variou oper on thi text some email address johndoeexamplecom jane_smithcompanyorg thi text also includ some common word and potenti spell error such as langaug instead of languag nlp is gr for txt analysi rofl
Lemmatized: Hello This is a sample text for our NLP experiment It contains punctuation number like and special symbol Well perform various operation on this text Some email address johndoeexamplecom jane_smithcompanyorg This text also includes some common word and potential spelling error such a langauge instead of language NLP is gr for txt analysis ROFL


In [13]:
# 1e. Extract usernames from email addresses
def extract_usernames(text):
    email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    emails = re.findall(email_pattern, text)
    usernames = [email.split('@')[0] for email in emails]
    return usernames

In [14]:
print("\n1e. Extracted usernames from email addresses:")
usernames = extract_usernames(sample_text)
print(usernames)


1e. Extracted usernames from email addresses:
['john.doe', 'jane_smith']


In [15]:
# 1f. Find the most common words
def find_common_words(text, n=10):
    words = word_tokenize(text.lower())
    word_freq = Counter(words)
    return word_freq.most_common(n)

In [16]:
print("\n1f. Most common words:")
common_words = find_common_words(cleaned_text)
for word, count in common_words:
    print(f"{word}: {count}")


1f. Most common words:
this: 3
text: 3
is: 2
for: 2
nlp: 2
and: 2
some: 2
hello: 1
a: 1
sample: 1


In [17]:
# 1g. Correct spelling errors
def correct_spelling(text):
    return str(TextBlob(text).correct())

In [18]:
print("\n1g. Text after spelling correction:")
corrected_text = correct_spelling(sample_text)
print(corrected_text)


1g. Text after spelling correction:

Hello! His is a sample text for our NLP experiment. It contains punctuation, numbers (like 123), and special symbols (@#$).
He'll perform various operations on this text. Some email addresses: john.doe@example.com, jane_smith@company.org.
His text also includes some common words and potential spelling errors, such as 'language' instead of 'language'.
NLP is grm for txt analysis. ROFL!



In [19]:
# 2a. Replace social media slangs
def replace_slangs(text, slang_dict):
    words = word_tokenize(text)
    replaced_words = [slang_dict.get(word.lower(), word) for word in words]
    return ' '.join(replaced_words)

In [20]:
print("\n2a. Text after replacing social media slangs:")
slang_dict = {'gr8': 'great', 'txt': 'text', 'rofl': 'rolling on the floor laughing'}
text_without_slangs = replace_slangs(sample_text, slang_dict)
print(text_without_slangs)


2a. Text after replacing social media slangs:
Hello ! This is a sample text for our NLP experiment . It contains punctuation , numbers ( like 123 ) , and special symbols ( @ # $ ) . We 'll perform various operations on this text . Some email addresses : john.doe @ example.com , jane_smith @ company.org . This text also includes some common words and potential spelling errors , such as 'langauge ' instead of 'language ' . NLP is great for text analysis . rolling on the floor laughing !


In [21]:
# 2b. Apply stemming or lemmatization only if the word doesn't have meaning
def smart_stem_lemmatize(text):
    d = enchant.Dict("en_US")
    lemmatizer = WordNetLemmatizer()
    words = word_tokenize(text)
    processed_words = []
    for word in words:
        if not d.check(word):
            lemma = lemmatizer.lemmatize(word)
            if d.check(lemma):
                processed_words.append(lemma)
            else:
                processed_words.append(word)
        else:
            processed_words.append(word)
    return ' '.join(processed_words)

In [22]:
print("\n2b. Text after smart stemming/lemmatization:")
smart_processed_text = smart_stem_lemmatize(cleaned_text)
print(smart_processed_text)


2b. Text after smart stemming/lemmatization:
Hello This is a sample text for our NLP experiment It contains punctuation numbers like and special symbols Well perform various operations on this text Some email addresses johndoeexamplecom jane_smithcompanyorg This text also includes some common words and potential spelling errors such as langauge instead of language NLP is gr for txt analysis ROFL
