# 1. Tokenization and Text Cleaning

At the heart of NLP lies the art of breaking down text into meaningful units. Tokenization is the process of splitting text into words, phrases, or even sentences (tokens). It's the initial step that sets the stage for further analysis. Coupled with text cleaning, where we remove unnecessary characters, numbers, and symbols, tokenization ensures we work with pristine, understandable language units.

In [4]:
# !pip install nltk
# Example Tokenization and Text Cleaning
import nltk 
text = "NLP is amazing! Let's explore its wonders."
tokens = nltk.word_tokenize(text)
cleaned_tokens = [word.lower() for word in tokens if word.isalpha()]
print(cleaned_tokens)

['nlp', 'is', 'amazing', 'let', 'explore', 'its', 'wonders']


In [11]:
import re
def clean_tweet(tweet):
    tweet = re.sub(r'@\w+', '', tweet)  # Remove mentions
    tweet = re.sub(r'#\w+', '', tweet)  # Remove hashtags
    tweet = re.sub(r'http\S+', '', tweet)  # Remove URLs
    return tweet

tweet = "Loving the new #iPhone! Best phone ever! @Apple"
clean_tweet(tweet)

'Loving the new ! Best phone ever! '

# 2. Stop Words Removing

Not all words contribute equally to the meaning of a sentence. Stop words like "the" or "and" are often filtered out to focus on more meaningful content.

In [5]:
# Example Stop Words
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
filtered_sentence = [word for word in cleaned_tokens if word not in stop_words]
print(filtered_sentence)

['nlp', 'amazing', 'let', 'explore', 'wonders']


# 3. Stemming & Lemmatizing

Stemming and lemmatization are both text normalization techniques used in Natural Language Processing (NLP) to reduce words to their base or root forms. While they share the goal of simplifying words, they operate differently in terms of the linguistic knowledge they apply.

Stemming: Stemming involves cutting off prefixes or suffixes of words to obtain their root or base form, known as the stem. The purpose is to treat words with similar meanings as if they were the same. Stemming is a rule-based method that doesn't always result in a valid word, but it's computationally less intensive.

Lemmatization: Lemmatization, on the other hand, involves reducing words to their base or dictionary forms, known as lemmas. It takes into account the context of the word in a sentence and applies morphological analysis. Lemmatization results invalid words and is more linguistically informed compared to stemming.

In [6]:
# Example Stemming, and Lemmatization 
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_words = [stemmer.stem(word) for word in filtered_sentence]
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_sentence]

print(stemmed_words)
print(lemmatized_words)

['nlp', 'amaz', 'let', 'explor', 'wonder']
['nlp', 'amazing', 'let', 'explore', 'wonder']


# 4. Part-of-Speech Tagging:

Part-of-speech tagging (POS tagging) is a natural language processing task where the goal is to assign a grammatical category (such as noun, verb, adjective, etc.) to each word in a given text. This provides a deeper understanding of the structure and function of each word in a sentence.
The Penn Treebank POS Tag Set is a widely used standard for representing these part-of-speech tags in English text.

In [7]:
# Example Part-of-Speech Tagging 
from nltk import pos_tag

pos_tags = nltk.pos_tag(filtered_sentence)
print(pos_tags)

[('nlp', 'RB'), ('amazing', 'JJ'), ('let', 'NN'), ('explore', 'NN'), ('wonders', 'NNS')]


# 5. Named Entity Recognition (NER):

NER takes language understanding to the next level by identifying and classifying entities like names, locations, organizations, etc., in a given text. This is crucial for extracting meaningful information from unstructured data.

In [8]:
# Example Named Entity Recognition (NER) 
from nltk import ne_chunk

ner_tags = ne_chunk(pos_tags)
print(ner_tags)

(S nlp/RB amazing/JJ let/NN explore/NN wonders/NNS)
