## Step 1: Lower Case

In this step, we convert the given sentence to lower case.

In [1]:
sentence = "The Sky is not the limit"
lowered_sent =sentence.lower()
lowered_sent

'the sky is not the limit'


## Step 2: Remove Stop Words

In this step, we remove common stop words from the sentence to focus on the important words.


In [2]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [3]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Akaash\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
en_stops = stopwords.words('english')
en_stops

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [5]:
sentence = "I am not able to come up with a way to solve this problem that is why I am delaying this project"

In [6]:
sentence_no_stopwords = ' '.join([word for word in sentence.split() if word not in en_stops])
sentence_no_stopwords

'I able come way solve problem I delaying project'

In [7]:
en_stops.remove('up')
en_stops.remove('not')

In [8]:
new_sentence_no_stopwords = ' '.join([word for word in sentence.split() if word not in en_stops])

In [9]:
new_sentence_no_stopwords 

'I not able come up way solve problem I delaying project'

## Step 3: Regular Expression

In this step, we will use regular expressions to further clean and process the text data.


In [10]:
import re

In [11]:
# Example of re.search
pattern = r'\bnot\b'
match = re.search(pattern, sentence)
if match:
    print(f"Found '{match.group()}' in the sentence at position {match.start()}")

# Example of re.sub
pattern = r'\bnot\b'
replacement = 'definitely'
new_sentence = re.sub(pattern, replacement, sentence)
print(new_sentence)

Found 'not' in the sentence at position 5
I am definitely able to come up with a way to solve this problem that is why I am delaying this project


In [12]:
# List of reviews
reviews = [
    "Alice did a fantastic job!",
    "Bob's work was exceptional.",
    "Charlie is a great team player.",
    "David's contribution was invaluable.",
    "Eve's performance was outstanding.",
    "Frank's dedication is commendable."
]

# Demonstrating the use of ^ (start of string)
for review in reviews:
    if re.search(r'^Alice', review):
        print(f"Starts with 'Alice': {review}")

# Demonstrating the use of $ (end of string)
for review in reviews:
    if re.search(r'commendable\.$', review):
        print(f"Ends with 'commendable.': {review}")

# Demonstrating the use of | (or)
for review in reviews:
    if re.search(r'Alice|Bob', review):
        print(f"Contains 'Alice' or 'Bob': {review}")

# Removing punctuations from reviews
reviews_no_punctuations = [re.sub(r'[^\w\s]', '', review) for review in reviews]
print("Reviews without punctuations:", reviews_no_punctuations)

Starts with 'Alice': Alice did a fantastic job!
Ends with 'commendable.': Frank's dedication is commendable.
Contains 'Alice' or 'Bob': Alice did a fantastic job!
Contains 'Alice' or 'Bob': Bob's work was exceptional.
Reviews without punctuations: ['Alice did a fantastic job', 'Bobs work was exceptional', 'Charlie is a great team player', 'Davids contribution was invaluable', 'Eves performance was outstanding', 'Franks dedication is commendable']


## Step 4: Tokenization

In this step, we will tokenize the sentences into individual words. Tokenization is the process of splitting text into smaller pieces, such as words or phrases. This is a crucial step in text processing as it allows us to analyze the text at a granular level.


In [16]:
pip install nltk




In [17]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize, word_tokenize

# Example string
example_string = "Hello there! How are you doing today? This is a simple sentence tokenizer and word tokenizer demo."

# Sentence Tokenization
sentences = sent_tokenize(example_string)
print("Sentences:", sentences)

# Word Tokenization
words = word_tokenize(example_string)
print("Words:", words)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Akaash\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Akaash\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


Sentences: ['Hello there!', 'How are you doing today?', 'This is a simple sentence tokenizer and word tokenizer demo.']
Words: ['Hello', 'there', '!', 'How', 'are', 'you', 'doing', 'today', '?', 'This', 'is', 'a', 'simple', 'sentence', 'tokenizer', 'and', 'word', 'tokenizer', 'demo', '.']


## Step 5: Stemming

In this step, we will apply stemming to reduce words to their root form. Stemming helps in reducing inflected or derived words to their base form, which is useful in text processing as it reduces the dimensionality of the data and improves text analysis tasks.


In [18]:
from nltk.stem import PorterStemmer

# Initialize the stemmer
porter_stemmer = PorterStemmer()

# Example tokens
tokens = ["connect", "connected", "connecting", "connection", "connections", "learn", "learning", "learned", "learner"]

# Apply stemming
stemmed_tokens = [porter_stemmer.stem(token) for token in tokens]

# Display the results
for token, stemmed in zip(tokens, stemmed_tokens):
    print(f"Original: {token}, Stemmed: {stemmed}")

Original: connect, Stemmed: connect
Original: connected, Stemmed: connect
Original: connecting, Stemmed: connect
Original: connection, Stemmed: connect
Original: connections, Stemmed: connect
Original: learn, Stemmed: learn
Original: learning, Stemmed: learn
Original: learned, Stemmed: learn
Original: learner, Stemmed: learner



## Step 6: Lemmatization

In this step, we will apply lemmatization to reduce words to their base or dictionary form. Lemmatization is similar to stemming but it brings context to the words. It links words with similar meanings to one word. This is useful in text processing as it helps in understanding the context and meaning of the words.


In [None]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Example tokens
tokens = ["connect", "connected", "connecting", "connection", "connections", "learn", "learning", "learned", "learner"]

# Apply lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

# Display the results
for token, lemmatized in zip(tokens, lemmatized_tokens):
    print(f"Original: {token}, Lemmatized: {lemmatized}")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Akaash\AppData\Roaming\nltk_data...


Original: connect, Lemmatized: connect
Original: connected, Lemmatized: connected
Original: connecting, Lemmatized: connecting
Original: connection, Lemmatized: connection
Original: connections, Lemmatized: connection
Original: learn, Lemmatized: learn
Original: learning, Lemmatized: learning
Original: learned, Lemmatized: learned
Original: learner, Lemmatized: learner


### Stemming vs Lemmatization

**Stemming:**
- Reduces words to their root form by removing suffixes.
- Often results in non-dictionary words.
- Example: "connected" -> "connect", "connecting" -> "connect".

**Lemmatization:**
- Reduces words to their base or dictionary form.
- Considers the context and meaning of the words.
- Example: "connected" -> "connected", "better" -> "good".

In [21]:
from nltk.util import ngrams
from collections import Counter

# Example string
dummy_data = "This is a simple example to demonstrate the use of ngrams in nltk"

# Tokenize the string into words
tokens = word_tokenize(dummy_data)

# Generate bigrams (2-grams)
bigrams = list(ngrams(tokens, 2))
print("Bigrams:", bigrams)

# Generate trigrams (3-grams)
trigrams = list(ngrams(tokens, 3))
print("Trigrams:", trigrams)

# Count the frequency of bigrams
bigram_freq = Counter(bigrams)
print("Bigram Frequencies:", bigram_freq)

# Count the frequency of trigrams
trigram_freq = Counter(trigrams)
print("Trigram Frequencies:", trigram_freq)

Bigrams: [('This', 'is'), ('is', 'a'), ('a', 'simple'), ('simple', 'example'), ('example', 'to'), ('to', 'demonstrate'), ('demonstrate', 'the'), ('the', 'use'), ('use', 'of'), ('of', 'ngrams'), ('ngrams', 'in'), ('in', 'nltk')]
Trigrams: [('This', 'is', 'a'), ('is', 'a', 'simple'), ('a', 'simple', 'example'), ('simple', 'example', 'to'), ('example', 'to', 'demonstrate'), ('to', 'demonstrate', 'the'), ('demonstrate', 'the', 'use'), ('the', 'use', 'of'), ('use', 'of', 'ngrams'), ('of', 'ngrams', 'in'), ('ngrams', 'in', 'nltk')]
Bigram Frequencies: Counter({('This', 'is'): 1, ('is', 'a'): 1, ('a', 'simple'): 1, ('simple', 'example'): 1, ('example', 'to'): 1, ('to', 'demonstrate'): 1, ('demonstrate', 'the'): 1, ('the', 'use'): 1, ('use', 'of'): 1, ('of', 'ngrams'): 1, ('ngrams', 'in'): 1, ('in', 'nltk'): 1})
Trigram Frequencies: Counter({('This', 'is', 'a'): 1, ('is', 'a', 'simple'): 1, ('a', 'simple', 'example'): 1, ('simple', 'example', 'to'): 1, ('example', 'to', 'demonstrate'): 1, ('to