<a href="https://colab.research.google.com/github/Sunilyogi333/Natural-Language-Processing--NLP-/blob/main/Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Cleaning
Apply Data Preprocessing techniques on textual data

## Define Sample Text

Define a sample text string to be used for preprocessing.


In [None]:
sample_text = """<p>Hello world! This is a test sentence with some HTML tags.</p> Visit our site: https://www.google.com. Don't forget to check out this video: https://youtu.be/dQw4w9WgXcQ. It's gr8! üòä How r u doing? This is an exampel of bad speling and a lot of punc,tuation!!! Lol. BTW, did u see that??!! ü§£ Let's go partyüéâ"""

print("Sample text created:.")
sample_text

Sample text created:.


"<p>Hello world! This is a test sentence with some HTML tags.</p> Visit our site: https://www.google.com. Don't forget to check out this video: https://youtu.be/dQw4w9WgXcQ. It's gr8! üòä How r u doing? This is an exampel of bad speling and a lot of punc,tuation!!! Lol. BTW, did u see that??!! ü§£ Let's go partyüéâ"

## Text Cleaning

- Lowercase conversion
- Remove HTML tags
- Remove Urls
- chat word replacement
- remove emojis
- remove punctuation


In [None]:
import re
import string

# 1. Convert to lowercase
normalized_text = sample_text.lower()

# 2. Remove HTML tags
normalized_text = re.sub(r'<.*?>', '', normalized_text)

# 3. Remove URLs
normalized_text = re.sub(r'https?://\S+|www\.\S+', '', normalized_text)

# 4. Define and apply robust chat word replacements
chat_words_replacements = {
    'gr8': 'great',
    'r u': 'are you',
    'lol': 'laughing out loud',
    'btw': 'by the way',
    'u': 'you'
}
# Sort by length in descending order to handle multi-word chat expressions first
sorted_chat_words = sorted(chat_words_replacements.items(), key=lambda item: len(item[0]), reverse=True)

for chat, full_form in sorted_chat_words:
    # Use word boundaries (\b) to replace whole words only
    # re.escape() is used to escape any special characters in the chat word itself
    normalized_text = re.sub(r'\b' + re.escape(chat) + r'\b', full_form, normalized_text)

# 5. Remove emojis (updated regex for broader coverage)
emoji_pattern = re.compile(
    "["
    "\U0001F600-\U0001F64F"  # emoticons
    "\U0001F300-\U0001F5FF"  # symbols & pictographs
    "\U0001F680-\U0001F6FF"  # transport & map symbols
    "\U0001F1E0-\U0001F1FF"  # flags (iOS)
    "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
    "\U00002600-\U000027BF"  # Miscellaneous Symbols + Dingbats
    "\U00002B00-\U00002BFF"  # Miscellaneous Symbols and Arrows
    "\U0000200D"             # Zero Width Joiner
    "\U0000FE0F"             # Variation Selector-16
    "]+",
    flags=re.UNICODE
)
normalized_text = emoji_pattern.sub(r'', normalized_text) # Corrected variable name

# 6. Remove all punctuation
# Create a translation table to remove all punctuation characters
# It's important to do this after chat word replacement if chat words can have punctuation
normalized_text = normalized_text.translate(str.maketrans('', '', string.punctuation))

print("Normalized text after all steps:")
print(normalized_text)

Normalized text after all steps:
hello world this is a test sentence with some html tags visit our site  dont forget to check out this video  its great  how are you doing this is an exampel of bad speling and a lot of punctuation laughing out loud by the way did you see that  lets go party


## Correct Spelling Errors

Correct spelling errors in the normalized text using an appropriate library.


In [None]:
!pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.4-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.4-py3-none-any.whl (7.2 MB)
[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/7.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m4.6/7.2 MB[0m [31m142.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m [32m7.2/7.2 MB[0m [31m148.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m7.2/7.2 MB[0m [31m96.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected package

In [None]:
from spellchecker import SpellChecker

# Instantiate SpellChecker
spell = SpellChecker()

# Split the normalized text into words
words = normalized_text.split()

# Correct each word
corrected_words = [spell.correction(word) for word in words]

# Join the corrected words back into a string
corrected_text = " ".join(corrected_words)

print("Corrected text:")
print(corrected_text)

Corrected text:
hello world this is a test sentence with some him tags visit our site don't forget to check out this video its great how are you doing this is an example of bad spelling and a lot of punctuation laughing out loud by the way did you see that lets go party


## Remove Stop Words

Remove common stop words from the text after spelling correction.


In [None]:
import nltk
nltk.download('stopwords')
print("NLTK stopwords corpus downloaded.")

NLTK stopwords corpus downloaded.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from nltk.corpus import stopwords

# Get English stop words
stop_words = set(stopwords.words('english'))

# Tokenize the corrected text
words = corrected_text.split()

# Remove stop words
filtered_words = [word for word in words if word.lower() not in stop_words]

# Join the filtered words back into a string
text_without_stopwords = " ".join(filtered_words)

print("Text after stop word removal:")
print(text_without_stopwords)

Text after stop word removal:
hello world test sentence tags visit site forget check video great example bad spelling lot punctuation laughing loud way see lets go party


## Tokenize Text

Tokenize the preprocessed text into individual words.

#### Instructions
1. Tokenize the `text_without_stopwords` into individual words. This can typically be done by splitting the string by spaces, or more robustly using `nltk.word_tokenize`.
2. Store the list of tokens in a new variable, for example, `tokenized_text`.

In [None]:
import nltk
nltk.download('punkt') # Download the 'punkt' tokenizer models if not already downloaded
nltk.download('punkt_tab') # Download the 'punkt_tab' resource as suggested by the error

# Tokenize the text without stopwords
tokenized_text = nltk.word_tokenize(text_without_stopwords)

print("Tokenized text:")
print(tokenized_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Tokenized text:
['hello', 'world', 'test', 'sentence', 'tags', 'visit', 'site', 'forget', 'check', 'video', 'great', 'example', 'bad', 'spelling', 'lot', 'punctuation', 'laughing', 'loud', 'way', 'see', 'lets', 'go', 'party']


## Stemming

Apply stemming to the tokenized text to reduce words to their root form.

#### Instructions
1. Import an appropriate stemmer from `nltk.stem`, such as `PorterStemmer` or `SnowballStemmer`.
2. Instantiate the stemmer.
3. Apply the stemmer to each word in the `tokenized_text` list.
4. Store the stemmed words in a new variable, for example, `stemmed_text`.

In [None]:
from nltk.stem import PorterStemmer

# Instantiate the Porter Stemmer
porter_stemmer = PorterStemmer()

# Apply stemming to each word in the tokenized text
stemmed_text = [porter_stemmer.stem(word) for word in tokenized_text]

print("Stemmed text:")
print(stemmed_text)

Stemmed text:
['hello', 'world', 'test', 'sentenc', 'tag', 'visit', 'site', 'forget', 'check', 'video', 'great', 'exampl', 'bad', 'spell', 'lot', 'punctuat', 'laugh', 'loud', 'way', 'see', 'let', 'go', 'parti']


## Lemmatization

Apply lemmatization to the tokenized text to reduce words to their base form (lemma).



#### Instructions
1. Import the `WordNetLemmatizer` from `nltk.stem`.
2. Download the 'wordnet' and 'omw-1.4' NLTK corpora if not already downloaded, as these are required for `WordNetLemmatizer`.
3. Instantiate the `WordNetLemmatizer`.
4. Apply the lemmatizer to each word in the `tokenized_text` list. You may need to specify the part-of-speech (pos) for more accurate lemmatization (e.g., 'v' for verbs, 'n' for nouns), but for this task, a default lemmatization is sufficient if POS tagging is not performed.
5. Store the lemmatized words in a new variable, for example, `lemmatized_text`.

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
print("NLTK 'wordnet' and 'omw-1.4' corpora downloaded.")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


NLTK 'wordnet' and 'omw-1.4' corpora downloaded.


In [None]:
from nltk.stem import WordNetLemmatizer

# Instantiate the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization to each word in the tokenized text
# For simplicity, we are not specifying part-of-speech (pos) for now
lemmatized_text = [lemmatizer.lemmatize(word, pos='v') for word in tokenized_text]

print("Lemmatized text:")
print(lemmatized_text)

Lemmatized text:
['hello', 'world', 'test', 'sentence', 'tag', 'visit', 'site', 'forget', 'check', 'video', 'great', 'example', 'bad', 'spell', 'lot', 'punctuation', 'laugh', 'loud', 'way', 'see', 'let', 'go', 'party']


## Compare and Summarize Preprocessing Outputs

Compare and display the original text, the fully processed text, and the outputs of both stemming and lemmatization. Summarize the observed differences and the preprocessing steps applied.

In [None]:
print("--- Original Text ---")
print(sample_text)
print("\n--- Normalized & Spell-Corrected Text (before stop words) ---")
print(corrected_text)
print("\n--- Text After Stop Word Removal ---")
print(text_without_stopwords)
print("\n--- Stemmed Text ---")
print(" ".join(stemmed_text)) # Join list back to string for display
print("\n--- Lemmatized Text ---")
print(" ".join(lemmatized_text)) # Join list back to string for display

--- Original Text ---
<p>Hello world! This is a test sentence with some HTML tags.</p> Visit our site: https://www.google.com. Don't forget to check out this video: https://youtu.be/dQw4w9WgXcQ. It's gr8! üòä How r u doing? This is an exampel of bad speling and a lot of punc,tuation!!! Lol. BTW, did u see that??!! ü§£ Let's go partyüéâ

--- Normalized & Spell-Corrected Text (before stop words) ---
hello world this is a test sentence with some him tags visit our site don't forget to check out this video its great how are you doing this is an example of bad spelling and a lot of punctuation laughing out loud by the way did you see that lets go party

--- Text After Stop Word Removal ---
hello world test sentence tags visit site forget check video great example bad spelling lot punctuation laughing loud way see lets go party

--- Stemmed Text ---
hello world test sentenc tag visit site forget check video great exampl bad spell lot punctuat laugh loud way see let go parti

--- Lemmatize

### Summary of Preprocessing Steps and Observed Differences

The following is a summary of the text preprocessing steps applied and the observed differences in the text at each stage:

1.  **Original Text**: This was the raw input containing HTML tags, URLs, punctuation, emojis, chat words, and spelling errors. Example: `"<p>Hello world! This is a test sentence with some HTML tags.</p> Visit our site: https://www.example.com. Don't forget to check out this video: https://youtu.be/dQw4w9WgXcQ. It's gr8! üòä How r u doing? This is an exampel of bad speling and a lot of punc,tuation!!! Lol. BTW, did u see that??!! ü§£ Let's go partyüéâ"`

2.  **Normalization (Lowercase, HTML, URLs, Chat Words, Emojis, Punctuation Removal)**:
    *   The text was converted to lowercase.
    *   HTML tags (`<p>`, `</p>`) were removed.
    *   URLs (`https://www.example.com`, `https://youtu.be/dQw4w9WgXcQ`) were removed.
    *   Chat words were expanded (e.g., `gr8` became `great`, `r u` became `are you`, `lol` became `laughing out loud`, `btw` became `by the way`, `u` became `you`).
    *   Emojis (`üòä`, `ü§£`, `üéâ`) were removed.
    *   All punctuation (`!`, `.`, `,`, `?`, `'`, `"`) was removed.

3.  **Spelling Correction**: After normalization, obvious spelling errors were corrected (e.g., `exampel` to `example`, `speling` to `spelling`, `punc,tuation` to `punctuation`). This resulted in a cleaner, more readable text.

4.  **Stop Word Removal**: Common English stop words (e.g., `this`, `is`, `a`, `with`, `our`, `dont`, `to`, `its`, `how`, `are`, `you`, `doing`, `an`, `of`, `and`, `by`, `the`, `did`, `see`, `that`) were removed to reduce noise and focus on more significant terms. This significantly reduced the length of the text and retained only keywords.

5.  **Tokenization**: The text was split into individual words or tokens, forming a list of words. This is a prerequisite for stemming and lemmatization.

6.  **Stemming**: Applied a Porter Stemmer to reduce words to their root or base form (stem). This is an aggressive process and often results in words that are not actual dictionary words. For example, `sentence` became `sentenc`, `tags` became `tag`, `example` became `exampl`, `spelling` became `spell`, `punctuation` became `punctuat`, `laughing` became `laugh`, and `party` became `parti`.

7.  **Lemmatization**: Applied a WordNetLemmatizer to reduce words to their dictionary base form (lemma). This is less aggressive than stemming and ensures the resulting words are valid. For example, `sentence` remained `sentence`, `tags` became `tag` (similar to stemming here), `example` remained `example`, `spelling` remained `spelling`, `punctuation` remained `punctuation`, `laughing` remained `laughing`, and `party` remained `party`. Notice that `laughing` did not become `laugh` unlike stemming, and `sentence`, `example`, `spelling`, `punctuation`, and `party` were less aggressively reduced than in stemming.

**Key Differences between Stemming and Lemmatization**:
*   **Stemming** is a heuristic process that chops off suffixes to reduce words to a common root form, which may not be a valid word (e.g., `sentence` -> `sentenc`).
*   **Lemmatization** is a more sophisticated process that uses a vocabulary and morphological analysis of words to return the base or dictionary form (lemma), which is always a valid word (e.g., `sentence` -> `sentence`). Lemmatization is generally preferred when linguistic accuracy is important, while stemming is faster and can be sufficient for some applications where speed is paramount.

### Summary of Preprocessing Steps and Observed Differences

The following is a summary of the text preprocessing steps applied and the observed differences in the text at each stage:

1.  **Original Text**: This was the raw input containing HTML tags, URLs, punctuation, emojis, chat words, and spelling errors. Example: `"<p>Hello world! This is a test sentence with some HTML tags.</p> Visit our site: https://www.example.com. Don't forget to check out this video: https://youtu.be/dQw4w9WgXcQ. It's gr8! üòä How r u doing? This is an exampel of bad speling and a lot of punc,tuation!!! Lol. BTW, did u see that??!! ü§£ Let's go partyüéâ"`

2.  **Normalization (Lowercase, HTML, URLs, Chat Words, Emojis, Punctuation Removal)**:
    *   The text was converted to lowercase.
    *   HTML tags (`<p>`, `</p>`) were removed.
    *   URLs (`https://www.example.com`, `https://youtu.be/dQw4w9WgXcQ`) were removed.
    *   Chat words were expanded (e.g., `gr8` became `great`, `r u` became `are you`, `lol` became `laughing out loud`, `btw` became `by the way`, `u` became `you`).
    *   Emojis (`üòä`, `ü§£`, `üéâ`) were removed.
    *   All punctuation (`!`, `.`, `,`, `?`, `'`, `"`) was removed.

3.  **Spelling Correction**: After normalization, obvious spelling errors were corrected (e.g., `exampel` to `example`, `speling` to `spelling`, `punc,tuation` to `punctuation`). This resulted in a cleaner, more readable text.

4.  **Stop Word Removal**: Common English stop words (e.g., `this`, `is`, `a`, `with`, `our`, `dont`, `to`, `its`, `how`, `are`, `you`, `doing`, `an`, `of`, `and`, `by`, `the`, `did`, `see`, `that`) were removed to reduce noise and focus on more significant terms. This significantly reduced the length of the text and retained only keywords.

5.  **Tokenization**: The text was split into individual words or tokens, forming a list of words. This is a prerequisite for stemming and lemmatization.

6.  **Stemming**: Applied a Porter Stemmer to reduce words to their root or base form (stem). This is an aggressive process and often results in words that are not actual dictionary words. For example, `sentence` became `sentenc`, `tags` became `tag`, `example` became `exampl`, `spelling` became `spell`, `punctuation` became `punctuat`, `laughing` became `laugh`, and `party` became `parti`.

7.  **Lemmatization**: Applied a WordNetLemmatizer to reduce words to their dictionary base form (lemma). This is less aggressive than stemming and ensures the resulting words are valid. For example, `sentence` remained `sentence`, `tags` became `tag` (similar to stemming here), `example` remained `example`, `spelling` remained `spelling`, `punctuation` remained `punctuation`, `laughing` remained `laughing`, and `party` remained `party`. Notice that `laughing` did not become `laugh` unlike stemming, and `sentence`, `example`, `spelling`, `punctuation`, and `party` were less aggressively reduced than in stemming.

**Key Differences between Stemming and Lemmatization**:
*   **Stemming** is a heuristic process that chops off suffixes to reduce words to a common root form, which may not be a valid word (e.g., `sentence` -> `sentenc`).
*   **Lemmatization** is a more sophisticated process that uses a vocabulary and morphological analysis of words to return the base or dictionary form (lemma), which is always a valid word (e.g., `sentence` -> `sentence`). Lemmatization is generally preferred when linguistic accuracy is important, while stemming is faster and can be sufficient for some applications where speed is paramount.

In [None]:
import pandas as pd

df = pd.read_csv('/content/IMDB Dataset.csv')

df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [None]:
df['review'] = df['review'].str.lower()
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


In [None]:
#Remove HTML tags
import re
def remove_html_tags(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)
#Sample info
text = "This is the sample <br> file <h1> showing tag info"
remove_html_tags(text)
df['review'] = df['review'].apply(remove_html_tags)
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


In [None]:
# Remove URLS
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)
#Sample info
text = "This is the sample https://www.google.com "
remove_urls(text)
df['review'] = df['review'].apply(remove_urls)
df
#

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


In [None]:
import string
punc = string.punctuation

def remove_punctuation(text):
  return text.translate(str.maketrans('', '', punc))

df['review'] = df['review'].apply(remove_punctuation)
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,bad plot bad dialogue bad acting idiotic direc...,negative
49997,i am a catholic taught in parochial elementary...,negative
49998,im going to have to disagree with the previous...,negative


In [None]:
chat_words_replacements = {
    'gr8': 'great',
    'r u': 'are you',
    'lol': 'laughing out loud',
    'btw': 'by the way',
    'u': 'you',
    'brb': 'be right back'
}

def chat_conversion(text):
  new_text = []
  for i in text.split():
    if i in chat_words_replacements:
      new_text.append(chat_words_replacements[i])
    else:
      new_text.append(i)
  return " ".join(new_text)

text = " I will brb"
chat_conversion(text)
df['review'] = df['review'].apply(chat_conversion)
df
#

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,bad plot bad dialogue bad acting idiotic direc...,negative
49997,i am a catholic taught in parochial elementary...,negative
49998,im going to have to disagree with the previous...,negative


In [None]:
from textblob import TextBlob

text = "Tody is the clss for NLP"

txtbl = TextBlob(text)
print("Text:",txtbl)
print("Text:",txtbl.correct())

# Apply TextBlob correction to each review in the 'review' column
df['review'] = df['review'].apply(lambda x: str(TextBlob(x).correct()))

df

Text: Tody is the clss for NLP
Text: Body is the class for NLP


In [None]:
from nltk.corpus import stopwords
stopword = stopwords.words('english')

def remove_stopword(text):
  new_text = []
  for word in text.split():
    if word in stopword:
      new_text.append('')
    else:
      new_text.append(word)
  x = new_text[:]
  new_text.clear()
  return " ".join(x)
df['review'].apply(remove_stopword)
df
#

In [None]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

def stem_words(text):
  return " ".join([stemmer.stem(word) for word in text.split()])

sample = "walk walks walking walked"
stem_words(sample)

df['review'].apply(stem_words)
df
#

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
import nltk
from nltk import word_tokenize
lemma = WordNetLemmatizer()

sentence ="walking walked walks into the river Ram walked away form"
punc = string.punctuation
words = nltk.word_tokenize(sentence)
words_no = []
for word in words:
  if word not in punc:
    words_no.append(word)

words_nostop = []
for word in words_no:
  if word not in stopword:
    words_nostop.append(word)

print(words_nostop)

from nltk.stem import WordNetLemmatizer

# Instantiate the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization to each word in the tokenized text
# For simplicity, we are not specifying part-of-speech (pos) for now
lemmatized_text = [lemmatizer.lemmatize(word, pos='v') for word in words_nostop]

print("Lemmatized text:")
print(lemmatized_text)


#for word in words_no:
 #'''