<a href="https://colab.research.google.com/github/MANOJ-S-NEGI/NLP_INTRODUCTION/blob/main/IMDB_DATACLEANNIG_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**About Dataset**

- IMDB dataset having 50K movie reviews for natural language processing or Text analytics.

- This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing.

- So, predict the number of positive and negative reviews using either classification or deep learning algorithms.

In [1]:
## importing libraries
import pandas as pd
import numpy as np
import string

In [68]:
## reading csv:
path = "/content/drive/MyDrive/IMDB Dataset.csv"
data_ori =  pd.read_csv(path, on_bad_lines='warn')
data_ori

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


Other possible values for on_bad_lines include:

- 'error': This will raise an error and stop the reading process when a problematic line is encountered.
- 'warn': This will issue a warning but continue reading the file.
- 'skip': This will skip the problematic lines and continue reading the file.

In [69]:
# taking data sample:
data = data_ori[:100].copy()
print(data.shape)

(100, 2)


In [70]:
## checking length:
print("length of data :",len(data))
data.head(4)

length of data : 100


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative


In [71]:
## removing all the html tags
pattern_html_tag  = r'<.*?>'
https_pattern = r'https://\S+'

# Iterate through reviews and apply replacements
import re

for i in range(len(data.review)):
    # Remove HTML tags
    data.review[i] = re.sub(pattern_html_tag, "", data.review[i])
    # Remove URLs
    data.review[i] = re.sub(https_pattern, "", data.review[i])
    # removing all instance where word boundary with extra dot like,'super.......'
    data.review[i] = re.sub(r'\.{2,}', '.', data.review[i])

## printing length of review:
print(len(data))

## print review[0]
data.review[0]

100


"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many.Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more.so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wouldn

In [72]:
data.review.head(4)

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. The filming tec...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
Name: review, dtype: object

### checking if their are any short form like word:

In [81]:
# Pattern for all caps words:

short_form_pattern = r'\b[A-Z]{2,}\b'
short_form = []

for i in data.review:
    short_forms = re.findall(short_form_pattern, i)
    if short_forms is not None:  # Change Null to None
        short_form.extend(short_forms)  # Extend the list
    else:
        pass

print(short_form)


[]


In [75]:
## replace the
abbreviation_dict = {
    'BBC': 'British Broadcasting Corporation',
    'TV': 'Television',
    'DVD': 'Digital Versatile Disc',
    'VHS': 'Video Home System',
    'US': 'United States',
    'BTW': 'between'
}
counter = 0
for i in data.review:
    # Replace full forms with abbreviations
    for key, value in abbreviation_dict.items():
        if key in i:
            data.review[counter] = i.replace(key, value)
        else:
            pass
        counter = counter+1


In [76]:
data.review[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many.Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more.so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wouldn

In [82]:
data.review.head(4)

0    one of the other reviews has mentioned that af...
1    a wonderful little production. the filling tec...
2    i thought this was a wonderful way to spend ti...
3    basically there's a family where a little boy ...
Name: review, dtype: object

In [78]:
## converting whole text into lower case:
reviews = []
for i in data.review:
    lower_case = i.lower()
    reviews.append(lower_case)

data.review = reviews.copy()

Doing Spell Checking: LIBRARY Textblob:

In [79]:
# importing text blob:
from textblob import TextBlob

correction_review = []
for i in data.review:
    correction = TextBlob(i).correct()  # Correct the review using TextBlob
    correction_review.append(str(correction))  # Convert the corrected TextBlob object back to a string

data.review = correction_review.copy()


In [80]:
data.review[0]

"one of the other reviews has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.the first thing that struck me about oz was its brutally and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punched with regards to drugs, sex or violence. its is hardware, in the classic use of the word.it is called oz as that is the nickname given to the onward maximum security state penitentiary. it focused mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many.organs, muslin, gangstas, nations, christians, italians, irish and more.so snuffles, death stares, podgy dealings and shady agreements are never far away.i would say the main appeal of the show is due to the fact that it goes where other shows wouldn't 

In [85]:
print(string.punctuation)
## creating the customise puctuation list

customized_punctuation_list = []

for i in string.punctuation:
    if i=="." or i==',':
        pass
    else:
        customized_punctuation_list.append(i)


!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [86]:
## removing the punctuators:
## Punctuation :punct = string.punctuation

## removing the punctuation from reviews:
review_punc_correction = []
for i in range(len(data.review)):
    for punctuations in string.punctuation:
        if punctuations in data.review[i]:
            data.review[i] = data.review[i].replace(punctuations,'')
        else:
            pass


In [88]:
data.review[0]

'one of the other reviews has mentioned that after watching just 1 oz episode youll be hooked. they are right, as this is exactly what happened with me.the first thing that struck me about oz was its brutally and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punched with regards to drugs, sex or violence. its is hardware, in the classic use of the word.it is called oz as that is the nickname given to the onward maximum security state penitentiary. it focused mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many.organs, muslin, gangstas, nations, christians, italians, irish and more.so snuffles, death stares, podgy dealings and shady agreements are never far away.i would say the main appeal of the show is due to the fact that it goes where other shows wouldnt da

In [89]:
data.review.head(4)

0    one of the other reviews has mentioned that af...
1    a wonderful little production. the filling tec...
2    i thought this was a wonderful way to spend ti...
3    basically theres a family where a little boy j...
Name: review, dtype: object

- Stopwords are common words that are often removed from text during natural language processing tasks because they usually don't contain much meaningful information. They include words like "the," "is," "in," "and," etc

---
- The "punkt" resource in NLTK refers to a data file that contains pre-trained models for tokenization. Tokenization is the process of splitting a text into individual words or tokens.

In [127]:
data_copy = data.copy()

In [128]:
data_copy.review[2]

'i thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a lighthearted comedy. the plot is simplistic, but the dialogue is witty and the characters are liable even the well bread suspected aerial killer. while some may be disappointed when they realize this is not match point 2 risk addition, i thought it was proof that wood allen is still fully in control of the style many of us have grown to love.this was the most id laughed at one of woods remedies in years dare i say a decade. while ive never been impressed with scarlet johnson, in this she managed to tone down her sex image and jumped right into a average, but spirited young woman.this may not be the crown jewel of his career, but it was whittier than devil wears trade and more interesting than sherman a great comedy to go see with friends.'

In [129]:
## importing nltk[Natural Language Toolkit] library
import nltk

## from nltk import stopwords
from nltk.corpus import stopwords

## from nltk importing SnowballStemmer
from nltk.stem import PorterStemmer,LancasterStemmer, SnowballStemmer
## downloading pre define list, we can make customized one
nltk.download('punkt')
nltk.download('stopwords')

# Get the list of English stopwords
stop_words_eng = set(stopwords.words('english'))

# initializing SnowballStemmer(english) as stemmer_snow
stemmer_snow =  SnowballStemmer('english')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [130]:
# Tokenizing the sentence
counter = 0

# calling review items
for i in data_copy.review:
    ## tokenizing words
    tokenize_word = nltk.word_tokenize(i) ## use sentence tokennizer also
    ## set list as null
    stop_words_sentence = []

    ## loop through each tokenized word
    for j in tokenize_word:
        # steem the word ex."running as run"
        stemmed_word = stemmer_snow.stem(j)

        ## removing stop word (like is,the...,etc)
        if stemmed_word not in stop_words_eng:
            stop_words_sentence.append(stemmed_word)
        else:
            pass
    data_copy.review[counter] = stop_words_sentence.copy()
    ## incerasing counter for next index of data_copy
    counter = counter+1

---

- A stemmer is a linguistic algorithm that reduces words to their base or root form, known as the "stem." This process involves removing prefixes, suffixes, and other affixes from words to produce a common base form. The resulting stem may not always be a valid word, but it carries the core meaning of the original word.

- For example, stemming the word "running" would produce the stem "run," and stemming the word "jumps" would result in "jump."

- Stemming is a useful preprocessing step in natural language processing (NLP) tasks such as text classification, information retrieval, and sentiment analysis. It helps to reduce the dimensionality of text data, making it easier for algorithms to process and analyze.

There are various stemmers available, with the most common ones being:

1. **Porter Stemmer**: Developed by Martin Porter, this is one of the oldest and most widely used stemming algorithms. It's designed to be simple and fast, but it may not always produce the most linguistically accurate stems.

2. **Snowball Stemmer (Porter2)**: Also known as the "Porter2" stemmer, it is an improvement over the original Porter Stemmer and is included in the Snowball stemmer library. It addresses some of the limitations of the original algorithm.

3. **Lancaster Stemmer**: This stemmer is more aggressive than the Porter Stemmer and can produce very aggressive stemming, which might result in non-standard stems.

In [131]:
data_copy.review[0]

['one',
 'review',
 'mention',
 'watch',
 '1',
 'oz',
 'episod',
 'youll',
 'hook',
 '.',
 'right',
 ',',
 'exact',
 'happen',
 'me.th',
 'first',
 'thing',
 'struck',
 'oz',
 'brutal',
 'unflinch',
 'scene',
 'violenc',
 ',',
 'set',
 'right',
 'word',
 'go',
 '.',
 'trust',
 ',',
 'show',
 'faint',
 'heart',
 'timid',
 '.',
 'show',
 'pull',
 'punch',
 'regard',
 'drug',
 ',',
 'sex',
 'violenc',
 '.',
 'hardwar',
 ',',
 'classic',
 'use',
 'word.it',
 'call',
 'oz',
 'nicknam',
 'given',
 'onward',
 'maximum',
 'secur',
 'state',
 'penitentiari',
 '.',
 'focus',
 'main',
 'emerald',
 'citi',
 ',',
 'experiment',
 'section',
 'prison',
 'cell',
 'glass',
 'front',
 'face',
 'inward',
 ',',
 'privaci',
 'high',
 'agenda',
 '.',
 'em',
 'citi',
 'home',
 'many.organ',
 ',',
 'muslin',
 ',',
 'gangsta',
 ',',
 'nation',
 ',',
 'christian',
 ',',
 'italian',
 ',',
 'irish',
 'more.so',
 'snuffl',
 ',',
 'death',
 'stare',
 ',',
 'podgi',
 'deal',
 'shadi',
 'agreement',
 'never',
 'far',

In [39]:
"""
from textblob import TextBlob

# Initialize an empty list to store corrected reviews
correction_review = []

# Define the batch size (adjust based on your dataset size and system resources)
batch_size = 100

# Iterate over the reviews in batches
for i in range(0, len(data.review), batch_size):
    # Get a batch of reviews
    batch = data.review[i:i+batch_size]

    # Correct each review in the batch
    batch_corrections = []
    for review in batch:
        # Create a TextBlob object and correct the review
        corrected_review = str(TextBlob(review).correct())
        batch_corrections.append(corrected_review)

    # Extend the list of corrected reviews with the batch
    correction_review.extend(batch_corrections)

"""

'\nfrom textblob import TextBlob\n\n# Initialize an empty list to store corrected reviews\ncorrection_review = []\n\n# Define the batch size (adjust based on your dataset size and system resources)\nbatch_size = 100\n\n# Iterate over the reviews in batches\nfor i in range(0, len(data.review), batch_size):\n    # Get a batch of reviews\n    batch = data.review[i:i+batch_size]\n    \n    # Correct each review in the batch\n    batch_corrections = []\n    for review in batch:\n        # Create a TextBlob object and correct the review\n        corrected_review = str(TextBlob(review).correct())\n        batch_corrections.append(corrected_review)\n    \n    # Extend the list of corrected reviews with the batch\n    correction_review.extend(batch_corrections)\n\n'

In [1]:
import nltk

# Download the nltk data (if not already downloaded)
nltk.download('punkt')

from nltk.tokenize import sent_tokenize

def tokenize_sentence(text):
    # Use sent_tokenize to tokenize the text into sentences
    sentences = sent_tokenize(text)
    return sentences

# Example usage
text = "This is the first sentence. This is the second sentence."
sentences = tokenize_sentence(text)

# Print the tokenized sentences
for sentence in sentences:
    print(sentence)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


This is the first sentence.
This is the second sentence.
