# **Loding Dataset**
---

In [5]:
import kagglehub

path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")
path

Using Colab cache for faster access to the 'imdb-dataset-of-50k-movie-reviews' dataset.


'/kaggle/input/imdb-dataset-of-50k-movie-reviews'

**Dataframe**

*Creating pandas dataframe and utilising a small chunk of data for demonstration.*

---

In [6]:
import pandas as pd

df = pd.read_csv(path + "/IMDB Dataset.csv")
df.shape

(50000, 2)

In [7]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [8]:
df = df.head(200)
df.shape

(200, 2)

# **Lowercase**

---



In [9]:
df['review'][5]

'Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. It just never gets old, despite my having seen it some 15 or more times in the last 25 years. Paul Lukas\' performance brings tears to my eyes, and Bette Davis, in one of her very few truly sympathetic roles, is a delight. The kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. And the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they\'d all be "up" for this movie.'

In [10]:
df['review'] = df['review'].str.lower()

In [11]:
df['review'][5]

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

# **HTML Tags Removal**

---



In [12]:
import re

def remove_html_tags(text):
  pattern = re.compile('<.*?>')
  return pattern.sub(r'', text)

In [13]:
sample_text = '<html><body><p> Movie</p><p> Actor'
new_sample_text = remove_html_tags(sample_text)
new_sample_text

' Movie Actor'

In [14]:
df['review'] = df['review'].apply(remove_html_tags)



# **URL Removal**

---



In [15]:
def remove_url(text):
  pattern = re.compile(r'https?://\S+|www\.\S+')
  return pattern.sub(r'', text)

In [16]:
url_text = 'youtube video https://www.youtube.com/ABC'
new_text = remove_url(url_text)
new_text

'youtube video '

In [17]:
df['review'] = df['review'].apply(remove_url)

# **Punctuations Removal**

---



In [18]:
import string

def punc_remove(text):
  return text.translate(str.maketrans('','', string.punctuation))

In [19]:
punc_text = 'Is the,, movie!! good??'
new_punc_text = punc_remove(punc_text)
new_punc_text

'Is the movie good'

In [20]:
df['review'][5]

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

In [21]:
df['review'] = df['review'].apply(punc_remove)

In [22]:
df['review'][5]

'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'

# **Handling Text Acronyms**

---



In [23]:
text_acronyms = {
    'AFAIK':'As Far As I Know',
    'AFK':'Away From Keyboard',
    'ASAP':'As Soon As Possible',
    "FYI": "For Your Information",
    "ASAP": "As Soon As Possible",
    "BRB": "Be Right Back",
    "BTW": "By The Way",
    "OMG": "Oh My God",
    "IMO": "In My Opinion",
    "LOL": "Laugh Out Loud",
    "TTYL": "Talk To You Later",
    "GTG": "Got To Go",
    "TTYT": "Talk To You Tomorrow",
    "IDK": "I Don't Know",
    "TMI": "Too Much Information",
    "IMHO": "In My Humble Opinion",
    "ICYMI": "In Case You Missed It",
    "AFAIK": "As Far As I Know",
    "BTW": "By The Way",
    "FAQ": "Frequently Asked Questions",
    "TGIF": "Thank God It's Friday",
    "FYA": "For Your Action",
    "ICYMI": "In Case You Missed It",
}

In [24]:
def handle_acronyms(text):
  new_text = []
  for word in text.split():
    if word.upper() in text_acronyms.keys():
      new_text.append(text_acronyms[word.upper()])
    else:
      new_text.append(word)
  return " ".join(new_text)


In [25]:
acronym_text = "LOL I will be there ASAP"
complete_text = handle_acronyms(acronym_text)
complete_text

'Laugh Out Loud I will be there As Soon As Possible'

In [26]:
df['review'] = df['review'].apply(handle_acronyms)

# **Handling Incorrect Words**

---



In [27]:
from textblob import TextBlob

In [28]:
def text_correction(text):
  return str(TextBlob(text).correct())

In [29]:
incorrect_text = 'the moviee is noot goood'
correction = text_correction(incorrect_text)
correction

'the movie is not good'

In [30]:
df['review'] = df['review'].apply(text_correction)

# **Stopwords**

---

*Stopwords are very common words like 'the', 'a', 'an' etc. which doesn't contribute much to the NLP task and take up resources unnecessarily. NLTK package corpus contains collection of these words which can be removed from the texts.*

In [31]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [32]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [33]:
def remove_stopwords(text):
  new_text = []
  for word in text.split():
    if word in stopwords.words('english'):
      new_text.append('')
    else:
      new_text.append(word)
  return " ".join(new_text)


In [34]:
stopword_text = 'the movie is a great watch'
new_stopword_text = remove_stopwords(stopword_text)
new_stopword_text

' movie   great watch'

In [35]:
df['review'][5]

'probably my alliee favorite movie a story of helplessness sacrifice and education to a noble cause but its not preach or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lupus performance brings tears to my eyes and better davis in one of her very few truly sympathetic roles is a delight the kiss are as grand says more like dressed midges than children but that only makes them more fun to watch and the mothers slow awakening to what happening in the world and under her own roof is believable and startling if i had a dozen thumbs they all be up for this movie'

In [36]:
df['review'] = df['review'].apply(remove_stopwords)
df['review'][5]

'probably  alliee favorite movie  story  helplessness sacrifice  education   noble cause    preach  boring   never gets old despite   seen   15   times   last 25 years paul lupus performance brings tears   eyes  better davis  one     truly sympathetic roles   delight  kiss   grand says  like dressed midges  children    makes   fun  watch   mothers slow awakening   happening   world     roof  believable  startling     dozen thumbs       movie'

# **Tokenization, Stemming and Lemmatization**

---


**Tokenization** is segmentation of the input text into characters, words or sentences based on type of tokenization. E.g. "I like this movies" tokenizes to ['I','like', 'this', 'movie'] where each word is a token.

**Stemming** is a technique to reduce words to their stems, e.g. 'running', 'ran' gets stemmed to 'run'. Stemming works on a set of rules, hence it can be linguistically wrong sometimes, it can convert 'university' to 'universi'. But it is fast and works for simpler tasks.

**Lemmatization** works like stemming but it takes care of part of speech and context of a word. Hence, it is slower than stemming process but more accurate.

NLTK suite and spaCy library performs the mentioned operations in different ways and both can be used based on the project needs.
While NLTK performs all these operations and each one needs to be done separately, spaCy performs all at once, it means that spaCy performs all internal operations like tokenization, parts of speech (POS) tagging, lemmatization and named entity recognisation (NER) at once. Stemming is not supported in spaCy and except lemmatization and POS tagger, all other steps in the pipeline can be disabled.

In [37]:
import spacy

nlp = spacy.load('en_core_web_sm')
def lemmatize(text):
  tokens = [token.lemma_ for token in nlp(text)]
  return " ".join(tokens)

In [38]:
text = 'He played the movie and liked it'
lemmatize(text)

'he play the movie and like it'

In [39]:
df['review'][5]

'probably  alliee favorite movie  story  helplessness sacrifice  education   noble cause    preach  boring   never gets old despite   seen   15   times   last 25 years paul lupus performance brings tears   eyes  better davis  one     truly sympathetic roles   delight  kiss   grand says  like dressed midges  children    makes   fun  watch   mothers slow awakening   happening   world     roof  believable  startling     dozen thumbs       movie'

In [40]:
df['review'] = df['review'].apply(lemmatize)

In [41]:
df['review'][5]

'probably   alliee favorite movie   story   helplessness sacrifice   education    noble cause     preach   boring    never get old despite    see    15    time    last 25 year paul lupus performance bring tear    eye   well davis   one      truly sympathetic role    delight   kiss    grand say   like dressed midge   child     make    fun   watch    mother slow awakening    happen    world      roof   believable   startling      dozen thumb        movie'