# **NLP Pipeline**
*   **Downloading Datset from Kaggle to Google Colab**
*   **Text Cleaning**
*   **Text Preprocessing**

# **Downloading Datset from Kaggle to Google Colab**

In [1]:
#!/bin/bash
!pip install kaggle



To access Kaggle datasets, you need to provide the API key. Here’s how to do that:

*   Visit https://www.kaggle.com/ and log in to your Kaggle account.
*   Get the Kaggle API Key:
*   Click on your profile icon in the top-right corner and select Settings
*   Scroll down to the API section and click on Create New API Token.
*   This will download a file called kaggle.json.
*   Upload the kaggle.json file to the Google Colab.

In [2]:
import os
import json

# Set up Kaggle API credentials
#os.environ['KAGGLE_CONFIG_DIR'] = "/content"
#/content/kaggle.json
# Make the Kaggle API key available to the environment
with open('/content/kaggle.json') as f:
    kaggle_json = json.load(f)
    os.environ['KAGGLE_USERNAME'] = kaggle_json['username']
    os.environ['KAGGLE_KEY'] = kaggle_json['key']

In [3]:
#!/bin/bash
!kaggle datasets download lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
 35% 9.00M/25.7M [00:00<00:00, 92.7MB/s]
100% 25.7M/25.7M [00:00<00:00, 164MB/s] 


In [4]:
!unzip imdb-dataset-of-50k-movie-reviews.zip

Archive:  imdb-dataset-of-50k-movie-reviews.zip
  inflating: IMDB Dataset.csv        


# **Text Cleaning**

In [5]:
import numpy as np
import pandas as pd

In [6]:
temp_df = pd.read_csv('/content/IMDB Dataset.csv')

In [7]:
temp_df.shape

(50000, 2)

In [8]:
df = temp_df.iloc[:5000]

In [None]:
df.shape

(5000, 2)

In [9]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Lowercasing

In [10]:
df['review'][3].lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [11]:
df['review'] = df['review'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].str.lower()


In [12]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
4995,an interesting slasher film with multiple susp...,negative
4996,i watched this series when it first came out i...,positive
4997,once again jet li brings his charismatic prese...,positive
4998,"i rented this movie, after hearing chris gore ...",negative


# Removing Special Characters

**Remove HTML Tags**

In [13]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [14]:
html_text = "<html><body><h1>My World</h1></body></html>"
clean_text = remove_html_tags(html_text)
print(clean_text)

My World


In [15]:
df['review'][1]

'a wonderful little production. <br /><br />the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. <br /><br />the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well d

In [16]:
df['review'] = df['review'].apply(remove_html_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_html_tags)


In [17]:
df['review'][1]

'a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well done.'

**Remove URL**

In [18]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [19]:
df['review'] = df['review'].apply(remove_url)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_url)


In [20]:
df['review'][1]

'a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well done.'

In [21]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
4995,an interesting slasher film with multiple susp...,negative
4996,i watched this series when it first came out i...,positive
4997,once again jet li brings his charismatic prese...,positive
4998,"i rented this movie, after hearing chris gore ...",negative


**Remove Punctuation**

In [None]:
#import string
#exclude = string.punctuation

In [22]:
exclude = "!.,?"
def remove_punc(text):
    return text.translate(str.maketrans('', '', exclude))
    #return text.translate(str.maketrans('probably', 'possible'))

In [23]:
df['review'][5]

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

In [24]:
df['review'] = df['review'].apply(remove_punc)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_punc)


In [25]:
df['review'][5]

'probably my all-time favorite movie a story of selflessness sacrifice and dedication to a noble cause but it\'s not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas\' performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like "dressed-up midgets" than children but that only makes them more fun to watch and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling if i had a dozen thumbs they\'d all be "up" for this movie'

In [26]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
4995,an interesting slasher film with multiple susp...,negative
4996,i watched this series when it first came out i...,positive
4997,once again jet li brings his charismatic prese...,positive
4998,i rented this movie after hearing chris gore s...,negative


**Stopword Removal**

In [27]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [28]:
from nltk.corpus import stopwords
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [29]:
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [30]:
remove_stopwords('probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times')

'probably  all-time favorite movie,  story  selflessness, sacrifice  dedication   noble cause,    preachy  boring.   never gets old, despite   seen   15   times'

In [31]:
df['review'] = df['review'].apply(remove_stopwords)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_stopwords)


In [32]:
df

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz e...,positive
1,wonderful little production filming techniqu...,positive
2,thought wonderful way spend time hot s...,positive
3,basically there's family little boy (jake) ...,negative
4,"petter mattei's ""love time money"" visuall...",positive
...,...,...
4995,interesting slasher film multiple suspectsin...,negative
4996,watched series first came 70si 14 year...,positive
4997,jet li brings charismatic presence movie ...,positive
4998,rented movie hearing chris gore saying some...,negative


**Tokenization**

In [33]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [34]:
sent1 = 'I am going to visit delhi!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [35]:
df['review'][1]

' wonderful little production  filming technique   unassuming-  old-time-bbc fashion  gives  comforting  sometimes discomforting sense  realism   entire piece  actors  extremely well chosen- michael sheen   "has got   polari"      voices  pat    truly see  seamless editing guided   references  williams\' diary entries     well worth  watching     terrificly written  performed piece  masterful production  one   great master\'s  comedy   life  realism really comes home   little things:  fantasy   guard  rather  use  traditional \'dream\' techniques remains solid  disappears  plays   knowledge   senses particularly   scenes concerning orton  halliwell   sets (particularly   flat  halliwell\'s murals decorating every surface)  terribly well done'

In [36]:
#df['sentences'] = df['review'].apply(sent_tokenize)
df['review'] = df['review'].apply(word_tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(word_tokenize)


In [37]:
df['review'][1]

['wonderful',
 'little',
 'production',
 'filming',
 'technique',
 'unassuming-',
 'old-time-bbc',
 'fashion',
 'gives',
 'comforting',
 'sometimes',
 'discomforting',
 'sense',
 'realism',
 'entire',
 'piece',
 'actors',
 'extremely',
 'well',
 'chosen-',
 'michael',
 'sheen',
 '``',
 'has',
 'got',
 'polari',
 "''",
 'voices',
 'pat',
 'truly',
 'see',
 'seamless',
 'editing',
 'guided',
 'references',
 'williams',
 "'",
 'diary',
 'entries',
 'well',
 'worth',
 'watching',
 'terrificly',
 'written',
 'performed',
 'piece',
 'masterful',
 'production',
 'one',
 'great',
 'master',
 "'s",
 'comedy',
 'life',
 'realism',
 'really',
 'comes',
 'home',
 'little',
 'things',
 ':',
 'fantasy',
 'guard',
 'rather',
 'use',
 'traditional',
 "'dream",
 "'",
 'techniques',
 'remains',
 'solid',
 'disappears',
 'plays',
 'knowledge',
 'senses',
 'particularly',
 'scenes',
 'concerning',
 'orton',
 'halliwell',
 'sets',
 '(',
 'particularly',
 'flat',
 'halliwell',
 "'s",
 'murals',
 'decorating

In [38]:
df['review']

Unnamed: 0,review
0,"[one, reviewers, mentioned, watching, 1, oz, e..."
1,"[wonderful, little, production, filming, techn..."
2,"[thought, wonderful, way, spend, time, hot, su..."
3,"[basically, there, 's, family, little, boy, (,..."
4,"[petter, mattei, 's, ``, love, time, money, ''..."
...,...
4995,"[interesting, slasher, film, multiple, suspect..."
4996,"[watched, series, first, came, 70si, 14, years..."
4997,"[jet, li, brings, charismatic, presence, movie..."
4998,"[rented, movie, hearing, chris, gore, saying, ..."


**Stemming**

In [39]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [40]:
sample1 = "The leaves are falling and the children are running towards the park."
stem_words(sample1)

'the leav are fall and the children are run toward the park.'

**Lemmitization**

In [41]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmitizer = WordNetLemmatizer()
def lemmitize_words(text):
    return " ".join([lemmitizer.lemmatize(word,pos='v') for word in text.split()])

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [42]:
#sample2 = "happy happiness happily"
sample3 = "The leaves are falling and the children are running towards the park ran."
lemmitize_words(sample3)

'The leave be fall and the children be run towards the park ran.'

In [43]:
df['review'][1]

['wonderful',
 'little',
 'production',
 'filming',
 'technique',
 'unassuming-',
 'old-time-bbc',
 'fashion',
 'gives',
 'comforting',
 'sometimes',
 'discomforting',
 'sense',
 'realism',
 'entire',
 'piece',
 'actors',
 'extremely',
 'well',
 'chosen-',
 'michael',
 'sheen',
 '``',
 'has',
 'got',
 'polari',
 "''",
 'voices',
 'pat',
 'truly',
 'see',
 'seamless',
 'editing',
 'guided',
 'references',
 'williams',
 "'",
 'diary',
 'entries',
 'well',
 'worth',
 'watching',
 'terrificly',
 'written',
 'performed',
 'piece',
 'masterful',
 'production',
 'one',
 'great',
 'master',
 "'s",
 'comedy',
 'life',
 'realism',
 'really',
 'comes',
 'home',
 'little',
 'things',
 ':',
 'fantasy',
 'guard',
 'rather',
 'use',
 'traditional',
 "'dream",
 "'",
 'techniques',
 'remains',
 'solid',
 'disappears',
 'plays',
 'knowledge',
 'senses',
 'particularly',
 'scenes',
 'concerning',
 'orton',
 'halliwell',
 'sets',
 '(',
 'particularly',
 'flat',
 'halliwell',
 "'s",
 'murals',
 'decorating

In [44]:
def lemmatize_words(tokens):
    return [lemmitizer.lemmatize(word,pos='v') for word in tokens]
    #return " ".join([lemmatizer.lemmatize(word) for word in tokens])

# Lemmatizing the tokenized words in the 'review' column
df['lemmatized_review'] = df['review'].apply(lemmatize_words)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['lemmatized_review'] = df['review'].apply(lemmatize_words)


In [45]:
df['lemmatized_review'][1]

['wonderful',
 'little',
 'production',
 'film',
 'technique',
 'unassuming-',
 'old-time-bbc',
 'fashion',
 'give',
 'comfort',
 'sometimes',
 'discomforting',
 'sense',
 'realism',
 'entire',
 'piece',
 'actors',
 'extremely',
 'well',
 'chosen-',
 'michael',
 'sheen',
 '``',
 'have',
 'get',
 'polari',
 "''",
 'voice',
 'pat',
 'truly',
 'see',
 'seamless',
 'edit',
 'guide',
 'reference',
 'williams',
 "'",
 'diary',
 'entries',
 'well',
 'worth',
 'watch',
 'terrificly',
 'write',
 'perform',
 'piece',
 'masterful',
 'production',
 'one',
 'great',
 'master',
 "'s",
 'comedy',
 'life',
 'realism',
 'really',
 'come',
 'home',
 'little',
 'things',
 ':',
 'fantasy',
 'guard',
 'rather',
 'use',
 'traditional',
 "'dream",
 "'",
 'techniques',
 'remain',
 'solid',
 'disappear',
 'play',
 'knowledge',
 'sense',
 'particularly',
 'scenes',
 'concern',
 'orton',
 'halliwell',
 'set',
 '(',
 'particularly',
 'flat',
 'halliwell',
 "'s",
 'murals',
 'decorate',
 'every',
 'surface',
 ')',

# **Lab Task:**

**Apply these preprocessing tasks on the following dataset.**
https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset