# **NLP Pipeline**
*   **Downloading Datset from Kaggle to Google Colab**
*   **Text Cleaning**
*   **Text Preprocessing**

# **Downloading Datset from Kaggle to Google Colab**

In [91]:
#!/bin/bash
!pip install kaggle



To access Kaggle datasets, you need to provide the API key. Here’s how to do that:

*   Visit https://www.kaggle.com/ and log in to your Kaggle account.
*   Get the Kaggle API Key:
*   Click on your profile icon in the top-right corner and select Settings
*   Scroll down to the API section and click on Create New API Token.
*   This will download a file called kaggle.json.
*   Upload the kaggle.json file to the Google Colab.

In [92]:
import os
import json

# Set up Kaggle API credentials
#os.environ['KAGGLE_CONFIG_DIR'] = "/content"
#/content/kaggle.json
# Make the Kaggle API key available to the environment
with open('/content/kaggle.json') as f:
    kaggle_json = json.load(f)
    os.environ['KAGGLE_USERNAME'] = kaggle_json['username']
    os.environ['KAGGLE_KEY'] = kaggle_json['key']

In [93]:
#!/bin/bash
!kaggle datasets download uciml/sms-spam-collection-dataset



Dataset URL: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
License(s): unknown
sms-spam-collection-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [94]:
!unzip sms-spam-collection-dataset.zip

Archive:  sms-spam-collection-dataset.zip
replace spam.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: all
error:  invalid response [all]
replace spam.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: all
error:  invalid response [all]
replace spam.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: all
error:  invalid response [all]
replace spam.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: all
error:  invalid response [all]
replace spam.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: all
error:  invalid response [all]
replace spam.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

# **Text Cleaning**

In [96]:
import numpy as np
import pandas as pd

In [97]:
temp_df = pd.read_csv('/content/spam.csv',encoding='latin-1')

In [98]:
temp_df.shape

(5572, 5)

In [99]:
df = temp_df#.iloc[:5000]

In [100]:
df.shape

(5572, 5)

In [101]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [102]:
# Check for null values

df.isnull().sum()

Unnamed: 0,0
v1,0
v2,0
Unnamed: 2,5522
Unnamed: 3,5560
Unnamed: 4,5566


In [103]:
# Drop columns with high null values

df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],inplace=True)

In [104]:
# Rename the other columns

df.rename(columns={'v1':'Class','v2':'Message'},inplace=True)

In [105]:
# Reordering the columns in the dataset

columns = ['Message','Class']

df = df[columns]

df.head(3)

Unnamed: 0,Message,Class
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam


In [106]:
# Check the class distribution

df['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
ham,4825
spam,747


# Lowercasing

In [107]:
df['Message'][3].lower()

'u dun say so early hor... u c already then say...'

In [108]:
df['Message'] = df['Message'].str.lower()

In [109]:
df

Unnamed: 0,Message,Class
0,"go until jurong point, crazy.. available only ...",ham
1,ok lar... joking wif u oni...,ham
2,free entry in 2 a wkly comp to win fa cup fina...,spam
3,u dun say so early hor... u c already then say...,ham
4,"nah i don't think he goes to usf, he lives aro...",ham
...,...,...
5567,this is the 2nd time we have tried 2 contact u...,spam
5568,will ì_ b going to esplanade fr home?,ham
5569,"pity, * was in mood for that. so...any other s...",ham
5570,the guy did some bitching but i acted like i'd...,ham


# Removing Special Characters

**Remove username**

In [110]:
import re

def remove_username(text):
  pattern = re.compile('@[A-Za-z0-9_.]')
  return pattern.sub(r'', text)


In [111]:
df['Message'] = df['Message'].apply(remove_username)

**Remove numbers**

In [112]:
import re

def remove_number(text):
  pattern = re.compile('\d+')
  return pattern.sub(r'', text)

In [113]:
df['Message'] = df['Message'].apply(remove_number)


**Character normalization**

In [117]:
import re

def remove_character(text):
    # Ensure input is string
    text = str(text)

    # Pattern: Matches any character repeated 3 or more times
    pattern = re.compile(r'([a-zA-Z])\1{2,}')

    # Substitution: Replace the repeated characters with a single instance
    return pattern.sub(r'', text)


In [118]:

# Apply the function to the 'Message' column
df['Message'] = df['Message'].apply(remove_character)


**Remove HTML Tags**

In [119]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [120]:
df['Message'][1]

'ok lar... joking wif u oni...'

In [121]:
df['Message'] = df['Message'].apply(remove_html_tags)

In [122]:
df['Message'][1]

'ok lar... joking wif u oni...'

**Remove URL**

In [123]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [124]:
df['Message'] = df['Message'].apply(remove_url)

In [125]:
df['Message'][1]

'ok lar... joking wif u oni...'

In [126]:
df

Unnamed: 0,Message,Class
0,"go until jurong point, crazy.. available only ...",ham
1,ok lar... joking wif u oni...,ham
2,free entry in a wkly comp to win fa cup final...,spam
3,u dun say so early hor... u c already then say...,ham
4,"nah i don't think he goes to usf, he lives aro...",ham
...,...,...
5567,this is the nd time we have tried contact u. ...,spam
5568,will ì_ b going to esplanade fr home?,ham
5569,"pity, * was in mood for that. so...any other s...",ham
5570,the guy did some bitching but i acted like i'd...,ham


**Remove Punctuation**

In [134]:
import string
exclude = string.punctuation

In [135]:
exclude = "!.,?*_"
def remove_punc(text):
    return text.translate(str.maketrans('', '', exclude))
    #return text.translate(str.maketrans('probably', 'possible'))

In [136]:
df['Message'][5569]

'pity  was in mood for that soany other suggestions'

In [137]:
df['Message'] = df['Message'].apply(remove_punc)

In [138]:
df['Message'][5]

"freemsg hey there darling it's been  week's now and no word back i'd like some fun you up for it still tb ok x std chgs to send å£ to rcv"

In [139]:
df

Unnamed: 0,Message,Class
0,go until jurong point crazy available only in ...,ham
1,ok lar joking wif u oni,ham
2,free entry in a wkly comp to win fa cup final...,spam
3,u dun say so early hor u c already then say,ham
4,nah i don't think he goes to usf he lives arou...,ham
...,...,...
5567,this is the nd time we have tried contact u u...,spam
5568,will ì b going to esplanade fr home,ham
5569,pity was in mood for that soany other suggest...,ham
5570,the guy did some bitching but i acted like i'd...,ham


**Stopword Removal**

In [140]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [141]:
from nltk.corpus import stopwords
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [142]:
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [143]:
remove_stopwords('probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times')

'probably  all-time favorite movie,  story  selflessness, sacrifice  dedication   noble cause,    preachy  boring.   never gets old, despite   seen   15   times'

In [144]:
df['Message'] = df['Message'].apply(remove_stopwords)

In [145]:
df

Unnamed: 0,Message,Class
0,go jurong point crazy available bugis n gre...,ham
1,ok lar joking wif u oni,ham
2,free entry wkly comp win fa cup final tkts ...,spam
3,u dun say early hor u c already say,ham
4,nah think goes usf lives around though,ham
...,...,...
5567,nd time tried contact u u å£ pound pri...,spam
5568,ì b going esplanade fr home,ham
5569,pity mood soany suggestions,ham
5570,guy bitching acted like i'd interested ...,ham


**Tokenization**

In [146]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [147]:
sent1 = 'I am going to visit delhi!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [148]:
df['Message'][1]

'ok lar joking wif u oni'

In [149]:
#df['sentences'] = df['review'].apply(sent_tokenize)
df['Message'] = df['Message'].apply(word_tokenize)

In [150]:
df['Message'][1]

['ok', 'lar', 'joking', 'wif', 'u', 'oni']

In [151]:
df['Message']

Unnamed: 0,Message
0,"[go, jurong, point, crazy, available, bugis, n..."
1,"[ok, lar, joking, wif, u, oni]"
2,"[free, entry, wkly, comp, win, fa, cup, final,..."
3,"[u, dun, say, early, hor, u, c, already, say]"
4,"[nah, think, goes, usf, lives, around, though]"
...,...
5567,"[nd, time, tried, contact, u, u, å£, pound, pr..."
5568,"[ì, b, going, esplanade, fr, home]"
5569,"[pity, mood, soany, suggestions]"
5570,"[guy, bitching, acted, like, i, 'd, interested..."


**Stemming**

In [152]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [153]:
sample1 = "The leaves are falling and the children are running towards the park."
stem_words(sample1)

'the leav are fall and the children are run toward the park.'

**Lemmitization**

In [154]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmitizer = WordNetLemmatizer()
def lemmitize_words(text):
    return " ".join([lemmitizer.lemmatize(word,pos='v') for word in text.split()])

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [155]:
#sample2 = "happy happiness happily"
sample3 = "The leaves are falling and the children are running towards the park ran."
lemmitize_words(sample3)

'The leave be fall and the children be run towards the park ran.'

In [156]:
df['Message'][1]

['ok', 'lar', 'joking', 'wif', 'u', 'oni']

In [157]:
def lemmatize_words(tokens):
    return [lemmitizer.lemmatize(word,pos='v') for word in tokens]
    #return " ".join([lemmatizer.lemmatize(word) for word in tokens])

# Lemmatizing the tokenized words in the 'review' column
df['lemmatized_Message'] = df['Message'].apply(lemmatize_words)

In [158]:
df['lemmatized_Message'][1]

['ok', 'lar', 'joke', 'wif', 'u', 'oni']

**Task Completed**