Task 37-> NLP Preprocessing  

Go through some common text preprocessing techniques and demonstrate them by applying them to different
datasets.

importing necessary libraries

In [15]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

importing datasets

importing spam dataset

In [2]:
from google.colab import files
uploaded = files.upload()
df_spam = pd.read_csv('sms_spam.csv', encoding='latin-1')

df_spam = df_spam.iloc[:, :2]
df_spam.columns = ['label', 'text']

Saving sms_spam.csv to sms_spam.csv


importing movie review dataset

In [3]:
from google.colab import files
uploaded = files.upload()

df_movie = pd.read_csv('movie_review.csv')


Saving movie_review.csv to movie_review.csv


Basic info for Datasets

In [4]:
print("Spam Dataset Information:")
print(df_spam.info())
print("\nFirst few rows of the Spam dataset:")
print(df_spam.head())

print("\nMovie Review Dataset Information:")
print(df_movie.info())
print("\nFirst few rows of the Movie Review dataset:")
print(df_movie.head())



Spam Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   29 non-null     object
 1   text    29 non-null     object
dtypes: object(2)
memory usage: 592.0+ bytes
None

First few rows of the Spam dataset:
  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...

Movie Review Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   fold_id  18 non-null     int64 
 1   cv_tag   18 non-null     object
 2

Convert to lowercase for both datasets

In [5]:
df_spam['text'] = df_spam['text'].str.lower()
df_movie['text'] = df_movie['text'].str.lower()

Remove punctuation for both datasets

In [6]:
df_spam['text'] = df_spam['text'].str.translate(str.maketrans('', '', string.punctuation))
df_movie['text'] = df_movie['text'].str.translate(str.maketrans('', '', string.punctuation))

Remove stopwords for both datasets

In [9]:
stop_words = set(stopwords.words('english'))

df_spam['text_no_stopwords'] = df_spam['text'].apply(lambda x: ' '.join(word for word in x.split() if word not in stop_words))
df_movie['review_no_stopwords'] = df_movie['text'].apply(lambda x: ' '.join(word for word in x.split() if word not in stop_words))

Tokenization (Splitting text into words or phrases) for both datasets

In [12]:
df_spam['tokens'] = df_spam['text_no_stopwords'].apply(word_tokenize)
df_movie['tokens'] = df_movie['review_no_stopwords'].apply(word_tokenize)

Stemming (Reducing words to thier root form) for both datasets

In [13]:
stemmer = PorterStemmer()
df_spam['tokens_stemmed'] = df_spam['tokens'].apply(lambda x: [stemmer.stem(word) for word in x])
df_movie['tokens_stemmed'] = df_movie['tokens'].apply(lambda x: [stemmer.stem(word) for word in x])

Lemmatization (Reducing words to base form) for both datasets

In [16]:
lemmatizer = WordNetLemmatizer()
df_spam['tokens_lemmatized'] = df_spam['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
df_movie['tokens_lemmatized'] = df_movie['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

Join tokens back into strings for further processing for both datasets

In [17]:
df_spam['text_stemmed'] = df_spam['tokens_stemmed'].apply(lambda x: ' '.join(x))
df_spam['text_lemmatized'] = df_spam['tokens_lemmatized'].apply(lambda x: ' '.join(x))

df_movie['text_stemmed'] = df_movie['tokens_stemmed'].apply(lambda x: ' '.join(x))
df_movie['text_lemmatized'] = df_movie['tokens_lemmatized'].apply(lambda x: ' '.join(x))

Display the first few rows of the DataFrames after preprocessing for both datasets

In [18]:
print("\nPreprocessed Spam Dataset:")
print(df_spam.head())

print("\nPreprocessed Movie Review Dataset:")
print(df_movie.head())


Preprocessed Spam Dataset:
  label                                               text  \
0   ham  go until jurong point crazy available only in ...   
1   ham                            ok lar joking wif u oni   
2  spam  free entry in 2 a wkly comp to win fa cup fina...   
3   ham        u dun say so early hor u c already then say   
4   ham  nah i dont think he goes to usf he lives aroun...   

                                   text_no_stopwords  \
0  go jurong point crazy available bugis n great ...   
1                            ok lar joking wif u oni   
2  free entry 2 wkly comp win fa cup final tkts 2...   
3                u dun say early hor u c already say   
4        nah dont think goes usf lives around though   

                                              tokens  \
0  [go, jurong, point, crazy, available, bugis, n...   
1                     [ok, lar, joking, wif, u, oni]   
2  [free, entry, 2, wkly, comp, win, fa, cup, fin...   
3      [u, dun, say, early, hor, u, c,