# CLEANING AND PREPROCESSING TEXT DATA IN PANDAS FOR NLP TASK


Cleaning and preprocessing data is often one of the most daunting, yet critical phases in building AI and Machine Learning solutions fueled by data, and text data is not the exception.


## Load the data into DataFrame


In [5]:
## Load the data into DataFrame
import pandas as pd
data = {'text': ["I love cooking!", "Baking is fun", None, "Japanese cuisine is great!"]}

# Convert the data to a DataFrame
data_df = pd.DataFrame(data)

# Display the DataFrame
data_df.head()

Unnamed: 0,text
0,I love cooking!
1,Baking is fun
2,
3,Japanese cuisine is great!


## Handle Missing values


In [7]:
data_df.dropna(subset = ['text'], inplace = True)
print(data_df)

                         text
0             I love cooking!
1               Baking is fun
3  Japanese cuisine is great!


## Normalize the text to make it consitent
- Normalizing text implies standardizing or unifying elements that may appear under different formats across different instances, for instance, date formats, full names, or case sensitiveness.

In [10]:
# convert the data into lower cases
data_df['text'] = data_df['text'].str.lower()
print(data_df)

                         text
0             i love cooking!
1               baking is fun
3  japanese cuisine is great!


## To remove noise(Punctuation)
- Noise is unnecessary or unexpectedly collected data that may hinder the subsequent modeling or prediction processes if not handled adequately.

In [23]:
import re
import string
# print all the punctuation
print(string.punctuation)

# Create a translation table to remove all the punctuation
translation_table = str.maketrans("","",string.punctuation)
data_df['text'] = data_df['text'].str.translate(translation_table)
print(data_df)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
                        text
0             i love cooking
1              baking is fun
3  japanese cuisine is great


## Tokenize the text data
- To slit the text into smaller units called tokens.
- Tokenization is arguably the most important text preprocessing step -along with encoding text into a numerical representation- before using NLP and language models.

In [29]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [31]:
# import necessary libraries
from nltk.tokenize import word_tokenize

# Apply word_tokenize to each row in the 'text' column and store it in a new 'tokens' column
data_df['tokens'] = data_df['text'].apply(word_tokenize)

# Display the results
print(data_df[['text', 'tokens']])

                        text                          tokens
0             i love cooking              [i, love, cooking]
1              baking is fun               [baking, is, fun]
3  japanese cuisine is great  [japanese, cuisine, is, great]


### Alternate method for tokenization

In [33]:
data_df['tokens'] = data_df['text'].str.split()
print(data_df[['text','tokens']])

                        text                          tokens
0             i love cooking              [i, love, cooking]
1              baking is fun               [baking, is, fun]
3  japanese cuisine is great  [japanese, cuisine, is, great]


## Remove Stop words
- Is removing words that don't add value while being the corpus.
  - Document - This are individual text, string, observation.
  - Corpus - Collection of ducuments

In [40]:
# Download stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [55]:
# import necessary libraries
from nltk.corpus import stopwords

# Initialize the stopwords function
stop_words = set(stopwords.words('english'))

# Apply the stop_words to the tokens column
data_df['tokens'] = data_df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])

# Display results
print(data_df[['text', 'tokens']])

                        text                      tokens
0             i love cooking             [love, cooking]
1              baking is fun               [baking, fun]
3  japanese cuisine is great  [japanese, cuisine, great]


## Stemming and Lemmatization
 - Stemming - is mapping words to their root word.
  - SwonballStemmer
  - PorterStemmer
  - LancansterStemmer
 - lemmatization - is the morpholoical analysis of a word to reduce it to its base or Lemma
  - WordNetLemmatizer

In [34]:
# Download and import wordnet
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [54]:
# iport necessary libraries
from nltk import SnowballStemmer

# Intialize SnowballStemmer
stemmer = SnowballStemmer('english')

# Apply snollballstemmer to each row in the 'text' column and store it in a new 'stemmer' column
data_df['stemmed'] = data_df['tokens'].apply(lambda x: [stemmer.stem(word) for word in x])

# Print the results
print(data_df[['tokens', 'stemmed']])

                       tokens                   stemmed
0             [love, cooking]              [love, cook]
1               [baking, fun]               [bake, fun]
3  [japanese, cuisine, great]  [japanes, cuisin, great]


## Convert Text into Numerical Representations
 - we need to map our word vectors into numerical representations, commonly known as embedding vectors, or simply embedding.

 The below example converts tokenized text in the 'tokens' column and uses a TF-IDF vectorization approach (one of the most popular approaches in the good old days of classical NLP) to transform the text into numerical representations.

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Join text and tokens
data_df['text'] = data_df['tokens'].apply(lambda x: " ".join(x))

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Implement the TfidfVectorizer
X = vectorizer.fit_transform(data_df['text'])

# Display and convert X to an array
print(X.toarray())

[[0.         0.70710678 0.         0.         0.         0.
  0.70710678]
 [0.70710678 0.         0.         0.70710678 0.         0.
  0.        ]
 [0.         0.         0.57735027 0.         0.57735027 0.57735027
  0.        ]]


The next step would be feeding these numerical representations to our NLP model to let it do its magic