# Preprocessing in NLP

## What is it?


Preprocessing in NLP refers to the series of steps taken to clean, normalize, and prepare raw text data before it is fed into a machine learning or deep learning model.

Proper preprocessing is crucial for improving the performance of NLP models by ensuring the text data is in a suitable format and reducing noise.

Key Preprocessing Steps

	1. Tokenization Splitting text into individual words, phrases, or tokens.
	2. Lowercasing
	3. Removing Punctuation
	4. Removing Stop Words
	5. Stemming Reducing words to their root form by removing suffixes.Example: “running” -> “run”
	6. Lemmatization Reducing words to their base or dictionary form, considering the context. Example: “better” -> “good”
	7. Handling Special Characters
	8. Removing or replacing special characters and digits. Example: “Price is $100” -> “Price is”
	9. Text Normalization, Standardizing text by correcting spelling mistakes, expanding contractions, and normalizing numbers. Example: “can’t” -> “cannot”, “2day” → “today”

## What for?

Uses of Preprocessing

1. Improving Model Accuracy: Reduces noise and irrelevant information, leading to more accurate and reliable models.
2. Reducing Vocabulary Size: removing stop words, punctuation, and special characters, the vocabulary size is reduced, which simplifies the model and speeds up training.

In [1]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting click (from nltk)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.5.15-cp312-cp312-macosx_11_0_arm64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m419.0 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hCollecting tqdm (from nltk)
  Downloading tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading regex-2024.5.15-cp312-cp312-macosx_11_0_arm64.whl (278 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.5/278.5 kB[0m [

## Example

Example Workflow

Consider the sentence: “Natural Language Processing (NLP) is fascinating!”

1. Tokenization: “Natural Language Processing (NLP) is fascinating!” -->  [“Natural”, “Language”, “Processing”, “(NLP)”, “is”, “fascinating”, “!”]
2. Lowercasing: [“Natural”, “Language”, “Processing”, “(NLP)”, “is”, “fascinating”, “!”] --> [“natural”, “language”, “processing”, “(nlp)”, “is”, “fascinating”, “!”]
3. Removing Punctuation: [“natural”, “language”, “processing”, “(nlp)”, “is”, “fascinating”, “!”] --> [“natural”, “language”, “processing”, “nlp”, “is”, “fascinating”]
4. Removing Stop Words: [“natural”, “language”, “processing”, “nlp”, “is”, “fascinating”] --> [“natural”, “language”, “processing”, “nlp”, “fascinating”]
5. Lemmatization: [“natural”, “language”, “processing”, “nlp”, “fascinating”] --> [“natural”, “language”, “process”, “nlp”, “fascinate”]

Preprocessing prepares this sentence for vectorization and subsequent machine learning tasks, ensuring the text is clean, uniform, and ready for analysis.

## How to do it?

### Packages

* NLTK
* Spacy
* TextBlob
* Gensim
* Scikit-learn
* Pandas
* Numpy
* String

In [4]:
import nltk
import pandas as pd
import string
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aymanelsayeed/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Examples


#### Remove stop words

In [4]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

In [8]:
' '.join([word for word in text1.split() if word.lower() not in stopwords.words('english')])

'Ethics built right ideals objectives United Nations'

#### Remove Punctuation

In [2]:
text2 = "Hello! How are you doing today?"

In [5]:
text2.translate(str.maketrans('', '', string.punctuation))

'Hello How are you doing today'

#### Stemming

In [12]:
input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
words1

['list', 'listed', 'lists', 'listing', 'listings']

In [13]:
porter = nltk.PorterStemmer()

In [14]:
[porter.stem(word) for word in words1]

['list', 'list', 'list', 'list', 'list']

#### Lemmatization

In [15]:
WNlemma = nltk.WordNetLemmatizer()

In [16]:
[WNlemma.lemmatize(t) for t in words1]

['list', 'listed', 'list', 'listing', 'listing']

## Practise

### Quiz 1

* Read the file `twitter.csv`
* Remove all the stop words from the text. save the output in a new column `text_without_stopwords`

Write function to remove stopwords from the text

In [51]:
# Write answer here


Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


### Quiz 2

Remove all the Punctuation from the `text_without_stopwords` column. Save the output in a new column `text_without_sp`

Write function to remove punctuation from the text

In [53]:
# Write answer here


'Love Echo'

'Love my Echo!'

### Quiz 3

Apply stemming on the `text_without_sp` column. Save the output in a new column `stemmed_text`

write function to apply stemming on the text

### Quiz 4

Apply Lemmatization on the `text_without_sp` column. Save the output in a new column `lemmatized_text`

write function to apply lemmatization on the text

### Quiz 5

Write a function that takes a string as input and returns a clean text after:

* Removing stopwords,
* Removing Punctuation
* Applying stemming / lemmatizing.

### Quiz 6

Save the final dataframe in a new csv file.

In [59]:
# write answer here


### Quiz 7

Do the same preprocessing on the `tweet.csv`  file.