# **Text Preprocessing for NLP**

## **Notebook Overview**  
This notebook introduces essential text preprocessing techniques in Natural Language Processing (NLP), demonstrating how to clean and prepare raw text data for downstream tasks. Using the IMDb Movie Reviews Dataset, we will process raw reviews step by step to remove noise, extract meaningful tokens, and prepare data for machine learning models.

---

### **Objectives of the Notebook**  

#### **1. Understand the Importance of Text Preprocessing:**  
- Why preprocessing is necessary for NLP tasks.  
- The impact of clean data on model performance.  

#### **2. Learn Common Preprocessing Techniques:**  
- **Text cleaning**: Removing noise, converting text to lowercase.  
- **Tokenization**: Splitting sentences into words.  
- **Removing stopwords**: Eliminating common words with low significance.  
- **Stemming and lemmatization**: Reducing words to their root form.  
- **Noise removal**: Removing URLs, numbers, and special characters.  

#### **3. Apply Preprocessing to a Real Dataset:**  
- Load and explore the IMDb Movie Reviews Dataset.  
- Preprocess reviews step-by-step.  
- Visualize results and compare preprocessed data to raw text.  

---


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv


In [2]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
df=pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Text normalization
Text normalization standardizes the format of reviews by converting all text to lowercase, removing unwanted elements such as HTML tags (<br />), punctuation, and special characters. In the IMDb dataset, this step ensures uniformity, so variations like "Great Movie!" and "great movie" are treated the same.

### Tokenization:
Tokenization splits reviews into smaller units, such as words or sentences. For the IMDb dataset, word tokenization breaks reviews into individual words like ["great", "movie", "!"], while sentence tokenization divides long reviews into manageable sentences for better processing.

### Stopword Removal:
Stopwords are common words like "is," "the," and "and" that don't add significant meaning to the analysis. Removing stopwords from IMDb reviews focuses on meaningful words like "amazing" or "boring," improving model performance by reducing noise.

### Stemming and Lemmatization:
These techniques reduce words to their root form. For example, "playing," "played," and "plays" are converted to "play." In the IMDb dataset, this step reduces vocabulary size and ensures consistent representation of similar words across reviews.



In [4]:
corpus=[]
stemmer=PorterStemmer()
for review in df['review']:
    review=re.sub('[^a-zA-Z]',' ',review)
    review=review.lower().split()
    review=[stemmer.stem(word) for word in review if word not in stopwords.words('english')]
    review=''.join(review)
    corpus.append(review)
corpus[0]

'onereviewmentionwatchozepisodhookrightexactlihappenbrbrfirstthingstruckozbrutalunflinchsceneviolencsetrightwordgotrustshowfainthearttimidshowpullpunchregarddrugsexviolenchardcorclassicusewordbrbrcalloznicknamgivenoswaldmaximumsecurstatepenitentarifocusmainliemeraldcitiexperimentsectionprisoncellglassfrontfaceinwardprivacihighagendaemcitihomemaniaryanmuslimgangstalatinochristianitalianirishscuffldeathstaredodgidealshadiagreementneverfarawaybrbrwouldsaymainappealshowduefactgoeshowdareforgetprettipicturpaintmainstreamaudiencforgetcharmforgetromancozmessaroundfirstepisodeversawstrucknastisurrealsayreadiwatchdeveloptastozgotaccustomhighlevelgraphicviolencviolencinjusticcrookguardsoldnickelinmatkillordergetawaywellmannermiddlclassinmatturnprisonbitchduelackstreetskillprisonexperiwatchozmaybecomcomfortuncomfortviewthatgettouchdarkerside'