# Padding

Padding is a crucial step in Natural Language Processing (NLP) for handling sequences of different lengths, especially when working with neural networks that require fixed-size input. Here’s why and how it’s done:

### **Why Padding?**
**Uniform Sequence Length**: Ensures all sequences are of the same length, which is essential for batch processing.

**Efficiency**: Allows for efficient computation and matrix operations.

**Alignment**: Helps in maintaining alignment when dealing with multiple sequences or batches.

## Types of Padding

1. Pre-Padding
2. Post-Padding

### 1. Pre-Padding

Pre-padding adds zeros (or another specified value) at the start of a sequence to ensure all sequences are of the same length. It’s particularly useful when training models that require input sequences of uniform length. Here’s a practical example using TensorFlow's pad_sequences

### Implementations

In [16]:
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import tensorflow as tf
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [5]:
nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [19]:
data = pd.read_csv("/content/Reddit_Data.csv")

In [20]:
data.head()

Unnamed: 0,clean_comment,category
0,family mormon have never tried explain them t...,1.0
1,buddhism has very much lot compatible with chr...,1.0
2,seriously don say thing first all they won get...,-1.0
3,what you have learned yours and only yours wha...,0.0
4,for your own benefit you may want read living ...,1.0


In [21]:
data.isnull().sum()

Unnamed: 0,0
clean_comment,90
category,1


In [22]:
data = data.dropna()

In [24]:
x = data.drop("category",axis=1)
y = data["category"]

## one hot repr

In [25]:
corpus = []
lemmatizer = WordNetLemmatizer()
for i in x.index:
  words = re.sub("[^a-zA-Z]"," ",x.loc[i,"clean_comment"])
  words = words.lower()
  words = words.split()
  words = [lemmatizer.lemmatize(word) for word in words if word not in stopwords.words("english")]
  words = " ".join(words)
  corpus.append(words)

In [26]:
corpus[0]

'family mormon never tried explain still stare puzzled time time like kind strange creature nonetheless come admire patience calmness equanimity acceptance compassion developed thing buddhism teach'

### Pre_Padding

In [28]:
voc_size = 5000
one_hot_repr = [one_hot(word,voc_size) for word in corpus]

In [30]:
pre_padding = pad_sequences(one_hot_repr,padding="pre",maxlen=850)

In [31]:
pre_padding[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

## 2. Post-Padding

Post-padding adds zeros (or another specified value) at the end of a sequence to ensure all sequences are of the same length. Here’s how you can implement post-padding using TensorFlow's pad_sequences:

### Implementation

In [35]:
post_padding = pad_sequences(one_hot_repr,padding="post",maxlen=850)

In [36]:
post_padding[0]

array([ 286, 1862,  299, 3561,  660, 4050,  494, 3533, 2176, 2176, 2349,
       3489,  479, 4777, 4547,  779, 4041,  269,  652, 1092, 2434, 4389,
       3797, 1436, 2278, 3999,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   