# Recurrent Neural Networks

Welcome to the Paper implementation section. This is the first notebook, I will be implementing a simple RNN. It has been made much easier for us by the `nn.RNN` abstraction provided by PyTorch. We will be using that to create a simple `Spam Classifier`. 

The dataset for the following tutorial can be downloaded from Kaggle - [Link](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)

Let's get started!

In [2]:
# Importing neccessary libraries
import torch
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

In [3]:
# Setting training device
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(device)

mps


We don't have a predefined Dataset object for the Spam Classification Dataset. Before we get to that, we have to understand a few important concepts and implement them. One of these concepts, very critical for us, is tokenizer. It helps us in tokenizing the input, i.e converting the input to some numerical form, usually vectors in a vector space. 

## Looking at the Data - EDA

In [1]:
import pandas as pd
import numpy as np

`ISO-8859-1` is a single byte encoding that can encode the first 256 Unicode characters, it has some latin letters, punctuations etc. It is widely used in Europe and USA, as it supports their language. 

In [5]:
df_train = pd.read_csv("spam.csv", encoding = "ISO-8859-1")
df_train.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [9]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [10]:
df_train.isna().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [13]:
uniq_v1 = df_train["v1"].unique()
print(uniq_v1)

['ham' 'spam']


> Note: We can use Ordinal encoding when there exists a natural rank ordering. If that doesn't exist, then go for One Hot Encoding. 

We don't have a natural rank ordering, hence we are going to One Hot Encode our data. 

We can create a custom One Hot encoder or use pre existing encoder. For the sake of self-reliance, I am going to create a custom One Hot encoder.

In [14]:
def OneHotEncoder(df_column : str):
    uniq_values = df_train[df_column].unique()
    temp_dict = {}
    for num,val in enumerate(uniq_values):
        temp_dict[val] = num
        
    print(temp_dict)
    for i in df_train[df_column]:
        
        

In [15]:
OneHotEncoder("v1")

{'ham': 0, 'spam': 1}
