<a href="https://colab.research.google.com/github/MuhammedAshraf2020/NLP-Pytorch/blob/main/Text_Classification_BOW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load Data

In [1]:
!mkdir ~/.kaggle
!cp kaggle.json  ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json  

In [2]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
 97% 25.0M/25.7M [00:01<00:00, 5.10MB/s]
100% 25.7M/25.7M [00:01<00:00, 17.2MB/s]


In [3]:
!unzip /content/imdb-dataset-of-50k-movie-reviews.zip

Archive:  /content/imdb-dataset-of-50k-movie-reviews.zip
  inflating: IMDB Dataset.csv        


In [8]:
import torch
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer  

In [6]:
data = pd.read_csv("/content/IMDB Dataset.csv")

In [7]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Create Our Custom DataLoader

In [67]:
from torch.utils.data import DataLoader , Dataset

In [56]:
def numurize(sent):
  if sent == "positive":
    return 1
  else:
    return 0

In [218]:
dataset = Preprocessing("/content/IMDB Dataset.csv").data

In [221]:
class Sequences(Dataset):
  def __init__(self , path):
    self.data = pd.read_csv(path)
    
    #removing tags , puncs and digits
    self.data["review"] = self.data["review"].str.replace('\d+', '')
    self.data["review"] = self.data["review"].str.replace("<.*?>" , "")
    self.data["review"] = self.data["review"].str.replace("n't"   , " not" )

    #nummirize sntiment/Labels
    self.data["sentiment"] = self.data["sentiment"].apply(lambda x : numurize(x))
    
    #tokinizer and delete stopwords
    self.tokenizer = CountVectorizer(stop_words='english', max_df=0.99, min_df=0.005)
    self.sequances = self.tokenizer.fit_transform(self.data)
    
    self.labels = self.data["sentiment"].tolist()
    self.token2index = self.tokenizer.vocabulary_
    self.index2token = {idx : token for token , idx in self.token2index.items()}
  
  def __getitem__(self , i):
    return (self.sequances[i].toarray() , self.labels[i])
  
  def __len__(self):
    return self.sequances.shape[0]

In [247]:
dataset_train = Sequences("/content/IMDB Dataset.csv")

In [248]:
train_loader = DataLoader(dataset_train , batch_size = 510)

## Let's Build Model

In [249]:
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

In [250]:
class FeedForward(nn.Module):
  def __init__(self , voc_lens):
    super().__init__()
    self.Linear1 = nn.Linear(voc_lens , 128)
    self.Linear2 = nn.Linear(128 , 64)
    self.out = nn.Linear(64 , 1)
  
  def forward(self , x):
    x = x.squeeze(1).float()
    x = F.relu(self.Linear1(x))
    x = F.relu(self.Linear2(x))
    return torch.sigmoid(self.out(x))


In [251]:
model = FeedForward(len(dataset_train.token2index)).to("cuda")

In [252]:
model

FeedForward(
  (Linear1): Linear(in_features=2, out_features=128, bias=True)
  (Linear2): Linear(in_features=128, out_features=64, bias=True)
  (out): Linear(in_features=64, out_features=1, bias=True)
)

In [253]:
critic = nn.BCELoss()
opt    = optim.Adam(model.parameters())

## Train / Test Z Model

In [254]:
from tqdm import tqdm , tqdm_notebook

In [255]:
model.train()
epochs_losses = []

for epoch in range(100):
  
  loop   = tqdm_notebook(train_loader , leave = False)
  total  = 0 
  losses = []
  
  for x , y in loop:
    model.zero_grad()
    
    x = x.to("cuda")
    y = y.to("cuda").float()
    
    prediction = model(x)

    loss = critic(prediction.squeeze() , y)
    loss.backward()

    loop.set_description(f"loss : {loss.item():.3f}")

    opt.step()
    losses.append(loss.item())
    total = total + 1
  
  epoch_loss = sum(losses) / total
  epochs_losses.append(epoch_loss)
  
  tqdm.write(f"Epoch {epoch + 1} loss:[{epoch_loss}]")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 1 loss:[0.7819536328315735]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 2 loss:[0.7482127547264099]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 3 loss:[0.7149853706359863]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 4 loss:[0.683641791343689]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 5 loss:[0.6541990637779236]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 6 loss:[0.6258292198181152]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 7 loss:[0.5985807180404663]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 8 loss:[0.5720837712287903]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 9 loss:[0.5465242266654968]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 10 loss:[0.5218095183372498]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 11 loss:[0.497608482837677]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 12 loss:[0.47493547201156616]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 13 loss:[0.45289069414138794]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 14 loss:[0.431456983089447]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 15 loss:[0.4109397530555725]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 16 loss:[0.39079439640045166]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 17 loss:[0.3708559274673462]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 18 loss:[0.35136181116104126]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 19 loss:[0.33260416984558105]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 20 loss:[0.3143530488014221]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 21 loss:[0.29662594199180603]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 22 loss:[0.2793820798397064]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 23 loss:[0.2628942131996155]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 24 loss:[0.24704349040985107]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 25 loss:[0.23180395364761353]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 26 loss:[0.21719761192798615]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 27 loss:[0.20336946845054626]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 28 loss:[0.1902177780866623]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 29 loss:[0.17771100997924805]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 30 loss:[0.16585297882556915]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 31 loss:[0.15464279055595398]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 32 loss:[0.1440746784210205]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 33 loss:[0.13413885235786438]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 34 loss:[0.12482164055109024]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 35 loss:[0.11610579490661621]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 36 loss:[0.10797125101089478]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 37 loss:[0.10039543360471725]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 38 loss:[0.0933537483215332]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 39 loss:[0.08682030439376831]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 40 loss:[0.08076821267604828]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 41 loss:[0.07517002522945404]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 42 loss:[0.06999805569648743]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 43 loss:[0.06522496044635773]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 44 loss:[0.060823969542980194]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 45 loss:[0.05676886439323425]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 46 loss:[0.053034476935863495]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 47 loss:[0.04959680512547493]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 48 loss:[0.046432994306087494]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 49 loss:[0.04352143406867981]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 50 loss:[0.04084186255931854]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 51 loss:[0.03837534040212631]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 52 loss:[0.03610405698418617]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 53 loss:[0.03401196375489235]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 54 loss:[0.03208378702402115]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 55 loss:[0.03030553087592125]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 56 loss:[0.028664378449320793]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 57 loss:[0.02714872919023037]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 58 loss:[0.025747647508978844]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 59 loss:[0.024451371282339096]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 60 loss:[0.02325085736811161]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 61 loss:[0.022137995809316635]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 62 loss:[0.02110515907406807]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 63 loss:[0.020145650953054428]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 64 loss:[0.019253306090831757]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 65 loss:[0.018422327935695648]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 66 loss:[0.017647497355937958]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 67 loss:[0.01692347228527069]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 68 loss:[0.016246719285845757]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 69 loss:[0.01561338547617197]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 70 loss:[0.015020091086626053]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 71 loss:[0.014463589526712894]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 72 loss:[0.013941008597612381]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 73 loss:[0.013449698686599731]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 74 loss:[0.012987285852432251]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 75 loss:[0.012551527470350266]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 76 loss:[0.012140488252043724]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 77 loss:[0.011752177029848099]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 78 loss:[0.011385000310838223]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 79 loss:[0.011037246324121952]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 80 loss:[0.010706906206905842]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 81 loss:[0.010393239557743073]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 82 loss:[0.01009511761367321]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 83 loss:[0.00981159321963787]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 84 loss:[0.009541751816868782]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 85 loss:[0.009284527972340584]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 86 loss:[0.009039130993187428]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 87 loss:[0.008804980665445328]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 88 loss:[0.00858122669160366]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 89 loss:[0.008367231115698814]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 90 loss:[0.008162417449057102]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 91 loss:[0.007966208271682262]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 92 loss:[0.007778058759868145]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 93 loss:[0.007597573101520538]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 94 loss:[0.007424237206578255]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 95 loss:[0.007257656194269657]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 96 loss:[0.00709746778011322]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 97 loss:[0.006943307351320982]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 98 loss:[0.006794842425733805]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 99 loss:[0.006651740521192551]


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 100 loss:[0.006513790227472782]


In [256]:
def predict_sentiment(text):
    model.eval()
    with torch.no_grad():
        test_vector = torch.LongTensor(dataset_train.tokenizer.transform([text]).toarray()).to("cuda")

        output = model(test_vector).to("cuda")
        prediction = torch.sigmoid(output).item()

        if prediction > 0.5:
            print(f'{prediction:0.3}: Positive sentiment')
        else:
            print(f'{prediction:0.3}: Negative sentiment')

In [257]:
test_text = """
Cool Cat Saves The Kids is a symbolic masterpiece directed by Derek Savage that
is not only satirical in the way it makes fun of the media and politics, but in
the way in questions as how we humans live life and how society tells us to
live life.

Before I get into those details, I wanna talk about the special effects in this
film. They are ASTONISHING, and it shocks me that Cool Cat Saves The Kids got
snubbed by the Oscars for Best Special Effects. This film makes 2001 look like
garbage, and the directing in this film makes Stanley Kubrick look like the
worst director ever. You know what other film did that? Birdemic: Shock and
Terror. Both of these films are masterpieces, but if I had to choose my
favorite out of the 2, I would have to go with Cool Cat Saves The Kids. It is
now my 10th favorite film of all time.

Now, lets get into the symbolism: So you might be asking yourself, Why is Cool
Cat Orange? Well, I can easily explain. Orange is a color. Orange is also a
fruit, and its a very good fruit. You know what else is good? Good behavior.
What behavior does Cool Cat have? He has good behavior. This cannot be a
coincidence, since cool cat has good behavior in the film.

Now, why is Butch The Bully fat? Well, fat means your wide. You wanna know who
was wide? Hitler. Nuff said this cannot be a coincidence.

Why does Erik Estrada suspect Butch The Bully to be a bully? Well look at it
this way. What color of a shirt was Butchy wearing when he walks into the area?
I don't know, its looks like dark purple/dark blue. Why rhymes with dark? Mark.
Mark is that guy from the Room. The Room is the best movie of all time. What is
the opposite of best? Worst. This is how Erik knew Butch was a bully.

and finally, how come Vivica A. Fox isn't having a successful career after
making Kill Bill.

I actually can't answer that question.

Well thanks for reading my review.
"""
predict_sentiment(test_text)

0.73: Positive sentiment
