In [1]:
import pandas as pd

In [2]:
! wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip

--2022-12-23 16:20:42--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 172.67.9.4, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5828358084 (5.4G) [application/zip]
Saving to: ‘crawl-300d-2M-subword.zip’


2022-12-23 16:23:02 (40.1 MB/s) - ‘crawl-300d-2M-subword.zip’ saved [5828358084/5828358084]



In [3]:
! unzip crawl-300d-2M-subword.zip

Archive:  crawl-300d-2M-subword.zip
  inflating: crawl-300d-2M-subword.vec  
  inflating: crawl-300d-2M-subword.bin  


In [4]:
!pip install fastText

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fastText
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 3.5 MB/s 
[?25hCollecting pybind11>=2.2
  Using cached pybind11-2.10.2-py3-none-any.whl (222 kB)
Building wheels for collected packages: fastText
  Building wheel for fastText (setup.py) ... [?25l[?25hdone
  Created wheel for fastText: filename=fasttext-0.9.2-cp38-cp38-linux_x86_64.whl size=3133412 sha256=1029731eaa6d26585cd6d7dbf2760ee21ab2b26c4d59c28a6a4769dde9b3e13e
  Stored in directory: /root/.cache/pip/wheels/93/61/2a/c54711a91c418ba06ba195b1d78ff24fcaad8592f2a694ac94
Successfully built fastText
Installing collected packages: pybind11, fastText
Successfully installed fastText-0.9.2 pybind11-2.10.2


In [5]:
import fasttext

In [6]:
model_fasttext = fasttext.load_model('/content/crawl-300d-2M-subword.bin')



In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
#Reading the data
df_train_clean = pd.read_csv('/content/drive/MyDrive/IMDB_NLP/Train_reviews.csv',usecols = ['Review','Review_label'])
df_test_clean = pd.read_csv('/content/drive/MyDrive/IMDB_NLP/Test_reviews.csv',usecols = ['Review','Review_label'])

**Preprocessing Steps for Fasttext**
1. Convert the review label into required format. 
2. Prefix review_label with _label_
3. Concatenating reviews with their labels the entire content in a single column 

In [9]:
#Assign sentiments : 'Negative' to review_label = 0 and 'Positive' to review_label = 1
df_train_clean['Review_label'] = df_train_clean['Review_label'].apply(lambda x: 'Negative' if x == 0  else 'Positive')
df_test_clean['Review_label'] = df_test_clean['Review_label'].apply(lambda x: 'Negative' if x == 0  else 'Positive')

In [10]:
#Prefix with _label_
df_train_clean['Review_label'] = '__label__' + df_train_clean['Review_label'].astype(str)
df_test_clean['Review_label'] = '__label__' + df_test_clean['Review_label'].astype(str)

In [11]:
df_train_clean

Unnamed: 0,Review,Review_label
0,This absolutely terrible movie Dont lured Chri...,__label__Negative
1,I known fall asleep films usually due combinat...,__label__Negative
2,Mann photographs Alberta Rocky Mountains super...,__label__Negative
3,This kind film snowy Sunday afternoon rest wor...,__label__Positive
4,As others mentioned women go nude film mostly ...,__label__Positive
...,...,...
24995,I severe problem show several actually A simpl...,__label__Negative
24996,The year 1964 Ernesto Che Guevara Cuban citize...,__label__Positive
24997,Okay So I got back Before I start review let t...,__label__Negative
24998,When I saw trailer TV I surprised In May 2008 ...,__label__Negative


In [12]:
#Creating new dataframe containing data in required format
df_fasttext_train = pd.DataFrame()
df_fasttext_train['Review_sentiment'] = df_train_clean['Review_label'] +' '+ df_train_clean['Review']
df_fasttext_test = pd.DataFrame()
df_fasttext_test['Review_sentiment'] = df_test_clean['Review_label'] +' '+ df_test_clean['Review']

In [13]:
#Lets have a look!
df_fasttext_train.loc[:0]
df_fasttext_test.loc[:0]

Unnamed: 0,Review_sentiment
0,__label__Positive There films make careers For...


In [14]:
df_fasttext_train.to_csv(r'/content/drive/MyDrive/IMDB_NLP/Train_fasttext.txt', index=False, header=False)
df_fasttext_test.to_csv(r'/content/drive/MyDrive/IMDB_NLP/Test_fasttext.txt', index=False, header=False)

In [15]:
#Training the model
model_fast_text = fasttext.train_supervised(input='/content/drive/MyDrive/IMDB_NLP/Train_fasttext.txt')

In [16]:
#Now testing the model
model_fast_text.test("/content/drive/MyDrive/IMDB_NLP/Test_fasttext.txt")

(25000, 0.88136, 0.88136)

Here 
1. Precision Score: 88% implies out of 100 roughly 88 predictions are correct.
2. Recall Score: 88% 

Hence our model is reasonably good.


##Let's check the prediction!

In [17]:
df_test_clean['Review'][0]

'There films make careers For George Romero NIGHT OF THE LIVING DEAD Kevin Smith CLERKS Robert Rodriguez EL MARIACHI Add list Onur Tukels absolutely amazing DINGALINGLESS Flawless filmmaking assured professional aforementioned movies I havent laughed hard since I saw THE FULL MONTY And even I dont think I laughed quite hard So speak Tukels talent considerable DINGALINGLESS chock full double entendres one would sit copy script linebyline examination fully appreciate uh breadth width Every shot beautifully composed clear sign surehanded director performances around solid theres none overthetop scenery chewing one mightve expected film like DINGALINGLESS film whose time come'

In [18]:
df_test_clean['Review_label'][0]

'__label__Positive'

In [19]:
#Now lets check the prediction on this review!
model_fast_text.predict("There films make careers For George Romero NIGHT OF THE LIVING DEAD Kevin Smith CLERKS Robert Rodriguez EL MARIACHI Add list Onur Tukels absolutely amazing DINGALINGLESS Flawless filmmaking assured professional aforementioned movies I havent laughed hard since I saw THE FULL MONTY And even I dont think I laughed quite hard So speak Tukels talent considerable DINGALINGLESS chock full double entendres one would sit copy script linebyline examination fully appreciate uh breadth width Every shot beautifully composed clear sign surehanded director performances around solid theres none overthetop scenery chewing one mightve expected film like DINGALINGLESS film whose time come")

(('__label__Positive',), array([0.55579025]))

Here we see it gives correct prediction by classifying it as positive review!

References:
https://github.com/codebasics/nlp-tutorials/blob/main/18_fasttext_classification/fasttext_tutorial.ipynb