Colab:https://colab.research.google.com/github/Mohit-Patil/Taxonomy-Creation/blob/master/Taxonomy.ipynb
import pandas as pd
import collections
import random
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import re
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.utils import shuffle
import tensorflow as tf
import nltk
nltk.download('stopwords')
nltk.download('punkt')
-
Used to Store Data and Model on the Google Drive.
-
Data can be recovered if the Run time crashes.
- Instead of Downloading the dataset locally, we download it to Colab Runtime utilizing the network speed and storage constraints.
-
Install Kaggle API.
-
Download the Dataset from the Competition list.
-
Unzip the Train and Test Data.
-
Id - Unique identifier for each question.
-
Title - The question's title.
-
Body - The body of the question.
-
Tags - The tags associated with the question.
-
Size(Compressed): 2.19GB
-
Size(Actual): 6.76GB
-
No. of Rows: 6034195
-
No. of Columns: 4
Dataset: Facebook Recruiting III - Keyword Extraction competition on Kaggle.
- Using Pandas Library
- Using Inbuilt Functions of Pandas library.
-
The Columns "Id" and "Index" are of no use
-
Removing them using drop(...) function of Pandas Library
-
Dataset contains duplicates in it.
-
We only remove duplicates from only the "Body" column because the "Title" may be same for many data points but they may have different "Body"
-
Also "Tags" may be same for 2 Questions so we do not remove duplicates from "Tags" column
-
After Removing Duplicates we reset the index values.
Total Tags Present 12030708
Average Number of Tags Present per Row 2.8959135600213175
-
List of tags which occur the most number of times.
-
Freq_Tags : size = 100.
-
We take 100 frequent tags to limit the data size.
-
We store the Indices of the data points which contain the most frequent tags.
-
We do this by comparing Tags of each data point and storing the data point's index into Sample_Index.
-
Sample_Index now contains indices of 674746 data points.
-
So we choose 600000 random data points out of those.
-
Now we select the data points which match the indices in the above randomly fetched indices which have frequent tags.
-
Using iloc(..) function present in pandas Library.
-
Then, we reset the indices and drop the unwanted "index" column.
- We separate space separated tags in each data point to comma spearated.
Example: [c++ clion array] now becomes [c++, clion, array]
-
Html Tags are removed as they add too much noise to the data and can make model to perform abruptly
-
Removed using 're' function.
-
Joining both "Title" and "Body" Columns of every data point to form a single Column.
-
Also converting them to lowercase to maintain uniformity.
-
Removing words such as I, am, he, was, .... from the "Body" of each data point as it adds noise and can cause model to overfit.
-
We also remove the frequent tags so that the model does not closely train on the frequent occurring words.
-
Also the reason is some Questions have the tags in their question itself but some do not have any tag mentioned in the Question. So to avoid this, we remove stopwords and frequent tags also.
-
Using NLTK Library and its Tokenizer and StopWords.
- Now that our data is cleaned we now divide the dependent and independent columns from each other.
-
We now Encode top tags to multi-hot array.
-
We first split the tag values of each dataset and the multi-hot encode them using MultiLabelBinarizer() from sklearn library.
- We now Split the data into Train and Test Set using 80/20 split.
- First "Body" of each Data Point is Tokenized and then transformed to matrices of 0's and 1's depending if the word is in Bag of Words.
-
The Model is a sequential model comprising of different Layers:
-
Input Layer(shape = 50)
-
Dense/ Fully Connected Layer(shape = 80)
-
Dense/ Fully Connected Layer(shape = 140)
-
Output Layer(shape = 100)
-
-
The First Three Layers use Relu Activation function whereas the last layer uses Sigmoid Activation function to output if a tag is related or not.
-
Sigmoid outputs values between [0,1], so this tells of strong or how weak is a tag related.
-
The Model performs well on the cross validation set while training as well as on test set.
-
Accuracy = 98%
-
GPU: Tesla T4
-
RAM: 25.81 GB
-
Softwares: Pandas, Keras, Tensorflow, Sklearn, nltk, pickle