<a href="https://colab.research.google.com/github/Mohit-Patil/Taxonomy-Creation/blob/master/Taxonomy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **-----TAXONOMY CREATION-----**

## 1. Importing Libraries.

In [1]:
import pandas as pd
import collections
import random
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import re
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.utils import shuffle
import tensorflow as tf 
import nltk
nltk.download('stopwords')
nltk.download('punkt')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True


##2. Connecting Google Drive to Save Processed Data sets.

* Used to Store Data and Model on the Google Drive.
* Data can be recovered if the Run time crashes.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 3. Downloading Dataset from Kaggle.

* Instead of Downloading the dataset locally, we download it to Colab Runtime utilizing the network speed and storage constraints.
1. Install Kaggle API.
2. Download the Dataset from the Competition list.
3. Unzip the Train and Test Data.

In [0]:
!pip install -q kaggle
from google.colab import files
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c facebook-recruiting-iii-keyword-extraction
!unzip Train.zip
!unzip Test.zip

Saving kaggle.json to kaggle (1).json
mkdir: cannot create directory ‘/root/.kaggle’: File exists
Train.zip: Skipping, found more recently modified local copy (use --force to force download)
Test.zip: Skipping, found more recently modified local copy (use --force to force download)
SampleSubmission.csv: Skipping, found more recently modified local copy (use --force to force download)
Archive:  Train.zip
replace Train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
Archive:  Test.zip
replace Test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n



## 4. Preproceesing and Cleaning the Data

### Dataset Description
**Id** - Unique identifier for each question.</br>
**Title** - The question's title.</br>
**Body** - The body of the question.</br>
**Tags** - The tags associated with the question.</br>
</br>
*Size(Compressed)*: 2.19GB</br>
*Size(Actual)*: 6.76GB</br>
*No. of Rows*: 6034195 </br>
*No. of Columns*: 4</br>
</br>

### Dataset Link
Dataset: [Facebook Recruiting III - Keyword Extraction](https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data) competition on Kaggle.

</br>


### Loading Datasets into Dataframe</br>

 - Using Pandas Library</br>


In [2]:
df = pd.read_csv("Train.csv")
print("Dimensions of Data:",df.shape)

Dimensions of Data: (6034195, 4)



### 4.1 Deleting Missing data from the Rows</br>

 - Using Inbuilt Functions of Pandas library.</br>


In [3]:
df.dropna(inplace = True)
df.reset_index(inplace = True)
df.head()

Unnamed: 0,index,Id,Title,Body,Tags
0,0,1,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,1,2,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,2,3,R Error Invalid type (list) for variable,<p>I am import matlab file and construct a dat...,r matlab machine-learning
3,3,4,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,4,5,How to modify whois contact details?,<pre><code>function modify(.......)\n{\n $mco...,php api file-get-contents


In [4]:
print("Dimensions of Data after Removing Null Rows:",df.shape)

Dimensions of Data after Removing Null Rows: (6034187, 5)



### 4.2: Deleting Unnecessary Columns.</br>

 - The Columns "Id" and "Index" are of no use</br>
 - Removing them using drop(...) function of Pandas Library</br>


In [5]:
df.drop(columns = ['Id','index'],inplace = True)
df.head()

Unnamed: 0,Title,Body,Tags
0,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,R Error Invalid type (list) for variable,<p>I am import matlab file and construct a dat...,r matlab machine-learning
3,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,How to modify whois contact details?,<pre><code>function modify(.......)\n{\n $mco...,php api file-get-contents


In [6]:
print("Dimensions of Data After Dropping Unnecessary Columns:",df.shape)

Dimensions of Data After Dropping Unnecessary Columns: (6034187, 3)



### 4.3: Removing Duplicates.</br>

 - Dataset contains duplicates in it.</br>
 - We only remove duplicates from only the "Body" column because the "Title" may be same for many data points but they may have different "Body"</br>
 - Also "Tags" may be same for 2 Questions so we do not remove duplicates from "Tags" column</br>
 - After Removing Duplicates we reset the index values.</br>


In [0]:
df.drop_duplicates(subset='Body',keep = 'first',inplace=True)

In [8]:
print("Dimensions of Data After Removing Duplicates",df.shape)

Dimensions of Data After Removing Duplicates (4154374, 3)


In [9]:
df.reset_index(inplace = True)
df.head()

Unnamed: 0,index,Title,Body,Tags
0,0,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,1,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,2,R Error Invalid type (list) for variable,<p>I am import matlab file and construct a dat...,r matlab machine-learning
3,3,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,4,How to modify whois contact details?,<pre><code>function modify(.......)\n{\n $mco...,php api file-get-contents


###4.4: Calculating Total Tags Present

In [10]:
Available_tags = [tag for slist in df["Tags"].values for tag in slist.split()]
print("Total Tags Present",len(Available_tags))
print("Average Number of Tags Present per Row",float(len(Available_tags))/df.shape[0])

Total Tags Present 12030708
Average Number of Tags Present per Row 2.8959135600213175



### 4.5: Frequently Occurring Tags

 - List of tags which occur the most number of times.
 - Freq_Tags : *size* = 100.
 - We take 100 frequent tags to limit the data size.


In [0]:
Counter = collections.Counter(Available_tags)
Counter = { x:y for x, y in sorted(Counter.items(), key=lambda x: x[1], reverse=True) }
Freq_tags = set(list(Counter.keys())[:100])

In [12]:
print(Freq_tags)

{'entity-framework', 'networking', 'objective-c', 'winforms', 'http', 'html5', 'java', 'homework', 'iphone', 'ruby-on-rails-3', 'mysql', 'visual-studio-2010', 'matlab', 'json', 'query', '.net', 'sql-server', 'firefox', 'facebook', 'bash', 'sql', 'performance', 'multithreading', 'file', 'api', 'c#', 'c++', 'osx', 'jquery', 'shell', 'asp.net', 'google-chrome', 'vb.net', 'perl', 'git', 'r', 'node.js', 'javascript', 'django', 'delphi', 'ipad', 'unit-testing', 'cocoa', 'ruby-on-rails', 'sqlite', 'regex', 'image', 'qt', 'wordpress', 'svn', 'visual-studio', 'windows', '.htaccess', 'swing', 'linux', 'android', 'oracle', 'cocoa-touch', 'linq', 'ios', 'apache', 'security', 'silverlight', 'apache2', 'algorithm', 'flash', 'spring', 'sql-server-2008', 'c', 'flex', 'html', 'internet-explorer', 'php', 'tsql', 'actionscript-3', 'hibernate', 'web-services', 'ajax', 'xcode', 'arrays', 'windows-7', 'asp.net-mvc-3', 'database', 'eclipse', 'oop', 'string', 'ubuntu', 'jquery-ui', 'list', 'codeigniter', 'asp


### 4.6: Storing Indices of the Rows which Contain Frequent Tags</br>

 - We store the Indices of the data points which contain the most frequent tags.</br>
 - We do this by comparing Tags of each data point and storing the data point's index into Sample_Index.</br>


In [0]:
Sample_Index = []
for data in range(0,df.shape[0]):
  tags = set(df["Tags"][data].split())
  if tags.issubset(Freq_tags):
    Sample_Index.append(data)

In [14]:
print("Number of Rows which contain Frequent Tags",len(Sample_Index))

Number of Rows which contain Frequent Tags 674746



###  Randomly Choosing Indices with Frequent Tags</br>

 - Sample_Index now contains indices of *674746* data points.</br>
 - So we choose *600000* random data points out of those.</br>


In [0]:
Pre_index = random.sample(Sample_Index,k = 600000)


### 4.7: Choosing Data with the above Randomly Chosen Indices</br>

 - Now we select the data points which match the indices in the above randomly fetched indices which have frequent tags.</br>
 - Using iloc(..) function present in pandas Library.</br>
 - Then, we reset the indices and drop the unwanted "index" column.</br>


In [0]:
df_Final = df.iloc[Pre_index, :]

In [17]:
df_Final.head()

Unnamed: 0,index,Title,Body,Tags
489471,504365,jQuery fast mouse move mouseleave event trigge...,<p>If a users mouse goes over a table cell the...,php jquery ajax
1998466,2297245,Why do i need to add /g when using string repl...,<p>Why is the '/g' required when using string ...,javascript
55203,55379,Collecting data from multiple sources using mu...,<p>Let's imagine the set of several data sourc...,c# .net windows
191621,193786,Uninitialized constant in application controller,<p>I have a model class in my Rails applicatio...,ruby-on-rails ruby
3982226,5642520,Unresolved externals error during compilation,<p>I am getting two unresolved externals error...,c++


In [0]:
df_Final.reset_index(inplace=True)

In [19]:
df_Final.drop(columns = ['level_0','index'],inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [20]:
df_Final.head()

Unnamed: 0,Title,Body,Tags
0,jQuery fast mouse move mouseleave event trigge...,<p>If a users mouse goes over a table cell the...,php jquery ajax
1,Why do i need to add /g when using string repl...,<p>Why is the '/g' required when using string ...,javascript
2,Collecting data from multiple sources using mu...,<p>Let's imagine the set of several data sourc...,c# .net windows
3,Uninitialized constant in application controller,<p>I have a model class in my Rails applicatio...,ruby-on-rails ruby
4,Unresolved externals error during compilation,<p>I am getting two unresolved externals error...,c++



### 4.8: Separating Tags in a row</br>

 - We separate space separated tags in each data point to comma spearated.</br>
 > Example: [c++ clion array] now becomes [c++, clion, array]</br>




In [21]:
Tags_Sep = []
for tags in range(0,df_Final.shape[0]):
  Tags_Sep.append(df_Final['Tags'][tags].replace(" ",","))
  if tags % 100000 == 0:
    print(tags)
tags_split = [Tags.split(",") for Tags in Tags_Sep]

0
100000
200000
300000
400000
500000


In [22]:
df_Final["Tags"] = Tags_Sep
Tags_Sep.clear

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


<function list.clear>

In [23]:
df_Final.head()

Unnamed: 0,Title,Body,Tags
0,jQuery fast mouse move mouseleave event trigge...,<p>If a users mouse goes over a table cell the...,"php,jquery,ajax"
1,Why do i need to add /g when using string repl...,<p>Why is the '/g' required when using string ...,javascript
2,Collecting data from multiple sources using mu...,<p>Let's imagine the set of several data sourc...,"c#,.net,windows"
3,Uninitialized constant in application controller,<p>I have a model class in my Rails applicatio...,"ruby-on-rails,ruby"
4,Unresolved externals error during compilation,<p>I am getting two unresolved externals error...,c++



### 4.9: Removing Html Tags From the Body Column</br>

 - Html Tags are removed as they add too much noise to the data and can make model to perform abruptly</br>
 - Removed using 're' function.</br>


In [0]:
import re

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

In [25]:
Body_Final = []
Title_Final = []
Tags_Final = []
for b in range(0,df_Final.shape[0]):
    Title_Final.append(cleanhtml(df_Final["Title"][b].lower()))
    Tags_Final.append(cleanhtml(df_Final["Tags"][b].lower()))
    Body_Final.append(cleanhtml(df_Final["Body"][b].lower()))
    if b % 100000 == 0:
      print(b)
df_Final["Body"] = Body_Final
df_Final["Tags"] = Tags_Final
df_Final["Title"] = Title_Final
Body_Final.clear
Title_Final.clear
Tags_Final.clear

0
100000
200000
300000
400000
500000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


<function list.clear>

In [26]:
df_Final.head()

Unnamed: 0,Title,Body,Tags
0,jquery fast mouse move mouseleave event trigge...,if a users mouse goes over a table cell then a...,"php,jquery,ajax"
1,why do i need to add /g when using string repl...,why is the '/g' required when using string rep...,javascript
2,collecting data from multiple sources using mu...,"let's imagine the set of several data sources,...","c#,.net,windows"
3,uninitialized constant in application controller,i have a model class in my rails application c...,"ruby-on-rails,ruby"
4,unresolved externals error during compilation,i am getting two unresolved externals error wh...,c++



### 4.10: Concatenating Title And Body Into One Column</br>

 - Joining both "Title" and "Body" Columns of every data point to form a single Column.</br>
 - Also converting them to lowercase to maintain uniformity.</br>


In [27]:
Body_Final = []
for b in range(0,df_Final.shape[0]):
    Body_Final.append(cleanhtml(df_Final["Title"][b].lower()) + " " + df_Final["Body"][b].lower())
    if b % 100000 == 0:
      print(b)
df_Final["Body"] = Body_Final
Body_Final.clear

0
100000
200000
300000
400000
500000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


<function list.clear>

In [28]:
df_Final.head()

Unnamed: 0,Title,Body,Tags
0,jquery fast mouse move mouseleave event trigge...,jquery fast mouse move mouseleave event trigge...,"php,jquery,ajax"
1,why do i need to add /g when using string repl...,why do i need to add /g when using string repl...,javascript
2,collecting data from multiple sources using mu...,collecting data from multiple sources using mu...,"c#,.net,windows"
3,uninitialized constant in application controller,uninitialized constant in application controll...,"ruby-on-rails,ruby"
4,unresolved externals error during compilation,unresolved externals error during compilation ...,c++


In [29]:
df_Final.drop(columns= 'Title',inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [30]:
df_Final.head()

Unnamed: 0,Body,Tags
0,jquery fast mouse move mouseleave event trigge...,"php,jquery,ajax"
1,why do i need to add /g when using string repl...,javascript
2,collecting data from multiple sources using mu...,"c#,.net,windows"
3,uninitialized constant in application controll...,"ruby-on-rails,ruby"
4,unresolved externals error during compilation ...,c++



### 4.11: Removing Stop Words and Frequent Tags from The Body Column</br>

 - Removing words such as I, am, he, was, .... from the "Body" of each data point as it adds noise and can cause model to overfit.</br>
 - We also remove the frequent tags so that the model does not closely train on the frequent occurring words.</br>
 - Also the reason is some Questions have the tags in their question itself but some do not have any tag mentioned in the Question. So to avoid this, we remove stopwords and frequent tags also.</br>
 - Using NLTK Library and its Tokenizer and StopWords.</br>


In [31]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

stop_words = stop_words.union(Freq_tags)

Body_Final = []
#print(stop_words)

for b in range(0,df_Final.shape[0]):
    word_tokens = word_tokenize(df_Final["Body"][b])
    filtered_sentence = []
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    Body_Final.append(" ".join(filtered_sentence))
    if b % 100000 == 0:
      print(b)

0
100000
200000
300000
400000
500000


In [32]:
df_Final["Body"] = Body_Final

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [0]:
df_Final = shuffle(df_Final, random_state = 40)

In [34]:
df_Final.reset_index(inplace = True)
df_Final.head()

Unnamed: 0,index,Body,Tags
0,534389,trigger function blur unless specific elements...,"javascript,jquery"
1,524450,content rotator various widths content rotator...,jquery
2,84140,find column group another column given date us...,mysql
3,373751,using bit flags purposes 'm conflicted one . '...,windows
4,219648,issues ip address vs netbios name 7 ultimate m...,"windows-7,networking"



## 5. Data Preparation</br>

 - Now that our data is cleaned we now divide the dependent and independent columns from each other.</br>


### 5.1 Encoding Tags</br>

 - We now Encode top tags to multi-hot array.</br>
 - We first split the tag values of each dataset and the multi-hot encode them using MultiLabelBinarizer() from sklearn library.</br>


In [35]:
tags_split = [tags.split(',') for tags in df_Final['Tags'].values]
tags_split[0:10]

[['javascript', 'jquery'],
 ['jquery'],
 ['mysql'],
 ['windows'],
 ['windows-7', 'networking'],
 ['c++'],
 ['android'],
 ['c#'],
 ['php'],
 ['hibernate', 'list']]

In [36]:
tag_encoder = MultiLabelBinarizer()
tags_encoded = tag_encoder.fit_transform(tags_split)
num_tags = len(tags_encoded[0])
print(df_Final['Body'].values[0])
print(tag_encoder.classes_)
print(tags_encoded[0])

trigger function blur unless specific elements clicked login form . inputs get focused `` forgotten password '' `` remember '' elements get shown adding class , blur , elements hidden removing class `` sho '' . would like elements keep class `` show '' click either one login link $ ( document ) .ready ( function ( ) { $ ( '.login * ' ) .focus ( showlogin ) ; $ ( '.login * ' ) .blur ( hidelogin ) ; } ) ; function showlogin ( ) { $ ( '.login .hidden ' ) .addclass ( `` show '' ) ; } function hidelogin ( ) { $ ( '.login .hidden ' ) .removeclass ( `` show '' ) ; } : form class= '' login '' input type= '' text '' / input type= '' password '' / class= '' loginbutton '' href= '' # '' log in/a br / class= '' hidden '' href= '' # '' forgotten password/a label class= '' hidden '' input type= '' checkbox '' / remember me/label /form
['.htaccess' '.net' 'actionscript-3' 'ajax' 'algorithm' 'android' 'apache'
 'apache2' 'api' 'arrays' 'asp.net' 'asp.net-mvc' 'asp.net-mvc-3' 'bash'
 'c' 'c#' 'c++' 'co

### 5.2: Splitting Prepared Data into Train and Cross Validation Set

 - We now Split the data into Train and Test Set using 80/20 split.</br>


In [37]:
train_size = int(len(df_Final) * .8)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(df_Final) - train_size))

Train size: 480000
Test size: 120000


In [0]:
train_tags = tags_encoded[:train_size]
test_tags = tags_encoded[train_size:]

### 5.3: Tokenizing and Transforming.</br>

 - First "Body" of each Data Point is Tokenized and then transformed to matrices of 0's and 1's depending if the word is in Bag of Words.</br>


In [39]:
%%writefile preprocess.py
from tensorflow.keras.preprocessing import text

class TextPreprocessor(object):
  def __init__(self, vocab_size):
    self._vocab_size = vocab_size
    self._tokenizer = None
  
  def create_tokenizer(self, text_list):
    tokenizer = text.Tokenizer(num_words=self._vocab_size)
    tokenizer.fit_on_texts(text_list)
    self._tokenizer = tokenizer

  def transform_text(self, text_list):
    text_matrix = self._tokenizer.texts_to_matrix(text_list)
    return text_matrix

Overwriting preprocess.py


In [0]:
from preprocess import TextPreprocessor

VOCAB_SIZE = 200 

train_qs = df_Final['Body'].values[:train_size]
test_qs = df_Final['Body'].values[train_size:]

processor = TextPreprocessor(VOCAB_SIZE)
processor.create_tokenizer(train_qs)

body_train = processor.transform_text(train_qs)
body_test = processor.transform_text(test_qs)

In [41]:
print(len(body_train[0]))
print(body_train[0])

200
[0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]


When all the data is converted and transformed, we then save the Processed Data

In [42]:
df_Final.to_csv("Train_Processed.csv", index=False)
df_Final.head()

Unnamed: 0,index,Body,Tags
0,534389,trigger function blur unless specific elements...,"javascript,jquery"
1,524450,content rotator various widths content rotator...,jquery
2,84140,find column group another column given date us...,mysql
3,373751,using bit flags purposes 'm conflicted one . '...,windows
4,219648,issues ip address vs netbios name 7 ultimate m...,"windows-7,networking"


# 6. Building and Training the Model

In [0]:
import pickle

with open('./processor_state.pkl', 'wb') as f:
  pickle.dump(processor, f)


### Model</br>

 - The Model is a sequential model comprising of different Layers
		 - Input Layer(shape = 50)
		 - Dense/ Fully Connected Layer(shape = 80)
		 - Dense/ Fully Connected Layer(shape = 140)
		 - Output Layer(shape = 100)
	
 - The First Three Layers use Relu Activation function whereas the last layer uses Sigmoid Activation function to output if a tag is related or not.</br>
 - Sigmoid outputs values between [0,1], so this tells of strong or how weak is a tag related.</br>
 - Extensively uses Tensorflow and keras.</br>


In [0]:
def create_model(vocab_size, num_tags):
  
  model = tf.keras.models.Sequential()
  model.add(tf.keras.layers.Dense(50, input_shape=(VOCAB_SIZE,), activation='relu'))
  model.add(tf.keras.layers.Dense(80, activation='relu'))
  model.add(tf.keras.layers.Dense(140, activation='relu'))
  model.add(tf.keras.layers.Dense(num_tags, activation='sigmoid'))


  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  return model

In [66]:
model = create_model(VOCAB_SIZE, num_tags)
model.summary()

# Train and evaluate the model
model.fit(body_train, train_tags, epochs=3, batch_size=2048, validation_split=0.1)
print('Eval loss/accuracy:{}'.format(
model.evaluate(body_test, test_tags, batch_size=2048)))

# Export the model to a file
model.save('keras_saved_model.h5')

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_40 (Dense)             (None, 50)                10050     
_________________________________________________________________
dense_41 (Dense)             (None, 80)                4080      
_________________________________________________________________
dense_42 (Dense)             (None, 140)               11340     
_________________________________________________________________
dense_43 (Dense)             (None, 100)               14100     
Total params: 39,570
Trainable params: 39,570
Non-trainable params: 0
_________________________________________________________________
Train on 432000 samples, validate on 48000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Eval loss/accuracy:[0.059188420738776525, 0.98355633]



### Performance on Validation Set</br>

 - The Model performs well on the cross validation set while training as well as test set.</br>
 - Accuracy = 98%</br>


In [67]:
print('Eval loss/accuracy:{}'.format(
model.evaluate(body_test, test_tags, batch_size=2048)))

Eval loss/accuracy:[0.059188420738776525, 0.98355633]


#7. Predictions

In [74]:
%%writefile model_prediction.py
import pickle
import os
import numpy as np

class CustomModelPrediction(object):

  def __init__(self, model, processor):
    self._model = model
    self._processor = processor
  
  def predict(self, instances, **kwargs):
    preprocessed_data = self._processor.transform_text(instances)
    predictions = self._model.predict(preprocessed_data)
    return predictions.tolist()

  @classmethod
  def from_path(cls, model_dir):
    import tensorflow.keras as keras
    model = keras.models.load_model(
      os.path.join(model_dir,'keras_saved_model.h5'))
    with open(os.path.join(model_dir, 'processor_state.pkl'), 'rb') as f:
      processor = pickle.load(f)

    return cls(model, processor)

Overwriting model_prediction.py


## Write the Question in the test_request Variable

In [0]:
test_requests = [
  "Change the bar item name in Pandas I have a test excel file like: df = pd.DataFrame({'name':list('abcdefg'), 'age':[10,20,5,23,58,4,6]}) print (df) name  age 0    a   10 1    b   20 2    c    5 3    d   23 4    e   58 5    f    4 6    g    6 I use Pandas and matplotlib to read and plot it: import pandas as pd import numpy as np import matplotlib.pyplot as plt import os excel_file = 'test.xlsx' df = pd.read_excel(excel_file, sheet_name=0) df.plot(kind='bar') plt.show() the result shows: enter image description here it use index number as item name, how can I change it to the name, which stored in column name?"
]

In [79]:
from model_prediction import CustomModelPrediction

classifier = CustomModelPrediction.from_path('.')
results = classifier.predict(test_requests)
#print(results)

for i in range(len(results)):
  print('Predicted labels:')
  for idx,val in enumerate(results[i]):
    if val > 0.4:
      print(tag_encoder.classes_[idx])
      #print(val)
  print('\n')

Predicted labels:
python




# Hardware and Software Used:


*   GPU: Tesla T4

*   RAM: 25.81 GB
*   Softwares: Pandas, Keras, Tensorflow, Sklearn, nltk, pickle





In [54]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 5199780674223167954, name: "/device:XLA_CPU:0"
 device_type: "XLA_CPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 2880498197868775359
 physical_device_desc: "device: XLA_CPU device", name: "/device:XLA_GPU:0"
 device_type: "XLA_GPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 7862391114126866291
 physical_device_desc: "device: XLA_GPU device", name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 14912199066
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 16676038523866578715
 physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"]