<a href="https://colab.research.google.com/github/Mohit-Patil/Taxonomy-Creation/blob/master/Taxonomy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **-----TAXONOMY CREATION-----**

## 1. Importing Libraries.

In [0]:
import pandas as pd
import collections
import random
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import re
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.utils import shuffle
import tensorflow as tf 
import nltk
nltk.download('stopwords')
nltk.download('punkt')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True


##2. Connecting Google Drive to Save Processed Data sets.

* Used to Store Data and Model on the Google Drive.
* Data can be recovered if the Run time crashes.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


## 3. Downloading Dataset from Kaggle.

* Instead of Downloading the dataset locally, we download it to Colab Runtime utilizing the network speed and storage constraints.
1. Install Kaggle API.
2. Download the Dataset from the Competition list.
3. Unzip the Train and Test Data.

In [0]:
!pip install -q kaggle
from google.colab import files
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c facebook-recruiting-iii-keyword-extraction
!unzip Train.zip
!unzip Test.zip

Saving kaggle.json to kaggle.json
Downloading Train.zip to /content
100% 2.18G/2.19G [00:20<00:00, 169MB/s]
100% 2.19G/2.19G [00:20<00:00, 113MB/s]
Downloading Test.zip to /content
 98% 707M/725M [00:04<00:00, 157MB/s]
100% 725M/725M [00:04<00:00, 161MB/s]
Downloading SampleSubmission.csv to /content
 83% 65.0M/78.7M [00:00<00:00, 81.0MB/s]
100% 78.7M/78.7M [00:00<00:00, 160MB/s] 
Archive:  Train.zip
  inflating: Train.csv               
Archive:  Test.zip
  inflating: Test.csv                



## 4. Preproceesing and Cleaning the Data

### Dataset Description
- **Id** - Unique identifier for each question.
- **Title** - The question's title.
- **Body** - The body of the question.
- **Tags** - The tags associated with the question.

- *Size(Compressed)*: 2.19GB
- *Size(Actual)*: 6.76GB
- *No. of Rows*: 6034195 
- *No. of Columns*: 4


### Dataset Link
Dataset: [Facebook Recruiting III - Keyword Extraction](https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data) competition on Kaggle.

</br>


### Loading Datasets into Dataframe

 - Using Pandas Library


In [0]:
df = pd.read_csv("Train.csv")
print("Dimensions of Data:",df.shape)

Dimensions of Data: (6034195, 4)



### 4.1 Deleting Missing data from the Rows

 - Using Inbuilt Functions of Pandas library.


In [0]:
df.dropna(inplace = True)
df.reset_index(inplace = True)
df.head()

Unnamed: 0,index,Id,Title,Body,Tags
0,0,1,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,1,2,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,2,3,R Error Invalid type (list) for variable,<p>I am import matlab file and construct a dat...,r matlab machine-learning
3,3,4,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,4,5,How to modify whois contact details?,<pre><code>function modify(.......)\n{\n $mco...,php api file-get-contents


In [0]:
print("Dimensions of Data after Removing Null Rows:",df.shape)

Dimensions of Data after Removing Null Rows: (6034187, 5)



### 4.2: Deleting Unnecessary Columns.

 - The Columns "Id" and "Index" are of no use
 - Removing them using drop(...) function of Pandas Library


In [0]:
df.drop(columns = ['Id','index'],inplace = True)
df.head()

Unnamed: 0,Title,Body,Tags
0,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,R Error Invalid type (list) for variable,<p>I am import matlab file and construct a dat...,r matlab machine-learning
3,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,How to modify whois contact details?,<pre><code>function modify(.......)\n{\n $mco...,php api file-get-contents


In [0]:
print("Dimensions of Data After Dropping Unnecessary Columns:",df.shape)

Dimensions of Data After Dropping Unnecessary Columns: (6034187, 3)



### 4.3: Removing Duplicates.

 - Dataset contains duplicates in it.
 - We only remove duplicates from only the "Body" column because the "Title" may be same for many data points but they may have different "Body"
 - Also "Tags" may be same for 2 Questions so we do not remove duplicates from "Tags" column
 - After Removing Duplicates we reset the index values.


In [0]:
df.drop_duplicates(subset='Body',keep = 'first',inplace=True)

In [0]:
print("Dimensions of Data After Removing Duplicates",df.shape)

Dimensions of Data After Removing Duplicates (4154374, 3)


In [0]:
df.reset_index(inplace = True)
df.head()

Unnamed: 0,index,Title,Body,Tags
0,0,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,1,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,2,R Error Invalid type (list) for variable,<p>I am import matlab file and construct a dat...,r matlab machine-learning
3,3,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,4,How to modify whois contact details?,<pre><code>function modify(.......)\n{\n $mco...,php api file-get-contents


###4.4: Calculating Total Tags Present

In [0]:
Available_tags = [tag for slist in df["Tags"].values for tag in slist.split()]
print("Total Tags Present",len(Available_tags))
print("Average Number of Tags Present per Row",float(len(Available_tags))/df.shape[0])

Total Tags Present 12030708
Average Number of Tags Present per Row 2.8959135600213175



### 4.5: Frequently Occurring Tags

 - List of tags which occur the most number of times.
 - Freq_Tags : *size* = 200.
 - We take 200 frequent tags to limit the data size.


In [0]:
Counter = collections.Counter(Available_tags)
Counter = { x:y for x, y in sorted(Counter.items(), key=lambda x: x[1], reverse=True) }
Freq_tags = set(list(Counter.keys())[:200])

In [0]:
print(Freq_tags)

{'java', 'jquery', 'sql-server', 'r', 'sharepoint', 'ipad', 'sqlite', '.net', 'vim', 'mysql', 'svn', 'matlab', 'c', 'session', 'node.js', 'real-analysis', 'android', 'servlets', 'cakephp', 'spring', 'delphi', 'generics', 'zend-framework', 'pointers', 'wpf', 'web-services', 'redirect', 'powershell', 'ios5', 'sockets', 'django', 'iphone', 'java-ee', 'windows', 'jsp', 'ajax', 'iis', 'c++', 'phonegap', 'eclipse', 'apache', 'wordpress', 'visual-studio-2010', 'multithreading', 'mvc', 'facebook', 'ms-access', 'gwt', 'algorithm', 'bash', 'google-chrome', 'table', 'logging', 'visual-c++', 'asp.net-mvc', 'winforms', 'windows-8', 'dom', 'debugging', 'networking', 'tsql', 'performance', 'optimization', 'exception', 'http', 'apache2', 'javascript', 'drupal', 'perl', 'html5', 'winapi', 'google-maps', 'vb.net', 'git', 'jquery-mobile', 'tomcat', 'database-design', 'calculus', 'wcf', 'permissions', 'database', 'flex', 'class', 'linux', 'objective-c', 'validation', 'url', 'asp.net', 'jquery-ui', 'image'


### 4.6: Storing Indices of the Rows which Contain Frequent Tags

 - We store the Indices of the data points which contain the most frequent tags.
 - We do this by comparing Tags of each data point and storing the data point's index into Sample_Index.


In [0]:
Sample_Index = []
for data in range(0,df.shape[0]):
  tags = set(df["Tags"][data].split())
  if tags.issubset(Freq_tags):
    Sample_Index.append(data)

In [0]:
print("Number of Rows which contain Frequent Tags",len(Sample_Index))

Number of Rows which contain Frequent Tags 927034



###  Randomly Choosing Indices with Frequent Tags

 - Sample_Index now contains indices of *927034* data points.
 - So we choose *700000* random data points out of those.


In [0]:
Pre_index = random.sample(Sample_Index,k = 700000)


### 4.7: Choosing Data with the above Randomly Chosen Indices

 - Now we select the data points which match the indices in the above randomly fetched indices which have frequent tags.
 - Using iloc(..) function present in pandas Library.
 - Then, we reset the indices and drop the unwanted "index" column.


In [0]:
df_Final = df.iloc[Pre_index, :]

In [0]:
df_Final.head()

Unnamed: 0,index,Title,Body,Tags
449047,461556,Getting a word predecent a regular expression ...,<p>I want to evaluate an expression similar to...,regex
3706205,5062026,Previous link remain not highlighted,<p>I am using an sqlite database in my app. Th...,iphone ios
2308466,2724499,Numberguessing class and client,<p>I am trying to code a number guessing class...,java homework
1144782,1232871,Why do reads in MongoDB sometimes wait for lock?,"<p>While using db.currentOp(), I sometimes see...",mongodb
2820083,3490910,How can i send just two parameters to a web se...,<p>The application is based on GPS tracking wh...,java android xml json


In [0]:
df_Final.reset_index(inplace=True)

In [0]:
df_Final.drop(columns = ['level_0','index'],inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [0]:
df_Final.head()

Unnamed: 0,Title,Body,Tags
0,Getting a word predecent a regular expression ...,<p>I want to evaluate an expression similar to...,regex
1,Previous link remain not highlighted,<p>I am using an sqlite database in my app. Th...,iphone ios
2,Numberguessing class and client,<p>I am trying to code a number guessing class...,java homework
3,Why do reads in MongoDB sometimes wait for lock?,"<p>While using db.currentOp(), I sometimes see...",mongodb
4,How can i send just two parameters to a web se...,<p>The application is based on GPS tracking wh...,java android xml json



### 4.8: Separating Tags in a row

 - We separate space separated tags in each data point to comma spearated.
 > Example: [c++ clion array] now becomes [c++, clion, array]




In [0]:
Tags_Sep = []
for tags in range(0,df_Final.shape[0]):
  Tags_Sep.append(df_Final['Tags'][tags].replace(" ",","))
  if tags % 100000 == 0:
    print(tags)
tags_split = [Tags.split(",") for Tags in Tags_Sep]

0
100000
200000
300000
400000
500000
600000


In [0]:
df_Final["Tags"] = Tags_Sep
Tags_Sep.clear

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


<function list.clear>

In [0]:
df_Final.head()

Unnamed: 0,Title,Body,Tags
0,Getting a word predecent a regular expression ...,<p>I want to evaluate an expression similar to...,regex
1,Previous link remain not highlighted,<p>I am using an sqlite database in my app. Th...,"iphone,ios"
2,Numberguessing class and client,<p>I am trying to code a number guessing class...,"java,homework"
3,Why do reads in MongoDB sometimes wait for lock?,"<p>While using db.currentOp(), I sometimes see...",mongodb
4,How can i send just two parameters to a web se...,<p>The application is based on GPS tracking wh...,"java,android,xml,json"



### 4.9: Removing Html Tags From the Body Column

 - Html Tags are removed as they add too much noise to the data and can make model to perform abruptly
 - Removed using 're' function.


In [0]:
import re

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

In [0]:
Body_Final = []
Title_Final = []
Tags_Final = []
for b in range(0,df_Final.shape[0]):
    Title_Final.append(cleanhtml(df_Final["Title"][b].lower()))
    Tags_Final.append(cleanhtml(df_Final["Tags"][b].lower()))
    Body_Final.append(cleanhtml(df_Final["Body"][b].lower()))
    if b % 100000 == 0:
      print(b)
df_Final["Body"] = Body_Final
df_Final["Tags"] = Tags_Final
df_Final["Title"] = Title_Final
Body_Final.clear
Title_Final.clear
Tags_Final.clear

0
100000
200000
300000
400000
500000
600000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


<function list.clear>

In [0]:
df_Final.head()

Unnamed: 0,Title,Body,Tags
0,getting a word predecent a regular expression ...,i want to evaluate an expression similar to th...,regex
1,previous link remain not highlighted,i am using an sqlite database in my app. the a...,"iphone,ios"
2,numberguessing class and client,i am trying to code a number guessing class an...,"java,homework"
3,why do reads in mongodb sometimes wait for lock?,"while using db.currentop(), i sometimes see op...",mongodb
4,how can i send just two parameters to a web se...,the application is based on gps tracking where...,"java,android,xml,json"



### 4.10: Concatenating Title And Body Into One Column

 - Joining both "Title" and "Body" Columns of every data point to form a single Column.
 - Also converting them to lowercase to maintain uniformity.


In [0]:
Body_Final = []
for b in range(0,df_Final.shape[0]):
    Body_Final.append(cleanhtml(df_Final["Title"][b].lower()) + " " + df_Final["Body"][b].lower())
    if b % 100000 == 0:
      print(b)
df_Final["Body"] = Body_Final
Body_Final.clear

0
100000
200000
300000
400000
500000
600000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


<function list.clear>

In [0]:
df_Final.head()

Unnamed: 0,Title,Body,Tags
0,getting a word predecent a regular expression ...,getting a word predecent a regular expression ...,regex
1,previous link remain not highlighted,previous link remain not highlighted i am usin...,"iphone,ios"
2,numberguessing class and client,numberguessing class and client i am trying to...,"java,homework"
3,why do reads in mongodb sometimes wait for lock?,why do reads in mongodb sometimes wait for loc...,mongodb
4,how can i send just two parameters to a web se...,how can i send just two parameters to a web se...,"java,android,xml,json"


In [0]:
df_Final.drop(columns= 'Title',inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [0]:
df_Final.head()

Unnamed: 0,Body,Tags
0,getting a word predecent a regular expression ...,regex
1,previous link remain not highlighted i am usin...,"iphone,ios"
2,numberguessing class and client i am trying to...,"java,homework"
3,why do reads in mongodb sometimes wait for loc...,mongodb
4,how can i send just two parameters to a web se...,"java,android,xml,json"



### 4.11: Removing Stop Words and Frequent Tags from The Body Column

 - Removing words such as I, am, he, was, .... from the "Body" of each data point as it adds noise and can cause model to overfit.
 - We also remove the frequent tags so that the model does not closely train on the frequent occurring words.
 - Also the reason is some Questions have the tags in their question itself but some do not have any tag mentioned in the Question. So to avoid this, we remove stopwords and frequent tags also.
 - Using NLTK Library and its Tokenizer and StopWords.


In [0]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

alphabets = {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','[',']','{','}','?','.'}

stop_words = set(stopwords.words('english'))

stop_words = stop_words.union(Freq_tags)

stop_words = stop_words.union(alphabets)

Body_Final = []
#print(stop_words)

for b in range(0,df_Final.shape[0]):
    word_tokens = word_tokenize(df_Final["Body"][b])
    filtered_sentence = []
    filtered_sentence = [w for w in word_tokens if not w in alphabets]
    Body_Final.append(" ".join(filtered_sentence))
    if b % 100000 == 0:
      print(b)

0
100000
200000
300000
400000
500000
600000


In [0]:
df_Final["Body"] = Body_Final

In [0]:
df_Final = shuffle(df_Final, random_state = 40)

In [0]:
#df_Final.reset_index(inplace = True)
df_Final.drop(columns = ['level_0','index'],inplace = True)
df_Final.head()

Unnamed: 0,Body,Tags
0,"detect log outs seen , 5 open tabs , `` log ''...","php,facebook"
1,good looking web application developer way cre...,java
2,get profile 's outlook plugin called xobni rea...,facebook
3,synchronize data server local made application...,"ios,objective-c"
4,get first value comma separated 'm looking qui...,"php,arrays"


In [0]:
df_Final.to_csv("Train_Processed.csv", index=False)
df_Final.head()

Unnamed: 0,Body,Tags
0,"detect log outs seen , 5 open tabs , `` log ''...","php,facebook"
1,good looking web application developer way cre...,java
2,get profile 's outlook plugin called xobni rea...,facebook
3,synchronize data server local made application...,"ios,objective-c"
4,get first value comma separated 'm looking qui...,"php,arrays"


In [0]:
df_Final[:-1]


Unnamed: 0,Body,Tags
0,"detect log outs seen , 5 open tabs , `` log ''...","php,facebook"
1,good looking web application developer way cre...,java
2,get profile 's outlook plugin called xobni rea...,facebook
3,synchronize data server local made application...,"ios,objective-c"
4,get first value comma separated 'm looking qui...,"php,arrays"
5,prevent application freezing - ( void ) test i...,"cocoa,osx"
6,"able process response getting back , need disp...","jquery,ajax,json"
7,# template get panel parent form form created ...,"c#,winforms"
8,fix font issue ( hello.html ) like bellow xmln...,html
9,wrapper method block want use combination meth...,ruby


In [0]:
df_Final = pd.read_csv("Train_Processed.csv")


# 5. Data Preparation
 - Now that our data is cleaned we now divide the dependent and independent columns from each other.


### 5.1 Encoding Tags</br>

 - We now Encode top tags to multi-hot array.</br>
 - We first split the tag values of each dataset and the multi-hot encode them using MultiLabelBinarizer() from sklearn library.</br>


In [0]:
tags_split = [tags.split(',') for tags in df_Final['Tags'].values]
tags_split[0:10]

[['php', 'facebook'],
 ['java'],
 ['facebook'],
 ['ios', 'objective-c'],
 ['php', 'arrays'],
 ['cocoa', 'osx'],
 ['jquery', 'ajax', 'json'],
 ['c#', 'winforms'],
 ['html'],
 ['ruby']]

In [0]:
tag_encoder = MultiLabelBinarizer()
tags_encoded = tag_encoder.fit_transform(tags_split)
num_tags = len(tags_encoded[0])
print(df_Final['Body'].values[0])
print(tag_encoder.classes_)
print(tags_encoded[0])

detect log outs seen , 5 open tabs , `` log '' one go another tab , detects 're logged inform us user login page working want make thing web app php/jquery need hint ... thanks advance
['.htaccess' '.net' 'actionscript-3' 'ajax' 'algorithm' 'android'
 'android-layout' 'animation' 'apache' 'apache2' 'api' 'arrays' 'asp.net'
 'asp.net-mvc' 'asp.net-mvc-3' 'audio' 'authentication' 'bash' 'c' 'c#'
 'c#-4.0' 'c++' 'caching' 'cakephp' 'calculus' 'class' 'cocoa'
 'cocoa-touch' 'codeigniter' 'command-line' 'core-data' 'css' 'css3'
 'database' 'database-design' 'date' 'datetime' 'debugging' 'delphi'
 'design' 'design-patterns' 'django' 'dns' 'dom' 'drupal' 'eclipse'
 'email' 'entity-framework' 'events' 'excel' 'exception' 'facebook'
 'facebook-graph-api' 'file' 'firefox' 'flash' 'flex' 'forms' 'function'
 'generics' 'git' 'google' 'google-app-engine' 'google-chrome'
 'google-maps' 'grails' 'gui' 'gwt' 'haskell' 'hibernate' 'homework'
 'html' 'html5' 'http' 'iis' 'image' 'internet-explorer' 'ios

### 5.2: Splitting Prepared Data into Train and Cross Validation Set

 - We now Split the data into Train and Test Set using 80/20 split.</br>


In [0]:
train_size = int(len(df_Final) * .8)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(df_Final) - train_size))

Train size: 560000
Test size: 140000


In [0]:
train_tags = tags_encoded[:train_size]
test_tags = tags_encoded[train_size:]

### 5.3: Tokenizing and Transforming.</br>

 - First "Body" of each Data Point is Tokenized and then transformed to matrices of 0's and 1's depending if the word is in Bag of Words.</br>


In [0]:
%%writefile preprocess.py
from tensorflow.keras.preprocessing import text

class TextPreprocessor(object):
  def __init__(self, vocab_size):
    self._vocab_size = vocab_size
    self._tokenizer = None
  
  def create_tokenizer(self, text_list):
    tokenizer = text.Tokenizer(num_words=self._vocab_size)
    tokenizer.fit_on_texts(text_list)
    #print(tokenizer.word_index)
    self._tokenizer = tokenizer

  def transform_text(self, text_list):
    text_matrix = self._tokenizer.texts_to_matrix(text_list)
    return text_matrix
  

Overwriting preprocess.py


In [0]:
from preprocess import TextPreprocessor

VOCAB_SIZE = 500

train_qs = df_Final['Body'].values[:train_size]
test_qs = df_Final['Body'].values[train_size:]

processor = TextPreprocessor(VOCAB_SIZE)
processor.create_tokenizer(train_qs)

body_train = processor.transform_text(train_qs)
body_test = processor.transform_text(test_qs)

In [0]:
print(len(body_train[0]))
print(train_qs[0])
print(body_train[0])

500
detect log outs seen , 5 open tabs , `` log '' one go another tab , detects 're logged inform us user login page working want make thing web app php/jquery need hint ... thanks advance
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0

In [0]:
from tensorflow.keras.preprocessing import text
testizer = text.Tokenizer(500)
testizer.fit_on_texts(train_qs)
testizer.word_index

{"''": 1,
 "'": 2,
 '0': 3,
 '1': 4,
 'id': 5,
 'name': 6,
 'new': 7,
 '2': 8,
 'code': 9,
 'get': 10,
 'data': 11,
 'using': 12,
 'like': 13,
 'value': 14,
 "n't": 15,
 'div': 16,
 'text': 17,
 'want': 18,
 "'m": 19,
 'user': 20,
 'type': 21,
 "'s": 22,
 'error': 23,
 'public': 24,
 '3': 25,
 'int': 26,
 'would': 27,
 'return': 28,
 'page': 29,
 'use': 30,
 'one': 31,
 'var': 32,
 'array': 33,
 'class': 34,
 'com': 35,
 'way': 36,
 'select': 37,
 'need': 38,
 'server': 39,
 'set': 40,
 'java': 41,
 'null': 42,
 'form': 43,
 'system': 44,
 'add': 45,
 'thanks': 46,
 'method': 47,
 'view': 48,
 'script': 49,
 'problem': 50,
 '4': 51,
 'application': 52,
 'a': 53,
 'void': 54,
 'input': 55,
 'time': 56,
 'help': 57,
 '5': 58,
 'button': 59,
 'this': 60,
 'app': 61,
 'know': 62,
 'td': 63,
 'content': 64,
 'create': 65,
 'width': 66,
 'true': 67,
 'end': 68,
 'post': 69,
 'trying': 70,
 'i': 71,
 'work': 72,
 'first': 73,
 'following': 74,
 'example': 75,
 '10': 76,
 'e': 77,
 'line': 78,

When all the data is converted and transformed, we then save the Processed Data

# 6. Building and Training the Model

In [0]:
import pickle

with open('/content/processor_state.pkl', 'wb') as f:
  pickle.dump(processor, f)


### Model</br>

 - The Model is a sequential model comprising of different Layers
		 - Input Layer(shape = 50)
		 - Dense/ Fully Connected Layer(shape = 80)
		 - Dense/ Fully Connected Layer(shape = 140)
		 - Output Layer(shape = 100)
	
 - The First Three Layers use Relu Activation function whereas the last layer uses Sigmoid Activation function to output if a tag is related or not.</br>
 - Sigmoid outputs values between [0,1], so this tells of strong or how weak is a tag related.</br>
 - Extensively uses Tensorflow and keras.</br>


In [0]:
def create_model(vocab_size, num_tags):
  
  model = tf.keras.models.Sequential()
  model.add(tf.keras.layers.Dense(480, input_shape=(VOCAB_SIZE,), activation='relu'))
  model.add(tf.keras.layers.Dense(360, activation='relu'))
  model.add(tf.keras.layers.Dense(320, activation='relu'))
  model.add(tf.keras.layers.Dense(num_tags, activation='sigmoid'))


  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  return model

In [0]:
model = create_model(VOCAB_SIZE, num_tags)
model.summary()

# Train and evaluate the model
model.fit(body_train, train_tags, epochs=3, batch_size=256, validation_split=0.1)
print('Eval loss/accuracy:{}'.format(
model.evaluate(body_test, test_tags, batch_size=256)))

# Export the model to a file
model.save('keras_saved_model.h5')

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 480)               240480    
_________________________________________________________________
dense_1 (Dense)              (None, 360)               173160    
_________________________________________________________________
dense_2 (Dense)              (None, 320)               115520    
_________________________________________________________________
dense_3 (Dense)              (None, 200)               64200     
Total params: 593,360
Trainable params: 593,360
Non-trainable params: 0
_________________________________________________________________
Train on 504000 samples, validate on


### Performance on Validation Set

 - The Model performs well on the cross validation set while training as well as on test set.
 - Accuracy = 98%


In [0]:
print('Eval loss/accuracy:{}'.format(
model.evaluate(body_test, test_tags, batch_size=2048)))

Eval loss/accuracy:[0.028339047020248004, 0.9917963]


#7. Predictions

In [0]:
%%writefile model_prediction.py
import pickle
import os
import numpy as np

class CustomModelPrediction(object):

  def __init__(self, model, processor):
    self._model = model
    self._processor = processor
  
  def predict(self, instances, **kwargs):
    preprocessed_data = self._processor.transform_text(instances)
    predictions = self._model.predict(preprocessed_data)
    return predictions.tolist()

  @classmethod
  def from_path(cls, model_dir):
    import tensorflow.keras as keras
    model = keras.models.load_model(
      os.path.join(model_dir,'keras_saved_model.h5'))
    with open(os.path.join(model_dir, 'processor_state.pkl'), 'rb') as f:
      processor = pickle.load(f)

    return cls(model, processor)

Writing model_prediction.py


## Write the Question in the test_request Variable

In [0]:
test_requests = ["enter image description hererecently I had this problem while starting a new project, and while working on dependencies I had to import DaggerApplicationInjection class from java generated, but I can't .. has anyone an Idea how to resolve it or know what it is about ?"]

In [0]:
from model_prediction import CustomModelPrediction

classifier = CustomModelPrediction.from_path('/content')
results = classifier.predict(test_requests)
#print(results)

for i in range(len(results)):
  print('Predicted labels:')
  for idx,val in enumerate(results[i]):
    if val > 0.1:
      print(tag_encoder.classes_[idx])
      print(val)
  print('\n')

Predicted labels:
android
0.3953574299812317
eclipse
0.20225611329078674
java
0.6330957412719727




# Hardware and Software Used:


*   GPU: Tesla T4

*   RAM: 25.81 GB
*   Softwares: Pandas, Keras, Tensorflow, Sklearn, nltk, pickle





In [0]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 8798907438853032785, name: "/device:XLA_CPU:0"
 device_type: "XLA_CPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 17654958511962880217
 physical_device_desc: "device: XLA_CPU device", name: "/device:XLA_GPU:0"
 device_type: "XLA_GPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 9744206611959780222
 physical_device_desc: "device: XLA_GPU device", name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 14912199066
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 12613631813494626021
 physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"]

In [0]:
!cp Train_Processed.csv /content/drive/My\ Drive/Syn

In [0]:
!cp processor_state.pkl /content/drive/My\ Drive/Syn

In [0]:
!cp keras_saved_model.h5 /content/drive/My\ Drive/Syn

In [0]:
!cp preprocess.py /content/drive/My\ Drive/Syn

In [0]:
!cp model_prediction.py /content/drive/My\ Drive/Syn

In [0]:

# Authenticate to your cloud account
from google.colab import auth
auth.authenticate_user()

In [0]:
%%writefile setup.py

from setuptools import setup

setup(
  name="so_predict",
  version="0.1",
  include_package_data=True,
  scripts=["preprocess.py", "model_prediction.py"]
)

Writing setup.py


In [0]:
## Replace this with the name of your Cloud Storage bucket

!gsutil cp keras_saved_model.h5 gs://stacktags
!gsutil cp processor_state.pkl gs://stacktags

Copying file://keras_saved_model.h5 [Content-Type=application/octet-stream]...
-
Operation completed over 1 objects/6.8 MiB.                                      
Copying file://processor_state.pkl [Content-Type=application/octet-stream]...
\
Operation completed over 1 objects/61.3 MiB.                                     


In [0]:
# Replace with your bucket name below
!python setup.py sdist
!gsutil cp ./dist/so_predict-0.1.tar.gz gs://stacktags/packages/so_predict-0.1.tar.gz

running sdist
running egg_info
writing so_predict.egg-info/PKG-INFO
writing dependency_links to so_predict.egg-info/dependency_links.txt
writing top-level names to so_predict.egg-info/top_level.txt
reading manifest file 'so_predict.egg-info/SOURCES.txt'
writing manifest file 'so_predict.egg-info/SOURCES.txt'

running check


creating so_predict-0.1
creating so_predict-0.1/so_predict.egg-info
copying files to so_predict-0.1...
copying model_prediction.py -> so_predict-0.1
copying preprocess.py -> so_predict-0.1
copying setup.py -> so_predict-0.1
copying so_predict.egg-info/PKG-INFO -> so_predict-0.1/so_predict.egg-info
copying so_predict.egg-info/SOURCES.txt -> so_predict-0.1/so_predict.egg-info
copying so_predict.egg-info/dependency_links.txt -> so_predict-0.1/so_predict.egg-info
copying so_predict.egg-info/top_level.txt -> so_predict-0.1/so_predict.egg-info
Writing so_predict-0.1/setup.cfg
Creating tar archive
removing 'so_predict-0.1' (and everything under it)
Copying file://./dist/s

In [0]:
!gcloud config set project 	decent-oxygen-242311

Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud alpha survey



In [0]:
!gcloud ml-engine models create stack_tag_predict

Created ml engine model [projects/decent-oxygen-242311/models/stack_tag_predict].


In [0]:
# To use this custom code feature, fill out this form: bit.ly/cmle-custom-code-signup
!gcloud alpha ml-engine versions create v1 --model stack_tag_predict \
--origin=gs://stacktags/ \
--python-version=3.5 \
--runtime-version=1.13 \
--package-uris=gs://stacktags/packages/so_predict-0.1.tar.gz \
--prediction-class=model_prediction.CustomModelPrediction

[1;31mERROR:[0m (gcloud.alpha.ml-engine.versions.create) Create Version failed. Bad model detected with error:  "Failed to load model: Unexpected error when loading the model: Unexpected keyword argument passed to optimizer: learning_rate (Error code: 0)"


In [0]:
tf.version

<module 'tensorflow._api.v1.version' from '/usr/local/lib/python3.6/dist-packages/tensorflow/_api/v1/version/__init__.py'>

In [0]:
%%writefile predictions.txt
" want to create a list of 'Yes' buttons so I can loop through the list and click the 'Yes' button instead of targetting them individually. I'm getting an incorrect count when I try to add them to the list because I'm not sure how to ignore the empty columns within each section and each section has a different number of buttons. I came close to solving this by using two for loops but I'm not sure how to deal with the empty columns so it messes my loop up. Any tips or help is greatly appreciated. I'm quite new to the framework so learning how to solve something like this would be great for my development. I thought I could just search for all buttons and loop through them using indexing but that didn't seem to work:"

Overwriting predictions.txt


In [0]:

# Get predictions from our trained model
predictions = !gcloud ai-platform predict --model='stack_tag_predict' --text-instances=predictions.txt --version=v2
print(predictions)

['[[4.738569259643555e-06, 0.16678136587142944, 0.00750848650932312, 0.0014092028141021729, 0.0008029341697692871, 0.0008700191974639893, 1.8209218978881836e-05, 0.0008295774459838867, 7.596611976623535e-05, 1.4185905456542969e-05, 0.00038865208625793457, 0.01861727237701416, 0.047030627727508545, 0.008950591087341309, 0.004727780818939209, 3.269314765930176e-05, 0.00037792325019836426, 0.0001570582389831543, 0.000978320837020874, 0.6267008185386658, 0.028247088193893433, 0.026131629943847656, 0.0005345642566680908, 0.00012174248695373535, 7.987022399902344e-06, 0.024436771869659424, 0.0012642741203308105, 0.00024363398551940918, 0.00039643049240112305, 0.00017842650413513184, 0.00020182132720947266, 0.0005689859390258789, 0.00020197033882141113, 0.005464732646942139, 0.0018385052680969238, 9.763240814208984e-05, 0.0002669990062713623, 0.0012146234512329102, 0.002485036849975586, 0.005331218242645264, 0.014180600643157959, 0.001962631940841675, 7.420778274536133e-06, 0.0005295276641845

In [0]:
#print(tag_encoder.classes_, '\n')
for sigmoid_arr in eval(predictions[0]):
  #print(sigmoid_arr)
  for idx,probability in enumerate(sigmoid_arr):
    if probability > 0.3:
      print(tag_encoder.classes_[idx])
  print('\n')

c#
entity-framework


