<a href="https://colab.research.google.com/github/DGuilherme/Challenge3/blob/main/CH3_PatentRS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Challenge 3 Patent Reconmmender System


In [None]:
# Import section
%matplotlib inline

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random

# solve issue of gensim version
!pip install --upgrade gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/44/52/f1417772965652d4ca6f901515debcd9d6c5430969e8c02ee7737e6de61c/gensim-4.0.1-cp37-cp37m-manylinux1_x86_64.whl (23.9MB)
[K     |████████████████████████████████| 23.9MB 5.8MB/s 
Installing collected packages: gensim
  Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.0.1


# Import Dataset

*Dataset composition*   

| Feature        | Description           |
| -------------- | --------------------- |
| ID             | The patent ID         |
| Title          | Patent Title          |
| Abstract       | Patent Abstract       |
| Classification | [Patent Classification](https://www.uspto.gov/web/patents/classification/selectnumwithtitle.htm) |

## How to export from database


```
db.10000.find(
  {'classes.FSC': {$exists: true},title: {$exists: true},'abstract': {$exists: true}},
  {'abstract': 1,title: 1,'classes.FSC': 1}
)
```


```
[{$sample: {
  size: 10000
}}, {$project: { _id: {$toString: "$_id"}, abstract : 1, title : 1, "classes.FSC" :1}}, {$match: {"classes" :{"$exists":true},title:{"$exists":true},abstract:{"$exists":true}}}]
```


In [None]:
url = 'https://raw.githubusercontent.com/DGuilherme/Challenge3/main/Dataset/10000_classified_patents.json'


# Preprocessing 


In [None]:
from sklearn.model_selection import train_test_split

raw_train_data = pd.read_json(url)
raw_train_data = raw_train_data.rename(columns={'_id': 'ID', 'abstract': 'Resumo','title': 'Titulo'})
raw_train_data = raw_train_data.dropna()
raw_train_data = raw_train_data.drop_duplicates(subset ="Resumo",keep = False)
raw_train_data = raw_train_data.drop_duplicates(subset ="Titulo",keep = False)
train_classes_data = raw_train_data[['ID','classes']]
train_data_unsplit = raw_train_data[['ID','Titulo','Resumo']]

# Split dataset
train_data, test_data = train_test_split(train_data_unsplit, test_size=0.2)

# Create the Vocabulary

In [None]:
modelIndexToDataframeIndex = []

import gensim

def tagData(dataframe):
  number = 0
  for index,row in dataframe.iterrows():
    number = number + 1
    modelIndexToDataframeIndex.append(row['ID'])
    resumotokens = gensim.utils.simple_preprocess(row['Resumo'])

    yield gensim.models.doc2vec.TaggedDocument(resumotokens, [number])

vocabulary = list(tagData(train_data))
vocabulary_test = list(tagData(test_data))

# User Question


In [None]:
modelIndexToDataframeIndex

['570641eceb1ec9cd7cadcab8',
 '57026241eb1ec9489e203481',
 '570266e9eb1ec9cdb6c6d3dc',
 '570260bbeb1ec9244e6b6322',
 '57066c4eeb1ec98afebd4524',
 '5702610eeb1ec92a0ae78b3c',
 '57066c4eeb1ec98afebd5a49',
 '5702662feb1ec9c19506808a',
 '5706641eeb1ec967930b4773',
 '570e1777eb1ec9929bae97a1',
 '57026169eb1ec9319738965b',
 '5702610eeb1ec92a0ae7a2ce',
 '57026579eb1ec9b0a9bdef2e',
 '57026917eb1ec90cb361a51e',
 '57066c4eeb1ec98afebe56a3',
 '5706641eeb1ec967930b0897',
 '57026579eb1ec9b0a9bdbf84',
 '57026919eb1ec90cb362c619',
 '57065f86eb1ec950df047e17',
 '57065f89eb1ec950df054137',
 '570261d1eb1ec93bc90e0d4f',
 '57066c4eeb1ec98afebe4ff6',
 '57066c4deb1ec98afebd1f56',
 '570641eaeb1ec9cd7cad13c5',
 '57064db8eb1ec901ec1aa4ee',
 '570260bbeb1ec9244e6b9e88',
 '57065f85eb1ec950df03a6e6',
 '5706821beb1ec9ebf6b44217',
 '57068219eb1ec9ebf6b38c4b',
 '570264c4eb1ec99dbf52ca78',
 '5702610feb1ec92a0ae7da0b',
 '57064db7eb1ec901ec1a32a1',
 '57025eb6eb1ec9f5515f9cb6',
 '570e1777eb1ec9929bafac73',
 '570e1774eb1e

# Create gensim Doc2Vec model

In [None]:
# instanciate
model = gensim.models.doc2vec.Doc2Vec(vector_size=100, min_count=2, epochs=100) # Create inital empty model

# build
model.build_vocab(vocabulary) # Add data to the model

2021-04-20 00:52:40,438 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec(dm/m,d100,n5,w5,mc2,s0.001,t3)', 'datetime': '2021-04-20T00:52:40.438047', 'gensim': '4.0.1', 'python': '3.7.10 (default, Feb 20 2021, 21:17:23) \n[GCC 7.5.0]', 'platform': 'Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'created'}
2021-04-20 00:52:40,445 : INFO : collecting all words and their counts
2021-04-20 00:52:40,447 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2021-04-20 00:52:40,598 : INFO : collected 19130 word types and 5849 unique tags from a corpus of 5848 examples and 651216 words
2021-04-20 00:52:40,601 : INFO : Creating a fresh vocabulary
2021-04-20 00:52:40,669 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=2 retains 12761 unique words (66.7067433350758%% of original 19130, drops 6369)', 'datetime': '2021-04-20T00:52:40.669397', 'gensim': '4.0.1', 'python': '3.7.10 (default, Feb 20 2021, 21:17:23) \n[GCC 7.5.0]', 'platform': 'Linux

# Model Train


In [None]:
model.train(vocabulary, total_examples=model.corpus_count, epochs=model.epochs)

2021-04-20 00:52:44,028 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 3 workers on 12761 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5', 'datetime': '2021-04-20T00:52:44.028598', 'gensim': '4.0.1', 'python': '3.7.10 (default, Feb 20 2021, 21:17:23) \n[GCC 7.5.0]', 'platform': 'Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'train'}
2021-04-20 00:52:45,060 : INFO : EPOCH 1 - PROGRESS: at 68.59% examples, 329560 words/s, in_qsize 5, out_qsize 0
2021-04-20 00:52:45,485 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-04-20 00:52:45,489 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-04-20 00:52:45,505 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-04-20 00:52:45,507 : INFO : EPOCH - 1 : training on 651216 raw words (489680 effective words) took 1.5s, 333681 effective words/s
2021-04-20 00:52:46,524 : INFO : EPOCH 2 - PROGRESS: at 67.03% examples, 326213 words

# Model Test
 

In [None]:
# Pick a random document from the test corpus and infer a vector from the model
sample = train_data.sample()
print("ID: "+ sample.iloc[0]['ID'])
print("Resumo: "+ sample.iloc[0]['Titulo'])
print("Titulo: "+ sample.iloc[0]['Resumo'])
value = str(sample.iloc[0]['ID'])
fscList = train_classes_data.iloc[sample.index]['classes'].iloc[0]['FSC']
print("FSC: "+ str(fscList))




ID: 570e1774eb1ec9929bae5541
Resumo: Portable computers lock
Titulo: A locking arrangement for securing portable computers and the like against theft including a cable (40) with a cable head (36) extended by a first stem portion (42), a collar portion (38) and a free end second stem portion (44) all in axial alignment. A prismatic lock body (10) includes a push-in, keyoperated locking device (24) having a releasable locking detent (26'). The body (14) has front (12), rear (18) and two side surfaces (14; 16). First, second and third bores (30; 32; 34) are formed respectively at the front and two side surfaces, in a common plane, passing each other and being of a diameter slightly larger than that of the collar portion (38). The locking detent (26') is insertable behind the collar (38) and above the first stem portion (42) thus precluding the extraction of the cable head (36) when inserted into any of the bores. Also, the rear side surface of the lock body can be secured to a portion of 

In [None]:
inferred_vector = model.infer_vector(gensim.utils.simple_preprocess(sample.iloc[0]['Resumo']))
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
best_match_row = train_data[train_data['ID'] == modelIndexToDataframeIndex[sims[1][0]-1]]
print("Similarity: "+ str(sims[1][1]))
print("ID: "+ best_match_row.iloc[0]['ID'])
print("Resumo: "+ best_match_row.iloc[0]['Titulo'])
print("Titulo: "+ best_match_row.iloc[0]['Resumo'])
value = str(best_match_row.iloc[0]['ID'])
fscList = train_classes_data.iloc[best_match_row.index]['classes'].iloc[0]['FSC']
print("FSC: "+ str(fscList))

Similarity: 0.46206676959991455
ID: 57064db7eb1ec901ec18e59b
Resumo: Extended BIOS adapted to establish remote communication for diagnostics       and repair
Titulo: An extended basic input output system (E-BIOS) has a first portion of code for providing power-on self-test (POST) and boot functions for a first computer, including code for sensing if the first computer does not boot. In the event of failure to boot, a second portion of code in the E-BIOS directs establishing communication link with a remote diagnostics and repair computer. When communication is established, a master code kernel at the diagnostics and repair computer may be executed to download a slave kernel to random access memory of the first computer, blowing an automatic software kernel or an operator at the diagnostics and repair computer to access and modify code and data in memory devices of the first computer, and to reboot the first computer after repair. Communication links may be by telephone modem, either an