## Identifying keywords to measure artificial intelligence research and developments in Australia between 2011-2022​

#### Name(s) & ID(s) of Group Members: 
- s3846691@student.rmit.edu.au Abraar
- s3839204@student.rmit.edu.au Adrian
- s3870059@student.rmit.edu.au Larry

# Table of Contents <a name="con"></a>
1. [Introduction](#in) 
2. [Lemmatization](#lem) 
3. [Tokenization](#tok)
4. [Word2Vec](#w2v)
5. [Bert](#bert)
6. [TF-IDF](#tf)


# 1. Introduction <a name="intro"></a>

## 1.1 Dataset Source <a name="DatasetSource"></a>
The dataset was found from [Lens.org](https://lens.org).


## 1.2 Dataset Details <a name="DatasetDetails"></a>

The dataset contains 10k observations and 32 column variables. This data includes all research paper under the jurisdiction of Australia in the past 10 years. Dataset includes information such as Lens ID, publication data, application number, title, abstract, applicants, inventors, CPC classifications, number citations etc. 



**This chunk of code imports all the required packages for this project.**

In [1]:
import pandas as pd
import numpy as np
from preprocess import document_preprocess
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from scipy.spatial import distance

import pickle
from progressbar import ProgressBar
pbar = ProgressBar()
from sentence_transformers import SentenceTransformer

**This chunk of code changes the setting to disregard warning messages.**

In [2]:
import warnings
warnings.filterwarnings("ignore")

**This chunk of code imports and reads data from csv file.**

In [3]:
data = pd.read_csv("Lens Data 2011-2022.csv")
train = list((data['Abstract']+data['Title']).values.astype('U'))

In [5]:
data.head(10)

Unnamed: 0,#,Jurisdiction,Kind,Display Key,Lens ID,Publication Date,Publication Year,Application Number,Application Date,Priority Numbers,...,Extended Family Size,Sequence Count,CPC Classifications,IPCR Classifications,US Classifications,NPL Citation Count,NPL Resolved Citation Count,NPL Resolved Lens ID(s),NPL Resolved External ID(s),NPL Citations
0,1,AU,B2,AU 2020/289790 B2,008-698-137-173-075,2022-03-31,2022,AU 2020/289790 A,2020-12-16,AU 2020/289790 A;;AU 2017/345067 A;;US 2016152...,...,13,0,G05D1/042;;G05D1/102;;G05D1/1062;;G05D1/0088;;...,G05D1/06;;G01C21/00;;G06T7/73;;G08G5/02,,0,0,,,
1,2,AU,B2,AU 2016/253569 B2,004-265-997-185-317,2022-03-31,2022,AU 2016/253569 A,2016-11-02,AU 2015/904490 A,...,5,0,E04H17/12;;E04H17/10;;E04H17/20;;E04H17/24,E04H17/20,,0,0,,,
2,3,AU,B2,AU 2018/306411 B2,035-019-491-150-417,2022-03-31,2022,AU 2018/306411 A,2018-07-30,KR 20180088375 A;;US 201762538034 P;;KR 201800...,...,13,0,A61K9/51;;A61K31/713;;A61P35/00;;C12N15/113;;C...,A61K31/713;;A61K9/51;;A61P35/00,,0,0,,,
3,4,AU,B2,AU 2017/205693 B2,026-585-572-258-403,2022-03-31,2022,AU 2017/205693 A,2017-01-05,EP 16179291 A;;EP 16150631 A;;EP 16191462 A;;E...,...,8,0,A61K47/60;;A61P19/00;;A61K47/60;;A61K38/22,A61K47/60;;A61P19/00,,0,0,,,
4,5,AU,B2,AU 2017/225767 B2,049-789-773-402-263,2022-03-31,2022,AU 2017/225767 A,2017-03-02,US 201662302430 P;;US 2017/0020448 W,...,6,0,G01N35/02;;G01N35/00732;;G01N2035/0441;;G01N35...,G01N21/13;;G01N21/31;;G01N21/63;;G01N33/02;;G0...,,0,0,,,
5,6,AU,B2,AU 2020/239823 B2,061-282-055-514-561,2022-03-31,2022,AU 2020/239823 A,2020-09-26,AU 2020/239823 A;;AU 2016/247473 A;;US 2015621...,...,12,0,A01N37/22;;A01N37/22;;A01N25/08;;A01N25/34;;A0...,A01N37/22;;A01N43/36;;A01N51/00;;A01N53/00;;A0...,,0,0,,,
6,7,AU,B2,AU 2017/392966 B2,068-513-652-517-891,2022-03-31,2022,AU 2017/392966 A,2017-12-27,IN 201731001199 A;;IB 2017058408 W,...,15,0,A01N47/40;;A01N47/40;;A01N53/00;;A01N25/12;;A0...,A01N47/40;;A01N53/00;;A01P7/04,,0,0,,,
7,8,AU,B2,AU 2020/294444 B2,067-623-277-289-76X,2022-03-31,2022,AU 2020/294444 A,2020-04-24,US 201916445981 A;;US 2020/0029705 W,...,10,0,H04L67/104;;H04W88/04;;H04W48/20;;H04L67/16;;H...,H04L65/80;;H04L67/104;;H04W8/00;;H04W40/24;;H0...,,0,0,,,
8,9,AU,B2,AU 2021/204749 B2,087-242-273-655-845,2022-03-31,2022,AU 2021/204749 A,2021-07-07,AU 2021/204749 A;;AU 2019/256245 A;;CN 2018083...,...,15,0,A61K38/26;;A61K47/68;;A61P3/10;;C07K1/107;;C07...,C07K14/605;;A61K38/26;;A61K47/68;;A61P3/10;;C0...,,0,0,,,
9,10,AU,B2,AU 2018/426934 B2,108-511-371-966-818,2022-03-31,2022,AU 2018/426934 A,2018-06-05,EP 2018064759 W,...,12,0,A61F13/496;;A61F13/51394;;A61F13/51496;;A61F13...,A61F13/496;;A61F13/513;;A61F13/514,,0,0,,,


In [35]:
class document_preprocess:
    
    def __init__(self,lemmatize=True, stop_words=True, singleton=True, valid_word=True, custom_stop_words=[]):
        self.lemmatize=lemmatize
        self.stop_words=stop_words
        self.singleton=singleton
        self.valid_word=valid_word
        self.custom_stop_words=custom_stop_words
        self.nlp = spacy.load("en_core_sci_lg")

    def filter_word(self, text):
        
        filtered_sentence=[]
        doc = self.nlp(str(text).lower())
        
        for word in doc:
            filters=[]
            if self.lemmatize:
                word=word.vocab[word.lemma_]

            filters.append(True if word.is_alpha else False)

            # dont append word if it is a stop word
            if self.stop_words:
                filters.append(False if word.is_stop else True)

            # dont append word if its length is 1
            if self.singleton:
                filters.append(False if len(word.text)==1 else True)

            # dont append word if it belongs to custom stop word
            if len(self.custom_stop_words)>0:
                filters.append(False if word.text in self.custom_stop_words else True)

            # If there is a valid word vector
            if self.valid_word:
                filters.append(False if word.vector.sum()==0 else True)

            if all(filters):
                filtered_sentence.append(word.text)

        return filtered_sentence
    
    def make_ngrams(self,s,n):
        ''' 
        Input: String
        Description: Create N-grams of a string
        Output: n-grams
        '''
        ngrams=[]
        s=self.nlp(s)
        for n in range(1,n+1):
            ngrams.extend([s[i:i+n] for i in range(len(s)-n+1)])
        return ngrams

**This chunk of code lemmatize data.**

In [50]:
ob = document_preprocess(lemmatize=True, stop_words=True, singleton=True, custom_stop_words=[])

**This chunk of code tokenize data.**

In [37]:
tokens=[]

for text in pbar(train):
    filtered_list = ob.filter_word(text)
    filtered_string =' '.join([str(item) for item in filtered_list])
    tokens.append(ob.make_ngrams(filtered_string,2))

100% (10000 of 10000) |##################| Elapsed Time: 0:07:45 Time:  0:07:45


In [38]:
preprocessed_doc=[]
for i in tokens:
    test_tok=''
    for tok in i:
        if len(tok)==2:
            test_tok=test_tok + ' ' + str(tok[0])+'_'+str(tok[1])
    preprocessed_doc.append(test_tok[1:])

In [39]:
final_token=[]
for i in tokens:
    for tok in i:
        final_token.append(str(tok))

In [40]:
with open('tokens.pkl', 'wb') as f:
    pickle.dump(final_token, f)

In [41]:
with open('preprocessed_doc.pkl', 'wb') as f:
    pickle.dump(preprocessed_doc, f)

### Input: Parameters for adding preprocessing operations to the pipeline 
#### Description: Class containing functions for preprocessing and n-grams
#### Input: Text document | Output: Preprocessed text
#### Description: This function inputs a particular document text, creates it to an nlp object and then performs the following preprocessing operations:
1. Lemmatization
2. Stop words removal
3. Singleton removal
4. Custom stop words removal
5. Valid word check


## 4. Word2Vec <a name="w2v"></a>

In [9]:
nlp = spacy.load("en_core_sci_lg")

In [10]:
with open('preprocessed_doc.pkl', 'rb') as f:
    preprocessed_doc = pickle.load(f)

In [11]:
doc_tokenized=[word.lower().split() for word in preprocessed_doc]

In [12]:
fin_token=[]
for i in doc_tokenized:
    for j in i:
        fin_token.append(j)

In [55]:
spacy_tokens=[]

for i in fin_token:
    spacy_tokens.append(i.replace('_',' '))

In [56]:
# Function for computing the cosine similarity scores
def cos_sim(vector1, vector2):
    cosine_similarity = 1 - distance.cosine(vector1, vector2)
    return cosine_similarity

Using `Artificial Intelligence` as target keyword set

In [None]:
dict_={}
error_=[]
unique_tokens=list(set(spacy_tokens))


for i in unique_tokens:
    try:
        dict_[i] = cos_sim(nlp(i).vector, nlp("artificial intelligence").vector)
    except:
        error_.append(i)

In [None]:
dict_ = dict(sorted(dict_.items(), key=lambda item: item[1], reverse=True))

In [None]:
results=pd.DataFrame(dict_, index=[0]).T
results.to_csv("AI_cosine_similarity_results_w2v.csv")

## 5. BERT <a name="bert"></a>

In [6]:
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [13]:
#removing underscore
bert_tokens=[]

for i in fin_token:
    bert_tokens.append(i.replace('_',' '))

In [14]:
# Function for computing the cosine similarity scores
def cos_sim(vector1, vector2):
    cosine_similarity = 1 - distance.cosine(vector1, vector2)
    return cosine_similarity

In [None]:
# compare to keyword `artificial intelligence`
dict_={}
error_=[]
unique_tokens=list(set(bert_tokens))


for i in unique_tokens:
    try:
        dict_[i] = cos_sim(sbert_model.encode(i), sbert_model.encode('artificial intelligence'))
    except:
        error_.append(i)

In [None]:
# Rank cosine similarity by Descending order
dict_ = dict(sorted(dict_.items(), key=lambda item: item[1], reverse=True))

In [None]:
# Print Results
results=pd.DataFrame(dict_, index=[0]).T
results.to_csv("AI_cosine_similarity_results_bert.csv")

## 6. TF-IDF <a name="tf"></a>

In [None]:
# Use first 5000 documents
data = pd.read_csv(r"data\Lens-AU.csv")
train = list((data['Abstract']+data['Title']).values.astype('U'))
train=train[:5000]

In [None]:
#Import pre-process document
ob=document_preprocess(lemmatize=True, stop_words=True, singleton=True, custom_stop_words=[])

In [None]:
preprocessed_doc=[]

for text in pbar(train):
    filtered_list=ob.filter_word(text)
    filtered_string =' '.join([str(item) for item in filtered_list])
    preprocessed_doc.append(filtered_string)

In [None]:
#Vectorize, using bi-grams
vectorizer = TfidfVectorizer(ngram_range=(2,2))
X = vectorizer.fit_transform(preprocessed_doc).todense()

df=pd.DataFrame(X, columns=vectorizer.get_feature_names())

In [None]:
def cos_sim(vector1, vector2):
    cosine_similarity = 1 - distance.cosine(vector1, vector2)
    return cosine_similarity

In [None]:
dict_={}

for i in df.columns:
    dict_[i] = cos_sim(df[i], df["artificial intelligence"])

In [None]:
# Print Results
dict_ = dict(sorted(dict_.items(), key=lambda item: item[1], reverse=True))
results=pd.DataFrame(dict_, index=[0]).T
results.to_csv("AI_cosine_similarity_results_tfidf.csv")