# Keywords Generation From PDF File  

## Introduction
This notebook aims to generate keywords for scientific papers that posted online as PDF files. We do this by

1. <b>Getting Data</b> - we collect a url of papers from our database.
2. <b>Keywords Generation</b>  
    We try generating keywords by using data engineering and machine learning libraries.  
  - Keywords matching - We read in content of online pdf file, then count occurrence of each word in our pre-listed keywords list and select 5 highest words.  
  - SpaCy
  - YAKE
  - Rake-Nltk
  - Gensim

## 1. Getting Data

In [1]:
# Connect with database
from pymongo import MongoClient

DB_NAME = 'TechVault'
COLLECTION_NAME = 'blogs'

mongo_uri = "mongodb+srv://gift:t@cluster0.kf3n4.mongodb.net/{}?authSource=admin&ssl=true".format(DB_NAME)
client = MongoClient(mongo_uri)

# Database: TechVault
db = client[DB_NAME]

# Collection: blogs
collection = db[COLLECTION_NAME]

In [2]:
# Read all papers

papers = []

for doc in collection.find():
    if doc['type'] == 'paper':
        # collect only title and abstract
        paper = {'url': doc['link']}
        papers.append(paper)

papers[0]

{'url': 'https://arxiv.org/pdf/0704.0062'}

In [8]:
len(papers)

284703

## 2. Keywords Generation
### 2.1 Keywords Matching

In [3]:
# Dictionary of keywords
# Key: Searching words
# Value: Displayed words

keywords = {"Machine Learning": "Machine Learning",
            "Supervised Learning": "Supervised Learning",
            "Unsupervised Learning": "Unsupervised Learning",
            "Multilabel Classification": "Multilabel Classification",
            "Clustering": "Clustering",
            "K-Means": "K-Means",
            "DBSCAN": "DBSCAN",
            "Hierarchical Clustering": "Hierarchical Clustering",
            "Deep Learning": "Deep Learning",
            "Data Mining": "Data Mining",
            "Linear regression": "Linear regression",
            "Logistic regression": "Logistic regression",
            "SVM": "SVM",
            "Natural Language Processing": "Natural Language Processing",
            "Computer Vision": "Computer Vision",
            "KNN": "KNN",
            "Random forest": "Random forest",
            "Decision Tree": "Decision Tree",
            "Regularization": "Regularization",
            "Ensemble Learning": "Ensemble Learning",
            "Gradient Boosting": "Gradient Boosting",
            "Feature Selection": "Feature Selection",
            "Reinforcement Learning": "Reinforcement Learning",
            "Virtual Reality": "Virtual Reality",
            "Augmented reality": "Augmented reality",
            "Autonomous driving": "Autonomous driving",
            "Optics": "Optics",
            "Biology": "Biology",
            "C++": "C++",
            "Java": "Java",
            "Python": "Python",
            "React JS": "React JS",
            "Computer Network": "Computer Networks",  # remove s
            "Frontend": "Frontend",
            "Backend": "Backend",
            "High Scalability": "High Scalability",
            "Cloud computing": "Cloud computing",
            "Parallel Computing": "Parallel Computing",
            "CUDA": "CUDA",
            "Distributed System": "Distributed Systems",  # remove s
            "Apache ZooKeeper": "Apache ZooKeeper",
            "Streaming analytic": "Streaming analytics",
            "Model Selection": "Model Selection",
            "Model Evaluation": "Model Evaluation",
            "Apache Kafka": "Apache Kafka",
            "HDFS": "HDFS",
            "Amazon S3": "Amazon S3",
            "Pub-Sub": "Pub-Sub",
            "Leader Election": "Leader Election",
            "Clock Synchronization": "Clock Synchronization",
            "Graph": "Graphs",  # remove s
            "Information Retrieval": "Information Retrieval",
            "SQL": "SQL",
            "Graph Database": "Graph Database",
            "Database Management": "Database Management",
            "Storage": "Storage",
            "Memor": "Memory",
            "Garbage Collection": "Garbage Collection",
            "Map-Reduce": "Map-Reduce",
            "Network Protocol": "Network Protocols",  # remove s
            "Cyber Security": "Cyber Security",
            "Assembly Language": "Assembly Language",
            "Computational Complexity Theor": "Computational Complexity Theory",
            "Computer Architecture": "Computer Architecture",
            "Human-Computer Interface": "Human-Computer Interface",
            "Data Structure": "Data Structures",  # remove s
            "Discrete Mathematic": "Discrete Mathematics",
            "Hacking": "Hacking",
            "Quantum Computing": "Quantum Computing",
            "Robotic": "Robotics",  # remove s
            "Engineering Practice": "Engineering Practices",  # remove s
            "Software Tool": "Software Tools",  # remove s
            "Mathematical Logic": "Mathematical Logic",
            "Graph Theor": "Graph Theory",
            "Computational Geometr": "Computational Geometry",
            "Compiler": "Compilers",  # remove s
            "Distributed Computing": "Distributed Computing",
            "Software Engineering": "Software Engineering",
            "Bioinformatic": "Bioinformatics",  # remove s
            "Computational Chemistry": "Computational Chemistry",
            "Computational Neuroscience": "Computational Neuroscience",
            "Computational physics": "Computational physics",
            "Numerical algorithm": "Numerical algorithms",  # remove s
            "JavaScript": "JavaScript",
            "HTML": "HTML",
            "Web Development": "Web Development",
            "App Development": "App Development",
            "CSS": "CSS",
            "PHP": "PHP",
            "BlockChain": "BlockChain",
            "Hardware": "Hardware",
            "VLSI": "VLSI",
            "Cluster Computing": "Cluster Computing",
            "Kubernetes": "Kubernetes",
            "Go": "Go-Lang",
            "File System": "File Systems",  # remove s
            "Statistic": "Statistics",  # remove s
            "Optimization": "Optimization",
            "Knowledge Graph": "Knowledge Graph",
            "RNN": "RNN",
            "CNN": "CNN",
            "Physical Design": "Physical Design",
            "Memory management": "Memory management",
            "PCA": "PCA",
            "LDA": "LDA",
            "Feature Engineering": "Feature Engineering",
            "Data manipulation": "Data manipulation",
            "ACID": "ACID",
            "BASE": "BASE",
            "Consistency": "Consistency",
            "Disaster recovery": "Disaster recovery",
            "Replication": "Replication",
            "Fault tolerance": "Fault tolerance",
            "Deployment": "Deployment",
            "Processor": "Processors",  # remove s
            "Multi-Threading": "Multi-Threading",
            "Queue": "Queue",
            "Stack": "Stack",
            "Dynamic Programming": "Dynamic Programming",
            "Graph Traversal": "Graph Traversal",
            "Device": "Devices",  # remove s
            "Data analysis": "Data analysis",
            "Probability": "Probability",
            "Mathematic": "Mathematics",  # remove s
            "Genomic": "Genomics",  # remove s
            "Data Infrastructure": "Data Infrastructure",
            "Software Principles and Practices": "Software Principles and Practices",
            "Image Processing": "Image Processing",
            "Audio Processing": "Audio Processing",
            "Signal Processing": "Signal Processing",
            "Pattern Recognition": "Pattern Recognition",
            "Computation and Language": "Computation and Language",
            "Artificial Intelligence": "Artificial Intelligence",
            "Computation and Language": "Computation and Language",
            "Computational Complexit": "Computational Complexity",
            "Computational Engineering": "Computational Engineering",
            "Finance": "Finance",  # remove "and Science" from "Finance, and Science"
            "Computational Geometry": "Computational Geometry",
            "Game Theory": "Game Theory",  # remove "Computer Science" from "Computer Science and Game Theory"
            "Computer Vision": "Computer Vision",  # break down from "Computer Vision and Pattern Recognition"
            "Pattern Recognition": "Pattern Recognition",  # break down from "Computer Vision and Pattern Recognition"
            "Computers and Society": "Computers and Society",
            "Cryptography and Security": "Cryptography and Security",
            "Data Structure": "Data Structures",  # break down from "Data Structures and Algorithms"
            "Algorithm": "Algorithms",  # break down from "Data Structures and Algorithms"
            "Database": "Databases",  # break down from "Databases; Digital Libraries"
            "Digital Librar": "Digital Libraries",  # break down from "Databases; Digital Libraries"
            "Distributed Computing": "Distributed Computing",  # break down from "Distributed, Parallel, and Cluster Computing"
            "Parallel Computing": "Parallel Computing",  # break down from "Distributed, Parallel, and Cluster Computing"
            "Cluster Computing": "Cluster Computing",  # break down from "Distributed, Parallel, and Cluster Computing"
            "Emerging Technolog": "Emerging Technologies",
            "Formal Language": "Formal Languages",  # break down from "Formal Languages and Automata Theory"
            "Automata Theory": "Automata Theory",  # break down from "Formal Languages and Automata Theory"
            "General Literature": "General Literature",
            "Graphic": "Graphics",  # remove s
            "Human-Computer Interaction": "Human-Computer Interaction",
            "Information Theory": "Information Theory",
            "Logic in Computer Science": "Logic in Computer Science",
            "Mathematical Software": "Mathematical Software",
            "Multiagent System": "Multi-agent Systems",  # remove s from "Systems"
            "Multi-agent System": "Multi-agent Systems",  # remove s from "Systems" and add -
            "Multimedia": "Multimedia",
            "Networking and Internet Architecture": "Networking and Internet Architecture",
            "Neural and Evolutionary Computing": "Neural and Evolutionary Computing",
            "Numerical Analysis": "Numerical Analysis",
            "Operating System": "Operating Systems",  # remove s from "Systems"
            "Performance": "Performance",
            "Programming Language": "Programming Languages",  # remove s 
            "Social and Information Networks": "Social and Information Networks",
            "Software Engineering": "Software Engineering",
            "Sound": "Sound",
            "Symbolic Computation": "Symbolic Computation",
            "Systems and Control": "Systems and Control"
           }

In [4]:
def countKeywords(text, keywords):
    ''' Count occurence of keywords in the text, return a dict of words and its occurence'''
    d = {}
    text = ' ' + text + ' '
    # Abbreviations list
    abbreviations = ["SVM", "KNN", "CUDA", "HDFS", "SQL", "HTML", "CSS", "PHP",
                    "VLSI", "RNN", "CNN", "PCA", "LDA", "ACID", "BASE"]
    
    for search_word, display_word in keywords.items():
        # 'Go' can be a sub-string of many words, to be precise, we'll search for the word " Go "
        if search_word == "Go":
            search_word = ' ' + search_word + ' '
        
        # Lower the word if it's not an abbreviation
        elif search_word not in abbreviations:
            search_word = search_word.lower()
        
        # Count occurence of searching word (case sensitive)
        oc = text.count(search_word)
        
        # Append to the dictionary if the word occurs 1 or more time
        if oc > 0:
            d[display_word] = oc

    return d

In [5]:
# Install pdfPlumber package
# pip install pdfplumber

In [6]:
import pdfplumber
import io
import requests

def readPDF(url):
    ''' 
    Read PDF file from a URL
    Return a text string
    '''
    r = requests.get(url)
    f = io.BytesIO(r.content)
    text = "" # returned text
    try:
        # Open pdf file
        with pdfplumber.open(f) as pdf:
            pages = pdf.pages
            # Iterate through pdf document pages
            for page in pages:
                # Concat page content to text if there are any
                if page.extract_text():
                    text += '\n' + page.extract_text()

        return text
    except:
        print("PDF file is not found!")
        return ""
    

def keywordsFromPaper(url, keywords):
    '''
    Extracts keywords from a pdf document
      String url: a url of a pdf file
      Dict keywords: a dictionary of searched word and displayed word
    Return
      A list of five words that have highest occurences 
    '''
    text = readPDF(url).lower()
    
    if not text:
        global n_pdf_unavil
        n_pdf_unavil += 1
        
    occ = countKeywords(text, keywords)

    # get a list of top 5 words
    result = list(dict(sorted(occ.items(), key=lambda x: x[1], reverse=True)).keys())[:5]
    return result
    

In [7]:
n_pdf_unavil = 0
for paper in papers:
    print(paper['url'])
    paper['keywords'] = keywordsFromPaper(paper['url'], keywords)
    print(paper['keywords'])
    print(n_pdf_unavil)
    print()
print(n_pdf_unavil)

https://arxiv.org/pdf/0704.0062
['Algorithms', 'Memory', 'Probability', 'Biology', 'Bioinformatics']
0

https://arxiv.org/pdf/0704.0098
['Statistics', 'Performance', 'Probability', 'Graphs', 'Algorithms']
0

https://arxiv.org/pdf/0704.0304
['Biology', 'Probability', 'Memory', 'Mathematics', 'Information Theory']
0

https://arxiv.org/pdf/0704.0050
['Algorithms', 'Statistics', 'Signal Processing', 'Memory', 'Mathematics']
0

https://arxiv.org/pdf/0704.0309
['Graphs', 'Algorithms', 'Optimization', 'Mathematics', 'Computational Complexity']
0

https://arxiv.org/pdf/0704.0492
['Algorithms', 'Probability', 'Optimization', 'C++', 'Graphs']
0

https://arxiv.org/pdf/0704.0831
['Probability', 'Performance', 'Memory', 'Stack']
0

https://arxiv.org/pdf/0704.0834
['Algorithms', 'Mathematics', 'Probability', 'Information Theory', 'Graphs']
0

https://arxiv.org/pdf/0704.1068
['Algorithms', 'Graphs', 'Databases', 'Queue', 'Graphics']
0

https://arxiv.org/pdf/0704.2452
['Performance', 'Optimization', '

['Algorithms', 'Performance', 'Graphs', 'Memory', 'Information Theory']
1

https://arxiv.org/pdf/0704.0046
['Probability', 'Information Theory', 'Memory', 'Statistics', 'Mathematics']
1

https://arxiv.org/pdf/0704.0590
['Algorithms', 'Memory', 'Graphs', 'Hardware', 'Mathematics']
1

https://arxiv.org/pdf/0704.0047
['Probability', 'Statistics', 'Databases', 'Performance', 'Memory']
1

https://arxiv.org/pdf/0704.0788
['Algorithms', 'Processors', 'Performance', 'Probability']
1

https://arxiv.org/pdf/0704.1198
['Algorithms', 'Consistency', 'Performance', 'Optimization', 'Graphs']
1

https://arxiv.org/pdf/0704.1409
PDF file is not found!
[]
2

https://arxiv.org/pdf/0704.1709
['Algorithms', 'Graphs', 'Statistics', 'Data analysis', 'Databases']
2

https://arxiv.org/pdf/0704.1748
['Graphs', 'Algorithms', 'Databases', 'Biology', 'Robotics']
2

https://arxiv.org/pdf/0704.2083
['Java', 'Statistics', 'Frontend', 'Graphs', 'Signal Processing']
2

https://arxiv.org/pdf/0704.2353
['Probability', 'De

['Performance', 'Probability', 'Optimization', 'Signal Processing', 'Mathematics']
2

https://arxiv.org/pdf/0704.3453
['Algorithms', 'Databases', 'Bioinformatics', 'Machine Learning', 'Biology']
2

https://arxiv.org/pdf/0704.3520
['Databases', 'Performance', 'Clustering', 'Data Mining', 'Graphs']
2

https://arxiv.org/pdf/0704.3591
['Probability', 'Information Theory', 'Statistics']
2

https://arxiv.org/pdf/0704.3644
['Information Theory', 'Performance', 'Optimization', 'Statistics', 'Mathematics']
2

https://arxiv.org/pdf/0704.3931
['Graphs', 'Graphics', 'Sound', 'Mathematics', 'Memory']
2

https://arxiv.org/pdf/0705.0123
['Performance', 'Optimization', 'Probability', 'Signal Processing']
2

https://arxiv.org/pdf/0705.0199
['Algorithms', 'Sound', 'Graphs', 'Robotics', 'Probability']
2

https://arxiv.org/pdf/0704.0540
['Probability', 'Memory', 'Information Theory', 'Graphs', 'Graphics']
2

https://arxiv.org/pdf/0704.0858
['Graphs', 'Deployment', 'Graphics', 'Memory', 'Databases']
2

htt

['Graphs', 'Algorithms', 'Probability', 'Performance', 'Graphics']
2

https://arxiv.org/pdf/0705.0815
['Algorithms', 'Graphs', 'Memory', 'Distributed Systems', 'Graphics']
2

https://arxiv.org/pdf/0705.0936
['Performance', 'Probability', 'Algorithms']
2

https://arxiv.org/pdf/0705.0952
['Algorithms', 'Statistics', 'Performance', 'Databases', 'Computer Vision']
2

https://arxiv.org/pdf/0705.0965
['Algorithms', 'Graphs', 'C++', 'Mathematics', 'Graphics']
2

https://arxiv.org/pdf/0705.1151
['Performance', 'Statistics']
2

https://arxiv.org/pdf/0705.1583
['Multimedia', 'Graphs', 'Graphics', 'Performance', 'Devices']
2

https://arxiv.org/pdf/0705.1788
['Queue', 'Optimization', 'Probability', 'Game Theory', 'Performance']
2

https://arxiv.org/pdf/0705.2604
['Signal Processing', 'Probability', 'Algorithms', 'Pattern Recognition', 'Optimization']
2

https://arxiv.org/pdf/0705.2835
['Algorithms', 'Computational Geometry', 'Dynamic Programming', 'Mathematics', 'Biology']
2

https://arxiv.org/pdf

['Graphs', 'Algorithms', 'Mathematics', 'Discrete Mathematics', 'Optimization']
3

https://arxiv.org/pdf/0705.1682
['Memory', 'Algorithms', 'Information Theory']
3

https://arxiv.org/pdf/0705.1759
['Optimization', 'Algorithms', 'Probability', 'Mathematics', 'Supervised Learning']
3

https://arxiv.org/pdf/0705.1760
['Optimization', 'Algorithms', 'Statistics', 'Probability', 'Mathematics']
3

https://arxiv.org/pdf/0705.1789
['Graphs', 'Probability', 'Information Theory', 'Algorithms', 'Optimization']
3

https://arxiv.org/pdf/0705.2084
['Computer Networks', 'Backend', 'Graphs', 'Storage', 'Statistics']
3

https://arxiv.org/pdf/0705.2626
['Algorithms', 'Performance', 'Memory', 'Mathematics', 'Processors']
3

https://arxiv.org/pdf/0705.2847
['Statistics', 'Performance', 'Graphs', 'Probability', 'Mathematics']
3

https://arxiv.org/pdf/0705.2848
['Statistics', 'Performance', 'Probability', 'Algorithms']
3

https://arxiv.org/pdf/0705.3099
['Optimization', 'Probability', 'Information Theory', '

['Algorithms', 'Graphs', 'Graphics', 'Memory', 'Deployment']
3

https://arxiv.org/pdf/0705.2318
['Statistics', 'Performance', 'Memory', 'Ensemble Learning']
3

https://arxiv.org/pdf/0705.2787
['Algorithms', 'Statistics', 'Probability', 'Databases', 'Graphs']
3

https://arxiv.org/pdf/0705.3468
['Optimization', 'Stack', 'Graphs', 'Databases', 'Performance']
3

https://arxiv.org/pdf/0705.3751
['Graphs', 'Algorithms', 'Mathematics', 'Graph Theory', 'Graphics']
3

https://arxiv.org/pdf/0705.4045
['Statistics', 'Probability', 'Information Theory', 'Graphs', 'Mathematics']
3

https://arxiv.org/pdf/0705.0564
['Memory', 'Statistics', 'Optimization', 'Probability', 'Information Theory']
3

https://arxiv.org/pdf/0705.0751
['Algorithms', 'Probability', 'Information Retrieval', 'Biology', 'Graphs']
3

https://arxiv.org/pdf/0705.1183
['Probability', 'Memory', 'Information Theory', 'Optimization', 'Computational Complexity']
3

https://arxiv.org/pdf/0705.1244
['Robotics', 'Performance', 'Statistics',

['Graphs', 'Mathematics', 'Algorithms', 'Logic in Computer Science']
3

https://arxiv.org/pdf/0706.4323
['Algorithms', 'C++', 'Performance', 'Graphs', 'Memory']
3

https://arxiv.org/pdf/0707.0476
['Probability', 'Algorithms', 'Optimization', 'Graphs', 'Mathematics']
3

https://arxiv.org/pdf/0707.0556
['Programming Languages', 'Java', 'Memory', 'Software Engineering', 'Mathematics']
3

https://arxiv.org/pdf/0707.0568
['Optimization', 'Performance', 'Probability', 'Algorithms', 'Game Theory']
3

https://arxiv.org/pdf/0705.4676
['Algorithms', 'Memory', 'Java', 'Probability', 'Performance']
3

https://arxiv.org/pdf/0706.0431
['Algorithms', 'Mathematics', 'Graphs', 'Formal Languages', 'Graphics']
3

https://arxiv.org/pdf/0706.0457
['Robotics', 'Hardware', 'Algorithms', 'Performance', 'Machine Learning']
3

https://arxiv.org/pdf/0706.0489
['Graphs', 'Statistics', 'Algorithms', 'Probability', 'Mathematics']
3

https://arxiv.org/pdf/0706.0682
['Probability', 'Information Theory', 'Memory', 'Op

['Databases', 'Algorithms', 'Robotics', 'Consistency']
5

https://arxiv.org/pdf/0706.0585
['Algorithms', 'Optimization', 'Machine Learning', 'Memory', 'Performance']
5

https://arxiv.org/pdf/0706.1318
['Algorithms', 'Dynamic Programming', 'Optimization', 'C++', 'Memory']
5

https://arxiv.org/pdf/0706.1563
['Queue', 'Processors', 'Probability', 'Algorithms']
5

https://arxiv.org/pdf/0706.1751
['Mathematics', 'Algorithms', 'Performance', 'Graphs', 'Storage']
5

https://arxiv.org/pdf/0706.2155
['Algorithms', 'Performance', 'Processors', 'Graphs', 'Data Structures']
5

https://arxiv.org/pdf/0706.3848
['Graphs', 'Algorithms', 'Mathematics', 'Optimization']
5

https://arxiv.org/pdf/0706.4298
['Algorithms', 'Graphs', 'Processors', 'Distributed Systems', 'Distributed Computing']
5

https://arxiv.org/pdf/0707.0479
['Probability', 'Optimization', 'Graphs', 'Memory', 'Algorithms']
5

https://arxiv.org/pdf/0707.0652
['Algorithms', 'Performance', 'Robotics', 'Memory']
5

https://arxiv.org/pdf/0706.

PDF file is not found!
[]
82

https://arxiv.org/pdf/0707.3638
PDF file is not found!
[]
83

https://arxiv.org/pdf/0707.4081
PDF file is not found!
[]
84

https://arxiv.org/pdf/0707.4448
PDF file is not found!
[]
85

https://arxiv.org/pdf/0707.4489
PDF file is not found!
[]
86

https://arxiv.org/pdf/0707.4507
PDF file is not found!
[]
87

https://arxiv.org/pdf/0708.0877
PDF file is not found!
[]
88

https://arxiv.org/pdf/0707.0762
PDF file is not found!
[]
89

https://arxiv.org/pdf/0707.0862
PDF file is not found!
[]
90

https://arxiv.org/pdf/0707.0890
PDF file is not found!
[]
91

https://arxiv.org/pdf/0707.1151
PDF file is not found!
[]
92

https://arxiv.org/pdf/0707.1372
PDF file is not found!
[]
93

https://arxiv.org/pdf/0707.1432
PDF file is not found!
[]
94

https://arxiv.org/pdf/0707.1716
PDF file is not found!
[]
95

https://arxiv.org/pdf/0707.1820
PDF file is not found!
[]
96

https://arxiv.org/pdf/0707.1925
PDF file is not found!
[]
97

https://arxiv.org/pdf/0707.2126
PDF file

PDF file is not found!
[]
213

https://arxiv.org/pdf/0708.0909
PDF file is not found!
[]
214

https://arxiv.org/pdf/0707.0978
PDF file is not found!
[]
215

https://arxiv.org/pdf/0707.3236
PDF file is not found!
[]
216

https://arxiv.org/pdf/0707.3248
PDF file is not found!
[]
217

https://arxiv.org/pdf/0707.3732
PDF file is not found!
[]
218

https://arxiv.org/pdf/0707.3807
PDF file is not found!
[]
219

https://arxiv.org/pdf/0707.4524
PDF file is not found!
[]
220

https://arxiv.org/pdf/0708.0605
PDF file is not found!
[]
221

https://arxiv.org/pdf/0708.0694
PDF file is not found!
[]
222

https://arxiv.org/pdf/0708.0713
PDF file is not found!
[]
223

https://arxiv.org/pdf/0707.0785
PDF file is not found!
[]
224

https://arxiv.org/pdf/0707.1501
PDF file is not found!
[]
225

https://arxiv.org/pdf/0707.0796
PDF file is not found!
[]
226

https://arxiv.org/pdf/0707.1295
PDF file is not found!
[]
227

https://arxiv.org/pdf/0707.1534
PDF file is not found!
[]
228

https://arxiv.org/pdf/07

PDF file is not found!
[]
344

https://arxiv.org/pdf/0708.1416
PDF file is not found!
[]
345

https://arxiv.org/pdf/0708.1491
PDF file is not found!
[]
346

https://arxiv.org/pdf/0708.2514
PDF file is not found!
[]
347

https://arxiv.org/pdf/0708.2804
PDF file is not found!
[]
348

https://arxiv.org/pdf/0708.3070
PDF file is not found!
[]
349

https://arxiv.org/pdf/0708.3465
PDF file is not found!
[]
350

https://arxiv.org/pdf/0708.3567
PDF file is not found!
[]
351

https://arxiv.org/pdf/0708.3761
PDF file is not found!
[]
352

https://arxiv.org/pdf/0708.3900
PDF file is not found!
[]
353

https://arxiv.org/pdf/0708.4288
PDF file is not found!
[]
354

https://arxiv.org/pdf/0709.0145
PDF file is not found!
[]
355

https://arxiv.org/pdf/0709.0511
PDF file is not found!
[]
356

https://arxiv.org/pdf/0709.0516
PDF file is not found!
[]
357

https://arxiv.org/pdf/0709.0624
PDF file is not found!
[]
358

https://arxiv.org/pdf/0709.0896
PDF file is not found!
[]
359

https://arxiv.org/pdf/07

SSLError: HTTPSConnectionPool(host='arxiv.org', port=443): Max retries exceeded with url: /pdf/0708.3696 (Caused by SSLError(SSLError("bad handshake: SysCallError(54, 'ECONNRESET')")))

# Finding
We successfully generate 4-5 keywords for most papers. 


Many URLs that the PDF file is unavilable: 
1. https://arxiv.org/pdf/0704.0213
2. https://arxiv.org/pdf/0704.1409
3. https://arxiv.org/pdf/0705.1442
4. https://arxiv.org/pdf/0707.0454
5. https://arxiv.org/pdf/0706.1002
6. https://arxiv.org/pdf/0706.0484
7. https://arxiv.org/pdf/0706.1118
8. https://arxiv.org/pdf/0706.1477
9. https://arxiv.org/pdf/0706.2073
10. https://arxiv.org/pdf/0706.2153

and so on.

### Text Cleansing
Let's clean text for further machine learning works

In [None]:
from bs4 import BeautifulSoup
import re
import string
import copy

# Funtion to remove HTML tags
def removeHTMLTags(text):
    return BeautifulSoup(text, 'html.parser').get_text()

# Function to remove more special characters and escape characters
def removeExtraWhitespaceEsc(text):
    #pattern = r'^\s+$|\s+$'
    pat = r'^\s*|\s\s*'
    return re.sub(pat, ' ', text).strip()

# Function to remove commas and periods
def removeCommasPeriods(text):
    pat = r'[.,]+'
    return re.sub(pat, '', text)

# Function to remove words that include special character
def removeSpecialCharacterWords(text):
    # define the pattern to keep only letters, numbers, dash and white spaces
    pat = r'[a-zA-Z0-9]*[^a-zA-Z0-9_\s]+[a-zA-Z0-9]*'
    return re.sub(pat, '', text)

def cleanText(text):
    clean_text = removeHTMLTags(text)
    clean_text = removeExtraWhitespaceEsc(clean_text)
    clean_text = removeCommasPeriods(clean_text)
    clean_text = removeSpecialCharacterWords(clean_text)
    return clean_text

In [None]:
# Sample paper

# Keywords from doing keywords matching
# ['Probability', 'Statistics', 'Graphs', 'Image Processing', 'Signal Processing']

url = "https://arxiv.org/pdf/0705.0043.pdf"
text = readPDF(url)
clean_text = cleanText(text)
clean_text

### 2.2 SpaCy

In [None]:
# Download "en_core_sci_lg"
# pip install scispacy
# pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz

In [None]:
import spacy

nlp = spacy.load("en_core_sci_lg")
kws = nlp(clean_text)
kws.ents

### 2.3 YAKE

In [None]:
# Install yake
# pip install yake

In [None]:
import yake

kw_extractor = yake.KeywordExtractor()
language = "en"
max_ngram_size = 2
deduplication_threshold = 0.1
numOfKeywords = 20
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
yake_kws = custom_kw_extractor.extract_keywords(clean_text)
yake_kws
    
# The lower the score, the more relevant the keyword is

### 2.4 Rake-Nltk

In [None]:
# Install Rake-Nltk
# pip install rake-nltk

In [None]:
from rake_nltk import Rake
r = Rake(max_length=3)  # uses stopwords for english from NLTK, and all puntuation characters

# Keywords from title
r.extract_keywords_from_text(clean_text)
rake_kws = r.get_ranked_phrases()
rake_kws

### 2.5 Gensim

In [None]:
# Install Gensim
# !python -m pip install gensim==3.8.3

In [None]:
from gensim.summarization import keywords
gensim_kws = keywords(clean_text)
gensim_kws

## Finding
The text that we read from online pdf files is not in a good shape, e.g. no white space between words. This make it hard for machine learning algorithms to work well.