# Keywords Generation From PDF File  

## Introduction
This notebook aims to generate keywords for scientific papers that posted online as PDF files. We do this by

1. <b>Getting Data</b> - we collect a url of papers from our database.
2. <b>Keywords Generation</b>  
    We try generating keywords by using data engineering and machine learning libraries.  
  - Keywords matching - We read in content of online pdf file, then count occurrence of each word in our pre-listed keywords list and select 5 highest words.  
  - SpaCy
  - YAKE
  - Rake-Nltk
  - Gensim

## 1. Getting Data

In [1]:
# Connect with database
from pymongo import MongoClient

DB_NAME = 'TechVault'
COLLECTION_NAME = 'blogs'

mongo_uri = ""
client = MongoClient(mongo_uri)

# Database: TechVault
db = client[DB_NAME]

# Collection: blogs
collection = db[COLLECTION_NAME]

In [2]:
# Read all papers

papers = []

for doc in collection.find():
    if doc['type'] == 'paper':
        # collect only title and abstract
        paper = {'url': doc['link']}
        papers.append(paper)

papers[0]

{'url': 'https://arxiv.org/pdf/0704.0002'}

In [3]:
len(papers)

284703

## 2. Keywords Generation
### 2.1 Keywords Matching

In [4]:
# Dictionary of keywords
# Key: Searching words
# Value: Displayed words

keywords = {"Machine Learning": "Machine Learning",
            "Supervised Learning": "Supervised Learning",
            "Unsupervised Learning": "Unsupervised Learning",
            "Multilabel Classification": "Multilabel Classification",
            "Clustering": "Clustering",
            "K-Means": "K-Means",
            "DBSCAN": "DBSCAN",
            "Hierarchical Clustering": "Hierarchical Clustering",
            "Deep Learning": "Deep Learning",
            "Data Mining": "Data Mining",
            "Linear regression": "Linear regression",
            "Logistic regression": "Logistic regression",
            "SVM": "SVM",
            "Natural Language Processing": "Natural Language Processing",
            "Computer Vision": "Computer Vision",
            "KNN": "KNN",
            "Random forest": "Random forest",
            "Decision Tree": "Decision Tree",
            "Regularization": "Regularization",
            "Ensemble Learning": "Ensemble Learning",
            "Gradient Boosting": "Gradient Boosting",
            "Feature Selection": "Feature Selection",
            "Reinforcement Learning": "Reinforcement Learning",
            "Virtual Reality": "Virtual Reality",
            "Augmented reality": "Augmented reality",
            "Autonomous driving": "Autonomous driving",
            "Optics": "Optics",
            "Biology": "Biology",
            "C++": "C++",
            "Java": "Java",
            "Python": "Python",
            "React JS": "React JS",
            "Computer Network": "Computer Networks",  # remove s
            "Frontend": "Frontend",
            "Backend": "Backend",
            "High Scalability": "High Scalability",
            "Cloud computing": "Cloud computing",
            "Parallel Computing": "Parallel Computing",
            "CUDA": "CUDA",
            "Distributed System": "Distributed Systems",  # remove s
            "Apache ZooKeeper": "Apache ZooKeeper",
            "Streaming analytic": "Streaming analytics",
            "Model Selection": "Model Selection",
            "Model Evaluation": "Model Evaluation",
            "Apache Kafka": "Apache Kafka",
            "HDFS": "HDFS",
            "Amazon S3": "Amazon S3",
            "Pub-Sub": "Pub-Sub",
            "Leader Election": "Leader Election",
            "Clock Synchronization": "Clock Synchronization",
            "Graph": "Graphs",  # remove s
            "Information Retrieval": "Information Retrieval",
            "SQL": "SQL",
            "Graph Database": "Graph Database",
            "Database Management": "Database Management",
            "Storage": "Storage",
            "Memor": "Memory",
            "Garbage Collection": "Garbage Collection",
            "Map-Reduce": "Map-Reduce",
            "Network Protocol": "Network Protocols",  # remove s
            "Cyber Security": "Cyber Security",
            "Assembly Language": "Assembly Language",
            "Computational Complexity Theor": "Computational Complexity Theory",
            "Computer Architecture": "Computer Architecture",
            "Human-Computer Interface": "Human-Computer Interface",
            "Data Structure": "Data Structures",  # remove s
            "Discrete Mathematic": "Discrete Mathematics",
            "Hacking": "Hacking",
            "Quantum Computing": "Quantum Computing",
            "Robotic": "Robotics",  # remove s
            "Engineering Practice": "Engineering Practices",  # remove s
            "Software Tool": "Software Tools",  # remove s
            "Mathematical Logic": "Mathematical Logic",
            "Graph Theor": "Graph Theory",
            "Computational Geometr": "Computational Geometry",
            "Compiler": "Compilers",  # remove s
            "Distributed Computing": "Distributed Computing",
            "Software Engineering": "Software Engineering",
            "Bioinformatic": "Bioinformatics",  # remove s
            "Computational Chemistry": "Computational Chemistry",
            "Computational Neuroscience": "Computational Neuroscience",
            "Computational physics": "Computational physics",
            "Numerical algorithm": "Numerical algorithms",  # remove s
            "JavaScript": "JavaScript",
            "HTML": "HTML",
            "Web Development": "Web Development",
            "App Development": "App Development",
            "CSS": "CSS",
            "PHP": "PHP",
            "BlockChain": "BlockChain",
            "Hardware": "Hardware",
            "VLSI": "VLSI",
            "Cluster Computing": "Cluster Computing",
            "Kubernetes": "Kubernetes",
            "Go": "Go-Lang",
            "File System": "File Systems",  # remove s
            "Statistic": "Statistics",  # remove s
            "Optimization": "Optimization",
            "Knowledge Graph": "Knowledge Graph",
            "RNN": "RNN",
            "CNN": "CNN",
            "Physical Design": "Physical Design",
            "Memory management": "Memory management",
            "PCA": "PCA",
            "LDA": "LDA",
            "Feature Engineering": "Feature Engineering",
            "Data manipulation": "Data manipulation",
            "ACID": "ACID",
            "BASE": "BASE",
            "Consistency": "Consistency",
            "Disaster recovery": "Disaster recovery",
            "Replication": "Replication",
            "Fault tolerance": "Fault tolerance",
            "Deployment": "Deployment",
            "Processor": "Processors",  # remove s
            "Multi-Threading": "Multi-Threading",
            "Queue": "Queue",
            "Stack": "Stack",
            "Dynamic Programming": "Dynamic Programming",
            "Graph Traversal": "Graph Traversal",
            "Device": "Devices",  # remove s
            "Data analysis": "Data analysis",
            "Probability": "Probability",
            "Mathematic": "Mathematics",  # remove s
            "Genomic": "Genomics",  # remove s
            "Data Infrastructure": "Data Infrastructure",
            "Software Principles and Practices": "Software Principles and Practices",
            "Image Processing": "Image Processing",
            "Audio Processing": "Audio Processing",
            "Signal Processing": "Signal Processing",
            "Pattern Recognition": "Pattern Recognition",
            "Computation and Language": "Computation and Language",
            "Artificial Intelligence": "Artificial Intelligence",
            "Computation and Language": "Computation and Language",
            "Computational Complexit": "Computational Complexity",
            "Computational Engineering": "Computational Engineering",
            "Finance": "Finance",  # remove "and Science" from "Finance, and Science"
            "Computational Geometry": "Computational Geometry",
            "Game Theory": "Game Theory",  # remove "Computer Science" from "Computer Science and Game Theory"
            "Computer Vision": "Computer Vision",  # break down from "Computer Vision and Pattern Recognition"
            "Pattern Recognition": "Pattern Recognition",  # break down from "Computer Vision and Pattern Recognition"
            "Computers and Society": "Computers and Society",
            "Cryptography and Security": "Cryptography and Security",
            "Data Structure": "Data Structures",  # break down from "Data Structures and Algorithms"
            "Algorithm": "Algorithms",  # break down from "Data Structures and Algorithms"
            "Database": "Databases",  # break down from "Databases; Digital Libraries"
            "Digital Librar": "Digital Libraries",  # break down from "Databases; Digital Libraries"
            "Distributed Computing": "Distributed Computing",  # break down from "Distributed, Parallel, and Cluster Computing"
            "Parallel Computing": "Parallel Computing",  # break down from "Distributed, Parallel, and Cluster Computing"
            "Cluster Computing": "Cluster Computing",  # break down from "Distributed, Parallel, and Cluster Computing"
            "Emerging Technolog": "Emerging Technologies",
            "Formal Language": "Formal Languages",  # break down from "Formal Languages and Automata Theory"
            "Automata Theory": "Automata Theory",  # break down from "Formal Languages and Automata Theory"
            "General Literature": "General Literature",
            "Graphic": "Graphics",  # remove s
            "Human-Computer Interaction": "Human-Computer Interaction",
            "Information Theory": "Information Theory",
            "Logic in Computer Science": "Logic in Computer Science",
            "Mathematical Software": "Mathematical Software",
            "Multiagent System": "Multi-agent Systems",  # remove s from "Systems"
            "Multi-agent System": "Multi-agent Systems",  # remove s from "Systems" and add -
            "Multimedia": "Multimedia",
            "Networking and Internet Architecture": "Networking and Internet Architecture",
            "Neural and Evolutionary Computing": "Neural and Evolutionary Computing",
            "Numerical Analysis": "Numerical Analysis",
            "Operating System": "Operating Systems",  # remove s from "Systems"
            "Performance": "Performance",
            "Programming Language": "Programming Languages",  # remove s 
            "Social and Information Networks": "Social and Information Networks",
            "Software Engineering": "Software Engineering",
            "Sound": "Sound",
            "Symbolic Computation": "Symbolic Computation",
            "Systems and Control": "Systems and Control"
           }

In [5]:
def countKeywords(text, keywords):
    ''' Count occurence of keywords in the text, return a dict of words and its occurence'''
    d = {}
    text = ' ' + text + ' '
    # Abbreviations list
    abbreviations = ["SVM", "KNN", "CUDA", "HDFS", "SQL", "HTML", "CSS", "PHP",
                    "VLSI", "RNN", "CNN", "PCA", "LDA", "ACID", "BASE"]
    
    for search_word, display_word in keywords.items():
        # 'Go' can be a sub-string of many words, to be precise, we'll search for the word " Go "
        if search_word == "Go":
            search_word = ' ' + search_word + ' '
        
        # Lower the word if it's not an abbreviation
        elif search_word not in abbreviations:
            search_word = search_word.lower()
        
        # Count occurence of searching word (case sensitive)
        oc = text.count(search_word)
        
        # Append to the dictionary if the word occurs 1 or more time
        if oc > 0:
            d[display_word] = oc

    return d

In [None]:
# Install pdfPlumber package
# pip install pdfplumber

In [6]:
import pdfplumber
import io
import requests

def readPDF(url):
    ''' 
    Read PDF file from a URL
    Return a text string
    '''
    r = requests.get(url)
    f = io.BytesIO(r.content)
    text = "" # returned text
    try:
        # Open pdf file
        with pdfplumber.open(f) as pdf:
            pages = pdf.pages
            # Iterate through pdf document pages
            for page in pages:
                # Concat page content to text if there are any
                if page.extract_text():
                    text += '\n' + page.extract_text()

        return text
    except:
        print("PDF file is not found!")
        return ""
    

def keywordsFromPaper(url, keywords):
    '''
    Extracts keywords from a pdf document
      String url: a url of a pdf file
      Dict keywords: a dictionary of searched word and displayed word
    Return
      A list of five words that have highest occurences 
    '''
    text = readPDF(url).lower()
    
    if not text:
        global n_pdf_unavil
        n_pdf_unavil += 1
        
    occ = countKeywords(text, keywords)

    # get a list of top 5 words
    result = list(dict(sorted(occ.items(), key=lambda x: x[1], reverse=True)).keys())[:5]
    return result
    

In [None]:
n_pdf_unavil = 0
for paper in papers:
    print(paper['url'])
    paper['keywords'] = keywordsFromPaper(paper['url'], keywords)
    print(paper['keywords'])
    print(n_pdf_unavil)
    print()
print(n_pdf_unavil)

# Finding
We successfully generate 4-5 keywords for most papers. 


Many URLs that the PDF file is unavilable: 
1. https://arxiv.org/pdf/0704.0213
2. https://arxiv.org/pdf/0704.1409
3. https://arxiv.org/pdf/0705.1442
4. https://arxiv.org/pdf/0707.0454
5. https://arxiv.org/pdf/0706.1002
6. https://arxiv.org/pdf/0706.0484
7. https://arxiv.org/pdf/0706.1118
8. https://arxiv.org/pdf/0706.1477
9. https://arxiv.org/pdf/0706.2073
10. https://arxiv.org/pdf/0706.2153

and so on.

### Text Cleansing
Let's clean text for further machine learning works

In [None]:
from bs4 import BeautifulSoup
import re
import string
import copy

# Funtion to remove HTML tags
def removeHTMLTags(text):
    return BeautifulSoup(text, 'html.parser').get_text()

# Function to remove more special characters and escape characters
def removeExtraWhitespaceEsc(text):
    #pattern = r'^\s+$|\s+$'
    pat = r'^\s*|\s\s*'
    return re.sub(pat, ' ', text).strip()

# Function to remove commas and periods
def removeCommasPeriods(text):
    pat = r'[.,]+'
    return re.sub(pat, '', text)

# Function to remove words that include special character
def removeSpecialCharacterWords(text):
    # define the pattern to keep only letters, numbers, dash and white spaces
    pat = r'[a-zA-Z0-9]*[^a-zA-Z0-9_\s]+[a-zA-Z0-9]*'
    return re.sub(pat, '', text)

def cleanText(text):
    clean_text = removeHTMLTags(text)
    clean_text = removeExtraWhitespaceEsc(clean_text)
    clean_text = removeCommasPeriods(clean_text)
    clean_text = removeSpecialCharacterWords(clean_text)
    return clean_text

In [None]:
# Sample paper

# Keywords from doing keywords matching
# ['Probability', 'Statistics', 'Graphs', 'Image Processing', 'Signal Processing']

url = "https://arxiv.org/pdf/0705.0043.pdf"
text = readPDF(url)
clean_text = cleanText(text)
clean_text

### 2.2 SpaCy

In [None]:
# Download "en_core_sci_lg"
# pip install scispacy
# pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz

In [None]:
import spacy

nlp = spacy.load("en_core_sci_lg")
kws = nlp(clean_text)
kws.ents

### 2.3 YAKE

In [None]:
# Install yake
# pip install yake

In [None]:
import yake

kw_extractor = yake.KeywordExtractor()
language = "en"
max_ngram_size = 2
deduplication_threshold = 0.1
numOfKeywords = 20
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
yake_kws = custom_kw_extractor.extract_keywords(clean_text)
yake_kws
    
# The lower the score, the more relevant the keyword is

### 2.4 Rake-Nltk

In [None]:
# Install Rake-Nltk
# pip install rake-nltk

In [None]:
from rake_nltk import Rake
r = Rake(max_length=3)  # uses stopwords for english from NLTK, and all puntuation characters

# Keywords from title
r.extract_keywords_from_text(clean_text)
rake_kws = r.get_ranked_phrases()
rake_kws

### 2.5 Gensim

In [None]:
# Install Gensim
# !python -m pip install gensim==3.8.3

In [None]:
from gensim.summarization import keywords
gensim_kws = keywords(clean_text)
gensim_kws

## Finding
The text that we read from online pdf files is not in a good shape, e.g. no white space between words. This make it hard for machine learning algorithms to work well.