# Preliminaries

In [1]:
import pandas as pd
import numpy as np
import PyPDF2
import textract
import re


# Reading Text

convert PDF file to txt format for better pre-processing

In [2]:
filename ='JavaBasics-notes.pdf' 

pdfFileObj = open(filename,'rb')               #open allows you to read the file
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)   #The pdfReader variable is a readable object that will be parsed
num_pages = pdfReader.numPages                 #discerning the number of pages will allow us to parse through all the pages


count = 0
text = ""
                                                            
while count < num_pages:                       #The while loop will read each page
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
    
#Below if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.

if text != "":
    text = text
    
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text

else:
    text = textract.process('http://bit.ly/epo_keyword_extraction_document', method='tesseract', language='eng')

    # Now we have a text variable which contains all the text derived from our PDF file.

In [3]:
text = text.encode('ascii','ignore').lower() #Lowercasing each word

# Extracting Keywords

In [8]:
keywords = re.findall(r'[a-zA-Z]\w+',text.decode("utf-8"))
len(keywords)                               #Total keywords in document

3410

In [12]:
df = pd.DataFrame(list(set(keywords)),columns=['keywords'])  #Dataframe with unique keywords to avoid repetition in rows
df

Unnamed: 0,keywords
0,objectsuperthe
1,stacks
2,execute
3,jumping
4,etc
...,...
932,comments
933,limited
934,yet
935,predefined


# Calculating Weights

 - In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. 

- __TF: Term Frequency__, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

__TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).__

- __IDF: Inverse Document Frequency__, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

__IDF(t) = log_e(Total number of documents / Number of documents with term t in it).__

In [10]:
def weightage(word,text,number_of_documents=1):
    word_list = re.findall(word,text)
    number_of_times_word_appeared =len(word_list)
    tf = number_of_times_word_appeared/float(len(text))
    idf = np.log((number_of_documents)/float(number_of_times_word_appeared))
    tf_idf = tf*idf
    return number_of_times_word_appeared,tf,idf ,tf_idf    

In [13]:
df['number_of_times_word_appeared'] = df['keywords'].apply(lambda x: weightage(x,text.decode("utf-8"))[0])
df['tf'] = df['keywords'].apply(lambda x: weightage(x,text.decode("utf-8"))[1])
df['idf'] = df['keywords'].apply(lambda x: weightage(x,text.decode("utf-8"))[2])
df['tf_idf'] = df['keywords'].apply(lambda x: weightage(x,text.decode("utf-8"))[3])

In [14]:
df = df.sort_values('tf_idf',ascending=True)
df.to_csv('Keywords.csv')
df.head(25)

Unnamed: 0,keywords,number_of_times_word_appeared,tf,idf,tf_idf
106,in,369,0.014913,-5.910797,-0.088146
287,re,258,0.010427,-5.55296,-0.057899
237,at,247,0.009982,-5.509388,-0.054996
328,on,243,0.009821,-5.493061,-0.053945
467,the,203,0.008204,-5.313206,-0.04359
867,an,199,0.008042,-5.293305,-0.042571
415,to,190,0.007679,-5.247024,-0.04029
628,or,167,0.006749,-5.117994,-0.034542
408,as,157,0.006345,-5.056246,-0.032082
815,java,135,0.005456,-4.905275,-0.026763


# Extracting table from PDF

We will Look into two libraries for table scraping

1. tabula-py

tabula-py is a package that allows you to both scrape PDFs, as well as convert PDFs directly into CSV files.

In [7]:
import tabula
tables = tabula.read_pdf(filename, pages = "all", multiple_tables = True)

Got stderr: Jul 23, 2020 7:26:01 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: Your current java version is: 1.8.0
Jul 23, 2020 7:26:01 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: To get higher rendering speed on old java 1.8 or 9 versions,
Jul 23, 2020 7:26:01 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   update to the latest 1.8 or 9 version (>= 1.8.0_191 or >= 9.0.4),
Jul 23, 2020 7:26:01 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   or
Jul 23, 2020 7:26:01 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
Jul 23, 2020 7:26:01 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")



The result stored into tables is a list of data frames which correspond to all the tables found in the PDF file. To search for all the tables in a file we have to specify the parameters page = “all” and multiple_tables = True.

We can also use tabula-py to convert a PDF file directly into a CSV. The line below will find all the table in the PDF and output it to a CSV. We add the parameter pages="all" to extract tables from all the pages of PDF.

In [14]:
# output all the tables in the PDF to a CSV
tabula.convert_into(filename, "iris_all.csv", pages= "all")

Got stderr: Jul 23, 2020 7:39:06 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: Your current java version is: 1.8.0
Jul 23, 2020 7:39:06 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: To get higher rendering speed on old java 1.8 or 9 versions,
Jul 23, 2020 7:39:06 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   update to the latest 1.8 or 9 version (>= 1.8.0_191 or >= 9.0.4),
Jul 23, 2020 7:39:06 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   or
Jul 23, 2020 7:39:06 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
Jul 23, 2020 7:39:06 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")



In [17]:
import pandas as pd

df= pd.read_csv("iris_all.csv")

In [18]:
df

Unnamed: 0.1,Unnamed: 0,Primitive Type,Unnamed: 2,Description,Unnamed: 4
0,,boolean,,true/false,
1,,byte,,8 bits,
2,,char,,16 bits (UNICODE),
3,,short,,16 bits,
4,,int,,32 bits,
5,,long,,64 bits,
6,,float,,32 bits IEEE 754-1985,
7,,double,,64 bits IEEE 754-1985,
8,,element type,,,
9,,element 0,,,


Above tabular form is combination of all tables. If we need to extract specific table, we need to split above data into rows after calculating rows manually.

Note: tabula-py can also scrape all of the PDFs in a directory in just one line of code, and drop the tables from each into CSV files.

2. Camelot


Camelot is another possibility for scraping tables from PDFs.
Camelot does have some additional dependencies, including GhostScript. Once installed, we can use Camelot similarly to tabula-py to scrape PDF tables.

In [22]:
import camelot
tables = camelot.read_pdf(filename, pages = "1-end")

We can count the number of tables extracted

In [23]:
tables

<TableList n=14>

To access any of the tables found by index, we can perform below code

In [25]:
# get the 0th-indexed-table table
tables[0].df

Unnamed: 0,0,1,2
0,Primitive Type,Description,
1,boolean,true/false,
2,byte,8 bits,
3,char,16 bits (UNICODE),
4,short,16 bits,
5,int,32 bits,
6,long,64 bits,
7,float,32 bits IEEE 754-1985,
8,double,64 bits IEEE 754-1985,


One important feature of Camelot is that you also get a “parsing report” for each table giving an accuracy metric, the page the table was found on, and the percentage of whitespace present in the table.

In [26]:
tables[0].parsing_report

{'accuracy': 100.0, 'whitespace': 33.33, 'order': 1, 'page': 9}