# Find Keywords from the document
In this notebook, we are going to extract the keywords from the document shared in the link.

Original Document link is provided below.

Link: http://bit.ly/epo_keyword_extraction_document 

# Step 1:
As the Original file is in pdf format we have to convert it in text for further processing

It can be done by following ways

1. Using online pdf to doc converter & then using Microsoft office to convert it into text format

2. Use pdfminer package

for this code, I have used 1st way (as the 2nd option gives error while importing package)

You can download the converted text file from the following link

Link: https://docs.google.com/document/d/149UY32wdBu1VcH-7RtmIxAzGywqAQjeYKqFS7S6rq28/edit

Download it from given link & save it in a folder




In [1]:
#Importing necessary packages
# For basic string,text operation import following
import re, string, unicodedata

# Natural Language toolkit (nltk) used for text processing
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer,PorterStemmer




# Step 2:
Importing Data & Visualize it

this step is importanat to get insight from data

In [2]:
# read the file
file_path ='C:/Users/Dipti_B/Desktop/ds_keyword_assignment/JavaBasics-notes-encoded.txt'
f = open(file_path,encoding="utf8")

# save data in raw 
raw = f.read()

#Close the file
f.close()

#Now printing the length of this texts
# it contains all the spaces, special characters, and other data
print(len(raw))

40138


In [3]:
#We can see the raw data
#print(raw)

# Step 3:
Preprocessing the data

As we want to find keywords from data, first we have to clean it, filter it for further processing

Here data is text so we have to remove white spaces, special characters, symbols, stopwords etc


In [4]:
# as we have to find keywords we have to seperate out each word from whole document
#this can be done by nltk's tokenize function
tokens = word_tokenize(raw)

In [5]:
# So we have total 5331 tokens
print(len(tokens))

5331


In [6]:
#We can see the tokens data
#print(tokens)

In [7]:
# remove punctuation from each word as we have to find keywords punctuation are treated as noise in data
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]

#printing first 100 keywords hich are stored in stripped
#print(stripped[:100])

In [8]:
#checking whether the string consists of alphabetic characters only
#if yes then only keeping it

words=[word for word in stripped if word.isalpha()]
#print(words[:100])

In [9]:
#printing length of punctioctions free words
#So we are filtering unwanted stuff 
#print(len(words))

In [10]:
#converting all characters to loer case for further processing
#This is also called as normelization

words_lower=[w.lower() for w in words]
#print(words_lower[:100])

In [11]:
# Removing stop words
# we can see the list of stop words by printing it
stop_words = stopwords.words('english')
#print(stop_words)


In [13]:
#filtering stop words
set(stopwords.words('english'))
words_stopw_rem = [w for w in words_lower if not w in stop_words]
#print(words_stopw_rem[:100])

In [14]:
#printing length of words after removing stop words
#print(len(words_stopw_rem))

In [15]:
#lemitizing is the  process of converting the words of a sentence to its dictionary form. 
#it is very important as it normalize all words 
lemmatizer = WordNetLemmatizer()
words_lemmatized=[lemmatizer.lemmatize(word)for word in words_stopw_rem]

In [16]:
#print(len(words_lemmatized))
#print(words_lemmatized)

In [17]:
#print(len(set(words_lemmatized)))
sorted((words_lemmatized),reverse=True)

['zeroextend',
 'yet',
 'yet',
 'www',
 'www',
 'www',
 'www',
 'written',
 'written',
 'write',
 'write',
 'write',
 'write',
 'would',
 'would',
 'would',
 'would',
 'would',
 'would',
 'would',
 'would',
 'would',
 'would',
 'world',
 'world',
 'within',
 'within',
 'within',
 'windowmanager',
 'window',
 'window',
 'widthw',
 'width',
 'width',
 'width',
 'width',
 'wide',
 'wide',
 'whenever',
 'well',
 'web',
 'web',
 'way',
 'way',
 'vspace',
 'void',
 'void',
 'void',
 'void',
 'void',
 'void',
 'void',
 'void',
 'void',
 'void',
 'void',
 'void',
 'void',
 'void',
 'visible',
 'virus',
 'virtually',
 'view',
 'via',
 'version',
 'version',
 'version',
 'version',
 'vector',
 'vector',
 'variablename',
 'variablename',
 'variablelength',
 'variable',
 'variable',
 'variable',
 'variable',
 'variable',
 'variable',
 'variable',
 'variable',
 'variable',
 'variable',
 'valuev',
 'value',
 'value',
 'value',
 'value',
 'value',
 'value',
 'value',
 'value',
 'valid',
 'v',
 'using

# Step 4:
    
Getting insight from data

All preprocessing task has done now we can play with this data to find the keywords, which is our final goal

we can also calculate lexical richness of the text

importance or how frequent the specific word has used

count of each word in this document



In [18]:
#let's calculate a measure of the lexical richness of the text
# From this we can say that in document most of the words are repeated as result shows it has 28.9% lexical richness
len(set(words_lemmatized))*100 / len(words_lemmatized)

28.73227689741451

In [19]:
#how often a word occurs in a text, and compute what percentage of the text is taken up by a specific word
#ex: java
100 * words_lemmatized.count('java') / len(words_lemmatized)

5.129274395329441

In [20]:
words_freqDist = nltk.FreqDist(words_lemmatized)


In [21]:
# Output top 50 words
#It shows how much time that perticular word has repeated in document

for word, frequency in words_freqDist.most_common(50):
    print(u'{}:{}'.format(word, frequency))
    
    


java:123
object:53
new:52
basic:48
button:48
data:43
applet:41
int:39
code:38
c:38
method:37
array:33
b:33
class:32
string:28
jgurucom:23
right:23
reserved:23
example:22
public:22
program:20
type:19
comment:18
pointer:17
cc:16
return:16
language:15
use:15
memory:15
null:15
void:14
make:13
primitive:13
application:12
browser:12
reference:12
operator:12
element:12
garbage:11
allocate:11
system:10
runtime:10
file:10
variable:10
following:10
may:10
parameter:10
stack:10
would:10
collection:9


# Step 5:

This is the final step

After finding the occurrence of each word now we can find the weight of each word

Here the document is related to Java language

so constraining the word length will remove 'c' which is itself a language

so printing the keywords according to their weights and saving the same in CSV format

this CSV file is stored in the same folder in which this notebook is saved



In [23]:
# Saving output in csv file
#for this we require pandas package and collection package to deal with freqDist output
import pandas as pd
from collections import Counter
d = words_freqDist 
df = pd.DataFrame.from_dict(d, columns=['Kywwords'],orient='index').reset_index()
sorted_df=df.sort_values(by ='Kywwords',ascending=0).reset_index()
sorted_df.drop(columns=['level_0'],inplace=True)
sorted_df['Kywwords']=round((sorted_df['Kywwords']*100)/len(words_lemmatized),2)
sorted_df.rename(index=str, columns={"index": "Keywords", "Kywwords": "Weightage in %"},inplace= True)
#Here 'keywords' is the name of csv file in that we can see all keywords wrt their weightage in document
sorted_df.to_csv("keywords_1.csv")
print(sorted_df[:50])
# We can also see the first 50  keywords in document arranged as per their weightage Shown below 


       Keywords  Weightage in %
0          java            5.13
1        object            2.21
2           new            2.17
3        button            2.00
4         basic            2.00
5          data            1.79
6        applet            1.71
7           int            1.63
8          code            1.58
9             c            1.58
10       method            1.54
11        array            1.38
12            b            1.38
13        class            1.33
14       string            1.17
15     reserved            0.96
16        right            0.96
17     jgurucom            0.96
18      example            0.92
19       public            0.92
20      program            0.83
21         type            0.79
22      comment            0.75
23      pointer            0.71
24           cc            0.67
25       return            0.67
26         null            0.63
27       memory            0.63
28     language            0.63
29          use            0.63
30      