# Information Retrieval (IR)
### Goal of lesson
- Learn what Information Retrival is
- Topic modeling documents
- How to use Term Frequency and understand the limitations
- Implement Term Frequency by Inverse Document Frequency (TF-IDF)

### What is Information Retrievel (IR)
- The task of finding relevant documents in respose to a user query
- Web search engines are the most visible IR applications ([wiki](https://en.wikipedia.org/wiki/Information_retrieval))

### Topic Modeling
- Models for discovering the topics for a set of document
    - e.g., it provides us with methods to organize, understand and summarize large collections of textual information.
- Topic modeling can be described as a method for finding a group of words that best represents the information.

## Approach 1: Term Frequency

### Term Frequency
- The number of times a term occurs in a document is called its term frequency ([wiki](https://en.wikipedia.org/wiki/Tf–idf#Term_frequency))

$\text{tf}(t, d) = f_{t, d}$: The number of time term $t$ occurs in document $d$.

- There are other ways to define term frequency (see [wiki](https://en.wikipedia.org/wiki/Tf–idf#Term_frequency_2))

> #### Programming Notes:
> - Libraries used
>     - [**nltk**](https://www.nltk.org) - Natural Language Toolkit
>     - [**os**](https://docs.python.org/3/library/os.html) Miscellaneous operating system interfaces
>     - [**math**](https://docs.python.org/3/library/math.html) Do math with Python
> - Functionality and concepts used
>     - **List/Dict Comprehension** to convert data ([Lecture on **List Comprehension**](https://youtu.be/vCYEvtfXdig))
>     - [**sorted**](https://docs.python.org/3/howto/sorting.html) sort stuff
>     - [**lambda**](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions) lambda functions

In [4]:
import os
import nltk
import math
import PyPDF2
import ssl

In [54]:
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('punkt')

[nltk_data] Error loading punkt: <urlopen error [Errno 8] nodename nor
[nltk_data]     servname provided, or not known>


False

In [8]:
try:
    os.mkdir('resumes_2/data/data/texts')
except OSError as e:
    if e.errno != 17:
        raise

 ## NO SIRVIO DE LA MANERA ESPERADA ##

for filename in os.listdir('resumes_2/data/data/ENGINEERING'):
    pdfDoc = open(f'resumes_2/data/data/ENGINEERING/{filename}', 'rb')
    #print(filename)
    #create reader variable that will read the pdfDoc
    pdfReader = PyPDF2.PdfFileReader(pdfDoc)
    
    #This will store the number of pages of this pdf file
    x=pdfReader.numPages
    #print(x)
    #create a variable that will select the selected number of pages
    page=pdfReader.pages[x-1]
 
    #(x+1) because python indentation starts with 0.
    #create text variable which will store all text datafrom pdf file
    text=page.extractText()
    
    fileTxt = open(f'resumes_2/data/data/texts/{filename[:-3]}txt', 'w')
    fileTxt.writelines(text)

In [7]:
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal, LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed

try:
    os.mkdir('resumes_2/data/data/texts')
except OSError as e:
    if e.errno != 17:
        raise


for filename in os.listdir('resumes_2/data/data/ENGINEERING'):
    
    pdf_filename = f'resumes_2/data/data/ENGINEERING/{filename}'
    txt_filename = f'resumes_2/data/data/texts/{filename[:-3]}txt'

    device = PDFPageAggregator(PDFResourceManager(), laparams=LAParams())
    interpreter = PDFPageInterpreter(PDFResourceManager(), device)

    doc = PDFDocument()
    parser = PDFParser(open(pdf_filename, 'rb'))
    parser.set_document(doc)
    doc.set_parser(parser)
    doc.initialize()

    # Detectar si el documento proporciona la conversión de TXT, NO IGNORE
    if not doc.is_extractable:
        raise PDFTextExtractionNotAllowed
    else:
        with open(txt_filename, 'w', encoding="utf-8") as fw:
            #print("num page:{}".format(len(list(doc.get_pages()))))
            for page in doc.get_pages():
                interpreter.process_page(page)
                # Aceptar el objeto LTTPAGE de esta página
                layout = device.get_result()
                # Houts es un objeto LTTPAGE que almacena varios objetos de esta página de la página.
                #M generalmente incluye lttextbox, ltfigure, ltimage, lttextboxhorizontal, etc.
                #               t t t t
                for x in layout:
                    if isinstance(x, LTTextBoxHorizontal):
                        results = x.get_text()
                        fw.write(results)

In [4]:
corpus = {}

for filename in os.listdir('resumes_2/data/data/texts'):
    with open(f'resumes_2/data/data/texts/{filename}') as f:
        #print(f'{filename} was open')
        content = [word.lower() for word in nltk.word_tokenize(f.read()) if word.isalpha()]
        
        freq = {word: content.count(word) for word in set(content)}
        
        corpus[filename] = freq

In [5]:
for filename in corpus:
    corpus[filename] = sorted(corpus[filename].items(), key=lambda x: x[1], reverse=True)

In [6]:
for filename in corpus:
    print(filename)
    for word, score in corpus[filename][:5]:
        print(f'  {word}: {score}')

25425322.txt
  and: 48
  of: 27
  to: 27
  manufacturing: 15
  for: 15
21629057.txt
  and: 73
  to: 35
  the: 34
  as: 20
  of: 19
64468610.txt
  and: 46
  to: 23
  of: 19
  team: 16
  for: 16
11981094.txt
  and: 51
  to: 22
  for: 18
  team: 14
  of: 13
25608963.txt
  and: 48
  to: 33
  the: 15
  for: 15
  of: 13
35389360.txt
  and: 68
  expert: 23
  software: 21
  support: 19
  hardware: 17
19396040.txt
  and: 29
  of: 18
  for: 16
  to: 15
  process: 13
12488356.txt
  and: 49
  to: 35
  of: 20
  the: 20
  for: 18
27040860.txt
  and: 22
  to: 14
  in: 12
  of: 9
  the: 8
24647794.txt
  and: 31
  the: 23
  a: 16
  of: 13
  in: 13
12011623.txt
  and: 76
  of: 25
  in: 25
  data: 20
  the: 19
12022566.txt
  and: 30
  of: 26
  process: 17
  the: 16
  in: 15
28628090.txt
  and: 111
  of: 48
  to: 44
  in: 29
  for: 28
17103000.txt
  the: 30
  and: 28
  of: 26
  engineering: 19
  to: 16
12518008.txt
  and: 66
  to: 37
  in: 27
  testing: 18
  product: 17
32985311.txt
  of: 14
  to: 14
  th

### Problem: Stop of Function Word
- words that have little meaning on their own ([wiki](https://en.wikipedia.org/wiki/Stop_word))
- Examples: am, by, do, is, which, ....
- Student exercise: Remove function words and see result (HINT: nltk has a list of stopwords)

## Approach 2: TF-IDF
- TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. ([wiki](https://en.wikipedia.org/wiki/Tf–idf))

### Inverse Document Frequency
- Measure of how common or rare a word is across documents

$\text{idf}(t, D) = \log{\frac{N}{|d\in D : t\in d|}} = \log{\frac{\text{Total Documents}}{\text{Number of Documents Containing "term"}}}$
- $D$: All docments in the corpus
- $N$: total number of documents in the corpus $N = |D|$

### TF-IDF
- Ranking of what words are important in a document by multiplying Term Frequencey (TF) by Inverse Document Frequency (IDF)

$\text{tf-idf}(t, d) = \text{tf}(t, d)\cdot \text{idf}(t, D)$

### Example

- Document 1: *This is the sample of the day*
- Document 2: *This is another sample of the day*

In [9]:
doc1 = "This is the sample of the day".split()
doc2 = "This is another sample of the day".split()

In [10]:
corpus = [doc1, doc2]
corpus

[['This', 'is', 'the', 'sample', 'of', 'the', 'day'],
 ['This', 'is', 'another', 'sample', 'of', 'the', 'day']]

In [11]:
tf1 = {word: doc1.count(word) for word in set(doc1)}
tf2 = {word: doc2.count(word) for word in set(doc2)}

In [12]:
tf1

{'This': 1, 'of': 1, 'sample': 1, 'is': 1, 'day': 1, 'the': 2}

In [13]:
tf2

{'This': 1, 'of': 1, 'sample': 1, 'is': 1, 'another': 1, 'day': 1, 'the': 1}

In [19]:
term = 'another'
ids = 2/sum(term in doc for doc in corpus)

tf1.get(term, 0)*ids, tf2.get(term, 0)*ids

(0.0, 2.0)

In [20]:
a = 'asb.pdf'
print(a[:-3])

asb.


In [2]:
import csv

In [51]:
categories = set()
test = ""
with open('UpdatedResumeDataSet.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            test = row['Resume']
        categories.add(row["Category"])
        line_count += 1
    print(f'Processed {line_count} lines.')
print(test)

Processed 962 lines.
Skills * Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, JavaScript/JQuery. * Machine learning: Regression, SVM, NaÃ¯ve Bayes, KNN, Random Forest, Decision Trees, Boosting techniques, Cluster Analysis, Word Embedding, Sentiment Analysis, Natural Language processing, Dimensionality reduction, Topic Modelling (LDA, NMF), PCA & Neural Nets. * Database Visualizations: Mysql, SqlServer, Cassandra, Hbase, ElasticSearch D3.js, DC.js, Plotly, kibana, matplotlib, ggplot, Tableau. * Others: Regular Expression, HTML, CSS, Angular 6, Logstash, Kafka, Python Flask, Git, Docker, computer vision - Open CV and understanding of Deep learning.Education Details 

Data Science Assurance Associate 

Data Science Assurance Associate - Ernst & Young LLP
Skill Details 
JAVASCRIPT- Exprience - 24 months
jQuery- Exprience - 24 months
Python- Exprience - 24 monthsCompany Details 
company - Ernst & Young LLP
description - Fraud Investigations and Dis

In [39]:
to_exclude = ['Advocate','Arts','Civil Engineer','HR','Health and fitness','PMO','Operations Manager','Sales']
auto_id = 1
for c in categories:
    if c not in to_exclude:
        print(auto_id, c)
        auto_id += 1

1 Java Developer
2 Mechanical Engineer
3 Database
4 Network Security Engineer
5 DevOps Engineer
6 Python Developer
7 Electrical Engineering
8 Data Science
9 Hadoop
10 Blockchain
11 ETL Developer
12 Testing
13 DotNet Developer
14 SAP Developer
15 Automation Testing
16 Business Analyst
17 Web Designing


In [6]:
categories = [
 'Automation Testing',
 'Blockchain',
 'Business Analyst',
 'Data Science',
 'Database',
 'DevOps Engineer',
 'DotNet Developer',
 'ETL Developer',
 'Electrical Engineering',
 'Hadoop',
 'Java Developer',
 'Mechanical Engineer',
 'Network Security Engineer',
 'Python Developer',
 'SAP Developer',
 'Testing',
 'Web Designing']

In [9]:
corpus = {}
auto_id = 1
with open('UpdatedResumeDataSet.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    line_count = 0
    for row in csv_reader:
        if row["Category"] in categories:
            corpus[auto_id] = [word.lower() for word in nltk.word_tokenize(row["Resume"]) if word.isalpha()]
            line_count += 1
            auto_id += 1
    print(f'Processed {line_count} lines.')
#corpus[1]

Processed 698 lines.


['skills',
 'programming',
 'languages',
 'python',
 'pandas',
 'numpy',
 'scipy',
 'matplotlib',
 'sql',
 'java',
 'machine',
 'learning',
 'regression',
 'svm',
 'bayes',
 'knn',
 'random',
 'forest',
 'decision',
 'trees',
 'boosting',
 'techniques',
 'cluster',
 'analysis',
 'word',
 'embedding',
 'sentiment',
 'analysis',
 'natural',
 'language',
 'processing',
 'dimensionality',
 'reduction',
 'topic',
 'modelling',
 'lda',
 'nmf',
 'pca',
 'neural',
 'nets',
 'database',
 'visualizations',
 'mysql',
 'sqlserver',
 'cassandra',
 'hbase',
 'elasticsearch',
 'plotly',
 'kibana',
 'matplotlib',
 'ggplot',
 'tableau',
 'others',
 'regular',
 'expression',
 'html',
 'css',
 'angular',
 'logstash',
 'kafka',
 'python',
 'flask',
 'git',
 'docker',
 'computer',
 'vision',
 'open',
 'cv',
 'and',
 'understanding',
 'of',
 'deep',
 'details',
 'data',
 'science',
 'assurance',
 'associate',
 'data',
 'science',
 'assurance',
 'associate',
 'ernst',
 'young',
 'llp',
 'skill',
 'details',


In [11]:
words = set()

for a_id in corpus:
    words.update(corpus[a_id])
words

{'costs',
 'ice',
 'ciso',
 'laundering',
 'sinhagad',
 'frequently',
 'your',
 'larsen',
 'linux',
 'ems',
 'cometchat',
 'america',
 'nuclear',
 'capable',
 'arranging',
 'quota',
 'appointments',
 'merger',
 'bias',
 'technologies',
 'ios',
 'gathered',
 'measure',
 'ddsm',
 'daily',
 'severity',
 'allocate',
 'uploaded',
 'schrader',
 'return',
 'financials',
 'maharashatra',
 'barclaycard',
 'baseline',
 'rivzi',
 'mediation',
 'sure',
 'fm',
 'among',
 'approximately',
 'alamuri',
 'want',
 'qms',
 'compressions',
 'middleware',
 'kakinada',
 'future',
 'mappings',
 'conflict',
 'inovics',
 'methodology',
 'wmdatalake',
 'utilized',
 'sybase',
 'arrive',
 'contributed',
 'removing',
 'sciene',
 'evaluating',
 'properties',
 'fields',
 'jalgaon',
 'months',
 'oem',
 'revoking',
 'collaborate',
 'autoit',
 'making',
 'specially',
 'produced',
 'massachusetts',
 'tripping',
 'lsa',
 'speaker',
 'declaration',
 'coming',
 'pid',
 'st',
 'interns',
 'scraping',
 'dbartisan',
 'ftp',
 

In [12]:
tf = {}

for a_id in corpus:
    tf[a_id] = {word: corpus[a_id].count(word) for word in words}

In [13]:
idf = {}

for word in words:
    freq = sum(word in corpus[a_id] for a_id in corpus)
    idf[word] = math.log(len(corpus)/freq)

In [14]:
tfidf = {}

for a_id in corpus:
    tfidf[a_id] = [(word, tf[a_id][word]*idf[word]) for word in words]

In [15]:
for a_id in corpus:
    tfidf[a_id] = sorted(tfidf[a_id], key=lambda x:x[1], reverse=True)

In [17]:
for a_id in corpus:
    print(a_id)
    for term, score in tfidf[a_id][:10]:
        print(f'  {term}: {score}')

1
  topic: 21.228170048841633
  fraud: 19.54580886573557
  review: 18.314754290266325
  modelling: 18.01890061797966
  questions: 16.982536039073306
  kibana: 15.485774224927445
  sentiment: 15.102521522090365
  search: 14.209947316833526
  analytics: 13.304914574856248
  sqlserver: 12.736902029304979
2
  matelabs: 15.485774224927445
  outlier: 10.323849483284963
  scientist: 8.126624905948745
  deployed: 6.814658951238951
  less: 6.6869825952487965
  than: 6.583083895180032
  model: 5.8212658860719735
  detection: 5.8212658860719735
  year: 5.228791802964545
  ml: 5.161924741642482
3
  manipal: 9.877562380656544
  than: 9.874625842770048
  less: 8.915976793665061
  electrical: 7.168008058208102
  value: 7.007393330077899
  year: 6.971722403952726
  who: 5.769314913265452
  they: 5.272392194668453
  gitbash: 5.161924741642482
  patrons: 5.161924741642482
4
  sap: 54.07569563719028
  hana: 48.75974943569247
  industry: 37.37078277932672
  analytics: 33.26228643714062
  consultant: 26.01

In [19]:
rel = idf
rel = dict(sorted(rel.items(), key=lambda i: i[1], reverse=True))
rel

{'allocate': 6.548219102762372,
 'financials': 6.548219102762372,
 'conflict': 6.548219102762372,
 'catering': 6.548219102762372,
 'force': 6.548219102762372,
 'transition': 6.548219102762372,
 'eligible': 6.548219102762372,
 'moss': 6.548219102762372,
 'polaris': 6.548219102762372,
 'cript': 6.548219102762372,
 'economics': 6.548219102762372,
 'emerging': 6.548219102762372,
 'vice': 6.548219102762372,
 'introducing': 6.548219102762372,
 'cucumber': 6.548219102762372,
 'engagements': 6.548219102762372,
 'aria': 6.548219102762372,
 'revise': 6.548219102762372,
 'north': 6.548219102762372,
 'steering': 6.548219102762372,
 'connector': 6.548219102762372,
 'prevention': 6.548219102762372,
 'vqi': 6.548219102762372,
 'wwf': 6.548219102762372,
 'mba': 6.548219102762372,
 'genentech': 6.548219102762372,
 'expand': 6.548219102762372,
 'exercises': 6.548219102762372,
 'lotus': 6.548219102762372,
 'linguistic': 6.548219102762372,
 'venturus': 6.548219102762372,
 'wearable': 6.548219102762372,
 '

In [21]:
rel = idf
rel = dict(sorted(rel.items(), key=lambda i: i[1], reverse=False))
rel

{'description': 0.0,
 'skill': 0.0,
 'company': 0.0,
 'details': 0.0,
 'and': 0.07124673987268973,
 'of': 0.09302053942225041,
 'to': 0.11527901002319293,
 'in': 0.1168880208288938,
 'monthscompany': 0.1869166251893772,
 'exprience': 0.1869166251893772,
 'the': 0.2437703003403912,
 'maharashtra': 0.2886376386974495,
 'months': 0.3177376551838906,
 'for': 0.3620104788618787,
 'on': 0.39975080684472575,
 'project': 0.42792168381142215,
 'education': 0.43012190472102446,
 'a': 0.4367517632596941,
 'with': 0.47287307167368836,
 'as': 0.49848564753041463,
 'skills': 0.5975765501746456,
 'university': 0.6027984941557973,
 'system': 0.618629959372478,
 'data': 0.648321749179881,
 'team': 0.6510652351256319,
 'engineering': 0.6593411444294915,
 'january': 0.7163366254788557,
 'working': 0.7281361724100106,
 'developer': 0.7552054943782287,
 'management': 0.767475586970043,
 'technical': 0.770566779539716,
 'is': 0.770566779539716,
 'using': 0.7736675572179642,
 'application': 0.776777979632356

In [54]:
mustBe = ["c++","c#","python","java","javascript","docker","virtualization","angular","react","vue","django","flask",
          "ai","deep","learning","machine","oop","restrictions","algorithms","testing","optimization","data","mining",
         "backend","frontend","developer","scrum","devops","sap","evolutive","structure","spring","boot","windows",
         "linux","os","ios","mobile","native","kernel","netbeans","visual","studio","net","framework","constraints"]
notExist = []
for w in mustBe:
    if w not in words:
        notExist.append(w)
        print(f'F,{w} no esta')

F,c++ no esta
F,c# no esta
F,virtualization no esta
F,vue no esta
F,restrictions no esta
F,evolutive no esta
F,kernel no esta


In [55]:
len(words)

5540

In [65]:
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
    
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

words_2 = set()
for w in words:
    if w not in stop_words:
        words_2.add(w)

tf = {}

for filename in corpus:
    tf[filename] = {word: corpus[filename].count(word) for word in words_2}
    
idf = {}

for word in words_2:
    freq = sum(word in corpus[filename] for filename in corpus)
    idf[word] = math.log(len(corpus)/freq)
    
tfidf = {}

for filename in corpus:
    tfidf[filename] = [(word, tf[filename][word]*idf[word]) for word in words_2]
    
for filename in corpus:
    tfidf[filename] = sorted(tfidf[filename], key=lambda x:x[1], reverse=True)
    
for filename in corpus:
    print(filename)
    for term, score in tfidf[filename][:10]:
        print(f'  {term}: {score}')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/alejandrocaicedo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


1
  topic: 21.228170048841633
  fraud: 19.54580886573557
  review: 18.314754290266325
  modelling: 18.01890061797966
  questions: 16.982536039073306
  kibana: 15.485774224927445
  sentiment: 15.102521522090365
  search: 14.209947316833526
  analytics: 13.304914574856248
  sqlserver: 12.736902029304979
2
  matelabs: 15.485774224927445
  outlier: 10.323849483284963
  scientist: 8.126624905948745
  deployed: 6.814658951238951
  less: 6.6869825952487965
  model: 5.8212658860719735
  detection: 5.8212658860719735
  year: 5.228791802964545
  ml: 5.161924741642482
  prophet: 5.161924741642482
3
  manipal: 9.877562380656544
  less: 8.915976793665061
  electrical: 7.168008058208102
  value: 7.007393330077899
  year: 6.971722403952726
  gitbash: 5.161924741642482
  patrons: 5.161924741642482
  disclosed: 5.161924741642482
  prove: 5.161924741642482
  segment: 5.161924741642482
4
  sap: 54.07569563719028
  hana: 48.75974943569247
  industry: 37.37078277932672
  analytics: 33.26228643714062
  cons

In [58]:
len(words) - len(words_2)

115

In [64]:
mustBe = ["c++","c#","c","python","java","javascript","docker","virtualization","angular","react","vue","django","flask",
          "ai","deep","learning","machine","oop","restrictions","algorithms","testing","optimization","data","mining",
         "backend","frontend","developer","scrum","devops","sap","evolutive","structure","spring","boot","windows",
         "linux","os","ios","mobile","native","kernel","netbeans","visual","studio","net","framework","constraints"]
notExist = []
countNE = 0
for w in mustBe:
    if w not in words_2:
        notExist.append(w)
        countNE += 1
print(f'{countNE} no existen')
print(notExist)

7 no existen
['c++', 'c#', 'virtualization', 'vue', 'restrictions', 'evolutive', 'kernel']
