**AUTHOR : ANAND VEERARAHAVAN**

**Information retrieval** (IR) is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources.

**APPLICATIONS OF INFORMATION RETRIEVAL**

Where is Information Retrieval used?

**Use Case 1: Digital Library**

A digital library is a library in which collection of data are stored in digital formats and accessible by computers. The digital content may be stored locally, or accessed remotely via computer networks. A digital library is a type of information retrieval system.

**Use Case 2: Search Engine**

A search engine is one of the most the practical applications of information retrieval techniques to large scale text collections.

**Use Case 3**: **Image retrieval**

An image retrieval system is a computer system for browsing, searching and retrieving images from a large database of digital images.

![alt text](https://jamesmccaffrey.files.wordpress.com/2016/10/precisionandrecall_informationretrieval.jpg)

## Case Study : Retrieving Similar Publications @ UNM College of Pharmacy and School of Medicine

**Goal: Find similar papers using Title and Abstract text**

In this case we will use Pandas, NLTK, Numpy, and SKLearn libraries to find similar articles published in PubMed using k-Nearest Neighbors.

Steps:
* Find the important keywords of each document using tf-idf
* Apply knn_model on tf-idf to find similar papers
* Cleaning:

 * Clean text from \n and \x things like that by Replacing \n and \x with white-spaces
 
 * Apply unicode

 * Make everything lower case

In [None]:
! pip install biopython



Import the libraries

In [None]:
import pandas as pd
import sklearn
import numpy as np
import nltk
import re
from Bio import Medline

Uploading the dataset

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving pubmed-university-set.txt to pubmed-university-set (1).txt
User uploaded file "pubmed-university-set.txt" with length 33099469 bytes


In [None]:
# Function that uses the Medline module from
# the Biopython library to parse and read MEDLINE
# formatted files. Results are stored in a Pandas 
# DataFrame
def read_medline_data(filename):
    recs = Medline.parse(open(filename, 'r'))
    text = pd.DataFrame(columns = ["title", "authors", "abstract"])
    count = 0
    for rec in recs:
        try:
            abstr = rec["AB"]
            title = rec["TI"]
            auths = rec["AU"]
            text = text.append(pd.DataFrame([[title, auths, abstr]],
                                     columns=['title', 'authors', 'abstract']),
                              ignore_index=True)            
        except:
            pass
    return text

In [None]:
# Read in MEDLINE formatted text
papers = read_medline_data("pubmed-university-set.txt")

In [None]:

# Show the top few papers
papers.head()

Unnamed: 0,title,authors,abstract
0,Colonialism and the co-evolution of ethnic and...,"[Hunley K, Edgar H, Healy M, Mosley C]",OBJECTIVE: Socially constructed ethnic identit...
1,Vision 2020 measures University of New Mexico'...,"[Kaufman A, Roth PB, Larson RS, Ridenour N, We...",The University of New Mexico Health Sciences C...
2,Etiology of Alternaria Leaf Spot of Cotton in ...,"[Zhu Y, Lujan P, Dura S, Steiner R, Zhang J, S...",Alternaria leaf spot (caused by Alternaria spp...
3,Testicular Cancer Incidence and Mortality in N...,"[Taylor ZD, McLeod E, Gard CC, Woods ME]",OBJECTIVE: To examine incidence and survival o...
4,Impact of Health Care and Socioeconomic Needs ...,"[Soto Mas F, Iriart C, Pedroncelli R, Binder D...",Understanding how unmet basic needs impact hea...


In [None]:
print ("Title: ", papers['title'][11])
print ('\n')
print ("Abstract: ", papers['abstract'][11])

Title:  The New Mexico aging process study (1979-2003). A longitudinal study of nutrition, health and aging.


Abstract:  In 1979, Dr. James S. Goodwin, M.D., assisted by Philip J. Garry, Ph.D., submitted a grant proposal to the United States Public Health Service/ National Institute on Aging (NIA) entitled, "A prospective study of nutrition in the elderly". This study was approved and funded by the NIA beginning in 1979. Initially, approximately 300 men and women over 65 years of age with no known medical illnesses and no prescription medications were selected for this study. The primary purpose of this multi disciplinary study, known in the literature as the New Mexico Aging Process Study (NMAPS), was to examine the role of nutrition and resultant changes in body composition and organ function in relation to the aging process and health status of the elderly. This was accomplished by following prospectively healthy elderly volunteers, obtaining in-depth information about dietary habi

In [None]:
# Function that cleans text by removing '\x0c' and '\n' characters
# as well as all non-alpha characters and finally converts everything
# to lower case
def clean_text(text):
    stop_words = ['\x0c', '\n']
    for i in stop_words:
        text.replace(i, ' ')
    clean_text = re.sub('[^a-zA-Z]+', ' ', text)
    return clean_text.lower()

# Create a column for cleaned Abstract and cleaned Title
papers['clean_abstract'] = papers['abstract'].apply(clean_text)
papers['clean_title'] = papers['title'].apply(clean_text)

papers.head()

Unnamed: 0,title,authors,abstract,clean_abstract,clean_title
0,Colonialism and the co-evolution of ethnic and...,"[Hunley K, Edgar H, Healy M, Mosley C]",OBJECTIVE: Socially constructed ethnic identit...,objective socially constructed ethnic identiti...,colonialism and the co evolution of ethnic and...
1,Vision 2020 measures University of New Mexico'...,"[Kaufman A, Roth PB, Larson RS, Ridenour N, We...",The University of New Mexico Health Sciences C...,the university of new mexico health sciences c...,vision measures university of new mexico s suc...
2,Etiology of Alternaria Leaf Spot of Cotton in ...,"[Zhu Y, Lujan P, Dura S, Steiner R, Zhang J, S...",Alternaria leaf spot (caused by Alternaria spp...,alternaria leaf spot caused by alternaria spp ...,etiology of alternaria leaf spot of cotton in ...
3,Testicular Cancer Incidence and Mortality in N...,"[Taylor ZD, McLeod E, Gard CC, Woods ME]",OBJECTIVE: To examine incidence and survival o...,objective to examine incidence and survival of...,testicular cancer incidence and mortality in n...
4,Impact of Health Care and Socioeconomic Needs ...,"[Soto Mas F, Iriart C, Pedroncelli R, Binder D...",Understanding how unmet basic needs impact hea...,understanding how unmet basic needs impact hea...,impact of health care and socioeconomic needs ...


In [None]:
print ("Title: ", papers['title'][4])
print ('\n')
print ("Abstract: ", papers['abstract'][4])

Title:  Impact of Health Care and Socioeconomic Needs on Health Care Utilization and Disease Management: The University of New Mexico Hospital Care One Program.


Abstract:  Understanding how unmet basic needs impact health care in patients with complex conditions is vital to improve health outcomes and reduce health care costs. The purpose of this observational study was to explore the association between health care and socioeconomic needs and health care utilization and disease management among patients with chronic conditions at an intensive, patient-centered, office-based program. The study used a cross-sectional design and a convenience sampling approach. Data were collected through a patient questionnaire and medical records. Analysis included descriptive and inferential statistics. Data from 48 established patients were analyzed. Financial and lack of transportation were the 2 most frequently reported unmet needs. More than 65% of participants had their chronic condition(s) und

In [None]:
print ("Title: ", papers['clean_title'][4])
print ('\n')
print ("Abstract: ", papers['clean_abstract'][4])

Title:  impact of health care and socioeconomic needs on health care utilization and disease management the university of new mexico hospital care one program 


Abstract:  understanding how unmet basic needs impact health care in patients with complex conditions is vital to improve health outcomes and reduce health care costs the purpose of this observational study was to explore the association between health care and socioeconomic needs and health care utilization and disease management among patients with chronic conditions at an intensive patient centered office based program the study used a cross sectional design and a convenience sampling approach data were collected through a patient questionnaire and medical records analysis included descriptive and inferential statistics data from established patients were analyzed financial and lack of transportation were the most frequently reported unmet needs more than of participants had their chronic condition s under control sex and e

In [None]:
'''Build tf-idf matrix based on Abstract and Title
Use NLTK word_tokenize() and SnowballStemmer() to tokenize and stem document Title and Abstract'''

# Function that takes text, tokenizes it and returns list of stemmed tokens
def tokenize_and_stem(text):
    tokens = nltk.word_tokenize(text)
    stemmer = nltk.stem.snowball.SnowballStemmer("english")
    return [i for i in [stemmer.stem(t) for t in tokens] if len(i) > 2]

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Import the TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# Create vectorizer for Abstracts, max_df is set to 0.5, we only want
# to include terms that appear in less tha 50% of the documents (i.e. rare terms)
abs_tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, max_features=200000,
               stop_words='english', use_idf=True, tokenizer=tokenize_and_stem)

# Create vectorizer for Title, max_df is set to 0.5, we only want 
# to include terms that appear in less than 50% of the documents (i.e. rare terms)
title_tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, max_features=200000,
               stop_words='english', use_idf=True, tokenizer=tokenize_and_stem)

# Compute TF-IDF weights for Abstracts
tfidf_weights_abs = abs_tfidf_vectorizer.fit_transform(papers['clean_abstract'])

  'stop_words.' % sorted(inconsistent))


In [None]:
# Compute TF-IDF weights for Title
tfidf_weights_title = title_tfidf_vectorizer.fit_transform(papers['clean_title'])

# Get feature names for Abstract and Title models
tfidf_features_title = title_tfidf_vectorizer.get_feature_names()
tfidf_features_abs = abs_tfidf_vectorizer.get_feature_names()

  'stop_words.' % sorted(inconsistent))


In [None]:
tfidf_features_abs

['aac',
 'aacp',
 'aacpdm',
 'aacr',
 'aacvpr',
 'aai',
 'aam',
 'aamc',
 'aao',
 'aapyr',
 'aar',
 'aasbjerg',
 'aasm',
 'aav',
 'aavp',
 'aba',
 'abadi',
 'abalon',
 'abandon',
 'abat',
 'abatacept',
 'abbrevi',
 'abc',
 'abcg',
 'abct',
 'abcwua',
 'abdomen',
 'abdomin',
 'abdominopelv',
 'abduct',
 'abe',
 'aberr',
 'aberti',
 'abi',
 'abil',
 'abiot',
 'abl',
 'ablat',
 'abmd',
 'abmt',
 'abnorm',
 'abo',
 'abolish',
 'aborigin',
 'abort',
 'abov',
 'aboveground',
 'abq',
 'abrad',
 'abraham',
 'abras',
 'abrog',
 'abrupt',
 'abscess',
 'abscessus',
 'abscis',
 'absenc',
 'absent',
 'absit',
 'absolut',
 'absorb',
 'absorpt',
 'absorptiometri',
 'abstain',
 'abstin',
 'abstr',
 'abstract',
 'absurd',
 'abund',
 'abus',
 'abw',
 'aca',
 'acad',
 'academ',
 'academi',
 'academia',
 'academician',
 'acalyptrata',
 'acampros',
 'acanthosi',
 'acc',
 'acceler',
 'acceleromet',
 'accentu',
 'accept',
 'acceptor',
 'access',
 'accessori',
 'accid',
 'accident',
 'acclim',
 'acclimat',
 '

In [None]:
# Function for returning the top_k features of an Abstract
# or Title
def get_top_features(rownum, weights, features, top_k=20):
    weight_vec = weights.toarray()[rownum,:]
    top_idx = np.argsort(weight_vec)[::-1][:top_k]
    return [features[i] for i in top_idx]

# Top k features of Abstract 1
get_top_features(1, tfidf_weights_abs, tfidf_features_abs)

['health',
 'communiti',
 'vision',
 'prioriti',
 'unmhsc',
 'new',
 'progress',
 'mexico',
 'educ',
 'measur',
 'adopt',
 'enlist',
 'silo',
 'institut',
 'respond',
 'clinic',
 'resourc',
 'better',
 'state',
 'program']

In [None]:
# Top k features of Title 1
get_top_features(1, tfidf_weights_title, tfidf_features_title)

['vision',
 'success',
 'univers',
 'measur',
 'state',
 'health',
 'mexico',
 'new',
 'fibrous',
 'fibrosi',
 'fibrosarcoma',
 'fibromyalgia',
 'zuni',
 'fibrillar',
 'fidel',
 'fiber',
 'fgf',
 'fewer',
 'fever',
 'fetal']

In [None]:

# Build model to return 5 closest neighbors
from sklearn.neighbors import NearestNeighbors

# Create the k-NN model using k=5
nn_abs = NearestNeighbors(n_neighbors=5, algorithm='auto')
nn_title = NearestNeighbors(n_neighbors=5, algorithm='auto')

# Fit the models to the TF-IDF weights matrix
nn_fitted_abs = nn_abs.fit(tfidf_weights_abs)
nn_fitted_title = nn_title.fit(tfidf_weights_title)

# function to return the top-k nearest papers

def find_nearest_papers(row, kNNmodel, tfidf_weights, tfidf_features, papers):
    keywords = get_top_features(row, tfidf_weights, tfidf_features)
    dist,idx = kNNmodel.kneighbors(tfidf_weights[row,:])
    idx = list(idx[0])
    return {'papers':papers.iloc[idx], 'keywords':keywords}

In [None]:
find_nearest_papers(1, nn_fitted_abs, tfidf_weights_abs, tfidf_features_abs, papers)['papers']

Unnamed: 0,title,authors,abstract,clean_abstract,clean_title
1,Vision 2020 measures University of New Mexico'...,"[Kaufman A, Roth PB, Larson RS, Ridenour N, We...",The University of New Mexico Health Sciences C...,the university of new mexico health sciences c...,vision measures university of new mexico s suc...
3219,Telehealth and hepatitis C treatment for indig...,"[Leston J, Stephens D, Miller M, Moran B, Demi...",,,telehealth and hepatitis c treatment for indig...
62,Health extension in new Mexico: an academic he...,"[Kaufman A, Powell W, Alfero C, Pacheco M, Sil...",The Agricultural Cooperative Extension Service...,the agricultural cooperative extension service...,health extension in new mexico an academic hea...
189,Health Extension and Clinical and Translationa...,"[Kaufman A, Rhyne RL, Anastasoff J, Ronquillo ...",Health Extension Regional Officers (HEROs) thr...,health extension regional officers heros throu...,health extension and clinical and translationa...
888,The role of community health centers in assess...,"[Bruna S, Stone LC, Wilger S, Cantor J, Guzman C]",This article examines the experience of a fron...,this article examines the experience of a fron...,the role of community health centers in assess...


In [None]:
find_nearest_papers(1, nn_fitted_title, tfidf_weights_title, tfidf_features_title, papers)['papers']

Unnamed: 0,title,authors,abstract,clean_abstract,clean_title
1,Vision 2020 measures University of New Mexico'...,"[Kaufman A, Roth PB, Larson RS, Ridenour N, We...",The University of New Mexico Health Sciences C...,the university of new mexico health sciences c...,vision measures university of new mexico s suc...
5134,"The where and when of ""what if"".",[Cavanagh JF],"In this issue of Neuron, Fischer and Ullsperge...",in this issue of neuron fischer and ullsperger...,the where and when of what if
791,The genetics of university success.,"[Smith-Woolley E, Ayorech Z, Dale PS, von Stum...","University success, which includes enrolment i...",university success which includes enrolment in...,the genetics of university success
7,The University of New Mexico Center for Molecu...,"[Edwards BS, Gouveia K, Oprea TI, Sklar LA]",The University of New Mexico Center for Molecu...,the university of new mexico center for molecu...,the university of new mexico center for molecu...
1032,[The healthy eating index of new students at a...,"[Muñoz-Cano JM, Córdova-Hernández JA, del Vall...",INTRODUCTION: The main factor associated with ...,introduction the main factor associated with i...,the healthy eating index of new students at a...


In [None]:
title = "Impact of Health Care and Socioeconomic Needs on Health Care Utilization and Disease Management: The University of New Mexico Hospital Care One Program." #provide actual name of a paper
papers[papers['title']==title]

Unnamed: 0,title,authors,abstract,clean_abstract,clean_title
4,Impact of Health Care and Socioeconomic Needs ...,"[Soto Mas F, Iriart C, Pedroncelli R, Binder D...",Understanding how unmet basic needs impact hea...,understanding how unmet basic needs impact hea...,impact of health care and socioeconomic needs ...


In [None]:
nearest_papers = find_nearest_papers(4, nn_fitted_abs, tfidf_weights_abs, tfidf_features_abs, papers)
for i in nearest_papers['keywords']: print ("Keywords: ", i)

Keywords:  unmet
Keywords:  need
Keywords:  condit
Keywords:  health
Keywords:  care
Keywords:  patient
Keywords:  appoint
Keywords:  chronic
Keywords:  socioeconom
Keywords:  financi
Keywords:  met
Keywords:  medic
Keywords:  transport
Keywords:  intens
Keywords:  control
Keywords:  inferenti
Keywords:  statist
Keywords:  program
Keywords:  manag
Keywords:  uncontrol


In [None]:
# Show the abstracts of similar papers
for i in nearest_papers['papers']['abstract']: print ("Abstract: "+i+"\n")

Abstract: Understanding how unmet basic needs impact health care in patients with complex conditions is vital to improve health outcomes and reduce health care costs. The purpose of this observational study was to explore the association between health care and socioeconomic needs and health care utilization and disease management among patients with chronic conditions at an intensive, patient-centered, office-based program. The study used a cross-sectional design and a convenience sampling approach. Data were collected through a patient questionnaire and medical records. Analysis included descriptive and inferential statistics. Data from 48 established patients were analyzed. Financial and lack of transportation were the 2 most frequently reported unmet needs. More than 65% of participants had their chronic condition(s) under control. Sex and ethnicity were the only 2 demographic variables that yielded significant differences (P ≤ 0.01) on visits to the emergency room and having chron