<a href="https://colab.research.google.com/github/HofstraDoboli/TextMining_F22/blob/main/indexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [64]:
list_doc =[ "Information retrieval course will cover algorithms used in search engines for finding relevant documents or information related to a query. Topics include: natural language processing for extracting relevant terms out of text data, vector space a methods for computing similarity between documents, text classification, and clustering. These techniques are commonly used in applications such as: automatic extraction of summaries out of a long text, extract novel information in a stream of data.",  

"NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statical inference. In general, the more data analyzed, the more accurate the model will be.",  

"The Denver Broncos made sure Brandon McManus will be their kicker for the long haul on Monday.  General manager John Elway announced the team and the kicker agreed on a contract extension. NFL Network Insider Ian Rapoport reported, per a source, it's a three-year extension worth $11.254 million with $6 million of it guaranteed. McManus is now the NFL's fourth highest paid kicker.", 

 "Equifax, one of the three major credit reporting agencies, handles the data of 820 million consumers and more than 91 million businesses worldwide. Between May and July of this year 143 million people in the U.S. may have had their names, Social Security numbers, birth dates, addresses and even driver's license numbers accessed. In addition, the hack compromised 209,000 people's credit card numbers and personal dispute details for another 182,000 people. What bad actors could do with that information is daunting. This data breach is more confusing than others -- like when Yahoo or Target were hacked, for example -- according to Joel Winston, a former deputy attorney general for New Jersey , whose current law practice focuses on consumer rights litigation, information privacy, and data protection law.", 
 
  """Why didn't she text me back yet? She doesn't like me anymore!" "There's no way I'm trying out for the team. I suck at basketball""It's not fair that I have a curfew! "Sound familiar? Parents of tweens and teens often shrug off such anxious and gloomy thinking as normal irritability and moodiness — because it is. Still, the beginning of a new school year, with all of the required adjustments, is a good time to consider just how closely the habit of negative, exaggerated "self-talk" can affect academic and social success, self-esteem and happiness. Psychological research shows that what we think can have a powerful influence on how we feel emotionally and physically, and on how we behave. Research also shows that our harmful thinking patterns can be changed."""]

In [72]:
import numpy as np
import spacy   # another tokenizer, lemmatizer (has --> be)
nlp = spacy.load('en_core_web_sm')
nlp.disable_pipes('parser', 'ner')  


['parser', 'ner']

In [65]:
# Step 1: text processing for one document - return lemmas
def nlp_processing(doc):  
    tokens = nlp(doc)
    
    #print(type(tokens))
    # eliminates stop words  and non alpha num and converts all to lower case
    terms = [token.lemma_.lower() for token in tokens if not token.is_stop and token.is_alpha]
  
    return terms

# Step 2: extract a list of (token, doc_id) from all documents.
# input a list of documents
# output: a list of sorted (token, doc_id) tuples
def extract_token_doc_id(list_doc):
  all_tokens = []
  for ind_doc, doc in enumerate(list_doc):
    tokens_doc = [(token, ind_doc) for token in nlp_processing(doc)]
    all_tokens.extend(tokens_doc)
  
  # sort by token name 
  all_tokens = sorted(all_tokens, key = lambda x:x[0])

  return all_tokens

# Step 3: Extract terms (unique) and document frequency (count tokens)
# change this to account only once for a repeated term in the same document
# all_tokens list of tuples
def doc_freq(all_tokens):
  set_all_tokens = set(all_tokens) # remove duplicate token in the same document
  dict_doc_freq = {}
  for (token, doc) in set_all_tokens:
    if token in dict_doc_freq:
      dict_doc_freq[token] += 1
    else: 
      dict_doc_freq[token] = 1

  # sort by key (term)  
  tuples_doc_freq = sorted(dict_doc_freq.items(), key = lambda x: x[0])
  
  dict_doc_freq = {term:doc_freq for (term, doc_freq) in tuples_doc_freq}
  return dict_doc_freq

# Step 4: Extract term frequency of each term in each document it appears in
# dict_term_freq = {term: {doc1:tf1, doc2:tf2, ...}} # includes only docs that have 
# non-zero term frequency
def term_freq(all_tokens, dict_doc_freq):
  dict_term_freq = {term:{} for term in dict_doc_freq.keys()} # initialize dictionary with all unique terms
  for (token, doc) in all_tokens:
    if doc in dict_term_freq[token]:
      dict_term_freq[token][doc] += 1 
    else: # if doc is not a key in the dictionary 
      dict_term_freq[token][doc] = 1
  
  return dict_term_freq

In [47]:
# step1: extract tokens from one document
tokens1 = nlp_processing(list_doc[0])
tokens1[:20]

['information',
 'retrieval',
 'course',
 'cover',
 'technique',
 'search',
 'engine',
 'find',
 'relevant',
 'document',
 'information',
 'relate',
 'query',
 'topic',
 'include',
 'natural',
 'language',
 'processing',
 'extract',
 'relevant']

In [66]:
# Step 2: extract a list of tuples (token, doc id), sorted alphabetically
all_tokens = extract_token_doc_id(list_doc)
all_tokens[:20]

[('academic', 4),
 ('access', 3),
 ('accord', 3),
 ('accurate', 1),
 ('actor', 3),
 ('addition', 3),
 ('address', 3),
 ('adjustment', 4),
 ('affect', 4),
 ('agency', 3),
 ('agree', 2),
 ('algorithm', 0),
 ('algorithm', 1),
 ('algorithm', 1),
 ('analyze', 1),
 ('analyze', 1),
 ('announce', 2),
 ('anxious', 4),
 ('anymore', 4),
 ('application', 0)]

In [67]:
# Step 3: Extract document frequency all_doc_frequency = dict{term:doc frequency }
dict_doc_freq = doc_freq(all_tokens)
print("Document frequency")
list(dict_doc_freq.items())[:20]

Document frequency


[('academic', 1),
 ('access', 1),
 ('accord', 1),
 ('accurate', 1),
 ('actor', 1),
 ('addition', 1),
 ('address', 1),
 ('adjustment', 1),
 ('affect', 1),
 ('agency', 1),
 ('agree', 1),
 ('algorithm', 2),
 ('analyze', 1),
 ('announce', 1),
 ('anxious', 1),
 ('anymore', 1),
 ('application', 1),
 ('attorney', 1),
 ('automatic', 1),
 ('automatically', 1)]

In [69]:
# Step 4: extract document frequency
dict_term_freq = term_freq(all_tokens, dict_doc_freq)
print("Term frequency")
list(dict_term_freq.items())[:20]

Term frequency


[('academic', {4: 1}),
 ('access', {3: 1}),
 ('accord', {3: 1}),
 ('accurate', {1: 1}),
 ('actor', {3: 1}),
 ('addition', {3: 1}),
 ('address', {3: 1}),
 ('adjustment', {4: 1}),
 ('affect', {4: 1}),
 ('agency', {3: 1}),
 ('agree', {2: 1}),
 ('algorithm', {0: 1, 1: 2}),
 ('analyze', {1: 2}),
 ('announce', {2: 1}),
 ('anxious', {4: 1}),
 ('anymore', {4: 1}),
 ('application', {0: 1}),
 ('attorney', {3: 1}),
 ('automatic', {0: 1}),
 ('automatically', {1: 1})]

In [73]:
def counter(items):
  sort_items = sorted(items) # sorts tokens alphabetically
  count_items = {}
  for item in sort_items:
    if item in count_items.keys():
      count_items[item] += 1
    else:
      count_items[item] = 1
  
  # sort by the count, in reverse order
  sorted_count_list = sorted(count_items.items(), 
                            key = lambda x:x[1], reverse = True)
  sorted_count_dict = dict(sorted_count_list)
  return sorted_count_dict 

In [80]:
def tf_func(tf_freq_doc):
  return np.log(tf_freq_doc)

def idf_func(tf_doc_freq, nr_docs):
  return np.log((nr_docs+1)/tf_doc_freq)

# implement simple tf-idf function 
# query - a string
# nr_doc = number of documents in the collection
def tf_idf(query, dict_doc_freq, dict_term_freq, nr_docs):
  # nlp processing of query -> tf_query = {terms: non-zero frequency in query}
  tokens   = nlp_processing(query)
  tf_query = counter(tokens)
  sim_query_doc = {} # doc: similarity function 
  # for each term_q in the tf_query find doc matches in dict_term_freq:
  for (term_q, tf_term_q) in tf_query.items():
  # for each doc_id in the dict_term_freq[term].keys()
    print('term query = ', term_q, 'freq_query', tf_term_q)
    for doc in dict_term_freq[term_q].keys(): 
  #   if doc_id in sim_query_doc: # you found another matching term in the same doc_id
      print('\t doc id', doc)
      if doc in sim_query_doc:
  #     sim_query_doc[doc_id] += tf_query * tf_func(dict_term_freq[term_q][doc_id])*idf_func(dict_doc_freq[term_q],nr_doc)
        tf_doc  = tf_func(dict_term_freq[term_q][doc])
        idf_doc = idf_func(dict_doc_freq[term_q], nr_docs)
        sim_query_doc[doc] += tf_term_q * tf_doc * idf_doc
      else:
        tf_doc  = tf_func(dict_term_freq[term_q][doc])
        idf_doc = idf_func(dict_doc_freq[term_q], nr_docs)
        sim_query_doc[doc] = tf_term_q * tf_doc * idf_doc
      print("sim_query for doc = ", doc, "is =", sim_query_doc)
  #         sim_query_doc[doc_id] = tf_func(dict_term_freq[term_q][doc_id])*idf_func(dict_doc_freq[term_q],nr_doc)
  # sort sim_query_doc by similarity value of all keys. 
  return sim_query_doc
      


In [81]:
sim_query = tf_idf("information retrieval", dict_doc_freq, dict_term_freq, len(list_doc))

term query =  information freq_query 1
	 doc id 0
sim_query for doc =  0 is = {0: 1.206948960812582}
	 doc id 3
sim_query for doc =  3 is = {0: 1.206948960812582, 3: 0.761500010418809}
term query =  retrieval freq_query 1
	 doc id 0
sim_query for doc =  0 is = {0: 1.206948960812582, 3: 0.761500010418809}


In [82]:
dict_doc_freq['information']

2

In [83]:
dict_term_freq['information']

{0: 3, 3: 2}

In [84]:
dict_doc_freq['retrieval']

1

In [85]:
dict_term_freq['retrieval']

{0: 1}