<a href="https://colab.research.google.com/github/HofstraDoboli/TextMining_F22/blob/main/indexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
list_doc =[ "Information retrieval course will cover techniques used in search engines for finding relevant documents or information related to a query. Topics include: natural language processing for extracting relevant terms out of text data, vector space a methods for computing similarity between documents, text classification, and clustering. These techniques are commonly used in applications such as: automatic extraction of summaries out of a long text, extract novel information in a stream of data.",  

"NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statical inference. In general, the more data analyzed, the more accurate the model will be.",  

"The Denver Broncos made sure Brandon McManus will be their kicker for the long haul on Monday.  General manager John Elway announced the team and the kicker agreed on a contract extension. NFL Network Insider Ian Rapoport reported, per a source, it's a three-year extension worth $11.254 million with $6 million of it guaranteed. McManus is now the NFL's fourth highest paid kicker.", 

 "Equifax, one of the three major credit reporting agencies, handles the data of 820 million consumers and more than 91 million businesses worldwide. Between May and July of this year 143 million people in the U.S. may have had their names, Social Security numbers, birth dates, addresses and even driver's license numbers accessed. In addition, the hack compromised 209,000 people's credit card numbers and personal dispute details for another 182,000 people. What bad actors could do with that information is daunting. This data breach is more confusing than others -- like when Yahoo or Target were hacked, for example -- according to Joel Winston, a former deputy attorney general for New Jersey , whose current law practice focuses on consumer rights litigation, information privacy, and data protection law.", 
 
  """Why didn't she text me back yet? She doesn't like me anymore!" "There's no way I'm trying out for the team. I suck at basketball""It's not fair that I have a curfew! "Sound familiar? Parents of tweens and teens often shrug off such anxious and gloomy thinking as normal irritability and moodiness — because it is. Still, the beginning of a new school year, with all of the required adjustments, is a good time to consider just how closely the habit of negative, exaggerated "self-talk" can affect academic and social success, self-esteem and happiness. Psychological research shows that what we think can have a powerful influence on how we feel emotionally and physically, and on how we behave. Research also shows that our harmful thinking patterns can be changed."""]

In [2]:
import spacy   # another tokenizer, lemmatizer (has --> be)
nlp = spacy.load('en_core_web_sm')
nlp.disable_pipes('parser', 'ner')  




In [39]:
# Step 2: text processing for one document - return lemmas
def nlp_processing(doc):  
    tokens = nlp(doc)
    
    #print(type(tokens))
    # eliminates stop words  and non alpha num
    terms = [token.lemma_.lower() for token in tokens if not token.is_stop and token.is_alpha]
  
    return terms

# Step 2: extract a list of (token, doc_id) from all documents.
# input a list of documents
# output: a list of sorted (token, doc_id) tuples
def extract_token_doc_id(list_doc):
  all_tokens = []
  for ind_doc, doc in enumerate(list_doc):
    tokens_doc = [(token, ind_doc) for token in nlp_processing(doc)]
    all_tokens.extend(tokens_doc)
  
  # sort by token name 
  all_tokens = sorted(all_tokens, key = lambda x:x[0])

  return all_tokens

# Step 3: Extract terms (unique) and document frequency (count tokens)
def doc_freq(all_tokens):
  dict_doc_freq = {}
  for (token, doc) in all_tokens:
    if token in dict_doc_freq:
      dict_doc_freq[token] +=1
    else:
      dict_doc_freq[token] = 1
    
  return dict_doc_freq

# Step 4: Extract term frequency of each term in each document it appears in
# dict_term_freq = {term: {doc1:tf1, doc2:tf2, ...}} # includes only docs that have 
# non-zero term frequency
def term_freq(all_tokens, dict_doc_freq):
  dict_term_freq = {term:{} for term in dict_doc_freq.keys()} # initialize dictionary with all unique terms
  for (token, doc) in all_tokens:
    if doc in dict_term_freq[token]:
      dict_term_freq[token][doc] += 1 
    else: # if doc is not a key in the dictionary 
      dict_term_freq[token][doc] = 1
    
  return dict_term_freq

In [19]:
all_tokens = extract_token_doc_id(list_doc)
all_tokens[:20]

[('academic', 4),
 ('access', 3),
 ('accord', 3),
 ('accurate', 1),
 ('actor', 3),
 ('addition', 3),
 ('address', 3),
 ('adjustment', 4),
 ('affect', 4),
 ('agency', 3),
 ('agree', 2),
 ('algorithm', 1),
 ('algorithm', 1),
 ('analyze', 1),
 ('analyze', 1),
 ('announce', 2),
 ('anxious', 4),
 ('anymore', 4),
 ('application', 0),
 ('attorney', 3)]

In [42]:
# Step 3: Extract document frequency all_doc_frequency = dict{term:doc frequency }
dict_doc_freq = doc_freq(all_tokens)
print("Document frequency")
list(dict_doc_freq.items())[:20]

Document frequency


[('academic', 1),
 ('access', 1),
 ('accord', 1),
 ('accurate', 1),
 ('actor', 1),
 ('addition', 1),
 ('address', 1),
 ('adjustment', 1),
 ('affect', 1),
 ('agency', 1),
 ('agree', 1),
 ('algorithm', 2),
 ('analyze', 2),
 ('announce', 1),
 ('anxious', 1),
 ('anymore', 1),
 ('application', 1),
 ('attorney', 1),
 ('automatic', 1),
 ('automatically', 1)]

In [43]:
# Step 4: extract document frequency
dict_term_freq = term_freq(all_tokens, dict_doc_freq)
print("Term frequency")
list(dict_term_freq.items())[:20]

Term frequency


[('academic', {4: 1}),
 ('access', {3: 1}),
 ('accord', {3: 1}),
 ('accurate', {1: 1}),
 ('actor', {3: 1}),
 ('addition', {3: 1}),
 ('address', {3: 1}),
 ('adjustment', {4: 1}),
 ('affect', {4: 1}),
 ('agency', {3: 1}),
 ('agree', {2: 1}),
 ('algorithm', {1: 2}),
 ('analyze', {1: 2}),
 ('announce', {2: 1}),
 ('anxious', {4: 1}),
 ('anymore', {4: 1}),
 ('application', {0: 1}),
 ('attorney', {3: 1}),
 ('automatic', {0: 1}),
 ('automatically', {1: 1})]