# Instructions

## Hidden Code
The notebook contains several hidden code cells, which are displayed as three dots. To run these, select the above cell and press run twice (either by the buttons on the top, or the "shift" + "return" hotkey). Once the hidden code cell runs, a message such as "Done" or your evaluation results should appear. If the code expands, please collapse it using the column on the left. You don't need to edit or understand the code in these cells. 

## Starting Out
The first cell after this section is titled "Run the hidden code cell before starting". This should be run everytime a new kernel/runtime is started. If run correctly, the message "Done!" should appear underneath.


## Exercises
This is followed by all the exercises of the notebook. An exercise generally consists of 4 cells. 
1.   The first cell provides a description of the function you need to implement. 
2.   The second cell contains the starter code, and contains a comment indicating where you should write your code. Your entire implementation will go in this cell. 
3.   The third cell is a testing cell for your own testing. Feel free to write any code you like here to test your function is working correctly.
4.   The last cell contains hidden code to run test cases on your function. This cell, when run, will provide a mark on your implementation. If implemented correctly, you should get full marks.

## Completion
The completion cell runs all the test cases in the notebook on all the functions. If this cell returns full marks, this means the notebook is complete and you can submit it.

## Important: Run the hidden code cell before starting

In [None]:
# Do not edit this cell (Keep hidden to keep notebook easier to read)
def test(test_name, actual, expected):
  if(actual == expected):
    return 1
  else:
    print("Test failed. " + test_name + " expected " + str(expected) + ", got " + str(actual))
    return 0

print("Done!")

# Part 1

## Part 1a: Helper Function for Pre-Processing the Data

Complete the following function according to its docstring.

In [None]:
PUNCTUATION = """.,<>;'":{}[]|!@#$%^&*()"""
def clean_up(text):
  """ (str) -> str

  Return a new version of text in which all the letters have been 
  converted to lowercase, and all punctuation is replaced with spaces.

  >>> clean_up('Influenza, commonly known as "the flu", is ...')
  'influenza  commonly known as  the flu   is    '
  """
  # Write your code here

In [None]:
# Test your function here

### Run the hidden code cell to evaluate `clean_up()`

In [None]:
# Do not edit this cell
score, max = 0, 0

score += test("Trivial string", clean_up("influenza is commonly known as the flu"), "influenza is commonly known as the flu")
max += 1
score += test("Uppercase letters", clean_up("influenza is prevalent in the USA"), "influenza is prevalent in the usa")
max += 1
score += test("Punctuation", clean_up('Influenza, commonly known as "the flu", is ...'), "influenza  commonly known as  the flu   is    ")
max += 1

if score == max:
  print("All test cases passed!")
print("Mark: " + str(score) + "/" + str(max))

## Part 1b: Loading the Data

Run the following cells once you have imported the contents of `wikipages.zip` into your working directory to make sure that you can properly load the data.

In [None]:
import os

def get_documents():
  """ (str) -> dict of {str: str}

  Return a dictionary where the keys are document names (without .html)
  and the values are the contents of the file corresponding .html file
  from the directory datapath.
  """
  
  # Get a list of all the filenames in the directory
  datapath = 'wikipages'
  filenames = os.listdir(datapath)
  
  # Dictionary of all texts, keys are disease names
  doc_to_text= {}
  
  for filename in filenames:
    # Only consider filenames that end in ".html"
    if len(filename) > 5 and filename[-5:] == ".html":
      # Read the entire file's contents as a string
      text = open(os.path.join(datapath, filename)).read()

      # Clean up the text using the helper function
      text = clean_up(text)

      # Since all the filenames end in .html, just drop that part
      disease = filename[:-5]
      
      # Insert it into the dictionary
      doc_to_text[disease] = text
  
  return doc_to_text

In [None]:
# If you have copied over the data from wikipages.zip properly, 
# this code should run without any issues
try:
  document_dict = get_documents()
except:
  raise Exception("Could not find a directory called wikipages")

if len(document_dict) == 0:
  raise Exception("The dictionary is empty, so there may not be any .html files in the wikipages directory")
else:
  print("Successfully loaded!")

# Part 2

## Part 2a: Finding a Word in a Document

Complete the following function according to its docstring.

In [None]:
def keyword_found(keyword, doc_name, doc_to_text):
  """ (str, str, dict of {str:str}) -> bool
  
  Return True iff keyword is found in this doc_name inside doc_to_text
  as a full token separated by whitespace.
  
  """
  # Write your code here

In [None]:
# Test your function here

### Run the hidden code cell to evaluate `keyword_found()`

In [None]:
# Do not edit this cell
score, max = 0, 0

doc_to_text = {"AIDS": "aids is a disease in which there is a severe loss of the body s cellular immunity  greatly lowering the resistance to infection and malignancy",
               "Cancer": "cancer is a disease caused by an uncontrolled division of abnormal cells in a part of the body",
               "COPD": "copd is a common  preventable and treatable disease that is characterized by persistent respiratory symptoms like progressive breathlessness and cough"}
score += test("Word in text", keyword_found("cellular", "AIDS", doc_to_text), True)
max += 1
score += test("Word not in text", keyword_found("cellular", "Influenza", doc_to_text), False)
max += 1
score += test("Subword not in text", keyword_found("cell", "AIDS", doc_to_text), False)
max += 1

if score == max:
  print("All test cases passed!")
print("Mark: " + str(score) + "/" + str(max))

## Part 2b: Computing the IDF

Complete the following function according to its docstring.

In [None]:
import numpy

def idf(keyword, doc_to_text):
  """ (str, dict of {str: str}) -> float

  Return the IDF for this keyword in documents doc_to_text.
  If the keyword does not appear in any of the documents, return -1. 

  """
  # Write your code here

In [None]:
# Test your function here

### Run the hidden code cell to evaluate `idf()`

In [None]:
# Do not edit this cell
score, max = 0, 0

doc_to_text = {"AIDS": "aids is a disease in which there is a severe loss of the body s cellular immunity  greatly lowering the resistance to infection and malignancy",
               "Cancer": "cancer is a disease caused by an uncontrolled division of abnormal cells in a part of the body",
               "COPD": "copd is a common  preventable and treatable disease that is characterized by persistent respiratory symptoms like progressive breathlessness and cough"}
score += test("Keyword not in any of the documents", idf("notfound", doc_to_text), -1)
max += 1
score += test("Keyword in only one document", idf("division", doc_to_text), numpy.log(3))
max += 1
score += test("Keyword in all of the documents once", idf("disease", doc_to_text), 0)
max += 1
score += test("Keyword in all of the documents multiple times", idf("is", doc_to_text), 0)
max += 1

if score == max:
  print("All test cases passed!")
print("Mark: " + str(score) + "/" + str(max))

## Part 2c: Initializing the TF-IDF Scores

Complete the following function according to its docstring.

In [None]:
def build_empty_scores_dict(doc_to_text):
  """ (dict of {str: str}) -> dict of {str: number}

  Build and return an empty dictionary where the keys are the same as the keys in doc_to_text
  and the values are all 0.
  """
  # Write your code here

In [None]:
# Test your function here

### Run the hidden code cell to evaluate `build_empty_scores_dict()`

In [None]:
# Do not edit this cell
score, max = 0, 0

doc_to_text = {"AIDS": "aids is a disease in which there is a severe loss of the body s cellular immunity  greatly lowering the resistance to infection and malignancy",
               "Cancer": "cancer is a disease caused by an uncontrolled division of abnormal cells in a part of the body",
               "COPD": "copd is a common  preventable and treatable disease that is characterized by persistent respiratory symptoms like progressive breathlessness and cough"}
score += test("Empty dictionary", build_empty_scores_dict({}), {})
max += 1
score += test("Example with three documents", build_empty_scores_dict(doc_to_text), {"AIDS": 0, "Cancer": 0, "COPD": 0})
max += 1

if score == max:
  print("All test cases passed!")
print("Mark: " + str(score) + "/" + str(max))

## Part 2d: Computing the TF-IDF Scores

Complete the following function according to its docstring. Note that this function should update the scores in `doc_to_score` rather than return a new dictionary. Also keep in mind that you should increment the scores rather than replace the old onces since this will be used for queries with multiple keywords.

In [None]:
def update_scores(doc_to_score, keyword, doc_to_text):
  """ (dict of {str: number}, str, dict of {str: str}) -> None

  Update current_scores by adding to the value of each entry to TF-IDF individual score
  for keyword based on the documents in all_texts.
  """
  # Write your code here

In [None]:
# Test your function here

### Run the hidden code cell to evaluate `update_scores()`

In [None]:
# Do not edit this cell
score, max = 0, 0

doc_to_score = {"AIDS": 0, "Cancer": 0, "COPD": 0}
doc_to_text = {"AIDS": "aids is a disease in which there is a severe loss of the body s cellular immunity  greatly lowering the resistance to infection and malignancy",
               "Cancer": "cancer is a disease caused by an uncontrolled division of abnormal cells in a part of the body",
               "COPD": "copd is a common  preventable and treatable disease that is characterized by persistent respiratory symptoms like progressive breathlessness and cough"}

curr_scores = doc_to_score.copy()
update_scores(curr_scores, "notfound", doc_to_text)
score += test("Keyword not in any of the documents", curr_scores, {"AIDS": 0, "Cancer": 0, "COPD": 0})
max += 1

curr_scores = doc_to_score.copy()
update_scores(curr_scores, "division", doc_to_text)
score += test("Keyword in only one document", curr_scores, {"AIDS": 0, "Cancer": numpy.log(3), "COPD": 0})
max += 1
update_scores(curr_scores, "resistance", doc_to_text)
score += test("Scores get updated rather than replaced", curr_scores, {"AIDS": numpy.log(3), "Cancer": numpy.log(3), "COPD": 0})
max += 1

curr_scores = doc_to_score.copy()
update_scores(curr_scores, "disease", doc_to_text)
score += test("Keyword not in any of the documents", curr_scores, {"AIDS": 0, "Cancer": 0, "COPD": 0})
max += 1

curr_scores = doc_to_score.copy()
update_scores(curr_scores, "is", doc_to_text)
score += test("Keyword not in any of the documents", curr_scores, {"AIDS": 0, "Cancer": 0, "COPD": 0})
max += 1

if score == max:
  print("All test cases passed!")
print("Mark: " + str(score) + "/" + str(max))

# Part 3

Complete your program according to the comments that have been left below. Remember to use all of the helper functions that you have at your disposal.

In [None]:
# TODO: Ask for user input from the keyboard

# TODO: Clean up the input by removing punctuation and converting to lowercase

# TODO: Convert the query into a list of keywords

# TODO: Load the data from the documents

# TODO: Initialize a dictionary of scores

# TODO: Update all of the documents' scores for each keyword

# TODO: Print out the document name with the highest score