## Important: Run this code cell each time you start a new session!

In [None]:
# Do not edit this cell
def test_homework(test_name, actual, expected):
  if actual == expected:
    print(f"Test passed: {test_name}.")
    return 1
  else:
    print(f"Test failed: {test_name}. Expected {expected}, got {actual}")
    return 0

def compare_hw_scores(score, max_score):
  if score == max_score:
    print("All test cases passed!")
  print(f"Mark: {score} / {max_score}")

print("Done!")

# Instructions

To get full credit for this assignment, we should be able to run your code all the way through the autograder without any errors. To check this, click the text cell ***immediately after*** the final autograder, then go to "Runtime" > "Run before" in the Google Colab menu. 

Your final program should also work according to the specifications without any issues.

# Part 1

## Part 1a: Helper Function for Pre-Processing the Data

Complete the following function according to its docstring.

In [None]:
PUNCTUATION = """.,<>;'":{}[]|!@#$%^&*()"""
def clean_up(text):
  """ (str) -> str

  Return a new version of text in which all the letters have been 
  converted to lowercase, and all punctuation is replaced with spaces.

  >>> clean_up('Influenza, commonly known as "the flu", is ...')
  'influenza  commonly known as  the flu   is    '
  """
  # Write your code here

In [None]:
# Test your function here

### Run the hidden code cell to evaluate Part 1a

In [None]:
# Do not edit this cell
def check_part1a():
  ex_score, max_ex_score = 0, 0

  ex_score += test_homework("Trivial string", clean_up("influenza is commonly known as the flu"), "influenza is commonly known as the flu")
  max_ex_score += 1
  ex_score += test_homework("Uppercase letters", clean_up("influenza is prevalent in the USA"), "influenza is prevalent in the usa")
  max_ex_score += 1
  ex_score += test_homework("Punctuation", clean_up('Influenza, commonly known as "the flu", is ...'), "influenza  commonly known as  the flu   is    ")
  max_ex_score += 1

  compare_hw_scores(ex_score, max_ex_score)
  return ex_score, max_ex_score

_ = check_part1a()

## Part 1b: Loading the Data

Run the following cells once you have imported the contents of `wikipages.zip` into your working directory to make sure that you can properly load the data.

In [None]:
import os

def get_documents():
  """ (str) -> dict of {str: str}

  Return a dictionary where the keys are document names (without .html)
  and the values are the contents of the file corresponding .html file
  from the directory datapath.
  """
  
  # Get a list of all the filenames in the directory
  datapath = 'wikipages'
  filenames = os.listdir(datapath)
  
  # Dictionary of all texts, keys are disease names
  doc_to_text= {}
  
  for filename in filenames:
    # Only consider filenames that end in ".html"
    if len(filename) > 5 and filename[-5:] == ".html":
      # Read the entire file's contents as a string
      text = open(os.path.join(datapath, filename)).read()

      # Clean up the text using the helper function
      text = clean_up(text)

      # Since all the filenames end in .html, just drop that part
      disease = filename[:-5]
      
      # Insert it into the dictionary
      doc_to_text[disease] = text
  
  return doc_to_text

In [None]:
# If you have copied over the data from wikipages.zip properly, 
# this code should run without any issues
try:
  document_dict = get_documents()
except:
  raise Exception("Could not find a directory called wikipages")

if len(document_dict) == 0:
  raise Exception("The dictionary is empty, so there may not be any .html files in the wikipages directory")
else:
  print("Successfully loaded!")

# Part 2

## Part 2a: Finding a Word in a Document

Complete the following function according to its docstring.

In [None]:
def keyword_found(keyword, doc_name, doc_to_text):
  """ (str, str, dict of {str:str}) -> bool
  
  Return True iff keyword is found in this doc_name inside doc_to_text
  as a full token separated by whitespace.
  
  """
  # Write your code here

In [None]:
# Test your function here

### Run the hidden code cell to evaluate Part 2a

In [None]:
# Do not edit this cell
def check_part2a():
  ex_score, max_ex_score = 0, 0

  doc_to_text = {"AIDS": "aids is a disease in which there is a severe loss of the body s cellular immunity  greatly lowering the resistance to infection and malignancy",
                "Cancer": "cancer is a disease caused by an uncontrolled division of abnormal cells in a part of the body",
                "COPD": "copd is a common  preventable and treatable disease that is characterized by persistent respiratory symptoms like progressive breathlessness and cough"}
  ex_score += test_homework("Word in text", keyword_found("cellular", "AIDS", doc_to_text), True)
  max_ex_score += 1
  ex_score += test_homework("Word not in text", keyword_found("cellular", "COPD", doc_to_text), False)
  max_ex_score += 1
  ex_score += test_homework("Subword not in text", keyword_found("cell", "AIDS", doc_to_text), False)
  max_ex_score += 1

  compare_hw_scores(ex_score, max_ex_score)
  return ex_score, max_ex_score

_ = check_part2a()

## Part 2b: Computing the IDF

Complete the following function according to its docstring.

In [None]:
import numpy

def idf(keyword, doc_to_text):
  """ (str, dict of {str: str}) -> float

  Return the IDF for this keyword in documents doc_to_text.
  If the keyword does not appear in any of the documents, return -1. 

  """
  # Write your code here

In [None]:
# Test your function here

### Run the hidden code cell to evaluate Part 2b

In [None]:
# Do not edit this cell
def check_part2b():
  ex_score, max_ex_score = 0, 0

  doc_to_text = {"AIDS": "aids is a disease in which there is a severe loss of the body s cellular immunity  greatly lowering the resistance to infection and malignancy",
                "Cancer": "cancer is a disease caused by an uncontrolled division of abnormal cells in a part of the body",
                "COPD": "copd is a common  preventable and treatable disease that is characterized by persistent respiratory symptoms like progressive breathlessness and cough"}
  ex_score += test_homework("Keyword not in any of the documents", idf("notfound", doc_to_text), -1)
  max_ex_score += 1
  ex_score += test_homework("Keyword in only one document", idf("division", doc_to_text), numpy.log(3))
  max_ex_score += 1
  ex_score += test_homework("Keyword in all of the documents once", idf("disease", doc_to_text), 0)
  max_ex_score += 1
  ex_score += test_homework("Keyword in all of the documents multiple times", idf("is", doc_to_text), 0)
  max_ex_score += 1

  compare_hw_scores(ex_score, max_ex_score)
  return ex_score, max_ex_score

_ = check_part2b()

# Part 3

## Part 3a: Initializing the TF-IDF Scores

Complete the following function according to its docstring.

In [None]:
def build_empty_scores_dict(doc_to_text):
  """ (dict of {str: str}) -> dict of {str: number}

  Build and return an empty dictionary where the keys are the same as the keys in doc_to_text
  and the values are all 0.
  """
  # Write your code here

In [None]:
# Test your function here

### Run the hidden code cell to evaluate Part 3a

In [None]:
# Do not edit this cell
def check_part3a():
  ex_score, max_ex_score = 0, 0

  doc_to_text = {"AIDS": "aids is a disease in which there is a severe loss of the body s cellular immunity  greatly lowering the resistance to infection and malignancy",
                "Cancer": "cancer is a disease caused by an uncontrolled division of abnormal cells in a part of the body",
                "COPD": "copd is a common  preventable and treatable disease that is characterized by persistent respiratory symptoms like progressive breathlessness and cough"}
  ex_score += test_homework("Empty dictionary", build_empty_scores_dict({}), {})
  max_ex_score += 1
  ex_score += test_homework("Example with three documents", build_empty_scores_dict(doc_to_text), {"AIDS": 0, "Cancer": 0, "COPD": 0})
  max_ex_score += 1

  compare_hw_scores(ex_score, max_ex_score)
  return ex_score, max_ex_score

_ = check_part3a()

## Part 3b: Computing the TF-IDF Scores

Complete the following function according to its docstring. Note that this function should update the scores in `doc_to_score` rather than return a new dictionary. Also keep in mind that you should increment the scores rather than replace the old ones since this will be used for queries with multiple keywords.

In [None]:
def update_scores(doc_to_score, keyword, doc_to_text):
  """ (dict of {str: number}, str, dict of {str: str}) -> None

  Update current_scores by adding to the value of each entry to TF-IDF individual score
  for keyword based on the documents in all_texts.
  """
  # Write your code here

In [None]:
# Test your function here

### Run the hidden code cell to evaluate Part 3b

In [None]:
# Do not edit this cell
def check_part3b():
  ex_score, max_ex_score = 0, 0

  doc_to_score = {"AIDS": 0, "Cancer": 0, "COPD": 0}
  doc_to_text = {"AIDS": "aids is a disease in which there is a severe loss of the body s cellular immunity  greatly lowering the resistance to infection and malignancy",
                "Cancer": "cancer is a disease caused by an uncontrolled division of abnormal cells in a part of the body",
                "COPD": "copd is a common  preventable and treatable disease that is characterized by persistent respiratory symptoms like progressive breathlessness and cough"}

  curr_scores = doc_to_score.copy()
  update_scores(curr_scores, "notfound", doc_to_text)
  ex_score += test_homework("Keyword not in any of the documents", curr_scores, {"AIDS": 0, "Cancer": 0, "COPD": 0})
  max_ex_score += 1

  curr_scores = doc_to_score.copy()
  update_scores(curr_scores, "division", doc_to_text)
  ex_score += test_homework("Keyword in only one document", curr_scores, {"AIDS": 0, "Cancer": numpy.log(3), "COPD": 0})
  max_ex_score += 1
  update_scores(curr_scores, "resistance", doc_to_text)
  ex_score += test_homework("Scores get updated rather than replaced", curr_scores, {"AIDS": numpy.log(3), "Cancer": numpy.log(3), "COPD": 0})
  max_ex_score += 1

  curr_scores = doc_to_score.copy()
  update_scores(curr_scores, "disease", doc_to_text)
  ex_score += test_homework("Keyword in all the documents once", curr_scores, {"AIDS": 0, "Cancer": 0, "COPD": 0})
  max_ex_score += 1

  curr_scores = doc_to_score.copy()
  update_scores(curr_scores, "is", doc_to_text)
  ex_score += test_homework("Keyword in all the documents multiple times", curr_scores, {"AIDS": 0, "Cancer": 0, "COPD": 0})
  max_ex_score += 1

  compare_hw_scores(ex_score, max_ex_score)
  return ex_score, max_ex_score

_ = check_part3b()

## Part 3c: Finding the Highest Score

Complete the following function according to its docstring. 

In [None]:
def find_highest_score(doc_to_score):
  """ (dict of {str: number}) -> list of [str, number]

  Find the document with the highest TF-IDF score in doc_to_score.
  Return a list where the first element is the name of the document and the second element is its score.
  If there are multiple entries with the same score, return the first one.
  """
  # Write your code here

In [None]:
# Test your function here

### Run the hidden code cell to evaluate Part 3c

In [None]:
# Do not edit this cell
def check_part3c():
  ex_score, max_ex_score = 0, 0

  doc_to_score = {"AIDS": 0, "Cancer": 0, "COPD": 5}
  ex_score += test_homework("One clear winner", find_highest_score(doc_to_score), ["COPD", 5])
  max_ex_score += 1

  doc_to_score = {"AIDS": 0, "Cancer": 5, "COPD": 5}
  ex_score += test_homework("Two documents with max score, return first", find_highest_score(doc_to_score), ["Cancer", 5])
  max_ex_score += 1

  compare_hw_scores(ex_score, max_ex_score)
  return ex_score, max_ex_score

_ = check_part3c()

# Part 4

Please read the instructions below the following code block.

In [None]:
# Do not edit this cell
proj_score, max_proj_score = 0, 0

try:
  part1a_score, max_part1a_score = check_part1a()
  proj_score += part1a_score
  max_proj_score += max_part1a_score

  part2a_score, max_part2a_score = check_part2a()
  proj_score += part2a_score
  max_proj_score += max_part2a_score

  part2b_score, max_part2b_score = check_part2b()
  proj_score += part2b_score
  max_proj_score += max_part2b_score

  part3a_score, max_part3a_score = check_part3a()
  proj_score += part3a_score
  max_proj_score += max_part3a_score

  part3b_score, max_part3b_score = check_part3b()
  proj_score += part3b_score
  max_proj_score += max_part3b_score

  part3c_score, max_part3c_score = check_part3c()
  proj_score += part3c_score
  max_proj_score += max_part3c_score
  
except NameError:
  raise Exception("Autograder failed to run. You have either not completed all of the exercises or did not run the entire notebook")

compare_hw_scores(proj_score, max_proj_score)

To confirm that all of your helper functions are working properly and that your notebook is suitable for submission, run the entire notebook up until this cell by clicking inside this cell and then selecting "Runtime" > "Run before" in the Google Colab menu.

Note that you are not recommended to select "Run all" as you have been able to do for the homework assignments since you will also have your final program after this cell.

Once you have confirmed that all of your helper functions are working, complete your program according to the comments that have been left below. Remember to use all of the helper functions that you have at your disposal.

In [None]:
# TODO: Ask for user input from the keyboard

# TODO: Clean up the input by removing punctuation and converting to lowercase

# TODO: Convert the query into a list of keywords

# TODO: Load the data from the documents

# TODO: Initialize a dictionary of scores

# TODO: Update all of the documents' scores for each keyword

# TODO: Find the document with the highest score

# TODO: Print out the most relevant document and its score