In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("HW02.ipynb")

# Homework 02 - Exploring Obituaries

In this homework we are going to use Term Frequency Inverse Document Frequency to explore Obituaries from the NYTimes. This assignment is based on [Matthew J. Lavin's](https://matthew-lavin.com/) (Professor at Denison University) [lesson](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf) on TF-IDF.


Let's begin by running the next cell that will import some python packages relevant for this assignment.

In [None]:
import pandas as pd
import numpy as np
import spacy
import os

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer

## 1. Explore and Load Data (13 points)

<!--
BEGIN QUESTION
name: q1.1
manual: False
points: 
    - 1
    - 1
    - 1
-->

We have stored 364 obituatries from the New York Times. The obituaries are in text files stored in `/data/obits_txt/'

**Question 1.1: (3 points)** Load the obituaries into a single dataframe named **obits_df**. The dataframe should have the following columns named: **File_Name** and **Obit**, the former indicates the name of the file (do not include the path), and the latter should be the text of the obituary stored as a string. 

In [None]:
OBITS_PATH = "data/obits_txt/"



...


obits_df.head()

In [None]:
grader.check("q1.1")

### Applying Spacy to the obituaries

The 'Obit' column contains the original uncleaned text. For the first part of this assignment we are going to apply the Spacy text processing pipeline that we covered in Tutorial 2.1. Run the next line to load the Spacy `en_core_web_sm` model into memory and store it in the variable named `nlp`

In [None]:
nlp = spacy.load("en_core_web_sm")
type(nlp), nlp

<!--
BEGIN QUESTION
name: q1.2
manual: False
points: 
    - 1
    - 1
    - 1
-->

In Tutorial 2.1, we saw how to apply this trained Spacy model to text. 

**Question 1.2 (3 points)**: In the next line, apply the spacy model to every obituatry and store each spacy doc in a new column in `obits_df`. The new column's name should be `spacy_doc`.  

*If you are applying the model correctly, the next cell will take about 2 minutes to run*

In [None]:
...
obits_df.head(5)

In [None]:
grader.check("q1.2")

**Saving the `obits_df` dataframe**

<!-- BEGIN QUESTION -->

Applying the Spacy nlp pipeline to the 364 obituaries takes a while to run. Often times we will work a bit, take a break, and then come back to continue working. We do not want to have to reprocess our data every time we come back to work on project. If we are analyzing even more amount of texts, applying the spacy pipeline will take even longer, and this will become pretty annoying everytime we come back to work on our project.

It is a common practice to save our data after processing our data, e.g. deploying a sentiment model, tokenizing text, etc.

<!--
BEGIN QUESTION
name: q1.3
manual: True
points: 1
-->


**Question 1.3 (1 point):** In the next cell, save the dataframe to a csv file named `tmp_obits.csv.` Make sure to use `index=None` when saving the csv file. This will prevent the index column from being saved to the csv file. 

In [None]:
...

<!-- END QUESTION -->

<!--
BEGIN QUESTION
name: q1.4
manual: False
points: 
    - 1
-->


**Question 1.4 (1 point):** In the next cell, load the file you just saved to a new dataframe called `tmp_obits_df`.

In [None]:
tmp_obits_df = ...
tmp_obits_df

In [None]:
grader.check("q1.4")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1.5
manual: True
points: 1
-->


**Question: 1.5 (1 point):** In the next cell, write some code to detect what is different between `obits_df` and `tmp_obits_df`

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1.6
manual: True
points: 1
-->


**Question 1.6 (1 point):** In the next cell, briefly describe what is different between the two dataframes

_Type your answer here, replacing this text._

<!-- END QUESTION -->



When we want to save python objects to a file, a better approach is to save the data as a pickle file.  

> The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” 1 or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling”. - https://docs.python.org/3/library/pickle.html

In simple terms, pickle is a way to save python objects to a file without losing any information.
We can ***dump*** python objects to a pickle file and we can ***load*** python objects directly from a pickle file. This [python webpage](https://wiki.python.org/moin/UsingPickle) is a great reference on how to dump and load pickle files - this is the go to website I look at when needing to dump and load pickle files.


Pandas has a built in Dataframe function, [*.to_pickle()*](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_pickle.html) that will pickle the dataframe to a file. The next line will store the dataframe, including the processed spacy documents to a pickle file called `obits_df.pkl`. Run the next line to save the DataFrame as a pickle file.

In [None]:
obits_df.to_pickle('data/obits_df.pkl')

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1.7
manual: True
points: 0.5
-->


**Question 1.7 (0.5 point):** In the next line, write a bash command to see the size of the pickle file you just dumped. 

*Note:* Make sure to change the cell from a Markdown cell to a code cell.

*Hint:* If you are stuck, look at the [bash commands page on the course website](http://coms2710.barnard.edu/bash-commands.html)

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

When dumping DataFrames to pickle files, we can dump the data into a compressed file. 

<!--
BEGIN QUESTION
name: q1.8
manual: True
points: 1
-->


**Question 1.8 (1 point):** In the next cell, dump the dataframe to a compressed pickle file. You can choose any of the compression methods, e.g. zip, gzip, etc.

*Hint:* You might want to look at the [*to_pickle()* documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_pickle.html).

*Note:* Make sure to use the naming convention (last part of the file name) that corresponds to the type of compression you are using

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1.9
manual: True
points: 1
-->


**Question 1.9 (0.5 point):** In the next cell, write a bash command to see the size of the compressed pickle file.

*Note:* Make sure to change the cell from a Markdown cell to a code cell.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1.10
manual: True
points: 1
-->


**Question 1.10 (1 point):** What is the difference in size between the different compressed versions of the pickle files and what is your takeaway from these questions?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



 When coming back to this assignment, you can just run the next cell which will load the pickled dataframe that you have generated so far in this assignment.

In [None]:
obits_df = pd.read_pickle('data/obits_df.pkl')
obits_df.keys(), obits_df.shape

## 2. Who is each obituary about (14.5 points)?

Unfortunately we did not apply any labels to the obituaries. Ideally it would be great if the name of each file indicated who the obit was about or if we provided metadata that mapped the file names to name of each obituatry's subject. When working with real word data, this is often the case.    

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2.1
manual: True
points: 4
-->

**Question 2.1 (4 points):** In the next cell, write out an algorithm or an approach you could take to determine who the obituary is about. Justify why you think this approach will generally work. It is okay if the approach is not 100% accurate but it should work most of the time.

It might be helpful to spend some time sampling about 10 of the obituaries.
    
*Hint:* Leveraging the named entities from the spacy document object might be helpful. Look at the last section of Demo 5 for how to extract named entities and their labels from spacy.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!--
BEGIN QUESTION
name: q2.2
manual: False
points: 
    - 0.5
    - 1
    - 1
    - 1
    - 1
    - 1
    - 1
-->


**Question 2.2 (6.5 points):** In the next cell, apply the approach you just described to determine who is the subject of each obituary. In the `obits_df` dataframe, create a new column called `subject`. Each row's `subject` cell should indicate the subject of the obituary and the values can be a string of the subject's/person's name. 

*Note:* There is no need to reformat the name. For example, it is ok if the subject for JFK's obituary is *Kennedy*. 

In [None]:
...



In [None]:
grader.check("q2.2")

### Pareto principle (80-20 rule)

It is very likely there are edge cases where your method fails or produces results that are not accurate. This is okay and will often happen in research. An important question to constantly ask yourself is how much fine-tuning your methods is necessary and how much are small mistakes tolerable. 

The goal in this assignment is to understand obituaries using tf-idf. We could spend lots of time developing a fool-proof heurstic that correctly and perfectly extracts all the subjects from all of our obituatires. However, that probably would be a waste of time.

**Pareto Principle**

> The Pareto principle states that for many outcomes, roughly 80% of consequences come from 20% of the causes (the “vital few”).[1] Other names for this principle are the 80/20 rule, the law of the vital few, or the principle of factor sparsity. - https://en.wikipedia.org/wiki/Pareto_principle

For our purposes, we apply the Pareto Principle to mean that we want a 20% effort solution that covers 80% of our data. Spending time for the remaining 20% edge cases often might not be worth it. In most Computer Science classes your code and solutions are evaluated by how well they cover edge cases. However, for our purposes here, the edge cases don't necessarily matter and arent always worthwhile. It is important to remember the goal and keep our eyes on the prize.

With that said, it is important to sample our data to see whether our 20% effort approach covers 80% of the examples.

The next cell will randomly sample 20 extracted subjects and their corresponding file name

In [None]:
np.random.seed(1)
for row_id, row in obits_df.sample(20).iterrows():
    print(row.subject, row.File_Name)
    print

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2.3
manual: True
points: 4
-->


**Question 2.3 (4 poitns):**  After looking at those samples, briefly discuss how accurate your approach is and cases where it might have failed. Do you see any instances or patterns where your approach failed?

If your approach resulted in atleast 16 matches of the randomly sampled 20 examples, thats a signal that your approach is good enough. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 3. Cleaning text (3 points)

Before we compute TF-IDF values, lets create cleaned versions of our text. 

<!--
BEGIN QUESTION
name: q3.1
manual: False
points: 
    - 1
    - 1
    - 1 
-->


**Question 3.1** In the next cell, create two cleaned versions of each obituary, and store the two cleaned versions in the following columns in `obits_df`:

- `clean_text_one` will have no stop words, no punctuation, lowercase, numbers removed,
- `clean_text_two` will have no stop words, no punctuation, lowercase, numbers removed, lemmas

*Note:* The values in both columns should be strings.


In [None]:
...

obits_df

In [None]:
grader.check("q3.1")

## 4. TF-IDF (5 points)

In Demo07, we learned how to use the `TfidfVectorizer` class from SKLearn. Here, we will use the default settings for the class.

We will use the function called `make_tfidf_matrix` to create TF-IDF matrices for our obituaries.

<!--
BEGIN QUESTION
name: q4.1
manual: False
points: 
    - 0
-->


**Question 4.1: (0 points)** Complete the function based on the doc string.

In [None]:
def make_tfidf_matrix(df, text_column):
    """
    Given a dataframe and a name of the column, return a TF-IDF matrix as a dataframe.
    Make sure the indices of the dataframe are the name of the subject of the obituary and
    the name of the columns indicates the specfic words.
    """
    tfidf_vectorizer = TfidfVectorizer() 

    tf_idf_sparse_matrix = ...
    tfidf_df = pd.DataFrame(tf_idf_sparse_matrix.toarray())
    
    ...
    
    return tfidf_df

<!--
BEGIN QUESTION
name: q4.2
manual: False
points: 
    - 1
    - 1
    - 1
    - 1
-->


**Question 4.2 (4 points):** Now use the function to create two TF-IDF matrices, one named `tfidf_cleaned_one`, `tfidf_cleaned_two`.

In [None]:
tfidf_cleaned_one = ...
tfidf_cleaned_two = ...

In [None]:
grader.check("q4.2")

## 5. Top-Terms (16 points)

Now that we have computed TF-IDF terms, we can begin to analyze the obituaries and answer the following questions.

For Section 5, just use `tfidf_cleaned_one`

### 5.1 Top single term (4 points)

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q5.1.1
manual: True
points: 2
-->


**Question 5.1.1 (2 points):** In the next cell, determine the word with the highest tfidf score for each obituary based on`tfidf_cleaned_one`. Store the result in the variable named `top_1_word`.

*Hint:* This can be done with a single pandas DataFrame function

In [None]:
top_1_word = ...
top_1_word

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q5.1.2
manual: True
points: 2
-->


**Question 5.1.2 (2 points):** What trend do you notice about the top word for each of these obituaries?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### 5.2 Top 15 terms (12 points)

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q5.2.1
manual: True
points: 2
-->



**Question 5.2.1 (2 points):** In the next cell, determine the top 15 words with the highest tfidf score for each obituary. Store the result in the variable named `top_15_words` 

In [None]:
top_15_words = ...
top_15_words

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q5.2.2
manual: True
points: 10
-->



**Question 5.2.2 (10 points):** In the next cell, randomly sample 5 obituaries from `obits_df`. Read the 5 obituaries and briefly note whether the top 15-terms based on tf-idf capture the person based on the obituary. 

Make sure to print out the top-15 terms for that person as well. It might be a good idea to create one code cell and then one markdown cell.

<!-- END QUESTION -->

## 6. Finding Similar Obituaries (22 points)

In class we discussed how we can use cosine similarities to rank documents based on their similarities.

**Question 6.1 (0 points):** In the next cell, complete the function `make_similarities_matrix()` based on the doc string

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def make_similarities_matrix(df):
    """
    Given a tfidf-matrix as a dataframe, return a square (n by n) matrix where the cells indicate the cosine similarity
    between the corresponding row and column. 
    Make sure the indices and the colum names are the name of the subject of the obituary
    """
    ...
    return similarities_df

<!--
BEGIN QUESTION
name: q6.2
manual: False
points: 
    - 2
    - 2
    
-->


**Question 6.2 (4 points):** Now use the function to create two similarity matrices, one named `sim_df_one`, `sim_df_two`. The former should be based on `tfidf_cleaned_one` and the latter should be based on `tfidf_cleaned_two`.

In [None]:
sim_df_one = ...
sim_df_two = ...

In [None]:
grader.check("q6.2")

<!-- BEGIN QUESTION -->

### 6.3 Most Similar Queries (18 points)

<!--
BEGIN QUESTION
name: q6.3.1
manual: True
points: 
    - 2
    - 6
-->


**Question 6.3.1 (8 points):** In the next cell, create a new dataframe called `most_similar_obits_df`.
The indices should be the same as `sim_df_one` and `sim_df_two`. Create a column called `clean_text_one` where each value should indicate the obituary that is most similary (based on `sim_df_one`) to the corresponding index.
 Create a column called `clean_text_two` where each value should indicate the obituary that is most similary (based on `sim_df_two`) to the corresponding index.



In [None]:
...
most_similar_obits_df

In [None]:
grader.check("q6.3.1")

<!-- END QUESTION -->



In this assignment we use two different pre-processing approaches. For one approach we used the lower cased tokens and in the other other approach we used the lower cased lemmas. 

**Question 6.3.2 (10 points):** In the next few cells, determine whether this difference resulted in the most similar obituaries being different? If we see differences for the most similar obituaries, do we see these differences if we consider the set of 5 closest obituaries? What about the 10 closest obituaries?

In addition to your code, please include a brief paragraph answering these questions.

In [None]:
...

## 7. Feedback (Optional)

In the next cell, please provide any feedback about this assignment

_Type your answer here, replacing this text._

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()