# Assignment 0: Getting (to know) the Kardashians
## © Cristian Danescu-Niculescu-Mizil 2023
## CS/INFO 4300 Language and Information

## Due by midnight on Friday January 27th

You must completely this assignment **individually**.

In this assignment we will be working with transcripts from the reality TV show "Keeping Up With The Kardashians" and cleaning the raw transcript data so that we may apply various layers of analysis in later assignments.

This assignment **is not intended to be a test of your programming skills**, but to get you familiar with the virtual environment and the structure of the data you will be analyzing. In fact, most of the code is provided and you only need to run through it and address two questions at the end of the notebook.

**Learning Objectives**

This project aims to help you to get comfortable working with the following tools / technologies / concepts:

* The Jupyter Notebook environment
* Recap of Python syntax and basic data structures
* `virtualenv` or `venv` environment for package dependencies

**Academic Integrity and Collaboration**

Note that these projects should be completed individually. As a result, all University-standard academic integrity guidelines must be followed.

**Guidelines**

All cells that contain the blocks that read `# YOUR CODE HERE` are editable and are to be completed to ensure you pass the test-cases. Make sure to write your code where indicated.

All cells that read `YOUR ANSWER HERE` are free-response cells that are editable and are to be completed. 

You may use any number of notebook cells to explore the data and test out your functions, although you will only be graded on the solution itself.

You are unable to modify the read-only cells.

You should also use Markdown cells to explain your code and discuss your results when necessary.
Instructions can be found [here](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html).

All floating point values should be printed with **2 decimal places** precision. You can do so using the built-in `round` function.

**Grading**

For code-completion questions you will be graded on passing the public test cases we have included, as well as any hidden test cases that we have supplemented to ensure that your logic is correct.

For free-response questions you will be manually graded on the quality of your answer.

**Submission**

You are expected to submit this .ipynb as your submission for Assignment 0. 

In addition please submit an html copy of the notebook (You can create this by clicking `File` > `Download as` > `HTML (.html)`).

Make sure you double-check that you submitted what you actually intended by re-downloading the files from CMS after you submitted it. You won't be allowed to update the files after the deadline as stated in the syllabus.

**Additional Notes**

To setup your environment review the writeup attached to this Assignment. 

Make sure to fill out the startup quiz on CMS.

In [1]:
import sys
import re
from glob import glob
import os
from itertools import groupby
import pickle
import bs4
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Ensure that your kernel is using Python3
assert sys.version_info.major == 3
assert sys.version_info.minor == 7

# Processing the transcripts of "Keeping up with the Kardashians"

Transcripts of this TV show are available online and have been downloaded and provided to you in HTML format. However, they show very weak structure and a lot of noise. There is a lot of work to be done to render them usable for analysis.

We will use the *BeautifulSoup* library, which makes working with messy HTML much easier.

In [3]:
transcript_filename = "kardashians_data/livedash_kardashians/273926.html"
with open(transcript_filename) as f:
    bsoup = bs4.BeautifulSoup(f, "html5lib")

The title of the episode transcribed in the current file can be found in the element with the `title` id:

In [4]:
bsoup.find(attrs={'id': 'title'}).get_text()

'Keeping Up With the Kardashians - Shape Up or Ship Out'

Each line of conversation is a table row with two table cells, one containing the timestamp, the other the text:

In [5]:
bsoup.findAll("tr")[100:105]

[<tr><td><a name="943920369"></a>00:06:09</td><td>&gt;&gt; KOURTNEY: This one.
 </td></tr>,
 <tr><td><a name="943920370"></a>00:06:10</td><td>&gt;&gt; KHLOE: Here, you want this
 one?
 </td></tr>,
 <tr><td><a name="943920371"></a>00:06:11</td><td>&gt;&gt; KOURTNEY: Yes, thank you.
 </td></tr>,
 <tr><td><a name="943920374"></a>00:06:14</td><td>&gt;&gt; KOURTNEY: Okay.
 </td></tr>,
 <tr><td><a name="943920375"></a>00:06:15</td><td>I like that one better.
 </td></tr>]

The formatting used by the transcripts is not completely normalized, but follows some patterns.

For instance, there are two types of lines:

 * When a character starts to speak after someone else, this is marked as such:
 
     `>> KOURTNEY: Okay.`


 * When a character has already been speaking and continues, the line is simply:
 
     `I like that one better.`
     
     The text in the second kind of line can be considered to be part of the same speech
     act as the previous one.

However, there are some irregularities in the transcripts that we exemplify and work around.

## Extracting valid dialogue

In [6]:
def strip_actions(line):
    """Some of the texts contain indications about the actions
    that the characters do. For example:
        
        (Kourtney and Khloe laughing) >> BRUCE: Sometimes,
        I can get so disappointed with these girls.
    
    This function should remove everything between parantheses
    in the line passed as argument.  You may assume no nesting.
    
    >>> strip_actions("a(bc)d(efg)")
    "ad"

    """
    return "".join(re.split("\([^\(\)]*\)", line, flags=re.DOTALL))

def break_up_line(line):
    """When characters take turn in quick succession, we may have two
    characters speaking in the same line. Example:

        >> SCOTT: Three? >> KOURTNEY: Yeah.
    
    We should split this line into two appropriate lines.
    
    >>> break_up_line(">> SCOTT: Three? >> KOURTNEY: Yeah.")
    ['>> SCOTT: Three?', '>> KOURTNEY: Yeah.']
    
    For consistency, we should always return a list:
    >>> break_up_line(">> SCOTT: Three?")
    ['>> SCOTT: Three?']
    
    When there are no indications of who is speaking, simply return
    a list containing a single string that contains the text.
    
    >>> break_up_line("I am.")
    ['I am.']
    
    """
    lines = line.split(">>")
    if lines[0].strip() == "":
        return list(map(lambda s: ">>" + s.rstrip(), lines[1:]))
    else:
        return [lines[0].strip()]


def is_valid_speaker(speaker_string):
    return speaker_string.strip().isupper()

def is_valid_transcript_char(char):
    return char.isalpha() or char.isdigit() or char in ".?!$\"'"

The bulk of the function that turns a transcript into a usable representation is given below.

In [7]:
def parse_kardashians_transcript(raw_html):
    bsoup = bs4.BeautifulSoup(raw_html, "html5lib")

    # Get the title of the TV show
    title = bsoup.find(attrs={'id': 'title'})
    title = title.get_text() if title else None

    transcript = []
    speaker = None
    
    # We maintain an error state. If we reach an invalid line, the captioning
    # turned bad (perhaps an advertisement was being captioned instead of the TV show),
    # but sometimes it gets fixed.
    captioning_broken = False 

    for row in bsoup.findAll("tr"):
        
        #  We are looking for table rows that have exactly two cells, and
        # the first cell has a timestamp with a link. Skip other table rows.
        cells = row.findAll("td")
        if len(cells) != 2:
            continue
        time_cell, text_cell = cells
        if len(time_cell.contents) != 2:
            continue
        anchor, timestamp = time_cell.contents
        if anchor.name != 'a':
            continue

        text = " ".join(text_cell.get_text().splitlines())
        text = strip_actions(text)

        for subline in break_up_line(text):
            # At this point, subline is a single speech act, which should contain either
            # one speaker name and a text, or just the text (if the speaker stays the same).
            # e.g. the variable text might be ">> SCOTT: Three?" right now.
            
            # Use a regular expression to see if the line has the speaker marked with ">>".
            # The regex should ONLY match if the speaker name is present. If it is, it should
            # should return two matching groups, one for the speaker and one for the text.
            # e.g. your regex should not match "I am." but it should match ">> SCOTT: Three?".
            # In the later case, the first regex group should match "SCOTT" and the second
            # regex group should match "Three?"
            
            speaker_re = r'>> (.*): (.+)'
            speaker_match = re.match(speaker_re, subline)
            orig_subline = subline
            if speaker_match:
                speaker_string, subline = speaker_match.groups()
                if not is_valid_speaker(speaker_string):
                    # The speaker is not a valid all-uppercase string, so something is broken.
                    captioning_broken = True
                else:
                    # We have a well-formed line. We can recover and exit the error state.
                    captioning_broken = False  
                speaker = speaker_string
            elif subline.startswith(">>"):
                # The line starts with ">>" but doesn't match the speaker regular expression.
                captioning_broken = True
            if not is_valid_transcript_char(subline[0]):
                # The line starts with an invalid character
                captioning_broken = True
            if speaker is None:
                # No speaker has been marked yet, but the lines are plain, as if
                # continuing from a known speaker. We cannot tell who is speaking.
                captioning_broken = True

            if not captioning_broken:
                transcript.append(dict(timestamp=timestamp,
                                       speaker=speaker,
                                       text=subline))
    
    return title, transcript

## Preparing the data for processing.

Go through all the provided files and build two dictionaries:

* `titles[transcript_id] = ` *title of the episode in the transcript*
* `transcripts[transcript_id] = ` *the parsed transcript*

as returned by `parse_kardashians_transcript`. A convenient transcript ID is defined by the `_nice_key` function below.

In [8]:
def _nice_key(file_path):
    """Convenience function to get a unique transcript ID that is shorter than the filename"""
    return file_path.split("_", 2)[2].rsplit(".")[0]

_nice_key('kardashians_data/livedash_kourtney_and_kim/273926.html')

'kourtney_and_kim/273926'

In [9]:
import os
titles = {}
transcripts = {}

for subdir in os.listdir("kardashians_data"):
    if os.path.isdir("kardashians_data/" + subdir):
        for filename in os.listdir("kardashians_data/" + subdir):
            path = "kardashians_data/" + subdir + "/" + filename
            if os.path.splitext(path)[1].lower() == ".html":
                transcript_id = _nice_key(path)
                with open(path) as f:
                    title, transcript = parse_kardashians_transcript(f)
                    titles[transcript_id] = title
                    transcripts[transcript_id] = transcript 

We can now count the number of messages that are stored in the transcripts:

In [10]:
sum(map(lambda t: len(t), transcripts.values()))

202230

In [11]:
len(set({'kardashians/273926': 'Keeping Up With the Kardashians - Shape Up or Ship Out',
 'kardashians/273926': 'Keeping Up With the Kardashians - Shape Up or Ship Out'}.values()))

1

## Question 1 (Code Completion): Analysis of Episode Titles
Multiple HTML files could contain the transcripts from the same episode, which is indicated by the title, resulting in duplicates in our data.

*In the cell below complete the function to determine how many distinct episodes are present in the files?*

In [12]:
def num_episodes(input_titles):
    """ Method takes in the titles and returns the number of distinct episodes
    
        Note: What kind of data structure should be used here? 
        Note: We generally recommend using the local variables (in this case, 'input_titles') 
        when filling out these functions.
    """

    # use a set to store the distinct values

    distinct_values = {value for value in input_titles.values()}
    return len(distinct_values)

In [13]:
"""Check that num_episodes returns the correct output"""    
assert num_episodes(titles) == 56, f"{num_episodes(titles)} titles"

In later assignments we will be analyzing the language used by central characters in the show. It turns out that one of the characters is referred to by two different names, Rob and Robert. The function below can be used to replace a specified name with a new one.

In [14]:
def replace_name(transcripts, original_name, replacement_name):
    for k in transcripts.keys():
        for i in range(0,len(transcripts[k])):
            if transcripts[k][i]['speaker'] == original_name:
                transcripts[k][i]['speaker'] = replacement_name
    return transcripts
transcripts = replace_name(transcripts, "ROB", "ROBERT")

## Question 2 (Code Completion): Speaker utterances

In the cell below complete the function to determine: *How many times does the speaker ROBERT appear in the processed transcripts?*

In [None]:
def num_robert_utterances(input_transcripts):
    """ Method takes in the transcripts and returns the number of utterances by 'ROBERT'
    """
    count = 0
    for k in input_transcripts.keys():
        for i in range(0,len(input_transcripts[k])):
            if input_transcripts[k][i]['speaker'] == 'ROBERT':
                count += 1
    return count

In [None]:
"""Check that num_robert_utterances returns an output in the following range."""
assert num_robert_utterances(transcripts) >= 18117, f"{num_robert_utterances(transcripts)} utterances"
assert num_robert_utterances(transcripts) <= 18591

## End of Assignment 0