# Guided Project 13: Winning Jeopardy

In [1]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Slide 1
---
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. If you need help at any point, you can consult our solution notebook [here](https://github.com/dataquestio/solutions/blob/master/Mission210Solution.ipynb).

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named `jeopardy.csv`, and contains `20000` rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- `Show Number` - the Jeopardy episode number.
- `Air Date` - the date the episode aired.
- `Round` - the round of Jeopardy.
- `Category` - the category of the question.
- `Value` - the number of dollars the correct answer is worth.
- `Question` - the text of the question.
- `Answer` - the text of the answer.


#### Instructions
---
- Read the dataset into a Dataframe called `jeopardy` using Pandas.


- Print out the first `5` rows of `jeopardy`.


- Print out the columns of `jeopardy` using `jeopardy.columns`.


- Some of the column names have spaces in front.
    - Remove the spaces from each item in `jeopardy.columns`.
    - Assign the result back to `jeopardy.columns` to fix the column names in `jeopardy`. \[Note: Intead, I'm using `skipinitialspace=True` inside `pd.read_csv()` to eliminate those whitespaces.\]
    
    
- Pay close attention to the format of each column.

### A first look into the data set.

In [151]:
jeopardy = pd.read_csv('jeopardy.csv', skipinitialspace=True)

In [152]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [153]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [154]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19999 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 625.0+ KB


## Slide 2
---
Before you can start doing analysis on the Jeopardy questions, you need to normalize all of the text columns (the `Question` and `Answer` columns). We covered normalization before, but the idea is to ensure that you put words in lowercase and remove punctuation so `Don't` and `don't` aren't considered to be different words when you compare them.


#### Instructions
---
- Write a function to normalize questions and answers. The function should:
    - Take in a string.
    - Convert the string to lowercase.
    - Remove all punctuation in the string.
    - Return the string.
    
    
- Normalize the `Question` column.
    - Use the Pandas `Series.apply` method to apply the function to each item in the Question column.
    - Assign the result to the `clean_question` column.
    
    
- Normalize the Answer column.
    - Use the Pandas `Series.apply` method to apply the function to each item in the `Answer` column.
    - Assign the result to the `clean_answer` column.




### Normalizing questions and answers.

In order to make elements in the `Questions` column comparable to those in thee `Answers` column, I make a small normalization process for the two columns: 

- applying a function that lowercases strings and removes punctuation.
- assigning changes to two new columns - `clean_question` and `clean_answer`.

In [155]:
def norm_string(string):
    """Takes in a string and lowercases it and removes punctuation."""
    
    string_mod = str.lower(string)
    string_mod = re.sub('\W', ' ', string_mod)
    
    return string_mod

In [156]:
jeopardy['clean_question'] = jeopardy['Question'].apply(norm_string)

jeopardy['clean_answer'] = jeopardy['Answer'].apply(norm_string)

Reducing two consecutive or more whitespaces into one:

In [157]:
jeopardy['clean_question'] = \
    jeopardy['clean_question'].str.replace('\s{2,}', ' ', regex=True)

jeopardy['clean_answer'] = \
    jeopardy['clean_answer'].str.replace('\s{2,}', ' ', regex=True)

## Slide 3
---
Now that you've normalized the text columns, there are also some other columns to normalize.

The `Value` column should be numeric, to allow you to manipulate it easier. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The `Air Date` column should also be a datetime, not a string, to enable you to work it easier.



#### Instructions
---
- Write a function to normalize dollar values. The function should:
    - Take in a string.
    - Remove any punctuation in the string.
    - Convert the string to an integer.
    - Assign `0` instead if the conversion has an error.
    - Return the integer.
    
- Normalize the `Value` column.
    - Use the Pandas `Series.apply` method to apply the function to each item in the Value column.
    - Assign the result to the `clean_value` column.
    
    
- Use the `pandas.to_datetime` function to convert the `Air Date` column to a datetime column.

Also, a function to be applied on `Value` to turn the values into integers.

In [158]:
def norm_usd(string):
    """Converts a string/currency value into an integer."""
    
    try:
        converted = re.sub('(\$|,)', '', string)
        converted = int(converted)
    except:
        converted = 0
    
    return converted

In [159]:
jeopardy['clean_value'] = jeopardy['Value'].apply(norm_usd)

In [160]:
jeopardy['clean_value'].unique()

array([  200,   400,   600,   800,  2000,  1000,  1200,  1600,  3200,
           0,  5000,   100,   300,   500,  1500,  4800,  1800,  1100,
        2200,  3400,  3000,  4000,  6800,  1900,  3100,   700,  1400,
        2800,  8000,  6000,  2400, 12000,  3800,  2500,  6200, 10000,
        7000,  1492,  7400,  1300,  7200,  2600,  3300,  5400,  4500,
        2100,   900,  3600,  2127,   367,  4400,  3500,  2900,  3900,
        4100,  4600, 10800,  2300,  5600,  1111,  8200,  5800,   750,
        7500,  1700,  9000,  6100,  1020,  4700,  2021,  5200,  3389],
      dtype=int64)

Lastly, the column 'Air Date' can be converted to a datetime type. 

In [161]:
jeopardy['clean_air_date'] = pd.to_datetime(jeopardy['Air Date'])

## Slide 4
---
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:


- How often the answer can be used for a question.


- How often questions are repeated.


You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.



#### Instructions
---
- Write a function that takes in a row in `jeopardy`, as a Series. It should:
    - Split the `clean_answer` column around spaces and assign to the variable `split_answer`.
    - Split the `clean_question` column around spaces and assign to the variable `split_question`.
    
    
- Create a variable called `match_count`, and set it to `0`.
    - If `the` is in `split_answer`, remove (see its description below) it using the remove method on lists. '`The`' is commonly found in answers and questions, but doesn't have any meaningful use in finding the answer.
    - If the length of `split_answer` is `0`, return `0`. This prevents a division by zero error later.
    - Loop through each item in `split_answer`, and see if it occurs in `split_question`. If it does, add `1` to `match_count`.
    - Divide `match_count` by the length of `split_answer`, and return the result.
    
    
- Count how many times terms in `clean_answer` occur in `clean_question`.
    - Use the Pandas `DataFrame.apply` method to apply the function to each row in `jeopardy`.
    - Pass the `axis=1` argument to apply the function across each row.
    - Assign the result to the `answer_in_question` column.
    
    
- Find the mean of the `answer_in_question` column using the `mean` method on Series.


- Write up a markdown cell with a short explanation of how finding this mean might influence your studying strategy for Jeopardy.


From the Python's documentation:

- list.remove(x)
    - Remove the first item from the list whose value is equal to x. It raises a ValueError if there is no such item.

In [162]:
def count_matches(row):
    
    split_answer = row["clean_answer"].split(' ')
    split_question = row["clean_question"].split(' ')
    
    if "the" in split_answer:
        split_answer.remove("the")
        
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    
    for item in split_answer:
        if item in split_question:
            match_count += 1
            
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [163]:
answer_in_question_mean = round(jeopardy["answer_in_question"].mean()*100, 2) 

f'{answer_in_question_mean = }%'

'answer_in_question_mean = 6.93%'

### Comment

On average, only about 7% of the words in the question are also in the answer. This likely means that more often than not, question and associated answer can share common words such as 'a', 'an', 'for', etc, (see examples of a slice of cleaned answers below)
hence frustrating the strategy of resorting to the words in the questions alone to find the answers.

In [164]:
jeopardy.loc[20:40,'clean_answer']

20                                              morocco
21                                          paul bonwit
22    hattie mcdaniel for her role in gone with the ...
23                                                  era
24                                   the congress party
25                                     wilt chamberlain
26                                                   k2
27                                          ethan allen
28                                                  ply
29                                               horton
30                                                nixon
31                                             a kennel
32                                                moses
33                                            aerosmith
34                                              oratory
35                          coolidge or chester arthur 
36                                       business class
37                                             m

## Slide 5
---
Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about `10%` of the full Jeopardy question dataset, but you can investigate it at least.


To do this, you can:


- Sort `jeopardy` in order of ascending air date.


- Maintain a set called `terms_used` that will be empty initially.


- Iterate through each row of `jeopardy`.


- Split `clean_question` into words, remove any word shorter than `6` characters, and check if each word occurs in `terms_used`.
    - If it does, increment a counter.
    - Add each word to terms_used.


This allows you to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables you to filter out words like `the` and `than`, which are commonly used, but don't tell you a lot about a question.



#### Instructions
---
- Create an empty list called `question_overlap`.


- Create an empty set called `terms_used`.


- Sort jeopardy by ascending air date.
- Use the `iterrows` Dataframe method to loop through each row of `jeopardy`.
    - Split the `clean_question` column of the row on the space character (` `), and assign to `split_question`.
    - Remove any words in `split_question` that are less than 6 characters long.
    - Set `match_count` to `0`.
    - Loop through each word in `split_question`.
        - If the term occurs in `terms_used`, add `1` to `match_count`.
    - Add each word in `split_question` to `terms_used` using the `add` method on sets.
    - If the length of `split_question` is greater than `0`, divide `match_count` by the length of `split_question`.
    - Append `match_count` to `question_overlap`.
    
    
- Assign question_overlap to the question_overlap column of jeopardy.


- Find the mean of the `question_overlap` column and print it.


- Look at the value, and think about what this might mean for questions being recycled. Write up your thoughts in a markdown cell.

### Answering the question - 'how often questions are repeated', by seeing how often complex words (> 6 characters) reoccur.

Preliminary clean-up: ordering questions by show air date. 

In [165]:
jeopardy = jeopardy.sort_values(by='Air Date').reset_index(drop=True)

In the process below, Stage 2 (marked as a comment) is divided into two loops - 'a' and 'b', so that we can register repeated words among questions/rows and thus avoid counting repeated words within the same question as re-occurrences. Mind as well that there is a time logic behind this process, hence the questions being sorted by air date: occurrences only happen when we compare one question with questions that appeared in past shows (or in the same air date). 

Sets can form random orders each time they are created, even if they have the same content. Because of that, it is better to perform the loop process once, in order to fill out `terms_used` with the desired terms, convert it into a list, and save it externally, so that we can preserve the order of the values. Since `question_overlap` is derived from `terms_used`, we save it as well.

    question_overlap = []

    terms_used = set()

    for index, val in enumerate(jeopardy['clean_question']):

        ## Stage 1.
        split_question = val.split(' ')

        split_question = [word for word in split_question if len(word) >= 6]


        ## Stage 2.
        match_count = 0

        # Loop a.
        for word in split_question:
            if word in terms_used:
                match_count += 1

        # Loop b.
        for word in split_question:
            terms_used.add(word)


        ## Stage 3. 
        if len(split_question):
            match_count /= len(split_question)

        question_overlap.append(match_count)

**Important note:** `set` objects are very ambiguous in retaining a specific order, the best way to avoid reading back the same set object with a different order, e.g. you save this: set(['a', 'b', 'c']) and when you read it back it can appear with a random order every time, like this: set(['a', 'c', 'b']) or set(['b', 'c', 'a']). The best practice is to convert the set object into a list and then save it.

    import pickle


    # Creating a binary pickle file for each object.
    qo = open("question_overlap.pkl","wb") 
    tu = open("terms_used.pkl","wb") 

    # Write the python object to pickle file.
    pickle.dump(question_overlap, qo)
    pickle.dump(list(terms_used), tu) # converted into a list before saving

    # close file
    qo.close()
    tu.close()

Reading the objects back.

In [166]:
question_overlap = pd.read_pickle("question_overlap.pkl") # original variable overwritten

terms_used_list = pd.read_pickle("terms_used.pkl")

In [167]:
jeopardy['question_overlap'] = pd.Series(question_overlap)

jeopardy['question_overlap'].head()

0    0.0
1    0.0
2    0.0
3    0.5
4    0.0
Name: question_overlap, dtype: float64

In [168]:
question_overlap_mean = jeopardy['question_overlap'].mean()

question_overlap_mean = round(question_overlap_mean*100, 2)

f'{question_overlap_mean = }%'

'question_overlap_mean = 71.98%'

### Comment

The value above means that, on average, aprox. 72% of the words used in a question were already used in a previous similar question.

The validity of the previous method as a mean capable of identifying repeated questions is dubious. First, resorting to a bank of repeated words to determine which words in a given question are also repeated may not be very useful, since the same words may be used to formulate different questions. Also, computing the ratio of repeated words of a question in relation to the total number of words does not make great sense; by definition, categorizing a question as repeated entails that a previous question, which had the same context, also was answered similarly (in this case we seek identical answers). We are looking for a binary situation such as: 'is this question a repeated one or not?'; we should not seek to answer: 'this question is 75% repeated' because three out of four words were used randomly (i.e. in various possible contexts) in past questions. 

## Slide 6
---
Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

- Low value -- Any row where `Value` is less than `800`.
- High value -- Any row where `Value` is greater than `800`.

You'll then be able to loop through each of the terms from the last screen, `terms_used`, and:


- Find the number of low value questions the word occurs in.


- Find the number of high value questions the word occurs in.


- Find the percentage of questions the word occurs in.


- Based on the percentage of questions the word occurs in, find expected counts.


- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.



#### Instructions
---
1. Create a function that takes in a row from a Dataframe, and:
    - If the `clean_value` column is greater than `800`, assign `1` to `value`.
    - Otherwise, assign `0` to `value`.
    - Return value.
    
    
2. Determine which questions are high and low value.
    - Use the Pandas DataFrame.apply method to apply the function to each row in `jeopardy`.
    - Pass the `axis=1` argument to apply the function across each row.
    - Assign the result to the `high_value` column.
    
    
3. Create a function that takes in a word, and:
    - Assigns `0` to `low_count`.
    - Assigns `0` to `high_count`.
    - Loops through each row in `jeopardy` using the `iterrows` method.
    - Split the `clean_question` column on the space character (` `).
    - If the word is in the split question:
        - If the `high_value` column is `1`, add `1` to `high_count`.
        - Else, add `1` to `low_count`.
    - Returns `high_count` and `low_count`. You can return multiple values by separating them with a comma.
    
    
4. Randomly pick ten elements of `terms_used` and append them to a list called `comparison_terms`.


5. Create an empty list called `observed_expected`.


6. Loop through each term in `comparison_terms`, and:
    - Run the function on the term to get the high value and low value counts.
    - Append the result of running the function (which will be a list) to `observed_expected`.

Simplifying the process above in the following steps:

1. creating a new column that defines each question/row as 'high value' or 'low_value'. 
2. picking randomly 10 terms from the `terms_used` pool.
3. building the function that counts for each term chosen in 2. the number of high value and low value questions that contain that expression (a least once).
4. instead of filling out a list of lists with the output generated in 3., a Series called 'observed' will be created to save that output (`observed_expected` is created later on). 

1. classify each question/row as high value or low value.

In [169]:
jeopardy['high_or_low'] = jeopardy['clean_value'].apply(lambda x: 'high_value' if x > 800 else 'low_value')

# checking new column, random look.
jeopardy.loc[80:90, ['clean_value', 'high_or_low']]

Unnamed: 0,clean_value,high_or_low
80,800,low_value
81,1900,high_value
82,800,low_value
83,1000,high_value
84,1000,high_value
85,400,low_value
86,1000,high_value
87,400,low_value
88,400,low_value
89,400,low_value


2. randomly pick 10 terms from `terms_used` and save them into a list - `comparison_term`. Only done once, output saved in the list below, to prevent overwriting.

        import random

        comparison_term = random.sample(terms_used_list, 10)

In [170]:


comparison_term = ['coronado',
                  'residence',
                  'subatomic',
                  'halston',
                  'nuremberg',
                  'plumber',
                  'osment',
                  'tortured',
                  'relatively',
                  'herradura']

3. build the function that counts how many high value and low value questions contain each one of the terms in `comparison_term`.

In [171]:
def count_highs_and_lows(word):
    """Takes a word and returns how many times that word was used in high value questions
    and low value questions.
    """
    
    high_counts = 0
    low_counts = 0
    
    for i, val in enumerate(jeopardy['clean_question']):
        
        split = val.split(' ')
        
        if word in split:
            if jeopardy.loc[i, 'high_or_low'] == 'high_value':
                high_counts += 1
            else:
                low_counts += 1

    return [high_counts, low_counts]

4. apply function and display results.

In [172]:
observed_list = [count_highs_and_lows(word) for word in comparison_term]

observed = pd.Series(observed_list, name='observed')

observed 

0    [1, 0]
1    [4, 4]
2    [1, 1]
3    [0, 1]
4    [1, 0]
5    [1, 1]
6    [1, 0]
7    [0, 1]
8    [0, 3]
9    [0, 1]
Name: observed, dtype: object

## Slide 7
---
Now that you've found the observed counts for a few terms, you can compute the expected counts and the chi-squared value.


#### Instructions
---
1. Find the number of rows in `jeopardy` where `high_value` is `1`, and assign to `high_value_count`. 


2. Find the number of rows in `jeopardy` where `high_value` is `0`, and assign to `low_value_count`. 


3. Create an empty list called `chi_squared`.


4. Loop through each list in `observed_expected`.
    - Add up both items in the list (high and low counts) to get the total count, and assign to `total`.
    - Divide `total` by the number of rows in` jeopardy` to get the proportion across the dataset. Assign to `total_prop`.
    - Multiply `total_prop` by `high_value_count` to get the expected term count for high value rows.
    - Multiply `total_prop` by `low_value_count` to get the expected term count for low value rows.
    - Use the `scipy.stats.chisquare` function to compute the chi-squared value and p-value given the expected and observed counts.
    - Append the results to `chi_squared`.
    
    
5. Look over the chi-squared values and the associated p-values. Are there any statistically significant results? Write up your thoughts in a markdown cell.

Counting number of high value questions and low value questions.

In [173]:
high_value_count = jeopardy[jeopardy['high_or_low']=='high_value'].shape[0]

low_value_count = jeopardy[jeopardy['high_or_low']!='high_value'].shape[0]

# checking counts:
print(f'{high_value_count = }', f'{low_value_count = }', sep='\n')

high_value_count = 5734
low_value_count = 14265


Computing expected values.

In [174]:
expected = []

for i, val in enumerate(observed):
    
    # High and Low value count.
    high_count = val[0]
    low_count = val[1]
    
    # Proportion of words vs total number of questions
    total_prop = (high_count + low_count) / jeopardy.shape[0]
    
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
#     expected.append([round(expected_high, 10), round(expected_low, 10)])
    expected.append([expected_high, expected_low])
    
expected = pd.Series(expected, name='expected')

expected

0    [0.28671433571678584, 0.7132856642832142]
1      [2.2937146857342867, 5.706285314265713]
2     [0.5734286714335717, 1.4265713285664283]
3    [0.28671433571678584, 0.7132856642832142]
4    [0.28671433571678584, 0.7132856642832142]
5     [0.5734286714335717, 1.4265713285664283]
6    [0.28671433571678584, 0.7132856642832142]
7    [0.28671433571678584, 0.7132856642832142]
8     [0.8601430071503575, 2.1398569928496425]
9    [0.28671433571678584, 0.7132856642832142]
Name: expected, dtype: object

Putting observed and expected values in a DataFrame: `observed_expected`.

In [175]:
observed_expected = pd.concat([observed, expected], axis=1)

observed_expected.index = comparison_term

observed_expected 

Unnamed: 0,observed,expected
coronado,"[1, 0]","[0.28671433571678584, 0.7132856642832142]"
residence,"[4, 4]","[2.2937146857342867, 5.706285314265713]"
subatomic,"[1, 1]","[0.5734286714335717, 1.4265713285664283]"
halston,"[0, 1]","[0.28671433571678584, 0.7132856642832142]"
nuremberg,"[1, 0]","[0.28671433571678584, 0.7132856642832142]"
plumber,"[1, 1]","[0.5734286714335717, 1.4265713285664283]"
osment,"[1, 0]","[0.28671433571678584, 0.7132856642832142]"
tortured,"[0, 1]","[0.28671433571678584, 0.7132856642832142]"
relatively,"[0, 3]","[0.8601430071503575, 2.1398569928496425]"
herradura,"[0, 1]","[0.28671433571678584, 0.7132856642832142]"


Manually finding the chi-square value of the first term in `observed_expected` - 'coronado':

In [176]:
def proportion_diff_sq(a, b):
    """Takes two numbers, a and b, and returns their proportional
    difference. The dividend is squared.
    """
    
    
    return (a - b)**2 / b

In [177]:
high_value_chi = proportion_diff_sq(observed_expected.iloc[0, 0][0], observed_expected.iloc[0, 1][0])

low_value_chi = proportion_diff_sq(observed_expected.iloc[0, 0][1], observed_expected.iloc[0, 1][1])

composit_chi = high_value_chi + low_value_chi

composit_chi

2.487792117195675

Producing the chi-square tests and associated p-values for every term:

In [178]:
from scipy.stats import chisquare 

chi_squared = []

for index, series in observed_expected.iterrows():
    
    obs = series[0]
    exp = series[1]

    chisquare_value, pvalue  = chisquare(obs, exp)
    
    chi_squared.append([chisquare_value, pvalue])
    
chi_squared_df = pd.DataFrame(chi_squared,
                              columns=['chi-square value', 'p-value'],
                              index=comparison_term).rename_axis('term')

chi_squared_df['over 5% threshold'] = chi_squared_df['p-value'].apply(lambda x: 'yes' if x > 0.05 else 'no')

chi_squared_df

Unnamed: 0_level_0,chi-square value,p-value,over 5% threshold
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
coronado,2.487792,0.114733,yes
residence,1.77951,0.18221,yes
subatomic,0.444877,0.504778,yes
halston,0.401963,0.526077,yes
nuremberg,2.487792,0.114733,yes
plumber,0.444877,0.504778,yes
osment,2.487792,0.114733,yes
tortured,0.401963,0.526077,yes
relatively,1.205889,0.272148,yes
herradura,0.401963,0.526077,yes


#### Comment

Following the tutorial's instructions leads us to a major problem that invalidates the tests' results shown above. The problem stems from the fact that, when sampling for complex words (terms), we are not taking into account that questions with such terms have extremely low observed counts. From the [scipy documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html):

'This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5. According to [3], the total number of samples is recommended to be greater than 13, otherwise exact tests (such as Barnard’s Exact test) should be used because they do not overreject.'

Also, simply looking at the chi-square value magnitude is not enough to determine whether a term is more likely to be occurring in high value questions since the squared proportion difference of low value questions may be the component that inflates the chi-square value. Recall the chi-square value for the present scenario:

\begin{equation}
\chi^{2} = \frac{(\text{high value obs} - \text{high value exp})^2}{\text{high value exp}} + 
       \frac{(\text{low value obs} - \text{low value exp})^2}{\text{low value exp}}  
\end{equation}

Despite of the tests being invalid, for practicing purposes, reading them as is, would indicate that 'none of the terms had a significant difference in usage between high value and low value rows' (from the solution notebook, which finds similar results).

Furthermore, from single term alone, is difficult to assess the overall context in which they are being used, e.g. if we can infer that History of World War II is a recurrent topic because of the term 'nuremberg' as in 'Nuremberg Trials', not much can be said about what subjects, 'residence' or 'relatively', are alluding to. 

Regarding the chi-square test, two immediate improvements can be made:

 - a) consider only terms that can be found more often in high value questions than in low value questions. To this end, we can make a chain of procedures:
    1. calculate for every term the difference between (observed) high value and low value questions counts.
    2. sort terms/rows, in descending order, based on the magnitude of the differences in counts calculated in Step 1.
    3. drop all rows when the difference between high and low count is not positive (we only are interested in terms that appear more often in high value than low value questions).
    4. slice the DataFrame, so that it only contains the top 100 terms which have the greatest differences.
    5. compute the expected values for each term/row.


 - b) from the final output produced in a), filter that DataFrame, selecting only the terms that have observed and expected values over 5. 

---

#### Starting with a):
 - Since we are computing the observed values for 21223 terms, we apply the `count_highs_and_lows` in `terms_used_series`  once and save the output.

In [179]:
terms_used_series = pd.Series(terms_used_list) 

    observed_2 = terms_used_series.apply(count_highs_and_lows) 

Resorting to the pickle library again to save the `observed_2` output.

    obsr_2 = open("observed_2.pkl","wb") 

    pickle.dump(observed_2, obsr_2)

    obsr_2.close()

In [225]:
observed_2_series = pd.read_pickle('observed_2.pkl')

In [236]:
observed_2_df = pd.DataFrame({'observed': observed_2_series}).set_index(terms_used_series)

In [240]:
observed_2_df.head()

Unnamed: 0,observed
croatia,"[2, 3]"
campuses,"[1, 1]"
accumulate,"[1, 0]"
voyages,"[0, 3]"
emeritus,"[1, 0]"


Below, we compute the differences between high value and low value counts for each term/row.

In [241]:
observed_2_df['difference'] = observed_2_df.observed.apply(lambda x: x[0] - x[1])

Sorting in descending order.

In [242]:
observed_2_df_sorted = observed_2_df.sort_values('difference', ascending=False)

observed_2_df_sorted.head(10)

Unnamed: 0,observed,difference
monitor,"[37, 17]",20
painter,"[17, 10]",7
example,"[16, 10]",6
largely,"[5, 0]",5
creates,"[6, 1]",5
hormone,"[8, 3]",5
spirit,"[13, 8]",5
orator,"[8, 3]",5
andrew,"[13, 8]",5
hyphenated,"[8, 3]",5


Keeping only terms with positive differences - high counts higher than low counts, and also keep only the top 100 terms that have the greatest difference between counts.

In [243]:
difference_over_0 = observed_2_df_sorted[observed_2_df_sorted.difference > 0]

difference_top_100 = difference_over_0.head(100).copy()

difference_top_100.head()

Unnamed: 0,observed,difference
monitor,"[37, 17]",20
painter,"[17, 10]",7
example,"[16, 10]",6
largely,"[5, 0]",5
creates,"[6, 1]",5


Computing the expected high and low counts applying the function `find_expected`. 

In [244]:
def find_expected(observed_list):

    # High and Low value count.
    high_count = observed_list[0]
    low_count = observed_list[1]

    # Proportion of words vs total number of questions
    total_prop = (high_count + low_count) / jeopardy.shape[0]

    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count

    return [expected_high, expected_low]

In [245]:
difference_top_100['expected'] = difference_top_100['observed'].apply(find_expected)

difference_top_100.head()

Unnamed: 0,observed,difference,expected
monitor,"[37, 17]",20,"[15.482574128706435, 38.51742587129356]"
painter,"[17, 10]",7,"[7.741287064353218, 19.25871293564678]"
example,"[16, 10]",6,"[7.454572728636432, 18.545427271363568]"
largely,"[5, 0]",5,"[1.4335716785839292, 3.566428321416071]"
creates,"[6, 1]",5,"[2.007000350017501, 4.992999649982499]"


#### Task b)

Creating a procedure to find out questions/rows that have both observed and expected values, high and low, equal or over 5, so that we can apply a valid chi-square test to each row.

In the `df.iterrows()` process below, we go over the observed and expected values in each list (high and low value), and save the index value/term in `index_values_over_4`, whenever all those values are equal to 5 or more.

In [246]:
index_values_over_4 = []

for index, series in difference_top_100.iterrows():
    
    # example of `observed_expected`: [37, 17, 15.482574128706435, 38.51742587129356]
    observed_expected = []
    
    observed_expected += series[0]
    observed_expected += series[2]
    
    check_over_4 = [True for el in observed_expected if el >= 5]
    
    # check_over_4 can only have 4 elements, [True, True, True, True], 
    # if all values in observed_expected are over 4. 
    if len(check_over_4) == 4:
        index_values_over_4.append(series.name)

Producing the final DataFrame for testing - `observed_expected_2`.

In [247]:
observed_expected_2 = difference_top_100.loc[index_values_over_4, :]

observed_expected_2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, monitor to plants
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   observed    12 non-null     object
 1   difference  12 non-null     int64 
 2   expected    12 non-null     object
dtypes: int64(1), object(2)
memory usage: 240.0+ bytes


In [248]:
observed_expected_2

Unnamed: 0,observed,difference,expected
monitor,"[37, 17]",20,"[15.482574128706435, 38.51742587129356]"
painter,"[17, 10]",7,"[7.741287064353218, 19.25871293564678]"
example,"[16, 10]",6,"[7.454572728636432, 18.545427271363568]"
spirit,"[13, 8]",5,"[6.021001050052503, 14.978998949947497]"
andrew,"[13, 8]",5,"[6.021001050052503, 14.978998949947497]"
liquid,"[17, 12]",5,"[8.31471573578679, 20.68528426421321]"
pulitzer,"[15, 11]",4,"[7.454572728636432, 18.545427271363568]"
african,"[43, 39]",4,"[23.51057552877644, 58.48942447122356]"
process,"[16, 12]",4,"[8.028001400070004, 19.971998599929996]"
relative,"[13, 10]",3,"[6.594429721486074, 16.405570278513927]"


Producing the chisquare values again, this time for `observed_expected_2`, and the respective p-values.

In [249]:
chi_squared_2 = []

for index, series in observed_expected_2.iterrows():
    
    obs = series[0]
    exp = series[2]

    chisquare_value, pvalue  = chisquare(obs, exp)
    
    chi_squared_2.append([chisquare_value, pvalue])
    
chi_squared_df_2 = pd.DataFrame(chi_squared_2,
                                columns=['chisquare value', 'p-value'],
                                index=observed_expected_2.index).rename_axis('term')

chi_squared_df_2['over 5% threshold'] = chi_squared_df_2['p-value'].apply(lambda x: 'yes' if x > 0.05 else 'no')

chi_squared_df_2

Unnamed: 0_level_0,chisquare value,p-value,over 5% threshold
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
monitor,41.925086,9.483806e-11,no
painter,15.524748,8.143213e-05,no
example,13.733503,0.000210663,no
spirit,11.341071,0.0007581159,no
andrew,11.341071,0.0007581159,no
liquid,12.719123,0.0003619354,no
pulitzer,10.707336,0.001067116,no
african,22.65016,1.94344e-06,no
process,11.09848,0.0008639852,no
relative,8.723181,0.003141895,no


### Comment:

The original idea is to find out, from a randomly chosen set of terms/topics, which ones are more likely to be used in high value questions. First of all, we know that every term in this short list has appeared more often in high value questions than in low value questions. Also, we can tell that for all terms, the p-value is near zero, which means that we can reject the (null) hypothesis, that tells us that the given distribution of the term, where there are more high value questions than low value questions, is unlikely to be found on a random draw, where the probability of drawing a high value or a low value question is equally distributed (50% chance). In other words, this means that there is some evidence that these distributions are not random, so that we can at least try to pick some subjects to study, based on these terms, because we know that by default, they are likely to be included more often in high value questions, than in low value questions. Looking at the list of terms, which is very short, some of them seem to be generic without a context; others, are indicative of certain subjects, e.g. 'pulitzer' (Journalism/Literature), 'painter' (History/Historical Figures).

### Extra exercise: replicate the chi-squared test p-values 'manually'.

Notes:

- We perform a 1000 rounds of filling out a series with the same length of jeopardy, 20000 rows, with values from 0 to 1. Because we are analyzing the possibility of each value to be 'high' or 'low' in 'nature', then we attribute the probability of drawing one of the values equally: 0.5. If we had 'high', 'medium', 'low', we would attribute equally 1/3 of drawing probability.

In [250]:
from numpy import random   

p_values = []

for index, val in enumerate(chi_squared_df_2['chisquare value']):

    chi_squared_values = []

    for i in range(1000):
        rand_series = np.random.random((jeopardy.shape[0],))
        high = rand_series[rand_series >= 0.5].size
        low = rand_series[rand_series < 0.5].size
        high_diff = proportion_diff_sq(high, jeopardy.shape[0]/2)
        low_diff = proportion_diff_sq(low, jeopardy.shape[0]/2)
        chi_square = high_diff + low_diff
        chi_squared_values.append(chi_square)

    over_chisquare_term = [chi for chi in chi_squared_values if \
                           chi >= val]

    p_value = len(over_chisquare_term) / 1000

    p_values.append(p_value)

As we can see below, the 'manually' produced p-values are almost identical to the ones previously computed with the 'scipy' library, being all near zero, thus confirming the same conclusions already withdrawn. 

In [251]:
p_values_comp = pd.DataFrame({'scipy p-value': chi_squared_df_2['p-value'], 'manual p-value': p_values})

p_values_comp

Unnamed: 0_level_0,scipy p-value,manual p-value
term,Unnamed: 1_level_1,Unnamed: 2_level_1
monitor,9.483806e-11,0.0
painter,8.143213e-05,0.0
example,0.000210663,0.0
spirit,0.0007581159,0.001
andrew,0.0007581159,0.001
liquid,0.0003619354,0.002
pulitzer,0.001067116,0.0
african,1.94344e-06,0.0
process,0.0008639852,0.001
relative,0.003141895,0.006


## Slide 8
---
That's it for the guided steps! We recommend you explore the data more.

Here are some potential next steps:

- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    - Manually create a list of words to remove, like the, than, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like 5%) of questions.
    
    
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
    
    
- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
    
    
- ~~Use the whole Jeopardy dataset (available here) instead of the subset we used in this lesson.~~


- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.


We recommend creating a Github repository and placing this project there. It will help other people, including employers, see your work. As you start to put multiple projects on Github, you'll have the beginnings of a strong portfolio.

You're welcome to keep working on the project here, but we recommend downloading it to your computer using the download icon above and working on it there.

Curious to see what other students have done on this project? Head over to our Community to check them out. While you are there, please remember to share your experience and provide feedback!

In addition, we welcome you to share your own project and show off your hard work. Head over to our Community to share your finished Guided Project!