# Winning Jeopardy - How To Set A Winning Strategy Based On The Insights Provided By Records Of Old Questions And Answers


## Introduction
---

Jeopardy is a popular TV show in the US where participants answer questions to win money [source: Dataquest]. In this project we explore a sample of 20000 (out of 216930) questions and respective answers, originally compiled by a [reddit user](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/), and try to figure out ways of maximizing the earnings by looking for patterns in the data. Two of the questions we'll try to explore are:

- How often an answer can be used to formulate a question.

- How often questions repeat.


The description of every column in the data set:

| Name | Description |
| ----------- | -------------------------------------------------- |
| Show Number | Jeopardy episode number.                       |
| Air Date    | Date the episode aired.                        |
| Round       | Round of Jeopardy.                             |
| Category    | Category of the question.                      |
| Value       | Number of dollars the correct answer is worth. |
| Question    | Text of the question.                          |
| Answer      | Text of the answer.                            |

## A First Look Into The Data Set
---

In [130]:
import numpy as np
import pandas as pd
import re

In [92]:
jeopardy = pd.read_csv('jeopardy.csv', skipinitialspace=True)

First five rows.

In [93]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Summary of the DataFrame.

In [94]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19999 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


## Data Cleaning
---

### Normalizing Questions And Answers (Text Columns)

In order to make elements in the `Questions` column comparable to those in the `Answers` column, I make a small normalization process for the two columns: 

- applying a function that lowercases strings and removes punctuation.
- assigning those changes to a pair of new columns: `clean_question` and `clean_answer`.

In [95]:
def norm_string(string):
    """Takes in a string and lowercases it and removes punctuation."""
    
    string_mod = str.lower(string)
    string_mod = re.sub('\W', ' ', string_mod)
    
    return string_mod

In [96]:
jeopardy['clean_question'] = jeopardy['Question'].apply(norm_string)

jeopardy['clean_answer'] = jeopardy['Answer'].apply(norm_string)

Reducing two consecutive or more whitespaces into one:

In [97]:
jeopardy['clean_question'] = \
    jeopardy['clean_question'].str.replace('\s{2,}', ' ', regex=True)

jeopardy['clean_answer'] = \
    jeopardy['clean_answer'].str.replace('\s{2,}', ' ', regex=True)

### Further Data Cleaning (Numeric Columns)


- Transform values in the `Value` column into integers.
- Transform values in `Air Date` into _datetime_ values.

Also, a function to be applied on `Value` to turn the values into integers.

In [98]:
def norm_usd(string):
    """Converts a string/currency value into an integer."""
    
    try:
        converted = re.sub('(\$|,)', '', string)
        converted = int(converted)
    except:
        converted = 0
    
    return converted

In [99]:
jeopardy['clean_value'] = jeopardy['Value'].apply(norm_usd)

In [100]:
jeopardy['clean_value'].unique()

array([  200,   400,   600,   800,  2000,  1000,  1200,  1600,  3200,
           0,  5000,   100,   300,   500,  1500,  4800,  1800,  1100,
        2200,  3400,  3000,  4000,  6800,  1900,  3100,   700,  1400,
        2800,  8000,  6000,  2400, 12000,  3800,  2500,  6200, 10000,
        7000,  1492,  7400,  1300,  7200,  2600,  3300,  5400,  4500,
        2100,   900,  3600,  2127,   367,  4400,  3500,  2900,  3900,
        4100,  4600, 10800,  2300,  5600,  1111,  8200,  5800,   750,
        7500,  1700,  9000,  6100,  1020,  4700,  2021,  5200,  3389])

Lastly, the column 'Air Date' can be converted to a datetime type. 

In [101]:
jeopardy['clean_air_date'] = pd.to_datetime(jeopardy['Air Date'])

## How Often The Answer Can Be Used For A Question?
---

If a participant, out of bad luck,  were to be totally clueless about the questions being directed at her/him during the show, could the participant resort to the question to pull out the answer? In other words, how many times can a participant find words in the question that can be also be found in the answer? The extreme version of this would be the famous joke phrase - 'What's the color of napoleon's white horse?' (Answer: 'white').

A way to check this is to go row by row, comparing question and respective answer, and calculate the proportion of words in the answer given the total number of words in the question. `count_matches()` is a function that does just that. Note that the very common word 'the' is taken from the answer since has no relevancy.

In [102]:
def count_matches(row):
    
    split_answer = row["clean_answer"].split(' ')
    split_question = row["clean_question"].split(' ')
    
    if "the" in split_answer:
        split_answer.remove("the") # removing 'the' as common word
        
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    
    for item in split_answer:
        if item in split_question:
            match_count += 1
            
    return match_count / len(split_answer)

In [103]:
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [104]:
answer_in_question_mean = round(jeopardy["answer_in_question"].mean()*100, 2) 

f'{answer_in_question_mean = }%'

'answer_in_question_mean = 6.93%'

On average, only about 7% of the words in the question are also in the answer. This likely means that more often than not, question and associated answer can share other common words besides 'the', such as 'a', 'an', 'for', 'in', etc (see examples of a slice of cleaned answers below),
hence frustrating the strategy of resorting to the words in the questions alone, to find the answers.

In [105]:
jeopardy.loc[20:40,'clean_answer']

20                                              morocco
21                                          paul bonwit
22    hattie mcdaniel for her role in gone with the ...
23                                                  era
24                                   the congress party
25                                     wilt chamberlain
26                                                   k2
27                                          ethan allen
28                                                  ply
29                                               horton
30                                                nixon
31                                             a kennel
32                                                moses
33                                            aerosmith
34                                              oratory
35                          coolidge or chester arthur 
36                                       business class
37                                             m

## Answering The Question - 'How Often Questions Are Repeated', By Seeing How Often Complex Words Reoccur
---

This task takes on a similar approach to the previous one, but in this case, instead of comparing words in the question against the the ones in the answer, it compares the words in the question against a pool of words saved from previous questions; if we find repeated words in that process, we take the ratio of repeated words (from previous questions) vs the total number of words in the question, and later on, average out the list of ratios. 

In order to exclude common words, we only identify expressions with more complexity - words with more than 5 characters. In theory, this should allow to make this type of inference: if a complex word was used already in a previous question and has appeared again, then, in the context of this new question, it must be asking the same thing as before.

Preliminary Clean-up: Ordering Questions By Show Air Date

In [106]:
jeopardy = jeopardy.sort_values(by='Air Date').reset_index(drop=True)

In the process below, Stage 2 (marked as a comment) is divided into two loops - 'a' and 'b', so that we can register repeated words among questions/rows and thus avoid counting repeated words within the same question as re-occurrences. Mind as well that there is a time logic behind this process, hence the questions being sorted by air date: occurrences only happen when we compare one question with questions that appeared in past shows (or in the same air date). 

Moreover on the order of the question by air date: we don't have a way of ordering the questions if they have the same air date - which one was asked first during the show?, we don't have time references or other method to determine that, therefore, we take the order produced by `df.sort_values()` as is.

Sets (Python Object) can form random orders each time they are created, even if they have the same content. Therefore, it is better to perform the loop process once, in order to fill out `terms_used` with the desired terms, convert it into a list, and save it externally, so that we can preserve the order of the values. Since `question_overlap` is derived from `terms_used`, we store it as well.


In [107]:
runs_1 = 1

if runs_1 == 0: 
    
    runs_1 += 1 
    
    question_overlap = []

    terms_used = set()

    for index, val in enumerate(jeopardy['clean_question']):

        ## Stage 1.
        split_question = val.split(' ')

        split_question = [word for word in split_question if len(word) >= 6]


        ## Stage 2.
        match_count = 0

        # Loop a.
        for word in split_question:
            if word in terms_used:
                match_count += 1

        # Loop b.
        for word in split_question:
            terms_used.add(word)


        ## Stage 3. 
        if len(split_question):
            match_count /= len(split_question)

        question_overlap.append(match_count)


    # Saving `question_overlap` and `terms_used` into pickle files.

    import pickle

    # Creating a binary pickle file for each object.
    qo = open("question_overlap.pkl","wb") 
    tu = open("terms_used.pkl","wb") 

    # Write the python object to pickle file.
    pickle.dump(question_overlap, qo)
    pickle.dump(list(terms_used), tu) # converted into a list before saving

    # close file
    qo.close()
    tu.close()

Reading the objects back.

In [108]:
if 'question_overlap' not in locals():

    question_overlap = pd.read_pickle("question_overlap.pkl") # original object overwritten 

    terms_used_list = pd.read_pickle("terms_used.pkl")

Finally producing the mean.

In [109]:
jeopardy['question_overlap'] = pd.Series(question_overlap)  

question_overlap_mean = jeopardy['question_overlap'].mean()

question_overlap_mean = round(question_overlap_mean*100, 2)

f'{question_overlap_mean = }%'

'question_overlap_mean = 71.98%'

#### Comment: interpretation of the result and discussion.

The value above suggests that, on average, aprox. 72% of the words used in a question were already used in a previous similar question. Although neatly conveyed in a single measurement, the validity of the previous method as a mean capable of identifying repeated questions can be skeptically questioned. First, resorting to a bank of repeated words to determine which words in a given question are also repeated may not be very useful, since the same words may be used to formulate different questions. Also, computing the ratio of repeated words of a question in relation to the total number of words does not make great sense; by definition, categorizing a question as repeated entails that a previous question, which had the same context, was also answered similarly (in this case we seek identical answers). We are looking for a binary situation such as: 'is this question a repeated one or not?'; we should not seek to answer: 'this question is 75% repeated' because three out of four words were used randomly (i.e. in various possible contexts) in past questions. 

## How To Target Highly Paid Questions By Inferring Their Most Common Themes
---
From Dataquest's tutorial:


>Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.
>
>You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:
>
>- Low value -- Any row where `Value` is less than `800`.
>- High value -- Any row where `Value` is greater than `800`.
>
>You'll then be able to loop through each of the terms from the last screen, `terms_used`, and:
>
>
>- Find the number of low value questions the word occurs in.
>
>
>- Find the number of high value questions the word occurs in.
>
>
>- Find the percentage of questions the word occurs in.
>
>
>- Based on the percentage of questions the word occurs in, find expected counts.
>
>
>- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.
>
>You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

### Task: 'Pre-arrange The Data Set, Pick 10 Words/Subjects And Find Their Observed Values In High Value Questions And Low Value Questions.'


The process goes through the following steps:

1. creating a new column that defines each question/row as 'high value' or 'low_value'. 
2. picking randomly 10 terms from the `terms_used` pool.
3. building the function that counts, for each term chosen in Step 2, the number of high value and low value questions that contain that expression (a least once).
4. the output generated in Step 3 is saved in a Series called `observed`.

1. classify each question/row as high value or low value.

In [110]:
jeopardy['high_or_low'] = jeopardy['clean_value'].apply(lambda x: 'high_value' if x > 800 else 'low_value')

# Checking new column, random slice.
jeopardy.loc[80:90, ['clean_value', 'high_or_low']]

Unnamed: 0,clean_value,high_or_low
80,800,low_value
81,1900,high_value
82,800,low_value
83,1000,high_value
84,1000,high_value
85,400,low_value
86,1000,high_value
87,400,low_value
88,400,low_value
89,400,low_value


2. randomly pick 10 terms from `terms_used` and save them into a list - `comparison_term` (only done once); output saved in the list below to prevent overwriting it when re-running the notebook.

        import random


        comparison_term = random.sample(terms_used_list), 10)

In [111]:
comparison_term = [
    'coronado',
    'residence',
    'subatomic',
    'halston',
    'nuremberg',
    'plumber',
    'osment',
    'tortured',
    'relatively',
    'herradura'
    ]

3. build the function that counts how many high value and low value questions contain each one of the terms in `comparison_term`.

In [112]:
def count_highs_and_lows(word):
    """Takes a word and returns how many times that word was used in high value questions
    and low value questions.
    """
    
    high_counts = 0
    low_counts = 0
    
    for i, val in enumerate(jeopardy['clean_question']):
        
        split = val.split(' ')
        
        if word in split:
            if jeopardy.loc[i, 'high_or_low'] == 'high_value':
                high_counts += 1
            else:
                low_counts += 1

    return [high_counts, low_counts]

4. apply `count_highs_and_lows()` function and display results.

In [113]:
observed_list = [count_highs_and_lows(word) for word in comparison_term]

observed = pd.Series(observed_list, name='observed')

observed 

0    [1, 0]
1    [4, 4]
2    [1, 1]
3    [0, 1]
4    [1, 0]
5    [1, 1]
6    [1, 0]
7    [0, 1]
8    [0, 3]
9    [0, 1]
Name: observed, dtype: object

### Calculate The Respective Expected Values

Counting number of high value questions and low value questions.

In [114]:
high_value_count = jeopardy[jeopardy['high_or_low']=='high_value'].shape[0]

low_value_count = jeopardy[jeopardy['high_or_low']!='high_value'].shape[0]

# checking counts:
print(f'{high_value_count = }', f'{low_value_count = }', sep='\n')

high_value_count = 5734
low_value_count = 14265


Computing expected values.

In [115]:
expected = []

for i, val in enumerate(observed):
    
    # High and Low value count.
    high_count = val[0]
    low_count = val[1]
    
    # Proportion of words vs total number of questions
    total_prop = (high_count + low_count) / jeopardy.shape[0]
    
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
    expected.append([expected_high, expected_low])
    
expected = pd.Series(expected, name='expected')

expected

0    [0.28671433571678584, 0.7132856642832142]
1      [2.2937146857342867, 5.706285314265713]
2     [0.5734286714335717, 1.4265713285664283]
3    [0.28671433571678584, 0.7132856642832142]
4    [0.28671433571678584, 0.7132856642832142]
5     [0.5734286714335717, 1.4265713285664283]
6    [0.28671433571678584, 0.7132856642832142]
7    [0.28671433571678584, 0.7132856642832142]
8     [0.8601430071503575, 2.1398569928496425]
9    [0.28671433571678584, 0.7132856642832142]
Name: expected, dtype: object

### Compile Observed And Expected Values Into A DataFrame And Produce Chi-square Test Results For Each Word/Row

Putting observed and expected values into a DataFrame: `observed_expected`.

In [116]:
observed_expected = pd.concat([observed, expected], axis=1)

observed_expected.index = comparison_term

observed_expected 

Unnamed: 0,observed,expected
coronado,"[1, 0]","[0.28671433571678584, 0.7132856642832142]"
residence,"[4, 4]","[2.2937146857342867, 5.706285314265713]"
subatomic,"[1, 1]","[0.5734286714335717, 1.4265713285664283]"
halston,"[0, 1]","[0.28671433571678584, 0.7132856642832142]"
nuremberg,"[1, 0]","[0.28671433571678584, 0.7132856642832142]"
plumber,"[1, 1]","[0.5734286714335717, 1.4265713285664283]"
osment,"[1, 0]","[0.28671433571678584, 0.7132856642832142]"
tortured,"[0, 1]","[0.28671433571678584, 0.7132856642832142]"
relatively,"[0, 3]","[0.8601430071503575, 2.1398569928496425]"
herradura,"[0, 1]","[0.28671433571678584, 0.7132856642832142]"


Producing the chi-square tests and associated p-values for every term:

In [117]:
from scipy.stats import chisquare 

chi_squared = []

for index, series in observed_expected.iterrows():
    
    obs = series[0]
    exp = series[1]

    chisquare_value, pvalue  = chisquare(obs, exp)
    
    chi_squared.append([chisquare_value, pvalue])
    
chi_squared_df = pd.DataFrame(chi_squared,
                              columns=['chi-square value', 'p-value'],
                              index=comparison_term).rename_axis('term')

chi_squared_df['over 5% threshold'] = chi_squared_df['p-value'].apply(lambda x: 'yes' if x > 0.05 else 'no')

chi_squared_df.sort_values('chi-square value', ascending=False)

Unnamed: 0_level_0,chi-square value,p-value,over 5% threshold
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
coronado,2.487792,0.114733,yes
nuremberg,2.487792,0.114733,yes
osment,2.487792,0.114733,yes
residence,1.77951,0.18221,yes
relatively,1.205889,0.272148,yes
subatomic,0.444877,0.504778,yes
plumber,0.444877,0.504778,yes
halston,0.401963,0.526077,yes
tortured,0.401963,0.526077,yes
herradura,0.401963,0.526077,yes


From the random sample taken from the pool of terms, `chi_squared_df` shows that all the p-values are over the 5% threshold, therefore we cannot reject the null hypothesis that there is no unequal distribution between the number of high value and low value questions for these terms. 

### Choosing themes/terms based on the validation of the chisquare value


Following the tutorial's instructions leads us to a major problem that invalidates the tests' results shown above in `chi_squared_df`. The problem stems from the fact that, when sampling for complex words (terms), we are not taking into account that questions with such terms have extremely low observed counts. From the [scipy documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html):


>This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5. According to [3], the total number of samples is recommended to be greater than 13, otherwise exact tests (such as Barnard’s Exact test) should be used because they do not overreject.

Also, simply looking at the chi-square value magnitude is not enough to determine whether a term is more likely to be occurring in high value questions since the squared proportion difference of low value questions may be the component that inflates the chi-square value. Recall the chi-square value for the present scenario:

\begin{equation}
\chi^{2} = \frac{(\text{high value obs} - \text{high value exp})^2}{\text{high value exp}} + 
       \frac{(\text{low value obs} - \text{low value exp})^2}{\text{low value exp}}  
\end{equation}

Despite of the tests being invalid, for practicing purposes, reading them as is, would indicate that 'none of the terms had a significant difference in usage between high value and low value rows' (from the [solution notebook]((https://github.com/dataquestio/solutions/blob/master/Mission210Solution.ipynb)), which finds similar results).


Furthermore, from a single term alone, is difficult to assess the overall context in which its being used, e.g. if we can infer that History of World War II is a recurrent topic because of the term 'nuremberg' as in 'Nuremberg Trials', not much can be said about what subjects, 'residence' or 'relatively', are alluding to. 



Regarding the chi-square test, two immediate improvements can be made:

 - a) consider only terms that can be found more often in high value questions than in low value questions. To this end, we can make a chain of procedures:
    1. calculate for every term the difference between (observed) high value and low value questions counts.
    2. sort terms/rows, in descending order, based on the magnitude of the differences in counts calculated in Step 1.
    3. drop all rows when the difference between high and low count is not positive (we only are interested in terms that appear more often in high value than low value questions).
    4. slice the DataFrame, so that it only contains the top 100 terms which have the greatest differences.
    5. compute the expected values for each term/row.


 - b) from the final output produced in a), filter that DataFrame, selecting only the terms that have observed and expected values over 5. 


---

#### Starting with a):

- Since we are computing the observed values for 21223 terms, we apply the function `count_highs_and_lows()` for each row in `terms_used_series`  once and save the output.

In [118]:
runs_2 = 1

if runs_2 == 0:
    
    runs_2 += 1

    terms_used_series = pd.Series(terms_used_list) 

    observed_2 = terms_used_series.apply(count_highs_and_lows) 

    # Resorting to the pickle library again to save the `observed_2` output.
    obsr_2 = open("observed_2.pkl","wb") 

    pickle.dump(observed_2, obsr_2)

    obsr_2.close()

Reading back and displaying the first rows in `observed_2_df`.


In [119]:
observed_2_series = pd.read_pickle('observed_2.pkl')

observed_2_df = pd.DataFrame(data={'observed': observed_2_series}).set_index(np.array(terms_used_list))

observed_2_df.head()

Unnamed: 0,observed
croatia,"[2, 3]"
campuses,"[1, 1]"
accumulate,"[1, 0]"
voyages,"[0, 3]"
emeritus,"[1, 0]"


Below, we compute the difference between high value and low value counts for each term/row.

In [120]:
observed_2_df['difference'] = observed_2_df.observed.apply(lambda x: x[0] - x[1])

Sorting in descending order.

In [121]:
observed_2_df_sorted = observed_2_df.sort_values('difference', ascending=False)

observed_2_df_sorted.head(10)

Unnamed: 0,observed,difference
monitor,"[37, 17]",20
painter,"[17, 10]",7
example,"[16, 10]",6
largely,"[5, 0]",5
creates,"[6, 1]",5
hormone,"[8, 3]",5
spirit,"[13, 8]",5
orator,"[8, 3]",5
andrew,"[13, 8]",5
hyphenated,"[8, 3]",5


Keeping only terms with positive differences - 'high value' counts higher than 'low value' counts, and also keep only the top 100 terms that have the greatest difference between counts.


In [122]:
difference_over_0 = observed_2_df_sorted[observed_2_df_sorted.difference > 0]

difference_top_100 = difference_over_0.head(100).copy()

difference_top_100.head()

Unnamed: 0,observed,difference
monitor,"[37, 17]",20
painter,"[17, 10]",7
example,"[16, 10]",6
largely,"[5, 0]",5
creates,"[6, 1]",5


Computing the expected high and low counts applying the function `find_expected()`. 

In [123]:
def find_expected(observed_list):

    # High and Low value count.
    high_count = observed_list[0]
    low_count = observed_list[1]

    # Proportion of words vs total number of questions
    total_prop = (high_count + low_count) / jeopardy.shape[0]

    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count

    return [expected_high, expected_low]

The summary of the transformation applied so far.

In [124]:
difference_top_100['expected'] = difference_top_100['observed'].apply(find_expected)

difference_top_100.head()

Unnamed: 0,observed,difference,expected
monitor,"[37, 17]",20,"[15.482574128706435, 38.51742587129356]"
painter,"[17, 10]",7,"[7.741287064353218, 19.25871293564678]"
example,"[16, 10]",6,"[7.454572728636432, 18.545427271363568]"
largely,"[5, 0]",5,"[1.4335716785839292, 3.566428321416071]"
creates,"[6, 1]",5,"[2.007000350017501, 4.992999649982499]"


#### Task b)

Creating a procedure to find out questions/rows that have both observed and expected values, high and low, equal or over 5, so that we can apply a valid chi-square test to each row.

In the `df.iterrows()` process below, we go over the observed and expected values in each list (high and low value), and save the index value/term in `index_values_over_4`, whenever all those values are equal to 5 or more.

In [125]:
index_values_over_4 = []

for index, series in difference_top_100.iterrows():
    
    # Example of `observed_expected`: [37, 17, 15.482574128706435, 38.51742587129356]
    observed_expected = []
    
    observed_expected += series[0]
    observed_expected += series[2]
    
    check_over_4 = [True for el in observed_expected if el >= 5]
    
    # Check_over_4 can only have 4 elements, [True, True, True, True], 
    # if all values in observed_expected are over 4. 
    if len(check_over_4) == 4:
        index_values_over_4.append(series.name)

Producing the final DataFrame for testing - `observed_expected_2`.

In [126]:
observed_expected_2 = difference_top_100.loc[index_values_over_4, :]

observed_expected_2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, monitor to plants
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   observed    12 non-null     object
 1   difference  12 non-null     int64 
 2   expected    12 non-null     object
dtypes: int64(1), object(2)
memory usage: 384.0+ bytes


In [127]:
observed_expected_2

Unnamed: 0,observed,difference,expected
monitor,"[37, 17]",20,"[15.482574128706435, 38.51742587129356]"
painter,"[17, 10]",7,"[7.741287064353218, 19.25871293564678]"
example,"[16, 10]",6,"[7.454572728636432, 18.545427271363568]"
spirit,"[13, 8]",5,"[6.021001050052503, 14.978998949947497]"
andrew,"[13, 8]",5,"[6.021001050052503, 14.978998949947497]"
liquid,"[17, 12]",5,"[8.31471573578679, 20.68528426421321]"
pulitzer,"[15, 11]",4,"[7.454572728636432, 18.545427271363568]"
african,"[43, 39]",4,"[23.51057552877644, 58.48942447122356]"
process,"[16, 12]",4,"[8.028001400070004, 19.971998599929996]"
relative,"[13, 10]",3,"[6.594429721486074, 16.405570278513927]"


Producing the chisquare values again, this time for `observed_expected_2`, and the respective p-values.

In [128]:
chi_squared_2 = []

for index, series in observed_expected_2.iterrows():
    
    obs = series[0]
    exp = series[2]

    chisquare_value, pvalue = chisquare(obs, exp)
    
    chi_squared_2.append([chisquare_value, pvalue])
    
chi_squared_df_2 = pd.DataFrame(chi_squared_2,
                                columns=['chisquare_value', 'p-value'],
                                index=observed_expected_2.index).rename_axis('term')

chi_squared_df_2['over 5% threshold'] = chi_squared_df_2['p-value'].apply(lambda x: 'yes' if x > 0.05 else 'no')

chi_squared_df_2['p-value'] = chi_squared_df_2['p-value'].round(6)

chi_squared_df_2.sort_values('chisquare_value', ascending=False)

Unnamed: 0_level_0,chisquare_value,p-value,over 5% threshold
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
monitor,41.925086,0.0,no
african,22.65016,2e-06,no
painter,15.524748,8.1e-05,no
example,13.733503,0.000211,no
liquid,12.719123,0.000362,no
spirit,11.341071,0.000758,no
andrew,11.341071,0.000758,no
process,11.09848,0.000864,no
pulitzer,10.707336,0.001067,no
relative,8.723181,0.003142,no


Observing `chi_squared_df_2` we can assume that there is some evidence that the distributions of these terms differ from a distribution based on random draws where the probability of drawing a high value or a low value question is equally distributed with a 50% chance (we can reject the null hypothesis in all cases, since all p-values are under the 5% threshold), so that we can at least try to pick some subjects to study, based on these terms, because we know that by default, they are likely to be included more often in high value questions than in low value questions. Looking at the list of terms, which is very short, we see that some of them seem to be generic, without a specific context; others, are indicative of certain subjects, e.g. 'pulitzer' (Journalism/Literature/Politics), 'painter' (Art/History/Historical Figures), plants (botany/biology).



## Conclusion
---
To conclude we can re-state the three major conclusion withdrawn from this analysis: i) words within a random question are unlikely to match the respective answer: on average, words in the answer constitute 7% of the total number of words in the respective question; ii) there is some evidence that questions with complex words (5 or more characters) have repetitions or similar questions, since on average, 72% of the words in one of those questions have been used previously; iii) to pursuit a game strategy of answering high value questions based on their themes, one can try to withdraw inspiration from the list of terms in `chi_squared_df_2`, since these terms are more likely to be associated with high value questions than low value.    




\[end of project\]

\***