# PROJECT LAYOUT

#### The aim of this project is to explore how we can improve our chances of winning the Television game Jeopardy! by checking how often questions

#### - are repeated
#### - which category of questions has the highest repititions.

#### The dataset contains more than 200,000 rows of questions.

# A. Reading in the Data

In [28]:
import pandas as pd

jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [29]:
jeopardy.shape

(216930, 7)

#### Cleaning the text present in Question and Answer columns

In [30]:
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### We determine how many questions are repeated by checking if a certain question has words in previous questions also.

#### Words with only more than 5 letters will be considered as words with lesser no. of letters are usually the, then, etc, which are not meaningful in determining whether a question is repeated or not

In [31]:
## CLEANING the text "Question" and "Answer" columns by lower-casing them and removing punctuations!

import string

jeopardy["Answer"] = jeopardy["Answer"].astype(str)

def cleaning_text(jeo_string):
    jeo_string = jeo_string.lower()
    jeo_string = "".join((char for char in jeo_string if char not in string.punctuation))
    return jeo_string

jeopardy["clean_question"] = jeopardy["Question"].apply(cleaning_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(cleaning_text)

print(jeopardy["clean_question"])
print(jeopardy["clean_answer"])

0         for the last 8 years of his life galileo was u...
1         no 2 1912 olympian football star at carlisle i...
2         the city of yuma in this state has a record av...
3         in 1963 live on the art linkletter show this c...
4         signer of the dec of indep framer of the const...
5         in the title of an aesop fable this insect sha...
6         built in 312 bc to link rome  the south of ita...
7         no 8 30 steals for the birmingham barons 2306 ...
8         in the winter of 197172 a record 1122 inches o...
9         this housewares store was named for the packag...
10                                           and away we go
11        cows regurgitate this from the first stomach t...
12        in 1000 rajaraja i of the cholas battled to ta...
13        no 1 lettered in hoops football  lacrosse at s...
14        on june 28 1994 the natl weather service began...
15        this companys accutron watch introduced in 196...
16        outlaw murdered by a traitor a

In [32]:
# Removing uncleaned Columns Questions and Answer

jeopardy = jeopardy.drop(["Question", "Answer"], axis=1)
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,signer of the dec of indep framer of the const...,john adams


# B. Calculating repitition of Questions

In [33]:
question_overlap = []
terms_used = set()

## Running the loop over the rows.

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    
    # Keeping words with minimum 6 letters only.
    split_question = [w for w in split_question if len(w) > 5]
    
    match_count = 0
    
    # Creating count of words in the split question list.
    for word in split_question:
        if word in terms_used:
            match_count += 1
            
    for word in split_question:
        terms_used.add(word)
            
    # Calculating percentage of words that occur in prevoius questions also.
    if len(split_question) > 0:
        match_count/=len(split_question)
    question_overlap.append(match_count)
        
jeopardy["question_overlap"] = question_overlap

# Finding average repitition.
jeopardy["question_overlap"].mean()

0.8729646759744081

### This means that on average, 87.29% of words keep repeating.

### Thus, we may improve our chances of winning greatly by preparing questions asked previously.

# C. Checking repititions by category.

In [34]:
# Fetching categories by their frequency.

jeopardy["Category"].value_counts()

BEFORE & AFTER                       547
SCIENCE                              519
LITERATURE                           496
AMERICAN HISTORY                     418
POTPOURRI                            401
WORLD HISTORY                        377
WORD ORIGINS                         371
COLLEGES & UNIVERSITIES              351
HISTORY                              349
SPORTS                               342
U.S. CITIES                          339
WORLD GEOGRAPHY                      338
BODIES OF WATER                      327
ANIMALS                              324
STATE CAPITALS                       314
BUSINESS & INDUSTRY                  311
ISLANDS                              301
WORLD CAPITALS                       300
U.S. GEOGRAPHY                       299
RELIGION                             297
SHAKESPEARE                          294
OPERA                                294
LANGUAGES                            284
BALLET                               282
TELEVISION      

#### Given the large number of categories, it only makes sense if we calculate repititions for only the best categories.

In [41]:
jeopardy["Category"].value_counts()[:10]

BEFORE & AFTER             547
SCIENCE                    519
LITERATURE                 496
AMERICAN HISTORY           418
POTPOURRI                  401
WORLD HISTORY              377
WORD ORIGINS               371
COLLEGES & UNIVERSITIES    351
HISTORY                    349
SPORTS                     342
Name: Category, dtype: int64

#### Fetching the above loop codes

In [None]:
question_overlap = []
terms_used = set()

## Running the loop over the rows.

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    
    # Keeping words with minimum 6 letters only.
    split_question = [w for w in split_question if len(w) > 5]
    
    match_count = 0
    
    # Creating count of words in the split question list.
    for word in split_question:
        if word in terms_used:
            match_count += 1
            
    for word in split_question:
        terms_used.add(word)
            
    # Calculating percentage of words that occur in prevoius questions also.
    if len(split_question) > 0:
        match_count/=len(split_question)
    question_overlap.append(match_count)
        
jeopardy["question_overlap"] = question_overlap

#### Introducing filtering by category and finding word overlapping percentages

In [43]:
# Finding average repitition.

entries = ["BEFORE & AFTER", "SCIENCE", "LITERATURE", "AMERICAN HISTORY", "POTPOURRI"]
entry_overlaps = []

for entry in entries:
    B = jeopardy["question_overlap"][jeopardy["Category"] == entry].mean()
    entry_overlaps.append(B)
    
entry_overlaps

[0.911536229360727,
 0.8882977379704683,
 0.9047886757060144,
 0.9459320564196726,
 0.880296054734958]

### Hence, it can be seen that the category of questions that re-occurs the most in the game Jeopardy! is AMERICAN HISTORY