# Basics

Parse, clean, and organize the Jeopardy! question data file to train a Naive Bayesian classifier.

Pass the Text Analysis Basics quiz with a score of 85% or better.

Just as we have built a classifier above, your aim here is to make sense of the data presented, and create a binary classifier ("high value" and "low value," based on the points available for each) for questions. Despite the large number of questions, this is an extraordinarily difficult classification problem. Consider it as a human coder: how often could you tell those questions that are "easy" versus "hard"? The degree to which you are successful in this is largely based on your own contextual knowledge--indeed, you might be tempted to classify questions you know the answer to as "easy" and those you do not as "hard." The computer doesn't know the answers to any of these.

For that reason, do not be discouraged if your classifier does not perform well. This constitutes an especially difficult problem for a simple classifier to solve.

Put the script and its output (which may merely report the accuracy of the trial) in your github repository, and share the link/filenames when you start your quiz.

That test consists of 20 questions, some of which you may have already encountered in your self-tests each section.

Importing  / Installing the appropriate packages for doing Naive Bayes and isolating specific text.

In [17]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Vainc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [62]:
from nltk import word_tokenize, sent_tokenize, pos_tag
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import pandas as pd
import os
import re
from string import punctuation

In [19]:
import pandas as pd
jeopardy = pd.read_json (r"C:\Users\Vainc\Downloads\jeopardy.json")

In [20]:
print(jeopardy)

                               category    air_date  \
0                               HISTORY  2004-12-31   
1       ESPN's TOP 10 ALL-TIME ATHLETES  2004-12-31   
2           EVERYBODY TALKS ABOUT IT...  2004-12-31   
3                      THE COMPANY LINE  2004-12-31   
4                   EPITAPHS & TRIBUTES  2004-12-31   
...                                 ...         ...   
216925                   RIDDLE ME THIS  2006-05-11   
216926                        "T" BIRDS  2006-05-11   
216927           AUTHORS IN THEIR YOUTH  2006-05-11   
216928                       QUOTATIONS  2006-05-11   
216929                   HISTORIC NAMES  2006-05-11   

                                                 question  value  \
0       'For the last 8 years of his life, Galileo was...   $200   
1       'No. 2: 1912 Olympian; football star at Carlis...   $200   
2       'The city of Yuma in this state has a record a...   $200   
3       'In 1963, live on "The Art Linkletter Show", t...   $200   

In [21]:
jeopardy = jeopardy.dropna() #Made sure to drop the NaN variables since they will interfere with the Naive Bayes.

In [23]:
jeopardy['value'] #Checking the row outputs of the "value" column to see what the range of $ awards are.

0          $200
1          $200
2          $200
3          $200
4          $200
          ...  
216924    $2000
216925    $2000
216926    $2000
216927    $2000
216928    $2000
Name: value, Length: 213296, dtype: object

In [24]:
jeopardy

Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680
...,...,...,...,...,...,...,...
216924,OFF-BROADWAY,2006-05-11,'In 2006 the cast of this long-running hit emb...,$2000,Stomp,Double Jeopardy!,4999
216925,RIDDLE ME THIS,2006-05-11,'This Puccini opera turns on the solution to 3...,$2000,Turandot,Double Jeopardy!,4999
216926,"""T"" BIRDS",2006-05-11,'In North America this term is properly applie...,$2000,a titmouse,Double Jeopardy!,4999
216927,AUTHORS IN THEIR YOUTH,2006-05-11,"'In Penny Lane, where this ""Hellraiser"" grew u...",$2000,Clive Barker,Double Jeopardy!,4999


In [25]:
jeopardy['value'].str.contains('$2000')

0         False
1         False
2         False
3         False
4         False
          ...  
216924    False
216925    False
216926    False
216927    False
216928    False
Name: value, Length: 213296, dtype: bool

In [26]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 213296 entries, 0 to 216928
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   category     213296 non-null  object
 1   air_date     213296 non-null  object
 2   question     213296 non-null  object
 3   value        213296 non-null  object
 4   answer       213296 non-null  object
 5   round        213296 non-null  object
 6   show_number  213296 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 13.0+ MB


In [27]:
type(jeopardy) #Checking the type of the jeopardy csv imported to make sure it didn't shift into something else.

pandas.core.frame.DataFrame

In [28]:
print(jeopardy.columns) #Checking the column names in the jeopardy data frame.

Index(['category', 'air_date', 'question', 'value', 'answer', 'round',
       'show_number'],
      dtype='object')


In [29]:
question = jeopardy['question'].tolist()
print(question)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [34]:
value = jeopardy['value'].tolist() 
print(value)

['$200', '$200', '$200', '$200', '$200', '$200', '$400', '$400', '$400', '$400', '$400', '$400', '$600', '$600', '$600', '$600', '$600', '$600', '$800', '$800', '$800', '$800', '$2,000', '$800', '$1000', '$1000', '$1000', '$1000', '$1000', '$400', '$400', '$400', '$400', '$400', '$400', '$800', '$800', '$800', '$800', '$800', '$1200', '$2,000', '$1200', '$1200', '$1200', '$1600', '$1600', '$1600', '$1600', '$1600', '$2000', '$2000', '$3,200', '$2000', '$2000', '$200', '$200', '$200', '$200', '$200', '$200', '$400', '$400', '$400', '$400', '$400', '$400', '$600', '$600', '$600', '$600', '$600', '$600', '$800', '$800', '$800', '$800', '$800', '$800', '$2,000', '$1000', '$1000', '$1000', '$1000', '$1000', '$400', '$400', '$400', '$400', '$400', '$400', '$800', '$800', '$800', '$800', '$800', '$800', '$1200', '$1200', '$1200', '$1200', '$1200', '$1200', '$5,000', '$1600', '$1600', '$1600', '$1600', '$5,000', '$2000', '$2000', '$2000', '$2000', '$2000', '$2000', '$100', '$100', '$100', '$10

In [35]:
#jeopardy.drop['value'['$400', '$600', '$800']] #Could be done this way

TypeError: string indices must be integers

In [36]:
jeopardy_list = []
for each_value in value:
    if each_value in ['$200', '$400', '$600']:
        jeopardy_list.append(0)
    else:
        jeopardy_list.append(1)
print(jeopardy_list)    

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 

In [None]:
#list(zip(question,jeopardy_list)) #Combined

In [37]:
question

["'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
 "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'",
 "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'",
 '\'In 1963, live on "The Art Linkletter Show", this company served its billionth burger\'',
 "'Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States'",
 "'In the title of an Aesop fable, this insect shared billing with a grasshopper'",
 "'Built in 312 B.C. to link Rome & the South of Italy, it's still in use today'",
 "'No. 8: 30 steals for the Birmingham Barons; 2,306 steals for the Bulls'",
 "'In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state'",
 "'This housewares store was named for the packaging its merchandise came in & was first displayed on'",
 '\'"And away we go

In [38]:
jeopardy_list

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [39]:
jeopardy_final = []
lemmatizer = WordNetLemmatizer()
for each_question in question:
    questionlist = [lemmatizer.lemmatize(x) for x in word_tokenize(each_question)]
    jeopardy_words = ' '.join(questionlist)
    jeopardy_final.append(jeopardy_words)
print(jeopardy_final[:])

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
#Drop the intermediate question value

Loading in the extra packages forgotten above.

In [40]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
import io
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [42]:
np.shape(jeopardy_final)

(213296,)

In [43]:
np.shape(jeopardy_list)

(213296,)

Doing the Naive Bayes below.

In [54]:
X_train, X_test, y_train, y_test = train_test_split(jeopardy_final, jeopardy_list,
                                                random_state=1)

In [56]:
X_train

["'He wa first famous a a comedian , but now he 's into more serious stuff in movie like `` Patch Adams '' '",
 "'Seth , < br / > Isiah , < br / > Clarence '",
 "'In 1997 all but 10 Republican congressman voted to re-elect this House speaker '",
 "' < a href= '' http : //www.j-archive.com/media/2008-05-24_J_30.jpg '' target= '' _blank '' > This < /a > national variety of python can grow to 30 foot long '",
 "'The name of this salsa mean `` beak of the rooster '' '",
 "'From it distinctive marking , this poisonous spider is also called a fiddleback ( not an hourglass back ) '",
 "'Give Emeril some pork bely & kosher salt & he wont just `` bring home '' this meat , he 'll make it himself '",
 "'Nursery-rhyme pair who are the title of a book by Louisa May Alcott & a le innocent book by James Patterson '",
 "'This country whose name on it coin begin with an `` E '' ha a king on them '",
 "'Because of 2010 Census tally , New York & Ohio each lost 2 of these '",
 "'In October 1823 he wrote t

In [57]:
y_train

[0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,


In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [59]:
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
X_train_tf = tfidf_vectorizer.fit_transform(X_train)
X_test_tf = tfidf_vectorizer.transform(X_test)

In [63]:
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_tf, y_train)
predictions = naive_bayes.predict(X_test_tf)

In [64]:
print('Accuracy: ', accuracy_score(y_test, predictions))

Accuracy:  0.5631985597479559


Testing the above code after altering the dataframe to have greater differences between what is arbitrarily considered "high" or "low" values. *No matter what I did this did not want to work!*

In [71]:
jeopardy

Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680
...,...,...,...,...,...,...,...
216924,OFF-BROADWAY,2006-05-11,'In 2006 the cast of this long-running hit emb...,$2000,Stomp,Double Jeopardy!,4999
216925,RIDDLE ME THIS,2006-05-11,'This Puccini opera turns on the solution to 3...,$2000,Turandot,Double Jeopardy!,4999
216926,"""T"" BIRDS",2006-05-11,'In North America this term is properly applie...,$2000,a titmouse,Double Jeopardy!,4999
216927,AUTHORS IN THEIR YOUTH,2006-05-11,"'In Penny Lane, where this ""Hellraiser"" grew u...",$2000,Clive Barker,Double Jeopardy!,4999


For the purposes of this secondary Naive Bayes I made the values of $0-$400 "low" or 0's and the values of $1600-$4000 "high."

In [106]:
drop_list = ['$400' , '$600', '$800' , '$1100', '$1200' , '$1600', '$1800']

jeopardy_alt = jeopardy[jeopardy['value'].isin(drop_list)]

In [107]:
jeopardy_alt

Unnamed: 0,category,air_date,question,value,answer,round,show_number


In [100]:
print(jeopardy_alt)

None


In [None]:
jeopardy_alt_list = []
for each_value in value:
    if each_value in ['$0','$200', '$400', '$600']:
        jeopardy_list.append(0)
    else:
        jeopardy_list.append(1)
print(jeopardy_list)   