In [5]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
from nltk.corpus import gutenberg, stopwords
%run Challenge.ipynb

0    [8, year, life, Galileo, house, arrest, espous...
1    [2, 1912, olympian, football, star, Carlisle, ...
2    [city, Yuma, state, record, average, 4,055, ho...
3    [1963, live, Art, Linkletter, company, serve, ...
4    [signer, December, Indep, framer, Constitution...
5    [title, Aesop, fable, insect, share, billing, ...
6    [build, 312, B.C., link, Rome, South, Italy, u...
7    [8, 30, steal, Birmingham, Barons, 2,306, stea...
8    [winter, 1971, 72, record, 1,122, inch, snow, ...
9    [houseware, store, name, packaging, merchandis...
dtype: object
1684334
['city', 'play', 'name', 'country', 'man', 'call', '2', 'know', 'see', 'like', 'type', 'film', 'say', 'state', 'U.S.', 'year', 'title', 'write', 'word', 'mean', 'win', 'come', 'include', 'large', 'bear', 'novel', 'find', 'term', 'New', 'time', 'star', 'work', '3', 'capital', 'president', '1', 'book', 'get', 'woman', 'go', 'old', 'take', 'famous', 'hit', 'song', 'day', 'world', 'John', 'give', 'home', 'begin', 'group', 'chara



[0.5250889  0.52258659 0.52848205 0.52617715 0.52156734 0.52051366
 0.52420151 0.5217649  0.52064537 0.52522392]
0.9442611507332846
0.5230719586963537




[0.54458053 0.54925589 0.55146526 0.55370431 0.54902865 0.55284821
 0.55100428 0.55607507 0.54033586 0.54985511]
Train score:  0.5763149403033236
Test score:  0.5503618678836492
[0.54892664 0.5508363  0.55278235 0.54863352 0.54626276 0.54784327
 0.55745802 0.54527494 0.54395785 0.55031612]
Train score:  0.5689195329632337
Test score:  0.550576991041657


# Jeopardy! Categories!

In the challenge exercise, I attempted to predict in which Jeopardy round a question might appear. It wasn't an overwhelming success as I was only able to get about 55% success rate in predictions. For this project, I am going to use clustering to see how ~28,000 categories can be grouped more broadly. I want to see if adding the category as a feature will improve the accuracy of my model.

In [6]:
# first let's take a closer look at the categories
filtered_df.head()

Unnamed: 0,Round,Category,Value,Question,cleaned,spacy
0,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...","For the last 8 years of his life, Galileo was ...","(For, the, last, 8, years, of, his, life, ,, G..."
1,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,No. 2: 1912 Olympian; football star at Carlisl...,"(No, ., 2, :, 1912, Olympian, ;, football, sta..."
2,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,The city of Yuma in this state has a record av...,"(The, city, of, Yuma, in, this, state, has, a,..."
3,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...","In 1963, live on ""The Art Linkletter Show"", th...","(In, 1963, ,, live, on, "", The, Art, Linklette..."
4,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...","Signer of the Dec. of Indep., framer of the Co...","(Signer, of, the, Dec., of, Indep, ., ,, frame..."


In [7]:
filtered_df[' Category'].describe()

count             216930
unique             27995
top       BEFORE & AFTER
freq                 547
Name:  Category, dtype: object

In [8]:
filtered_df[' Category'].value_counts()

BEFORE & AFTER               547
SCIENCE                      519
LITERATURE                   496
AMERICAN HISTORY             418
POTPOURRI                    401
WORLD HISTORY                377
WORD ORIGINS                 371
COLLEGES & UNIVERSITIES      351
HISTORY                      349
SPORTS                       342
U.S. CITIES                  339
WORLD GEOGRAPHY              338
BODIES OF WATER              327
ANIMALS                      324
STATE CAPITALS               314
BUSINESS & INDUSTRY          311
ISLANDS                      301
WORLD CAPITALS               300
U.S. GEOGRAPHY               299
RELIGION                     297
OPERA                        294
SHAKESPEARE                  294
LANGUAGES                    284
BALLET                       282
TELEVISION                   281
FICTIONAL CHARACTERS         280
RHYME TIME                   279
TRANSPORTATION               279
PEOPLE                       279
STUPID ANSWERS               270
          

Now, we can see right away that trying to cluster with just the category names might prove to be too difficult/inaccurate since the categories only consist of 2-3 words. Using tf-idf relies on frequency in the sentence compared with frequency in the document, so instead, I will combine the question and the category and then attempt to form clusters.

In [15]:
filtered_df['combined'] = filtered_df[' Category'] + ' ' + filtered_df['cleaned']
filtered_df.head()

Unnamed: 0,Round,Category,Value,Question,cleaned,spacy,combined
0,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...","For the last 8 years of his life, Galileo was ...","(For, the, last, 8, years, of, his, life, ,, G...","HISTORY For the last 8 years of his life, Gali..."
1,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,No. 2: 1912 Olympian; football star at Carlisl...,"(No, ., 2, :, 1912, Olympian, ;, football, sta...",ESPN's TOP 10 ALL-TIME ATHLETES No. 2: 1912 Ol...
2,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,The city of Yuma in this state has a record av...,"(The, city, of, Yuma, in, this, state, has, a,...",EVERYBODY TALKS ABOUT IT... The city of Yuma i...
3,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...","In 1963, live on ""The Art Linkletter Show"", th...","(In, 1963, ,, live, on, "", The, Art, Linklette...","THE COMPANY LINE In 1963, live on ""The Art Lin..."
4,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...","Signer of the Dec. of Indep., framer of the Co...","(Signer, of, the, Dec., of, Indep, ., ,, frame...",EPITAPHS & TRIBUTES Signer of the Dec. of Inde...


That looks good. Let's vectorize!!

In [16]:
# change column to list
comb_list = filtered_df['combined'].tolist()

In [24]:
# hold out 25% of data as a test set
X_train, X_test = train_test_split(comb_list, test_size=0.25, random_state=26)
#print(X_train.shape)

In [28]:
# format vectorizer
vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=10, # i want to use words that appear at least 10 times
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

In [29]:
# apply the vectorizer
comb_list_tfidf = vectorizer.fit_transform(comb_list)
print('Number of features: %d' % comb_list_tfidf.get_shape()[1])

Number of features: 19051


In [35]:
# split into train and test vectors
X_train_tfidf, X_test_tfidf = train_test_split(comb_list_tfidf, test_size=0.25, random_state=26)

In [36]:
#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()
#print(X_train_tfidf.shape)
#print(X_test_tfidf_csr)

#number of paragraphs
n = X_train_tfidf_csr.shape[0]

#A list of dictionaries, one per paragraph
tfidf_byques = [{} for _ in range(0,n)]

#List of features
terms = vectorizer.get_feature_names()

#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_byques[i][terms[j]] = X_train_tfidf_csr[i, j]

#Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_train[5])
print('Tf_idf vector:', tfidf_byques[5])

Original sentence: EGGS & HAM The ham & cheese version of this egg dish is prepared in much the same manner as the Lorraine
Tf_idf vector: {'manner': 0.336100988289531, 'lorraine': 0.3405620215897985, 'prepared': 0.2944894763367392, 'dish': 0.2500696568261996, 'cheese': 0.25382633746578803, 'ham': 0.6006912149220507, 'eggs': 0.26920627264605274, 'version': 0.22859153919328828, 'egg': 0.26797555090837477}


Increasing the min_df parameter to 10 words from 5 decreased the number of features from 30,000+ to 19,000. We are going to further trim the number of features using Singular Value Decomposition.

In [39]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

# We are going to reduce the feature space from 19,051 to 1900.
svd= TruncatedSVD(1900)
lsa = make_pipeline(svd, Normalizer(copy=False))

# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf)

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

#Looking at what sorts of paragraphs our solution considers similar, for the first five identified topics
paras_by_component=pd.DataFrame(X_train_lsa,index=X_train)
for i in range(5):
    print('Component {}:'.format(i))
    print(paras_by_component.loc[:,i].sort_values(ascending=False)[0:10])
    print('\n')

Percent variance captured by all components: 50.3266875508026
Component 0:
STATE CAPITALS Chartered in 1781, it's the only state capital named for a French city                  0.393608
WORLD GEOGRAPHY Bhopal is the capital of this country's Madhya Pradesh state                           0.392689
THE STATE I'M IN Snowflake, Bullhead City, Scottsdale                                                  0.376095
THE WORLD I see this city, the capital of Uruguay                                                      0.370173
COUNTRIES OF THE WORLD It's the only country whose name is the same as an American state's             0.370134
WORLD FACTS Cuernavaca is the capital of this North American country's state of Morelos                0.365592
WORLD GEOGRAPHY Like the city of Bern, the Bernese Alps are in this country                            0.357448
WORLD GEOGRAPHY Hobart is the capital city of this island state of Australia                           0.352545
STATE CAPITAL NICKNAMES "The 

So we've compressed the feature set from 19000 to 1900. This model is still able to explain 50% of the variance so that's not too bad (50% variance loss to 90% reduction in features). An attempt was made to reduce the features to 1% or 190, but that resulted in a 86% loss in variance information.

### What is the verdict?

All that's left for us to do it apply the model to the test set with the reduced features and see if we can improve on the 55% accuracy.

In [43]:
paras_by_component.head()
print(paras_by_component.shape)

(162697, 1900)


First we need to retrain the model with the new feature set.

In [42]:
y_train, y_test = train_test_split(filtered_df[' Round'], test_size=0.25, random_state=26)
y_train.shape

(162697,)

In [48]:
# logistic regression
train = lr.fit(paras_by_component, y_train)
train_score = lr.score(paras_by_component, y_train)
print(train_score)



0.6037603643582856


In [45]:
#Reshapes the vectorizer output into something people can read
X_test_tfidf_csr = X_test_tfidf.tocsr()
#print(X_test_tfidf.shape)
#print(X_test_tfidf_csr)

#number of paragraphs
n = X_test_tfidf_csr.shape[0]

#A list of dictionaries, one per paragraph
tfidf_byques = [{} for _ in range(0,n)]

#List of features
terms = vectorizer.get_feature_names()

#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_test_tfidf_csr.nonzero()):
    tfidf_byques[i][terms[j]] = X_test_tfidf_csr[i, j]

#Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_test[5])
print('Tf_idf vector:', tfidf_byques[5])

Original sentence: MOVIE STARS You may call him Rocky or Rambo, but his friends call him Sly
Tf_idf vector: {'sly': 0.49662028909697314, 'rambo': 0.5407947504575876, 'friends': 0.35924065997827037, 'stars': 0.31674372686574803, 'rocky': 0.4063484819278554, 'movie': 0.2577009842419211}


In [47]:
# applying lsa model to test set
X_test_lsa = lsa.transform(X_test_tfidf)

test_ques_by_component=pd.DataFrame(X_test_lsa,index=X_test)
for i in range(5):
    print('Component {}:'.format(i))
    print(test_ques_by_component.loc[:,i].sort_values(ascending=False)[0:10])
    print('\n')

Component 0:
WORLD GEOGRAPHY The largest state in Venezuela is named for this man, the country's liberator                             0.392514
U.S. STATES It's the only state whose name & capital city both consist of 2 words                                         0.372970
STATE CAPITALS One of the world's largest pipe organs is found in the tabernacle in this state capital                    0.367314
SMALL STATE CAPITALS This city became a state capital in 1826, the same year the president for which it was named died    0.366973
STATE CAPITALS This city dropped the word "Great" from its name in 1868, while it was still a territorial capital         0.354896
LYING IN STATE In 1909, before reinterment, he lay in state in the city he had planned                                    0.349969
AMERICAN WORLD CAPITALS Hey, dude! Wickenburg in this southwestern state is the "Dude Ranch Capital of the World"         0.342801
STATE CAPITALS If you know that this state's capital is Jefferson City

In [56]:
# now let's apply the logistic regression  to the test and make some predictions
test_score = lr.score(test_ques_by_component, y_test)
print('Test score: ', test_score)

y_pred = lr.predict(test_ques_by_component)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

print(y_test.value_counts())

Test score:  0.5850128150756919
[[15211     1 11201     0]
 [  583     3   351     0]
 [10367     1 16513     0]
 [    1     0     1     0]]
Jeopardy!           26881
Double Jeopardy!    26413
Final Jeopardy!       937
Tiebreaker              2
Name:  Round, dtype: int64


58.5%! That's a 3 point improvement over the model fit with just the question. It's a small improvement but an improvement nonetheless. Looking at the confusion matrix, this model just isn't very good at predicting Final Jeopardy questions. My hypothesis is that the difficulty of a question either cannot be predicted by the frequency of a term, or the length of each question with it's limited length is just difficult to extract meaningful relationships.