# Standdown Exercise

The cell below stores the text of a set of famous books in the variable nltk_books.

In [1]:
# nltk.download('gutenberg')

In [2]:
# Run cell with no changes

import nltk
import pandas as pd

# store raw text of books in a list
nltk_books = [nltk.corpus.gutenberg.raw(title) 
                 for title in nltk.corpus.gutenberg.fileids()]

# convert list to dataframe with titles as the index.
nltk_books = pd.DataFrame(nltk_books, 
                          index=nltk.corpus.gutenberg.fileids(),
                          columns=['raw_text'] )

The next cell below splits the books into a train and test sets.  This is an arbitrary split, but is here to remind you that we fit a vectorizer only on the training set.

In [3]:
# Run cell with no changes
from sklearn.model_selection import train_test_split

train, test = train_test_split(nltk_books, random_state=42)


In [4]:
# Here are the books whose full texts compose the training set
train.index

Index(['milton-paradise.txt', 'shakespeare-macbeth.txt',
       'shakespeare-hamlet.txt', 'edgeworth-parents.txt', 'austen-sense.txt',
       'chesterton-brown.txt', 'whitman-leaves.txt', 'blake-poems.txt',
       'melville-moby_dick.txt', 'carroll-alice.txt',
       'chesterton-thursday.txt', 'shakespeare-caesar.txt',
       'burgess-busterbrown.txt'],
      dtype='object')

Your task is to fit a TfidfVectorizer to the training set with the following specifications: max_features is set to 50, stopwords are removed using the nltk english stopwords list.  The other parameters should be the defaults.  

**After fitting the vectorizer, find the word with the highest tf-idf score in Moby Dick. Slack out both the word and tf-idf score, as well as your forked repo showing your work.**

> Hint: Converting the vectorized text into a DataFrame with column names and indices will make your life easier.  Use the following hints to make that happen:  
>> 1. The TF-IDF vectorizer returns a sparse matrix.  Chain the toarray() method off the vectorizer, then convert that array into a DataFrame.  

>> 2. The fit Tf-Idf object has a method called `get_feature_names()`. Assign the result of that method as the `columns` argument of DataFrame.  

>> 3. Pass train.index as the index argument of DataFrame.   
    



In [5]:
# import stopwords and tfidf, set to the above hyperparameters.
# fit_transform using raw_text which are the set columns from above
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=50, stop_words='english')

train_vectorized = tfidf.fit_transform(train['raw_text'])

In [6]:
# get feature names from columns
columns = tfidf.get_feature_names()

In [7]:
# create a df and set the index
train_vect_df = pd.DataFrame(train_vectorized.toarray(), columns=columns)
train_vect_df.set_index(train.index, inplace=True)

In [8]:
# quick look at the df
train_vect_df.head()

Unnamed: 0,away,came,come,day,did,don,elinor,eyes,father,good,...,thing,think,thou,thought,thy,time,way,whale,world,ye
milton-paradise.txt,0.000924,0.049906,0.035119,0.110901,0.018484,0.0,0.0,0.052678,0.057575,0.097039,...,0.010166,0.021256,0.53358,0.043436,0.474881,0.037891,0.074858,0.0,0.097039,0.103235
shakespeare-macbeth.txt,0.04585,0.020841,0.22508,0.079195,0.175063,0.020693,0.0,0.04585,0.071633,0.20424,...,0.033345,0.004168,0.501356,0.062522,0.289708,0.191735,0.070859,0.0,0.025009,0.01552
shakespeare-hamlet.txt,0.072524,0.021489,0.279353,0.067152,0.169223,0.003334,0.0,0.053722,0.144257,0.263236,...,0.067152,0.010744,0.384117,0.037605,0.300048,0.118188,0.029547,0.012102,0.067152,0.036673
edgeworth-parents.txt,0.054531,0.100045,0.130101,0.11679,0.145129,0.162009,0.0,0.03435,0.129135,0.196225,...,0.056678,0.090169,0.001722,0.081152,0.002132,0.135683,0.063548,0.0,0.036926,0.026113
austen-sense.txt,0.045929,0.036497,0.041828,0.061512,0.10129,0.012215,0.827519,0.022964,0.013214,0.072584,...,0.075865,0.086117,0.0,0.047569,0.0,0.098009,0.030346,0.0,0.038137,0.000509


In [9]:
# find moby dick
train_vect_df.loc['melville-moby_dick.txt', :].sort_values(ascending=False).head(1)

whale    0.835587
Name: melville-moby_dick.txt, dtype: float64

In [10]:
'''
Whale is the highest at 83.5%
'''

'\nWhale is the highest at 83.5%\n'