## Unstructured Data - Continued

Picking up where we left off - in this lab we'll use the same book data set that we web scraped. We'll pick a large genre, and see if we can identify some common words for the genre. We'll then test it out of sample, and see if we can predict which books are also in that genre, just by keyword matches.

In [1]:
#import pandas, nltk

import nltk
from nltk.tokenize.api import TokenizerI
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt

In [2]:
#read in the book data

data = pd.read_csv('scraped_books.csv', index_col=0)
data.head()

Unnamed: 0,Title,Price,Description,Rating,Link,Genre
0,A Light in the Attic,Â£51.77,It's hard to imagine a world without A Light i...,Three,a-light-in-the-attic_1000/index.html,Poetry
1,Tipping the Velvet,Â£53.74,"""Erotic and absorbing...Written with starling ...",One,tipping-the-velvet_999/index.html,Historical Fiction
2,Soumission,Â£50.10,"Dans une France assez proche de la nÃ´tre, un ...",One,soumission_998/index.html,Fiction
3,Sharp Objects,Â£47.82,"WICKED above her hipbone, GIRL across her hear...",Four,sharp-objects_997/index.html,Mystery
4,Sapiens: A Brief History of Humankind,Â£54.23,From a renowned historian comes a groundbreaki...,Five,sapiens-a-brief-history-of-humankind_996/index...,History


In [3]:
#display the top 10 genres
data.Genre.value_counts()[:10]

Default           152
Nonfiction        110
Sequential Art     75
Add a comment      67
Fiction            65
Young Adult        54
Fantasy            48
Romance            35
Mystery            32
Food and Drink     30
Name: Genre, dtype: int64

Looks like the top few genres are fairly broad, so let's pick a genre that likely has more descriptive keywords.

In [4]:
#create a slice of the dataframe that is only books in the Fantasy genre and show the first 5 rows of it

fantasy = data[data['Genre'] == 'Fantasy']
fantasy.head()

Unnamed: 0,Title,Price,Description,Rating,Link,Genre
49,Unicorn Tracks,Â£18.78,After a savage attack drives her from her home...,Three,unicorn-tracks_951/index.html,Fantasy
76,"Saga, Volume 6 (Saga (Collected Editions) #6)",Â£25.02,"After a dramatic time jump, the three-time Eis...",Three,saga-volume-6-saga-collected-editions-6_924/in...,Fantasy
81,Princess Between Worlds (Wide-Awake Princess #5),Â£13.34,Just as Annie and Liam are busy making plans t...,Five,princess-between-worlds-wide-awake-princess-5_...,Fantasy
91,Masks and Shadows,Â£56.40,"The year is 1779, and Carlo Morelli, the most ...",Two,masks-and-shadows_909/index.html,Fantasy
112,Crown of Midnight (Throne of Glass #2),Â£43.29,"""A line that should never be crossed is about ...",Three,crown-of-midnight-throne-of-glass-2_888/index....,Fantasy


Now let's split this into a training and test set. The idea here is that we want to find common words in 70% of the Fantasy book descriptions. Then we'll see if we can accurately predict the other 30% of the books. There is a very easy way of doing this using scikit-learn's [*sklearn.model_selection.train_test_split()*](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. To be totally clear, this is way overkill for this application, but it's a cool way to start using scikit, so let's do it. 😂

In [5]:
#import the model_selection class from scikit

from sklearn import model_selection

In [6]:
#split the dataframe into a train (70%) and test (30%) set

train, test = model_selection.train_test_split(fantasy, test_size=0.3, train_size=0.7)

In [7]:
#check the len of your train set to confirm you split properly

len(train)

33

In [8]:
#check the len of your test set to confirm you split properly

len(test)

15

In [9]:
#combine all the descriptions from the training set into a long string

all_train = ' '.join(list(train.Description))

In [10]:
#tokenize the string of all the training descriptions

tokens = nltk.word_tokenize(all_train)

In [11]:
#create a FreqDist of the tokens and show it

frequency = nltk.FreqDist(tokens)
frequency

FreqDist({',': 427, 'the': 404, '.': 264, 'and': 257, 'of': 216, 'a': 212, 'to': 198, 'in': 126, 'is': 121, 'her': 89, ...})

In [12]:
#import stopwords and set them to use the english subset

from nltk.corpus import stopwords
stopwords = stopwords.words('english')

In [13]:
#remove stop words from the FreqDist you created

frequency = [(w, f) for (w, f) in frequency.items() if w not in stopwords]
frequency

[('***USA', 2),
 ('TODAY', 2),
 ('BESTSELLING', 2),
 ('SERIES***', 2),
 ('EXPLOSIVE', 2),
 ('FINAL', 2),
 ('INSTALLMENT', 2),
 ('IN', 2),
 ('THE', 11),
 ('SEVEN', 4),
 ('SERIES', 4),
 ('Book', 21),
 ('1', 11),
 ('sale', 2),
 ('limited', 2),
 ('time', 10),
 ('!', 14),
 ('Love', 3),
 ('.', 264),
 ('Family', 2),
 ('Brotherhood', 7),
 ('Lexi', 3),
 ('faced', 4),
 ('personal', 4),
 ('struggles', 2),
 (',', 427),
 ('nothing', 3),
 ('prepared', 4),
 ('perilous', 4),
 ('battle', 7),
 ('life', 24),
 ('Shifters', 3),
 ('brink', 4),
 ('war', 14),
 ('Northerners', 2),
 ('target', 2),
 ('Colorado', 2),
 ('attempt', 3),
 ('infiltrate', 2),
 ('borders', 2),
 ('Texas', 2),
 ('winds', 2),
 ('hit', 1),
 ('list', 1),
 ('Weston', 1),
 ('pack', 2),
 ('prepares', 1),
 ('fight', 4),
 ('landâ\x80¦', 1),
 ('lives', 5),
 ('Austinâ\x80\x99s', 1),
 ('courage', 1),
 ('put', 5),
 ('test', 3),
 ('rogues', 1),
 ('want', 4),
 ('seize', 1),
 ('land', 8),
 ('slaughter', 1),
 ('But', 33),
 ('thatâ\x80\x99s', 1),
 ('heâ\x

In [14]:
#sort by frequency

frequency.sort(key=lambda tup: tup[1], reverse=True)

In [15]:
#show top 10 values in the list

frequency[:10]

[(',', 427),
 ('.', 264),
 ('...', 53),
 ("'s", 43),
 ('The', 41),
 (':', 34),
 ('But', 33),
 ('I', 30),
 ('world', 28),
 ('life', 24)]

In [16]:
#filter your list of tuples and keep only tokens longer than 1 character

frequency = [(w, f) for (w, f) in frequency if len(w) > 1]
frequency

[('...', 53),
 ("'s", 43),
 ('The', 41),
 ('But', 33),
 ('world', 28),
 ('life', 24),
 ('one', 22),
 ('Book', 21),
 ('must', 19),
 ('find', 18),
 ('Harry', 18),
 ('New', 17),
 ('power', 16),
 ('save', 15),
 ('war', 14),
 ('new', 14),
 ('Series', 13),
 ('series', 12),
 ('story', 12),
 ('With', 12),
 ('York', 12),
 ('Times', 12),
 ('bestselling', 12),
 ('since', 12),
 ('THE', 11),
 ('Seven', 11),
 ('city', 11),
 ('first', 11),
 ('father', 11),
 ('Hogwarts', 11),
 ('old', 11),
 ('take', 11),
 ('time', 10),
 ("n't", 10),
 ('back', 10),
 ('finds', 10),
 ('It', 10),
 ('Prince', 10),
 ('kingdom', 10),
 ('Maya', 10),
 ('Now', 9),
 ('could', 9),
 ('make', 9),
 ('secrets', 9),
 ('home', 9),
 ('stone', 9),
 ('queen', 9),
 ('land', 8),
 ('supernatural', 8),
 ('young', 8),
 ('magical', 8),
 ('get', 8),
 ('What', 8),
 ('Potter', 8),
 ('--', 8),
 ('ever', 8),
 ('family', 8),
 ('And', 8),
 ('know', 8),
 ('torn', 8),
 ('mother', 8),
 ('months', 8),
 ('As', 8),
 ('Brotherhood', 7),
 ('battle', 7),
 ('lo