## Unstructured Data - Continued

Picking up where we left off - in this lab we'll use the same book data set that we web scraped. We'll pick a large genre, and see if we can identify some common words for the genre. We'll then test it out of sample, and see if we can predict which books are also in that genre, just by keyword matches.

In [1]:
#import pandas, nltk, from nltk.tokenize.api import TokenizerI, matplotlib (inline), numpy

import nltk
from nltk.tokenize.api import TokenizerI
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

In [2]:
#read in the book data

data = pd.read_csv('scraped_books.csv', index_col=0)
data.head()

Unnamed: 0,Title,Price,Description,Rating,Link,Genre
0,A Light in the Attic,Â£51.77,It's hard to imagine a world without A Light i...,Three,a-light-in-the-attic_1000/index.html,Poetry
1,Tipping the Velvet,Â£53.74,"""Erotic and absorbing...Written with starling ...",One,tipping-the-velvet_999/index.html,Historical Fiction
2,Soumission,Â£50.10,"Dans une France assez proche de la nÃ´tre, un ...",One,soumission_998/index.html,Fiction
3,Sharp Objects,Â£47.82,"WICKED above her hipbone, GIRL across her hear...",Four,sharp-objects_997/index.html,Mystery
4,Sapiens: A Brief History of Humankind,Â£54.23,From a renowned historian comes a groundbreaki...,Five,sapiens-a-brief-history-of-humankind_996/index...,History


In [3]:
#display the top 10 genres
data.Genre.value_counts()[:10]

Default           152
Nonfiction        110
Sequential Art     75
Add a comment      67
Fiction            65
Young Adult        54
Fantasy            48
Romance            35
Mystery            32
Food and Drink     30
Name: Genre, dtype: int64

Looks like the top few genres are fairly broad, so let's pick a genre that likely has more descriptive keywords.

In [4]:
#create a slice of the dataframe that is only books in the Fantasy genre and show the first 5 rows of it

fantasy = data[data['Genre'] == 'Fantasy']
fantasy.head()

Unnamed: 0,Title,Price,Description,Rating,Link,Genre
49,Unicorn Tracks,Â£18.78,After a savage attack drives her from her home...,Three,unicorn-tracks_951/index.html,Fantasy
76,"Saga, Volume 6 (Saga (Collected Editions) #6)",Â£25.02,"After a dramatic time jump, the three-time Eis...",Three,saga-volume-6-saga-collected-editions-6_924/in...,Fantasy
81,Princess Between Worlds (Wide-Awake Princess #5),Â£13.34,Just as Annie and Liam are busy making plans t...,Five,princess-between-worlds-wide-awake-princess-5_...,Fantasy
91,Masks and Shadows,Â£56.40,"The year is 1779, and Carlo Morelli, the most ...",Two,masks-and-shadows_909/index.html,Fantasy
112,Crown of Midnight (Throne of Glass #2),Â£43.29,"""A line that should never be crossed is about ...",Three,crown-of-midnight-throne-of-glass-2_888/index....,Fantasy


Now let's split this into a training and test set. The idea here is that we want to find common words in 70% of the Fantasy book descriptions. Then we'll see if we can accurately predict the other 30% of the books. There is a very easy way of doing this using scikit-learn's [*sklearn.model_selection.train_test_split()*](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. To be totally clear, this is way overkill for this application, but it's a cool way to start using scikit, so let's do it. 😂

In [5]:
#import the model_selection class from scikit

from sklearn import model_selection

In [6]:
#split the dataframe into a train (70%) and test (30%) set

train, test = model_selection.train_test_split(fantasy, test_size=0.3, train_size=0.7)

In [7]:
#check the len of your train set to confirm you split properly

len(train)

33

In [8]:
#check the len of your test set to confirm you split properly

len(test)

15

Now, let's create a tokenization function that you can use apply to each row. It should tokenize the given row's description, remove stopwords, create a FreqDist for the tokens, remove anything that is < 1 character, and then return the top 5 most common words.

In [9]:
#import stopwords and set them to use the english subset

from nltk.corpus import stopwords
stopwords = stopwords.words('english')

In [10]:
#define the tokenization function

def common_word_getter(row):
    words = nltk.word_tokenize(row.Description)
    frequency = nltk.FreqDist(words)
    frequency = [(w, f) for (w, f) in frequency.items() if w.lower() not in stopwords]
    frequency = [(w, f) for (w, f) in frequency if len(w) > 1]
    frequency.sort(key=lambda tup: tup[1], reverse=True)
    most_common = frequency[:5]
    return most_common

In [11]:
#create a list of the most common words per book by iterating through the training set and applying your function

common_list = []

for index, row in train.iterrows():
    common_list.append(common_word_getter(row))
    

In [12]:
#add an empty column to the train dataframe called 'most_common' and fill it with NaNs

train['most_common'] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [13]:
#set the value of the 'most_common' column to the list of common words you found

train['most_common'] = common_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [14]:
#show the first few rows of your dataframe

train.head()

Unnamed: 0,Title,Price,Description,Rating,Link,Genre,most_common
834,Heir to the Sky,Â£44.07,"As heir to a kingdom of floating continents, K...",Four,heir-to-the-sky_166/index.html,Fantasy,"[(Kali, 4), (earth, 4), (kingdom, 3), (edge, 3..."
527,King's Folly (The Kinsman Chronicles #1),Â£39.61,"The gods are angry.Volcanic eruptions, sinkhol...",Five,kings-folly-the-kinsman-chronicles-1_473/index...,Fantasy,"[(Wilek, 5), (ground, 3), (king, 3), (gods, 2)..."
877,Ash,Â£22.06,Cinderella retold In the wake of her father's ...,Four,ash_123/index.html,Fantasy,"[(Ash, 7), (fairy, 5), ('s, 4), (death, 3), (g..."
604,Tell the Wind and Fire,Â£45.51,In a city divided between opulent luxury in th...,Three,tell-the-wind-and-fire_396/index.html,Fantasy,"[(city, 5), (Light, 4), (Dark, 4), (boyfriend,..."
643,A Feast for Crows (A Song of Ice and Fire #4),Â£17.21,"With A Feast for Crows, Martin delivers the lo...",Four,a-feast-for-crows-a-song-of-ice-and-fire-4_357...,Fantasy,"[(survivors, 3), (Feast, 2), (Crows, 2), (Mart..."
