In [2]:
data = pd.read_csv('scraped_books.csv', index_col=0)
data.head()

Unnamed: 0,Title,Price,Description,Rating,Link,Genre
0,A Light in the Attic,Â£51.77,It's hard to imagine a world without A Light i...,Three,a-light-in-the-attic_1000/index.html,Poetry
1,Tipping the Velvet,Â£53.74,"""Erotic and absorbing...Written with starling ...",One,tipping-the-velvet_999/index.html,Historical Fiction
2,Soumission,Â£50.10,"Dans une France assez proche de la nÃ´tre, un ...",One,soumission_998/index.html,Fiction
3,Sharp Objects,Â£47.82,"WICKED above her hipbone, GIRL across her hear...",Four,sharp-objects_997/index.html,Mystery
4,Sapiens: A Brief History of Humankind,Â£54.23,From a renowned historian comes a groundbreaki...,Five,sapiens-a-brief-history-of-humankind_996/index...,History


In [3]:
data.Genre.value_counts()[:10]

Genre
Default           152
Nonfiction        110
Sequential Art     75
Add a comment      67
Fiction            65
Young Adult        54
Fantasy            48
Romance            35
Mystery            32
Food and Drink     30
Name: count, dtype: int64

In [4]:
fantasy = data[data['Genre'] == 'Fantasy']
fantasy.head()

Unnamed: 0,Title,Price,Description,Rating,Link,Genre
49,Unicorn Tracks,Â£18.78,After a savage attack drives her from her home...,Three,unicorn-tracks_951/index.html,Fantasy
76,"Saga, Volume 6 (Saga (Collected Editions) #6)",Â£25.02,"After a dramatic time jump, the three-time Eis...",Three,saga-volume-6-saga-collected-editions-6_924/in...,Fantasy
81,Princess Between Worlds (Wide-Awake Princess #5),Â£13.34,Just as Annie and Liam are busy making plans t...,Five,princess-between-worlds-wide-awake-princess-5_...,Fantasy
91,Masks and Shadows,Â£56.40,"The year is 1779, and Carlo Morelli, the most ...",Two,masks-and-shadows_909/index.html,Fantasy
112,Crown of Midnight (Throne of Glass #2),Â£43.29,"""A line that should never be crossed is about ...",Three,crown-of-midnight-throne-of-glass-2_888/index....,Fantasy


We need to split this into a training and test set. We want to find common words in 70% of the Fantasy book descriptions. Then see if we can accurately predict the other 30% of the books. There is an easy way of doing this with sci-kit learns sklearn.model_selection.train_test_split() function. It's overkill, but here goes:

In [5]:
train, test = model_selection.train_test_split(fantasy, test_size=0.3, train_size=0.7, random_state=10)

In [6]:
len(train)

33

In [7]:
len(test)

15

In [10]:
stopwords = stopwords.words('english')

In [11]:
def common_word_getter(row):
    words = nltk.word_tokenize(row.Description)
    frequency = nltk.FreqDist(words)
    frequency = [(w,f) for (w,f) in frequency.items() if w.lower() not in stopwords]
    frequency = [(w,f) for (w,f) in frequency if len(w) > 1]
    frequency.sort(key=lambda tup: tup[1], reverse=True)
    most_common = frequency[:5]
    return most_common

create a list of the most common words per book by iterating through the training set and applying your function


In [12]:
common_list = []

for index, row in train.iterrows():
    common_list.extend([i[0] for i in common_word_getter(row)])

In [13]:
common_list[:10]

['since',
 'months',
 'London',
 'four',
 'stone',
 'Peter',
 "'s",
 'Probationary',
 'Constable',
 'Grant']

Up until this point, we've tokenized bits of text manually, and manually sorted, removed stopwords, etc. The hope here is to build some intuition around the steps required to work with text. Now, we're going to introduce some industry standard tools that do many of these steps together. We'll still go step by step, but these tools will allow us to abstract from some of the 'manual-ness' we've experienced thus far.

Instantiate and instance of the CountVectorizer() class

In [14]:
vect = CountVectorizer()
vect

This class is really cool. We'll run it on our Description column, and use it to vectorize each piece of text. The vectorizing here is extremely simple, and is the most basic way of making the 'words to numbers' jump we discussed above.

Essentially, you take every word in the piece of text you're analyzing and replace it with a 1. Then, for each additional instance of the same word, you add 1. We'll go step by step to show what's happening.

In [15]:
vect = CountVectorizer(stop_words=stopwords, max_features=10)
vect

use vect.fit_transform() on the Description column in the training set to find the vectors of the training data

In [16]:
train_vectors = vect.fit_transform(train.Description)
train_vectors

<33x10 sparse matrix of type '<class 'numpy.int64'>'
	with 107 stored elements in Compressed Sparse Row format>

Get the feature names that sklearn found

In [17]:
vect.get_feature_names_out()

array(['book', 'find', 'four', 'life', 'new', 'one', 'power', 'series',
       'seven', 'world'], dtype=object)

The resulting vector is what's called a 'sparse matrix'. This is a numpy data type for storing large, sparse arrays. The dimensions of the vector are:

rows = # of samples vectorized
columns = # of features
Now - here's the really mindblowing part :)

Now that we've defined this 'vect' class, it will 'remember' it's vocabulary the next time we call it. This is because of the object oriented concept called inheritance. This Medium post does a reasonable job at explaining the concept (as applies to sklearn specifically).

For our purposes, it means that once we've instantiated the class on the training set, we can call it on the test set and it will remember the common words from the training set. This can be a bit confusing, so definitely do some reading on this before moving on.

In [18]:
test_vectors = vect.transform(test.Description)
test_vectors

<15x10 sparse matrix of type '<class 'numpy.int64'>'
	with 39 stored elements in Compressed Sparse Row format>

So now we have two matrices:
The train vector has the vectorization of the ten most common words throughout the corpus
The test vector has the vectorization of each of those ten words in the test set
Now, if our goal is to use these training vectors to predict the genre of our test set, we need some way of comparing these vectors against each other. There are many ways to do this, so we'll start with using the cosine similarity. If you're really interested in understanding what's going on under the hood, read up on it here.

First, we need to find some measure of 'averageness' across our training vectors. This wil be a single vector that represents the average presence of each term across the training corpus.<br><br>
Second, we need to score each of the test vectors against this 'average' vector<br><br>
Third, we should look at those scores and see if there is any discernible pattern in them<br><br>
Finally, the real test of 'predictiveness' will be to shuffle in some other genres with out test set, score them all against the average vector, and see if we can accurately distinguish the fantasy books from the rest of the data.

find the average vector in the training set

In [19]:
average_vector = train_vectors.mean(axis=0)
average_vector

matrix([[0.75757576, 0.57575758, 0.54545455, 0.6969697 , 1.12121212,
         0.96969697, 0.57575758, 0.90909091, 0.60606061, 0.93939394]])

score the test vectors using their cosine distance from the average vector

In [21]:
from sklearn.metrics.pairwise import cosine_distances

In [23]:
scores = cosine_distances(X=average_vector, Y=test_vectors)

TypeError: np.matrix is not supported. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html