# Finding similar documents

This notebook goes through some different methods for finding/grouping similar documents. The dataset used is ~7000 rows of data about fantasy books, scraped from GoodReads.

## Imports

Lots of imports are almost always required for working with text, because it's so far away from the data that machines are comfortable with; lots of processing is required.

In [1]:
import pandas as pd  # Manipulate data
import seaborn as sns  # Visualise
import matplotlib.pyplot as plt  # Visualise in awkward ways
from nltk import download  # Get wordlists
from nltk.tokenize import word_tokenize  # Split text to words
from nltk.corpus import stopwords  # Boring words
from warnings import filterwarnings  # Hide pink boxes
from sklearn.feature_extraction.text import TfidfVectorizer  # Convert text to TF-IDF representations
from sklearn.metrics.pairwise import cosine_similarity  # Check similarities between vectors
import re  # Match patterns in text
from sklearn.cluster import DBSCAN  # Cluster
from gensim.corpora import Dictionary  # Bag-of-words model
from gensim.models.ldamodel import LdaModel  # Topic modelling
import pyLDAvis  # Visualise topics
import pyLDAvis.gensim # Visualise topics

## Set-up

Notebook-level things - mostly stylistic

In [2]:
# Hide pink warnings

filterwarnings("ignore")

# Show full text (useful for anything with text columns)

pd.set_option("display.max_colwidth", 0)

# Standardise plots

sns.set()

# Download word lists

download(["punkt", "wordnet", "stopwords", "vader_lexicon"])

  and should_run_async(code)
[nltk_data] Downloading package punkt to /Users/dank/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/dank/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/dank/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/dank/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

## Sourcing

In [3]:
# Load from a file

books = pd.read_csv("fantasy_books.csv")

In [4]:
# Check it

books.sample()

Unnamed: 0,author,title,description
5878,Patrick Carman,3 Below,"Charlie had his chocolate factory. Stanley Yelnats had his holes. Leo has the wacky, amazing Whippet Hotel.Now that Leo has uncovered a few secrets behind the wacky Whippet Hotel, he'll have to save it!Leo has explored the zany, wonderful Whippet Hotel from basement to top floor, with trains, flying goats, and mazes (among other things) in between. But even Leo doesn't know every secret of the Whippet - and when he discovers that there's more beneath the hotel than he'd thought, it doesn't take long for more adventures to unfold!"


## Basic exploration

In [5]:
# How many books?

books.shape[0]

7727

In [6]:
# How many authors?

books["author"].nunique()

3026

## Processing

In order to do various things with the text, we need to process it first, removing the beautiful chaos of language until we get something that a machine can interpret.

For our purposes, we're mostly interested in key words - words within each separate document that carry significant meaning. In the (short) document "From hell's heart I stab at thee", the word "stab" is vital to the meaning, but the word "at" is less important. In order to extract just those words, we need to simplify our text.

In [7]:
# Create a text column based on description

books["text"] = books["description"]

In [8]:
# Drop it to lowercase

books["text"] = books["text"].str.lower()

The next bit of cleaning involves **regular expressions**; these let you match patterns in text. You don't have to learn how to use them, but if you want to do anything with text, then it's a really good idea. Lots of online resources exist to let you [get](https://learn-regex.com/) [some](https://regexcrossword.com/) [practice](https://www.therobinlord.com/projects/slash-escape).

In [9]:
# Remove digits & duplicate spaces

books.loc[:, "text"].replace(r"\d+", " ", regex=True, inplace=True)
books.loc[:, "text"].replace(r"\s+", " ", regex=True, inplace=True)

### Tokenisation

**Tokenisation** is the process of splitting documents (big strings) into smaller units. You can do it at various levels - character-, word-, sentence-, etc. Essentially, it splits the text into the units that you care about, and enables particular kinds of processing.

Using `nltk`'s word-level tokeniser,

```
You could not see a cloud, because
No cloud was in the sky:
No birds were flying overhead —
There were no birds to fly.
```

becomes

```python
['You', 'could', 'not', 'see', 'a', 'cloud', ',', 'because', 'No', 'cloud', 'was', 'in', 'the', 'sky', ':', 'No', 'birds', 'were', 'flying', 'overhead', '—', 'There', 'were', 'no', 'birds', 'to', 'fly', '.']
```

Splitting text at word level allows us to deal with text at word-level, removing small functional words and words that - in our domain - are not exciting.

In [10]:
# Tokenize the column

books["text"] = books["text"].apply(word_tokenize)

### Stopword removal

In any English text, many of the words are small functional ones, like 'of' or 'and', that do not carry meaning on their own. Instead, they tell you how the other words relate to each other.

These words present a problem for many forms of text analysis, because they get in the way, and they're hard to deal with without very complex processing. One time-honoured way to handle them is just removal.

These words are called 'stop words', and you can find all sorts of lists of them to help with the removal process.

In [11]:
# Get a list of stopwords

stops = stopwords.words("english")

# Add any words that are also boring (use domain knowledge, etc.)

stops.extend(["n't", "'ve", "...", "book", "novel", "series", "author",
              "fiction", "story", "edition", "stories", "readers", "authors"])

In [12]:
# Use the stopwords list to filter out stop words and short words

books["text"] = books["text"].apply(lambda x: [word for word in x
                                               if word not in stops
                                               and len(word) > 2])

At this point, we could carry out further processing on the words. Two potential candidates are **stemming** and **lemmatisation**. Both of these techniques reduce the number of unique words in the dataset by collapsing different forms of the same word together ("is" and "was", for example, are both forms of the word "be").

Stemming is a relatively crude, rule-based method, whilst lemmatisation is a computationally-expensive dictionary-based one. Lemmatisation is harder to do properly, but tends to work better. Both techniques are outside today's scope, but here's a [lemmatisation notebook](https://github.com/Peritract/text-analysis/blob/master/Lemmatisation.ipynb) if you want to read further.

## Similarity scores

### Vectorisation

Now that we've extracted lists of the keywords in each document, the next step is to turn those keywords into a numeric representation of each document. This will let us take the keywords - the bit of the document that matters to us - and express them in a way that machine learning algorithms can work with. The process of converting a document into its numeric representation is referred to as **vectorisation**.

Various methods of vectorising text exist; two of them are shown in the table below, applied to the string `"round the ragged rocks the ragged rascal ran"`.

| Vectorizer | round | the | ragged | rocks | rascal | ran |
| --- | --- | --- | --- | --- | --- | --- |
| Boolean | True | True | True | True | True | True |
| Count | 1 | 2 | 2 | 1 | 1 | 1 |

Both of the above methods are relatively crude, so we're not actuall going to use them. We're going to make (dramatically cooler) **TF-IDF** vectors.

"Term frequency - inverse document frequency" (TF-IDF) is a way of measuring word **importance**, rather than occurrence or frequency. [Here's more information](http://tfidf.com/), but essentially it measures how common a word is in a single document, relative to how common it is in all documents.

A word that is much more common in a specific document than it is generally is an important word for that document. A word that appears roughly as often in a specific document as in all documents is unimportant to that document.

In time-honoured data science tradition, we're not actually going to calculate it; we can just use `sklearn`.

In [13]:
# Convert the text column back into strings

keywords_text = books["text"].apply(lambda x : " ".join(x))

In [14]:
# Make a vectorizer
# min_df=0.005 - only include words that occur in at least 0.5% of docs
# max_features=10000 - no matter how many words there are, only take the 10000 most frequent

vectorizer = TfidfVectorizer(min_df=0.005, max_features=10000)

In [15]:
# Create the vectors

dtm = vectorizer.fit_transform(keywords_text)

Vectorising transforms a bunch of documents into a **document-term matrix**, in which every row is a document and every column is a unique word (or term, hence the name).

In [16]:
# Glance at it, briefly

dtm

<7727x2373 sparse matrix of type '<class 'numpy.float64'>'
	with 336349 stored elements in Compressed Sparse Row format>

### Similarity scores

As vectors represent points in imaginary space, we can use the idea of those points to find similar documents. Any document that - when vectorised - is close to another vectorised document should contain similar key words, and thus have similar content.

We have out vectors stored in `dtm`, so now we need a way to turn a new document into the *exact same type of vector*.

In [17]:
def vectorize(text):
    """Converts a string into a TF-IDF vector"""
    
    # Case-fold and remove digits/spaces
    text = re.sub("\s+", " ", re.sub("\d+", " ", text.lower()))
    
    # Tokenize, remove short/stopwords and rejoin
    text = " ".join([x for x in word_tokenize(text)
                     if x not in stops
                     and len(x) > 2])
    
    # vectorize and return
    return vectorizer.transform([text])

Now that we have the vectorising function, we can vectorise a new book.

In [18]:
# Book description (not in sample)
# Where Dreams Descend - Janelle Angeles

test_book = """
In a city covered in ice and ruin, a group of magicians face off in a daring game of magical
feats to find the next headliner of the Conquering Circus, only to find themselves under the
threat of an unseen danger striking behind the scenes.

As each act becomes more and more risky and the number of missing magicians piles up, three
are forced to reckon with their secrets before the darkness comes for them next.

The Star: Kallia, a powerful showgirl out to prove she’s the best no matter the cost
The Master: Jack, the enigmatic keeper of the club, and more than one lie told
The Magician: Demarco, the brooding judge with a dark past he can no longer hide

Where Dreams Descend is the startling and romantic first book in Janella Angeles’ debut
Kingdom of Cards fantasy duology where magic is both celebrated and feared, and no heart
is left unscathed.
"""

# Get the vector

book_vector = vectorize(test_book)

In [19]:
# Show the vector

book_vector

<1x2373 sparse matrix of type '<class 'numpy.float64'>'
	with 56 stored elements in Compressed Sparse Row format>

### Cosine similarity

To compare vectors together, we can compute the **cosine similarity**. There's a lot of maths behind it all, with words like 'Euclidean' and 'orthogonal' involved. Cosine similarity takes multiple points in hyper-dimensional space and works out if they're close together, ignoring
the length of the original documents.

You can [read more about it all](https://en.wikipedia.org/wiki/Cosine_similarity), but essentially the cosine similarity of two documents is a score between 0 and 1 that tells you how similar they are.

In [20]:
# Compute the similarity vectors
# Compare every document to the new one

similarities = cosine_similarity(book_vector, dtm).flatten()

In [21]:
# Check the scores - should be a whole bunch of small numbers

similarities[:10]

array([0.01971991, 0.08980112, 0.02788277, 0.00833649, 0.        ,
       0.        , 0.00243461, 0.01799363, 0.02243726, 0.01798619])

### Most similar documents

With the similarity scores calculated, we just have to match the scores for each book back to the original title, and then show the titles with the highest similarity scores.

In [22]:
# Grab a copy of books

similarity_df = books[["title", "author"]]

In [23]:
# Add the similarities as a column

similarity_df["similarity"] = similarities

In [24]:
# Sort by similarity and return the top ten

similarity_df.sort_values("similarity", ascending=False).head(10)

Unnamed: 0,title,author,similarity
3813,The Magicians Trilogy Boxed Set,Lev Grossman,0.277792
7723,The Magicians and the Magician King,Lev Grossman,0.231837
1382,The Prestige,Christopher Priest,0.203499
1058,The Ambassador's Mission,Trudi Canavan,0.203308
1068,The Magician King,Lev Grossman,0.186267
209,The Amulet of Samarkand,Jonathan Stroud,0.180541
4772,Interstellar Pig,William Sleator,0.178967
317,The Magicians' Guild,Trudi Canavan,0.176819
5660,The Last Unicorn #4,Peter B. Gillis,0.17311
5569,The Demon's Surrender,Sarah Rees Brennan,0.169443


Using similarity scores, we can find the most similar documents to any document; however, this process works for one document at a time - if we want to group documents together without manual effort, we need to use other methods.

## Clustering

Clustering lets us find groups in data when we don't know what we're looking for. Most clustering algorithms needs numeric input to work, but luckily - with vectors - that's what we've got.

Clustering is also quite difficult, due to the "curse of dimensionality" - each separate word is a column, which means that, if there are fifty unique words in your dataset, then you have fifty columns. Fifty unique words is very low, so in practice, you often have thousands of columns.

In our case, the data is also quite diffuse, so clustering doesn't find clear groups easily.

In [25]:
# Using DBSCAN

clusters = pd.Series(DBSCAN(eps=0.95).fit_predict(dtm))

In [26]:
# See how many in each cluster

clusters.value_counts()

-1    7653
 3    13  
 2    9   
 8    9   
 6    8   
 1    8   
 5    6   
 0    6   
 7    5   
 9    5   
 4    5   
dtype: int64

-1 is what DBSCAN uses for outlier values that cannot be easily clustered. With this dataset, and without a lot of tweaking, clustering analysis seems to produce very little of value.

We can look at cluster 3 though (13 documents) and see if the clusters that have been found are meaningful.

In [27]:
# Get a copy of books

cluster_df = books[["title", "author"]]

# Add the clusters onto the df

cluster_df["cluster"] = clusters

# Display the books in cluster 3

cluster_df[cluster_df["cluster"] == 3]

Unnamed: 0,title,author,cluster
45,Harry Potter and the Chamber of Secrets,J.K. Rowling,3
348,The Harry Potter Collection 1-4,J.K. Rowling,3
366,Harry Potter and the Cursed Child - Parts One and Two,John Tiffany,3
619,Harry Potter and the Methods of Rationality,Eliezer Yudkowsky,3
967,"Harry Potter and the Cursed Child, Parts 1 & 2",John Tiffany,3
1377,The Harry Potter trilogy,J.K. Rowling,3
1555,Harry Potter: Film Wizardry,Brian Sibley,3
1796,"Harry Potter Boxed Set, Books 1-5 (Harry Potter, #1-5)",J.K. Rowling,3
2231,Harry Potter Boxset,J.K. Rowling,3
3534,Harry Potter and the Cursed Child - Parts I & II,John Tiffany,3


This has actually found some reasonable clusters - 3 is *Harry Potter*, 2 is *Death Note*, 4 is *Sandman*, and so on. However, this isn't particularly useful; it stands to reason that books in a series will have similar descriptions, but clustering that only shows you the obvious stuff is a waste of effort.

We need more complex, text-specific techniques.

## Topic analysis

Topic analysis is both conceptually and computationally difficult. Essentially, it allows you to identify "topics" from a set of documents, where "topic" means "collection of words that frequently appear together". Like clustering, topic analysis is a form of **unsupervised** learning, and so you need to evaluate your topics to work out what they mean. We'll look at [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) (**LDA**) topic modelling: 

To give a brief (extremely crude) example, suppose that you have the following documents:

| Document | Text |
| --- | --- |
| 0 | frogs are amphibious |
| 1 | they use amphibious vehicles for transport over water |
| 2 | ponds are home to many species including frogs and newts |
| 3 | aeroplanes are not amphibious vehicles |
| 4 | amphibious vehicles can travel over land and water|
| 5 | frogs live in ponds |
| 6 | cars are less expensive than aeroplanes |
| 7 | vampires are fictional creatures |
| 8 | cars and aeroplanes are used for transport |
| 9 | cars are the most common type of vehicle |

As humans, we can look at the above sentences and easily identify the main ideas. Machines do not have our conceptual understanding however, and so must rely on cruder methods: finding groups of words that seem to usefully split up the documents.

Topic analysis might identify the following groups:

| Topic | Top words |
| --- | --- |
| A | frogs, ponds |
| B | cars, aeroplanes, vehicles, transport|
| C | amphibious, water, over, vehicles |

Topic analysis is essentially saying that, if you sort documents into categories based on the appearance of certain words, these are the sets of words that result in the neatest categories. We can then identify, for example, that documents 0, 2, and 5 are mostly concerned with topic A, and that document 7 is an outlier that doesn't have much in common with any other documents.

The above is an oversimplification of the theory, but it works for a practical understanding; if you're interested in more specifics, [this article has a good overview](https://towardsdatascience.com/a-friendly-introduction-to-text-clustering-fa996bcefd04). This notebook, however, is going to focus on writing the code to run topic analysis.

In [28]:
# Get a copy of the text column again - lists of keywords

lda_docs = books["text"]

In [29]:
# Check a single document

print(lda_docs[4])

['explain', 'afraid', 'sir', 'said', 'alice', 'see', 'alice', 'sees', 'white', 'rabbit', 'take', 'watch', 'waistcoat', 'pocket', 'decides', 'follow', 'sequence', 'unusual', 'events', 'set', 'motion', 'mini', 'contains', 'entire', 'topsy-turvy', 'alice', 'adventures', 'wonderland', 'looking-glass', 'accompanied', 'practical', 'notes', 'martina', 'pelouso', 'memorable', 'full-colour', 'illustrations']


### Bag-of-words model

Just as when we vectorised, we need to represent each document as a set of numbers, not words directly. What we're doing here is **count vectorisation**, as described earlier on, but it's often referred to as a "bag-of-words" model, because we're losing all the information about how each word in a document relates to the others, and just jumbling them all together.

Every unique word in the documents will be given an ID, and then each document will be transformed into a vector of numbers, with each number representing the presence or absence of a particular ID (and therefore word).

This is the same process as we went through with our TF-IDF vectors, but using different libraries; this makes the code a little more complex, but being familiar with two different approaches will help you read & understand text processing in other situations as you come across it.

In [30]:
# Create a dictionary mapping IDs to words
# Equivalent to vectoriser.fit(lda_docs)

lda_dict = Dictionary(lda_docs)

Once we have the IDs stored in a dictionary, we can use it to convert each document into the bag-of-words form; because we have the dictionary now, we can later convert other new documents into exactly the same bag-of-words form. This collection of vectorised documents is called a **corpus**.

In [31]:
# Convert each document into a bag-of-words representation using the dictionary
# The equivalent of vectoriser.transform(lda_docs)

lda_corpus = [lda_dict.doc2bow(doc) for doc in lda_docs]

In [32]:
# View a bag-of-words representation

print(lda_corpus[4])

[(66, 1), (75, 1), (154, 1), (306, 1), (307, 1), (308, 3), (309, 1), (310, 1), (311, 1), (312, 1), (313, 1), (314, 1), (315, 1), (316, 1), (317, 1), (318, 1), (319, 1), (320, 1), (321, 1), (322, 1), (323, 1), (324, 1), (325, 1), (326, 1), (327, 1), (328, 1), (329, 1), (330, 1), (331, 1), (332, 1), (333, 1), (334, 1), (335, 1), (336, 1), (337, 1)]


Each document is now stored as a list of tuples, in the form `(word_id, word_frequency)`. You can map IDs back onto their words using the dictionary.

In [33]:
# What is word 339?

lda_dict[339]

'appeared'

### LDA modelling

Now that we have our data in the correct form, we can actually do some topic modelling. The model is a relatively complex one, and it has lots of tweakable parameters; for now, we're going to go with a basic implementation, but I encourage you to explore the options at leisure, looking for the best parameters for your dataset.

The best parameters, of course, are those that result in the most meaningful topics. Unfortunately, as this is unsupervised modelling, there's no magic bullet metric - a lot of this is based on reviewing the results and making your own judgements.

In [34]:
# Choose the number of topics you want to look for - similar to k in K-Means
# The number was an arbitrary choice here, not validated against anything precisely
# I just tweaked it a bit until the topics looked vaguely interesting

num_topics = 10

In [35]:
# Run the model; it takes a while

lda_model = LdaModel(corpus=lda_corpus,
                     id2word=lda_dict,
                     num_topics=num_topics, 
                     random_state=451)

Once the topics have been generated, we can view them.

Each topic has the following form:

(`topic_num`, `importance_of_word_A` * `word_A` + `importance_of_word_B` * `word_B` ...)

Essentially, for each topic, you get a list of the top words in the topic, plus a number (between 0 and 1) representing how central that word is to that topic.

In [36]:
# Check a topic

lda_model.print_topics()[6]

(6,
 '0.011*"world" + 0.007*"one" + 0.006*"new" + 0.006*"time" + 0.005*"life" + 0.004*"human" + 0.004*"power" + 0.004*"vampire" + 0.004*"war" + 0.004*"must"')

In [37]:
# Check a different topic

lda_model.print_topics()[7]

(7,
 '0.006*"time" + 0.006*"world" + 0.005*"life" + 0.005*"one" + 0.005*"new" + 0.004*"must" + 0.004*"love" + 0.003*"find" + 0.003*"three" + 0.003*"back"')

At a glance, we can see that topic 6 is probably dark fantasy - it's got vampires in. Topic 7 seems lighter on the fantastical elements, but the presence of "love" suggests that maybe this is the sub-genre of fantasy romance.

The word "world" appears in both topics; this is because - if you look at word frequency for this dataset - "world" is one of the most frequent words overall. A lot of fantasy books are focused on saving it.

### Visualising topics

We can go a step further in exploring the topics, and plot them in a visual way, using the [pyLDAvis](https://github.com/bmabey/pyLDAvis) library. This is also computationally expensive, but the end-result is worth it: an interactive plot that lets you not only explore the key words in a topic, but also narrow in on the ideal number of topics.

In [38]:
# Prepare the data for display

lda_vis = pyLDAvis.gensim.prepare(lda_model, lda_corpus, lda_dict, sort_topics=True)

In [39]:
# Create the visualisation

pyLDAvis.display(lda_vis)

The above visualisation is a lot to take in. Broadly though, it breaks down into the following bits:

1. **The topic map**

    This lets you see how the topics are distributed in your data. The axes have been generated by "multidimensional scaling", which means that absolute positioning isn't especially meaningful, but the relative size and location of topics is important. Topics that are far away from each other have little overlap - documents rarely contain both. Topics close together or overlapping are quite similar, shared by many documents. The size of a topic blob shows how important that topic is across all documents.

    The more separate and less-overlapped your blobs, the more distinct your topics; many over-lapping topics suggests that you have asked for more topics than are actually present in the data.


2. **The word list**

    The word list shows the thirty most "relevant" (more on that later) words; if you click on a blob on the map, it shows you the words for that topic, otherwise it shows you them for the whole dataset. When you click on a blob, you see red bar in addition to the blue - the blue bar is (always) frequency over the whole dataset for whatever that word is, while red is frequency for that word in that topic.


3. **The controls**

    The left-most controls let you control which topic you are looking at, duplicating just using the mouse. The right-most controls are more complex & interesting: they let you determine how **relevance** is counted. By changing the slider, you change what criteria are used for choosing the most relevant words. With the slider at 1, relevance is "most frequent words in the topic", and with the slider at 0, relevance is based on **salience**: words that are unusually frequent in that topic compared to their frequency overall. In-between values calculate relevance as a blend of the two.

    It's a good idea to play around with this metric to see what gives you the most interesting results with your data. In this dataset, for example, neither 0 or 1 is particularly useful: 0 gives you character names, and 1 gives you roughly the same words for each topic. A value of 0.4 seems to be useful to extrat key ideas, but only from the largest three topics.

### Mapping documents onto topics

The final step is to look at the documents that feature each topic.

In [40]:
# Check the output for a single doc (Alice in Wonderland)

lda_model[lda_corpus[4]]

[(2, 0.6572539), (5, 0.3210481)]

This document contains two topics, but mostly (65%) is focused on topic two. Based on the visualisation above, we can see that this makes sense - topic 2 (with the slider at 0.4) lists "girl" and "alice" as relevant terms; this topic may (potentially, and further exploration is required) be focused on children and portal fantasy.

On a larger scale, we can write a function that gets the most dominant topic for each separate book. This will allow us to look at topics as a whole.

In [41]:
def get_dominant_topic(bag_of_words):
    return lda_model[bag_of_words][0][0]

In [42]:
# Get the dominant topic for each book

topics = [get_dominant_topic(x) for x in lda_corpus]

In [43]:
# Get a copy of the books df

book_topics = books[["author", "title"]]

# Add the topics on

book_topics["topic"] = topics

In [44]:
# Count up how many books for each topic

book_topics.groupby("topic").count()["title"]

topic
0    1238
1    3748
2    1383
3    405 
4    183 
5    319 
6    242 
7    104 
8    33  
9    72  
Name: title, dtype: int64

This seems a bit more even than our earlier clustering, with larger groups being found. We can then investigate the titles in a single topic to see if they are at all similar.

In [45]:
# See all the books for a given topic

book_topics[book_topics["topic"] == 2].head(5)

Unnamed: 0,author,title,topic
4,Lewis Carroll,Alice's Adventures in Wonderland & Through the Looking-Glass,2
5,Oscar Wilde,The Picture of Dorian Gray,2
8,Orson Scott Card,Ender's Game,2
15,Rick Riordan,The Lightning Thief,2
22,Roald Dahl,Matilda,2


Topic 2 isn't as clear as I would like, but you can pick up on some potentially common elements; 4 of the 5 books list have child protagonists, for example, and four of them feature mirrors in relatively important ways.

Again, the topic modelling we've done here is a very basic implementation; there are lots of ways to tweak it and improve/measure the performance. [This article](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/) has more details about fine-tuning.

## Conclusions

This notebook has covered the basic code to group documents in three separate ways; all of these ways can be tweaked and improved and extended to identify more coherent & cohesive groups, but hopefully the underlying code & concepts are clear.

It's worth mentioning that the dataset used here - while a fun one to work with - is probably a little worse for finding groups than most other datasets would be; fantasy books tend - as is the nature of the genre - to be quite diffuse yet centered around some of the same concepts. This means that almost every description mentions saving the world, but only a few out of thousands mention golems or kelpies or whatever the specific danger is in that specific book. That makes it hard to find nice neat clusters.

With more focused datasets - customer voice or reviews, for example - I would expect neater groupings without excessive tweaking.

Text is always hard to work with, and always somewhat woolly - even with the perfect dataset, you'd need to do some level of manual interpretation. But these methods have a pretty good success rate for their level of complexity, and you can generally get something worthwhile out of them.