# Book Genre Clustering

In this notebook, we will walk through some **text clustering** with open source Python libraries. Let's think about the following scenario: Mr. Cricket, the owner of the best children bookshop in Walldorf, would like to put some order in his book inventory. He would like to classify the books into different categories based on topic similarity. This would allow him to improve customer experience both in his physical store and his web store, by grouping similar items into homogeneous shelves. Pretty nice idea, but how to do that?! 🤔

Mr. John asks for help to his SAP trusted partner. Their cosultants, after taking a look at the bookshop book inventory, come up with a plan. Each book in the inventory comes with a very concise description field. They are going to implement some text analysis on this field and group books with a similar content using an unsupervised clustering strategy. Their project will be then based on the following steps:

* **1- Text Preprocessing**
* **2- Word Embedding**
* **3- Text Clustering**

![image](https://github.com/SAP-samples/btp-data-to-value-workshop/raw/main/resources/text-clustering.png)

Let's put this into practice!

First, we will make sure the required libraries are installed. We will use a set of very common python libraries, for dataframe handling and visualization (pandas, numpy, matplotlib, seaborn), regex and nltk for text cleaning and preprocessing, gensim for the word embedding and sklearn for the clustering. hana_ml will be used in this notebook only to access the book inventory data, that are stored into Mr. Cricket HANA Cloud database. 

In [None]:
# basic Python
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn

# text preprocessing
!pip install regex
!pip install nltk

# word embedding
!pip install gensim

# clustering
!pip install sklearn

# connection to data source in Hana Cloud
!pip install hana_ml

Then, we use the import statement to import the packages and define aliases, like this:

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Then, we want to load our book dataset. We use *hana_ml* to create a connection to the HANA Cloud database, and load the book inventory table in a hana dataframe:

In [None]:
import hana_ml.dataframe as dataframe
from notebook_hana_connector.notebook_hana_connector import NotebookConnectionContext
conn = NotebookConnectionContext(connectionId = 'HANACLOUD_D2V' )
df_hana = (conn.table('SAP_CAPIRE_BOOKSHOP_BOOKS', schema='DATA2VALUE'))

With the **collect** command, we can copy the data from hana cloud to the Jupyter client, in the form of a pandas dataframe. We are now ready to massage our data using all sorts of python tricks!  

In [None]:
books= df_hana.collect()
del df_hana
books

We will be focusing our analysis on the description field in particular. 

In [None]:
books[['ID','TITLE','DESCR']]

## 1 - Text Preprocessing

In order to use textual data for predictive modeling, the text must be parsed and transformed to a list of words (called “tokenization”). In this process, special characters and punctuation have to be removed. We should also ged rid of the so called 'stop words', that is to say commonly used words without any specific connotation (such as “the”, “a”, “an”, “in”) etc. These are not predictive and you would not want them to be considered in the predictive model. We will use Natural Language Toolkit (NLTK) and Regular Expressions (RegEx) to clean up and **tokenize** our text.

If you fancy digging deeper into these techniques, here are a few refecences:
* https://tutorialspoint.dev/language/python/removing-stop-words-nltk-python 
* https://en.wikipedia.org/wiki/Regular_expression


Let's first drop missing values in the book description books:

In [None]:
books=books.dropna(axis=0,subset=['DESCR'])

Prepare the book description as follows:

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
stopwords=set(stopwords.words("english")) 
import regex as re

#Transform to lower case
books['tokens']=books['DESCR'].apply(lambda x: x.lower())

#Remove punctuation
books['tokens']=books['tokens'].map(lambda x: re.sub("[-,\.!?;\'\(\)]", ' ', x))

#Remove stopwords
books['tokens']=books['tokens'].apply(lambda x: ' '.join([ t for t in x.split() if not t in stopwords]))                                                     

# Remove short tokens
books['tokens']=books['tokens'].apply(lambda x:' '.join( [t for t in x.split() if len(t) > 1] )) 

#Remove extra spaces
books['tokens']=books['tokens'].map(lambda x: re.sub(' +', ' ', x))

# Remove duplicate tokens
books['tokens']=books['tokens'].apply(lambda x: ' '.join(list(dict.fromkeys(x.split()))))                                                 


Drop duplicates:

In [None]:
books=books.drop_duplicates('tokens')

In [None]:
books[['ID','TITLE','DESCR','tokens']]

## Word Embedding

Global Vectors for Word Representation (**GloVe**), is a **word2vec** model, that is to say an unsupervised learning algorithm for obtaining vector representations for words.  It allows you to take a corpus of text, and transform each word in that corpus into a position in a high-dimensional space.   

We could choose to train our word2vec model on our own corpus of book description, but to semplify the process while taking advantage of the precious machine learning open community, we will download a pretrained model that was developed usin a muuuuch broader corpus: Wikipedia. 

The gensim python library allows us to to that in 2 lines of code. When you drun them, it will take a minute or two.  Each line of the text file contains a word, followed by N numbers. The N numbers describe the vector of the word’s position.  N may vary depending on which model you choosed to donwload. For us, N is 100, since we are using glove.6B.100d.  

To learn more about how to use GloVe see:
* https://faculty.ai/tech-blog/glove/
* https://medium.com/analytics-vidhya/basics-of-using-pre-trained-glove-vectors-in-python-d38905f356db 



In [None]:
import gensim.downloader as gensim_api
model = gensim_api.load("glove-wiki-gigaword-100")  

Now that we have downloaded tour word2vec model, we can apply it to every book description, to **embed** our book in a multidimensional space:

In [None]:
features=[]
for i,book in books.iterrows():
    tokens_features=[]
    for word in book['tokens'].split():
        try:
            tokens_features.append(model[word])
        except:
            continue
    features.append(np.mean(np.array(tokens_features),axis=0))
    
for i in range(100):
    feature='f_'+str(i)
    books[feature]=[f[i] for f in features]
    
del features
embedding=['f_'+str(i) for i in range(100)]

You can see the results of our text embedding: we associated each book to a 100-dim numerical vector:

In [None]:
books[['TITLE']+embedding]

## KMeans Clustering

Now that each book is represented by a point in a multidimentional space, we can use the *distance* between these points to find out which books are similar to each other. More specifically, we will be performing a cluster analysis: our book sample will be divided into a certain number of groups such that books in the same groups are more similar (close) to books in the same group than those in other groups.  

In this exercise, the number of groups or clusters is set to 10.  We will use sklearn KMeans clustering algorith, in the form of MiniBatchKmeans

In [None]:
from sklearn.cluster import MiniBatchKMeans
n_clus=10
km = MiniBatchKMeans(n_clusters = n_clus, batch_size=50, random_state=42, max_iter=1000)
y_kmeans = km.fit_predict(books[embedding])
books['kmeans_cluster']=y_kmeans

### Cluster Profiling

To understand if the clustering worked effectively, let's examing the most representative books for each cluster, the books closest to the cluster centroid  If you run the following cell and scroll down to see the result, you will notice a certain homogeneity between descriptions belonging to the same cluster.

In [None]:
for cluster in range(n_clus):
    print('************* ')
    print('- CLUSTER ',str(cluster))
    print('*************')
    most_representative_docs = np.argsort(
    np.linalg.norm(books[embedding] - km.cluster_centers_[cluster], axis=1)
)
    centroid_index= most_representative_docs[0]
    centroid=[]
    for i in range(100):
        feature='f_'+str(i)
    for d in most_representative_docs[:10]:
        print(books.reset_index().DESCR[d])
        print("--")

### Cluster Visualization 

Most datasets have a large number of variables or dimensions along which the data is distributed. Visually exploring the data is challenging. In our case, for instance, our book embedding has **100 dimensions**. To visualize high-dimensional datasets you can use techniques known as **dimensionality reduction**. 

**t-Distributed Stochastic Neighbor Embedding (t-SNE)*** is a technique for dimensionality reduction that allows to map an high-dimensional distribution to a 2-dim plane. Since this is computationally quite heavy, another dimensionality reduction technique is used in conjunction with it, e.g. **Principal Component Analysis** or **PCA**. PCA is a technique for reducing the number of dimensions in a dataset while retaining most information. It analyzes the correlation between dimensions and attempts to provide a minimum number of variables that keeps the maximum amount of variation or information about the original data distribution. 

In our example, we will first reduce our dimensions from 100 to 50 using PCA, and eventually using t-sne to visualize our clusters in 2 dimensions. Here is the code. 

#### Reduce variables with PCA 

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

pca_50 = PCA(n_components=50)
pca_result_50 = pca_50.fit_transform(books[embedding])
print('Cumulative explained variation for 50 principal components: {}'.format(np.sum(pca_50.explained_variance_ratio_)))


#### Execute tsne model

In [None]:
tsne = TSNE(n_components=2, verbose=0, perplexity=30, n_iter=2000)
tsne_pca_results = tsne.fit_transform(pca_result_50)
books["tsne-1"] = tsne_pca_results[:,0]
books["tsne-2"] = tsne_pca_results[:,1]


#### Plot clusters in a 2-dim plane

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(x="tsne-1", y="tsne-2", hue="kmeans_cluster", s=30, palette="Paired",
                data=books).set(title="Glove 100 projection") 

sample_titles=['Shadow','Seventeen coffins','Stronger than magic']
for t in sample_titles:
    x=books.loc[books['TITLE']==t,'tsne-1'].tolist()[0]
    y=books.loc[books['TITLE']==t,'tsne-2'].tolist()[0]
    plt.annotate(t,(x,y),xytext=(x-70,y),arrowprops={'arrowstyle':'fancy'})
plt.show()

As you can see, we were able to display the clusters in a 2-dim plane. Samples of books dealing with different themes were assigned to different clusters and lay far from each other in the plot. Now that we are satisfied with our clustering analysis, we are only left with the task of saving the clustering model:

In [None]:
import pickle
pickle.dump(km, open("description.pickle.dat", "wb"))