<a href="https://colab.research.google.com/github/Bryan-Az/ClusteringMethod-Slate/blob/main/Text_Embeddings_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import os
import shutil
import sys
import zipfile
import gensim
import re

# Document / Text Clustering using Word Embeddings
In this notebook I will be running a text clustering model on data taken from the National Gallery of Art.

## Data Loading

In [2]:
# unpacking art_tables.zip using zipfile library and then load the two csv's (latinamerican_art.csv & non_latinamerican_art.csv)
# into a single dataframe
nga_art_sample = None
with zipfile.ZipFile('./data_samples/art_tables.zip', 'r') as zip_ref:
    zip_ref.extractall('./data_samples/')
    for file in zip_ref.namelist():
        if file.endswith('.csv'):
            # only sampling 628 rows from non_latinamerican.csv file & all from latinamerican, and selecting only title and nationality
            # latinamerican has very few rows so we are sampling all of them
            to_sample = True if 'non_latinamerican' in file else False
            if nga_art_sample is None:
                if to_sample:
                    nga_art_sample = pd.read_csv('./data_samples/' + file, on_bad_lines='skip').loc[:, ['title', 'nationality']].sample(628)
                else: 
                    nga_art_sample = pd.read_csv('./data_samples/' + file, on_bad_lines='skip').loc[:, ['title', 'nationality']]
            else:
                if to_sample:
                    nga_art_sample = pd.concat([pd.read_csv('./data_samples/' + file, on_bad_lines='skip').loc[:, ['title', 'nationality']].sample(628), nga_art_sample])
                else: 
                    nga_art_sample = pd.concat([pd.read_csv('./data_samples/' + file, on_bad_lines='skip').loc[:, ['title', 'nationality']], nga_art_sample])
# delete the unzipped data directory
shutil.rmtree('./data_samples/art_tables')

print(nga_art_sample.shape)
nga_art_sample.head()


  nga_art_sample = pd.concat([pd.read_csv('./data_samples/' + file, on_bad_lines='skip').loc[:, ['title', 'nationality']].sample(628), nga_art_sample])


(1256, 2)


Unnamed: 0,title,nationality
111523,George Washington,American
137701,Jersey Lu Blacked Out,American
88563,Bowl,American
49233,Saint Catherine in the Clouds,Flemish
174293,Eight Scalpels,American


In [3]:
nga_art_sample.columns

Index(['title', 'nationality'], dtype='object')

## Data Processing

For data processing, I will be cleaning the data to remove obvious errors in the title and nationality columns such as missing values that coul've occured during the sampling and extraction step (data was originally stored in a Mysql database).

And for feature engineering, in order to prepare for document clustering and text embedding I will be calculating the TF-IDF and n-grams of the titles by using Word2Vec. 

In [4]:
# we can see that titles and nationality differ in number of rows, so we will drop rows with missing values
# we can also see that top title is 'Untitled' which is not very useful for our analysis, so we will drop those rows as well
# finally, we can see that top nationality is 'Mexican' so we will need to include spanish stopwords in our list
nga_art_sample.describe()

Unnamed: 0,title,nationality
count,1256,1252
unique,850,37
top,Untitled,Mexican
freq,52,374


In [5]:
# dropping null values and 'Untitled' titles
nga_art_sample = nga_art_sample.dropna()
nga_art_sample = nga_art_sample[nga_art_sample.title != 'Untitled']
nga_art_sample.describe()

Unnamed: 0,title,nationality
count,1200,1200
unique,846,37
top,Garden,Mexican
freq,36,368


In [6]:
# remove non-ascii characters from titles
nga_art_sample.title = nga_art_sample.title.str.encode('ascii', 'ignore').str.decode('ascii')
# remove non-word characters from titles, including parentheses & quotation marks
nga_art_sample.title = nga_art_sample.title.str.replace(r"[\"':;,!?\\/\-+&=]|(\(.*\))", "", regex=True)
nga_art_sample.title = nga_art_sample.title.str.strip()
nga_art_sample.title = nga_art_sample.title.str.lower()

In [7]:
nga_art_sample.head()

Unnamed: 0,title,nationality
111523,george washington,American
137701,jersey lu blacked out,American
88563,bowl,American
49233,saint catherine in the clouds,Flemish
174293,eight scalpels,American


## Feature Engineering

In [8]:
from gensim.models.phrases import Phrases

# Convert the titles to a list of lists of words
title_words = [title.split() for title in nga_art_sample['title']]

# Create bigrams
bigram = Phrases(title_words, min_count=1, threshold=1)

# Apply the bigram model to the titles
title_bigrams = [bigram[title] for title in title_words]

In [9]:
title_words[:5]

[['george', 'washington'],
 ['jersey', 'lu', 'blacked', 'out'],
 ['bowl'],
 ['saint', 'catherine', 'in', 'the', 'clouds'],
 ['eight', 'scalpels']]

### TF-IDF Matrix

In [10]:
# calculaing the tf-idf matrix of bigrams 
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_bigram = TfidfVectorizer(ngram_range=(2,2))
vectors = vectorizer_bigram.fit_transform([' '.join(title) for title in title_bigrams])
#create a dataframe of the vectors
title_bigram_df = pd.DataFrame(vectors.toarray(), columns=vectorizer_bigram.get_feature_names_out())
title_bigram_df.head()

Unnamed: 0,12 12,13 yrs,14 jesus,14941547 king,15161591 margravine,1985 parking,199 curtesying,21 quai,23 86,25_prints of_leopoldo,...,xochimilcocinco de,y_la gran_cortina,ydoapai point,years expedition,young man,young woman_with,yrs 1933,yvette guilbert,zephyr loves,ziminian upper
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
title_bigram_df.describe()

Unnamed: 0,12 12,13 yrs,14 jesus,14941547 king,15161591 margravine,1985 parking,199 curtesying,21 quai,23 86,25_prints of_leopoldo,...,xochimilcocinco de,y_la gran_cortina,ydoapai point,years expedition,young man,young woman_with,yrs 1933,yvette guilbert,zephyr loves,ziminian upper
count,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,...,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0
mean,0.000481,0.000589,0.000278,0.000373,0.000315,0.000178,0.000315,0.000231,0.000241,0.001179,...,0.000589,0.000962,0.00034,0.000196,0.00034,0.000251,0.000589,0.000833,0.000417,0.000373
std,0.016667,0.020412,0.009623,0.01291,0.010911,0.006155,0.010911,0.008006,0.008333,0.028855,...,0.020412,0.02356,0.011785,0.006804,0.011785,0.008704,0.020412,0.028868,0.014434,0.01291
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.57735,0.707107,0.333333,0.447214,0.377964,0.213201,0.377964,0.27735,0.288675,0.707107,...,0.707107,0.57735,0.408248,0.235702,0.408248,0.301511,0.707107,1.0,0.5,0.447214


## Model

### K-Means Clustering
Now that we have the title-bigrams, we can generate a cosine similarity matrix which will model each individual 'document'/'title' in the title_bigram_df into a representation that can be used to compare the similarity of each 'title' to every other 'title'.

In [12]:
from sklearn.cluster import KMeans
num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(title_bigram_df)
clusters = km.labels_.tolist()

: 