# Introduction

This IPython notebook illustrates how to cluster a collection of strings using `py_stringclustering`. First, we need to import `py_stringclustering` package and other libraries as follows:

In [1]:
# Import py_stringclustering package
import py_stringclustering as scl
import pandas as pd
import os
from sklearn.cluster import AgglomerativeClustering

Next, we read the set of input strings from a file. This file contains the input strings, one string per line. We use the function `read_data` which returns a Pandas Dataframe containing the input strings and their corresponding IDs.

In [3]:
# Get the datasets directory
from py_stringclustering.utils.generic_helper import get_install_path

datasets_dir = get_install_path() + os.sep + 'tests' + os.sep + 'test_datasets'
path_big_ten = datasets_dir + os.sep + 'big_ten.txt'

In [4]:
# Read the strings from file
df = scl.read_data(path_big_ten)

In [5]:
df.head()

Unnamed: 0,name,id
0,Buckeyes,0
1,Indiana University,1
2,Indiana,2
3,M.S.U,3
4,MSU,4


In [7]:
len(df)

74

## Blocking Step

In this step, we try to reduce the number of string pairs we need to compute the similarity for, by performing a round of blocking. For this purpose, we use the `py_stringsimjoin` and `py_stringmatching` packages as follows:

In [10]:
import py_stringmatching as sm
import py_stringsimjoin as ssj

# Block using Jaccard join with jacc3gr(s1, s2) >= 0.3
# Returns a DataFrame containing pairs of string IDs that satisfy the blocking condition
trigramtok = sm.QgramTokenizer(qval=3)
blocked_pairs = ssj.jaccard_join(df, df, 'id', 'id', 'name', 'name', trigramtok, 0.3)

0%                          100%
[##############################] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00
Total time elapsed: 00:00:00


In [11]:
blocked_pairs.head()

Unnamed: 0,_id,l_id,r_id,_sim_score
0,0,0,0,1.0
1,1,1,1,1.0
2,2,2,1,0.318182
3,3,9,1,0.305556
4,4,26,1,0.392857


## Calculate Pairwise Similarities


Next, we calculate the similarities between the blocked string pairs, using the `get_sim_scores` function:

In [12]:
# Define clustering similarity measure
jaccsim = sm.Jaccard()

# Calculate sim scores
# Returns a list of triplets of the form (id1, id2, sim)
sim_scores = scl.get_sim_scores(df, blocked_pairs, trigramtok, jaccsim)

Blocked pairs provided.


In [15]:
sim_scores[:10]

[(1, 2, 0.3181818181818182),
 (7, 8, 0.36363636363636365),
 (1, 9, 0.3055555555555556),
 (8, 10, 0.3333333333333333),
 (11, 13, 0.3076923076923077),
 (12, 13, 0.36),
 (3, 17, 0.4),
 (17, 20, 0.4),
 (3, 20, 0.4),
 (19, 22, 0.3333333333333333)]

## Generate Similarity Matrix

Then, we use the similarity scores calculated above to generate a similarity matrix, by calling the `get_sim_matrix` function. This similarity matrix will be used to run the clustering algorithm subsequently.

In [16]:
# Returns a NumPy matrix containing the similarities in sim_scores and zero everywhere else
sim_matrix = scl.get_sim_matrix(df, sim_scores)

In [17]:
sim_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.31818182, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.44444444,  0.30769231],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.38461538],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

## Run Hierarchical Agglomerative Clustering (HAC)

Next, we use the similarity matrix calculated above to cluster the input strings, using the scikit-learn package:

In [18]:
aggcl = AgglomerativeClustering(n_clusters=5, affinity='precomputed', linkage='complete')

### Returns a list of cluster labels
labels = aggcl.fit_predict(sim_matrix)

## Generate the string clusters

Finally, we use the labels assigned above to generate the final clusters of strings, by calling the `get_clusters` function:

In [19]:
### Returns a list of clusters where each cluster is a list of strings
str_clusters = scl.get_clusters(df, labels)

In [20]:
str_clusters

[['P.S.U',
  'Purdue University',
  'UMich',
  'University of Iowa Iowa City',
  'Wisc Madison',
  'iowa',
  'wisconsin'],
 ['University of iowa', 'wisco'],
 ['Indiana',
  'Mich',
  'Michigan State University',
  'Minnesota',
  'O.S.U',
  'Penn State',
  'Purdue University Boilermakers',
  'Purdue',
  'Rutgers',
  'The university of iowa',
  'U Nebraska',
  'UM Twin Cities',
  'University of Wisconsin Madison',
  'michigan state'],
 ['Buckeyes',
  'Indiana University',
  'M.S.U',
  'MSU',
  'Madison - UW',
  'Maryland',
  'Mich St',
  'Michigan',
  'Minn',
  'Minnesota Twin Cities',
  'NORTHWESTERN UNIVERSITY',
  'NWern',
  'Nebraska',
  'OSU',
  'Ohio State',
  'PSU',
  'Pennsylvania State University',
  'Purdue Univ',
  'RU',
  'Rutgers U',
  'Rutgers the state university of new jersey',
  'THE ohio state uiversity',
  'The University of Maryland',
  'U Wisconsin',
  'U of Mich',
  'U.M.N',
  'U.N.L',
  'UDub',
  'UI Iowa City',
  'UIowa',
  'UM Ann Arbor',
  'UM College Park',
  'UM