Let's start by importing all the necessary packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn.cluster import DBSCAN
from hdbscan import HDBSCAN
from sklearn.manifold import TSNE
from sklearn.preprocessing import RobustScaler

Now, we have to import our dataset as a dataframe in order to cluster it.

In [None]:
WALS_raw = pd.read_csv('WALS table.csv')
WALS_languages = pd.read_csv('WALS languages.csv')
WALS_data = WALS_raw.drop(columns="label")
WALS_languages.head()

This dataset has features with measurements of different lengths. It is entirely categorical so the numbers are fairly similar, nontheless we should normalize it.

In [None]:
scaler = RobustScaler()
WALS_normalized = pd.DataFrame(scaler.fit_transform(WALS_data), index=WALS_raw.index)

Now, this dataset has a hundred-some features, in order to cluster it and visualize it more effectively, we are going to use dimensionality reduction to get it to two features. There are a number of different methods to reduce dimensionality, but I will be using t-Distributed Stochastic Neighbor Embedding. This will better preserve the local distances than another algorithm like Principal Component Analaysis, so it will be better for clustering.

In [None]:
TSNEReduction = TSNE(n_components=2)
WALS_reduced = pd.DataFrame(TSNEReduction.fit_transform(WALS_normalized), index=WALS_raw.index)
WALS_reduced.columns = ["x", "y"]


And now, let's plot it to see what the data looks like.

In [None]:
px.scatter(WALS_reduced, x="x", y="y", hover_name=WALS_raw["label"])

As we can see from this, the t-SNE reduction has given us a lot of areas with a lot of density. However, it's hard to count the exact number of clusters as there are many areas that could be clustered in many different ways due to small changes in density. From looking at this, it looks to me that the most optimal solution would be a density-based clustering algorithm, as it will automatically select the number of clusters based on areas of high density. The one disadvantage of this type of clustering is that it assigns some points as outliers, but that works for this project because we can think of those as "language isolates", languages that do not have any living relatives. The specific clustering algorithm I am going to use is HDBSCAN, an algorithm that converts DBSCAN into a hierarchical clustering algorithm. What this means is that it not only is looking for areas of high density, but it is creating a hierarchical categorization of the distances between the points. This allows it to account for clusters of varying density, something which DBSCAN has trouble with since it assumes anything outside of its predetermined density radius (epsilon) is in a different cluster or an outlier. Essentially, HDBSCAN operates like a DBSCAN with a variable epsilon based on the cluster.

So now I am going to apply HDBSCAN to our dataset, and plot the results. Additionally, I merged the now clustered data points with the WALS languages.csv I imported earlier. This csv contains more identifying information about the languages, such as where they are located and what language family they are in, and most importantly their name.

In [None]:
Cluster = HDBSCAN(min_cluster_size=10, min_samples=10, cluster_selection_epsilon=2.7)
WALS_prediction = pd.DataFrame(Cluster.fit_predict(WALS_reduced), index=WALS_reduced.index)
WALS_prediction.columns = ["cluster"]
WALS_clustered = pd.concat([WALS_reduced, WALS_prediction], axis=1)
WALS_clustered["ID"] = WALS_raw["label"]
WALS_clustered = WALS_clustered.merge(WALS_languages, how="inner", on="ID")
WALS_clustered["cluster"] = WALS_clustered["cluster"].astype(str)
px.scatter(WALS_clustered, x="x", y="y", color="cluster", hover_name="Name")

Browsing around this, you can see that some similar languages are clustered. One significant one that stands out is the small cluster in the bottom left corner around (-45,45). This cluster entirely consists of sign languages, a fairly natural grouping. But just by a cursory look it doesn't seem to correspond much with language families otherwise. Let's see how the plot looks when colored based on language family.

In [None]:
px.scatter(WALS_clustered, x="x", y="y", color="Family", hover_name="Name")

As we can see, the clusters don't correspond well to language families. To see if it still might correspond to areal relationships, let's see how this looks on a map.

In [None]:
px.scatter_geo(WALS_clustered, lat="Latitude", lon="Longitude", color="cluster", hover_name="Name")

We can see that this does not correspond well with geographic area either. This ultimately illustrates the point that we began with, that typological similarities are not sufficient basis for languages to be treated as part of the same family.