## Clustering & Classification using Knowledge Graph Embeddings (KGE)s

Here we will explore how to use knowledge embeddings generated by a graph of international football matches (since 19th century).

We will perform clustering and classification tasks. Knowledge Graph Embeddings are typically used for missing link prediction and knowledge discovery, but they can also be used for entity clustering, entity disambiguation and other downstream tasks. The embeddings are a form of representation learning that allow linear algebra and machine learning to be applied to knowledge graphs.

We will perform the following tasks -

1. Creating the knowledge graph (i.e. triples) from a tabular dataset of football matches.
2. Training the ComplEx embedding model on those triples.
3. Evaluating the quality of the embeddings on a validation set.
4. Clustering the embeddings, comparing to the natural clusters formed by the geographical continents.
5. Applying the embeddings as features in classification task, to predict match results.
6. Evaluating the predictive model on a out-of-time test set, comparing to a simple baseline.

### Importing the Modules

In [3]:
import numpy as np
import pandas as pd
import ampligraph
import requests

In [2]:
print('Version of Ampligraph: ', ampligraph.__version__)

Version of Ampligraph:  1.4.0


### Dataset

We will use the [International Football Results from 1872 to 2019](https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017). It contains over 40 thousand international football matches. Each row contains the following information -

1. Match date
2. Home team name
3. Away team name
4. Home score (goals including extra time)
5. Away score (goals including extra time)
6. Tournament (whether it was friendly match or part of a tournament)
7. City where match took place
8. Country where match took place
9. Whether match was on neutral grounds

This dataset is in a tabular format, so we will construct the knowledge graph ourselves.

In [4]:
url = 'https://ampligraph.s3-eu-west-1.amazonaws.com/datasets/football_graph.csv'
open('football_results.csv', 'wb').write(requests.get(url).content)

3033782

In [5]:
df = pd.read_csv('football_results.csv').sort_values("date")

In [6]:
df.isna().sum()

date          0
home_team     0
away_team     0
home_score    2
away_score    2
tournament    0
city          0
country       0
neutral       0
dtype: int64

In [7]:
# Dropping the matches with unknown score -
df = df.dropna()

The training set will be from 1872 to 2014, while the test set will be from 2014 to present date. Note that a temporal test set makes any machine learning task harder compared to a random shuffle.

In [10]:
df["train"] = df.date < "2014-01-01"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [12]:
df.head(3)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,train
0,1872-11-30,Scotland,England,0.0,0.0,Friendly,Glasgow,Scotland,False,True
1,1873-03-08,England,Scotland,4.0,2.0,Friendly,London,England,False,True
2,1874-03-07,Scotland,England,2.0,1.0,Friendly,Glasgow,Scotland,False,True


In [13]:
df.tail(3)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,train
40768,2019-07-11,Madagascar,Tunisia,0.0,3.0,African Cup of Nations,Cairo,Egypt,True,False
40769,2019-07-14,Algeria,Nigeria,2.0,1.0,African Cup of Nations,Cairo,Egypt,True,False
40770,2019-07-14,Senegal,Tunisia,1.0,0.0,African Cup of Nations,Cairo,Egypt,True,False


In [14]:
df.train.value_counts()

True     35714
False     5057
Name: train, dtype: int64

### Knowledge Graph Creation

We are going to create a knowledge graph from scratch based on the match information. The idea is that each match is an entity that will be connected to its participating teams, geography, characterstrics, and results.

The objective is to generate a new representation of the dataset where each data point is an triple in the form -

                        <subject (s), predicate (p), object (o)>

First we need to create the entities (subjects and objects) that will form the graph. We make sure teams and geographical information result in different entities (e.g. the Brazilian team and the corresponding country will be different)

In [17]:
# Entities naming

df['match_id'] = df.index.values.astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [19]:
df['match_id'] = "Match" + df.match_id

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [28]:
df['city_id'] = "City" + df.city.str.title().str.replace(" ", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [30]:
df['country_id'] = "Country" + df.country.str.title().str.replace(" ", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [32]:
df['home_team_id'] = "Team" + df.home_team.str.title().str.replace(" ", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [33]:
df['away_team_id'] = "Team" + df.away_team.str.title().str.replace(" ", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [34]:
df['tournament_id'] = "Tournament" + df.tournament.str.title().str.replace(" ", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [35]:
df['neutral'] = df.neutral.astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [37]:
df.head(3)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,train,match_id,city_id,country_id,home_team_id,away_team_id,tournament_id
0,1872-11-30,Scotland,England,0.0,0.0,Friendly,Glasgow,Scotland,False,True,Match0,CityGlasgow,CountryScotland,TeamScotland,TeamEngland,TournamentFriendly
1,1873-03-08,England,Scotland,4.0,2.0,Friendly,London,England,False,True,Match1,CityLondon,CountryEngland,TeamEngland,TeamScotland,TournamentFriendly
2,1874-03-07,Scotland,England,2.0,1.0,Friendly,Glasgow,Scotland,False,True,Match2,CityGlasgow,CountryScotland,TeamScotland,TeamEngland,TournamentFriendly


**Note** : We have created entities of all the columns except -
* date : which we have used to sort and create a train and test split
* scores (both home and away) : which we will use to create the relationship between these entities

So, now let's create the actual triples based on the relationship between the entites. We do it only for the triples in the training set (before 2014)

In [39]:
triples = []

for _, row in df[df["train"]].iterrows():
    # Home and Away information
    home_team = (row["home_team_id"], "isHomeTeamIn", row["match_id"])
    away_team = (row["away_team_id"], "isAwayTeamIn", row["match_id"])
    
    # Match results -
    if row["home_score"] > row["away_score"]:
        score_home = (row["home_team_id"], "winnerOf", row["match_id"])
        score_away = (row["away_team_id"], "loserOf", row["match_id"])
    elif row["home_score"] < row["away_score"]:
        score_home = (row["home_team_id"], "loserOf", row["match_id"])
        score_away = (row["away_team_id"], "winnerOf", row["match_id"])
    else:
        score_home = (row["home_team_id"], "draws", row["match_id"])
        score_away = (row["away_team_id"], "draws", row["match_id"])
    
    home_score = (row["match_id"], "homeScores", np.clip(int(row["home_score"]), 0, 5)) # we have limit the score to be at most 5
    away_score = (row["match_id"], "awayScores", np.clip(int(row["away_score"]), 0, 5))
    
    # Match Characteristics -
    tournament = (row["match_id"], "inTournament", row["tournament_id"])
    city = (row["match_id"], "inCity", row["city_id"])
    country = (row["match_id"], "inCountry", row["country_id"])
    neutral = (row["match_id"], "isNeutral", row["neutral"])
    year = (row["match_id"], "atYear", row["date"][:4])
    
    triples.extend((home_team, away_team, score_home, score_away,
                   tournament, city, country, neutral, year, home_score, away_score))

Let's visualize the whole graph of Uruguay v Brazil in 1950 FIFA World Cup

In [41]:
triples_df = pd.DataFrame(triples, columns = ["subject", "predicate", "object"])
triples_df.head(3)

Unnamed: 0,subject,predicate,object
0,TeamScotland,isHomeTeamIn,Match0
1,TeamEngland,isAwayTeamIn,Match0
2,TeamScotland,draws,Match0


In [42]:
triples_df[(triples_df.subject == 'Match3129') | (triples_df.object == 'Match3129')]

Unnamed: 0,subject,predicate,object
34419,TeamBrazil,isHomeTeamIn,Match3129
34420,TeamUruguay,isAwayTeamIn,Match3129
34421,TeamBrazil,loserOf,Match3129
34422,TeamUruguay,winnerOf,Match3129
34423,Match3129,inTournament,TournamentFifaWorldCup
34424,Match3129,inCity,CityRioDeJaneiro
34425,Match3129,inCountry,CountryBrazil
34426,Match3129,isNeutral,False
34427,Match3129,atYear,1950
34428,Match3129,homeScores,1
