# Identifying Similar Players Using Unsupervised Nearest Neighbors

## Using unsupervised nearest neighbors to compute searches for the nearest observations in the data, treating this as an approximation for a similarity score ranking

Unsupervised nearest neighbors implements different algorithms for the task of finding the nearest neighbors (the nearest observations) for each sample. The scikit-learn implementation has three different algorithms that it applies, finding the best approach for the data: BallTree, KDTree, and Brute Force.

Broadly speaking, neighbors-based methods find a set number of samples closest in distance to a new observation, and predict how to classify it from these nearest samples. The family of nearest neighbors algorithms are generally quite simple, but highly effective for a range of classification and regression tasks, and they form the basis for a number of more complicated approaches. In this case, the unsupervised approach can be repurposed and used to identify the nearest data points, representing the most similar players, without a follow-up task after.

I've made the following assumptions/decisions in developing this solution:

- I have taken the distance from all clusters to compute nearest neighbors. That way it doesn't just rely on the distance from its own cluster.
- I've used the cluster distances from one season to compute the nearest neighbors, meaning that the function is identifying players that are similar to a player in a certain season, but there is also an argument to be made for using a player's mean distances over the course of all seasons for which there is data available for them.
  - I've looked at the mean value similarities (taking the mean value of the individual, and then the single seeason value for similar players), and the results tend to be quite similar.
  - It's possible that mean values should be used for both target and neighbors.
  - It's not clear which is the better approach, but both are relatively easy to compute, so it's more of a conceptual decision than technical.

It would be possible to use the nearest neighbors approach on the raw player data, however, I think this would produce sub-optimal results. On its own this approach will potenially be too dependent on variables that may not define the broader role that a player is playing, but happen to define similarities between two players. Instead, the clustering approach classifies players in the wider context of the type of player they are, and then similarities can be drawn from that context.

In [1]:
import pickle

from sklearn.neighbors import NearestNeighbors


## Load K-Means Outputs

In [2]:
mf = pickle.load(open("mf_results.pickle", "rb"))
fw = pickle.load(open("fw_results.pickle", "rb"))

## Define Comparison Function

In [22]:
def similarity_comp(df, player_name, player_season):
    """
    Computing the 10 most similar player-seasons to a
    specified player-season using an unsupervised nearest
    neighbors algorithm on that player's position group.
    """
    n_nbrs = NearestNeighbors(n_neighbors=10, radius=0.4).fit(df[[0, 1, 2, 3]])
    x = (
        df.loc[(df["player"] == str(player_name)) & (df["season"] == player_season)][
            [0, 1, 2, 3]
        ]
        .to_numpy()
        .ravel()
        .tolist()
    )
    x = n_nbrs.kneighbors([x], 10, return_distance=False)
    x = df.iloc[x.ravel().tolist()][["player", "season"]]

    print(x)

## Examples

Below are six examples, three forwards and three midfielders, drawn from players that play relatively different roles. All six players are well-known, making it easier to assess how accurate the list of similar players are.

The examples are:

- Forwards: Lewandowski, Sancho, Messi
- Midfielders: Jorginho, Kimmich, De Bruyne

### Forwards

In [23]:
# robert lewandowski
similarity_comp(fw, "Robert Lewandowski", 1920)

player  season
2226  Robert Lewandowski    1920
1355       Karim Benzema    1920
2730  Zlatan Ibrahimović    1920
7782         Luis Suárez    1718
4846   Cristiano Ronaldo    1819
5103         Luis Suárez    1819
5361        Duván Zapata    1819
7527   Cristiano Ronaldo    1718
3899       Gabriel Jesus    1819
4505        Oumar Niasse    1819


In [24]:
# jadon sancho
similarity_comp(fw, "Jadon Sancho", 1920)

player  season
1093     Jadon Sancho    1920
641     Dimitri Payet    1920
1520  Lorenzo Insigne    1920
2084     Paulo Dybala    1920
3869       Alex Iwobi    1819
132    Alexis Sánchez    1920
2861    Marco Asensio    1819
4210     Riyad Mahrez    1819
6905           Malcom    1718
5114             Suso    1819


In [25]:
# lionel messi
similarity_comp(fw, "Lionel Messi", 1920)

player  season
1506     Lionel Messi    1920
4353     Lionel Messi    1819
7185           Neymar    1718
7044     Lionel Messi    1718
3845     Josip Iličić    1819
6539  Lorenzo Insigne    1718
7603   Alexis Sánchez    1718
1311     Josip Iličić    1920
220    Ángel Di María    1920
2217     Riyad Mahrez    1920


### Midfielders

In [26]:
# jorginho
similarity_comp(mf, "Jorginho", 1920)

player  season
1281               Jorginho    1920
6789            Lucas Leiva    1718
2396        Sergio Busquets    1920
2947  Julian Baumgartlinger    1819
4113            Lucas Leiva    1819
1744          Mateo Kovačić    1920
1031     Idrissa Gana Gueye    1920
3916               Jorginho    1819
5545           Milan Badelj    1718
626             Diego Demme    1920


In [8]:
# joshua kimmich
similarity_comp(mf, "Joshua Kimmich", 1920)

player  season
1305    Joshua Kimmich    1920
2081        Paul Pogba    1920
748        Éver Banega    1920
426      Cesc Fàbregas    1920
4051        Toni Kroos    1819
2564        Toni Kroos    1920
6853      Maxime Lopez    1718
759   Fabián Ruiz Peña    1920
5573       Éver Banega    1718
4690    Miralem Pjanić    1819


In [9]:
# kevin de bruyne
similarity_comp(mf, "Kevin De Bruyne", 1920)

player  season
1392     Kevin De Bruyne    1920
3320     Kevin De Bruyne    1819
5959   Philippe Coutinho    1718
576          David Silva    1920
5005         David Silva    1819
6548                Isco    1718
7692         David Silva    1718
7271          Mesut Özil    1718
2308  Ruslan Malinovskyi    1920
3659          Papu Gómez    1819
