# Affinity propagation

Adapted from [here](https://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html#sphx-glr-auto-examples-applications-plot-stock-market-py).

In [1]:
import sys

import numpy as np
import pandas as pd

companies = {
    "TOT": "Total",
    "XOM": "Exxon",
    "CVX": "Chevron",
    "COP": "ConocoPhillips",
    "VLO": "Valero Energy",
    "MSFT": "Microsoft",
    "IBM": "IBM",
    "TWX": "Time Warner",
    "CMCSA": "Comcast",
    "CVC": "Cablevision",
    "YHOO": "Yahoo",
    "DELL": "Dell",
    "HPQ": "HP",
    "AMZN": "Amazon",
    "TM": "Toyota",
    "CAJ": "Canon",
    "SNE": "Sony",
    "F": "Ford",
    "HMC": "Honda",
    "NAV": "Navistar",
    "NOC": "Northrop Grumman",
    "BA": "Boeing",
    "KO": "Coca Cola",
    "MMM": "3M",
    "MCD": "McDonald's",
    "PEP": "Pepsi",
    "K": "Kellogg",
    "UN": "Unilever",
    "MAR": "Marriott",
    "PG": "Procter Gamble",
    "CL": "Colgate-Palmolive",
    "GE": "General Electrics",
    "WFC": "Wells Fargo",
    "JPM": "JPMorgan Chase",
    "AIG": "AIG",
    "AXP": "American express",
    "BAC": "Bank of America",
    "GS": "Goldman Sachs",
    "AAPL": "Apple",
    "SAP": "SAP",
    "CSCO": "Cisco",
    "TXN": "Texas Instruments",
    "XRX": "Xerox",
    "WMT": "Wal-Mart",
    "HD": "Home Depot",
    "GSK": "GlaxoSmithKline",
    "PFE": "Pfizer",
    "SNY": "Sanofi-Aventis",
    "NVS": "Novartis",
    "KMB": "Kimberly-Clark",
    "R": "Ryder",
    "GD": "General Dynamics",
    "RTN": "Raytheon",
    "CVS": "CVS",
    "CAT": "Caterpillar",
    "DD": "DuPont de Nemours",
}

quotes = []
url = (
    "https://raw.githubusercontent.com/scikit-learn/examples-data/"
    "master/financial-data/{}.csv"
)

for symbol, name in companies.items():
    print(f"Fetching quote history for {name} ({symbol})")
    quotes.append(
        pd.read_csv(url.format(symbol))
        .assign(
            symbol=symbol,
            name=name
        )
    )

quotes = pd.concat(quotes)
quotes['variation'] = quotes.eval('close - open')
quotes.sample(5)


Fetching quote history for Total (TOT)
Fetching quote history for Exxon (XOM)
Fetching quote history for Chevron (CVX)
Fetching quote history for ConocoPhillips (COP)
Fetching quote history for Valero Energy (VLO)
Fetching quote history for Microsoft (MSFT)
Fetching quote history for IBM (IBM)
Fetching quote history for Time Warner (TWX)
Fetching quote history for Comcast (CMCSA)
Fetching quote history for Cablevision (CVC)
Fetching quote history for Yahoo (YHOO)
Fetching quote history for Dell (DELL)
Fetching quote history for HP (HPQ)
Fetching quote history for Amazon (AMZN)
Fetching quote history for Toyota (TM)
Fetching quote history for Canon (CAJ)
Fetching quote history for Sony (SNE)
Fetching quote history for Ford (F)
Fetching quote history for Honda (HMC)
Fetching quote history for Navistar (NAV)
Fetching quote history for Northrop Grumman (NOC)
Fetching quote history for Boeing (BA)
Fetching quote history for Coca Cola (KO)
Fetching quote history for 3M (MMM)
Fetching quote h

Unnamed: 0,date,open,close,symbol,name,variation
1126,2007-06-25,51.67,51.77,KO,Coca Cola,0.1
884,2006-07-07,64.4,63.68,CVX,Chevron,-0.72
941,2006-09-27,66.68,67.13,XOM,Exxon,0.45
942,2006-09-28,65.77,65.3,PEP,Pepsi,-0.47
1126,2007-06-25,26.835,27.0299,CSCO,Cisco,0.1949


In [2]:
names = list(companies.values())


In [3]:
len(names)


56

In [4]:
corr = quotes.pivot(index='date', columns='symbol', values='variation').corr()
corr


symbol,AAPL,AIG,AMZN,AXP,BA,BAC,CAJ,CAT,CL,CMCSA,COP,CSCO,CVC,CVS,CVX,DD,DELL,F,GD,GE,GS,GSK,HD,HMC,HPQ,IBM,JPM,K,KMB,KO,MAR,MCD,MMM,MSFT,NAV,NOC,NVS,PEP,PFE,PG,R,RTN,SAP,SNE,SNY,TM,TOT,TWX,TXN,UN,VLO,WFC,WMT,XOM,XRX,YHOO
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1
AAPL,1.0,0.225934,0.354917,0.258708,0.277894,0.23721,0.258311,0.271301,0.160504,0.210789,0.231287,0.314621,0.146344,0.183395,0.275225,0.212841,0.301663,0.138219,0.233791,0.269,0.374568,0.162076,0.248376,0.326159,0.352699,0.36554,0.284099,0.177503,0.190271,0.189102,0.207776,0.181003,0.21908,0.308663,0.277761,0.209165,0.186611,0.192993,0.128519,0.154336,0.229182,0.263949,0.334141,0.283405,0.238692,0.323852,0.274388,0.199013,0.274542,0.193153,0.183458,0.227461,0.179268,0.313432,0.23152,0.258951
AIG,0.225934,1.0,0.257543,0.509882,0.276199,0.503748,0.36276,0.344303,0.288148,0.384152,0.220224,0.368463,0.233585,0.278624,0.326939,0.392659,0.341351,0.308699,0.293095,0.487985,0.434796,0.324794,0.375069,0.361848,0.306621,0.395627,0.534658,0.333765,0.332443,0.338001,0.345645,0.242173,0.394529,0.391125,0.258036,0.287718,0.312348,0.36005,0.318351,0.366537,0.321224,0.254645,0.367395,0.422348,0.324335,0.380504,0.339891,0.337496,0.323935,0.376864,0.138622,0.5086,0.409292,0.322903,0.296388,0.269285
AMZN,0.354917,0.257543,1.0,0.330101,0.23875,0.292897,0.334448,0.305473,0.176198,0.247766,0.188684,0.362122,0.183621,0.241966,0.238414,0.258627,0.316176,0.234507,0.224671,0.318296,0.332471,0.186124,0.329117,0.299988,0.291894,0.370218,0.349385,0.230701,0.228231,0.253194,0.230752,0.226617,0.267799,0.320489,0.276003,0.18769,0.219842,0.226861,0.236136,0.231681,0.259408,0.217892,0.341014,0.341233,0.282251,0.358937,0.238439,0.289035,0.358275,0.272564,0.153306,0.254297,0.2512,0.255889,0.273266,0.488318
AXP,0.258708,0.509882,0.330101,1.0,0.309127,0.549991,0.452237,0.399157,0.279686,0.366383,0.294749,0.423727,0.223595,0.304246,0.409964,0.43703,0.349531,0.302573,0.344445,0.530516,0.57647,0.351108,0.440636,0.419356,0.369833,0.401951,0.623331,0.350794,0.342633,0.361075,0.380372,0.335655,0.388007,0.421247,0.306689,0.301798,0.328837,0.364607,0.332014,0.363419,0.40936,0.29809,0.444611,0.467341,0.390329,0.464094,0.398656,0.362035,0.343401,0.385759,0.205552,0.541227,0.416368,0.417736,0.342724,0.303611
BA,0.277894,0.276199,0.23875,0.309127,1.0,0.269415,0.386255,0.336043,0.195183,0.215768,0.252399,0.309814,0.159311,0.27953,0.267166,0.307798,0.241477,0.206076,0.429154,0.331608,0.345042,0.240863,0.262089,0.372375,0.26407,0.280318,0.304222,0.244612,0.266356,0.222347,0.288736,0.235371,0.267657,0.271961,0.267317,0.377614,0.259115,0.222783,0.211763,0.226504,0.332326,0.408796,0.356008,0.380425,0.286258,0.395889,0.305794,0.193586,0.28115,0.258608,0.1693,0.269182,0.234127,0.289396,0.241061,0.221916
BAC,0.23721,0.503748,0.292897,0.549991,0.269415,1.0,0.414744,0.40979,0.292941,0.426517,0.247243,0.401472,0.281815,0.305887,0.360117,0.428988,0.389575,0.34223,0.333801,0.511234,0.453164,0.360676,0.428986,0.391859,0.356465,0.398089,0.66827,0.338059,0.357312,0.334408,0.359439,0.265666,0.427838,0.415723,0.27917,0.293183,0.357802,0.37729,0.40571,0.387795,0.352547,0.245159,0.412092,0.419477,0.370499,0.405487,0.366518,0.374005,0.315114,0.405973,0.162933,0.615369,0.439729,0.344924,0.314946,0.34161
CAJ,0.258311,0.36276,0.334448,0.452237,0.386255,0.414744,1.0,0.421802,0.180749,0.322225,0.280406,0.439071,0.24294,0.302495,0.339055,0.382951,0.398473,0.341086,0.338632,0.422439,0.417058,0.319199,0.43063,0.54422,0.380653,0.375954,0.437349,0.28203,0.279597,0.30375,0.369473,0.268692,0.34344,0.371727,0.322558,0.283713,0.370664,0.2349,0.300296,0.271845,0.37443,0.271876,0.464243,0.582513,0.415934,0.600562,0.428964,0.309893,0.422166,0.405367,0.236433,0.386756,0.37423,0.345787,0.357042,0.336428
CAT,0.271301,0.344303,0.305473,0.399157,0.336043,0.40979,0.421802,1.0,0.200433,0.297537,0.332707,0.327925,0.171929,0.291181,0.364586,0.402845,0.333406,0.289144,0.308623,0.417726,0.364849,0.245979,0.386265,0.3862,0.323379,0.359425,0.409908,0.263276,0.259562,0.280706,0.349174,0.265649,0.40908,0.328447,0.426997,0.245431,0.27134,0.247267,0.264937,0.264799,0.380124,0.256788,0.422672,0.410815,0.3488,0.409901,0.382294,0.307213,0.352471,0.369215,0.274725,0.353089,0.327112,0.365735,0.27514,0.302337
CL,0.160504,0.288148,0.176198,0.279686,0.195183,0.292941,0.180749,0.200433,1.0,0.22841,0.109091,0.245803,0.123301,0.2102,0.185557,0.264197,0.213714,0.170264,0.247346,0.343785,0.253299,0.25227,0.243259,0.187539,0.22406,0.255153,0.314616,0.284636,0.36941,0.330089,0.247137,0.160549,0.283523,0.302837,0.158492,0.233918,0.195777,0.298067,0.230274,0.452405,0.185527,0.223362,0.224658,0.226033,0.256738,0.199917,0.197137,0.210629,0.154381,0.230462,0.020647,0.302492,0.270507,0.199703,0.181486,0.169819
CMCSA,0.210789,0.384152,0.247766,0.366383,0.215768,0.426517,0.322225,0.297537,0.22841,1.0,0.20201,0.400573,0.429365,0.211362,0.259956,0.288828,0.347974,0.257571,0.240296,0.390467,0.339669,0.306827,0.335061,0.291562,0.333316,0.39366,0.413695,0.242304,0.26005,0.271034,0.314879,0.190989,0.323383,0.417552,0.261207,0.219903,0.281845,0.267158,0.311965,0.328295,0.258826,0.211498,0.328653,0.305572,0.311928,0.307727,0.301165,0.433032,0.278702,0.309251,0.129244,0.377476,0.342357,0.27985,0.260791,0.318138


In [5]:
from sklearn import cluster

_, labels = cluster.affinity_propagation(corr, random_state=42)

for cluster_no in sorted(set(labels)):
    print(f"Cluster {cluster_no}: {', '.join(corr.index[labels == cluster_no])}")


Cluster 0: AAPL, AMZN, YHOO
Cluster 1: CMCSA, CVC, TWX
Cluster 2: COP, CVX, TOT, VLO, XOM
Cluster 3: CSCO, DELL, HPQ, IBM, MMM, MSFT, SAP, TXN
Cluster 4: BA, GD, NOC, RTN
Cluster 5: AIG, AXP, BAC, CAT, CVS, DD, F, GE, GS, HD, JPM, MAR, MCD, R, WFC, WMT
Cluster 6: GSK, NVS, PFE, SNY, UN
Cluster 7: K, KO, PEP
Cluster 8: CL, KMB, PG
Cluster 9: CAJ, HMC, NAV, SNE, TM, XRX


We can also print using the names rather than the symbols:

In [6]:
for cluster_no in sorted(set(labels)):
    print(f"Cluster {cluster_no}: {', '.join(corr.index[labels == cluster_no].map(companies))}")


Cluster 0: Apple, Amazon, Yahoo
Cluster 1: Comcast, Cablevision, Time Warner
Cluster 2: ConocoPhillips, Chevron, Total, Valero Energy, Exxon
Cluster 3: Cisco, Dell, HP, IBM, 3M, Microsoft, SAP, Texas Instruments
Cluster 4: Boeing, General Dynamics, Northrop Grumman, Raytheon
Cluster 5: AIG, American express, Bank of America, Caterpillar, CVS, DuPont de Nemours, Ford, General Electrics, Goldman Sachs, Home Depot, JPMorgan Chase, Marriott, McDonald's, Ryder, Wells Fargo, Wal-Mart
Cluster 6: GlaxoSmithKline, Novartis, Pfizer, Sanofi-Aventis, Unilever
Cluster 7: Kellogg, Coca Cola, Pepsi
Cluster 8: Colgate-Palmolive, Kimberly-Clark, Procter Gamble
Cluster 9: Canon, Honda, Navistar, Sony, Toyota, Xerox


Let's name these clusters by leveraging ChatGPT.

In [7]:
import os
import requests
import dotenv

dotenv.load_dotenv('../../.env')
api_url = 'https://api.openai.com/v1/chat/completions'
headers = {'Authorization': f"Bearer {os.environ['OPEN_AI_API_KEY']}"}

prompt = f"""
I would like to assign a name to the following stock market groups:

Cluster 1: Apple, Amazon, Yahoo
Cluster 2: Comcast, Cablevision, Time Warner
Cluster 3: ConocoPhillips, Chevron, Total, Valero Energy, Exxon
Cluster 4: Cisco, Dell, HP, IBM, 3M, Microsoft, SAP, Texas Instruments
Cluster 5: Boeing, General Dynamics, Northrop Grumman, Raytheon
Cluster 6: AIG, American express, Bank of America, Caterpillar, CVS, DuPont de Nemours, Ford, General Electrics, Goldman Sachs, Home Depot, JPMorgan Chase, Marriott, McDonald's, Ryder, Wells Fargo, Wal-Mart
Cluster 7: GlaxoSmithKline, Novartis, Pfizer, Sanofi-Aventis, Unilever
Cluster 8: Kellogg, Coca Cola, Pepsi
Cluster 9: Colgate-Palmolive, Kimberly-Clark, Procter Gamble
Cluster 10: Canon, Honda, Navistar, Sony, Toyota, Xerox

Please provide a list of names for each group.
"""

response = requests.post(
    api_url,
    headers=headers,
    json={
        'model': 'gpt-3.5-turbo',
        'messages': [{
            'role': 'user',
            'content': prompt
        }],
    }
)

answer = response.json()['choices'][0]['message']['content']
print(answer)


Cluster 1: Tech Titans
Cluster 2: Telecommunications Giants
Cluster 3: Energy Titans
Cluster 4: Tech Powerhouses
Cluster 5: Defense Contractors
Cluster 6: Corporate Titans
Cluster 7: Pharmaceutical Giants
Cluster 8: Beverage Giants
Cluster 9: Household Product Leaders
Cluster 10: Global Technology Manufacturers


We can do a linechart within one cluster to check their behavior is indeed similar.

In [8]:
labels


array([0, 5, 0, 5, 4, 5, 9, 5, 8, 1, 2, 3, 1, 5, 2, 5, 3, 5, 4, 5, 5, 6,
       5, 9, 3, 3, 5, 7, 8, 7, 5, 5, 3, 3, 9, 4, 6, 7, 6, 8, 5, 4, 3, 9,
       6, 9, 2, 1, 3, 6, 2, 5, 5, 2, 9, 0])

In [16]:
import altair as alt

cluster_no = 1
cluster_members = corr.index[labels == cluster_no]
cluster_quotes = quotes.query('symbol in @cluster_members')

chart = alt.Chart(cluster_quotes).mark_line().encode(
    x='yearmonth(date):T',
    y='close:Q',
    color='symbol:N',
    tooltip=['symbol:N', 'date:T', 'variation:Q']
).properties(
    width=600,
    height=400,
    title='Variation Over Time by Symbol'
).interactive()
chart


Here we used Pearson correlation to define the similarity between two stocks. But we could use other measures of similarity, such as the Kendall rank correlation.

In [10]:
corr = quotes.pivot(index='date', columns='symbol', values='variation').corr(method='kendall')


In [11]:
_, labels = cluster.affinity_propagation(corr, random_state=42)

for cluster_no in sorted(set(labels)):
    print(f"Cluster {cluster_no}: {', '.join(corr.index[labels == cluster_no].map(companies))}")


Cluster 0: Comcast, Cablevision, Time Warner
Cluster 1: ConocoPhillips, Chevron, Total, Valero Energy, Exxon
Cluster 2: AIG, American express, Bank of America, DuPont de Nemours, Ford, Goldman Sachs, Home Depot, JPMorgan Chase, Marriott, Pfizer, Ryder, Wells Fargo, Wal-Mart, Xerox
Cluster 3: Kellogg, Coca Cola, Pepsi
Cluster 4: GlaxoSmithKline, Novartis, Sanofi-Aventis
Cluster 5: Colgate-Palmolive, Kimberly-Clark, Procter Gamble
Cluster 6: Boeing, General Dynamics, Northrop Grumman, Raytheon
Cluster 7: Apple, Amazon, Caterpillar, Cisco, CVS, Dell, General Electrics, HP, IBM, McDonald's, 3M, Microsoft, SAP, Texas Instruments, Unilever, Yahoo
Cluster 8: Canon, Honda, Navistar, Sony, Toyota


As long as we have a distance matrix, distance-based clustering methods can be applied. They can thus be applied to sound waves, text sentences, images, etc. In all cases, an appropriate distance measure needs to be defined. In our case, we used the Pearson correlation coefficient. But that's not the most appropriate for time series. We could use the [Dynamic Time Warping (DTW)](https://rtavenar.github.io/blog/dtw.html) distance instead.

## How does it work?

The method was published in 2007, see [paper](https://utstat.toronto.edu/reid/sta414/frey-affinity.pdf).

Affinity propagation creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples.

Affinity propagation improves on some of the downsides of $k$-means. In $k$-means, a random set of points are chosen as initial cluster centers. This can be turned into $k$-medoids by instead choosing $k$ points from the dataset as initial cluster centers. However, this can still be problematic because the initial cluster centers may be chosen poorly and lead to poor convergence.

Instead, affinity propagation considers all the points as potential cluster centers at the start. Then, it iteratively considers messages sent between points until a set of exemplars emerges that best represents the dataset. Points exchange two kinds of messages with each other:

1. *Responsibility*: How well-suited one point is to serve as the exemplar for another point. A point sends a high responsibility to another point if it would make a good exemplar.
2. *Availability*: How appropriate it would be for a point to adopt another point as its exemplar. A point sends a high availability to another point if it thinks that point should be its exemplar.

The algorithm starts by setting all the responsibility and availability values to 0. Then, it iteratively updates the values until they converge. The final exemplars are those points with the highest sum of responsibility and availability.

Here are the formulas for responsibility and availability:

$$r(i, k) \leftarrow s(i, k) - \max_{k' \neq k} \{ a(i, k') + s(i, k') \}$$

$$a(i, k) \leftarrow \min \{ 0, r(k, k) + \sum_{i' \not\in \{i, k\}} \max \{ 0, r(i', k) \} \}$$

where $s(i, k)$ is the similarity between points $i$ and $k$. This is repeated until the values converge, or until a maximum number of iterations is reached.

The downside of this algorithm is its cost: $O(N^2 T)$, where $N$ is the number of samples and $T$ is the number of iterations until convergence. This is because the algorithm must consider every pair of points, and it must iterate until the values converge. This makes affinity propagation most appropriate for small to medium sized datasets.

The benefits of this algorithm are that it does not require the number of clusters to be specified, can use any distance matrix, and it works very well on small datasets.