# Stock Clustering

## Q1. SSE

In [261]:
from sklearn.cluster import KMeans
import numpy as np

with open('clustering_data.csv', 'r') as f:
    data = f.readlines()
    
stock_name_list = []
stock_change_list = []
# skip the header line
for line in data[1:]:
    line = line.split(',')
    stock_name_list.append(line[0])
    stock_change_list.append([float(x) for x in line[1:]])

stock_change = np.array(stock_change_list, dtype='float32')
kmeans = KMeans(init='random').fit(stock_change)
sse = kmeans.inertia_
sse

1651.9931640625

Result: 1651.99

## Q2. Decrease SSE
try to decrease the SSE as much as possible(k=8), select two parameters which can impact the SSE most

In [267]:
# try: init-method, number of iterations, tolerance
kmeans = KMeans(n_clusters=8, init='random', n_init=1000, max_iter=10000).fit(stock_change)
sse = kmeans.inertia_
print(sse)

kmeans = KMeans(n_clusters=8, init='k-means++').fit(stock_change)
sse = kmeans.inertia_
print(sse)

kmeans = KMeans(n_clusters=8, init='k-means++', n_init=1000, max_iter=10000, tol=1e-5).fit(stock_change)
sse = kmeans.inertia_
print(sse)

1579.822509765625
1809.9132080078125
1508.43408203125


After several tests, the sse can be reduced to `1508.43` by changing init-method to k-means++, increasing iterations and lowering tolerance.

## Q3. Labels

In [268]:
from collections import defaultdict
labels = kmeans.labels_
label_dict = defaultdict(list)
# add stocks to relevant label
for i in range(len(labels)):
    label = int(labels[i])
    label_dict[label].append(stock_name_list[i])
# sort by key
sorted_dict = dict(sorted(label_dict.items()))
sorted_dict

{0: ['Bank of America'],
 1: ['Chevron', 'Pfizer', 'ExxonMobil'],
 2: ['Kraft',
  'Verizon',
  'Procter & Gamble',
  'AT&T',
  'Merck',
  'McDonalds',
  'Coca-Cola',
  'Johnson & Johnson'],
 3: ['American Express',
  'Boeing',
  'Microsoft',
  'Walt Disney',
  'JPMorgan Chase'],
 4: ['Cisco Systems'],
 5: ['Hewlett-Packard'],
 6: ['DuPont', 'Caterpillar', 'Alcoa'],
 7: ['IBM',
  'The Home Depot',
  'Intel',
  'Wal-Mart',
  'General Electric',
  'United Technologies',
  'Travelers',
  '3M']}

### Label clusters

* Cluster1: **Banking**
  * Companies: Bank of America
  * Explanation: The company provides banking and financial services, including retail banking, investment management, and wealth management services.
* Cluster2: **Energy & Medicine**
  * Companies: Chevron, Pfizer, ExxonMobil
  * Explanation: Chevron and ExxonMobil provide energy like oil and gas, while Pfizer produces medicine.
* Cluster3: **Consumer Products**
  * Companies: American Express, Boeing, Microsoft, Walt Disney, Intel, United Technologies, 3M, Johnson & Johnson
  * Explanation: Companies provide consumer products like food, drinks, telecommunications and medicine.
* Cluster4: **Blue-chip**
  * Companies: American Express, Boeing, Microsoft, Walt Disney, JPMorgan Chase
  * Explanation: They can be divided in different industries, like finance, aerospace, technology and entertainment. But they are all big companies with reliable performance and growth, which can be classified as blue-chip companies. 
* Cluster5: **Communications**
  * Companies: Cisco Systems
  * Explanation: The company provides communication equipment and solutions.
* Cluster6: **Computer**
  * Companies: Hewlett-Packard
  * Explanation: The company provides computer hardware like laptops and servers.
* Cluster7: **Manufacture** 
  * Companies: DuPont, Caterpillar, Alcoa
  * Explanation: Companies involve in industrial manufacturing(DuPont, Caterpillar) and raw materials(Alcoa).
* Cluster8: **Technology and Retail**
  * Companies: IBM, The Home Depot, Intel, Wal-Mart, General Electric, United Technologies, Travelers, 3M
  * Explanation: IBM, Intel, General Electric, United Technologies and 3M can be considered as technology companies, and The Home Depot, Wal-Mart, Travelers can be considered as retail companies.