### Lesson 3</h3>


# Heuristic Portfolio Selection 

![](Lesson3GoalHeaderImage.png)

## 3.1 Introduction

The universe of stocks, both local and global,  can truly baffle investors who wish to make a modest selection of stocks for their portfolios. The vastness of the choices and the  diverse behavioural characteristics of each of these stocks with respect to itself and to one another, can render a prudent selection of stocks  to be a  daunting task indeed!  
  
Fig. 3.1 illustrates an investor's perception of the challenging task of portfolio selection. 

![Lesson3Fig3_1.png](attachment:Lesson3Fig3_1.png)
<h4 align = "center">Fig. 3.1 The challenging task of Portfolio Selection - an investor's perception </h4>

Also, there is this adage, "**never put all your eggs in the same basket**"! So, how would an investor know that the stocks picked do not perform similarly, warranting a risk when all stocks react similarly to market events thereby impacting the performance of the portfolio? 

There are several answers available to these questions independently and collectively from traditional investment finance, but a commonly agreed upon answer is **Diversification**. 
Diversification involves investments in different assets or asset classes or markets. A portfolio that comprises such a diversified set of securities can go a long way in mitigating risk since the securities would react differently to market events. 
 

The choice of assets in an investor's portfolio reflects the investor's *risk appetite* or the willingness to take risk.  It is an established investment principle that investments giving high returns are always associated with a probability of high risks and those that are less risky are associated with a probability of low returns.     
  
Therefore, while *risk averse* investors, as a matter of fact,  would choose to invest a larger share of their capital in risk-free instruments - government bonds, for example,  which are notionally risk-free, *risk-seeking* investors on the other hand, would choose to invest a lion's share of their capital in risky instruments -  equities or currencies or commodities, for example,  which are prone to yield quick and high returns and therefore are inherently risky.  
  
Investors therefore,  may hold  portfolios constituted by a mix of securities that can render the portfolio investments to  range from *less risky* through *moderately risky* to *highly risky*, as measured by a "riskometer".

Fig. 3.2 graphically illustrates some investors' portfolio "doughnuts" with the investors' risk appetites and approach to investments symbolically described by the icons superscribed over the investor labels.  Thus,  Investor C seems to be *risk averse* considering the larger investments made in bonds, which are risk-free, when compared to equities which are generally risky. Investor B turns out to be the opposite owning a *high risk* portfolio,  that invests a lion's share of the capital in equities. Investor D and Investor A, both owning chequered portfolios that invest lesser in bonds and more on other securities,   could be deemed to be holding *moderately high* and  *high risk* portfolios, respectively. 

![Lesson3Fig3_2.png](attachment:Lesson3Fig3_2.png)
<h4 align = "center">Fig. 3.2 Portfolio selection vs Investors' risk appetites </h4>

## 3.2 Diversification Index

A **Diversification Index** quantifies diversification. There are several diversification indices discussed in the literature. **Diversification Ratio** proposed and patented by Yves Choueifaty in 2008 [CHO 08, CHO 13], is a diversification index of recent origin,  built on the inter-dependence between assets of a portfolio. Diversification Ratio is the *ratio of the weighted sum of individual asset volatilities to the portfolio's volatility*. 

Let N be the number of assets in the  portfolio spanning different  asset classes or belonging to a specific class. Let $(\bar{w}=(w_1,w_2,...w_N) )$   be the weights or the proportion of capital to be invested in individual assets in the portfolio and $\bar{w}'$ its transpose. Let $(\bar{\sigma}=(\sigma_1,\sigma_2,...\sigma_N))$ be the standard deviations of returns on the assets and  _V_,  the variance-covariance matrix of returns on the assets. The Diversification Ratio of a portfolio is given as follows:  

![](Lesson3Eqn3_1.png)
<h5 align="right">..........(3.1)</h5>


A portfolio that is *most diversified* would yield the *maximal Diversification Ratio*. 

## 3.3  Clustering

 A one-shot solution to ensure prudent selection of assets from a stock universe,  which will ensure benefits of Diversification is **Clustering** or **Cluster Analysis**. Clustering deals with the task of grouping a set of physical or abstract objects into classes such that objects within a class exhibit close **similarity** to one another, while simultaneously expressing a strong **dissimilarity** to objects with other classes. 

Fig. 3.3 graphically illustrates clustering of objects represented by points,  into three clusters. It can be easily seen that those points that are in proximity to one another have  grouped themselves into a cluster. Marking the points in each of these clusters with different symbols, visibly shows within-class similarity of objects in the cluster and out-of-class dissimilarity to objects in other clusters.      
  
Thus, clustering exploits the characteristic of objects with similar features or behaviour,  gravitating towards one another, while moving away or repelling the influence of  objects which are dissimilar to them in features or behaviour, thereby satisfying the  adage, "*birds of a feather flock together*".  

![Lesson3Fig3_3.png](attachment:Lesson3Fig3_3.png)
<h4 align = "center">Fig. 3.3 Graphical illustration of clustering objects represented by points,  into groups </h4>

Several clustering methods have been explored. **$k$-means clustering** is a cluster analysis technique which groups $N$ objects into ***k*** clusters, where $k$ is an input decided by the user. In the case of portfolio selection, if $N$ were to indicate the size of the stock universe, $k$ could indicate the portfolio size and the investor can use the input $k$ to his or her advantage to make a choice of small portfolios (typically $k\le30$) or a large portfolio (typically $k >30$).

## 3.4 Case Study

Let us restrict our discussion to selection of equity stocks, after all any asset allocation plan involves significant portion of the capital being invested in equities. 

We shall consider a "mini" stock universe of 29 equity stocks of Dow Jones Industrial Average (DJIA) Index viz., Apple (AAPL), American Express (AXP), Boeing (BA), Caterpillar (CAT), Cisco Systems (CSCO), 	Chevron (CVX), 	Walt Disney (DIS), Goldman Sachs (GS), 	The Home Depot (HD),  	IBM (IBM), 	Intel (INTC), Johnson & Johnson	(JNJ), 	JP Morgan Chase (JPM), Coca-Cola (KO), 	McDonald's (MCD), 3M(MMM), Merck & Co	(MRK), Microsoft (MSFT), Nike	(NKE), 	Pfizer (PFE), Procter & Gamble	(PG), 	Travelers (TRV), United Health Group	(UNH), United Technologies	(UTX), Verizon	(V), Verizon (VZ), Walgreens Boots Alliance (WBA), Walmart (WMT), 	Exxon Mobil (XOM).   

  
The data set considered is from April 11, 2014 to April 11, 2019. Fig. 3.4 illustrates a snapshot of the DJIA dataset.

![Lesson3Fig3_4.png](attachment:Lesson3Fig3_4.png)
<h4 align = "center">Fig. 3.4 Snapshot of the DJIA dataset </h4>

An investor desires to invest in 15 stocks from this "mini" universe. The following questions arise:   

 > How can the investor decide on which combination of assets among the 29 stocks, is the best?   
 > How can diversification be ensured,  when the assets belong to different sectors and therefore behave differently under varying market conditions?  
  
 $k$-means clustering provides solutions to both questions at once. The following steps and the companion Python code illustrate this **heuristic portfolio selection process**. The Python code employs NumPy, Pandas and scikit-learn to effect portfolio selection using $k$-means clustering.

**Step 1:** ***Undertake data wrangling of the original stock dataset to keep it fit for further processing***.   

(Refer *Lesson2 Some glimpses of financial data wrangling* to learn about aspects of financial data wrangling).  
We assume that the DJIA dataset for the "mini" universe of 29 stocks  is already cleaned and the Python code shown below reads the CSV file concerned. **clusters = 15** represents the portfolio size desired by the investor.

In [2]:
#read stock prices from a cleaned DJIA dataset 

#Dependencies 
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans 

#input stock prices data set 
stockFileName = 'DJIA_Apr112014_Apr112019.csv'
originalRows = 1259   #excluding header
originalColumns = 29  #excluding date
clusters = 15

#read stock dataset into a dataframe 
df = pd.read_csv(stockFileName,  nrows= originalRows)

#extract asset labels
assetLabels = df.columns[1:originalColumns+1].tolist()
print(assetLabels)

#extract stock prices excluding header and trading dates
dfStockPrices = df.iloc[0:, 1:]

#store stock prices as an array
arStockPrices = np.asarray(dfStockPrices)
[rows, cols]= arStockPrices.shape
print(rows, cols)
print(arStockPrices)


['AAPL', 'AXP', 'BA', 'CAT', 'CSCO', 'CVX', 'DIS', 'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'JPM', 'KO', 'MCD', 'MMM', 'MRK', 'MSFT', 'NKE', 'PFE', 'PG', 'TRV', 'UNH', 'UTX', 'V', 'VZ', 'WBA', 'WMT', 'XOM']
1259 29
[[ 74.230003  84.540001 122.07     ...  64.260002  76.5       96.720001]
 [ 74.525711  85.5      123.25     ...  65.669998  77.379997  97.860001]
 [ 73.994286  86.040001 124.269997 ...  66.010002  76.879997  98.68    ]
 ...
 [199.5      109.849998 369.040009 ...  54.5       98.690002  81.93    ]
 [200.619995 110.160004 364.940002 ...  54.509998  99.599998  81.559998]
 [198.949997 109.849998 370.160004 ...  53.439999 100.800003  81.949997]]


**Step 2:** ***Compute asset returns of the stocks***  

(Refer *Lesson 1 Fundamentals of Risk and Return of a Portfolio* to know about stock returns computing)  
Function **StockReturnsComputing** computes the daily return of stocks in the DJIA index and stores it in the array **arReturns**, as shown in the Python code fragment given below.

In [3]:
#function for Stock Returns computing 
def StockReturnsComputing(StockPrice, Rows, Columns):
    
    import numpy as np
    
    StockReturn = np.zeros([Rows-1, Columns])
    for j in range(Columns):  # j: Assets
        for i in range(Rows-1):     #i: Daily Prices
            StockReturn[i,j]=((StockPrice[i+1, j]-StockPrice[i,j])/StockPrice[i,j])

    return StockReturn


In [4]:
#compute daily returns of all stocks in the mini universe
arReturns = StockReturnsComputing(arStockPrices, rows, cols)
print('Size of the array of daily returns of stocks:\n', arReturns.shape)
print('Array of daily returns of stocks\n',  arReturns)

Size of the array of daily returns of stocks:
 (1258, 29)
Array of daily returns of stocks
 [[ 0.00398367  0.01135556  0.00966658 ...  0.02194205  0.01150323
   0.0117866 ]
 [-0.00713076  0.0063158   0.00827584 ...  0.00517746 -0.00646162
   0.00837931]
 [ 0.0020272   0.01580661  0.01424321 ...  0.00227241  0.00442253
   0.01276857]
 ...
 [-0.00299853 -0.0076784  -0.01463201 ... -0.01017074 -0.00544191
  -0.01289157]
 [ 0.00561401  0.00282208 -0.01110993 ...  0.00018345  0.00922075
  -0.00451607]
 [-0.00832419 -0.00281414  0.01430373 ... -0.01962941  0.01204824
   0.00478174]]


**Step 3:** ***Compute mean returns and variance-covariance matrix of returns***  

(Refer *Lesson 1 Fundamentals of Risk and Return of a Portfolio* to know about computing the mean returns and variance-covariance matrix of stock returns)  
**meanReturns** and **covReturns** store the outputs of the respective computations. 

In [5]:
#compute mean returns and variance covariance matrix of returns
meanReturns = np.mean(arReturns, axis = 0)
print('Mean returns:\n', meanReturns)
covReturns = np.cov(arReturns, rowvar=False)
#set precision for printing results
np.set_printoptions(precision=5, suppress = True)
print('Size of Variance-Covariance matrix of returns:\n', covReturns.shape)
print('Variance-Covariance matrix of returns:\n', covReturns)

Mean returns:
 [ 9.02759761e-04  2.91002178e-04  9.96644943e-04  3.86567003e-04
  8.09992956e-04  1.55150853e-04  3.97538062e-04  3.31387657e-04
  8.47901377e-04 -1.61102228e-04  7.27422090e-04  3.15481134e-04
  6.04105771e-04  1.91569684e-04  5.65341938e-04  4.42214370e-04
  3.57998589e-04  9.97435940e-04  8.00864357e-04  3.37582842e-04
  2.51423106e-04  4.25465919e-04  9.54807585e-04  1.87158182e-04
  1.01281343e-03  2.31432075e-04 -1.68527771e-05  2.94624799e-04
 -6.10160493e-05]
Size of Variance-Covariance matrix of returns:
 (29, 29)
Variance-Covariance matrix of returns:
 [[0.00024 0.00007 0.0001  0.0001  0.0001  0.00007 0.00007 0.0001  0.00007
  0.00007 0.00011 0.00005 0.00008 0.00003 0.00005 0.00007 0.00005 0.00012
  0.00008 0.00005 0.00004 0.00005 0.00008 0.00007 0.0001  0.00003 0.00007
  0.00004 0.00006]
 [0.00007 0.00016 0.00008 0.0001  0.00007 0.00006 0.00006 0.00011 0.00007
  0.00007 0.00007 0.00005 0.0001  0.00003 0.00004 0.00007 0.00006 0.00008
  0.00007 0.00005 0.00003 

**Step 4:** ***Prepare parameters for k-means clustering*** 

Every asset $A_i$ is characterized by its mean return and the variance-covariance vector of its returns with those of  other assets $A_j$. For *i = j*, it would indicate its own variance of returns. Thus the characteristic vector for asset $A_i$ is given by $\left[\mu_i, \sigma_{i1},\sigma_{i2},...\sigma_{ii}, ...\sigma_{iN}) \right]$, where $\mu_i$ indicates the mean return of asset $A_i$ and  $\sigma_{i1},\sigma_{i2},...\sigma_{ii}, ...\sigma_{iN}$ are the variance and covariance of its returns with other assets. It can be seen that $\left[ \sigma_{i1},\sigma_{i2},...\sigma_{ii}, ...\sigma_{iN} \right]$ is nothing but row $i$ of the variance-covariance matrix $V$ of $N$ assets in the stock universe. $\sigma_{ii}$ which is the variance of the asset return,  is the diagonal element of matrix $V$ in row *i*.  

The following Python code shows the gathering of parameters for each of the 29 assets in the stock universe.  The parameters are to be provided as inputs to the $k$-means clustering method. Each characteristic vector of the asset comprises 30 components viz.,  its own mean return as the first element of the vector followed by its covariance/variance of returns with the rest of the 29 assets. Thus **assetParameter** holds the the characteristic vectors of all the 29 assets in the stock universe.  

In [6]:
#prepare asset parameters for k-means clustering
#reshape for concatenation
meanReturns = meanReturns.reshape(len(meanReturns),1)
assetParameters = np.concatenate([meanReturns, covReturns], axis = 1)
print('Size of the asset parameters for clustering:\n', assetParameters.shape)
print('Asset parameters for clustering:\n', assetParameters)

Size of the asset parameters for clustering:
 (29, 30)
Asset parameters for clustering:
 [[ 0.0009   0.00024  0.00007  0.0001   0.0001   0.0001   0.00007  0.00007
   0.0001   0.00007  0.00007  0.00011  0.00005  0.00008  0.00003  0.00005
   0.00007  0.00005  0.00012  0.00008  0.00005  0.00004  0.00005  0.00008
   0.00007  0.0001   0.00003  0.00007  0.00004  0.00006]
 [ 0.00029  0.00007  0.00016  0.00008  0.0001   0.00007  0.00006  0.00006
   0.00011  0.00007  0.00007  0.00007  0.00005  0.0001   0.00003  0.00004
   0.00007  0.00006  0.00008  0.00007  0.00005  0.00003  0.00005  0.00007
   0.00007  0.00008  0.00003  0.00007  0.00004  0.00005]
 [ 0.001    0.0001   0.00008  0.00023  0.00013  0.00009  0.00008  0.00007
   0.00011  0.00007  0.00008  0.0001   0.00006  0.0001   0.00004  0.00005
   0.00009  0.00006  0.00009  0.00008  0.00005  0.00004  0.00006  0.00007
   0.00009  0.00009  0.00004  0.00007  0.00005  0.00007]
 [ 0.00039  0.0001   0.0001   0.00013  0.00027  0.0001   0.00012  0.00007


**Step 5:** ***Group the assets into clusters using k-means clustering where k =15, which is the portfolio size selected by the investor.***  
  
The Python code shows the invocation of the function **KMeans** from the scikit-learn library. 
The centroids (special points in multidimensional space) towards which the the other physical points representing the asset parameters gravitated to, based on their similarity measure and hence formed a cluster with the centroid as its nucleus,  has been listed in the output. Observe that 15 centroids are obtained for $k=15$. Each point in the multi-dimensional space including the centroids, are of dimension 30.  The labels indicate the cluster to which point $i$ among $N$ points or  asset $i$ of the $N$-stock universe  in reality, belong to. 

In [7]:
#kmeans clustering of assets using the characteristic vector of 
#mean return and variance-covariance vector of returns

assetsCluster= KMeans(algorithm='auto',  max_iter=600, n_clusters=clusters)
print('Clustering of assets completed!') 
assetsCluster.fit(assetParameters)
centroids = assetsCluster.cluster_centers_
labels = assetsCluster.labels_

print('Centroids:\n', centroids)
print('Labels:\n', labels)

Clustering of assets completed!
Centroids:
 [[ 0.0006   0.00008  0.0001   0.0001   0.00012  0.00009  0.00009  0.00007
   0.00016  0.00008  0.00008  0.00009  0.00005  0.00017  0.00003  0.00005
   0.00008  0.00007  0.00009  0.00007  0.00006  0.00004  0.00008  0.00008
   0.00008  0.00009  0.00004  0.00007  0.00004  0.00008]
 [ 0.00024  0.00007  0.00012  0.00009  0.0001   0.00007  0.00006  0.00006
   0.00009  0.00006  0.00007  0.00008  0.00005  0.00009  0.00003  0.00004
   0.00007  0.00005  0.00008  0.00006  0.00005  0.00004  0.00005  0.00007
   0.0001   0.00008  0.00004  0.00006  0.00004  0.00005]
 [ 0.00042  0.00006  0.00006  0.00007  0.00008  0.00007  0.00006  0.00008
   0.00008  0.00006  0.00006  0.00007  0.00005  0.00008  0.00004  0.00004
   0.00008  0.00005  0.00007  0.00006  0.00005  0.00004  0.00007  0.00006
   0.00006  0.00007  0.00004  0.00006  0.00004  0.00006]
 [ 0.00082  0.00008  0.00007  0.00008  0.00009  0.00011  0.00006  0.00007
   0.00008  0.0001   0.00007  0.00009  0.0000

**Step 6:**  ***Fix asset labels to points in each cluster***

In [8]:
#fixing asset labels to cluster points
print('Stocks in each of the clusters:\n',)
assets = np.array(assetLabels)
for i in range(clusters):
    print('Cluster', i+1)
    clt  = np.where(labels == i)
    assetsCluster = assets[clt]
    print(assetsCluster)
    

Stocks in each of the clusters:

Cluster 1
['JPM']
Cluster 2
['AXP' 'UTX']
Cluster 3
['DIS' 'MMM' 'TRV']
Cluster 4
['CSCO' 'HD' 'NKE']
Cluster 5
['XOM']
Cluster 6
['CAT' 'GS']
Cluster 7
['AAPL']
Cluster 8
['BA' 'MSFT' 'UNH' 'V']
Cluster 9
['JNJ' 'MRK' 'PFE']
Cluster 10
['KO' 'PG' 'VZ' 'WMT']
Cluster 11
['MCD']
Cluster 12
['WBA']
Cluster 13
['INTC']
Cluster 14
['CVX']
Cluster 15
['IBM']


It can be seen that the 29 assets of the stock universe have been grouped into 15 clusters. The idea conveyed is that all assets in the same cluster behave **similar**   (*inter-cluster similarity*) with regard to their mean and variance-covariance of returns, and are **dissimilar** (*intra-cluster dissimilarity*) with regard to the same characteristics with those assets in other clusters.

Therefore picking **one asset from each cluster** to gather a portfolio of 15 assets would ensure that the portfolio is well-diversified with regard to these characteristics. The choice could be random or preferential, but restricted to one asset from each cluster. For ease of reference,  we term the choices made as **$k$-portfolio**. Needless to say, multiple $k$-portfolios can be generated from these clusters. 

Since k-means clustering is a heuristic method which produces clusters that are sensitive to the randomly chosen initial centroids, it may yield different cluster configurations with each run. In practice, an aggregation of the clusters yielded during various runs can be studied to make the appropriate choice of assets one each from each cluster to construct the well-diversified portfolio.

## 3.5 $k$-portfolios - a brief note

Let us suppose that a specific run of the $k$-means clustering algorithm yielded the following clusters for the DJIA index "mini" universe:
[KO PG  VZ  WMT], [UNH], [DIS  MMM  TRV], [IBM, XOM], [CSCO  INTC], [JPM], [GS], [WBA], [AAPL  MSFT  V], [HD  NKE], [AXP  CVX  UTX], [MCD],[JNJ  MRK  PFE], [BA], [CAT]  
  
The investor is now free to make a choice of one asset each from each of the clusters. The choice could be random or guided by individual preferences. The following  $k$-portfolios are some sample choices that can be made by investors.  

**$k$-portfolio 1**:  
{ [KO], [UNH], [DIS], [IBM], [CSCO], [JPM], [GS], [WBA], [AAPL], [HD], [AXP], [MCD], [MRK], [BA], [CAT] }  

**$k$-portfolio 2**:  
{ [VZ], [UNH], [MMM], [XOM], [INTC], [JPM], [GS], [WBA], [MSFT], [NKE], [CVX], [MCD], [PFE], [BA], [CAT]}  

**$k$-portfolio 3**:  
{ [WMT], [UNH], [TRV], [IBM], [CSCO], [JPM], [GS], [WBA], [V], [NKE], [UTX], [MCD], [JNJ], [BA], [CAT]}  

An investor can  opt to invest in any one of the $k$-portfolios. Fig. 3.5 illustrates heuristic selection of portfolios where the portfolio of assets is diversified in behaviour adhering to the adage " never put all eggs in the basket". 

![Lesson3Fig3_5.png](attachment:Lesson3Fig3_5.png)
<h4 align = "center">Fig. 3.5 Heuristic selection of assets to construct a diversified portfolio </h4>

Indeed, several questions do arise on the behavior of the $k$-portfolios, their risk-return tradeoffs and their performance when time tested portfolio construction techniques are applied over them. These shall be covered in detail in the ensuing lessons. 

<h3 align ="left">Companion Reading</h3>  

This work  is an abridged adaptation of concepts discussed in Chapter 3 of [PAI 18] to Dow Jones dataset (DJIA Index: April, 2014-April, 2019) and implemented in Python using NumPy, Pandas and scikit-learn libraries. Readers ( read "worker bees") seeking more information may refer to the corresponding chapter in the  book.


<h3 align="left">References</h3>  

[CHO 08]   Choueifaty Yves and  Y Coignard, Toward Maximum Diversification, *The Journal of  Portfolio Management*, pp. 40-51, 2008.

[CHO 13]   Choueifaty  Yves, T Froidure and J Reynier, Properties of the Most Diversified Portfolio, *Journal of Investment Strategies*, 2(2), pp. 49-70, 2013.  
  
[PAI 18]   Vijayalakshmi Pai G. A., Metaheuristics for Portfolio Optimization- *An Introduction using MATLAB*, Wiley-ISTE, 2018. https://www.mathworks.com/academia/books/metaheuristics-for-portfolio-optimization-pai.html 




#### Next.....Lesson 4: Traditional Methods for Portfolio Construction

![](Lesson3ExitTailImage.png)