In [68]:
import pandas as pd
import numpy as np

## My data looks like this

| Ticker  | Investmentname. | Quarter | shares | Value
| ------  | -----------------  | --------| ------ | -------
| MSFT  | Something Capital|12/31/2019| 1000| $250000
| AAPL  | Blah Capital.          | 3/30/2018 |2000| $500000

I am not trying to solve particular problem. Theoretically the problem that everyone in finance is trying to solve is predict future returns, but there are too many inputs that you can look at, so it's better to find something that you can predict and then see if its useful. The reason I am looking at collaborative filtering and recommendation system is that when I try to describe how institutional stock ownership works in terms that are used in the recommendation systems framework, it does not sound ridiculous and also it might be possible to add pieces on top of it in a way that makes sense.


The simple model for movie recommendations is:
$$r=p_i^T*q_j$$

where $p_i$ is the feature vector for the user and $q_j$ is the feature vector for the movie
the netflix project winning paper also added a bias for each user and also a function of time as well as other variables.


The straight forward translation of the model is there is a set of features that determine whether fund managers like the stock. example of features that i can think of are:
1. The stock is in S&P500. some funds will buy it just so they can track S&P 500
2. The stock is an energy stock and we are in a year like 2013 when the shale production oil production was growing but it was not that big, so there were a lot of stories about how the oil companies would make money and institutional ownership was high.

The complicating features, which make me relatively sure the most simple model won't work are the following:
1. Movie popularity might vary over time, and thats why the netflix paper has the time function built in, but its basically still the same movie, and the features that movies have and users value are also somewhat stationary over time.
   With stocks its different, a stock that's cheap in 2012, might have gone up 100% and is expensive in 2014. so if cheapness was a feature of the stock and why people liked it at the time, it might the oppositve now.
   On the demand side, when leading economic are about to trend down, funds might move from industrials and energy into utilities, or staples, so even if the stocks are the same the preference changes.
2. The data is with delay, 

In [43]:
def pick_stocks(returns, n):
    """
    Picks n stocks with probabilities proportional to their returns.

    :param returns: A 1D numpy array representing the returns of each stock.
    :param n: The number of stocks to pick.
    :return: A list of picked stock indices.
    this is a chatGPT function
    """
    # Normalize the returns to create a probability distribution
    returns = np.array(returns)
    returns = returns - np.min(returns)  # Shift the returns to make them non-negative
    probabilities = returns / np.sum(returns)

    # Pick n stocks using the numpy.random.choice() function
    picked_indices = np.random.choice(len(returns), size=n, replace=False, p=probabilities)
    
    return picked_indices, probabilities

returns = [-.15, 0.10, 0.02, 0.15, 0.03, 0.08, 0.12]
n = 3

picked_indices,probabilities = pick_stocks(returns, n)
print(f"Picked stock indices: {picked_indices}")

probabilities

Picked stock indices: [2 4 1]


array([0.        , 0.17857143, 0.12142857, 0.21428571, 0.12857143,
       0.16428571, 0.19285714])

In [44]:
import numpy as np

def pick_stocks_rank(returns, n):
    """
    Picks n stocks with probabilities proportional to the rank of their returns.

    :param returns: A 1D numpy array representing the returns of each stock.
    :param n: The number of stocks to pick.
    :return: A list of picked stock indices.
    another chatGPT function this one is based on rank instead of return
    """
    # Get the rank of the returns
    sorted_indices = np.argsort(returns)
    ranks = np.empty_like(sorted_indices)
    ranks[sorted_indices] = np.arange(len(returns))

    # Normalize the ranks to create a probability distribution
    probabilities = ranks / np.sum(ranks)

    # Pick n stocks using the numpy.random.choice() function
    picked_indices = np.random.choice(len(returns), size=n, replace=False, p=probabilities)
    
    return picked_indices,probabilities

returns = [0.05, 0.10, 0.02, 0.15, 0.03, 0.08, 0.12]
n = 3

picked_indices,probabilities = pick_stocks_rank(returns, n)
print(f"Picked stock indices: {picked_indices}")
probabilities
#type(picked_indices)

Picked stock indices: [3 1 6]


array([0.0952381 , 0.19047619, 0.        , 0.28571429, 0.04761905,
       0.14285714, 0.23809524])

In [60]:
def pick_stocks_on_preference(num_picks,pref_factor1,pref_factor2,returns,ratings):
    """
    Picks n stocks with probabilities depending on their returns and ratings.
    inputs:
    pref_factor1 - weight of returns
    pref_factor2 - weight of ratings
    only one of the two pref_factors is non-zero
    returns - a numpy array of returns (num_dates,num_stocks)
    ratings - a numpy array of ratings (num_dates,num_stocks)
    num_picks - number of stocks to pick
    outputs:
    picked_indices - a numpy array of the indices of the picked stocks, dimensions are (num_dates,num_picks)
    if pref_factor1 is non-zero, if pref_factor1 is -1 then pick the stocks with the lowest returns
    if pref_factor1 is 1 then pick the stocks with the highest returns
    if pref_factor2 is non-zero, if pref_factor2 is -1 then pick the stocks with the lowest ratings
    if pref_factor2 is 1 then pick the stocks with the highest ratings
    """
    num_dates=returns.shape[0]
    num_stocks=returns.shape[1]
    picked_indices=np.zeros((num_dates,num_picks))
    #print (f'num_picks={num_picks},pref_factor1={pref_factor1},pref_factor2={pref_factor2},num_dates={num_dates},num_stocks={num_stocks}')
    if pref_factor1!=0:
        if pref_factor1==-1:
            #pick stocks based on returns
            for i in range(num_dates):
                picked_indices[i,:],probabilities=pick_stocks(-returns[i,:],num_picks)
                #print(f'picks={picks},probabilities={probabilities}')
        elif pref_factor1==1:
            for i in range(num_dates):
                picked_indices[i,:],probabilities=pick_stocks(returns[i,:],num_picks)
    elif pref_factor2!=0:
        #pick stocks based on ratings
        if pref_factor2==-1:
            for i in range(num_dates):
                picked_indices[i,:],probabilities=pick_stocks_rank(ratings[i,:],num_picks)
        elif pref_factor2==1:
            for i in range(num_dates):
                picked_indices[i,:],probabilities=pick_stocks_rank(ratings[i,:],num_picks)
    return picked_indices


        




In [67]:
#Create a synthetic data set.  This is a dataframe with 3 columns: ticker, calendardate, investorname, and 1 column: value
#input variables are num_tickers, num_dates, num_investors
#1 generate the date, ticker, and investorname dimensions
#1a generate the date dimension: this should be quarterly dates from starting in 3/30/2009 and going by quarters with num_dates points
#1b generate the ticker dimension: this should be num_tickers tickers
#1c generate the investorname dimension: this should be num_investors investor names

num_tickers = 100
num_dates = 20
num_investors = 100
num_picks=5

#1a generate the date dimension: this should be quarterly dates from starting in 3/30/2009 and going by quarters with num_dates points
#1b generate the ticker dimension: this should be num_tickers tickers
#1c generate the investorname dimension: this should be num_investors investor names

dim_dates=pd.date_range(start='3/30/2009', periods=num_dates, freq='Q') 
dim_tickers=['ticker'+str(i) for i in range(num_tickers)]
dim_investors=['investor'+str(i) for i in range(num_investors)]


#assign the preference to factor 1 or factor 2 to each fund.
#quarter of the funds have a positive preference to factor 1, quarter of the funds have a negative preference to factor 1, 
#quarter of the funds have a positive preferenc to factor 2, quarter of the funds have a negative preference to factor 2
#the preference is either -1, or 1
#the preference is stored in a separate numpy array called pref_factor1 and pref_factor2

#pref_factor1=(np.random.randint(-1,1,num_tickers)+.5)*2
#pref_factor2=(np.random.randint(-1,1,num_tickers)+.5)*2
#now we need to zero out the preferences for half the funds

#pref_factor1[:int(num_tickers/2)]=0
#pref_factor2[int(num_tickers/2):]=0
pref_factor1=np.zeros(num_tickers)
pref_factor2=np.zeros(num_tickers)
pref_factor1[:int(num_tickers/4)]=-1
pref_factor1[int(num_tickers/4):int(num_tickers/2)]=1
pref_factor2[int(num_tickers/2):int(num_tickers*.75)]=-1
pref_factor2[int(num_tickers*.75):]=1


#2 generate the synthetic data set
#2a generate the two factors: factor1 and factor2
# factor 1 is momentum/mean_reversion, factor2 is analyst sentiment
#to generate factor 1, we will generate a random walk with mean 0 and std sigma
#half the funds will use this factor, 

sigma=.3
factor1=np.random.normal(0,sigma*.25,(num_tickers,num_dates))
returns=factor1
factor1
factor2=np.random.randint(0,5,(num_tickers,num_dates))
ratings=factor2

#2b there are two factors and two preferences for each fund.  The preference is either -1 or 1.  The preference is stored in pref_factor1 and pref_factor2
#2c for each of the four combination generate a 2d array of stock indeces for each fund. 
# for factor 1 the probability of picking a stock is proportional to factor
df_out=pd.DataFrame(columns=['ticker','investorname','calendardate','value'])
row_list_dict=[]
for investor in range(num_investors):
    f1=pref_factor1[investor]
    f2=pref_factor2[investor]
    inv_portfolio=pick_stocks_on_preference(num_picks,f1,f2,returns,ratings)
    for d in range(num_dates):
        for s in range(num_picks):
            dict_entry={'ticker':dim_tickers[int(inv_portfolio[d,s])],'investorname':dim_investors[investor],'calendardate':dim_dates[d],'value':1/num_picks}
            row_list_dict.append(dict_entry)
df_out=pd.DataFrame(row_list_dict)


df_out[df_out['investorname']=='investor1']



Unnamed: 0,ticker,investorname,calendardate,value
100,ticker12,investor1,2009-03-31,0.2
101,ticker4,investor1,2009-03-31,0.2
102,ticker14,investor1,2009-03-31,0.2
103,ticker19,investor1,2009-03-31,0.2
104,ticker8,investor1,2009-03-31,0.2
...,...,...,...,...
195,ticker8,investor1,2013-12-31,0.2
196,ticker3,investor1,2013-12-31,0.2
197,ticker19,investor1,2013-12-31,0.2
198,ticker15,investor1,2013-12-31,0.2


In [54]:
import numpy as np

# Create synthetic data for returns and ratings
#chatGPT generated test function
num_dates = 5
num_stocks = 7
returns = np.random.rand(num_dates, num_stocks)
ratings = np.random.rand(num_dates, num_stocks)

# Number of stocks to pick
num_picks = 3
print(f"returns:\n{returns[0,:]}")
print(f"ratings:\n{ratings[0,:]}")

# Test the function with different preference factors
test_cases = [
    {"pref_factor1": -1, "pref_factor2": 0},
    {"pref_factor1": 1, "pref_factor2": 0},
    {"pref_factor1": 0, "pref_factor2": -1},
    {"pref_factor1": 0, "pref_factor2": 1},
]

for test_case in test_cases:
    pref_factor1 = test_case["pref_factor1"]
    pref_factor2 = test_case["pref_factor2"]
    
    picked_indices = pick_stocks_on_preference(num_picks, pref_factor1, pref_factor2, returns, ratings)
    
    print(f"Preference factors: {pref_factor1} (returns), {pref_factor2} (ratings)")
    print(f"Picked stock indices:\n{picked_indices}")
    print("-" * 40)


returns:
[0.09362876 0.94418965 0.63779439 0.4784309  0.15442046 0.24275315
 0.13164508]
ratings:
[0.61593677 0.50435518 0.34398955 0.51756773 0.14967263 0.11469336
 0.77293415]
num_picks=3,pref_factor1=-1,pref_factor2=0,num_dates=5,num_stocks=7
Preference factors: -1 (returns), 0 (ratings)
Picked stock indices:
[[6. 3. 4.]
 [5. 4. 3.]
 [5. 6. 1.]
 [0. 6. 4.]
 [0. 1. 2.]]
----------------------------------------
num_picks=3,pref_factor1=1,pref_factor2=0,num_dates=5,num_stocks=7
Preference factors: 1 (returns), 0 (ratings)
Picked stock indices:
[[1. 2. 3.]
 [2. 1. 0.]
 [1. 4. 0.]
 [1. 5. 4.]
 [5. 6. 2.]]
----------------------------------------
num_picks=3,pref_factor1=0,pref_factor2=-1,num_dates=5,num_stocks=7
Preference factors: 0 (returns), -1 (ratings)
Picked stock indices:
[[2. 0. 1.]
 [5. 6. 3.]
 [5. 6. 0.]
 [3. 0. 5.]
 [2. 6. 4.]]
----------------------------------------
num_picks=3,pref_factor1=0,pref_factor2=1,num_dates=5,num_stocks=7
Preference factors: 0 (returns), 1 (ratings