# PART1: 5 New Features



# 1. Investor Irrationality 

Investor irrationality is a significant factor influencing asset price formation. By constructing an "irrational belief measure," some paper finds that the irrational belief measure has negative predictive power for future returns.

First, construct the investor irrational beliefs using a turnover separation model. The model posits that investor trading behavior is driven by three factors: the first factor is the investor's exogenous liquidity demand; the second is rational trading behavior based on the firm's fundamental value; the third is irrational beliefs unrelated to fundamental value. The expression is:

$ TO = E[\text{trading} | \text{Eliq}] + E[\text{trading} | D] + E[\text{trading} | \text{IRB}] $

where:

- $( E[\text{trading} | \text{Eliq}] )$ represents trading driven by different beliefs, reflected through turnover rate.
- $( E[\text{trading} | D] )$ represents rational investor beliefs driven by fundamental value, using four indicators: Return on Assets (ROA), Total Asset Growth Rate (INV), Cash Flow (CASH), and Stock Size (SIZE).
- $( E[\text{trading} | \text{IRB}] )$ represents the proxy for irrational beliefs, i.e., the "irrational belief measure."

Then, the irrational belief measure is deduced as:

$$ E[\text{trading} | \text{IRB}] = TO - E[\text{trading} | \text{Eliq}] - E[\text{trading} | D] $$


In [1]:
import pandas as pd

# Assuming 'turnover_rate', 'liquidity_needs', 'ROA', 'INV', 'CASH', 'SIZE' are columns in your DataFrame

def calculate_irrational_belief(df):
    # Calculate the rational trading component based on fundamental values
    # We could do linear regression of the historical return on these 4 fundamental values to get the coefficient
    # Here for simplicity, I assume the coefficient is the same.
    df['rational_trading'] = (df['ROA'] + df['INV'] + df['CASH'] + df['SIZE']) / 4
    
    # Calculate the irrational belief measure
    df['irrational_belief'] = df['turnover_rate'] - df['liquidity_needs'] - df['rational_trading']
    
    return df

# Example usage
data = {
    'turnover_rate': [0.1, 0.2, 0.15, 0.3],
    'liquidity_needs': [0.02, 0.03, 0.025, 0.04],
    'ROA': [0.05, 0.06, 0.07, 0.08],
    'INV': [0.02, 0.03, 0.025, 0.04],
    'CASH': [0.03, 0.04, 0.035, 0.045],
    'SIZE': [0.04, 0.05, 0.045, 0.055]
}
df = pd.DataFrame(data)

# Calculate irrational belief measure
df = calculate_irrational_belief(df)
print(df[['turnover_rate', 'liquidity_needs', 'rational_trading', 'irrational_belief']])


   turnover_rate  liquidity_needs  rational_trading  irrational_belief
0           0.10            0.020           0.03500            0.04500
1           0.20            0.030           0.04500            0.12500
2           0.15            0.025           0.04375            0.08125
3           0.30            0.040           0.05500            0.20500


# 2. Social Media Buzz

The Social Media Buzz feature tracks the volume of social media mentions and discussions about a particular stock. High levels of buzz can indicate increased interest and potential volatility. This feature provides insights into market trends influenced by public sentiment and hype.

#### Potential Methods for Implementation

1. **Volume of Mentions**: Track the number of mentions of the stock ticker on popular social media platforms (e.g., Twitter, Reddit, StockTwits) over a given time period.
2. **Sentiment Analysis**: Analyze the sentiment (positive, negative, neutral) of the social media mentions to understand the overall market mood towards the stock.
3. **Trend Analysis**: Use moving averages to track the trend in the volume of mentions over time.
4. **Frequency Analysis**: Calculate the frequency of keywords associated with the stock within social media posts.

1. **Volume of Mentions**:
   $$
   \text{Volume of Mentions} = \sum_{i=1}^{N} \text{Mentions}_{i}
   $$
   where \(N\) is the number of social media posts mentioning the stock within a given time period.

2. **Sentiment Analysis**:
   $$
   \text{Sentiment Score} = \frac{\sum_{i=1}^{N} \text{Sentiment}_{i} \times \text{Mentions}_{i}}{\text{Volume of Mentions}}
   $$
   where $\text{Sentiment}_{i}$ is the sentiment score of each mention (e.g., +1 for positive, -1 for negative, 0 for neutral).

3. **Trend Analysis (Moving Average)**:
   $$
   \text{Moving Average}_{t} = \frac{1}{k} \sum_{i=0}^{k-1} \text{Volume of Mentions}_{t-i}
   $$
   where \(k\) is the window size for the moving average.

4. **Frequency Analysis**:
   $$
   \text{Keyword Frequency} = \frac{\text{Number of Keyword Occurrences}}{\text{Total Number of Words}}
   $$

In [3]:
import pandas as pd
import numpy as np
from textblob import TextBlob
import datetime

# Example DataFrame containing social media posts
data = {
    'timestamp': [
        '2023-07-01 12:00:00', '2023-07-01 12:05:00', '2023-07-01 12:10:00', 
        '2023-07-01 12:15:00', '2023-07-01 12:20:00'
    ],
    'post': [
        "I think AAPL is going to skyrocket!", "AAPL is such a bad investment...",
        "I'm buying more AAPL stocks", "AAPL is neutral for me", "AAPL to the moon!"
    ]
}

# Convert data to DataFrame
df = pd.DataFrame(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Define the stock ticker
ticker = "AAPL"

# Define function to calculate sentiment score
def get_sentiment(post):
    analysis = TextBlob(post)
    return analysis.sentiment.polarity

# Apply sentiment analysis
df['sentiment'] = df['post'].apply(get_sentiment)

# Filter posts mentioning the ticker
df = df[df['post'].str.contains(ticker)]

# Calculate the volume of mentions over time
df.set_index('timestamp', inplace=True)
volume_of_mentions = df.resample('5T').size()

# Calculate the average sentiment over time
average_sentiment = df.resample('5T')['sentiment'].mean()

# Calculate the moving average of mentions (e.g., 3-period moving average)
moving_average_mentions = volume_of_mentions.rolling(window=3).mean()

# Display results
print("Volume of Mentions:\n", volume_of_mentions)
print("\nAverage Sentiment:\n", average_sentiment)
print("\nMoving Average of Mentions:\n", moving_average_mentions)

# Combine features into a single DataFrame
social_media_buzz = pd.DataFrame({
    'Volume of Mentions': volume_of_mentions,
    'Average Sentiment': average_sentiment,
    'Moving Average Mentions': moving_average_mentions
})

# Fill NaN values
social_media_buzz.fillna(0, inplace=True)

# Display the final Social Media Buzz DataFrame
print("\nSocial Media Buzz Features:\n", social_media_buzz)


Volume of Mentions:
 timestamp
2023-07-01 12:00:00    1
2023-07-01 12:05:00    1
2023-07-01 12:10:00    1
2023-07-01 12:15:00    1
2023-07-01 12:20:00    1
Freq: 5min, dtype: int64

Average Sentiment:
 timestamp
2023-07-01 12:00:00    0.00
2023-07-01 12:05:00   -0.35
2023-07-01 12:10:00    0.50
2023-07-01 12:15:00    0.00
2023-07-01 12:20:00    0.00
Freq: 5min, Name: sentiment, dtype: float64

Moving Average of Mentions:
 timestamp
2023-07-01 12:00:00    NaN
2023-07-01 12:05:00    NaN
2023-07-01 12:10:00    1.0
2023-07-01 12:15:00    1.0
2023-07-01 12:20:00    1.0
Freq: 5min, dtype: float64

Social Media Buzz Features:
                      Volume of Mentions  Average Sentiment  \
timestamp                                                    
2023-07-01 12:00:00                   1               0.00   
2023-07-01 12:05:00                   1              -0.35   
2023-07-01 12:10:00                   1               0.50   
2023-07-01 12:15:00                   1               0.00   


  volume_of_mentions = df.resample('5T').size()
  average_sentiment = df.resample('5T')['sentiment'].mean()


# 3. Institutional Holdings Change

**Explanation:** This feature represents the change in the percentage of shares held by institutional investors over a recent period. Changes in institutional holdings can signal shifts in confidence among large, informed investors, which may precede significant stock movements.

### Formulas and Techniques

1. **Percentage Change Calculation**:
   $$
   \text{Institutional Holdings Change} = \frac{\text{Current Institutional Holdings} - \text{Previous Institutional Holdings}}{\text{Previous Institutional Holdings}} \times 100
   $$


In [5]:
import pandas as pd

# Example DataFrame containing institutional holdings data
data = {
    'date': [
        '2023-01-01', '2023-02-01', '2023-03-01', 
        '2023-04-01', '2023-05-01', '2023-06-01'
    ],
    'institutional_holdings': [
        55.0, 57.2, 56.5, 58.0, 59.0, 60.5
    ]
}

# Convert data to DataFrame
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])

# Calculate the percentage change in institutional holdings
df['institutional_holdings_change'] = df['institutional_holdings'].pct_change() * 100

# Fill NaN values resulting from the pct_change calculation
df['institutional_holdings_change'].fillna(0, inplace=True)

# Display the DataFrame with the new feature
print("\nInstitutional Holdings Change:\n", df)

# Example of integrating this feature into a larger feature set
features_df = pd.DataFrame({
    'date': df['date'],
    'Institutional Holdings Change': df['institutional_holdings_change']
})

# Display the final features DataFrame
print("\nFeatures DataFrame:\n", features_df)



Institutional Holdings Change:
         date  institutional_holdings  institutional_holdings_change
0 2023-01-01                    55.0                       0.000000
1 2023-02-01                    57.2                       4.000000
2 2023-03-01                    56.5                      -1.223776
3 2023-04-01                    58.0                       2.654867
4 2023-05-01                    59.0                       1.724138
5 2023-06-01                    60.5                       2.542373

Features DataFrame:
         date  Institutional Holdings Change
0 2023-01-01                       0.000000
1 2023-02-01                       4.000000
2 2023-03-01                      -1.223776
3 2023-04-01                       2.654867
4 2023-05-01                       1.724138
5 2023-06-01                       2.542373


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['institutional_holdings_change'].fillna(0, inplace=True)


# 4. Firm-Specific Information Delay (FSID)

The study finds that information uncertainty amplifies the momentum effect, leading to stronger momentum in stocks with higher information uncertainty. Due to the lagged impact of firm-specific information on stock prices, companies with higher FSID exhibit higher return volatility and lower market efficiency.


1. **Measure Information Uncertainty:** Use return volatility to identify and measure information uncertainty.
2. **Strategy:** Go long on stocks with high momentum and high information uncertainty, and go short on stocks with low momentum and low information uncertainty.



In [6]:
import pandas as pd
import numpy as np

# Sample DataFrame containing stock prices
data = {
    'date': pd.date_range(start='2022-01-01', periods=200),
    'close': np.random.rand(200) * 100
}

# Convert data to DataFrame
df = pd.DataFrame(data)
df.set_index('date', inplace=True)

# Calculate daily returns
df['returns'] = df['close'].pct_change()

# Calculate information uncertainty (150-day rolling standard deviation of returns)
df['info_uncertainty'] = df['returns'].rolling(window=150).std()

# Calculate momentum (difference between close price and 12-day delayed close price)
df['momentum'] = df['close'] - df['close'].shift(12)

# Rank information uncertainty
df['rank_info_uncertainty'] = df['info_uncertainty'].rank()

# Rank momentum
df['rank_momentum'] = df['momentum'].rank()

# Calculate alpha signal
df['alpha_signal'] = df['rank_info_uncertainty'] * df['rank_momentum']

# Rank alpha signal
df['rank_alpha_signal'] = df['alpha_signal'].rank()

# Generate final signal
df['signal'] = np.where(df['rank_alpha_signal'] > 0.5, -(1 - df['rank_alpha_signal']), df['rank_alpha_signal'])

# Fill NaN values
df.fillna(0, inplace=True)

# Display the final DataFrame with the calculated features
print("\nCalculated Features:\n", df[['info_uncertainty', 'momentum', 'rank_info_uncertainty', 'rank_momentum', 'alpha_signal', 'signal']].tail())



Calculated Features:
             info_uncertainty   momentum  rank_info_uncertainty  rank_momentum  \
date                                                                            
2022-07-15          9.361564  39.140859                    6.0          148.0   
2022-07-16          9.363633 -60.827077                    7.0           20.0   
2022-07-17          9.364890 -57.718078                    8.0           24.0   
2022-07-18          9.365148 -22.842810                    9.0           66.0   
2022-07-19         36.822365  56.327887                   50.0          170.0   

            alpha_signal  signal  
date                              
2022-07-15         888.0    12.0  
2022-07-16         140.0     2.0  
2022-07-17         192.0     4.0  
2022-07-18         594.0     9.0  
2022-07-19        8500.0    49.0  


# 5. Coin stocks or Team stocks?

**Background:**

- **Coin Toss:** People expect reversals due to known probabilities.
- **Predicting Team Championship:** Historical performance suggests momentum, predicting past champions to win again.

In stocks, excessive trading for profit results in opposite outcomes:

- **Team Stocks:** Likely to show reversal effects.
- **Coin Stocks:** Likely to show momentum effects.

**Predictability Factors:**

- **Low Volatility:** Indicates stable prices, making future trends easier to predict.
- **Decreasing Turnover Rate:** Indicates reduced investor disagreement, increasing predictability.

**Conclusion:**

- **High Predictability (Coin Stocks):** Low volatility and decreasing turnover rate, likely to exhibit momentum effects.
- **Low Predictability (Team Stocks):** High volatility and increasing turnover rate, likely to exhibit reversal effects.

In [8]:
import pandas as pd
import numpy as np

# Example DataFrame containing stock data
data = {
    'timestamp': pd.date_range(start='2023-01-01', periods=150, freq='D'),
    'returns': np.random.randn(150),
    'volume': np.random.randint(1000, 5000, size=150),
    'shares_outstanding': np.random.randint(100000, 200000, size=150)
}

# Convert data to DataFrame
df = pd.DataFrame(data)
df.set_index('timestamp', inplace=True)

# Measure Volatility
df['volatility'] = df['returns'].rolling(window=150).std()

# Calculate Turnover Rate
df['turnover_rate'] = df['volume'] / df['shares_outstanding']

# Rank Volatility and Turnover Rate
df['rank_volatility'] = df['volatility'].rank()
df['rank_turnover_rate'] = df['turnover_rate'].rank()

# Alpha Signal
df['alpha_signal'] = df['rank_volatility'] + df['rank_turnover_rate']

# Determine Trading Strategy
df['trading_signal'] = df['alpha_signal'].apply(lambda x: 1 if x > df['alpha_signal'].median() else -1)

# Display the final DataFrame
print(df[['volatility', 'turnover_rate', 'rank_volatility', 'rank_turnover_rate', 'alpha_signal', 'trading_signal']])


            volatility  turnover_rate  rank_volatility  rank_turnover_rate  \
timestamp                                                                    
2023-01-01         NaN       0.018313              NaN                62.0   
2023-01-02         NaN       0.035054              NaN               134.0   
2023-01-03         NaN       0.010486              NaN                13.0   
2023-01-04         NaN       0.014926              NaN                42.0   
2023-01-05         NaN       0.019828              NaN                72.0   
...                ...            ...              ...                 ...   
2023-05-26         NaN       0.020804              NaN                74.0   
2023-05-27         NaN       0.039512              NaN               142.0   
2023-05-28         NaN       0.007512              NaN                 2.0   
2023-05-29         NaN       0.025845              NaN               105.0   
2023-05-30    1.014945       0.008586              1.0          