In [1]:
# importing necessaary libraries for data analysis and visualisation
import numpy as np
import pandas as pd
# import apmodule as ap # this is an imported function stated in the project description document
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 10
plt.rcParams['lines.linewidth'] = 3
%matplotlib inline

ModuleNotFoundError: No module named 'numpy'

# Backtesting
After we have ascertained that our information signal has some predictive power, now we can _back-test it_ to assess whether this predictive power can be translated into profits.

This involves designing a much more realistic trading strategy based on the signal incorporating issues such as liquidity and transaction costs. A common reason why many algorithmic trading strategies fail in practice is because **the predictive power of the signal is limited to very small shares** where there is very little trading liquidity and high transaction costs.

## 1. Load the Data
As usual, we start by loading the price data and the data on the signal we want to test. We will need to use this object as a series so we pass **[`DataFrame.squeeze()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.squeeze.html)** at the end.

In [None]:
# This line reads the csv file named "Prices.csv" into a pandas Dataframe
# the 'parse_dates' parameter is used to specify the columns that contain date/time information. In this case the "date" column would be parsed as dates.
# 'set_index(['id', 'date'])'; this function sets the index of the dataframe to be a multi-index composed of two columns: 'id', 'date'..
# the data would be arranged in an hierarchical structure based on the two columns.
prices = pd.read_csv('Prices.csv', parse_dates=['date']).set_index(['id', 'date']).squeeze()

# this displays the first 5 data in the dataset
prices.head()

The same for our **information signal**:

In [None]:
# This line works the same as above in reading the csv file "GP.zip"; together with restructuring the data in a dataframe.. first 5 line should be displayed
factor = pd.read_csv('GP.csv', parse_dates=['date']).set_index(['id', 'date']).squeeze()
factor.head()

We also load data on the **market capitalization** of each stock. This is the product between the price and the number of shares outstanding. The idea is that we will be able to invest more in larger companies without incurring **huge transactions costs** (Warning: market cap is an imperfect proxy for trading liquidity, but this is a topic for another day...).

In [None]:
# this line works same as above
market_cap = pd.read_csv('Market_cap.csv', parse_dates=['date']).set_index(['id', 'date']).squeeze()
market_cap.head()

___
## 2. Building the Strategy
### 2.1 The "buy" signal
As we remember from our simplified moving average strategy in week 4, the first step to build an algorithmic portfolio is to build the trading signal. Here we are looking at a **long-only** portfolio so we only need a signal that tells us when to buy a stock. 

For sake of simplicity we will limit our repertoire to a simple ranking strategy where we:
1. At the beginning of the period we rank all the shares on the information signal
1. We buy the N shares at the top of the ranking

Although it seems very simplistic this is not far from the reality of the most common factor-based strategy. Both the [Citigroup Industry Rotation Model](https://drive.google.com/file/d/0B1gltCdg5OaJT3cxSWthd2hoUVk/view?usp=sharing&resourcekey=0-hDFDrZVrBtAYjkQZqiVD-A) and the [AQR Value-Momentum Strategy](https://drive.google.com/file/d/11XRh8hob05F09LbwRc9VvHLixysFriEU/view?usp=sharing) follow a similar rule where they buy the _Top-33%_ of the market. In this case they do not define the number but the percentage...

We start by defining the number `N` of shares we want to buy:

In [None]:
# Defining variable N, and setting value to 200.
N=200

We use the **[`Series.rank()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html)** method to rank shares in desceninding order (1=highest factor). Notice that since we want a different rank for every date we need to use the **[`Series.groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.groupby.html)** operator. The `method='min'` at the end tells Pandas how to handle "ties", stocks with the same signal value.

In [None]:
# Define variable rank, and setting the value as the dataframe 'factor', and grouping it by date column..
# .rank(ascending=False, method='min'); this calculates the rank of each value within each group, in descending order and specifying the method used to assign ranks when there are ties.
# the 'min' method ensures that if there are ties, the smallest rank is assigned.
rank = factor.groupby('date').rank(ascending=False, method='min')
rank.head()

Our **buy signal** will now be simply a rank lower than `N`:

In [None]:
# creating a signal variable assigned with values lower than N
signal = rank < N
signal.head()

Notice that the logical operator returns a Boolean variable `True/False`. This work for us because internally they are treated as `True = 1` and `False = 0`. If we prefer a numerical expression we can simply use the [**`Series.astype()`**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html) to transform the type of the Series **to integer number**.

In [None]:
# changing the values to integer and displaying first 5 data
signal = (rank < N).astype(int)
signal.head()

### 2.2 The Position Size
Now that we have the buy signal we need to **calculate the investment position**. To be more realistic we will **build a cap-weighted strategy**: if stock A and stock B are both in my "buy list" and stock A has a market capitalization double that of stock B, then stock A will have a weight in my portfolio that is double the weight of stock B.

To achieve this we need to join signal and market capitalization:

In [None]:
# creates a variable 'df' and assigns variables with function that renames series signal and converts it to a dataframe using the 'to_frame()' method
# .join(market_cap); joins the created dataframe with 'market_cap' created above 
df = signal.rename('signal').to_frame().join(market_cap)
df.head()

The code to calculate the position is not very intuitive. Let's break it down: we need to **divide the market** cap of our share **for the sum** of the market cap of all the shares in the **buy portfolio** in that **specific month**.

To get the sum of the market caps for the shares in the buy portfolio each month we need to use the [**`DataFrame.groupby()`**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) operator with two variables (`date` and `signal`). This will group observations by month and by whether they are in the buy portfolio or not. 

You will also notice the use of the method [**`DataFrame.transform()`**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html). I will explain this at the end of the section.

In [None]:
# This line calculates the weight of each stock within each group defined by unique combinations of dates and signals..
# dividing the market capitalization of each stock by the sum of market capitalisations within each group. this effectively calculates the proportion of market capitalization each stock represents within its respective group. 
position = df['market_cap'] / df.groupby(['date', 'signal'])['market_cap'].transform('sum')
position = position * df['signal'] #this puts the weight of all the stocks not in top quantile to 0
position.head()

It is not easy to visualize the return because most observations are `0` (we only buy 200 out of 6000 shares). What we can do is to verify that in every month, **the sum of the portfolio weights** is 100%.

In [None]:
# this position groups the position data by 'date' index.. creating groups of positions for each date..
# .sum() calculates the sum of positions within each group, effectively summing up the positions for each date..
position.groupby('date').sum()

This is confirmed! So we have a position variable that for every time period (month here) tells us **which percentage of our portfolio will be invested in each stock**. As usual all of this code will be **encapsulated in a function**, but now this function will also perform the **full back-testing** of the strategy. Now that we have the portfolio composition we "only" need to simulate the investment and measure the performance. 

___
### Coding issue: the use of .transform()

Above we had to calculate the sum of the market cap by applying the sum function within  the method [**`DataFrame.transform()`**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html). The reason is that we want an output with the same shape as the original data: for every share and month we want to know the sum of the weights, in that month, of all the shares with the same buy status (buy or no buy) as the share we are looking at. By using transform we get exactly this: we apply the `sum` function and **return an object with the same index** (shape) as the input.

In [None]:
# groups dataframe 'df'  by  'date' and 'signal' , thereby creating groups based on this groups of signals
# the aggregation would be performed on 'market_cap' column; this would be the summation of the market capitalisation
df.groupby(['date', 'signal'])['market_cap'].transform('sum')

If we were to calculate the sum using the regular method [**`DataFrame.sum()`**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html) we would return an object that tells us the sum of the market cap, every month, of all the shares in and out of the buy portfolio:

In [None]:
# same as above
df.groupby(['date', 'signal'])['market_cap'].sum()

This is a bit of a complication, but all this code will be encapsulated in a function anyway. 
___

## 3. Backtesting the Position
Now that we have built our postion **we can backtest it**: we can simulate the investment strategy using past data and measure its return. The building of the backtesting function is quite complicated. This is mainly due to the **transaction costs**: to estimate the cost of trading we need to keep track of the portfolio composition month by month: the amount of trading that we need to execute in time $t$ is a function not only of the target composition of the portfolio at time $t$, but also of the actual portfolio composition at $t-1$.

Building this function is beyond the coding level required for this subject, and would force us to focus too much on the coding instead of the financial reasoning behind a specific investment strategy. To solve this problem the function has already been **created and stored in our own module** (`apmodule.py`) already imported in this notebook.

The backtesting function requires eight inputs and produces four different output. The input parameters are:
1. The information **signal** (passed as a series)
1. The stock **prices** (passed as a series)
1. The market **capitalization** (passed as a series)
1. The **Start** Date (this can be omitted)
1. The **End** Date (this can also be omitted)
1. The rebalancing **frequency** (the default value is rebalancing every month)
1. The round-trip **transaction costs** (the default value is zero)
1. The **number of stocks** we want to buy (works similarly to the `N` parameter in our ranking strategy, the default value is 100)

The output produced are (in this precise order):
1. A DataFrame with the monthly **return** of the portfolio, the benchmark and the difference betwen the two (called _Active Return_)
1. A DataFrame with the monthly **turnover** and transaction costs of the strategy
1. A DataFrame with the **composition** of portfolio and benchmark
1. A DataFrame with a number of **performance metrics** for portfolio, benchmark and active portfolio

Before we discuss this in more detail let's take it out for a spin:

In [None]:
# conducting the backtest using the 'ap.backtesting()' function, likely to evaluate the performance of your trading strategy..
# transaction cost, assumed to be 0.15% represented as t_cost = 0.0015
p_ret, p_turn, p_comp, p_perf = ap.backtesting(factor, prices, market_cap, 
                                               start='1990-01-01', end='2007-12-31', 
                                               frequency=1, t_cost=0.0015, N=200)

Let's have a look at the different output:

In [None]:
# this should represent the portfolio returns from the backtested strategy
p_ret.head()

In [None]:
# this should represent the portfolio turnover
p_turn.head()

In [None]:
# this would represent the composition of the portfolio
p_comp.head()

In [None]:
# this would represent the performance metrics
p_perf

Before we move to the economic interpretation of the results, let's briefly **discuss some aspects of the function**.

First of all you will notice that backtesting function starts from the information signal, not from the position. This is because the backtesting function itself **creates the position** based on a ranking strategy similar to the one implemented at the beginning of this notebook. This makes our life easier: we only need one line of code.

Second, this function has been **created specifically** for this subject and needs data and input formatted according to the specification of the subject. This is a bit of let-down, but is infact a feature common to most python backtesting libraries (see for example [Backtrader](https://www.backtrader.com/) and [Quantconnect](https://www.quantconnect.com/)). I would like to reiterate that the purpose of this function is to help you test algorithmic strategies to understand the "finance" behind them. If you are interested in a more efficient general purpose backtesting function, I suggest you look at one of the providers just mentioned.

___
### Exercise 09.01
While many investors and portfolio managers love momentum strategies, others have a feeling that these strategies have overstayed their welcome: they used to perform well in the past but in the last years they have been underwhelming. Load the data on 12-months momentum from `MOM12.zip` and backtest the strategy three times on three different periods:
* 1990 - 1999
* 2000 - 2009
* 2010 - 2019

Is it true that the profitability of this strategy has decreased? 

**STEP 1:** Load the data on 12-months momentum from `MOM12.zip`

In [None]:
# Load the data on 12-months momentum from MOM12.zip
mom12 = pd.read_csv('MOM12.csv', parse_dates=['date']).set_index(['id', 'date']).squeeze()
# Display the first 5 rows of the momentum data
mom12.head()

**STEP 2:** Run the backtesting function 3 times changing `start` and `end` to match the three windows described in the question. Choose different names for the output DataFrames to make them easily recognizable. Use the default values for rebalancing frequency, transaction costs and number of shares. 

In [None]:
# Backtest for the period 1990-1999
m_ret90, m_turn90, m_comp90, m_perf90 = ap.backtesting(mom12, prices, market_cap, start='1990-01-01', end='1999-12-31')

# Backtest for the period 2000-2009
m_ret00, m_turn00, m_comp00, m_perf00 = ap.backtesting(mom12, prices, market_cap, start='2000-01-01', end='2009-12-31')

# Backtest for the period 2010-2019
m_ret10, m_turn10, m_comp10, m_perf10 = ap.backtesting(mom12, prices, market_cap, start='2010-01-01', end='2019-12-31')

**STEP 3:** Show if the performance of the strategy has declined. You can do this by simply showing the performance tables of the three strategies. 

In [None]:
# Performance metrics for the period 1990-1999
m_perf90

In [None]:
# Performance metrics for the period 2000-2009
m_perf00

In [None]:
# Performance metrics for the period 2010-2019
m_perf10

**STEP 4 (Coding Challenge):** Here is something more original. Many times when you want to push a message across, for example convincing your boss to drop this strategy, you want to be as "punchy" as possible. Let's see if you can plot a simple bar graph with Information Ratio of the three strategies. Of course you could simply copy the three numbers and do the graph in Excel. This would take 90 seconds, **and would be the smart thing to to do**...but let's see if you can do it in Python. 

**HINT 1:** To access a single number in the DataFrame you need to use the [**`DataFrame.loc()`**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) method passing the name of the row and the name of the column. For example, to access the percentage of positive returns for the benchmark in the main example in the test you would use:
```python
p_perf.loc['% Positive', 'Benchmark']
```

**HINT 2:** To transform some numbers into a DataFrame you need to use the [**`pandas.DataFrame()`**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) passing the numbers as a list.

In [None]:
# Calculate the Relative Return (RR) Ratio for the active portfolio for each period
# Access the RR Ratio values from the 'm_perf90', 'm_perf00', and 'm_perf10' DataFrames using the .loc[] method
# Extract the RR Ratio specifically for the active portfolio by specifying 'Active' as the column name
# Store the RR Ratio values in a list named 'ir', with elements corresponding to the periods 1990-1999, 2000-2009, and 2010-2019
ir = [m_perf90.loc['RR Ratio','Active'], m_perf00.loc['RR Ratio','Active'], m_perf10.loc['RR Ratio','Active']]  # RR Ratio for the active portfolio during the period 1990-1999
ir

In [None]:
# Create a DataFrame to hold the Information Ratios calculated for each period
# Use the previously calculated RR Ratios stored in the list 'ir' as the data
# Set the index of the DataFrame to represent the periods: '1990-1999', '2000-2009', and '2010-2019'
# Set the column name as 'Information Ratio' to denote the metric being represented
info_ratio = pd.DataFrame(ir, index=['1990-1999', '2000-2009', '2010-2019'], columns=['Information Ratio'])
info_ratio

In [None]:
# Plot a bar graph to visualize the Information Ratio of the momentum strategy for each backtesting period
# Set the figure size to (10, 6) for better visibility
# Use the .plot() method on the DataFrame 'info_ratio' to create the bar graph
# Specify 'kind='bar'' to create a bar graph
# Set the title of the plot to 'Information Ratio Momentum Strategy' for clarity
# Set the xlabel to 'Backtesting Period' to label the x-axis
# Set the ylabel to 'Annualized Information Ratio' to label the y-axis
# Disable the legend by setting legend=False since it's not necessary for this plot
info_ratio.plot(figsize=(10,6),
                kind='bar',
                title='Information Ratio Momentum Strategy',
                xlabel='Backtesting Period',
                ylabel='Annualized Information Ratio',
                legend=False)

___
## 4. Assessing the performance of the strategy
Once we have run the backtesting we are left with a number of output that we can use to **see if the strategy is any good**. Here we will see what we can get out of each one of them in terms of **economic interpretation**.

### 4.1 The monthly performance
The first output of the function is a DataFrame with the monthly performance of the strategy (and the benchmark). There isn't much that we can do with this directly (apart from calculating further performance measures as we will see in a minute). For sure we can **plot the performance** to get some nice visual. We have already seen this in the code for the quantile analysis.

We start by calculating the value of the portfolio at the end of every month assuming an initial investment of 1 dollar. Since we are using logarithmic returns this is trivial. We know that if an asset has, in three periods, returns equal to $r_{1}$, $r_{2}$ and $r_{3}$, than the return from time 0 to time 3 is simply the sum
$$r_{0, 3} = \sum_{t=1}^{3}r_{t}$$

And we also know that the dollar value at the end of time 3 of a $1 investment in time 0 is

$$W_{3} = \$1\times e^{r_{0, 3}}$$

We can simply apply the Numpy exponential function [**`numpy.exp()`**](https://numpy.org/doc/stable/reference/generated/numpy.exp.html) to the cumulative returns calculated using the [**`DataFrame.cumsum()`**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html) method. 


In [None]:
# Calculate the cumulative sum of the monthly returns and apply the exponential function to obtain portfolio values over time
# Use the .cumsum() method on the DataFrame 'p_ret' to calculate the cumulative sum of the monthly returns
# Apply the exponential function np.exp() to convert the cumulative returns into portfolio values
port_values = p_ret.cumsum().apply(np.exp)

# Plot the portfolio values over time
# Set the figure size to (10, 6) for better visibility
# Use the .plot() method on the DataFrame 'port_values' to create the line plot
# Specify 'figsize=(10, 6)' to set the size of the plot
# Set the title of the plot to 'Top-200 Rank Strategy' for clarity
# Set the ylabel to 'Portfolio Value' to label the y-axis
# Set the xlabel to 'Time' to label the x-axis
# Specify 'rot=0' to force the x-axis labels to be horizontal
port_values.plot(figsize=(10,6),
                 title='Top-200 Rank Strategy',
                 ylabel='Portfolio Value',
                 xlabel='Time',
                 rot=0) #This forces the lables on the x axis to be horizontal

___
### 4.2 Performance Metrics
The last output of the function is a very handy DataFrame with a number of performance measures, or **diagnostics**.

In [None]:
# Display performance metrics DataFrame
p_perf

This DataFrame is built using a function `diagnostics()` contained in our `apmodule`. It is important that you are familiar on how these measures are built in case you need to build some different ones for your assignment.

We are **already familiar** on how to measure mean return, standard deviation and Return-to-Risk ratio, so there is no need for us to discuss them here. Please notice that the return-to-risk ratio of the active portfolio, also called the **Information Ratio** is one of the most important performance metrics and is commonly used as the main measure of quality of the strategy. An **IR greater than 1** is the gold standard for succesful strategy. 

After these standard measures, the table also shows some **non-parametric measures of risk**. The first one is the **percentage of months with positive return** (for the active portfolio this means months where the portfolio outperforms the benchmark). The calculation is pretty straightforward:

In [None]:
# Count the number of positive returns in the DataFrame p_ret
# Use boolean indexing to filter out positive returns (returns greater than 0)
# Use the .count() method to count the number of positive returns
p_ret[p_ret>0].count()

In [None]:
# Calculate the proportion of positive returns in the DataFrame p_ret
# Use boolean indexing to filter out positive returns (returns greater than 0)
# Count the number of positive returns using the .count() method
# Divide the count of positive returns by the total number of returns to calculate the proportion
# Store the result in the variable p_positive
p_positive = p_ret[p_ret>0].count()/p_ret.count()

# Display the proportion of positive returns
p_positive

We then gauge the **extreme risk** of the strategy by measuring the **best and worst** monthly return:

In [None]:
# Calculate the worst monthly return by finding the minimum value in the DataFrame p_ret
worst = p_ret.min()
worst

In [None]:
# Calculate the best monthly return by finding the maximum value in the DataFrame p_ret
best = p_ret.max()
best

#### 4.2.1 The Max Drawdown
A performance metric that is often used in portfolio management is the so-called [Max Drawdown (MDD)](https://www.investopedia.com/terms/m/maximum-drawdown-mdd.asp), this is the maximum continuous loss experienced by the portfolio in a given time frame. For example the 12-months MDD is the **worst drop in value** experienced by the portfolio **from the peak value** achieved within the last 12 months.  

In [None]:
#we derive the portfolio values from the monthly returns
port_values = p_ret.cumsum().apply(np.exp)

#we take the max portfolio value over the last 12 months
rolling_peak = port_values.rolling(12).max()

#we take the percentage difference between the current value and the rolling peak
drawdown = np.log(port_values/rolling_peak)

#we take the worst drawdown experienced (...called max because it's a negative number...)
max_drawdown = drawdown.min().rename('Max Drawdown')
max_drawdown

Some people like to **plot the Drawdown** to provide a visual indication of when the portfolio experienced the worst consecutive loss:

In [None]:
# Plot the drawdown over time
# Set the figure size to (10, 6) for better visibility
# Use the .plot() method on the drawdown DataFrame to create the line plot
# Specify 'figsize=(10, 6)' to set the size of the plot
# Set the title of the plot to '12-M Drawdown' for clarity
# Set the ylabel to 'Drawdown %' to label the y-axis
# Set the xlabel to 'Time' to label the x-axis
# Specify 'rot=0' to force the x-axis labels to be horizontal
drawdown.plot(figsize=(10,6),
              title='12-M Drawdown',
              ylabel='Drawdown %',
              xlabel='Time',
              rot=0)

To understand the meaning of the drawdown we can plot the value of the portfolio and observe that the MDD captures the shart fall in the portfolio value after the **bursting of the dot.com bubble** of Mach 2000.

![plot.png](attachment:c53d049e-54ca-44e5-923a-f067c7639c9d.png)

All these performance measures can then be included in a simple function. Please notice that this function is available in `apmodule` and is **called automatically** by the backtesting function.

In [None]:
def diagnostics(port_ret):
    mean_return = port_ret.mean()*12     # Calculate the annualized mean return of the portfolio
    volatility = port_ret.std()*np.sqrt(12)    # Calculate the annualized volatility of the portfolio
    rr_ratio = (port_ret.mean()*12) / (port_ret.std()*np.sqrt(12))    # Calculate the Risk-Return Ratio (RR Ratio) of the portfolio
    p_positive = port_ret[port_ret>0].count()/port_ret.count()    # Calculate the proportion of positive returns in the portfolio  
    worst = port_ret.min()     # Calculate the worst monthly return of the portfolio
    best = port_ret.max()     # Calculate the best monthly return of the portfolio
    
    #we derive the portfolio values from the monthly returns
    port_val = port_ret.cumsum().apply(np.exp)
    rolling_peak = port_val.rolling(12).max()     # Calculate the rolling maximum portfolio value over the last 12 months
    drawdown = np.log(port_val/rolling_peak)     # Calculate the drawdown, which represents the percentage difference between the current portfolio value and the rolling peak
    max_drawdown = drawdown.min().rename('Max Drawdown')    # Calculate the maximum drawdown experienced by the portfolio

    # Create a DataFrame to store the calculated performance metrics
    performance = pd.DataFrame({'Mean Return': mean_return,
                                'St. Dev.': volatility,
                                'RR Ratio': rr_ratio,
                                '% Positive': p_positive,
                                'Worst Month': worst,
                                'Best Month': best,
                                'Max DrawDown': max_drawdown})

    # Transpose the DataFrame for better readability
    return performance.transpose()

___
### 4.3 Turnover Analysis
We can also analyze the **dynamics of the turnover** and transaction costs. These are contained in the second output of the backtesting function. Let's, for example, calculate the total turnover for each calendar year:

In [None]:
#we generate a column with the years extracted by the index
p_turn['year'] = p_turn.index.year

#we sum the monthly turnover by year
turn = p_turn.groupby('year')['Turnover'].sum()*100

#we plot
turn.plot(figsize=(10,6),
          kind='bar',
          title='Turnover',
          ylabel='(%)',
          xlabel='Year')

We see that **every year we buy and sell** between 150% and 350% of the portfolio value. This generates a certain amount of annual transaction costs:

In [None]:
#we sum the monthly transaction cost by year
t_cost = p_turn.groupby('year')['T-Cost'].sum()*100

#we plot
t_cost.plot(figsize=(10,6),
            kind='bar',
            title='Transaction Costs',
            ylabel='(%)',
            xlabel='Year')

At the top of the tech bubble this strategy should have generated transaction costs of 0.5% per year. This is more than one-tenth of the average active return of the strategy.

___
### 5.3 Industry Exposure
The third output of the backtesting function is the composition of the portfolio (and of the benchmark). While we mainly care about the performance of a strategy, it is useful to know **in which type of stocks** we are actually investing to keep an eye on our **exposure to different risk factors**. 

One typical example is our **industry exposure**: how much of the portfolio is invested in tech stocks? How much in the hospitality industry? This will, for example, drive the peorformance of our stragegy during the pandemic.

In our strategy, portfolio composition is determined purely by a single information signal...but what if all the companies in a given industry rank very well (or very poorly) based on that signal?

To start, we can load the file `names.zip` that **contains name and industry** affiliation of all the companies in our sample.

In [None]:
names = pd.read_csv('names.csv').set_index('id')
names.head()

Let's now **merge this with our strategy** DataFrame (the one with the target weights).

In [None]:
# Merge the 'names' DataFrame with the strategy DataFrame ('p_comp') using a left join
p_comp = p_comp.join(names, how='left')
p_comp.head()

We can now estimate industry exposure by **summing the weights of all the stocks in each industry**. We want to calculate this exposure every month, so we use a double **`.groupby()`**: we need to group by date and then by industry. 

In [None]:
# Now, estimate industry exposure by summing the weights of all the stocks in each industry
port_exposure = p_comp.groupby(['date', 'industry'])['Portfolio'].sum()
port_exposure.head()

Since we used a double groupby operator, the result has a **multiindex**: every observation is indexed by **date AND industry**. It is easier for us to have **each industry** in a different column:

In [None]:
# Unstack the 'industry' level of the multi-index DataFrame to pivot the data
port_exposure = port_exposure.unstack(level=1)
port_exposure.head()

In [None]:
# Plot the industry exposure data
# Use .plot() method with 'kind='area'' to create an area plot
# Set the figure size to (10, 6) for better visibility
# Set the ylabel to 'Portfolio %' to label the y-axis
# Set the xlabel to 'Year' to label the x-axis
# Set the title of the plot to 'Portfolio Industry Exposure' for clarity
port_exposure.plot(figsize=(10,6),
                   kind='area',
                   ylabel='Portfolio %',
                   xlabel='Year',
                   title='Portfolio Industry Exposure')

**What a mess!** This picture is a disaster... Let's see:
* The legend overlaps and cover valuable information
* The colors are repeated so it is difficult to understand what is what
* The grey border looks weird

Let's try to fix them:

In [None]:
port_exposure.plot(figsize=(10,6),
                   kind='area',
                   ylabel='Portfolio %',
                   xlabel='Year',
                   title='Portfolio Industry Exposure',
                   colormap='tab20',
                   ylim=[0,1],
                   xlim=[port_exposure.index.min(), port_exposure.index.max()])

plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))

Better...:
* The `xlim=[]` and `ylim=[]` parameters specify the limits for the `x` and `y` axis and so we get rid of the weird grey border. Notice that for the `x` axis we could have simply written `xlim=[1990, 2007]` but that could break with a different sample. With the current version we look for min/max year in the sample.
* The  `colormap=` parameter changes the colormap from the default one to one with 20 colors so they are not repeated (a list of colormaps can be found [here](https://matplotlib.org/stable/gallery/color/colormap_reference.html)).

When it comes to the **legend**, the problem is that the integration of `matplotlib` with `pandas` **does not allow to move it** or change it in any way. So here we need to use the **standard** `matplotlib` **syntax** to move the lengend. The syntax of the line is a bit complex, so if you want you can read more [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html). 

While the picture now is ok, the graph still looks weird, but that is a problem with our data: we are building our monthly target weights by looking at accounting data (the Gross Profitability Ratio). Since US companies release quarterly accounting information, we get a lot of seasonality: retail companies have a boost in revenues and profit in the 4th quarter (Christmas shopping)...

To remedy this we can look at the **average yearly exposure**:

In [None]:
#we extract the year from the index
port_exposure['year'] = port_exposure.index.year

#we calculate the mean exposure by year (for simplicity we replace the old DataFrame)
port_exposure = port_exposure.groupby('year').mean()
port_exposure.head()

In [None]:
#and then we can plot
port_exposure.plot(figsize=(10,6),
                   kind='area',
                   ylabel='Portfolio %',
                   xlabel='Year',
                   title='Portfolio Industry Exposure',
                   colormap='tab20',
                   ylim=[0,1],
                   xlim=[port_exposure.index.min(), port_exposure.index.max()])

plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
#  this code block will generate a plot showing the average yearly exposure of the portfolio to different industries.

Now that the picture is finally fixed we can observe the industry exposure of our portfolio. We see, for example, that after 1997 our exposure to business equipment is greatly reduced. It is not clear **if this is an active bet** of our strategy, or if it simply reflects the dynamic of the general market. To answer this question we need to look at the **industry exposure of our benchmark**.

In [None]:
#we sum teh weights of the stocks for each industry and month
bench_exposure = p_comp.groupby(['date', 'industry'])['Benchmark'].sum()

#we put the data of each industry in a different column
bench_exposure = bench_exposure.unstack(level=1)

#we xtract the year from the index
bench_exposure['year'] = bench_exposure.index.year

#we take the mean average exposure within each year and plot
bench_exposure = bench_exposure.groupby('year').mean()

# This code calculates and visualizes the industry exposure of the benchmark over time
bench_exposure.plot(figsize=(10,6),
                    kind='area',
                    ylabel='Composition %',
                    xlabel='Year',
                    title='Benchmark Industry Exposure',
                    colormap='tab20',
                    ylim=[0,1],
                    xlim=[bench_exposure.index.min(), bench_exposure.index.max()])

plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))

And from comparing the industry exposure of our portfolio to that of the benchmark we see that our profitability strategy has an **implied long bet** on Business Equipment and Retail, and an **implied short bet** on Finance and Manufacturing.

In [None]:
___
### Exercise 09.02
In the profitability strategy above we have assumed that investing in the top 200 shares was a good idea. This is a **parameter that should be analyzed and optimized**.

Use a `for-loop` to run the backtesting of the profitability strategy with four different values for the parameter `N` (keeping all the other parameters as in the main text). You should try investing in the top: 50, 200, 500, 1000 shares. After each iteration of the for loop save the information ratio of the strategy. At the end observe what would have been the optimal parameter.

**HINT:** To access the information ratio after each iteration check the hint 1 to the previous exercise.

In [None]:
numbers = [50, 200, 500, 1000]

info_ratio=[]

# This code conducts a for-loop to run the backtesting of the profitability strategy with four different values for the parameter N, representing the top number of shares to invest in.
# It iterates through the list numbers, which contains the values [50, 200, 500, 1000], and for each value in the list
for number in numbers:
    
    p_ret, p_turn, p_comp, p_perf = ap.backtesting(factor, prices, market_cap, 
                                                   start='1990-01-01', end='2007-12-31', 
                                                   frequency=3, t_cost=0.0015, N=number)
    
    info_ratio.append(p_perf.loc['RR Ratio', 'Active'])
# During each iteration, it appends the information ratio (RR Ratio) of the active portfolio to the info_ratio list. Finally, it prints out the info_ratio list containing the information ratios for each value of N.   
print(info_ratio)

At the end you should observe that the performance of the strategy **improves with a larger number of shares** in the portfolio.