# Introduction

## Importing libraries

In [None]:
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

print("pandas version: ", pd.__version__)
print("seaborn version:", sns.__version__)
print('numpy version:', np.__version__)

pd.set_option('display.float_format', '{:.2f}'.format)

# Phase 2

## Data Requirements
This data dictionary is for all data that will be used for our model. We have taken this out of the data requirement document (this can be found in this folder under the name data_requirements.docx), A changelog will be kept in the data requirement document.

### Data Dictionary
| Dutch Term              | English Term             | Description                                                        | Data Type      | Range                        | Units          | Source          | Quality Standards          |
|-------------------------|--------------------------|--------------------------------------------------------------------|----------------|------------------------------|----------------|-----------------|----------------------------|
| Datum                   | Date                     | The date when the analyst ratings were recorded.                   | Date           | 2022-07-14 to 2022-12-01     | Date           | Marketscreener     | Accurate date required     |
| Kopen                   | Buy                      | Number of 'Buy' ratings from analysts.                             | Numerical      | 0 - 15                       | Ratings        | Marketscreener     | Accurate count             |
| Overpresteren           | Outperform               | Number of 'Outperform' ratings from analysts.                      | Numerical      | 0 - 15                       | Ratings        | Marketscreener     | Accurate count             |
| Houden                  | Hold                     | Number of 'Hold' ratings from analysts.                            | Numerical      | 0 - 15                       | Ratings        | Marketscreener     | Accurate count             |
| Onderpresteren          | Underperform             | Number of 'Underperform' ratings from analysts.                    | Numerical      | 0 - 15                            | Ratings        | Marketscreener     | Accurate count             |
| Verkopen                | Sell                     | Number of 'Sell' ratings from analysts.                            | Numerical      | 0 - 15                            | Ratings        | Marketscreener     | Accurate count             |
| Datum | Date         | The date when the stock market data was recorded.                           | Date       | Varied (min 11-06-2016 etc.)  | Date     | Yahoo Finance| Accurate date required   |
| Open | Open         | Opening price of the stock for the given date.                              | Numerical  | Varied (min 0,.-max ∞)                    | Currency | Yahoo Finance| Accurate financial data  |
| Hoog | High         | Highest price of the stock reached on the given date.                       | Numerical  | Varied (min 0,.-max ∞)                   | Currency | Yahoo Finance| Accurate financial data  |
| Laag | Low          | Lowest price of the stock on the given date.                                | Numerical  | Varied (min 0,.-max ∞)                   | Currency | Yahoo Finance| Accurate financial data  |
| Slot | Close        | Closing price of the stock at the end of the trading day on the given date. | Numerical  | Varied (min 0,.-max ∞)                   | Currency | Yahoo Finance| Accurate financial data  |
| Aangpaste slot | Adj Close    | Adjusted closing price after adjustments for all applicable splits and dividend distributions.| Numerical | Varied (min 0,.-max ∞) | Currency | Yahoo Finance| Accurate financial data  |
| Volume | Volume       | Number of shares of the stock traded during the given date.                 | Numerical  | Varied (min 0,.-max ∞)                   | Shares   | Yahoo Finance| Accurate count           |


## Data Collection

### Overview

- **Analyst consensus**: Extracted by typing over the data from an interactive graph, Currently there are no free sources to get this data.
- **Financial data from ASR**: Looked through the financials from ASR on their website, then extracted the neccessary data by typing it over in a new excel file.
- **Stock data for ASR**: On the Yahoo website there is a page to download the data by Day, Month or Year. The data range can be stated before downloading.

### Detailed process
We will explain exactly how we collected the data so these steps can be replicated.

#### Analyst consensus
- **Source**: [marketscreener.com](https://www.marketscreener.com/quote/stock/ASR-NEDERLAND-N-V-28377340/consensus/)
- **Limitation**: Not downloadable data, only up to 18 months of history.
- **Time Frame**: Data from 2022 July up to 2023 Januari. With a moving time frame(Only 18 months back from current date)
- **Script**: No script used
- **Storage**: Data stored locally in the data folder. Filename: [analyst_consensus_16_07-2022_17-01-2024.xlsx](data/analyst_consensus_16_07-2022_17-01-2024.xlsx).
- **Future Data Addition**: Look for a source with better historical data that is easier to gather


#### Financial data from asr
- **Source**: [asrnederland.nl](https://www.asrnederland.nl/investor-relations/financiele-publicaties)
- **Limitation**: The report are per half year or full year. The best would be that a shorter time frame is available. This is not likely to happen 
- **Time Frame**: Financial data available from 2016 up to 2023. (asr whent public in 2016 June)
- **Script**: No script used. Downloaded the 'Tables' and 'Financial ratios' under column 'Halfjaarcijfers' in the tab from '2023'. These 2 files contain the data for 2022 and first half of 2023, last half is not released yet.
- **Storage**: Data stored locally in the data folder. Filename: [extracted_financial_data_HY2022_FY2022_HY2023.xlsx](data\extracted_financial_data_HY2022_FY2022_HY2023.xlsx).
The original data from the asr website is stored in the folder named 'Original data asr'
- **Future Data Addition**: When newer reports are released these should be downloaded and the data should be added to [extracted_financial_data_HY2022_FY2022_HY2023.xlsx](data\extracted_financial_data_HY2022_FY2022_HY2023.xlsx).

---
- **Important note**: We collected the data according to the new calculations that went into use in 2023. asr has recalculated the values for 2022. This is the statement at the top of their books `all 2022 IFRS figures restated to IFRS17 / 9`<br>**Due to this we won't be using this data for iteration 0**
---

#### Stock data for asr
- **Source**: [finance.yahoo.com](https://finance.yahoo.com/quote/ASRNL.AS/history)
- **Limitation**: 
- **Time Frame**: Financial data available from 2016 up to 2023. (asr whent public in 2016 July)
- **Script**: No script used. Under the tab history there is the option to download the data from a given date as daily, monthly or yearly data. The download is in a csv format.
- **Storage**: Data stored locally in the data folder. We downloaded the full data available in three files
  - Daily: Filename: [StockPerDay_ASRNL.AS_01-01-2016_19-01-2024.csv](data/StockPerDay_ASRNL.AS_01-01-2016_19-01-2024.csv)
  - Weekly: Filename: [StockPerWeek_ASRNL.AS_01-01-2016_19-01-2024.csv](data/StockPerWeek_ASRNL.AS_01-01-2016_19-01-2024.csv)
  - Yearly: Filename: [StockPerYear_ASRNL.AS_01-01-2016_19-01-2024.csv](data/StockPerYear_ASRNL.AS_01-01-2016_19-01-2024.csv)
- **Future Data Addition**: When newer data is available for `Analyst consensus` or `Financial data from ASR` then from the yahoo source the new stock data should be downloaded and replace the existing files for daily, monthly and yearly

### Data Handling
All data handling, including preprocessing and cleaning, will be conducted within this notebook, focusing on highways in the Netherlands.

## Data understanding & Preparation for analyst consensus
In this part we will look at the data for the analyst consensus. This will be the most important data for our target variable.

We will start with loading the data and taking a first look

In [None]:
AnalystConsensus = pd.read_excel('data/analyst_consensus_16_07-2022_17-01-2024.xlsx')
AnalystConsensus.head(10)

The data is loaded correctly. We can see from this;
- one column to indicate the date
- 5 columns to show the count of the analyst consensus
- There seems to be around two weeks between every entry

We will now describe the dataset

In [None]:
AnalystConsensus.describe()

We now know there are 37 entries into the dataset.
The newest datapoint is from 17-01-2024 and the oldest is from 14-07-2022.

What is also visible is that the max count of the consensus fields is in all cases below 9. From this we can see that at maximum there are not a lot of analyst giving their opinion about asr

Now we will look if the types are correct with the .info

In [None]:
AnalystConsensus.info()

Here we see that everything has the correct datatype and there are no null values

### Data visualization
Below we will create an interactive plot to visualize the data in our dataset

In [None]:
df = AnalystConsensus

# Initialize go
fig = go.Figure()

# add bars
fig.add_trace(go.Bar(
    x=df['Date'],
    y=df['Sell'],
    name='Sell',
    marker_color='red'
))
fig.add_trace(go.Bar(
    x=df['Date'],
    y=df['Underperform'],
    name='Underperform',
    marker_color='orange'
))
fig.add_trace(go.Bar(
    x=df['Date'],
    y=df['Hold'],
    name='Hold',
    marker_color='yellow'
))
fig.add_trace(go.Bar(
    x=df['Date'],
    y=df['Outperform'],
    name='Outperform',
    marker_color='lightgreen'
))
fig.add_trace(go.Bar(
    x=df['Date'],
    y=df['Buy'],
    name='Buy',
    marker_color='green'
))

# update the figure
fig.update_layout(
    barmode='stack',
    title='Analyst Consensus - Interactive Stacked Bar Chart',
    xaxis_title='Date',
    yaxis_title='Consensus Count',
    legend_title='Consensus',
    hovermode='x'
)

fig.show()

From this plot we can see that analust overall think that asr is a good buy. With a slight hiccup around march where there was a sell consensus.

We can also see that there is a difference in the count over the weeks, this is something to look into in the prepocessing step of Phase 3 as this can influence the machine learning

### Conclusion

The data for the analyst consensus did not need any altering. The data is clear and we have a good visualization that shows the consensus over time, what we did notice it the varying count over the weeks. This is something we need to look at in the preprocessing step.

## Data understanding & Preparation for Financial data from asr
In this section we will look at the financial data gathered from asr.

We will start with loading the data and taking a first look

In [None]:
FinancialDataASR = pd.read_excel('data/extracted_financial_data_HY2022_FY2022_HY2023.xlsx')
FinancialDataASR.head()

We can see that this dataframe only has 3 entries.
This is due to the reporting from asr. Every halfyear and year there is a new report.

For our project this means that we need to fill in the data in between these points since this will be important for our model to understand the analyst consensus.

**As stated before, this is according to the new calculations. for this reason we will leave this data and not use it for iteration zero**

## Data understanding & Preparation for Stock data for asr
In this section we will look at the stock data gathered for asr.

The analyst consensus data has a new entry around every two weeks. We want to have our data matching to this timeframe. For the stock data we will use the data gathered per week.
We will start with loading the data and taking a first look.

In [None]:
StockPerWeek = pd.read_csv('data/StockPerWeek_ASRNL.AS_01-01-2016_19-01-2024.csv', delimiter=',', decimal='.')
StockPerWeek.head(10)

We can see that this dataframe columns for the date. We can see that the data is indeed weekly since every 7 days there is a new entry.

The dataframe also contains data for the Open, High, Low and Volume of that week.

The *Close and **Adj Close are fields with a description. This is the explanation from Yahoo;
- `*Close price adjusted for splits.`
- `**Adjusted close price adjusted for splits and dividend and/or capital gain distributions.`

In the next step we will use .info

In [None]:
StockPerWeek.info()

We can see that there are 397 entries for all columns and we can also see that there are no null values.

for the datatypes we need to change the data from object to a date, we will do this in the code block below

In [188]:
# Convert 'Date' to datetime
StockPerWeek['Date'] = pd.to_datetime(StockPerWeek['Date'])

In [None]:
StockPerWeek.info()

From the .info above we can see that the conversion of `Date` was succesful.
Now we will use .describe to look for anomalies

In [None]:
StockPerWeek.describe()

The .describe shows that the dataset appears normal, with no extreme values observed in any of the columns. One observation is that the earliest date in the dataset is June 13, 2016. This date represents more historical data than is currently required for matching with the analyst consensus.

Going forward, our plan includes the integration of additional analyst consensus data. This means that we will review this entire dataset, starting from 2016. This ensures that we can effectively incorporate new consensus data as it becomes available.

### Data visualization
Below we will start the visualization with creating a plot to visualize the volume over time

In [None]:
import matplotlib.ticker as ticker

df = StockPerWeek

plt.figure(figsize=(10, 6))
plt.bar(df['Date'], df['Volume'], color='blue', label='Trading Volume', width=3)
plt.title('Trading Volume Over Time')
plt.xlabel('Date')
plt.ylabel('Volume')

# This is needed to transform the mathemetical notation to numbers
plt.gca().yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'{int(x):,}'))

# Set the limit to 8.000.000 to capture most datapoint and still have it readable
plt.ylim(0, 8000000)

plt.legend()
plt.show()

Based on the provided plot, which appears to show the trading volume over time, we can observe that there is a fluctuating but somewhat consistent pattern of trade volume throughout the period from 2017 to 2024. There are notable spikes in volume at certain intervals, which could correspond to specific market events, earnings announcements, or other news affecting trading activity. However, the majority of trading days seem to have a volume well below these peaks. The consistency of the lower volume suggests a baseline level of trading activity, while the spikes indicate periods of high trader interest or market volatility. It would be interesting to cross-reference the spikes with actual market events to draw more detailed conclusions on what may have caused these surges in trading volume.

We also want to see the close price over time as this gives us a good overview of the stock movement.

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Close'], label='Close Price')
plt.title('Closing Stock Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.show()

It would be nice to see the two plots overlapping with each other to see if there is an effect on the close price due to trading volume.

In [None]:
plt.figure(figsize=(10, 6))

# Create the first plot for trading volume on the primary y-axis
ax1 = plt.gca()  # Get the current Axes instance on the current figure
ax1.bar(df['Date'], df['Volume'], color='blue', label='Trading Volume', width=3)
ax1.set_xlabel('Date')
ax1.set_ylabel('Volume', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
ax1.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'{int(x):,}'))
ax1.set_ylim(0, 8000000)  # Set the limit to capture most datapoints for volume
ax1.legend(loc='upper left')

# Create a secondary y-axis for the closing price
ax2 = ax1.twinx()  # Instantiate a second axes that shares the same x-axis
ax2.plot(df['Date'], df['Close'], label='Close Price', color='green')
ax2.set_ylabel('Price', color='green')
ax2.tick_params(axis='y', labelcolor='green')
ax2.legend(loc='upper right')

plt.title('Trading Volume and Closing Stock Prices Over Time')
ax1.grid(True)
plt.show()

It does not appear that there is a direct correlation between the stock price movement and trading volume.

It is noteworthy that there are a few occasions where the stock price drops dramatically and then we can also see a massive spike in trading volume.

## Concatenating data from Stock market and Analyst consensus
We know from our domain understanding that the consensus is often reflected in the stock market or the other way around.

In this section we will add the datasets together, we will lose a part of the stock market data by doing this.

First we will add a column to both dataset to state the number of the week and one column for the year, afterards we will join them based on the week number.
We will start with the analyst consensus dataframe

In [None]:
# AnalystConsensus

# Creating a column named Week to store the week number
AnalystConsensus['Week'] = AnalystConsensus['Date'].dt.isocalendar().week

# Creating a column named Year to store the year number
AnalystConsensus['Year'] = AnalystConsensus['Date'].dt.year

# Display the first rows to verify
AnalystConsensus.head(3)

In [None]:
# StockPerWeek

# Creating a column named Week to store the week number
StockPerWeek['Week'] = StockPerWeek['Date'].dt.isocalendar().week

# Creating a column named Year to store the year number
StockPerWeek['Year'] = StockPerWeek['Date'].dt.year

# Display the first rows to verify
StockPerWeek.head(3)

Now that we added the week column we need to edit the stock market dataframe.

First we will look at the max and min week + year of the analyst consensus dataframe

In [None]:
# Find the minimum and maximum years in the DataFrame
min_year = AnalystConsensus['Year'].min()
max_year = AnalystConsensus['Year'].max()

# Filtering AnalystConsensus to only include rows from the min year
min_year_df = AnalystConsensus[AnalystConsensus['Year'] == min_year]

# Now we look for the min week within the min year
min_week_in_min_year = min_year_df['Week'].min()

# Here we will do the same for max
max_year_df = AnalystConsensus[AnalystConsensus['Year'] == max_year]
max_week_in_max_year = max_year_df['Week'].max()


print(f"The minimum week in the minimum year {min_year} is: {min_week_in_min_year}")
print(f"The maximum week in the maximum year {max_year} is: {max_week_in_max_year}")

Now that we know the min is week 28 of the year 2022 and the max is week 3 of the year 2024.

We will remove all fields that do not fall inbetween these weeks from the stock market data

In [None]:
# Filter the DataFrame to include rows that fall within the given range
# Include all weeks in the min_year after min_week_in_min_year
# Include all weeks in the max_year up to and including max_week_in_max_year
# For years between min_year and max_year, include all weeks

filtered_df = StockPerWeek[
    ((StockPerWeek['Year'] == min_year) & (StockPerWeek['Week'] >= min_week_in_min_year)) |
    ((StockPerWeek['Year'] == max_year) & (StockPerWeek['Week'] <= max_week_in_max_year)) |
    ((StockPerWeek['Year'] > min_year) & (StockPerWeek['Year'] < max_year))
]

# Now let's print the shape of the original and filtered DataFrames to see how many rows were removed
print(f"Original DataFrame shape: {StockPerWeek.shape}")
print(f"Filtered DataFrame shape: {filtered_df.shape}")

According to the shape we can see that changes have been made. There now should be 80 rows in the dataset. This appears to be the correct time range. We will print the max and min week + year to check if the values are the same as in the ``AnalystConsensus`` dataset

In [None]:
StockPerWeek = filtered_df

# Find the minimum and maximum years in the DataFrame
min_year = StockPerWeek['Year'].min()
max_year = StockPerWeek['Year'].max()

# Filtering AnalystConsensus to only include rows from the min year
min_year_df = StockPerWeek[StockPerWeek['Year'] == min_year]

# Now we look for the min week within the min year
min_week_in_min_year = min_year_df['Week'].min()

# Here we will do the same for max
max_year_df = StockPerWeek[StockPerWeek['Year'] == max_year]
max_week_in_max_year = max_year_df['Week'].max()


print(f"The minimum week in the minimum year {min_year} is: {min_week_in_min_year}")
print(f"The maximum week in the maximum year {max_year} is: {max_week_in_max_year}")

Now we know that the min and max values are the same we can start joining the two dataset.

In [None]:
# Merging the two DataFrames on 'Year' and 'Week' using an inner join.
merged_df = pd.merge(StockPerWeek, AnalystConsensus, on=['Year', 'Week'], how='inner')

# This will result in a DataFrame that only contains rows where both 'Year' and 'Week' match in both DataFrames.
# All weeks that do not exist in AnalystConsensus will be dropped.

# Displaying the shape of the new df
print(f"Merged DataFrame shape: {merged_df.shape}")


merged_df.head()

Now that the two dataframes are merged on week and year we will continue to Phase 3.

# Phase 3

## Preprocessing
We will collect all imports at the top of this section to keep it clear what we are using

In [200]:
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error

First we will look at the data we are working with using the .describe

In [None]:
merged_df.describe()

We can see that date is twice in the df. We will not be using both dates so we will be dropping ``Date_y`` and renaming `Date_x` to ``Date``

In [None]:
# Drop the 'Date_y' column
merged_df.drop('Date_y', axis=1, inplace=True)

# Rename 'Date_x' to 'Date'
merged_df.rename(columns={'Date_x': 'Date'}, inplace=True)

merged_df.head(0)

Now we have dropped and renamed columns we are going to rename ``merged_df`` to ``df`` and use info to check the datatypes

In [None]:
df = merged_df
df.info()

All DType are correct. We also need to look at our target variable. From Phase 2 we know that the total amount switches every so often. It would be best to change these numbers to percentages based on the total count of the week. We will be doing this in the code block below step by step

In [None]:
# Calculate the total number of ratings for each row
df['Total_Ratings'] = df[['Buy', 'Outperform', 'Hold', 'Underperform', 'Sell']].sum(axis=1)

# Convert each rating to a percentage of the total
for column in ['Buy', 'Outperform', 'Hold', 'Underperform', 'Sell']:
    df[column + '_%'] = (df[column] / df['Total_Ratings']) * 100

# Drop the original count columns and the total ratings column if they are no longer needed
df.drop(['Buy', 'Outperform', 'Hold', 'Underperform', 'Sell', 'Total_Ratings'], axis=1, inplace=True)

# Display the first few rows to verify the changes
df.head()

In the next step we will create a heatmap

In [None]:
corr_matrix = df.corr()

plt.figure(figsize=(12, 8))

# Draw the heatmap
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm')

plt.title('Heatmap of Correlation Matrix')

# Show the plot
plt.show()

The first thing that we see is the missing values for `Underperform_%`. This is due to there being no values for this variable. This can introduce bias in our model.

### Feature selection
From the correlation matrix above we can see that the `Open`, `High`, `Low`, `Close` and `Adj Close` are highly correlated. We will pick the `Close` feature .
It seems that the ``Volume`` does not have a high correlation but we will keep it in there for iteration zero.

It also seems that the ``Year`` has a high correlation to the target. Since we only have 1 full year we will not be using this.

In [206]:
features = df[['Close', 'Volume']]
target = df[['Buy_%', 'Outperform_%', 'Hold_%', 'Underperform_%', 'Sell_%']] 

X = features
y = target

### Splitting in to train/test
The data we use is time series we need to keep this in mind with splitting the data. For now we will set the split at 80%

In [None]:
# Set splitting point at 80%
split_point = int(len(df) * 0.80)

# Split the features and target into training and testing sets
X_train, X_test = X[:split_point], X[split_point:]
y_train, y_test = y[:split_point], y[split_point:]

# Print the sizes of the train and test sets
print(f"Training set size: {X_train.shape[0]} rows")
print(f"Test set size: {X_test.shape[0]} rows")

## Modelling
We selected a Random Forest Regressor as our baseline model to predict weekly analyst consensus ratings, which are represented as percentages. Random Forest is chosen for its ability to handle complex and varied data types effectively. It's known for its robustness, reducing overfitting by averaging multiple decision trees.

In [None]:
# Define a grid of hyperparameters to search over
param_grid = {
    'estimator__n_estimators': [100, 200, 300],
    'estimator__max_depth': [None, 10, 20, 30],
}

# Initialize the Random Forest model
random_forest_regressor = RandomForestRegressor(random_state=42)

# Wrap the model with MultiOutputRegressor
multi_target_regressor = MultiOutputRegressor(random_forest_regressor)

# Set up the grid search with cross-validation
grid_search = GridSearchCV(multi_target_regressor, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Perform the grid search on the training data
grid_search.fit(X_train, y_train)

# Retrieve the best model from the grid search
best_model = grid_search.best_estimator_

# Predict on the test set using the best model
y_pred = best_model.predict(X_test)

# Evaluate the model using multiple metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"Best Hyperparameters: {grid_search.best_params_}")
print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")

# If you want to do cross-validated performance assessment:
cv_scores = cross_val_score(best_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Cross-validated MSE: {-cv_scores.mean()}")

The result that are printed do not look very promising. In the next section we will explore what happened

## Evaluation
Here we will create plots to see where it went wrong in our model and end with the conclusion

In [None]:
# Plot actual vs predicted values for all targets
target_names = y_test.columns
for i, target in enumerate(target_names):
    plt.figure(figsize=(10, 6))
    plt.scatter(y_test[target], y_pred[:, i], alpha=0.5)
    plt.title(f'Actual vs Predicted - {target}')
    plt.xlabel('Actual Values')
    plt.ylabel('Predicted Values')
    plt.plot([y_test[target].min(), y_test[target].max()], [y_test[target].min(), y_test[target].max()], 'k--')
    plt.show()

From these plots we can deduce that for some targets the model makes predictions even though the actual values should be 0.

### Conclusion
In the future iterations we need to look beter at preprocessing of the target variables. The model does not understand it at this moment.

We also need to look at a way to solve the bias introduced with some targets having no data points in the dataset. We could solve it by:

- Instead of in the demonstration adding the buy and outperform or sell and underperform together, we could do it in the preprocessing step to create more datapoints
- Collecting more data but this can be difficult
- Altering class weights
- Resampling techniques

# Phase 4
## Demonstration
In the following section we will create the code that can be used to make a prediction for our stakeholder

In [228]:
import pandas as pd

def predict_analyst_category(model, close, volume, target_names, feature_names):
    """
    Predict the analyst category ('Buy', 'Hold', or 'Sell') based on 'Close' and 'Volume' using the trained model.

    :param model: Trained MultiOutputRegressor model.
    :param close: Adjusted close value.
    :param volume: Trading volume.
    :param target_names: List of target variable names in the order they were used during model training.
    :param feature_names: List of feature names as they were used during model training.
    :return: Predicted category.
    """
    # Create a DataFrame for the input features with the correct column names
    input_df = pd.DataFrame([[close, volume]], columns=feature_names)

    # Predict using the model
    predictions = model.predict(input_df)[0]

    # Map predictions to their corresponding target names
    prediction_dict = dict(zip(target_names, predictions))

    # Aggregate predictions
    buy_prediction = prediction_dict['Buy_%'] + prediction_dict['Outperform_%']
    sell_prediction = prediction_dict['Sell_%'] + prediction_dict['Underperform_%']
    hold_prediction = prediction_dict['Hold_%']

    categories = {'Buy': buy_prediction, 'Hold': hold_prediction, 'Sell': sell_prediction}
    predicted_category = max(categories, key=categories.get)

    return predicted_category

In [None]:
feature_names = ['Close', 'Volume']
target_names = ['Buy_%', 'Outperform_%', 'Hold_%', 'Underperform_%', 'Sell_%']
predicted_category = predict_analyst_category(best_model, 40, 1000000, target_names, feature_names)
print("Predicted Analyst Consensus:", predicted_category)

## Feedback

For iteration zero we will do the feedback inside the notebook. In the next iteration when the models performs but a bit of accuracy we will create a seperate document together with a front end.

Comment from stakeholder about the progress;
It looks complex and after your explanation from what you did i can see that the data from asr is important. i am curious for the front end maybe then it will become clearer

# Conclusion

## Iteration Zero

### Data Limitations
Our analysis in phase 2 faced challenges due to recent changes in financial calculations, effective from the start of 2023. These changes limited our ability to utilize our initially identified data effectively. Furthermore, the restriction in accessing historical analyst consensus data, available only for the past 18 months, resulted in a data availability mismatch. This limitation prevented us from leveraging older, potentially valuable data sets.

### Model Performance
The baseline model underperformed, which could be attributed to insufficient optimization of the target variables and the limited size of our dataset. Additionally, the dataset lacked some key features that might be crucial in predicting the analyst consensus.

### Future Iterations
Looking ahead, we need to find ways to circumvent the financial calculation changes or access more extensive historical data for the analyst consensus. Enhancing our target variable optimization and refining the model to predict these targets more accurately will be a focus. Addressing potential biases and improving model performance may become feasible with a more comprehensive dataset.