In [2]:
from datetime import datetime, timedelta
from src.collect_price_data import collect_price_data
from src.format_price_data import format_price_data
from src.collect_sentiment_data import collect_sentiment_data
from src.preprocess_news_data import preprocess_news_data
from src.sentiment_analysis import perform_sentiment_analysis
from src.sentiment_summary import create_sentiment_summary
from src.calculate_technical_indicators import calculate_technical_indicators
from src.create_model_results_df import create_results_df

# Get the current date and time
current_date = datetime.now()

# Calculate yesterday's date by subtracting one day
yesterday_date = current_date - timedelta(days=1)

# Calculate the date from 4 years ago
years_ago = 4
five_years_ago = current_date - timedelta(days=365 * years_ago)

# Set the start date and end date for the data retrieval
start_date =  five_years_ago.strftime('%Y-%m-%d')
end_date = yesterday_date.strftime('%Y-%m-%d')

# Define the time period for historical data (start date, end date)
time_period = (start_date, end_date)

# List of stock tickers for analysis
tickers = [ "AAPL", "META", "JPM", "JNJ", "AMT"]

# Load historical price data for the specified tickers and time period
price_data = collect_price_data(tickers, time_period) 

# Format price_data
formatted_price_data = format_price_data(price_data)

# Remove NaN values
formatted_price_data = formatted_price_data.dropna(axis=0, how='any')
formatted_price_data.head()


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


symbol,AAPL,AMT,JNJ,JPM,META
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-10-18 13:00:00,235.92,227.93,131.43,120.92,190.5
2019-10-18 14:00:00,235.535,228.0,131.23,120.595,187.94
2019-10-18 15:00:00,235.135,228.125,129.95,120.72,185.7204
2019-10-18 16:00:00,235.24,228.2,129.21,120.86,185.47
2019-10-18 17:00:00,236.11,229.02,128.69,120.935,186.14


### Load News Data

In [4]:
# Load the news data for the specified tickers and time period
# Note: this step will take approximately 40 minutes
news_data = collect_sentiment_data(tickers, time_period)
news_data.head()

Unnamed: 0,Ticker,Date,Title
0,AAPL,2019-10-18 16:00:35+00:00,Today's Pickup: Logistics Solutions Boost CBD ...
1,AAPL,2019-10-18 13:09:29+00:00,Netflix Misses On Domestic Subscribers But Bea...
2,AAPL,2019-10-21 15:11:44+00:00,Cramer: Is FAAMG Is The New FAANG?
3,AAPL,2019-10-21 15:06:50+00:00,Stocks That Hit 52-Week Highs On Monday
4,AAPL,2019-10-21 13:29:30+00:00,Credit Suisse On Apple Also Notes 'Pro/Pro Max...


### Preprocess Data

In [5]:
# Preprocess news data for sentiment analysis
preprocessed_data = preprocess_news_data(news_data)

# Print the preprocessed data
preprocessed_data.head()

Unnamed: 0,Ticker,Date,Title
0,AAPL,2019-10-18 16:00:35+00:00,today pickup logistics solution boost cbd pet ...
1,AAPL,2019-10-18 13:09:29+00:00,netflix miss domestic subscriber beat earnings...
2,AAPL,2019-10-21 15:11:44+00:00,cramer faamg new faang
3,AAPL,2019-10-21 15:06:50+00:00,stock hit week high monday
4,AAPL,2019-10-21 13:29:30+00:00,credit suisse apple also note propro max wait ...


#### Perform Sentiment Analysis

In [6]:
# Perform sentiment analysis
# Note: this process will take approximately 30 minutes
sentiment_df = perform_sentiment_analysis(preprocessed_data)
print(sentiment_df.head())

  Ticker                      Date  \
0   AAPL 2019-10-18 16:00:35+00:00   
1   AAPL 2019-10-18 13:09:29+00:00   
2   AAPL 2019-10-21 15:11:44+00:00   
3   AAPL 2019-10-21 15:06:50+00:00   
4   AAPL 2019-10-21 13:29:30+00:00   

                                               Title  Sentiment_Score  \
0  today pickup logistics solution boost cbd pet ...         0.994553   
1  netflix miss domestic subscriber beat earnings...         0.996385   
2                             cramer faamg new faang         0.990447   
3                         stock hit week high monday         0.998139   
4  credit suisse apple also note propro max wait ...         0.976643   

  Sentiment_Label  
0        POSITIVE  
1        NEGATIVE  
2        NEGATIVE  
3        NEGATIVE  
4        NEGATIVE  


#### Exploring Sentiment Label as a Feature

In [7]:
# Explore the sentiment label column
unique_values = sentiment_df['Sentiment_Label'].value_counts()
print(unique_values)

NEGATIVE    8711
POSITIVE    2675
Name: Sentiment_Label, dtype: int64


##### **Feature Selection Consideration:**
At this stage of the project, we've made a conscious decision not to include the sentiment label (positive or negative) as a feature in our machine learning model. There are several logical reasons for this choice:

**1. Class Imbalance:**

The dataset exhibits a significant class imbalance, with a notably higher count of negative sentiment compared to positive sentiment. Imbalanced data can impact the model's performance and lead to potential biases, which should be addressed. 

**2. Sentiment Label Accuracy:**

The sentiment labels are derived from the analysis of headlines, and there is a possibility that they may not be entirely accurate. Incorporating potentially inaccurate sentiment labels as features can introduce noise into the model, reducing its reliability. (That is also why we don't fix the problem stated in reason 1 with resampling techniques)

**3. Model Iteration:**

Machine learning projects often involve multiple stages and iterations. We've chosen to prioritize other aspects of model development first and consider refining the inclusion of sentiment features at a later stage when we have more accurate and reliable data.

**4. Feature Engineering:**

Additionally, in the future, we plan to explore performing sentiment analysis on the summaries of articles, which may provide more context and accuracy in sentiment assessment. This aligns with our goal to continuously refine feature engineering for improved model performance.

It's important to note that, currently, we have performed sentiment analysis only on the headers of articles due to the significant time required for analysis. For headers alone, the sentiment analysis process already takes approximately 30 minutes. Analyzing the entire article summaries would be more time-consuming, and we are considering ways to optimize this process for efficiency.


#### Exploring Sentiment Score as a Feature for Analysis

In [8]:
# Create sentiment summary to evaluate statistics (min, max, mean, sum of sentiment scores) as potential features for the model
sentiment_summary = create_sentiment_summary(tickers, sentiment_df, formatted_price_data)

In [26]:
import hvplot.pandas
from bokeh.plotting import show
# plot the Sentiment Score Analysis
# Set the ticker you want to visualize
ticker='AAPL'
sentiment_summary[ticker].hvplot.line(
        xlabel='Time', ylabel='Statistics', title=f"{ticker} Sentiment Score Analysis",
        line_width=2, alpha=0.7, hover_line_color='red',
        width=1000, height=500
    ).opts(legend_position='top_left') 

##### **Exploratory Analysis:**
During our initial feature selection process, we considered various sentiment score statistics (mean, min, max, sum) to understand their potential influence on predicting the closing price. Although the sentiment scores themselves are relatively small values, we noticed an interesting behavior.

**Observation:**

The "sum of the sentiment score," while having small values, exhibits more pronounced ups and downs on a daily basis. This feature, which reflects the aggregate sentiment for the day, demonstrates higher variability, even if the absolute values are modest.

**Logical Relevance:**

We chose to focus on the "sum of the sentiment score" as a feature also due to its logical relevance. Aggregating daily sentiment scores into a sum provides a meaningful representation of overall sentiment for each day, which aligns with our goal of capturing sentiment trends that might influence stock prices.

It's worth noting that machine learning models, especially more complex ones, have the capacity to learn from features that may not have a strong linear correlation with the target variable. Hence, we believe this feature is promising for our model.


##### **Missing Sentiment Scores for "META" Ticker:**

During the exploration of sentiment scores, it was observed that sentiment scores for the "META" ticker were missing for the initial two years, despite the availability of stock price data.  More specifically, news data retrieved started in 2021-06-31, hence, also sentiment scores. Several factors could contribute to this issue:

1. **Data Retrieval Issues:** Data retrieval methods for sentiment analysis may not have been comprehensive or accurate in collecting data for the "META" ticker. Data collection methods can vary in terms of coverage and accuracy.

2. **Data Quality:** Ensuring the quality and consistency of data sources is crucial. Inaccurate or incomplete news data can result in missing sentiment scores.

**Possible Solutions:**

To address the missing sentiment scores for the "META" ticker, consider the following:

1. **Refining Data Sources:** Review and expand news data sources to cover a wider range of topics and keywords, including "META."

2. **Exploring Additional Data Sources:** Consider integrating other news data sources such as tweets or web scraping. These sources can offer a broader range of news data and help fill gaps in sentiment score coverage.

**Project Approach:**

For the initial stage of the project and to avoid excessive complexity, a pragmatic approach was taken. In the calculate_technical_indicators module, backward filling (bfill) was applied to address the missing sentiment scores for the "META" ticker. This allowed for the inclusion of available sentiment data while keeping the project manageable. Further enhancements can be explored to improve sentiment score coverage in future stages.



#### Calculate Technical Indicators

In [11]:
fast_window = 4  # Adjust the fast SMA window as needed
slow_window = 50  # Adjust the slow SMA and EMA window as needed
rsi_window = 14   # Adjust the RSI window as needed

# Calculate technical indicators 
# Calculate slow SMA, fast SMA, EMA, RSI for each ticker
technical_indicators_df = calculate_technical_indicators(formatted_price_data, fast_window, slow_window, rsi_window, sentiment_summary)

# Separate features and target variables
features = technical_indicators_df.filter(like='_').copy()  # Filter columns with '_'


In [12]:
technical_indicators_df.head()

symbol,AAPL,AMT,JNJ,JPM,META,SMA_Slow_AAPL,SMA_Fast_AAPL,EMA_AAPL,RSI_AAPL,Sentiment_Score_Sum_AAPL,...,SMA_Slow_JPM,SMA_Fast_JPM,EMA_JPM,RSI_JPM,Sentiment_Score_Sum_JPM,SMA_Slow_META,SMA_Fast_META,EMA_META,RSI_META,Sentiment_Score_Sum_META
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-10-25 20:00:00,246.85,216.8699,128.46,126.01,187.88,241.47526,246.025,241.955441,80.947277,1.921613,...,124.017312,126.195,124.261994,75.557905,1.995335,186.438292,187.9925,186.985811,74.882417,0.953524
2019-10-28 13:00:00,248.025,214.53,129.25,127.02,187.24,241.71736,246.71625,242.193462,81.840823,9.66844,...,124.139312,126.315,124.370151,79.495411,0.998139,186.373092,187.757525,186.995779,61.825805,0.953524
2019-10-28 14:00:00,248.08,213.43,129.84,126.47,187.7,241.96826,247.38375,242.424307,84.571017,9.66844,...,124.256812,126.38,124.452498,72.042888,0.998139,186.368292,187.6725,187.023396,63.550934,0.953524
2019-10-28 15:00:00,248.555,212.08,129.46,127.1,188.4,242.23666,247.8775,242.664726,92.977025,9.66844,...,124.384412,126.65,124.556322,75.824564,0.998139,186.421884,187.805,187.07738,71.81104,0.953524
2019-10-28 16:00:00,248.485,212.54,129.4838,126.64,188.765,242.50156,248.28625,242.892972,91.375819,9.66844,...,124.500012,126.8075,124.638035,68.408697,0.998139,186.487784,188.02625,187.143561,71.515294,0.953524


In [13]:
targets = technical_indicators_df.iloc[:, : len(tickers)] 
targets.head()

symbol,AAPL,AMT,JNJ,JPM,META
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-10-25 20:00:00,246.85,216.8699,128.46,126.01,187.88
2019-10-28 13:00:00,248.025,214.53,129.25,127.02,187.24
2019-10-28 14:00:00,248.08,213.43,129.84,126.47,187.7
2019-10-28 15:00:00,248.555,212.08,129.46,127.1,188.4
2019-10-28 16:00:00,248.485,212.54,129.4838,126.64,188.765


In [14]:
# Select the start of the training period
training_begin = features.index.min()

from sklearn.preprocessing import StandardScaler
from pandas.tseries.offsets import DateOffset

# Select the ending period for the training data with an offset of 3 years
training_end = features.index.min() + DateOffset(years=3)

# Generate the X_train and y_train DataFrames
X_train = features.loc[training_begin:training_end]
y_train = targets.loc[training_begin:training_end]

# Generate the X_test and y_test DataFrames
X_test = features.loc[training_end+DateOffset(hours=1):]
y_test = targets.loc[training_end+DateOffset(hours=1):]

In [15]:
# Scale the features DataFrames

# Create a StandardScaler instance
scaler = StandardScaler()

# Apply the scaler model to fit the X-train data
X_scaler = scaler.fit(X_train)

# Transform the X_train and X_test DataFrames using the X_scaler
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

#### Build Linear Regression Machine Learning Model

In [16]:
from sklearn.linear_model import LinearRegression

# From Linear Models, instantiate LinearRegression model instance
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()
 
# Fit the model to the data using the training data
lr_model = lr_model.fit(X_train_scaled, y_train)
 
# Use the testing data to make the model predictions
lr_pred = lr_model.predict(X_test_scaled)


In [21]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Calculate regression metrics
mae = mean_absolute_error(y_test, lr_pred)
mse = mean_squared_error(y_test, lr_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, lr_pred)

# Print the regression metrics
print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R-squared:", r2)

Mean Absolute Error: 0.7356974456497781
Mean Squared Error: 1.2355325112906335
Root Mean Squared Error: 1.1115451008801367
R-squared: 0.9962284949031229


**Mean Absolute Error (MAE):** The MAE of 0.7357 suggests that, on average, your model's predictions have an absolute error of approximately 0.736 units from the actual target values. Lower MAE values indicate better model performance. In this case, the low MAE suggests that the model's predictions are generally close to the actual values.

**Mean Squared Error (MSE):** The MSE of 1.2355 is a measure of the average squared differences between your model's predictions and the actual target values. Lower MSE values are better. This value is moderately low, indicating that our model has relatively low variance in its predictions.

**Root Mean Squared Error (RMSE):** The RMSE of 1.1115 is the square root of the MSE. It provides a measure of the average error in the same units as the target variable. An RMSE of 1.1115 indicates that, on average, our model's predictions are approximately 1.1115 units away from the actual values.

**R-squared (R2):** The R-squared value of 0.9962 is very close to 1, which is excellent. R2 measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A value of 0.9962 indicates that our model explains 99.62% of the variance in the target variable, suggesting a very strong fit.

In [18]:
# Specify the ticker you want to compare
selected_ticker = "AMT"
# Create the list of tickers that has the tickers in the same order has the target dataframe
tickers_names = targets.columns 
tickers_names

# Create the results DataFrame for the selected ticker
results_df = create_results_df(y_test, lr_pred, selected_ticker, tickers_names)

# Visualize results
results_df.hvplot.line(
        xlabel='Date', ylabel='Close Price', title=f"{selected_ticker} Atual vs Predicted Close Price",
        line_width=2, alpha=0.7, hover_line_color='red',
        width=1000, height=500
    ).opts(legend_position='top_left') 

When plotting the actual vs. predicted values for the Linear Regression model, we observe a close alignment between the data points and the regression line, indicating strong predictive accuracy.

The successful performance of our regression model in predicting close prices opens up the possibility of implementing a trading strategy in the future. This could be a valuable next step to leverage the accurate predictions and turn them into actionable trading decisions.

#### Build Random Forest Regressor Machine Learning Model

In [19]:
from sklearn.ensemble import RandomForestRegressor

# Instantiate a Random Forest Regressor model with the desired hyperparameters
rf_model = RandomForestRegressor(n_estimators=100, random_state=52)  

# Fit the Random Forest model to the training data
rf_model.fit(X_train_scaled, y_train)

# Use the testing data to make predictions
rf_pred = rf_model.predict(X_test_scaled)


In [22]:
# Calculate regression metrics
mae_rf = mean_absolute_error(y_test, rf_pred)
mse_rf = mean_squared_error(y_test, rf_pred)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_test, rf_pred)

# Print the regression metrics
print("Mean Absolute Error:", mae_rf)
print("Mean Squared Error:", mse_rf)
print("Root Mean Squared Error:", rmse_rf)
print("R-squared:", r2_rf)

Mean Absolute Error: 16.317834786559395
Mean Squared Error: 598.8285087704123
Root Mean Squared Error: 24.470972779405653
R-squared: -1.568480641412812


**Mean Absolute Error (MAE):** The MAE of 16.318 suggests that, on average, our model's predictions have an absolute error of approximately 16.318 units from the actual target values. In the context of our specific problem, this metric indicates that the model's predictions have a moderate degree of error.

**Mean Squared Error (MSE):** The MSE of 598.829 is a measure of the average squared differences between our model's predictions and the actual target values. In this case, the relatively high MSE indicates that there are significant variances between the predicted and actual values.

**Root Mean Squared Error (RMSE):** The RMSE of 24.471 is the square root of the MSE and provides a measure of the average error in the same units as the target variable. An RMSE of 24.471 indicates that, on average, our model's predictions are approximately 24.471 units away from the actual values.

**R-squared (R2):** The R-squared value of -1.568 is concerning. R2 measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A negative R2 suggests that the model's predictions are worse than a horizontal line, indicating that the model is performing poorly or that the model may not be appropriate for this dataset.

In [20]:
# Specify the ticker you want to compare
selected_ticker = "AMT"

# Create the results DataFrame for the selected ticker
rf_results_df = create_results_df(y_test, rf_pred, selected_ticker, tickers_names)

# Visualize results
rf_results_df.hvplot.line(
        xlabel='Date', ylabel='Close Price', title=f"{selected_ticker} Atual vs Predicted Close Price",
        line_width=2, alpha=0.7, hover_line_color='red',
        width=1000, height=500
    ).opts(legend_position='top_left') 



When plotting the actual vs. predicted values for the RandomForestRegressor model, we notice a significant deviation of data points from the ideal linear trend, highlighting poor predictive accuracy. 

##### **Possible reasons:**

**Overfitting:** RandomForestRegressor models can be prone to overfitting if they are not properly tuned. Overfitting occurs when the model learns to fit the training data too closely, capturing noise rather than genuine patterns. Overfit models tend to perform poorly on unseen data.

**Hyperparameter Tuning:** RandomForestRegressor models have hyperparameters that need to be tuned, such as the number of trees (n_estimators), maximum depth of trees, and minimum samples required to split a node. If these hyperparameters are not appropriately adjusted, it can impact the model's performance.

**Feature Engineering:** The quality of features used as input to the model can significantly affect its performance. If the features do not capture the relevant information, the model may struggle to make accurate predictions.
Randomness: Random Forest models incorporate an element of randomness in the selection of features and data points for building trees. It's possible that the randomization may not have favored the model's performance in this particular case.

**Inappropriate Model Selection:** While RandomForestRegressor models are a versatile and robust choice for many problems, there are situations where other models, like Linear Regression, may be more suitable. Ensuring the right model choice for the specific problem is crucial.

##### **Solutions:**
**Hyperparameter Tuning:** Tuning the hyperparameters of the RandomForestRegressor model can often lead to significant improvements. Key hyperparameters to consider include:

- n_estimators: The number of trees in the forest. Increasing the number of trees may help improve predictive accuracy, but it's essential to find a balance to avoid overfitting.

- max_depth: The maximum depth of the trees. A deeper tree can capture more complex patterns, but it can also lead to overfitting.

- min_samples_split and min_samples_leaf: These parameters control the minimum number of samples required to split an internal node or form a leaf node. Adjusting them can impact the model's robustness against noise and overfitting.

- max_features: The number of features to consider when looking for the best split. Experimenting with different values can be beneficial.

**Feature Engineering:** Carefully consider the features used in our model. Feature selection or engineering may be necessary to capture relevant information from the data. We can experiment with different sets of features to see which combination works best.

**Data Quality:** Ensure the quality and cleanliness of our dataset. Address any issues related to missing data, outliers, and data preprocessing.

**Ensemble Methods:** Experiment with different ensemble methods, such as Gradient Boosting, AdaBoost, or XGBoost. These ensemble techniques can often improve predictive accuracy.

**Out-of-the-Box Models:** If hyperparameter tuning and other optimizations don't yield improvements, consider trying different machine learning models to see if a different algorithm is better suited to our problem.