# XGBoost Regression
This programm runs XGBoost regression to attempt to predict stock prices after three months.

### 1. Imports

In [1]:
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

### 2. Load the data

In [2]:
data = pd.read_csv('stocks_data.csv')
data.describe(include='all')

Unnamed: 0.1,Unnamed: 0,Ticker,Year,Month,MA Ratio,Result,ROE,Insider Ownership Growth,Institutional Ownership Growth,Forecast EPS Growth,Avg 2Q EPS Growth,Avg 2Q EPS Surprise,YoY EPS Growth,Sector Performance,Market Performance
count,14854.0,14854,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0
unique,,393,,,,,,,,,,,,,
top,,AWK,,,,,,,,,,,,,
freq,,55,,,,,,,,,,,,,
mean,7426.5,,2020.669113,6.22324,1.004148,1.032817,39.494365,0.015486,0.026708,0.057775,0.181477,13.755183,0.369529,1.488003,1.438443
std,4288.124784,,1.428016,3.520757,0.046473,0.14926,181.839873,0.269863,0.230675,2.136724,2.111809,46.751483,3.637998,8.164589,7.038394
min,0.0,,2018.0,1.0,0.580721,0.259712,-613.743387,-0.633527,-0.714136,-0.992366,-45.05,-65.625,-0.961538,-44.900728,-22.795349
25%,3713.25,,2019.0,3.0,0.977766,0.944153,10.160854,-0.00135,-0.023114,-0.184264,-0.040838,2.015,0.017606,-3.453784,-3.160007
50%,7426.5,,2021.0,6.0,1.00536,1.028547,19.251991,0.0,-0.000648,-0.039062,0.045662,6.055,0.130688,1.496227,2.069271
75%,11139.75,,2022.0,9.0,1.031953,1.113949,31.949569,0.008,0.033653,0.086957,0.154182,13.135,0.275148,6.429508,5.50743


### 3. Split and clean the data
Based on my experience from attempt using Random Forest Regression (see RandomForest.ipynb), where overfitting occurred, this time I will split the data according to time rather than randomly.

In [3]:
data = data.reset_index(drop=True)
train_data = data[data['Year'] < 2023]
test_data = data[data['Year'] >= 2023]
x_train = train_data.drop(['Year', 'Result', 'Month', 'Ticker', data.columns[0]], axis=1)
y_train = train_data['Result']
x_test = test_data.drop(['Year', 'Result', 'Month', 'Ticker', data.columns[0]], axis=1)
y_test = test_data['Result']

### 4. Run regression

In [4]:
xgb_regressor = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=42)
xgb_regressor.fit(x_train, y_train)
y_pred = xgb_regressor.predict(x_test)

### 5. Evaluation

In [5]:
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
r2 = r2_score(y_test, y_pred)
print(f'R^2 Score: {r2}')

Mean Squared Error: 0.02983337233335683
R^2 Score: -0.33087834658024007


### 6. Conclusion
In this attempt with XGBoost regression, the results are as follows:
- Mean Squared Error (MSE): 0.0298
- R² Score: -0.3309

The negative R² score indicates that the model is performing significantly worse than a baseline prediction using the mean of the target values. This suggests that the model has struggled to capture the underlying patterns in the data. The poor performance is likely due to the limited dataset, which only covers a relatively short time span and includes just one recessionary period (2022). This lack of data diversity likely hindered the model's ability to learn and generalize, leading to overfitting or an inability to predict accurately for different time periods.

##### Next Steps
To improve the model's performance, it would be beneficial to expand the dataset to include a longer time frame with multiple economic cycles. This would provide more varied data, allowing the model to better capture the underlying trends and improve its generalization ability.