# **PROJECT TYPE**
Supervised Predictive modelling using Regression techniques.

# **Contribution**
Prathamesh Santosh Bharsakale

# **PROJECT SUMMARY:**
Financial forecasting is essential for stock market analysis, investment planning, and risk assessment. This project aims to predict closing prices based on historical data using machine learning techniques. The dataset comprises important financial indicators, including Open, High, Low, Close prices, and Date. The main goal is to create an accurate regression model that effectively generalizes to new data while minimizing prediction errors.

# **Data Cleaning & Preprocessing**

Before training the models, extensive data cleaning was performed to enhance model performance. This included:

Removing outliers using Interquartile Range (IQR) to prevent extreme values from skewing predictions.

Handling skewness using log transformation, ensuring normally distributed features.

Feature scaling using StandardScaler to maintain consistency across different models.

The cleaned dataset was then split into training (80%) and testing (20%) sets for model evaluation.

Models Implemented & Performance Analysis

Tested multiple regression models to identify the best-performing algorithm:

Linear Regression.
Decision Tree Regressor.
XGBoost Regressor.
Hyperparameter Tuning

To optimize each model, performed Hyperparameter Tuning (HPT) using GridSearchCV and manual trial-and-error:

DTR: Tuned max_depth, min_samples_split, min_samples_leaf
XGBR: Tuned learning_rate, n_estimators, max_depth, subsample, colsample_bytree
This project demonstrates the power of ML in financial forecasting and lays the foundation for stock price prediction, investment risk analysis, and automated trading systems.

# **GITHUB LINK:**

# **Problem Statement:**
Predicting stock prices is a complex and dynamic challenge influenced by numerous market factors. A significant factor to consider is the 2018 fraud case, which likely caused substantial fluctuations in stock prices. The objective is to analyze historical stock price data and create a model that can deliver reliable estimates of future prices.

In [25]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression

In [None]:
data = pd.read_csv('/content/data_YesBank_StockPrices.csv')
data.head()

In [None]:
data.info()

In [None]:
data.describe()

# Data Cleaning and Preprocessing

In [33]:
# Remove outliers using IQR
Q1 = data[['Open', 'High', 'Low', 'Close']].quantile(0.25)
Q3 = data[['Open', 'High', 'Low', 'Close']].quantile(0.75)
IQR = Q3 - Q1
data = data[~((data[['Open', 'High', 'Low', 'Close']] < (Q1 - 1.5 * IQR)) |
               (data[['Open', 'High', 'Low', 'Close']] > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
data['Date'] = pd.to_datetime(data['Date'], format='%b-%y') # Extracting Month and Year from Date Column.

# Extract month, year
data['Month'] = data['Date'].dt.month
data['Year'] = data['Date'].dt.year

data.drop(columns=['Date'], inplace=True)

data.head()

In [36]:
#Handle skewness using log transformation
data[['Open', 'High', 'Low', 'Close']] = np.log(data[['Open', 'High', 'Low', 'Close']] + 1)


In [19]:
# Check for NaN values and drop them
if data.isnull().sum().sum() > 0:
    print("NaN values found. Dropping rows with NaN values.")
    data = data.dropna()


# Feature Scaling

In [37]:
scaler = StandardScaler()
data[['Open', 'High', 'Low']] = scaler.fit_transform(data[['Open', 'High', 'Low']])

# Spliting the dataset into train and test

In [38]:
# Splitting the dataset into features and target variable
X = data[['Open', 'High', 'Low']]
y = data['Close']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Implementing Linear Regression

In [None]:
# Implementing Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
# Making predictions
y_pred_linear = model.predict(X_test)
y_pred_linear

# Visualizing prediction and actual values

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(y_test.values, label='Actual Prices', marker='o')
plt.plot(y_pred_linear, label='Predicted Prices', marker='x', alpha=0.7)
plt.title('Actual vs Predicted Closing Prices (Linear Regression)')
plt.xlabel('Samples')
plt.ylabel('Closing Prices')
plt.legend()
plt.show()

#Using another Model

In [44]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler

In [None]:
# Implementing Decision Tree Regression
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Making predictions
y_pred_tree = model.predict(X_test)
y_pred_tree

In [None]:
# Plotting predictions vs actual values
plt.figure(figsize=(10, 6))
plt.plot(y_test.values, label='Actual Prices', marker='o')
plt.plot(y_pred_tree, label='Predicted Prices', marker='x', alpha=0.7)
plt.title('Actual vs Predicted Closing Prices (Decision Tree Regression)')
plt.xlabel('Samples')
plt.ylabel('Closing Prices')
plt.legend()
plt.show()

#Calculating Metrics

In [None]:
# Calculating metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mse_linear = mean_squared_error(y_test, y_pred_linear)
mae_linear = mean_absolute_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)

mse_tree = mean_squared_error(y_test, y_pred_tree)
mae_tree = mean_absolute_error(y_test, y_pred_tree)
r2_tree = r2_score(y_test, y_pred_tree)

print(f'Linear Regression MSE: {mse_linear}')
print(f'Linear Regression MAE: {mae_linear}')
print(f'Linear Regression R²: {r2_linear}')

print(f'Decision Tree Regression MSE: {mse_tree}')
print(f'Decision Tree Regression MAE: {mae_tree}')
print(f'Decision Tree Regression R²: {r2_tree}')

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Create a confusion-like matrix for regression
# 1. Binning the predictions and actual values
n_bins = 10  # Number of bins - adjust as needed
bins = np.linspace(y.min(), y.max(), n_bins + 1)  # Create bins for the target variable

# Bin the predicted and actual values using the same bins
y_pred_linear_binned = np.digitize(y_pred_linear, bins)
y_test_binned = np.digitize(y_test, bins)
y_pred_tree_binned = np.digitize(y_pred_tree, bins)

# 2. Create confusion matrices
conf_matrix_linear = confusion_matrix(y_test_binned, y_pred_linear_binned, labels=np.arange(1, n_bins + 1))
conf_matrix_tree = confusion_matrix(y_test_binned, y_pred_tree_binned, labels=np.arange(1, n_bins + 1))

# 3. Plotting the confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Adjust labels to match the number of bins
labels = [f'Bin {i}' for i in range(1, n_bins + 1)]

ConfusionMatrixDisplay(conf_matrix_linear, display_labels=labels).plot(ax=axes[0], cmap='Blues', values_format='d')
axes[0].set_title('Confusion-like Matrix (Linear Regression)')

ConfusionMatrixDisplay(conf_matrix_tree, display_labels=labels).plot(ax=axes[1], cmap='Blues', values_format='d')
axes[1].set_title('Confusion-like Matrix (Decision Tree Regression)')

plt.tight_layout()
plt.show()

#Performance Comparison: Linear Regression vs. Decision Tree Regression
In this analysis, we compared two regression models—Linear Regression and Decision Tree Regression—on the same financial dataset to predict stock closing prices.

#Findings:
##Linear Regression:

This model demonstrated a better performance in terms of accuracy when predicting closing prices. The Mean Squared Error (MSE) for Linear Regression was significantly lower than that of the Decision Tree model, indicating that Linear Regression provided predictions that were closer to the actual values.
##Decision Tree Regression:

While Decision Trees can capture non-linear relationships and interactions between features, they tend to overfit the training data, especially with limited data points. This often leads to poorer generalization on unseen data, resulting in higher MSE compared to Linear Regression.


#Conclusion:
For this specific dataset, the Linear Regression model outperformed the Decision Tree Regression model in predicting accurate values for stock closing prices. This suggests that the relationships within the data are more linear, making Linear Regression a more suitable choice for this particular forecasting task.