# **Project Name  - Yes Bank**



##### **Project Type**    - Regression/Machine Learning
##### **Contribution**    - Individual

# **Project Summary -**

This project focuses on building a machine learning model to predict the closing stock price based on historical stock data. The dataset includes monthly stock records with variables such as Open, High, Low, and Close prices. The main goal was to build a robust, accurate, and interpretable predictive model that could assist investors, analysts, and businesses in making data-driven financial decisions.


*1. Data Understanding and Preprocessing:*
   * The dataset comprised 185 records, each with Date, Open, High, Low, and Close columns.
   * All columns were in appropriate data types, though Date was converted into a datetime format for better time-based analysis.
   * From the Date, we extracted new features: Year and Month, enabling us to study seasonal and yearly trends.


*2. Feature Engineering :* To enhance the model's learning, several new features were created:
   * Close_Lag_1: Previous month's closing price
   * Daily_Open_Close_Diff: Difference between Open and Close prices
   * Price_Range: High − Low for the month

These features helped capture momentum and volatility patterns, which are key indicators in stock price movement.


*3. Exploratory Data Analysis :*
   * Strong positive linear relationships between Close and the features Open, High, and Low.
   * Month and Year had weaker linear correlation but indicated potential seasonal effects.
   * Data was right-skewed, leading us to apply a log transformation to the Close variable to normalize the distribution and stabilize variance.

Visualization tools such as scatter plots, histograms, heatmaps, and pair plots were used to understand distributions, correlations, and outliers.


*4. Model Building and Evaluation :*
Two models were trained and evaluated -
  * Linear Regression (with log-transformed target)
  * Random Forest Regressor

Linear Regression - Performed well due to the data's linear relationships.Achieved an R² Score of 0.983 (on test split), and 0.974 (avg. across 5-fold CV).RMSE: ~0.11 (log scale), showing low prediction error.Highly interpretable, making it suitable for business insights.

Random Forest : Better at capturing non-linear patterns. Test split R²: 0.97, but 5-fold CV average R²: 0.935, indicating slightly less stability. Less interpretable than linear regression.


*5. Model Evaluation Metrics Used :*
  * R² Score: Explained variance; measures accuracy of prediction.
  * Mean Squared Error (MSE): Penalizes larger errors more.
  * Root Mean Squared Error (RMSE): Indicates average prediction error in the same scale as actual values.

Cross-validation ensured the model wasn't overfitting and performed consistently across different splits.


*6. Insights and Business Impact*
  * The closing price is highly influenced by Open, High, and Low prices, which supports using regression-based models.
  * The model can help forecast future prices, allowing businesses to plan better and investors to assess risk.
  * Incorporating feature engineering (e.g., lag variables) significantly improved model performance.

# **GitHub Link -**

[https://github.com/Aastha2675/Stock_Price_Prediction](https://github.com/Aastha2675/Stock_Price_Prediction)

# **Problem Statement**


Stock prices are highly volatile and influenced by various factors. In the case of Yes Bank, events like the 2018 fraud case led to sharp fluctuations. Accurately predicting the monthly closing stock price is challenging yet essential for investors and analysts. This project addresses the problem by using historical stock data to build a machine learning model that forecasts closing prices and aids decision-making.

# ***Solution***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
df.shape

### Dataset Information

In [None]:
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(6, 3))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Visualizing Missing Values Using Heatmap')
plt.show()

### What did you know about your dataset?

This is a monthly time series dataset containing historical stock price data for Yes Bank.

*Dataset Structure:* 185 rows and 5 columns
  * Date
  * Open - Price at the beginning of the month
  * High - Highest price in that month
  * Low - Lowest price in that month
  * Close - Price at the end of the month

*Datatype of columns:* Date: Object type, Open, High, Low, Close: All are float64, representing monthly stock prices.

There are no missing or duplicate values, the dataset is already clean and ready for modeling after minor preprocessing.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

All four are continuous numerical features, and the high standard deviation shows significant volatility in Yes Bank’s stock price over time.

1. Open - Opening stock price for the month. Ranges from ₹10 to ₹369.95. Median (50%) is ₹62.98.

2. High - Highest stock price during the month. Goes up to ₹404. Indicates peak market value.

3. Low - Lowest stock price in the month. Shows market downside, ranging from ₹5.55 to ₹345.50.

4. Close - Final stock price at month end. Median is ₹62.54, max is ₹367.90.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(f"{col} have {df[col].nunique()} unique values")

Date has 185 unique values, confirming that each row represents a unique month.

Other variables have high uniqueness (close to total rows), so they are not categorical, they're all numerical and continuous.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# converting the type of col 'Date' to datetime64[ns] format
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')

In [None]:
# checking the updated datatype of col 'Date'
df['Date'].dtypes

In [None]:
# sorting the df chronologically (if not sorted earlier)
df = df.sort_values('Date').reset_index(drop=True)

In [None]:
print(f"Range of Date is : {df['Date'].min()} , {df['Date'].max()}")

In [None]:
# adding new col 'Year' 'Month' for better understanding
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

In [None]:
# checking whether new columns are added successfully or not
df.head(2)

In [None]:
# as we added new cols 'Year' and 'Month' we can drop the col 'Date'
df.drop('Date',axis=1,inplace=True)

In [None]:
# checking
df.head(2)

### What all manipulations have you done and insights you found?

Date column was converted to datetime, sorted, split into year/month, and original column dropped.

This manipulations made the dataset clean, chronologically ordered, and ready for time-series or seasonal analysis, such as checking how the closing price varies across years or months.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

Scatter Plot

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(15, 5))

# Relation between 'Close' and 'Open'
plt.subplot(2, 3, 1)
sns.scatterplot(x='Open', y='Close', data=df)
plt.title('Open vs Close')

# Relation between 'Close' and 'Low'
plt.subplot(2, 3, 2)
sns.scatterplot(x='Low', y='Close', data=df)
plt.title('Low vs Close')

# Relation between 'Close' and 'High'
plt.subplot(2, 3, 3)
sns.scatterplot(x='High', y='Close', data=df)
plt.title('High vs Close')

# Relation between 'Close' and 'Month'
plt.subplot(2, 3, 4)
sns.scatterplot(x='Month', y='Close', data=df)
plt.title('Month vs Close')

# Relation between 'Close' and 'Month'
plt.subplot(2, 3, 5)
sns.scatterplot(x='Year', y='Close', data=df)
plt.title('Year vs Close')

plt.tight_layout()
plt.show()

##### Questions

###### 1. Why did you pick the specific chart?


*  It help us visually inspect whether a linear relationship exists between varibales
*  It helps to check data is evenly spread, clustered, or skewed, which can affect the model’s accuracy.
*  Show if there are any outliers, which might need special handling or removal.


###### 2. What is/are the insight(s) found from the chart?

* Input features - Open, Low, and High show a strong positive linear relationship with the Close price.
* Input features - Month and Year does not show any linear relation with the Close price.
* It does not showing major outliers or abnormal patterns.

###### 3. Will the gained insights help creating a positive business impact?Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help in positive business impact. The visualizations clearly show that the variables Open, High, and Low have a strong positive linear relationship with the Close price. This confirms that these features are reliable predictors for building a regression model to forecast stock closing prices accurately. Month and Year show weak relation, which may reduce model accuracy if not handled properly.


#### Chart - 2

Line Plot

In [None]:
# Calulating the average Close and Open price
yearly_avg = df.groupby('Year')[['Close','Open']].mean().reset_index()

plt.figure(figsize=(10, 5))
sns.lineplot(data=yearly_avg, x='Year', y='Close', marker='o', color='blue',label='closing price')
sns.lineplot(data=yearly_avg, x='Year', y='Open', marker='o', color='green',label='opening price')
plt.title('Yearly trend')
plt.xlabel('Year')
plt.ylabel('Avg Close Price')
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()

##### Questions

###### 1. Why did you pick the specific chart?


*  To observe the yearly trend for open price and close price



###### 2. What is/are the insight(s) found from the chart?

* Both Opening and Closing prices show a consistent upward trend from 2005 to 2017, peaking in 2017.

* A sharp decline is seen after 2018, indicating a possible market correction or external shock (e.g., fraud, crash).

* Opening and Closing prices closely follow each other, confirming strong correlation.

###### 3. Will the gained insights help creating a positive business impact?Are there any insights that lead to negative growth? Justify with specific reason.

* Positive Impact: The steady rise till 2017 indicates growth and investor confidence—useful for strategic investments.

* Negative Growth Insight: Post-2018 decline warns businesses/investors to analyze risks or events affecting stock performance.

#### Chart - 3

Box plot

In [None]:
for i, col in enumerate(df.columns[:4]):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(x=df[col], color='skyblue')
    plt.title(f'Box Plot of {col}')

plt.tight_layout()
plt.show()


##### Questions

###### 1. Why did you pick the specific chart?


*  To visualize the outlier



###### 2. What is/are the insight(s) found from the chart?

* All four variables (Open, High, Low, Close) show positive skew with outliers on the higher end.

* The distribution is not symmetric; majority of data lies in the lower range.

* Outliers are present but consistent across features, likely representing market spikes.

###### 3. Will the gained insights help creating a positive business impact?Are there any insights that lead to negative growth? Justify with specific reason.

* Positive Insight: Stable interquartile ranges (IQR) show a predictable core pattern useful for baseline modeling.

* Negative Insight: Outliers may cause model bias—need to handle them for better predictions.

#### Chart - 4

Heapmap

In [None]:
# visualization code
plt.figure(figsize=(6, 4))
sns.heatmap(df[['Open', 'High', 'Low', 'Month','Year' ,'Close']].corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation Heatmap")
plt.show()

###### 1. Why did you pick the specific chart?


* It clearly visualizes the strength and direction of correlation between variables.
* Helps to identify highly correlated features and avoid multicollinearity in regression.


##### 2. What is/are the insight(s) found from the chart?



*  Close has a very strong positive correlation with Open (0.98), High (0.99), and Low (1.0).
* Month and Year show very weak or negative correlation with Close.



###### 3. Will the gained insights help creating a positive business impact?Are there any insights that lead to negative growth? Justify with specific reason.

Yes, selecting features like Open, High, and Low will enhance prediction accuracy for closing prices.
However, since these are strongly inter-correlated, it may lead to multicollinearity, which can affect model stability.
Using Month and Year directly may not add value and might dilute model effectiveness.

#### Chart - 5

Pair Plot

In [None]:
# visualization code

plt.figure(figsize=(5,5))

selected_features = ['Open','Low','High','Year','Month','Close']
sns.pairplot(df, vars=selected_features)
plt.show()

##### Questions


###### 1. Why did you pick the specific chart?


* It helps visualize pairwise relationships between multiple variables in one view.
* Shows linear trends, correlations, and cluster patterns.
* Useful to check multicollinearity among predictors before model building.



###### 2. What is/are the insight(s) found from the chart?


* Strong positive linear relationship between Open, Low, High, and Close
* Month and Year show no clear trend with Close.
* No major outliers or noisy patterns detected among selected features.


###### 3. Will the gained insights help creating a positive business impact?Are there any insights that lead to negative growth? Justify with specific reason.

Yes, insights support selecting strong predictors (Open, High, Low) to build a reliable model for forecasting closing prices, which helps in better investment planning. Weak patterns from Year/Month may reduce accuracy if used without proper encoding, but don’t directly indicate negative growth.

## ***5. Hypothesis Testing***

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.


Null Hypothesis (H₀): The mean of Open prices = mean of Close prices

Alternative Hypothesis (H₁): The mean of Open prices ≠ mean of Close prices

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_rel

# Paired sample t-test
t_stat, p_value = ttest_rel(df['Open'], df['Close'])

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Significance level
alpha = 0.05

# Conclusion
if p_value < alpha:
    print("Reject H0: There is a significant difference between Open and Close price means.")
else:
    print("Fail to reject H0: No significant difference between Open and Close price means.")


##### Questions

###### Which statistical test have you done to obtain P-Value?

paired t-test

###### Why did you choose the specific statistical test?

Because Open and Close prices are related for each time point, the paired t-test is appropriate. It checks whether the average difference between Open and Close is statistically significant.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in the mean closing price before and after the 2018 fraud case.

Alternative Hypothesis (H1): There is a significant difference in the mean closing price before and after the 2018 fraud case.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

# split the data into before and after 2018
before_2018 = df[df['Year'] < 2018]['Close']
after_2018 = df[df['Year'] >= 2018]['Close']

# perfoming independent t-test
t_stat, p_value = ttest_ind(before_2018, after_2018, equal_var=False)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# significance level
alpha = 0.05

if p_value < alpha:
    print("Reject H0: Significant difference in mean closing prices before and after 2018.")
else:
    print("Fail to reject H0: No significant difference in mean closing prices before and after 2018.")

##### Question

###### Which statistical test have you done to obtain P-Value?

Independent two-sample t-test

###### Why did you choose the specific statistical test?

The goal is to compare the mean closing prices before and after 2018 fraud case.
The Close price is continuous numeric data, and we are comparing means across two different time periods before and after 2018.Hence, the t-test is the most appropriate and statistically valid choice for this comparison.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): The average daily price range (High - Low) is less than or equal to a certain threshold.

Alternative Hypothesis (H1): The average daily price range is greater than a certain threshold.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_1samp

# Calculate daily price range
df['Price_Range'] = df['High'] - df['Low']

# Perform one-sample t-test (test against value = 5)
t_stat, p_value = ttest_1samp(df['Price_Range'], 5)

print(f"T-statistic: {t_stat}")
print(f"P-value (two-tailed): {p_value}")

# Convert to one-tailed p-value for H1: mean > 5
p_value_one_tailed = p_value / 2

# Significance level
alpha = 0.05

# Conclusion
if t_stat > 0 and p_value_one_tailed < alpha:
    print("Reject H0: The average daily price range is significantly greater than 5.")
else:
    print("Fail to reject H0: No significant evidence that average range exceeds 5.")

##### Which statistical test have you done to obtain P-Value?

one-sample, one-tailed t-test.

##### Why did you choose the specific statistical test?

Because we are testing a sample mean (High - Low) against a specific threshold value (Here Rs.5)
Since we are only interested in whether it's greater than 5 one-tailed one-sample t-test will work best .


## ***6. Feature Engineering & Data Pre-processing***

No need to handle missing values

### 1. Handling Outliers

In [None]:
# Making copy of dataset
df1 = df.copy()

# Define a function to remove outliers using IQR
def remove_outliers_iqr(df, columns):
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Filter out rows where values are outside the IQR bounds
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

    return df

# List of numeric columns to check for outliers
cols_to_check = ['Open', 'High', 'Low', 'Close']

# Apply the function
df = remove_outliers_iqr(df, cols_to_check)

df.shape

##### What all outlier treatment techniques have you used and why did you use those techniques?

IQR (Interquartile Range) Method - It detects outliers as values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.

Why Used: This method is robust to non-normal distributions and works by defining a range around the median. Data points outside 1.5
timesIQR below the first quartile (Q_1) or above the third quartile (Q_3) are considered outliers.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Closing price from the previous trading day
df['Close_Lag_1'] = df['Close'].shift(1)

# Daily Open-Close difference
df['Daily_Open_Close_Diff'] = df['Close'] - df['Open']

df.head()

In [None]:
# Close_Lag_1 will have NaN in the first row since there's no previous day.
df.dropna(inplace=True)

#### 2. Feature Selection

In [None]:
# Select relevant features
features = ['Open', 'High', 'Low', 'Close_Lag_1', 'Price_Range', 'Daily_Open_Close_Diff']

# Calculate Pearson correlation with 'Close'
correlations = df[features + ['Close']].corr()['Close'].drop('Close')

print(correlations)

##### What all feature selection methods have you used  and why?

To identify which of these features might be most useful for predicting 'Close' price, I've calculated the Pearson correlation coefficient between each feature and the 'Close' price. Features with higher absolute correlation values are generally considered more relevant.

##### Which all features you found important and why?

**Low**, **High**, **Close_Lag_1**, and **Open** show extremely high positive correlations with the 'Close' price. This indicates they are very strong indicators, which is typical for stock data where prices are highly correlated day-to-day.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, Data transformation is needed because -


* Different Scales : Features like Open, High, and Low are on large numeric scales. This scale difference can bias models like linear regression, KNN, SVM, etc.
* Skewness : From histograms, the variables appeared right-skewed. Skewed data can violate model assumptions (especially linear regression) and lead to poor performance.
* Improve Model Convergence : Models like gradient descent (used in linear regression) converge faster with normalized or standardized inputs.



In [None]:
# Check the skrewness of newly added features
for i, col in enumerate(df.columns[6:]):
    plt.subplot(2, 3, i + 1)
    sns.histplot(df[col], kde=True)
    plt.title(f'{col}')

plt.tight_layout()
plt.show()

In [None]:
# Log Transform - needed when dats is skewed
df['Open_log'] = np.log1p(df['Open'])
df['High_log'] = np.log1p(df['High'])
df['Low_log'] = np.log1p(df['Low'])
df['Close_log'] = np.log1p(df['Close'])
df['Price_Range_log'] = np.log1p(df['Price_Range'])
df['Close_Lag_1_log'] = np.log1p(df['Close_Lag_1'])

df.head()

### 6. Data Scaling

In [None]:
# Data Scaling
from sklearn.preprocessing import StandardScaler

# Use log-transformed features
features_to_scale = ['Open_log','High_log','Low_log','Close_log','Price_Range_log','Close_Lag_1_log']

# standardization of log-transformed features
scaler = StandardScaler()
df = df.copy()
df[features_to_scale] = scaler.fit_transform(df[features_to_scale])

# Check the result
df.head()

##### Which method have you used to scale you data and why?

Here Standardscaler is used because it’s mathematically aligned with linear regression's assumptions and ensures proper coefficient learning and it is Preferred for Linear Regression, it transforms features to have: Mean = 0 and Standard deviation = 1.


### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

No, dimensionality reduction is not required for current setup since the number of features is small and each has clear predictive value.

### 8. Data Splitting

In [None]:
# Importing Library
from sklearn.model_selection import train_test_split

# Features for Linear regression ML Model
X = df[['Open_log', 'High_log', 'Low_log', 'Close_Lag_1_log']]
y = df['Close_log']

# split the dataset - Linear regression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42  )

# Features for Random Forest Regressor ML Model
X_rf = df[['Open', 'High', 'Low', 'Close_Lag_1']]
y_rf = df['Close']

# split the dataset - Random forest
X_rf_train, X_rf_test, y_rf_train, y_rf_test = train_test_split(X_rf, y_rf, test_size=0.2, random_state=42  )


##### What data splitting ratio have you used and why?

 80:20 ratio - 80% train, 20% test

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

**No**, dataset is not imbalanced as there are no classes or labels to balance.

## ***7. ML Model Implementation***

### ML Model 1 - Linear Regression

In [None]:
# Required Liraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize and train the Linear Regression model
model_reg = LinearRegression()

# Fit the Algorithm
model_reg.fit(X_train, y_train)

# Predict on the model
y_pred_log = model_reg.predict(X_test)

# Apply inverse logarithm to get Close price
predicted_close_price = np.exp(y_pred_log)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
mse = mean_squared_error(y_test, y_pred_log)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_log)

print("R² Score : ",r2)
print("MSE : ", mse)
print("RMSE : ", rmse)

**ML Model Used: Linear Regression**

*Input features:* Open Price , High Price , Low Price after log transformation and scaling

*Target feature:* Close Price

In this model we predict Close_log (log transformation value of close price) and then we apply inverse log to get actual Close price.

**Evaluation metric Score Chart:**

*R² Score :* The model explains 98.38% of the variance in the closing price. This is very high, indicating excellent predictive performance.

*MSE :* The average squared error between predicted and actual (log-transformed) values is very low.

*RMSE :* The average prediction error (in log scale) is small. When back-transformed, this reflects minimal deviation from actual values.


#### 2. Cross- Validation

In [None]:
# import libraries
from sklearn.model_selection import cross_val_score

# 5 k-fold
r2_score = cross_val_score(model_reg, X, y, cv=5, scoring='r2')

# display
print("Cross-Validation R2 Scores:", r2_score)
print("Average R2 Score:", r2_score.mean())

##### Which hyperparameter optimization technique have you used and why?

**K-fold Cross-validation Technique :**

It helps us check how well a model generalizes to unseen data by splitting the dataset into multiple training/testing folds.

While the single split had a slightly higher R², the cross-validation average gives a more realistic and trustworthy estimate of your model’s performance on unseen data.


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Yes**,
after applying the K-Fold Cross-Validation with 5 folds, we found that the **average R² Score is 0.97**, which indicates that the model generalizes well and **maintains a high level of accuracy across different subsets of the data**. This confirms the model's reliability and performance stability.


### ML Model 2 - Random Forest

In [None]:
# import libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Initialize and train the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_rf_train, y_rf_train)

# Predict on test set
y_pred_rf = rf_model.predict(X_rf_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Evaluate performance
mse_rf = mean_squared_error(y_rf_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_rf_test, y_pred_rf)

print(f"Random Forest - MSE: {mse_rf:.2f}")
print(f"Random Forest - RMSE: {rmse_rf:.2f}")
print(f"Random Forest - R² Score: {r2_rf:.2f}")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross-validation
# importing libraries
from sklearn.model_selection import cross_val_score

# 5 k-fold
r2_score_rf = cross_val_score(rf_model, X_rf, y_rf, cv=5, scoring='r2')

# displaying scores
print("Cross-Validation R2 Scores:", r2_score_rf)
print("Average R2 Score:", r2_score_rf.mean())

##### Which hyperparameter optimization technique have you used and why?

**K-fold Cross-validation Technique :**

It helps us check how well a model generalizes to unseen data by splitting the dataset into multiple training/testing folds.

While the single split had a slightly higher R², the cross-validation average gives a more realistic and trustworthy estimate of your model’s performance on unseen data.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Not really, after applying the K-Fold Cross-Validation with 5 folds, the Random Forest model achieved an average R² Score of 0.935. While this is slightly lower than the single-split score of 0.97, it is a more reliable indicator of the model's performance on unseen data. This confirms the model's consistency and helps avoid overfitting.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

**Chosen Metrics:**

R² Score : Tells how much of the price variation is explained by the model (ideal for regression tasks). A high R² (~0.97) implies strong business reliability.

RMSE : Indicates the average prediction error. Useful for understanding real-world impact in price deviation (e.g., ₹7–₹16 error range).

MSE : Helps optimize the model by penalizing larger errors more heavily. Valuable in risk-sensitive environments like stock prediction.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

**Choosen Model :** Linear Regression model with log transformation as the final prediction model. Because -
* It achieved a higher average R² Score (0.9744) during cross-validation than Random Forest (0.9350).
* The data showed a strong linear relationship between input features (Open, High, Low, etc.) and Close, which suits linear models well.
* Model interpretability is crucial for business reporting, and Linear Regression provides clear coefficient explanations, which Random Forest lacks.



### 3. Explain the model which you have used and the feature importance using any model explainability tool?

**Model Used: Linear Regression with Log Transformation**

We log-transformed the target variable (Close) to normalize the distribution and reduce skewness.

After training, predictions were converted back to the original scale using np.exp().

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

The project aimed to predict stock closing prices using regression. After preprocessing, feature engineering, and testing models, linear regression with log transformation gave the best accuracy and interpretability for reliable forecasts.
