## ***Getting the Tools Ready***


In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.neighbors import KNeighborsClassifier
from sklearn import neighbors
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV


- Stock Price Data Setup


In [2]:
stock_price = pd.read_csv('NSE-TATAGLOBAL.csv')

***What's in the Dataset?***

| Variable               | Description                                                                 |
|------------------------|-----------------------------------------------------------------------------|
| Date                | The date of the stock transaction or the recorded price for that day.       |
| Open                | The price at which the stock opened at the start of the trading day.        |
| High                 | The highest price the stock reached during the trading day.                 |
| Low                  | The lowest price the stock reached during the trading day.                  |
| Last                | The last traded price of the stock at the end of the trading day.           |
| Close              | The closing price of the stock at the end of the trading day.               |
| Total Trade Quantity | The total number of shares traded during that day.                          |
| Turnover (Lacs)    | The total value of shares traded, calculated by multiplying the price by the quantity, expressed in lacs (1 lac = 100,000). |


## ***Preview of the Data***

In [3]:
display(stock_price.sample(5))

featuers = stock_price.shape[1]
observations = stock_price.shape[0]

print('\n\033[1mInference\033[0m The dataset has {features} features and {observations} observations\n'.
      format(features=featuers, observations=observations))


Unnamed: 0,Date,Open,High,Low,Last,Close,Total Trade Quantity,Turnover (Lacs)
1750,2011-09-07,95.0,95.9,94.05,94.25,94.4,1094307,1039.65
919,2015-01-13,157.55,157.9,154.2,155.4,155.25,1558189,2417.14
1320,2013-05-28,144.1,148.7,143.25,144.7,144.7,5651909,8251.15
905,2015-02-03,164.0,164.5,159.2,160.0,160.45,1927234,3110.99
1994,2010-09-16,124.0,125.0,121.5,122.4,122.15,1073175,1319.26



[1mInference[0m The dataset has 8 features and 2035 observations



- Get the Dates in Order

In [4]:
stock_price = stock_price.sort_values(by='Date')
stock_price.head()

Unnamed: 0,Date,Open,High,Low,Last,Close,Total Trade Quantity,Turnover (Lacs)
2034,2010-07-21,122.1,123.0,121.05,121.1,121.55,658666,803.56
2033,2010-07-22,120.3,122.0,120.25,120.75,120.9,293312,355.17
2032,2010-07-23,121.8,121.95,120.25,120.35,120.65,281312,340.31
2031,2010-07-26,120.1,121.0,117.1,117.1,117.6,658440,780.01
2030,2010-07-27,117.6,119.5,112.0,118.8,118.65,586100,694.98


**Keeping the data in chronological order is key when you're trying to spot trends over time.**
**It gives us a clear picture of how prices moved in the past and helps us make sense of the stock’s historical performance.**
**Basically, no time order = no real insights.**


- The Data, Column by Column

In [5]:
stock_price.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2035 entries, 2034 to 0
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Date                  2035 non-null   object 
 1   Open                  2035 non-null   float64
 2   High                  2035 non-null   float64
 3   Low                   2035 non-null   float64
 4   Last                  2035 non-null   float64
 5   Close                 2035 non-null   float64
 6   Total Trade Quantity  2035 non-null   int64  
 7   Turnover (Lacs)       2035 non-null   float64
dtypes: float64(6), int64(1), object(1)
memory usage: 143.1+ KB


**It’s clear there aren’t any missing values that’s good news.** 
**But the Date column is stored as an object, which could mess things up later when we try to plot a proper time series.**
**Also, I’ll convert the Total Trade Quantity column to float for smoother analysis and consistency.**

In [6]:
stock_price['Total Trade Quantity'] = stock_price['Total Trade Quantity'].astype(float)
stock_price['Date'] = pd.to_datetime(stock_price['Date'])

stock_price = stock_price.set_index('Date')
stock_price = stock_price.sort_index()
stock_price = stock_price.reset_index()

- Summary Statistics for Closing Price

In [7]:
stock_price['Close'].describe()

count    2035.00000
mean      149.45027
std        48.71204
min        80.95000
25%       120.05000
50%       141.25000
75%       156.90000
max       325.75000
Name: Close, dtype: float64

**The average stock price stands at 149.45, with a standard deviation of 48.71, highlighting considerable variation in price over the observed period.**


In [8]:
for col in stock_price.columns:
    pct_missing = stock_price[col].isnull().mean() * 100
    print(f'{col} has {pct_missing}% missing values')

Date has 0.0% missing values
Open has 0.0% missing values
High has 0.0% missing values
Low has 0.0% missing values
Last has 0.0% missing values
Close has 0.0% missing values
Total Trade Quantity has 0.0% missing values
Turnover (Lacs) has 0.0% missing values


In [9]:
duplicate_rows = stock_price[stock_price.duplicated(keep=False)]

duplicte_rows_sorted = duplicate_rows.sort_values(by=list(stock_price.columns))

print(f'The dataset contains {duplicate_rows.shape[0]} duplicate rows that need to be removed.') 

duplicte_rows_sorted.head()


The dataset contains 0 duplicate rows that need to be removed.


Unnamed: 0,Date,Open,High,Low,Last,Close,Total Trade Quantity,Turnover (Lacs)


**It makes sense that there shouldn't be any duplicate rows, since it’s not logical for the same stock price to appear multiple times on the same date and time prices change.** **However, we still need to check for any duplicates just to be sure and avoid any issues.**


## ***Time Series EDA***

- Price Trend

**Understanding the price trend is key because we want to spot overall directions and any potential breaks in the price movement.**
**This part ties into important economic concepts like trend analysis, market sentiment, and business cycle theory.**


In [10]:
stock_price_df = stock_price.copy()

fig = px.line(stock_price_df, x='Date', y='Close', title='Stock Price Over Time')

fig.update_traces(line_color='#172f5a')

fig.update_layout(
        title_text='Stock Price',
        xaxis_title='Date',
        yaxis_title='Close Price',
        plot_bgcolor='rgba(0,0,0,0)',
        showlegend=False,
        font=dict(family='Arial', size=12, color='black')
        )

<p align="left">
  <img src="EDA images\plot1.png">
  <br>
</p>  

**The closing price varies widely over time, fluctuating between 80.95 and 325.75.**

- Volume

**Why do we care about volume? Simple. Price alone isn’t enough.** 
**For example, if the price goes up but volume is low, it might mean the market isn’t really that interested.**
**This section touches on key economic concepts like liquidity, market sentiment, confirmation, and informed trading.**


In [11]:
fig = px.line(stock_price_df, x='Date', y='Total Trade Quantity', title='Total Trade Quantity Over Time')
fig.update_traces(line_color='#172f5a')
fig.update_layout(
    title_text='Volume Over Time',
    xaxis_title='Date',
    yaxis_title='Total Trade Quantity',
    plot_bgcolor='rgba(0,0,0,0)',
    font=dict(family='Arial', size=12, color='black')
)
fig.show()

<p align="left">
  <img src="EDA images\plot2.png">
  <br>
</p>  

In [12]:
stock_price_df['Daily_Return'] = stock_price_df['Close'].pct_change()


up_days_vol = stock_price_df[stock_price_df['Daily_Return'] > 0]['Total Trade Quantity'].mean()
down_days_vol = stock_price_df[stock_price_df['Daily_Return'] < 0]['Total Trade Quantity'].mean()

print('Average Volume on Up Days:', format(up_days_vol, '.2f'))
print('Average Volume on Down Days:', format(down_days_vol, '.2f'))

vol_mean = stock_price_df['Total Trade Quantity'].mean()
vol_std = stock_price_df['Total Trade Quantity'].std()
volume_spikes = stock_price_df[stock_price_df['Total Trade Quantity'] > (vol_mean + 2 * vol_std)]
print('Number of significant volume spikes:', len(volume_spikes))

Average Volume on Up Days: 2614855.86
Average Volume on Down Days: 2070476.66
Number of significant volume spikes: 91


**Looking at the chart, we see how trading volume fluctuates over time. Generally, on days when the price increases, the trading volume is higher compared to days when the price decreases.**
**There were 91 notable volume spikes these are days when the volume exceeded two standard deviations above the mean.**
**This indicates that large volume increases are often linked to significant market movements, which can help validate price trends or signal potential turning points.**

- Daily Returns Distribution

**Why is it important to check the daily returns distribution? It helps us understand the likelihood of gains and losses.**
**By doing so, we can better assess risks and plan accordingly.**
**This part covers important economic concepts like risk management, fat tails, and behavioral biases.**

In [13]:
fig = px.histogram(stock_price_df, x='Daily_Return', nbins=50, title='Daily Returns Distribution')
fig.update_traces(marker_color='#172f5a')

fig.update_layout(
        title_text='Daily Returns Distribution',
        xaxis_title='Daily Returns',
        yaxis_title='Frequency',
        plot_bgcolor='rgba(0,0,0,0)',
        showlegend=False,
        font=dict(family='Arial', size=12, color='black')
        )

<p align="left">
  <img src="EDA images\plot3.png">
  <br>
</p>  

**The daily returns seem to follow a near normal distribution, with most values clustered around zero, which is common for stock returns.**

- Volatility

**It’s important to analyze volatility because it represents the risks associated with investing. A very volatile stock might offer big returns, but it also comes with higher risk.**  
**This part covers key economic concepts such as the risk vs. return tradeoff, portfolio theory, and the Capital Asset Pricing Model (CAPM).**



In [14]:
stock_price_df['Volatility'] = stock_price_df['Daily_Return'].rolling(window=20).std()

fig = px.line(stock_price_df, x='Date', y='Volatility', title='Daily Return of Stock Prices')
fig.update_traces(line_color='#172f5a')
fig.update_layout(
    title_text='20 Day Rolling Volatility of Stock Prices',
    xaxis_title='Date',
    yaxis_title='Volatility',
    plot_bgcolor='rgba(0,0,0,0)',
    font=dict(family='Arial', size=12, color='black')
)
fig.show()

<p align="left">
  <img src="EDA images\plot4.png">
  <br>
</p>  

**The 20-day rolling volatility highlights both high and low volatility periods, providing valuable insights for risk assessment.**


- Moving Averages

**We smooth out daily noise to better see the real trend behind short-term price swings.**
**This lets us focus on meaningful market moves instead of getting distracted by random fluctuations.**
**This part covers key economic concepts such as signal vs. noise, momentum investing, and behavioral finance.**

In [15]:
stock_price_df['MA20'] = stock_price_df['Close'].rolling(window=20).mean()
stock_price_df['MA50'] = stock_price_df['Close'].rolling(window=50).mean()

fig = px.line(stock_price_df, x='Date', y='Close', title='Close Price with Moving Averages')
fig.add_scatter(x=stock_price_df['Date'], y=stock_price_df['Close'], mode='lines', name='Close Price', line=dict(color='blue'))
fig.add_scatter(x=stock_price_df['Date'], y=stock_price_df['MA20'], mode='lines', name='MA20', line=dict(color='green'))
fig.add_scatter(x=stock_price_df['Date'], y=stock_price_df['MA50'], mode='lines', name='MA50', line=dict(color='red'))

fig.update_layout(
    title_text='Close Price with Moving Averages',
    xaxis_title='Date',
    yaxis_title='Price',
    plot_bgcolor='rgba(0,0,0,0)',
    font=dict(family='Arial', size=12, color='black')
)

fig.show()


<p align="left">
  <img src="EDA images\plot5.png">
  <br>
</p>  

T**he stock had a strong uptrend from 2011 to early 2013, a period of consolidation in 2013-2014, another uptrend until mid 2015, followed by a sharp decline in late 2015. It recovered in 2016-2017 with some volatility in 2018. MA20 reacted faster than MA50, with several crossovers signaling potential buy/sell opportunities.**



## ***Stock Price Forcasting***

**When working with stock price data, we usually face two key problems.**

**1. Classification: predicting whether the price will go up(1) or down(0) tomorrow. This is helpful for making buy or sell decisions based on expected direction.**

**2. Regression: predicting the actual closing price for the next day. This helps in estimating the future value and setting price targets.**

**For now, we'll focus on the regression problem, as classification requires a deeper and more advanced understanding before diving in.**


In [16]:
stock_price['Tomorrow_Close'] = stock_price['Close'].shift(-1)
stock_price['Target'] = (stock_price['Tomorrow_Close'] > stock_price['Close']).astype(int)
stock_price.dropna(inplace=True)
stock_price.head()

Unnamed: 0,Date,Open,High,Low,Last,Close,Total Trade Quantity,Turnover (Lacs),Tomorrow_Close,Target
0,2010-07-21,122.1,123.0,121.05,121.1,121.55,658666.0,803.56,120.9,0
1,2010-07-22,120.3,122.0,120.25,120.75,120.9,293312.0,355.17,120.65,0
2,2010-07-23,121.8,121.95,120.25,120.35,120.65,281312.0,340.31,117.6,0
3,2010-07-26,120.1,121.0,117.1,117.1,117.6,658440.0,780.01,118.65,1
4,2010-07-27,117.6,119.5,112.0,118.8,118.65,586100.0,694.98,118.25,0


**Here, we created a column called "Target", which serves as our label in the classification model.**

## ***Regression Model***

**I’ll be using KNN regression, so it's important to select the right features. KNN works slowly if we use too many features because it calculates distances and doesn’t learn from the data. Each additional feature increases the computational cost, making KNN significantly slower.**

**The negative impact on accuracy. If some features are irrelevant or unnecessary, they can negatively affect the model’s accuracy, as KNN will consider them when calculating distances.**


In [17]:
x = stock_price[['Open', 'High', 'Low']]
y = stock_price['Close']

In [18]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

In [19]:
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

x_train shape: (1525, 3)
x_test shape: (509, 3)
y_train shape: (1525,)
y_test shape: (509,)


**Since I’ll be using the KNN regression model, which is sensitive to outliers because it relies on distance calculations, using the StandardScaler is better than the MinMaxScaler.**


In [20]:
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [21]:
parms = {'n_neighbors': [2,3,4, 5, 6, 7,8, 9,10, 11,12, 13, 14, 15]}
knn = neighbors.KNeighborsRegressor()
knn_reg = GridSearchCV(knn, parms, cv=5)
knn_reg.fit(x_train_scaled, y_train)

print('Best Parameters:', knn_reg.best_params_)
print('Best Score:', knn_reg.best_score_)

Best Parameters: {'n_neighbors': 4}
Best Score: 0.9987921955100545


**Since we're working with data affected by sudden market changes, there's likely noise in the data. A relatively high value for *k* helps reduce the impact of noise on our data. After applying cross-validation, I found that *k* equals 4 works well because it's not too large, preventing underfitting, while still filtering out the noise effectively.**


In [22]:
y_pred_knn_train = knn_reg.predict(x_train_scaled)

MSE_train = mean_squared_error(y_train, y_pred_knn_train)
MAE_train = mean_absolute_error(y_train, y_pred_knn_train)
r2_knn_train = r2_score(y_train, y_pred_knn_train)

print(f'Mean Squared Error: {MSE_train:.4f}')
print(f'Mean Absolute Error: {MAE_train:.4f}')
print(f'R-squared (R2): {r2_knn_train:.4f}')

Mean Squared Error: 1.5860
Mean Absolute Error: 0.8230
R-squared (R2): 0.9993


In [23]:
y_pred_knn_test = knn_reg.predict(x_test_scaled)

MSE_test = mean_squared_error(y_test, y_pred_knn_test)
MAE_test = mean_absolute_error(y_test, y_pred_knn_test)
r2_knn_test = r2_score(y_test, y_pred_knn_test)

print(f'Mean Squared Error: {MSE_test:.4f}')
print(f'Mean Absolute Error:  {MAE_test:.4f}') 
print(f'R-squared (R2): {r2_knn_test:.4f}')

Mean Squared Error: 2.6338
Mean Absolute Error:  1.0531
R-squared (R2): 0.9989


In [24]:
valid_output = pd.DataFrame({'Actual close price': y_test, 'Predicted close price': y_pred_knn_test})
diff = pd.DataFrame({'Actual close price': y_test, 'Predicted close price': y_pred_knn_test, 'Difference': y_pred_knn_test - y_test})

In [25]:
valid_output.sample(5)
diff.sample(5)

Unnamed: 0,Actual close price,Predicted close price,Difference
2026,236.6,233.25,-3.35
530,130.3,128.3,-2.0
1339,132.15,131.55,-0.6
1501,138.3,137.125,-1.175
1157,154.75,153.9625,-0.7875


In [26]:
fig_actual = px.scatter(x=y_test, y=y_pred_knn_test, trendline='lowess', color_discrete_sequence=['#172f5a'])

fig_actual.update_layout(title_text='Actual vs Predicted Stock Prices',
                         width=800, height=500,
                         plot_bgcolor='rgba(0,0,0,0)',
                         showlegend=False,
                         font=dict(family='Arial', size=12, color='black'))

fig_actual.update_xaxes(title_text='Predicted')
fig_actual.update_yaxes(title_text='Actual')

fig_actual.show()

<p align="left">
  <img src="EDA images\plot6.png">
  <br>
</p>  

**The model performs well on both training and test data with very high accuracy. Since the results are close, there's no overfitting. However, we still need to check the assumptions to ensure the data is suitable for KNN.**


- **Feature Scaling**


**KNN is sensitive to the scale of features. Since we’ve already used StandardScaler, it’s more appropriate for our data due to the presence of outliers and the roughly normal distribution.**


- No Assumption of Data Distribution

**KNN doesn’t require data to be normally distributed, but it performs best when the relationship between features and the target is locally smooth.**


- **Feature Ranges**

In [27]:
stock_price[['Open', 'High', 'Low','Close']].describe().loc[['min', 'max']]

Unnamed: 0,Open,High,Low,Close
min,81.1,82.8,80.0,80.95
max,327.7,328.75,321.65,325.75


**The features have wide ranges, so scaling is necessary. We've already applied StandardScaler.**

- **No Multicollinearity Requirement**

**KNN doesn’t rely on internal equations or weights like regression models. It just measures distances between points.**

So:

- It doesn’t care if features are correlated (like weight and BMI).  
- It’s not much affected by multicollinearity.


- **Sufficient Data**

**KNN might not work well with tiny datasets or when there are too many features. But in our case, the number of rows and features is reasonable, so it's fine.**

- **Residuals**

In [28]:
residuals = y_test - y_pred_knn_test 

fig_residuals = px.scatter(stock_price,y_test,y =residuals, trendline='lowess', color_discrete_sequence=['#172f5a'])

fig_residuals.update_layout(title_text='Residuals vs Predicted Stock Prices',
                         width=800, height=500,
                         plot_bgcolor='rgba(0,0,0,0)',
                         showlegend=False,
                         font=dict(family='Arial', size=12, color='black'))

fig_residuals.update_xaxes(title_text='Predicted')
fig_residuals.update_yaxes(title_text='Residuals')

fig_residuals.show()

<p align="left">
  <img src="EDA images\plot7.png">
  <br>
</p>  

**The residuals are randomly spread around zero with no clear pattern, which means the KNN model is doing a good job capturing the main structure of the data and isn’t missing anything major.**


### ***After checking all the assumptions, everything suggests that KNN regression is a good fit for this dataset.***

---

**In the end, I hope you learned something valuable and don’t forget to enjoy the learning journey.**


**If you liked this notebook, feel free to leave an upvote or a star if you're on GitHub.**
**I'll be covering the classification problem soon!**