# Stock Market Analysis - Data Cleaning

## Objective

Prepare clean and comparible historical price data for Reliance Industries and Nifty 50 by:

* Aligning trading dates
* Standardizing columns and formats 
* Calculating daily returns
* Creating an analysis ready dataset

This step ensures fair comparison and relaible downstream analysis


### Import Libraries

In [99]:
import pandas as pd
import numpy as np

### Load Datsets

In [100]:
reliance = pd.read_csv(r"C:\Users\sande\OneDrive\Documents\Data science AI&ML\Projects\Projects for Resume\Stock Market Analysis & Forecasting- Project\Data\raw\RELIANCE.csv")
nifty = pd.read_csv(r"C:\Users\sande\OneDrive\Documents\Data science AI&ML\Projects\Projects for Resume\Stock Market Analysis & Forecasting- Project\Data\raw\NIFTY50.csv")

In [101]:
reliance.sample(5),nifty.sample(5)

(            Date         Open         High          Low        Close    Volume
 417   08-09-2022  1181.527366  1185.978219  1173.652856  1180.226318   7057077
 687   11-10-2023  1148.722762  1166.218270  1148.003113  1163.910400   9814118
 708   10-11-2023  1144.305465  1149.665815  1140.583017  1148.946045   7734954
 927   04-10-2024  1398.340206  1411.711728  1376.278295  1381.009399  37072876
 1181  13-10-2025  1376.900024  1377.699951  1367.800049  1375.000000   7600682,
             Date         Open         High          Low        Close  Volume
 870   12-07-2024  24387.94922  24592.19922  24331.15039  24502.15039  325800
 677   26-09-2023  19682.80078  19699.34961  19637.44922  19664.69922  204900
 1004  24-01-2025  23183.90039  23347.30078  23050.00000  23092.19922  264300
 404   19-08-2022  17966.55078  17992.19922  17710.75000  17758.44922  295600
 1093  06-06-2025  24748.69922  25029.50000  24671.44922  25003.05078  335600)

In [102]:
reliance.shape,nifty.shape

((1241, 6), (1239, 6))

### Standardize column Names

In [103]:
'''df.columns gives object like column names
str.lower() converts all names to lowercase
str.strip() removes leading and trailing spaces'''
reliance.columns = reliance.columns.str.lower().str.strip()
nifty.columns = nifty.columns.str.lower().str.strip()

### Date Parsing & Sorting

In [104]:
reliance['date'] = pd.to_datetime(reliance['date'], format = '%d-%m-%Y')
nifty['date'] = pd.to_datetime(nifty['date'],format = '%d-%m-%Y')

reliance = reliance.sort_values('date')
nifty = nifty.sort_values('date')

### Rename columns

In [105]:
reliance = reliance.rename(columns = {
    'open' : 'reliance_open',
    'high':'reliance_high',
    'low':'reliance_low',
    'close':'reliance_close',
    'volume':'reliance_volume', 
    })


nifty = nifty.rename(columns = { 
    'open' : 'nifty_open',
    'high':'nifty_high',
    'low':'nifty_low',
    'close':'nifty_close',
    'volume':'nifty_volume',
     })

### Merging Both datsets on commom dates

In [106]:
'''Stock and index might have different holidays so align both datasets on common trading dates to avoid misleading comparisons'''
market_df = pd.merge(
    reliance,
    nifty,
    on = 'date',
    how = 'inner'
)

### Check Data Types and Fix Data types

In [107]:
market_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1239 entries, 0 to 1238
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             1239 non-null   datetime64[ns]
 1   reliance_open    1239 non-null   float64       
 2   reliance_high    1239 non-null   float64       
 3   reliance_low     1239 non-null   float64       
 4   reliance_close   1239 non-null   float64       
 5   reliance_volume  1239 non-null   int64         
 6   nifty_open       1239 non-null   float64       
 7   nifty_high       1239 non-null   float64       
 8   nifty_low        1239 non-null   float64       
 9   nifty_close      1239 non-null   float64       
 10  nifty_volume     1239 non-null   int64         
dtypes: datetime64[ns](1), float64(8), int64(2)
memory usage: 106.6 KB


In [108]:
'''Convert selected columns to numeric'''
cols_convert = ['reliance_open','reliance_high','reliance_low','reliance_close','reliance_volume','nifty_open','nifty_high','nifty_low','nifty_close','nifty_volume']

for col in cols_convert:
    market_df[col] = pd.to_numeric(market_df[col], errors = 'coerce')

### Calculate Daily Returns

In [109]:
market_df['reliance_return']  = market_df['reliance_close'].pct_change()
market_df['nifty_return'] = market_df['nifty_close'].pct_change()


### Check Misiing Values

In [110]:
market_df = market_df.dropna()

In [111]:
market_df.isna().sum()

date               0
reliance_open      0
reliance_high      0
reliance_low       0
reliance_close     0
reliance_volume    0
nifty_open         0
nifty_high         0
nifty_low          0
nifty_close        0
nifty_volume       0
reliance_return    0
nifty_return       0
dtype: int64

In [112]:
market_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1238 entries, 1 to 1238
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             1238 non-null   datetime64[ns]
 1   reliance_open    1238 non-null   float64       
 2   reliance_high    1238 non-null   float64       
 3   reliance_low     1238 non-null   float64       
 4   reliance_close   1238 non-null   float64       
 5   reliance_volume  1238 non-null   int64         
 6   nifty_open       1238 non-null   float64       
 7   nifty_high       1238 non-null   float64       
 8   nifty_low        1238 non-null   float64       
 9   nifty_close      1238 non-null   float64       
 10  nifty_volume     1238 non-null   int64         
 11  reliance_return  1238 non-null   float64       
 12  nifty_return     1238 non-null   float64       
dtypes: datetime64[ns](1), float64(10), int64(2)
memory usage: 135.4 KB


In [None]:
market_df.shape

(1238, 13)

In [None]:
market_df.to_csv("../data/processed/market_data.csv", index=False)


SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 19-20: truncated \UXXXXXXXX escape (3981171625.py, line 1)

In [None]:
import os
os.listdir("../data/processed")
