# 📈 Stock Market Data Collection for StAI (Stock Prediction AI)

This notebook collects historical stock market data using the `yfinance` API for use in the StAI project. We'll fetch daily data for the last 3 years for selected stocks.


In [2]:
pip install yfinance --quiet

Note: you may need to restart the kernel to use updated packages.


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
realtimestt 0.3.0 requires torch==2.3.1, but you have torch 2.6.0+cu118 which is incompatible.
realtimestt 0.3.0 requires torchaudio==2.3.1, but you have torchaudio 2.6.0 which is incompatible.
realtimestt 0.3.0 requires websockets==v12.0, but you have websockets 15.0.1 which is incompatible.
gradio 3.43.1 requires websockets<12.0,>=10.0, but you have websockets 15.0.1 which is incompatible.
gradio-client 0.5.0 requires websockets<12.0,>=10.0, but you have websockets 15.0.1 which is incompatible.
selenium 4.7.2 requires urllib3[socks]~=1.26, but you have urllib3 2.2.3 which is incompatible.

[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import yfinance as yf
import pandas as pd
import os
from datetime import datetime


In [4]:
tickers = [
    "AAPL", "GOOGL", "MSFT", "TSLA",
    "RELIANCE.NS", "TCS.NS", "INFY.NS", "WIPRO.NS",
    "^NSEI", "^BSESN", "^GSPC"
]

In [5]:
start_date='2021-01-01'
end_date=datetime.today().strftime('%Y-%m-%d')


In [9]:
import random
import time

In [11]:
for ticker in tickers:
    delay = random.uniform(5,7)
    print(f"Waiting {delay:.1f}s...")
    time.sleep(delay)
    print(f"Downloading data for {ticker}")
    data=yf.download(ticker,start=start_date, end=end_date)
    file_path=f"../data/raw/{ticker}.csv"
    data.to_csv(file_path)
    print(f"Saved to {file_path}")

Waiting 6.4s...
Downloading data for AAPL


[*********************100%***********************]  1 of 1 completed


Saved to ../data/raw/AAPL.csv
Waiting 5.3s...
Downloading data for GOOGL


[*********************100%***********************]  1 of 1 completed


Saved to ../data/raw/GOOGL.csv
Waiting 5.7s...
Downloading data for MSFT


[*********************100%***********************]  1 of 1 completed


Saved to ../data/raw/MSFT.csv
Waiting 6.0s...
Downloading data for TSLA


[*********************100%***********************]  1 of 1 completed


Saved to ../data/raw/TSLA.csv
Waiting 6.6s...
Downloading data for RELIANCE.NS


[*********************100%***********************]  1 of 1 completed


Saved to ../data/raw/RELIANCE.NS.csv
Waiting 6.7s...
Downloading data for TCS.NS


[*********************100%***********************]  1 of 1 completed


Saved to ../data/raw/TCS.NS.csv
Waiting 6.1s...
Downloading data for INFY.NS


[*********************100%***********************]  1 of 1 completed


Saved to ../data/raw/INFY.NS.csv
Waiting 5.6s...
Downloading data for WIPRO.NS


[*********************100%***********************]  1 of 1 completed


Saved to ../data/raw/WIPRO.NS.csv
Waiting 6.3s...
Downloading data for ^NSEI


[*********************100%***********************]  1 of 1 completed


Saved to ../data/raw/^NSEI.csv
Waiting 5.4s...
Downloading data for ^BSESN


[*********************100%***********************]  1 of 1 completed


Saved to ../data/raw/^BSESN.csv
Waiting 5.4s...
Downloading data for ^GSPC


[*********************100%***********************]  1 of 1 completed

Saved to ../data/raw/^GSPC.csv





In [12]:
pd.read_csv("../data/raw/TSLA.csv").head()

Unnamed: 0,Price,Close,High,Low,Open,Volume
0,Ticker,TSLA,TSLA,TSLA,TSLA,TSLA
1,Date,,,,,
2,2021-01-04,243.2566680908203,248.163330078125,239.06333923339844,239.82000732421875,145914600
3,2021-01-05,245.0366668701172,246.94667053222656,239.73333740234375,241.22000122070312,96735600
4,2021-01-06,251.9933319091797,258.0,249.6999969482422,252.8300018310547,134100000


### Data Preprocessing for the project

In [25]:
df=pd.read_csv("../data/processed/AAPL_processed.csv")
df = df.drop(0).reset_index(drop=True)
df = df.rename(columns={'Price': 'Date'})
df.head()

Unnamed: 0,Date,Close,High,Low,Open,Volume
0,2021-01-04,126.23971557617188,130.33682094023055,123.65463382486575,130.24902933388864,143301900
1,2021-01-05,127.80044555664062,128.51257236547815,125.28364991225538,125.73238714953698,97664900
2,2021-01-06,123.49852752685548,127.83951504225612,123.2839159380844,124.5910922321954,155088000
3,2021-01-07,127.71267700195312,128.4052897378647,124.72764426980214,125.21539510538167,109578200
4,2021-01-08,128.8150177001953,129.38081053932285,127.03959725138373,129.18569826826229,105158200


In [42]:
for ticker in tickers:
    print(f"loading {ticker}")
    df=pd.read_csv(f"../data/processed/{ticker}_processed.csv")
    df = df.drop(0).reset_index(drop=True)
    df = df.rename(columns={'Price': 'Date'})
    df.head()
    df=df.dropna()
    df['Close'] = pd.to_numeric(df['Close'], errors='coerce')
    df['MA10']=df['Close'].rolling(10).mean()
    df['MA50']=df['Close'].rolling(50).mean()
    df['Returns'] = df['Close'].pct_change()
    df['Volatility'] = df['Close'].rolling(10).std()
    df.dropna()
    df.drop(df.index[:49], inplace=True)
    # df=df[df["MA10"].notna]
    df.to_csv(f"../data/processed/{ticker}_processed.csv",index=False)

loading AAPL
loading GOOGL
loading MSFT
loading TSLA
loading RELIANCE.NS
loading TCS.NS
loading INFY.NS
loading WIPRO.NS
loading ^NSEI
loading ^BSESN
loading ^GSPC


In [43]:
df.head()

Unnamed: 0,Date,Close,High,Low,Open,Volume,MA10,MA50,Returns,Volatility
97,2021-12-22,4696.560059,4697.669922,4645.529785,4650.359863,3319610000,4659.550049,4620.479189,0.01018,44.138306
98,2021-12-23,4725.790039,4740.740234,4703.959961,4703.959961,2913040000,4665.384033,4627.718994,0.006224,48.89749
99,2021-12-27,4791.189941,4791.490234,4733.990234,4733.990234,2770290000,4673.301025,4634.777598,0.013839,61.953481
100,2021-12-28,4786.350098,4807.02002,4780.040039,4795.490234,2707920000,4685.039014,4641.077197,-0.00101,71.435782
101,2021-12-29,4793.060059,4804.060059,4778.080078,4788.640137,2963310000,4700.936035,4647.209199,0.001402,76.356775
