# Dataset Curation

Historical data for BSE Sensex is obtained from [Yahoo Finance](https://finance.yahoo.com/quote/%5EBSESN/history?period1=1420070400&period2=1660003200&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true). 

We entered the start and end dates and copied the entries from the table element on the yahoo finance website and stored them as `yahoo_finance_bse_historical_table.html` file. We will now use beautiful-soup to parse the html table information into a pandas dataframe for subsequent analysis.


In [1]:
# !pip install -q beautifulsoup4 pandas numpy scikit-learn matplotlib seaborn

In [42]:
# Get all the libraries
import re
from pathlib import Path

import pandas as pd
from bs4 import BeautifulSoup

In [70]:
html_doc = Path("yahoo_finance_bse_historical_table.html").read_text()
soup = BeautifulSoup(html_doc, "html.parser")

# Column Headers
columns = [x.text for x in soup.find("thead").find_all("span")]

# Read the table rows
table_body = soup.find("tbody")
table_rows = table_body.find_all("tr")

# Extract information from each and every individual row
def extract_info(row):
    try:
        elements = [x.span.text for x in row.find_all("td")]
        elements = [elements[0]] + [float(re.sub(",", "", x)) for x in elements[1:]]
    except Exception as e:
        elements = [-1] * len(columns)
    return elements


records = [extract_info(x) for x in table_rows]

In [75]:
df = pd.DataFrame(records, columns=columns)
df["Date"] = pd.to_datetime(df["Date"])
df.tail(10)

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume
1868,2015-01-15,27831.16,28194.61,27703.7,28075.55,28075.55,16700.0
1869,2015-01-14,27432.14,27512.8,27203.25,27346.82,27346.82,10200.0
1870,2015-01-13,27611.56,27670.19,27324.58,27425.73,27425.73,7800.0
1871,2015-01-12,27523.86,27620.66,27323.74,27585.27,27585.27,7500.0
1872,2015-01-09,27404.19,27507.67,27119.63,27458.38,27458.38,11100.0
1873,2015-01-08,27178.77,27316.41,27101.94,27274.71,27274.71,8200.0
1874,2015-01-07,26983.43,27051.6,26776.12,26908.82,26908.82,12200.0
1875,2015-01-06,27694.23,27698.93,26937.06,26987.46,26987.46,14100.0
1876,2015-01-05,27978.43,28064.49,27786.85,27842.32,27842.32,9200.0
1877,2015-01-02,27521.28,27937.47,27519.26,27887.9,27887.9,7400.0


We have now collected data for `BSE Sensex` index right from the year `2015` upto the current day. However, we can truncate the dataset till `June 2021` since the problem statement expects us to only predict till that point in time.

We have the following columns in our dataset

- Date: The date corresponding to which the information is presented
- Open: What was the opening index amount
- High: What was the highest that the index touched for that day
- Low: What was the lowest that the index touched for that day
- Close: What was the closing index amount (Our Prediction Value)
- Volume: An indicator of the amount of trades happening within the given day
- Adj Close: Adjusted close price adjusted for splits/dividends and other capital gain distributions