# Data Analytics

In [4]:
# importing libraries
import requests
import pandas as pd
import numpy as np

## 1. Crawl Dataset

Perform web scraping on Yahoo Finance to obtain daily stock data of Nvidia from 25 November 2022 to 22 November 2024.
- What are the variables of interest?
- How was the data scraped/collected?

In [44]:
# scrape stock data
url = 'https://finance.yahoo.com/quote/NVDA/history/?period1=1669334400&period2=1732506788'
r = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})

In [46]:
# covert html table into dataframe
read_html_pandas_data = pd.read_html(r.text)[0]
read_html_pandas_data

Unnamed: 0,Date,Open,High,Low,Close Close price adjusted for splits.,Adj Close Adjusted close price adjusted for splits and dividend and/or capital gain distributions.,Volume
0,"Nov 22, 2024",145.93,147.16,141.10,141.95,141.95,235772200
1,"Nov 21, 2024",149.35,152.89,140.70,146.67,146.67,400946600
2,"Nov 20, 2024",147.41,147.56,142.73,145.89,145.89,309871700
3,"Nov 19, 2024",141.32,147.13,140.99,147.01,147.01,227834900
4,"Nov 18, 2024",139.50,141.55,137.15,140.15,140.15,221866000
...,...,...,...,...,...,...,...
506,"Nov 30, 2022",0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend
507,"Nov 30, 2022",15.70,16.93,15.60,16.92,16.91,565298000
508,"Nov 29, 2022",15.83,15.93,15.52,15.64,15.62,298384000
509,"Nov 28, 2022",16.03,16.36,15.73,15.83,15.81,303741000


In [51]:
# export dataframe to CSV file
read_html_pandas_data.to_csv('NVDA_stock_data.csv', index=False)

## 2a.  Data Preparation & Cleaning

In [435]:
# create a copy
df = read_html_pandas_data.copy()

In [437]:
# renaming the columns
df.columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']

In [439]:
# the number of rows
print(f'Number of rows: {df.shape[0]}')

# the number of columns
print(f'Number of columns: {df.shape[1]}')

Number of rows: 511
Number of columns: 7


In [441]:
# Checking for nulls in the table
df.isnull().sum()

Date         0
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

There are no null values in the table.

In [444]:
# Check the data type for each row
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 511 entries, 0 to 510
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Date       511 non-null    object
 1   Open       511 non-null    object
 2   High       511 non-null    object
 3   Low        511 non-null    object
 4   Close      511 non-null    object
 5   Adj Close  511 non-null    object
 6   Volume     511 non-null    object
dtypes: object(7)
memory usage: 28.1+ KB


We observe that all of the columns have the data type `object`, even though we would expect the the `Date` column to have `datetime` data type, `Volume` column to have an `int` or `float` data type, and the rest of the columns to have `float` data type.

### Date column

Let's start with the `Date` column. Firstly, we check each row in the column, ensuring that they are all dates but with object data type.

In [448]:
# count the number of rows that cannot be converted to datetime
pd.to_datetime(df['Date'], errors='coerce').isnull().sum()

0

Based on the result above, we can confirm that the rows in the `Date` column are all dates but with object data type (i.e. all rows can be converted to datetime data type).

So the logical next step is to convert the `Date` column's data type into `datetime`,

In [451]:
# convert data type to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

### The rest of the columns

Now, let's take a closer look at the data in the `Open`, `High`, `Low`, `Close`, `Adj Close`, and `Volume` columns.

We begin by checking for non-numeric entries in these columns.

In [454]:
# create a copy of df to only store numerical columns
df_num = df.copy()

# drop the Date column
df_num.drop('Date', axis=1, inplace=True)

# store numeric columns (to be used later)
num_cols = df_num.columns

In [456]:
# create a copy of df
df_bool = df_num.copy()

# indicate the rows in each column that are non-numeric
for col in df_num.columns:
    df_bool[col] = pd.to_numeric(df_num[col], errors='coerce').isnull()

# locate rows in the dataframe that contain at least 1 non-numeric value
ser_bool = df_bool.any(axis=1)

# print an extract of the dataframe with rows that contain at least 1 non-numeric value
df_num = df_num[ser_bool]
display(df_num)

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
51,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend
116,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend
118,10:1 Stock Splits,10:1 Stock Splits,10:1 Stock Splits,10:1 Stock Splits,10:1 Stock Splits,10:1 Stock Splits
186,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend
248,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend
312,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend
375,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend
440,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend
506,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend


From the result above, we can see the rows that contain a least 1 numeric values. Notice that in every column, there are entries that are non-numerical, which explains the reason that Pandas assigns the `object` data type to these columns.

Before we drop these rows, we will take a look at any other data entries with the same date as these dividend payment / stock split events.

In [460]:
# the list of dates when non-numeric entries are provided
date_non_numeric = list(df.loc[df_num.index, 'Date'])

# display all data in those dates
filetered_df = df.where(df['Date'].isin(date_non_numeric)).dropna()
display(filetered_df)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
51,2024-09-12,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend
52,2024-09-12,116.84,120.79,115.38,119.14,119.14,367100500
116,2024-06-11,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend
117,2024-06-11,121.77,122.87,118.74,120.91,120.90,222551200
118,2024-06-10,10:1 Stock Splits,10:1 Stock Splits,10:1 Stock Splits,10:1 Stock Splits,10:1 Stock Splits,10:1 Stock Splits
119,2024-06-10,120.37,123.10,117.01,121.79,121.77,314162700
186,2024-03-05,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend
187,2024-03-05,85.27,86.10,83.42,85.96,85.95,520639000
248,2023-12-05,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend,0.00 Dividend
249,2023-12-05,45.47,46.60,45.27,46.57,46.56,371718000


It seems that, at the dates when dividend payment / stock split event occur, there is also stock price and volume data.

Moreover, I want to take a closer look at the stock split event to check whether the prices before and after the split are consistent.

In [491]:
# extract date of stock split
bool_split = df['Open'].str.contains('Stock Splits')
date_split = df[bool_split]['Date']
date_split

118   2024-06-10
Name: Date, dtype: datetime64[ns]

The split happened on June 10th, 2024. To see whether the prices before and after the split are consistent, we will print a few rows of data, before and after the split date,

In [510]:
# index of the stock split date
idx_split = date_split.index[0]

# print a few rows before and after the split
df.iloc[idx_split-3 : idx_split+3, :]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
115,2024-06-12,123.06,126.88,122.57,125.20,125.19,299595000
116,2024-06-11,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend,0.01 Dividend
117,2024-06-11,121.77,122.87,118.74,120.91,120.90,222551200
118,2024-06-10,10:1 Stock Splits,10:1 Stock Splits,10:1 Stock Splits,10:1 Stock Splits,10:1 Stock Splits,10:1 Stock Splits
119,2024-06-10,120.37,123.10,117.01,121.79,121.77,314162700
120,2024-06-07,119.77,121.69,118.02,120.89,120.87,412386000


The prices before and after the stock split appear to be consistent (i.e. the prices have been adjusted). Therefore, so no further action is required (such as dividing or multipltying the price by 10 for a 10:1 stock split).

Now let's drop these non-numeric rows,

In [515]:
# drop non-numeric rows
df.drop(df_num.index, axis=0, inplace=True)

The next step is to change the data type of these numeric columns to float/integer,

In [518]:
# change data type of numeric columns to 
for col in num_cols:
    df[col] = pd.to_numeric(df[col])

In [520]:
# check data type
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 502 entries, 0 to 510
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       502 non-null    datetime64[ns]
 1   Open       502 non-null    float64       
 2   High       502 non-null    float64       
 3   Low        502 non-null    float64       
 4   Close      502 non-null    float64       
 5   Adj Close  502 non-null    float64       
 6   Volume     502 non-null    int64         
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 31.4 KB


The data type has been appropriately changed.

For convenience, we will change the order of the dataframe, so that is is ordered based on the `Date` column in ascending order,

In [527]:
df.sort_values('Date', ascending=True, inplace=True)

Use the `Date` column as the index for the dataframe,

In [530]:
# use Date column as the dataframe index
df.set_index('Date')

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022-11-25,16.32,16.49,16.17,16.27,16.26,167934000
2022-11-28,16.03,16.36,15.73,15.83,15.81,303741000
2022-11-29,15.83,15.93,15.52,15.64,15.62,298384000
2022-11-30,15.70,16.93,15.60,16.92,16.91,565298000
2022-12-01,17.00,17.26,16.64,17.14,17.12,470977000
...,...,...,...,...,...,...
2024-11-18,139.50,141.55,137.15,140.15,140.15,221866000
2024-11-19,141.32,147.13,140.99,147.01,147.01,227834900
2024-11-20,147.41,147.56,142.73,145.89,145.89,309871700
2024-11-21,149.35,152.89,140.70,146.67,146.67,400946600


## 2b. Features Generation
- compute intraday daily return -> (close - open) / open
- compute close-to-close return (also called Daily Return) -> (Close_today - Closer_yesterday)/Close_yesterday
- cumulative returns (for example, assume you purchase the stock on Nov 25, 2022 and you are holding on to it until Nov 22, 2024)
- compute moving averages for the close price
- Compute daily volatility by copmuting the difference between High and Low
- Extract the day from the date

## 3

- Relationship between high-low (i.e. intraday valatility) and volume