# <center><div style="font-family: sans-serif; border-radius : 10px; background-color: black; color: #00DDDE; padding: 12px; line-height: 1;">1. Huge Stock Market Dataset (Kaggle)</div></center>

### <u>Description</u>
High-quality financial data is expensive to acquire and is therefore rarely shared for free. Here I provide the full historical daily price and volume data for all US-based stocks and ETFs trading on the NYSE, NASDAQ, and NYSE MKT. It's one of the best datasets of its kind you can obtain. The data (last updated 11/10/2017) is presented in CSV format as follows: Date, Open, High, Low, Close, Volume, OpenInt. Note that prices have been adjusted for dividends and splits. 

Link: https://www.kaggle.com/datasets/borismarjanovic/price-volume-data-for-all-us-stocks-etfs

The data for every ticker symbol is saved in CSV format with common fields:
* Date - specifies trading date
* Open - opening price
* High - maximum price during the day
* Low - minimum price during the day
* Close - close price adjusted for splits
* Volume - the number of shares that changed hands during a given day
* OpenInt - this is apparently always 0 (have to double-check this) --> turns out it is!

### <u>Credibility</u>
The dataset seems to be credible as it has 4000 upvotes on Kaggle. However, the publisher of the dataset does not mention where he gets the data from. 

### <u>Dataset Evaluation</u>
The dataset has an extensive amount of data (in terms of time) for both Apple and Intel which is hard to find on the internet. However, one limitation to this dataset is that the number of features is quite limited. I'll need to merge other datasets with this one if this serves as my base dataset so I have more information about each datapoint. Also note that the data is not updated which is a little annoying... if there is a dataset that has more recent data, then I would choose that over this.

In [1]:
# Data Manipulation
import pandas as pd
import numpy as np

# Plotting
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [2]:
apple = pd.read_csv('data/aapl.txt', parse_dates=['Date'])
apple.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,OpenInt
0,1984-09-07,0.42388,0.42902,0.41874,0.42388,23220030,0
1,1984-09-10,0.42388,0.42516,0.41366,0.42134,18022532,0
2,1984-09-11,0.42516,0.43668,0.42516,0.42902,42498199,0
3,1984-09-12,0.42902,0.43157,0.41618,0.41618,37125801,0
4,1984-09-13,0.43927,0.44052,0.43927,0.43927,57822062,0


In [3]:
apple.tail()

Unnamed: 0,Date,Open,High,Low,Close,Volume,OpenInt
8359,2017-11-06,171.75,174.36,171.1,173.63,34901241,0
8360,2017-11-07,173.29,174.51,173.29,174.18,24424877,0
8361,2017-11-08,174.03,175.61,173.71,175.61,24451166,0
8362,2017-11-09,174.48,175.46,172.52,175.25,29533086,0
8363,2017-11-10,175.11,175.38,174.27,174.67,25130494,0


In [4]:
intel = pd.read_csv('data/intc.txt', parse_dates=['Date'])
intel.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,OpenInt
0,1972-01-07,0.01592,0.01592,0.01592,0.01592,3787746,0
1,1972-01-14,0.00791,0.00791,0.00791,0.00791,7878523,0
2,1972-01-21,0.00791,0.00791,0.00791,0.00791,1060564,0
3,1972-01-24,0.00791,0.00791,0.00791,0.00791,6060405,0
4,1972-01-25,0.00791,0.00791,0.00791,0.00791,1060564,0


In [5]:
intel.tail()

Unnamed: 0,Date,Open,High,Low,Close,Volume,OpenInt
11551,2017-11-06,46.6,46.74,46.09,46.7,34006271,0
11552,2017-11-07,46.7,47.09,46.64,46.78,24422113,0
11553,2017-11-08,46.62,46.7,46.28,46.7,21556947,0
11554,2017-11-09,46.05,46.39,45.65,46.3,25564257,0
11555,2017-11-10,46.04,46.09,45.38,45.58,24088569,0


Apple was made public near the end of 1980 which means that the data that is present for Apple is quite extensive. Similarly, for Intel (which went public in 1971), the data is basically from the start as it starts from 1972.

In the discussions of this dataset on Kaggle, people were talking about how the `OpenInt` feature is always 0. Let's see if that is actually the case...

In [6]:
apple['OpenInt'].unique(), intel['OpenInt'].unique()

(array([0], dtype=int64), array([0], dtype=int64))

Clearly, the `OpenInt` feature appears to always be 0 which makes it basically a useless feature for us (at least for the Tesla dataset). That should be fine because we still have a few other columns that tell us some important data. However, it is important to note that this dataset doesn't contain values such as the Fair Value of the price (which may be very useful), the revenue of the company, etc.

In [7]:
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8364 entries, 0 to 8363
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   Date     8364 non-null   datetime64[ns]
 1   Open     8364 non-null   float64       
 2   High     8364 non-null   float64       
 3   Low      8364 non-null   float64       
 4   Close    8364 non-null   float64       
 5   Volume   8364 non-null   int64         
 6   OpenInt  8364 non-null   int64         
dtypes: datetime64[ns](1), float64(4), int64(2)
memory usage: 457.5 KB


In [8]:
intel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11556 entries, 0 to 11555
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   Date     11556 non-null  datetime64[ns]
 1   Open     11556 non-null  float64       
 2   High     11556 non-null  float64       
 3   Low      11556 non-null  float64       
 4   Close    11556 non-null  float64       
 5   Volume   11556 non-null  int64         
 6   OpenInt  11556 non-null  int64         
dtypes: datetime64[ns](1), float64(4), int64(2)
memory usage: 632.1 KB


As seen above, the datasets don't contain any missing values and everything is a number which is great to see because then we have to do less data manipulation on these datasets.

In [9]:
# Graphing Closing & Adjusted Closing Prices

fig = make_subplots(
  rows=1, cols=2,
  subplot_titles=('Apple (1984-2017)', 'Intel (1972-2017)'))

fig.add_trace(
  go.Scatter(x=apple['Date'], y=apple['Close']),
  row=1, col=1)

fig.add_trace(
  go.Scatter(x=intel['Date'], y=intel['Close']),
  row=1, col=2)

fig.update_layout(title_text='Daily Closing Stock Prices in USD', showlegend=False)
fig.show()

# Graphing Volume

fig = make_subplots(
  rows=1, cols=2,
  subplot_titles=('Apple (1984-2017)', 'Intel (1972-2017)'))

fig.add_trace(
  go.Scatter(x=apple['Date'], y=apple['Volume']),
  row=1, col=1)

fig.add_trace(
  go.Scatter(x=intel['Date'], y=intel['Volume']),
  row=1, col=2)

fig.update_layout(title_text='Volume of Shares Exchanged Per Day', showlegend=False)
fig.show()