# Performing EDA
In this activity, you will be putting together everything we have learned so far about EDA.

This resource by nbviewer is a great example of how you would perform your exploratory data analysis (EDA). 

https://web.compass.lighthouselabs.ca/p/ds-5/5c149340-0ad0-4d55-8857-b7ee1dbf0707#:~:text=Incomplete-,Performing%20EDA,some%20time%20to%20review%20it%20before%20jumping%20into%20the%20steps%20below.,-Step%201%3A%20Download

Take some time to review it before jumping into the steps below.



Key Questions
Here are some questions to help you with the this assignment:

- What are the common daily values for stock prices?
- How can I load a JSON file in Python?
- How does the keyword with work when opening a file?
- When should I stop indenting after the with keyword?
- What format is the loaded JSON once I use the json.load() function?

## Step 1: Download the Data

## Step 2: Load the Data Files

As you may have noticed, the JSON files have a lot of information. What we decide to include in our dataframe should be informed by the questions we want to answer.

Consider all the companies listed on the NASDAQ in the folder for the year 2020. We are interested in finding out answers to the following questions:

- How much stock do we have?
- Which stock has the highest price and when it was observed?
- Which stock has the lowest price and when it was observed?
- Which stock is the most popular in 2021? (has the highest traded volume in 2021)

Now, we have to pick a single stock to act as our prototype.

Load the JSON files for the stock of your choice, write code to parse the file, and transform it into a Pandas DataFrame with the following columns:

- stock acronym
- day (should be extracted from timestamp value in the data)
- open - price when the trading opened that day
- high - the highest price of the day
- close - price when the trading closed that day
- low - the lowest price of the day
- splits - number of splits of the stock (look for the value splits in the events key of the JSON file).
- volume - what was the value of shares traded on that day
- These columns will help us answer the questions above.

In [1]:
import requests
import pandas as pd
from pprint import pprint
import json

Below

- open(file_path, 'r'): Opens the JSON file in read mode.

- json.load(f): Parses the file into a Python dictionary.

- Now data is a dictionary object—you can access it just like an API response.

In [2]:
# Example: Load one JSON file (AAVL.json)
file_path = 'Data/stock_market_data/nasdaq/json/AAVL.json'

with open(file_path, 'r') as f:
    data = json.load(f)

# Optional: Pretty print the JSON to inspect structure
from pprint import pprint
print(data, '\n')
pprint(data)

{'chart': {'result': [{'meta': {'currency': None, 'symbol': 'AAVL', 'exchangeName': 'YHD', 'instrumentType': 'MUTUALFUND', 'firstTradeDate': 1509629400, 'regularMarketTime': 1561759658, 'gmtoffset': -14400, 'timezone': 'EDT', 'exchangeTimezoneName': 'America/New_York', 'chartPreviousClose': 3.25, 'priceHint': 2, 'currentTradingPeriod': {'pre': {'timezone': 'EDT', 'start': 1626076800, 'end': 1626096600, 'gmtoffset': -14400}, 'regular': {'timezone': 'EDT', 'start': 1626096600, 'end': 1626120000, 'gmtoffset': -14400}, 'post': {'timezone': 'EDT', 'start': 1626120000, 'end': 1626134400, 'gmtoffset': -14400}}, 'dataGranularity': '1d', 'range': '', 'validRanges': ['1mo', '3mo', '6mo', 'ytd', '1y', '2y', '5y', '10y', 'max']}, 'timestamp': [1509629400, 1509715800, 1509978600, 1510065000, 1510151400, 1510237800, 1510324200, 1510583400, 1510669800, 1510756200, 1510842600, 1510929000, 1511188200, 1511274600, 1511361000, 1511533800, 1511793000, 1511879400, 1511965800, 1512052200, 1512138600, 151239

In [8]:
# print(data.keys())  # Should be: dict_keys(['chart'])
# print(data['chart'].keys())  # Should be: dict_keys(['result', 'error'])

pprint(data['chart']['result'][0]['meta'])

{'chartPreviousClose': 3.25,
 'currency': None,
 'currentTradingPeriod': {'post': {'end': 1626134400,
                                   'gmtoffset': -14400,
                                   'start': 1626120000,
                                   'timezone': 'EDT'},
                          'pre': {'end': 1626096600,
                                  'gmtoffset': -14400,
                                  'start': 1626076800,
                                  'timezone': 'EDT'},
                          'regular': {'end': 1626120000,
                                      'gmtoffset': -14400,
                                      'start': 1626096600,
                                      'timezone': 'EDT'}},
 'dataGranularity': '1d',
 'exchangeName': 'YHD',
 'exchangeTimezoneName': 'America/New_York',
 'firstTradeDate': 1509629400,
 'gmtoffset': -14400,
 'instrumentType': 'MUTUALFUND',
 'priceHint': 2,
 'range': '',
 'regularMarketTime': 1561759658,
 'symbol': 'AAVL',
 'timezone': 'E

# Step 3: Complete the Tasks Below

## Task 1
Once you are comfortable with your prototype code, put the code into a function.

 Use the function to fill out the columns in the dataframe for all companies listed on the NASDAQ in 2020.

In [None]:
import os
import json
import pandas as pd
from datetime import datetime

def load_json_file(filepath):
    with open(filepath, 'r') as file:
        return json.load(file)

folder_path = 'Data/stock_market_data/nasdaq/json/'
all_json_data = {}

# Load all JSON files into dictionary
for filename in os.listdir(folder_path):
    if filename.endswith('.json'):
        stock_symbol = filename.replace('.json', '')
        file_path = os.path.join(folder_path, filename)
        all_json_data[stock_symbol] = load_json_file(file_path)

# Initialize list to collect rows from all stocks
rows = []

# Parse each stock’s data
for symbol, data in all_json_data.items():
    try:
        result = data['chart']['result'][0]
        quote = result['indicators']['quote'][0]
        timestamps = result['timestamp']
        splits = result.get('events', {}).get('splits', {})

        for i, ts in enumerate(timestamps):
            day = datetime.fromtimestamp(ts)
            open_price = quote['open'][i]
            high_price = quote['high'][i]
            low_price = quote['low'][i]
            close_price = quote['close'][i]
            volume = quote['volume'][i]

            # Check for splits
            split_count = 0
            ts_str = str(ts)
            if ts_str in splits:
                split_count = splits[ts_str]['splitRatio']

            # Skip rows with missing values
            if None in [open_price, high_price, low_price, close_price, volume]:
                continue

            rows.append({
                'symbol': symbol,
                'day': day,
                'open': open_price,
                'high': high_price,
                'low': low_price,
                'close': close_price,
                'volume': volume,
                'splits': split_count
            })

    except Exception as e:
        print(f"Skipping {symbol} due to error: {e}")
        continue

# Convert to DataFrame
df = pd.DataFrame(rows)




Skipping SUBK due to error: 'timestamp'
Skipping YOSN due to error: 'timestamp'
Skipping MSLI due to error: 'timestamp'
Skipping BOFI due to error: 'timestamp'
Skipping NDRM due to error: 'timestamp'
Skipping HOTR due to error: 'timestamp'
Skipping IKAN due to error: 'timestamp'
Skipping IXYS due to error: 'timestamp'
Skipping CPXX due to error: 'timestamp'
Skipping LIQD due to error: 'timestamp'
Skipping PLPM due to error: 'timestamp'
Skipping LINE due to error: 'timestamp'
Skipping ONFC due to error: 'timestamp'
Skipping UTIW due to error: 'timestamp'
Skipping FWM due to error: 'timestamp'
Skipping ARUN due to error: 'timestamp'
Skipping ZINC due to error: 'timestamp'
Skipping ISIL due to error: 'timestamp'
Skipping MDM due to error: 'timestamp'
Skipping PULB due to error: 'timestamp'
Skipping EPRS due to error: 'timestamp'
Skipping GLDC due to error: 'timestamp'
Skipping DAEG due to error: 'timestamp'
Skipping NBBC due to error: 'timestamp'
Skipping DEPO due to error: 'timestamp'
Sk


Load the JSON files for the stock of your choice, write code to parse the file, and transform it into a Pandas DataFrame with the following columns:

- stock acronym
- day (should be extracted from timestamp value in the data)
- open - price when the trading opened that day
- high - the highest price of the day
- close - price when the trading closed that day
- low - the lowest price of the day
- splits - number of splits of the stock (look for the value splits in the events key of the JSON file).
- volume - what was the value of shares traded on that day
- These columns will help us answer the questions above.

## Task 2
Now, it’s time to do some EDA. Answer the following questions.

- How big is the DataFrame (shape)?
- How much stock do we have?
- Which stock has the highest price and when it was observed?
- Which stock has the lowest price and when it was observed?
- Which stock is the most popular in 2021? (has the highest traded volume in 2021)

In [126]:
df.shape

(8598285, 8)

In [132]:
df['day'] = pd.to_datetime(df['day'])

df[df['day'].dt.year == 2021].groupby('symbol').sum('volume').sort_values(by='volume', ascending=False).head(1)



Unnamed: 0_level_0,open,high,low,close,volume
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PLPL,0.1231,0.1392,0.10165,0.1222,18920675573


In [128]:
df[['symbol','day','high']].sort_values(by='high', ascending=False).head(1)


Unnamed: 0,symbol,day,high
775130,TOPS,2004-11-29 06:30:00,4562460000000.0


In [129]:
df[['symbol','day','low']].sort_values(by='low', ascending=True).head(1)

Unnamed: 0,symbol,day,low
3060294,MAYS,1992-01-20 06:30:00,0.0


## Task 3
What else could you answer by doing EDA for this dataset?