# Project 1

## **STEP 1: FIND DATA**

This project will extract the data **from the price change of the Netflix stock price** from **Kaggle**, and intends to **explore the volatility** of the stock price overtime.

Data source: https://www.kaggle.com/datasets/prathamjyotsingh/netflix-vs-disney/data

In [3]:
import plotly.io as pio

pio.renderers.default = "vscode+jupyterlab+notebook_connected"
import pandas as pd

stock_data = pd.read_csv('nflk_stock_price.csv')
pd.set_option("display.max_rows", None)
stock_data.sample(5)

Unnamed: 0,Date,Open Price,High Price,Low Price,Close Price,Volume
2303,2011-07-15,287.96,289.75,281.62,286.93,4065600.0
5529,2024-05-10,619.0,623.98,605.06,610.87,2653586.0
33,2002-07-11,16.16,17.9,15.86,17.77,180700.0
1241,2007-04-30,22.2,22.27,21.97,22.17,1321100.0
3182,2015-01-13,322.15,329.3376,321.3,323.79,2675608.0


The program has **read the data**

This DataFrame object has the following columns:
1. **Open Price**: the beginning price of the trading date
2. **High Price**: the highest price that the stock reached during the trading day
3. **Low Price**: the lowest price of the stock during the trading day
4. **Close Price** the price of the stock at the end of the trading day
5. **Volumn**: trading volumn

In [5]:
stock_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5640 entries, 0 to 5639
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         5640 non-null   object 
 1   Open Price   5640 non-null   float64
 2   High Price   5640 non-null   float64
 3   Low Price    5640 non-null   float64
 4   Close Price  5640 non-null   float64
 5   Volume       5640 non-null   float64
dtypes: float64(5), object(1)
memory usage: 264.5+ KB


**Conditions of the data**:
1. At least one numeric column2. 
Between one thousand and one million rows

## **STEP 2: ANALYZE THE DATA WITH AND WITHOUT PANDAS**

### 2.1.1 Choosing the data
There are more than 1 column with numeric values. This program will pick the open price data in **2020, 2021, 2022, and 2023** only

### 2.1.2 converting the date into appropriate type
This step is to convert the Date into readable date type, so that the slicing can based on date information

In [8]:
stock_data['Date'] = pd.to_datetime(stock_data['Date'])

In [9]:
stock_data.dtypes

Date           datetime64[ns]
Open Price            float64
High Price            float64
Low Price             float64
Close Price           float64
Volume                float64
dtype: object

### 2.1.3 Slicing the data

In [11]:
stock_data_2020 = stock_data[stock_data['Date'].dt.year == 2020][['Date', 'Open Price']]
stock_data_2021 = stock_data[stock_data['Date'].dt.year == 2021][['Date', 'Open Price']]
stock_data_2022 = stock_data[stock_data['Date'].dt.year == 2022][['Date', 'Open Price']]
stock_data_2023 = stock_data[stock_data['Date'].dt.year == 2023][['Date', 'Open Price']]

stock_data_4y = pd.concat([stock_data_2020, stock_data_2021, stock_data_2022, stock_data_2023], axis=0).sort_values(by='Date').reset_index(drop=True)

In [12]:
stock_data_4y.sample(5)

Unnamed: 0,Date,Open Price
669,2022-08-29,221.93
203,2020-10-21,501.03
123,2020-06-29,445.23
789,2023-02-21,342.85
177,2020-09-15,484.0


In [13]:
stock_data_4y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1006 entries, 0 to 1005
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Date        1006 non-null   datetime64[ns]
 1   Open Price  1006 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 15.8 KB


**The chosen slice of the data** has **over 1000 non-null samples** and it is fully **numeric**

## **STEP 3: Calculating the mean, mode, and median of each dataset**

### 3.1 with **standard library**, rather than pandas

In [16]:
# 3.1.1 define the function
def cal_mean(data_list):
    sum = 0
    for i in data_list:
        sum = sum + float(i)
    return (sum / len(data_list))

def cal_median(data_list):
    sort_list = sorted(data_list)
    if len(data_list) % 2 == 0:
        return (sort_list[int(len(data_list)/2) - 1] + sort_list[int(len(data_list)/2)]) / 2
    if len(data_list) % 2 != 0:
        return sort_list[len(data_list) // 2 + 1]

def cal_mode(data_list):
    frequency = {}
    for value in data_list:
        if value in frequency:
            frequency[value] += 1
        else:
            frequency[value] = 1
    max_count = max(frequency.values())  # Highest frequency
    modes = [key for key, count in frequency.items() if count == max_count]

    return modes

In [17]:
# 3.1.2 extract the list to conduct conculasion
open_prices = stock_data_4y['Open Price'].tolist()

In [18]:
# 3.1.3 print the results
print(f'''
the mean of the data is {cal_mean(open_prices):.4f},
the median of the dataset is {cal_median(open_prices)}
the modes of the dataset is {cal_mode(open_prices)}
''')


the mean of the data is 420.1211,
the median of the dataset is 428.87
the modes of the dataset is [425.0, 492.0]



### 3.2 with **pandas package**

In [20]:
# 3.2.1 calculating using panda
print(f'''
the mean of the data is {stock_data_4y['Open Price'].mean():.4f},
the median of the dataset is {stock_data_4y['Open Price'].median()}
the modes of the dataset is {stock_data_4y['Open Price'].mode().tolist()}
''')


the mean of the data is 420.1211,
the median of the dataset is 428.87
the modes of the dataset is [425.0, 492.0]



In [21]:
print(f'''
the standard deviation of the price change is {stock_data_4y['Open Price'].std()},
and the standard deviation is {(stock_data_4y['Open Price'].std()/stock_data_4y['Open Price'].median() * 100):.2f} % of the median
''')


the standard deviation of the price change is 121.1685617798361,
and the standard deviation is 28.25 % of the median



### **3.3 Result**
1. Two calculatiosn of **mean, median, and modes** are the **same**
2. It seems the stock **often open at around $420/share level**
3. The program checks the standarad deviation of the price change, noticing that the sd is a significant portion of the median value. This shows that the stock may **have significant flunctuation.**

## **STEP 4: DATA VISUALIZATION**

**Procedures**:
1. Define a function to **plot open prices with * for every $6 increment**
2. Extract 'Date' and 'Open Price' **as lists for processing**
3. **Utilize the function** to plot

In [24]:
# Define a function to plot open prices with * for every $6 increment
def plot_prices(dates, prices):
    for date, price in zip(dates, prices):
        
        # Determine the number of * based on price divided by 6
        stars = int(price / 6) * '*'
        
        # Print the date, price, and stars
        print(f"{date} : ${price:<8.2f} {stars}")

# Extract 'Date' and 'Open Price' as lists
dates = stock_data_4y['Date'].dt.strftime('%Y-%m-%d').tolist()
prices = stock_data_4y['Open Price'].tolist()

# Use the function with the dates and open prices
print('Graph: the variation of Netflix stock open price')
plot_prices(dates, prices)

Graph: the variation of Netflix stock open price
2020-01-02 : $326.10   ******************************************************
2020-01-03 : $326.78   ******************************************************
2020-01-06 : $323.12   *****************************************************
2020-01-07 : $336.47   ********************************************************
2020-01-08 : $331.49   *******************************************************
2020-01-09 : $342.00   *********************************************************
2020-01-10 : $337.13   ********************************************************
2020-01-13 : $331.80   *******************************************************
2020-01-14 : $344.40   *********************************************************
2020-01-15 : $338.68   ********************************************************
2020-01-16 : $343.50   *********************************************************
2020-01-17 : $341.00   ******************************************************

Look through the visualization plots - **the plots do reflect the regularity implied by the calculation**