## Preprocess MAANG companies stock market data

- Import libraries  
- Load raw data (CSV, API, etc.)  
- Handle missing values / duplicates  
- Feature engineering  
- Save processed data to file (e.g., `data/processed.csv`)

Dataset by SOUMENDRA PRASAD MOHANTY on Kaggle: https://www.kaggle.com/datasets/soumendraprasad/stock

In [4]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import os

In [5]:
raw_AAPL = pd.read_csv('data/raw/Apple.csv')
raw_AAPL.head(3)

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,Date
0,0.936384,1.004464,0.907924,0.999442,0.850643,535796800,2000-01-03
1,0.966518,0.987723,0.90346,0.915179,0.778926,512377600,2000-01-04
2,0.926339,0.987165,0.919643,0.928571,0.790324,778321600,2000-01-05


In [6]:
# load all CSV files
data_path = "data/raw"
company_names = ["Microsoft", "Apple", "Amazon", "Netflix", "Google"]

dfs = {}

for name in company_names:
    file_path = os.path.join(data_path, f"{name}.csv")
    df = pd.read_csv(file_path)
    df["Company"] = name  # add identifier column
    dfs[name] = df

In [7]:
dfs["Microsoft"].head(3)

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,Date,Company
0,58.6875,59.3125,56.0,58.28125,36.361576,53228400,2000-01-03,Microsoft
1,56.78125,58.5625,56.125,56.3125,35.133263,54119000,2000-01-04,Microsoft
2,55.5625,58.1875,54.6875,56.90625,35.503712,64059600,2000-01-05,Microsoft


In [8]:
combined_df = pd.concat(dfs.values(), ignore_index=True)
combined_df.sample(3)

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,Date,Company
17250,139.837494,151.748993,139.5,151.358002,151.358002,100786000,2022-02-24,Amazon
8054,3.446786,3.544643,3.282143,3.508214,2.985905,1675430400,2008-10-23,Apple
8728,11.916071,12.025,11.908571,11.973571,10.190922,294299600,2011-06-28,Apple


In [9]:
# check for missing values 
combined_df.isna().any().any()

np.False_

In [10]:
# pd date time and sort values
combined_df["Date"] = pd.to_datetime(combined_df["Date"])
combined_df = combined_df.sort_values(["Date"])

### Export processed data

In [11]:
combined_df.to_csv("data/processed/maang_combined.csv", index=False)