# Data Preprocessing

This notebook focuses on the **data preprocessing** step as required in the Software Requirements Specification (SRS).  
The raw stock data for Apple (AAPL), Amazon (AMZN), and Tesla (TSLA) is provided in CSV files inside the `Data/` folder.  

The main goals of this notebook are:
- Load the raw CSV files.
- Explore the data structure (columns, data types, missing values).
- Clean the data by handling missing values, formatting inconsistencies, and duplicates.
- Merge all datasets into a single DataFrame.
- Save the cleaned dataset to `Data/processed/cleaned_stocks.csv` for later steps (feature engineering and modeling).


In [1]:
import pandas as pd

df1 = pd.read_csv("../Data/AAPL.csv")
df2 = pd.read_csv("../Data/AMZN.csv")
df3 = pd.read_csv("../Data/TSLA.csv")

df = pd.concat([df1, df2, df3], ignore_index=True)


In [2]:
df.columns = df.columns.str.strip()


In [3]:
df["Date"] = pd.to_datetime(df["Date"], errors="coerce")

In [4]:
numeric_cols = ["Open", "High", "Low", "Close", "Volume"]
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors="coerce")

In [5]:
df = df.dropna()


In [7]:
df.to_csv("../Data/processed/cleaned_stocks.csv", index=False)


## Summary

In this notebook, we completed the preprocessing phase of the project:  
- Loaded raw CSV files for AAPL, AMZN, and TSLA.  
- Performed initial exploration (checked column names, datatypes, and missing values).  
- Cleaned the data by fixing column names, handling missing/null values, and removing duplicates.  
- Combined all three datasets into one standardized DataFrame.  
- Saved the cleaned dataset into `Data/processed/cleaned_stocks.csv`.  

 The data is now ready for the next step: **Feature Engineering**.
