# Data Preprocessing

In this notebook, we will load the Tunisian stock market data, clean it, and prepare it for further analysis and modeling.


### 1.1 Loading the Data
**First, we will load the data from the CSV file and take a look at its structure.**

In [1]:
import pandas as pd

# Load the data
data = pd.read_csv('./data/cleaned_weekly_stock_market.csv')

# Display the first few rows of the dataframe
data.head()

Unnamed: 0,companyName,date,openingPrice,highestPrice,lowestPrice,closingPrice,volume
0,SOTUMAG,16/06/2014,1.76,1.8,1.73,1.73,44315
1,SOTUMAG,23/06/2014,1.78,1.82,1.77,1.82,9551
2,SOTUMAG,30/06/2014,1.81,1.82,1.81,1.81,2401
3,SOTUMAG,07/07/2014,1.78,1.82,1.78,1.8,10341
4,SOTUMAG,14/07/2014,1.83,1.83,1.76,1.81,27480


### 1.2 Handling Missing Values
**We'll check for missing values and handle them appropriately.**

In [3]:
# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)

companyName     0
date            0
openingPrice    0
highestPrice    0
lowestPrice     0
closingPrice    0
volume          0
dtype: int64


*No Missing Values*
- companyName: 0 missing values
- date: 0 missing values
- openingPrice: 0 missing values
- highestPrice: 0 missing values
- lowestPrice: 0 missing values
- closingPrice: 0 missing values
- volume: 0 missing values

### 1.3 Data Type Conversion
**Ensure that the date column is of the correct data type and other columns are correctly typed.**

In [4]:
# Convert 'date' column to datetime format
data['date'] = pd.to_datetime(data['date'], format='%d/%m/%Y')

# Ensure numeric columns are of type float
numeric_columns = ['openingPrice', 'highestPrice', 'lowestPrice', 'closingPrice', 'volume']
data[numeric_columns] = data[numeric_columns].astype(float)

# Display the data types
data.dtypes

companyName             object
date            datetime64[ns]
openingPrice           float64
highestPrice           float64
lowestPrice            float64
closingPrice           float64
volume                 float64
dtype: object