# Data Preprocessing

In this notebook, we will load the Tunisian stock market data, clean it, and prepare it for further analysis and modeling.


### 1.1 Loading the Data
**First, we will load the data from the CSV file and take a look at its structure.**

In [1]:
# Import the necessary libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler


In [2]:
# Load the data
data = pd.read_csv('./data/checked_weekly_stock_market.csv')

# Display the first few rows of the dataframe
data.head()

Unnamed: 0,companyName,date,openingPrice,highestPrice,lowestPrice,closingPrice,volume
0,AMEN BANK,16/06/2014,23.63,23.63,22.75,23.14,1608
1,AMEN BANK,23/06/2014,23.14,23.14,22.37,22.75,16837
2,AMEN BANK,30/06/2014,22.75,22.97,22.07,22.66,33514
3,AMEN BANK,07/07/2014,22.75,23.17,21.88,22.74,3340
4,AMEN BANK,14/07/2014,22.84,23.16,22.58,22.75,5789


### 1.2 Handling Missing Values
**We'll check for missing values and handle them appropriately.**

In [3]:
# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)

companyName     0
date            0
openingPrice    0
highestPrice    0
lowestPrice     0
closingPrice    0
volume          0
dtype: int64


*No Missing Values*
- companyName: 0 missing values
- date: 0 missing values
- openingPrice: 0 missing values
- highestPrice: 0 missing values
- lowestPrice: 0 missing values
- closingPrice: 0 missing values
- volume: 0 missing values

### 1.3 Data Type Conversion
**Ensure that the date column is of the correct data type and other columns are correctly typed.**

In [4]:
# Convert 'date' column to datetime format
data['date'] = pd.to_datetime(data['date'], format='%d/%m/%Y')

# Ensure numeric columns are of type float
numeric_columns = ['openingPrice', 'highestPrice', 'lowestPrice', 'closingPrice', 'volume']
data[numeric_columns] = data[numeric_columns].astype(float)

# Display the data types
data.dtypes

companyName             object
date            datetime64[ns]
openingPrice           float64
highestPrice           float64
lowestPrice            float64
closingPrice           float64
volume                 float64
dtype: object

### 1.4 Handling Duplicates
**Check for and handle any duplicate rows.**

In [5]:
# Check for duplicates
duplicates = data.duplicated().sum()
print(f'Duplicates: {duplicates}')

# Drop duplicates if any
data.drop_duplicates(inplace=True)

Duplicates: 0


*No Duplicated Rows*

In [6]:
# Save the cleaned data
data.to_csv('./data/cleaned_weekly_stock_market.csv', index=False)

### 1.5 Data Normalization
**Normalize numerical features.**

In [7]:
scaler = MinMaxScaler()
data[numeric_columns] = scaler.fit_transform(data[numeric_columns])

# Display the first few rows of the dataframe after normalization
data.head()

Unnamed: 0,companyName,date,openingPrice,highestPrice,lowestPrice,closingPrice,volume
0,AMEN BANK,2014-06-16,0.00243,0.002398,0.00234,0.002357,1.4e-05
1,AMEN BANK,2014-06-23,0.002379,0.002348,0.002301,0.002317,0.00015
2,AMEN BANK,2014-06-30,0.002339,0.002331,0.00227,0.002308,0.000298
3,AMEN BANK,2014-07-07,0.002339,0.002351,0.00225,0.002316,3e-05
4,AMEN BANK,2014-07-14,0.002348,0.00235,0.002323,0.002317,5.1e-05
