# Data Preprocessing

In this notebook, we will load the Tunisian stock market data, clean it, and prepare it for further analysis and modeling.


### 1.1 Loading the Data
**First, we will load the data from the CSV file and take a look at its structure.**

In [1]:
# Import the necessary libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler


In [2]:
# Load the data
data = pd.read_csv('./data/checked_weekly_stock_market.csv')

# Display the first few rows of the dataframe
data.head()

Unnamed: 0,companyName,date,openingPrice,highestPrice,lowestPrice,closingPrice,volume
0,MIP,16/06/2014,4.7,4.7,4.7,4.7,243099
1,MIP,23/06/2014,4.69,4.69,4.56,4.56,13850
2,MIP,30/06/2014,4.43,4.56,4.43,4.43,4335
3,MIP,07/07/2014,4.5,4.5,4.49,4.49,4114
4,MIP,14/07/2014,4.48,4.48,3.92,4.03,52687


### 1.2 Handling Missing Values
**We'll check for missing values and handle them appropriately.**

In [3]:
# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)

companyName     0
date            0
openingPrice    0
highestPrice    0
lowestPrice     0
closingPrice    0
volume          0
dtype: int64


*No Missing Values*
- companyName: 0 missing values
- date: 0 missing values
- openingPrice: 0 missing values
- highestPrice: 0 missing values
- lowestPrice: 0 missing values
- closingPrice: 0 missing values
- volume: 0 missing values

### 1.3 Data Type Conversion
**Ensure that the date column is of the correct data type and other columns are correctly typed.**

In [4]:
# Convert 'date' column to datetime format
data['date'] = pd.to_datetime(data['date'], format='%d/%m/%Y')

# Ensure numeric columns are of type float
numeric_columns = ['openingPrice', 'highestPrice', 'lowestPrice', 'closingPrice', 'volume']
data[numeric_columns] = data[numeric_columns].astype(float)

# Display the data types
data.dtypes

companyName             object
date            datetime64[ns]
openingPrice           float64
highestPrice           float64
lowestPrice            float64
closingPrice           float64
volume                 float64
dtype: object

### 1.4 Handling Duplicates
**Check for and handle any duplicate rows.**

In [5]:
# Check for duplicates
duplicates = data.duplicated().sum()
print(f'Duplicates: {duplicates}')

# Drop duplicates if any
data.drop_duplicates(inplace=True)

Duplicates: 0


*No Duplicated Rows*

In [6]:
# Save the cleaned data
data.to_csv('./data/cleaned_weekly_stock_market.csv', index=False)

### 1.5 Data Normalization
**Normalize numerical features.**

In [7]:
scaler = MinMaxScaler()
data[numeric_columns] = scaler.fit_transform(data[numeric_columns])

# Display the first few rows of the dataframe after normalization
data.head()

Unnamed: 0,companyName,date,openingPrice,highestPrice,lowestPrice,closingPrice,volume
0,MIP,2014-06-16,0.000468,0.000466,0.000469,0.000467,0.002162
1,MIP,2014-06-23,0.000467,0.000465,0.000455,0.000453,0.000123
2,MIP,2014-06-30,0.000441,0.000452,0.000441,0.00044,3.9e-05
3,MIP,2014-07-07,0.000448,0.000446,0.000447,0.000446,3.7e-05
4,MIP,2014-07-14,0.000446,0.000444,0.000389,0.0004,0.000469
