

>  ## ***`Task 1: Data Preprocessing for Machine Learning`*** ##
 




```Objectives:```

---


```
1.   Handle missing data (e.g., filling with mean/median, dropping).
2.   Encode categorical variables (e.g., using one-hot
     encoding or label encoding).
3.   Normalize or standardize numerical features.
4.   Split the dataset into training and testing sets.
```

---


`Tools: Python, pandas, scikit-learn.`

`Using Stock dataset.`

### ***1- Import pre-processing libs*** ###


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split


### ***2- Import and Load Data*** ###


In [2]:
df = pd.read_csv('../DataSet-For-Tasks/2_Stock-Prices-Data-Set.csv')


In [3]:
df

Unnamed: 0,symbol,date,open,high,low,close,volume
0,AAL,2014-01-02,25.0700,25.8200,25.0600,25.3600,8998943
1,AAPL,2014-01-02,79.3828,79.5756,78.8601,79.0185,58791957
2,AAP,2014-01-02,110.3600,111.8800,109.2900,109.7400,542711
3,ABBV,2014-01-02,52.1200,52.3300,51.5200,51.9800,4569061
4,ABC,2014-01-02,70.1100,70.2300,69.4800,69.8900,1148391
...,...,...,...,...,...,...,...
497467,XYL,2017-12-29,68.5300,68.8000,67.9200,68.2000,1046677
497468,YUM,2017-12-29,82.6400,82.7100,81.5900,81.6100,1347613
497469,ZBH,2017-12-29,121.7500,121.9500,120.6200,120.6700,1023624
497470,ZION,2017-12-29,51.2800,51.5500,50.8100,50.8300,1261916


### ***3- Exploring Data*** ###

In [4]:
df.head()

Unnamed: 0,symbol,date,open,high,low,close,volume
0,AAL,2014-01-02,25.07,25.82,25.06,25.36,8998943
1,AAPL,2014-01-02,79.3828,79.5756,78.8601,79.0185,58791957
2,AAP,2014-01-02,110.36,111.88,109.29,109.74,542711
3,ABBV,2014-01-02,52.12,52.33,51.52,51.98,4569061
4,ABC,2014-01-02,70.11,70.23,69.48,69.89,1148391


In [5]:
df.tail()

Unnamed: 0,symbol,date,open,high,low,close,volume
497467,XYL,2017-12-29,68.53,68.8,67.92,68.2,1046677
497468,YUM,2017-12-29,82.64,82.71,81.59,81.61,1347613
497469,ZBH,2017-12-29,121.75,121.95,120.62,120.67,1023624
497470,ZION,2017-12-29,51.28,51.55,50.81,50.83,1261916
497471,ZTS,2017-12-29,72.55,72.76,72.04,72.04,1704122


In [6]:
df.sample(3)

Unnamed: 0,symbol,date,open,high,low,close,volume
254640,CHTR,2016-01-28,167.18,169.04,164.32,167.3,1607789
362858,SBAC,2016-12-06,97.46,98.95,97.16,97.81,1002103
426770,LNT,2017-06-12,41.41,41.53,40.835,41.17,971857


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497472 entries, 0 to 497471
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   symbol  497472 non-null  object 
 1   date    497472 non-null  object 
 2   open    497461 non-null  float64
 3   high    497464 non-null  float64
 4   low     497464 non-null  float64
 5   close   497472 non-null  float64
 6   volume  497472 non-null  int64  
dtypes: float64(4), int64(1), object(2)
memory usage: 26.6+ MB


In [8]:
df.describe()

Unnamed: 0,open,high,low,close,volume
count,497461.0,497464.0,497464.0,497472.0,497472.0
mean,86.352275,87.132562,85.552467,86.369082,4253611.0
std,101.471228,102.312062,100.570957,101.472407,8232139.0
min,1.62,1.69,1.5,1.59,0.0
25%,41.69,42.09,41.28,41.70375,1080166.0
50%,64.97,65.56,64.3537,64.98,2084896.0
75%,98.41,99.23,97.58,98.42,4271928.0
max,2044.0,2067.99,2035.11,2049.0,618237600.0


In [9]:
df.shape

(497472, 7)

### ***3-Finding & Handle missing data*** ###

In [10]:
df.isnull()

Unnamed: 0,symbol,date,open,high,low,close,volume
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
497467,False,False,False,False,False,False,False
497468,False,False,False,False,False,False,False
497469,False,False,False,False,False,False,False
497470,False,False,False,False,False,False,False


In [11]:
df.isna().any()

symbol    False
date      False
open       True
high       True
low        True
close     False
volume    False
dtype: bool

In [12]:
df.isnull().sum()

symbol     0
date       0
open      11
high       8
low        8
close      0
volume     0
dtype: int64

In [13]:
df.fillna(df.mean(numeric_only=True), inplace=True)

In [14]:
df.isnull().sum()

symbol    0
date      0
open      0
high      0
low       0
close     0
volume    0
dtype: int64

### ***4- Checking duplicated values*** ###

In [15]:
df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
497467    False
497468    False
497469    False
497470    False
497471    False
Length: 497472, dtype: bool

In [16]:
df.duplicated().sum()

0

### ***5_ Data Encoding*** ###

In [17]:
df.dtypes

symbol     object
date       object
open      float64
high      float64
low       float64
close     float64
volume      int64
dtype: object

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497472 entries, 0 to 497471
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   symbol  497472 non-null  object 
 1   date    497472 non-null  object 
 2   open    497472 non-null  float64
 3   high    497472 non-null  float64
 4   low     497472 non-null  float64
 5   close   497472 non-null  float64
 6   volume  497472 non-null  int64  
dtypes: float64(4), int64(1), object(2)
memory usage: 26.6+ MB


In [19]:
df['symbol'].value_counts()

symbol
AAL     1007
PRGO    1007
NVDA    1007
NUE     1007
NTRS    1007
        ... 
DXC      189
BHGE     126
BHF      117
DWDP      83
APTV      18
Name: count, Length: 505, dtype: int64

In [20]:
df['symbol'].describe()

count     497472
unique       505
top          AAL
freq        1007
Name: symbol, dtype: object

In [21]:
df['date'].value_counts()

date
2017-12-29    505
2017-12-15    505
2017-12-05    505
2017-12-06    505
2017-12-08    505
             ... 
2014-02-24    483
2014-02-21    483
2014-02-20    483
2014-01-02    483
2014-06-10    482
Name: count, Length: 1007, dtype: int64

In [22]:
df['date'].describe()

count         497472
unique          1007
top       2017-12-29
freq             505
Name: date, dtype: object

In [23]:
df

Unnamed: 0,symbol,date,open,high,low,close,volume
0,AAL,2014-01-02,25.0700,25.8200,25.0600,25.3600,8998943
1,AAPL,2014-01-02,79.3828,79.5756,78.8601,79.0185,58791957
2,AAP,2014-01-02,110.3600,111.8800,109.2900,109.7400,542711
3,ABBV,2014-01-02,52.1200,52.3300,51.5200,51.9800,4569061
4,ABC,2014-01-02,70.1100,70.2300,69.4800,69.8900,1148391
...,...,...,...,...,...,...,...
497467,XYL,2017-12-29,68.5300,68.8000,67.9200,68.2000,1046677
497468,YUM,2017-12-29,82.6400,82.7100,81.5900,81.6100,1347613
497469,ZBH,2017-12-29,121.7500,121.9500,120.6200,120.6700,1023624
497470,ZION,2017-12-29,51.2800,51.5500,50.8100,50.8300,1261916


In [24]:
df = pd.get_dummies(df, columns=['symbol'])

In [25]:
df

Unnamed: 0,date,open,high,low,close,volume,symbol_A,symbol_AAL,symbol_AAP,symbol_AAPL,...,symbol_XL,symbol_XLNX,symbol_XOM,symbol_XRAY,symbol_XRX,symbol_XYL,symbol_YUM,symbol_ZBH,symbol_ZION,symbol_ZTS
0,2014-01-02,25.0700,25.8200,25.0600,25.3600,8998943,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
1,2014-01-02,79.3828,79.5756,78.8601,79.0185,58791957,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
2,2014-01-02,110.3600,111.8800,109.2900,109.7400,542711,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
3,2014-01-02,52.1200,52.3300,51.5200,51.9800,4569061,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,2014-01-02,70.1100,70.2300,69.4800,69.8900,1148391,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
497467,2017-12-29,68.5300,68.8000,67.9200,68.2000,1046677,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
497468,2017-12-29,82.6400,82.7100,81.5900,81.6100,1347613,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
497469,2017-12-29,121.7500,121.9500,120.6200,120.6700,1023624,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
497470,2017-12-29,51.2800,51.5500,50.8100,50.8300,1261916,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


In [26]:
df.shape

(497472, 511)

In [27]:
le = LabelEncoder()
df['date'] = le.fit_transform(df['date'])


In [28]:
df

Unnamed: 0,date,open,high,low,close,volume,symbol_A,symbol_AAL,symbol_AAP,symbol_AAPL,...,symbol_XL,symbol_XLNX,symbol_XOM,symbol_XRAY,symbol_XRX,symbol_XYL,symbol_YUM,symbol_ZBH,symbol_ZION,symbol_ZTS
0,0,25.0700,25.8200,25.0600,25.3600,8998943,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
1,0,79.3828,79.5756,78.8601,79.0185,58791957,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
2,0,110.3600,111.8800,109.2900,109.7400,542711,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
3,0,52.1200,52.3300,51.5200,51.9800,4569061,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,0,70.1100,70.2300,69.4800,69.8900,1148391,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
497467,1006,68.5300,68.8000,67.9200,68.2000,1046677,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
497468,1006,82.6400,82.7100,81.5900,81.6100,1347613,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
497469,1006,121.7500,121.9500,120.6200,120.6700,1023624,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
497470,1006,51.2800,51.5500,50.8100,50.8300,1261916,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497472 entries, 0 to 497471
Columns: 511 entries, date to symbol_ZTS
dtypes: bool(505), float64(4), int32(1), int64(1)
memory usage: 260.5 MB


### ***6_ Droping unuseful columns*** ###

In [30]:
df.dtypes

date             int32
open           float64
high           float64
low            float64
close          float64
                ...   
symbol_XYL        bool
symbol_YUM        bool
symbol_ZBH        bool
symbol_ZION       bool
symbol_ZTS        bool
Length: 511, dtype: object

In [31]:
df.drop(['date'], axis=1, inplace=True)

In [32]:
df.shape

(497472, 510)

### ***7_ Scaling Data*** ###

In [33]:
df.dtypes

open           float64
high           float64
low            float64
close          float64
volume           int64
                ...   
symbol_XYL        bool
symbol_YUM        bool
symbol_ZBH        bool
symbol_ZION       bool
symbol_ZTS        bool
Length: 510, dtype: object

In [34]:
# Scaling numerical data only
scaler = StandardScaler()
df[['open', 'high', 'low', 'close']] = scaler.fit_transform(df[['open', 'high', 'low', 'close']])

In [35]:
df

Unnamed: 0,open,high,low,close,volume,symbol_A,symbol_AAL,symbol_AAP,symbol_AAPL,symbol_ABBV,...,symbol_XL,symbol_XLNX,symbol_XOM,symbol_XRAY,symbol_XRX,symbol_XYL,symbol_YUM,symbol_ZBH,symbol_ZION,symbol_ZTS
0,-0.603945,-0.599276,-0.601496,-0.601239,8998943,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,-0.068685,-0.073863,-0.066544,-0.072439,58791957,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
2,0.236599,0.241884,0.236030,0.230318,542711,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
3,-0.337363,-0.340164,-0.338396,-0.338901,4569061,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
4,-0.160070,-0.165207,-0.159814,-0.162400,1148391,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
497467,-0.175641,-0.179184,-0.175325,-0.179055,1046677,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
497468,-0.036585,-0.043227,-0.039400,-0.046900,1347613,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
497469,0.348849,0.340309,0.348688,0.338032,1023624,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
497470,-0.345642,-0.347788,-0.345455,-0.350234,1261916,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


### ***8_ Feature & Target Selection*** ###

In [36]:
X = df.drop('close', axis=1)
y = df['close']

In [37]:
X

Unnamed: 0,open,high,low,volume,symbol_A,symbol_AAL,symbol_AAP,symbol_AAPL,symbol_ABBV,symbol_ABC,...,symbol_XL,symbol_XLNX,symbol_XOM,symbol_XRAY,symbol_XRX,symbol_XYL,symbol_YUM,symbol_ZBH,symbol_ZION,symbol_ZTS
0,-0.603945,-0.599276,-0.601496,8998943,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,-0.068685,-0.073863,-0.066544,58791957,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
2,0.236599,0.241884,0.236030,542711,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,-0.337363,-0.340164,-0.338396,4569061,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
4,-0.160070,-0.165207,-0.159814,1148391,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
497467,-0.175641,-0.179184,-0.175325,1046677,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
497468,-0.036585,-0.043227,-0.039400,1347613,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
497469,0.348849,0.340309,0.348688,1023624,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
497470,-0.345642,-0.347788,-0.345455,1261916,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


In [38]:
y

0        -0.601239
1        -0.072439
2         0.230318
3        -0.338901
4        -0.162400
            ...   
497467   -0.179055
497468   -0.046900
497469    0.338032
497470   -0.350234
497471   -0.141212
Name: close, Length: 497472, dtype: float64

### ***9_ Spliting the dataset into the Training set and Test set*** ###

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [40]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((397977, 509), (99495, 509), (397977,), (99495,))


>  ## ***`Conclusion`*** ##

**In this notebook, we successfully completed the essential data preprocessing steps required for preparing a dataset for machine learning tasks:**

---
```
1.  Handled missing data using mean imputation for numerical features.
2.  Encoded categorical variables using one-hot encoding and label encoding.
3.  Standardized numerical features to bring them to a similar scale.
4.  Split the dataset into training and testing sets for model evaluation.
```
---
**These preprocessing steps ensure that the dataset is clean, consistent, and ready to be used for building reliable machine learning models.**
