
# Data Preprocessing Lab Work

This notebook walks through the steps of preprocessing a dataset to prepare it for machine learning models. 
The dataset consists of information on countries, age, salary, and whether a purchase was made.

### Objectives:
1. Handle missing data.
2. Encode categorical variables.
3. Avoid the dummy variable trap.
4. Prepare the data for machine learning.

---


## Step 1: Import Libraries
Import the necessary libraries for data preprocessing.

In [None]:

# Importing libraries
import numpy as np
import pandas as pd


## Step 2: Load the Dataset
Load the dataset into a pandas DataFrame for processing.

In [None]:

# Load the dataset
dataset = pd.read_csv('Data.csv')  # Ensure 'Data.csv' is in the same directory
X = dataset.iloc[:, :-1].values  # Independent variables
y = dataset.iloc[:, 3].values  # Dependent variable


## Step 3: Handle Missing Data
Replace missing values in numerical columns with their mean.

In [None]:

# Handle missing data using SimpleImputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])  # Fit to Age and Salary columns
X[:, 1:3] = imputer.transform(X[:, 1:3])  # Transform missing values


## Step 4: Encode Categorical Data
Convert categorical data (e.g., Country) to numerical values using label encoding and one-hot encoding. Avoid the dummy variable trap by dropping the first column of the dummy variables.

In [None]:

# Encode categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Label encode the 'Country' column
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

# One-hot encode and drop the first dummy variable to avoid the dummy variable trap
column_transformer = ColumnTransformer(
    transformers=[('encoder', OneHotEncoder(drop='first'), [0])], remainder='passthrough'
)
X = column_transformer.fit_transform(X)


## Step 5: Encode Dependent Variable
Convert the dependent variable 'Purchased' into binary numerical values.

In [None]:

# Encode the dependent variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)


## Conclusion
We have successfully preprocessed the dataset, handling missing values, encoding categorical variables, and avoiding the dummy variable trap. The data is now ready for use in machine learning models.