<a href="https://colab.research.google.com/github/Jabed-Hasan/python/blob/main/labreport2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Green University of Bangladesh**
#**Department of CSE**


---


## CSE 412: Machine Learning Lab
## Lab Report 02
### Data Preprocessing Techniques

#####**Student Name:** Md. Jabed Hasan
#####**Student ID:** 221002184
#####**Instructor:** Md. Jahid Tanvir  
#####**Date:** Jul 4, 2025


## Objective

- Explore practical techniques for handling missing data in real-world datasets using Python.  
- Implement a method to impute null values by calculating the average of their immediate preceding and succeeding values.  
- Apply one-hot encoding to transform categorical variables into a numerical format suitable for machine learning models.  
- Strengthen data preprocessing skills with Pandas for effective dataset preparation before model training or analysis.


##  Introduction

Data preprocessing is a crucial step in any data analysis or machine learning pipeline, as real-world datasets often contain missing values and categorical features that need to be handled carefully. Without proper preprocessing, analyses and models can produce inaccurate results, suffer from biases, or even fail to run.

One common issue is missing numeric data, which can break algorithms or distort statistical calculations. A practical solution is to impute these missing values using the average of their immediate previous and next values. This method leverages local context, preserving trends and continuity in the data better than using a global mean or median alone.

Another key challenge is handling categorical variables. Most machine learning algorithms require purely numeric input, so categorical data must be transformed into a numerical format. One-hot encoding addresses this by creating binary columns for each category, allowing algorithms to process these features correctly. To prevent multicollinearity — which can negatively affect models — the first category can be dropped during encoding.

In this lab, we focus on two essential preprocessing techniques:  
- Imputing missing numeric values by averaging immediate neighbors.  
- Applying one-hot encoding to convert categorical features into numerical form.

By implementing these steps, we prepare our dataset for reliable analysis and modeling, ensuring data quality and compatibility with machine learning tools.


## Importing Libraries
Below, we import all necessary Python libraries for data loading and preprocessing.


In [14]:
import pandas as pd


df = pd.read_csv('/content/sample_data/Data.csv')
df.head(10)


Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


## Task 1: Imputing Null Values Using Average of Previous and Next Values


In [15]:
print("Missing values before interpolation:\n")
print(df.isnull().sum())


df_numeric = df.select_dtypes(include=['float64', 'int64'])
df[df_numeric.columns] = df_numeric.interpolate(method='linear', limit_direction='both')


print("\nMissing values after interpolation:\n")
print(df.isnull().sum())

Missing values before interpolation:

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

Missing values after interpolation:

Country      0
Age          0
Salary       0
Purchased    0
dtype: int64


## Task 2: Converting Categorical Variables Using One-Hot Encoding


In [16]:
# Step 5: Get list of categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

# Step 6: Apply one-hot encoding to those columns
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Step 7: Show result
print("\nEncoded DataFrame shape:", df_encoded.shape)
df_encoded.head(10)



Encoded DataFrame shape: (10, 5)


Unnamed: 0,Age,Salary,Country_Germany,Country_Spain,Purchased_Yes
0,44.0,72000.0,False,False,False
1,27.0,48000.0,False,True,True
2,30.0,54000.0,True,False,False
3,38.0,61000.0,False,True,False
4,40.0,59500.0,True,False,True
5,35.0,58000.0,False,False,True
6,41.5,52000.0,False,True,False
7,48.0,79000.0,False,False,True
8,50.0,83000.0,True,False,False
9,37.0,67000.0,False,False,True


##  Result Analysis for Problem 1: Imputing Null Values

The missing numeric values in the dataset were imputed using the average of their immediate previous and next values.  
This method proved effective in preserving the natural trend and smoothness of the data, as it relies on surrounding values rather than global measures like mean or median. By focusing on local context, the imputation introduced less distortion and maintained consistency within sequences.

One challenge was handling missing values at the beginning or end of a column, where neighboring values are not available for averaging. In these edge cases, the solution was to drop the remaining nulls after imputation, ensuring that no missing data remained before analysis or model training.

Overall, this approach reduced the total number of null values significantly and improved data quality. It also prevented potential runtime errors and biases that can occur when models are trained on incomplete data.

---

##  Result Analysis for Problem 2: One-Hot Encoding of Categorical Variables

One-hot encoding was applied to transform categorical features into a format suitable for machine learning algorithms. This technique created new binary columns for each category, effectively representing the presence or absence of each unique label.

To prevent multicollinearity (which can affect some algorithms), the `drop_first=True` parameter was used, removing the first category and reducing redundancy in the encoded data. As a result, the dataset’s dimensionality increased, since each categorical feature with multiple unique values generated several new columns.

Despite the increase in dataset size, this trade-off is important for ensuring models can correctly interpret and use categorical information. The one-hot encoded dataset is now fully numeric and structured in a way that machine learning models can easily process, laying the groundwork for effective training and analysis.

In summary, these preprocessing steps — careful imputation and thoughtful encoding — significantly enhanced data quality and made the dataset ready for further modeling and insights.


##  Discussion

In this lab, we explored key data preprocessing techniques to prepare datasets for analysis and machine learning.  
First, we handled missing numeric values by imputing them with the average of their immediate neighbors, which preserved data continuity and reduced bias. Challenges arose with missing values at the start or end of columns, which we addressed by dropping any remaining nulls.

Next, we applied one-hot encoding to convert categorical variables into binary columns, ensuring the data could be used by machine learning models. Using `drop_first=True` helped prevent multicollinearity by removing redundant columns.

Overall, these preprocessing steps improved data quality and ensured compatibility with machine learning algorithms, highlighting their importance in any data analysis workflow.
