# EDA and Data Preprocessing

In this notebook, we will perform exploratory data analysis (EDA) and data preprocessing using the Car Evaluation dataset. This dataset is available from the UCI Machine Learning Repository and contains information about different car features and their evaluation classes. The notebook is structured to guide you step-by-step through loading the data, converting categorical features, and preparing the data for use in PyTorch.

## 1. Importing Libraries

We start by importing the necessary libraries for data manipulation (`pandas`) and for working with PyTorch (`torch` and related modules).



In [7]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader

# **2. Loading the Dataset**
The Car Evaluation dataset is available at the UCI repository. We can load it directly into a Pandas DataFrame.

In [8]:
# URL of the Car Evaluation dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'

# Define column names based on the dataset documentation
columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']

# Read the dataset into a DataFrame
car_df = pd.read_csv(url, names=columns)

# Display the first few rows of the DataFrame
print(car_df.head())


  buying  maint doors persons lug_boot safety  class
0  vhigh  vhigh     2       2    small    low  unacc
1  vhigh  vhigh     2       2    small    med  unacc
2  vhigh  vhigh     2       2    small   high  unacc
3  vhigh  vhigh     2       2      med    low  unacc
4  vhigh  vhigh     2       2      med    med  unacc


**Explanation:**
* We read the dataset from the given URL and define meaningful column names based on the dataset's documentation.
* The .head() function is used to display the first few rows of the DataFrame.

# **3. Creating Feature Matrix and Target Vector**
Now, we will split the dataset into two parts:

* X: The feature matrix containing the car attributes.
* y: The target vector containing the class labels (the evaluation of the car).

In [9]:
# Create feature matrix X and target vector y
X = car_df.drop('class', axis=1)  # Features
y = car_df['class']                # Target class

# Display the shapes of X and y
print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)

# Optionally, display the first few rows of X and y
print(X.head())
print(y.head())


Feature matrix shape: (1728, 6)
Target vector shape: (1728,)
  buying  maint doors persons lug_boot safety
0  vhigh  vhigh     2       2    small    low
1  vhigh  vhigh     2       2    small    med
2  vhigh  vhigh     2       2    small   high
3  vhigh  vhigh     2       2      med    low
4  vhigh  vhigh     2       2      med    med
0    unacc
1    unacc
2    unacc
3    unacc
4    unacc
Name: class, dtype: object


**Explanation:**
* We use .drop() to remove the target column (class) from X.
* y contains the car evaluations.
* We display the shape of the feature matrix and target vector to understand the structure.

# **4. Encoding Categorical Features**
Since all of the features in the dataset are categorical, we need to convert them into numerical format before using them in a machine learning model. We use one-hot encoding with pd.get_dummies() for this purpose.

In [10]:
# Convert categorical features to numerical values (using pd.get_dummies)
X = pd.get_dummies(X, drop_first=True)


 **Explanation:**
* The get_dummies() function converts categorical columns into multiple binary columns (one-hot encoding). The drop_first=True argument helps avoid multicollinearity by dropping the first category in each feature.

# **5. Converting Data to PyTorch Tensors**
Next, we convert the processed data into PyTorch tensors for training machine learning models in PyTorch.

In [11]:
# Convert to PyTorch tensors
X_tensor = torch.tensor(X.values, dtype=torch.float32)
y_tensor = torch.tensor(pd.factorize(y)[0], dtype=torch.long)

# Display the shapes of the tensors
print("X tensor shape:", X_tensor.shape)
print("y tensor shape:", y_tensor.shape)


X tensor shape: torch.Size([1728, 15])
y tensor shape: torch.Size([1728])


**Explanation:**
* We use torch.tensor() to convert the NumPy arrays into PyTorch tensors.
* pd.factorize(y)[0] is used to convert the categorical target labels into numerical values.
* The tensor shapes are printed to verify that they match our expectations.

# **6. Creating a Custom Dataset for PyTorch**
Now, we will create a custom Dataset class to handle our feature and target data. This class allows us to easily load data in batches during training.

In [12]:
class CarEvaluationDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]


**Explanation:**
* The Dataset class requires two methods:
* __len__(): Returns the number of samples in the dataset.
* __getitem__(): Returns a specific sample (feature, label) at the given index.

# **7. Creating a DataLoader**
We can now create a DataLoader to load data in batches, which is essential for efficient model training.

In [13]:
# Create the dataset
car_dataset = CarEvaluationDataset(X_tensor, y_tensor)

# Create a DataLoader
batch_size = 16
car_dataloader = DataLoader(car_dataset, batch_size=batch_size, shuffle=True)


# **8. Displaying a Batch of Data**
Finally, we display a batch of data from the DataLoader to verify that it works as expected.

In [14]:
# Display a batch of data
for features, labels in car_dataloader:
    print("Batch features:\n", features)
    print("Batch labels:\n", labels)
    break  # Only show the first batch


Batch features:
 tensor([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0.],
        [0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1., 0.],
        [0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0.],
        [1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1.],
        [0., 0., 1., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 1.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 1.],
        [1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0.],
        [0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1.],
   

# Summary: EDA and Data Preprocessing

In this notebook, we explored and preprocessed the Car Evaluation dataset. The main steps covered were:

1. **Importing Libraries**: Used `pandas` for data manipulation and `torch` for handling tensors and datasets in PyTorch.

2. **Loading the Dataset**: Loaded the Car Evaluation dataset from the UCI repository and assigned appropriate column names.

3. **Creating Feature Matrix and Target Vector**: Split the dataset into features (`X`) and target labels (`y`).

4. **Encoding Categorical Features**: Applied one-hot encoding to convert categorical features into numerical format using `pd.get_dummies()`.

5. **Converting Data to PyTorch Tensors**: Converted the feature matrix and target vector into PyTorch tensors, suitable for machine learning models.

6. **Creating a Custom Dataset Class**: Defined a custom PyTorch `Dataset` to handle the feature and label data efficiently.

7. **Using DataLoader for Batch Processing**: Created a `DataLoader` to batch the data and shuffle it for training.

8. **Displaying a Batch of Data**: Printed a sample batch from the DataLoader to verify the correctness of the preprocessing.

This preprocessing prepares the dataset for further steps in building a machine learning model in PyTorch.
