Github :
Linkedin :
Medium :


# Exploratory Data Analysis and Preprocessing of Car Dataset

In this notebook, we will perform data preprocessing on a dataset containing information about various cars. The dataset includes attributes such as miles per gallon (MPG), the number of cylinders, and other characteristics that may influence fuel efficiency. Our goal is to prepare the data for further analysis or machine learning modeling.

## Table of Contents
1. [Import Libraries](#import-libraries)
2. [Load the Dataset](#load-the-dataset)
3. [Explore the Initial Data](#explore-the-initial-data)
4. [Rename Columns](#rename-columns)
5. [Handle Missing Values](#handle-missing-values)
6. [Drop Unnecessary Columns](#drop-unnecessary-columns)
7. [One-Hot Encoding of Categorical Variables](#one-hot-encoding-of-categorical-variables)
8. [Separate Features and Target Variable](#separate-features-and-target-variable)
9. [Train-Test Split](#train-test-split)
10. [Feature Scaling](#feature-scaling)
11. [Display Preprocessed Data](#display-preprocessed-data)

## Step 1: Import Libraries <a name="import-libraries"></a>

To begin, we will import the necessary libraries. We need `pandas` for data manipulation and `sklearn` for data preprocessing and model evaluation.





In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# **Step 2: Load the Dataset <a name="load-the-dataset"></a>**
Next, we will load the dataset from the provided URL. The dataset is in CSV format and can be read into a pandas DataFrame.

In [3]:
url = "https://archive.ics.uci.edu/static/public/9/data.csv"
df = pd.read_csv(url)

print("Initial DataFrame:\n", df.head())


Initial DataFrame:
                     car_name  cylinders  displacement  horsepower  weight  \
0  chevrolet,chevelle,malibu          8         307.0       130.0    3504   
1          buick,skylark,320          8         350.0       165.0    3693   
2         plymouth,satellite          8         318.0       150.0    3436   
3              amc,rebel,sst          8         304.0       150.0    3433   
4                ford,torino          8         302.0       140.0    3449   

   acceleration  model_year  origin   mpg  
0          12.0          70       1  18.0  
1          11.5          70       1  15.0  
2          11.0          70       1  18.0  
3          12.0          70       1  16.0  
4          10.5          70       1  17.0  


# **Explanation:**
We use pd.read_csv() to read the dataset from the URL.
The head() method displays the first few rows of the DataFrame, giving us a glimpse of the data.

# **Step 3: Explore the Initial Data <a name="explore-the-initial-data"></a>**
Before preprocessing, it's helpful to understand the structure of the dataset. This can include checking for data types, missing values, and overall dimensions.

In [4]:
print("DataFrame Shape:", df.shape)
print("DataFrame Info:")
df.info()
print("\nMissing Values:\n", df.isnull().sum())


DataFrame Shape: (398, 9)
DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   car_name      398 non-null    object 
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   mpg           398 non-null    float64
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB

Missing Values:
 car_name        0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
mpg             0
dtype: int64


# **Explanation:**
df.shape provides the number of rows and columns in the DataFrame.
df.info() displays the data types and the number of non-null values in each column.
df.isnull().sum() shows the count of missing values in each column.

# **Step 4: Rename Columns <a name="rename-columns"></a>**
For easier access and readability, we will rename the columns of the DataFrame.

In [5]:
# Renaming the columns for easier access
df.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'origin', 'car_name']


# **Explanation:**
Clear column names help avoid confusion and make our code more readable.

# **Step 5: Handle Missing Values <a name="handle-missing-values"></a>**
Next, we will address the missing values, particularly in the horsepower column. We will convert the column to numeric values, coercing any errors, and fill missing values with the mean of the column.

In [6]:
# Handle missing values (replace missing horsepower values with the mean)
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].mean())


# **Explanation:**
pd.to_numeric() converts the horsepower column to numeric, replacing non-numeric values with NaN.
We then fill missing values with the mean of the horsepower column using fillna().

# **Step 6: Drop Unnecessary Columns <a name="drop-unnecessary-columns"></a>**
We will drop the car_name column, as it does not provide useful information for our analysis.

In [7]:
# Drop 'car_name' as it's not useful in our analysis
df.drop('car_name', axis=1, inplace=True)


# **Explanation:**
The drop() method removes the specified column (car_name) from the DataFrame.

# **Step 7: One-Hot Encoding of Categorical Variables <a name="one-hot-encoding-of-categorical-variables"></a>**
To convert the categorical origin column into a format suitable for modeling, we will apply one-hot encoding.

In [8]:
# One-hot encode 'origin' column
df = pd.get_dummies(df, columns=['origin'], prefix='origin')


# **Explanation:**
pd.get_dummies() converts categorical variables into a series of binary columns (0s and 1s), allowing us to use these features in a regression model.

# **Step 8: Separate Features and Target Variable <a name="separate-features-and-target-variable"></a>**
We will now separate the features (independent variables) from the target variable (dependent variable, which is MPG in this case).

In [9]:
# Separate features and target
X = df.drop('mpg', axis=1)
y = df['mpg']


# **Explanation:**
X contains all the features, while y contains the target variable we aim to predict.

# **Step 9: Train-Test Split <a name="train-test-split"></a>**
Next, we will split our dataset into training and testing sets. This helps evaluate the performance of our models on unseen data.

In [10]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# **Explanation:**
train_test_split() splits the data into training (80%) and testing (20%) sets. The random_state parameter ensures reproducibility.

# **Step 10: Feature Scaling <a name="feature-scaling"></a>**
To standardize the feature values, we will apply StandardScaler. This step is essential as it ensures that all features contribute equally to the model's performance.

In [11]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# **Explanation:**
StandardScaler standardizes features by removing the mean and scaling to unit variance.
We fit the scaler on the training data and then transform both the training and test sets to apply the same scaling.

# **Step 11: Display Preprocessed Data <a name="display-preprocessed-data"></a>**
Finally, we will display samples from the preprocessed training and testing datasets to verify our preprocessing steps.

In [12]:
# Print the preprocessed data
print("\nPreprocessed X_train sample:\n", X_train_scaled[:5])
print("Preprocessed X_test sample:\n", X_test_scaled[:5])



Preprocessed X_train sample:
 [[ 1.52718818  1.0901965   1.26183446  0.55282624 -1.31933367 -1.6966673
   0.78895436 -0.46232073 -0.51176632]
 [-0.85051483 -0.92299623 -0.41351298 -0.99966729 -0.41318225 -1.6966673
  -1.26750044 -0.46232073  1.95401684]
 [-0.85051483 -0.98134964 -0.95394763 -1.1247723   0.92792185  1.63897537
  -1.26750044 -0.46232073  1.95401684]
 [-0.85051483 -0.98134964 -1.1701215  -1.39285445  0.27549283  0.52709448
  -1.26750044 -0.46232073  1.95401684]
 [-0.85051483 -0.74793599 -0.22436085 -0.3276747  -0.23195197 -0.30681619
  -1.26750044  2.16300056 -0.51176632]]
Preprocessed X_test sample:
 [[-0.85051483 -0.98134964 -1.35927363 -1.39881183  0.63795339 -0.02884597
  -1.26750044 -0.46232073  1.95401684]
 [-0.85051483 -0.69930815 -0.65670857 -0.40988656  1.07290607  1.63897537
   0.78895436 -0.46232073 -0.51176632]
 [ 0.33833667  0.38995555 -0.08925218 -0.39916327 -0.9568731  -1.41869708
   0.78895436 -0.46232073 -0.51176632]
 [ 1.52718818  1.22635446  1.26183446

# **Explanation:**
We print the first five samples of the scaled training and testing feature sets to inspect the results of our preprocessing.

# Summary of Data Preprocessing for Car Dataset

In this notebook, we performed data preprocessing on a car dataset from the UCI Machine Learning Repository. The main steps involved are outlined below:

## Steps Overview

1. **Import Libraries**:
   - We imported `pandas` for data manipulation and `sklearn` for data preprocessing tasks.

2. **Load the Dataset**:
   - The dataset was loaded from a specified URL into a pandas DataFrame.

3. **Explore the Initial Data**:
   - Checked the shape, data types, and missing values in the dataset to understand its structure.

4. **Rename Columns**:
   - Column names were renamed for easier reference and readability.

5. **Handle Missing Values**:
   - Missing values in the `horsepower` column were converted to numeric and replaced with the mean of the column.

6. **Drop Unnecessary Columns**:
   - The `car_name` column was removed as it did not contribute to our analysis.

7. **One-Hot Encoding of Categorical Variables**:
   - The `origin` column was one-hot encoded to convert it into a binary format suitable for modeling.

8. **Separate Features and Target Variable**:
   - The dataset was divided into features (`X`) and the target variable (`y`, MPG).

9. **Train-Test Split**:
   - The dataset was split into training and testing sets (80% train, 20% test) for model evaluation.

10. **Feature Scaling**:
    - Standardization was applied to the features using `StandardScaler` to ensure all features contribute equally to model performance.

11. **Display Preprocessed Data**:
    - Samples from the preprocessed training and testing datasets were printed to verify the results.

## Conclusion

The dataset has been successfully preprocessed, making it ready for further analysis or machine learning modeling. This includes handling missing values, encoding categorical variables, and standardizing features.

Feel free to expand on this analysis with machine learning models to predict MPG based on the processed features.
