# **Project 1: Cleaning Sales Data**

In this project, we will clean a dataset containing sales information stored in a CSV file. The goal is to prepare the data for analysis by handling missing values, removing duplicates, correcting column names, and saving the cleaned dataset.

## **Why is data cleaning important?**
Real-world data is often messy. It may have missing values, duplicate records, incorrect formatting, or errors in column names. If we don’t clean the data, our analysis could be inaccurate or misleading.

## **Specific Tasks:**

1. Load a Sales dataset from a CSV file.
2. Indentify and handle missing values:
    * Replace missing values with mean, median, or a spicific value.
3. Detect and remove duplicate records.
4. Correct errors in column names (e.g., capitalization, whitespace, etc.)
5. Save the cleaned dataset to a new CSV file.

Now, let’s go through each step in detail.

## **Step 1. Initial Setup**

In [81]:
import pandas as pd  # Used for data manipulation

**``Pandas`` helps us load, manipulate, and clean data easily.**

## **Step 2: Load the Sales Dataset**

A dataset is usually stored in a CSV file (Comma-Separated Values), which is a common format for storing tabular data.

To load the dataset, we use the ``pd.read_csv()`` function:

In [82]:
df = pd.read_csv("car_sales.csv")  # Replace with your actual file name

After loading the data, let's check the first few rows to understand its structure:

**Why is this important?**

It allows us to verify that the dataset was loaded correctly.
We can see what kind of data we are working with.

In [83]:
print("First 5 rows of the dataset:")
df.head()

First 5 rows of the dataset:


Unnamed: 0,Manufacturer,Model,Sales_in_thousands,__year_resale_value,Vehicle_type,Price_in_thousands,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Latest_Launch,Power_perf_factor
0,Acura,Integra,16.919,16.36,Passenger,21.5,1.8,140.0,101.2,67.3,172.4,2.639,13.2,28.0,2/2/2012,58.28015
1,Acura,TL,39.384,19.875,Passenger,28.4,3.2,225.0,108.1,70.3,192.9,3.517,17.2,25.0,6/3/2011,91.370778
2,Acura,CL,14.114,18.225,Passenger,,3.2,225.0,106.9,70.6,192.0,3.47,17.2,26.0,1/4/2012,
3,Acura,RL,8.588,29.725,Passenger,42.0,3.5,210.0,114.6,71.4,196.6,3.85,18.0,22.0,3/10/2011,91.389779
4,Audi,A4,20.397,22.255,Passenger,23.99,1.8,150.0,102.6,68.2,178.0,2.998,16.4,27.0,10/8/2011,62.777639


In [84]:
print("\nDataset information:")
df.info()


Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Manufacturer         157 non-null    object 
 1   Model                157 non-null    object 
 2   Sales_in_thousands   157 non-null    float64
 3   __year_resale_value  121 non-null    float64
 4   Vehicle_type         157 non-null    object 
 5   Price_in_thousands   155 non-null    float64
 6   Engine_size          156 non-null    float64
 7   Horsepower           156 non-null    float64
 8   Wheelbase            156 non-null    float64
 9   Width                156 non-null    float64
 10  Length               156 non-null    float64
 11  Curb_weight          155 non-null    float64
 12  Fuel_capacity        156 non-null    float64
 13  Fuel_efficiency      154 non-null    float64
 14  Latest_Launch        157 non-null    object 
 15  Power_perf_factor 

In [85]:
print("\nSummary statistics:")
df.describe()


Summary statistics:


Unnamed: 0,Sales_in_thousands,__year_resale_value,Price_in_thousands,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Power_perf_factor
count,157.0,121.0,155.0,156.0,156.0,156.0,156.0,156.0,155.0,156.0,154.0,155.0
mean,52.998076,18.072975,27.390755,3.060897,185.948718,107.487179,71.15,187.34359,3.378026,17.951923,23.844156,77.043591
std,68.029422,11.453384,14.351653,1.044653,56.700321,7.641303,3.451872,13.431754,0.630502,3.887921,4.282706,25.142664
min,0.11,5.16,9.235,1.0,55.0,92.6,62.6,149.4,1.895,10.3,15.0,23.276272
25%,14.114,11.26,18.0175,2.3,149.5,103.0,68.4,177.575,2.971,15.8,21.0,60.407707
50%,29.45,14.18,22.799,3.0,177.5,107.0,70.55,187.9,3.342,17.2,24.0,72.030917
75%,67.956,19.875,31.9475,3.575,215.0,112.2,73.425,196.125,3.7995,19.575,26.0,89.414878
max,540.561,67.55,85.5,8.0,450.0,138.7,79.9,224.5,5.572,32.0,45.0,188.144323


### **Explanation:**

* pd.read_csv() loads the dataset into a DataFrame.

* ``df.head()`` displays the first 5 rows to give an overview of the data.

* ``df.info()`` provides information about the dataset, including column names, data types, and missing values.

* ``df.describe()`` gives summary statistics for numerical columns.

## **Step 3: Identify and Handle Missing Values**

Missing values can cause issues in analysis. We’ll handle them by either replacing them with appropriate values or removing rows/columns with too many missing values.

### **3.1. What are missing values?**

Missing values occur when certain data points are missing from the dataset. This can happen for many reasons, such as:

* A salesperson forgot to enter some information.
* A system error caused data loss.
To check for missing values in the dataset, we use:

In [86]:
df.isnull().sum()  # Count missing values in each column

Manufacturer            0
Model                   0
Sales_in_thousands      0
__year_resale_value    36
Vehicle_type            0
Price_in_thousands      2
Engine_size             1
Horsepower              1
Wheelbase               1
Width                   1
Length                  1
Curb_weight             2
Fuel_capacity           1
Fuel_efficiency         3
Latest_Launch           0
Power_perf_factor       2
dtype: int64

This will show how many missing values exist in each column.

### **3.2. How to handle missing values?**

We can fill missing values with a meaningful replacement, such as:

* **Mean (average)** – Good for numerical data (e.g., sales prices).
* **Median (middle value)** – Useful when there are extreme values (outliers).
* **A specific value** – Useful for categorical data (e.g., if the "Car Model" is missing, we can replace it with "Unknown").

### **a. Replace Missing Values**
We can replace missing values in numerical columns with the mean, median, or a specific value. For categorical columns, we might use the mode.

In [87]:
# Replace missing values in numerical columns with the mean
df['Sales_in_thousands'] = df['Sales_in_thousands'].fillna(df['Sales_in_thousands'].mean())
df['Price_in_thousands'] = df['Price_in_thousands'].fillna(df['Price_in_thousands'].mean())
df['Engine_size'] = df['Engine_size'].fillna(df['Engine_size'].mean())

# Replace missing values in categorical columns with the mode
df['Vehicle_type'] = df['Vehicle_type'].fillna(df['Vehicle_type'].mode()[0])

# Replace missing values in 'Horsepower' with the median
df['Horsepower'] = df['Horsepower'].fillna(df['Horsepower'].median())

# Replace missing values in '__year_resale_value' with 0
df['__year_resale_value'] = df['__year_resale_value'].fillna(0)

### **b. Remove Rows or Columns with Excessive Missing Data**
If a row or column has too many missing values, we’ll remove it.

In [88]:
# Drop rows where more than 50% of the values are missing
df.dropna(thresh=len(df.columns) // 2, inplace=True)

# Drop columns where more than 50% of the values are missing
df.dropna(axis=1, thresh=len(df) // 2, inplace=True)

### **Explanation:**

* ``fillna()`` replaces missing values with the specified value (mean, median, mode, or a constant).

* ``dropna()`` removes rows or columns with missing values. The thresh parameter specifies the minimum number of non-missing values required to keep a row or column.

## **Step 4: Detect and Remove Duplicate Records**

Duplicate records can skew your analysis. Let’s detect and remove them.

### **4.1. What are duplicate records?**

Duplicates happen when the same data appears more than once. This can happen due to:

* Data entry errors.
* The same sales transaction being recorded multiple times.

In [89]:
# Detect duplicate rows
duplicates = df.duplicated()

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Display the number of rows after removing duplicates
print(f"Number of rows after removing duplicates: {len(df)}")

Number of rows after removing duplicates: 157


### **Explanation:**

* ``df.duplicated()`` identifies duplicate rows.

* ``df.drop_duplicates()`` removes duplicate rows from the DataFrame.

* ``inplace=True`` ensures the changes are applied directly to the DataFrame.

## **Step 5: Correct Errors in Column Names**

Column names often have issues like extra spaces, incorrect capitalization, or special characters. Let’s clean them up.

### **5.1. Why correct column names?**

Sometimes, column names contain:

* Extra spaces.
* Inconsistent capitalization (``Sales Price``, ``sales_price``).
* Special characters that make them hard to use in code.

In [90]:
# Strip whitespace from column names
df.columns = df.columns.str.strip()

# Convert column names to lowercase
df.columns = df.columns.str.lower()

# Replace spaces with underscores
df.columns = df.columns.str.replace(' ', '_')

# Display the cleaned column names
print("Cleaned column names:")
print(df.columns)

Cleaned column names:
Index(['manufacturer', 'model', 'sales_in_thousands', '__year_resale_value',
       'vehicle_type', 'price_in_thousands', 'engine_size', 'horsepower',
       'wheelbase', 'width', 'length', 'curb_weight', 'fuel_capacity',
       'fuel_efficiency', 'latest_launch', 'power_perf_factor'],
      dtype='object')


### **Explanation:**

* ``str.strip()`` removes leading and trailing spaces.

* ``str.lower()`` converts column names to lowercase.

* ``str.replace(' ', '_')`` replaces spaces with underscores.

* ``rename()`` allows you to rename specific columns.

## **Step 6: Save the Cleaned Dataset**

Finally, let’s save the cleaned dataset to a new CSV file for future use.



In [91]:
# Save the cleaned dataset to a new CSV file
df.to_csv('cleaned_car_sales.csv', index=False)

print("Cleaned dataset saved as 'cleaned_car_sales.csv'")

Cleaned dataset saved as 'cleaned_car_sales.csv'


### **Explanation:**

* ``to_csv()`` saves the DataFrame to a CSV file.

* ``index=False`` ensures that the row indices are not included in the file.



## **Conclusion**

In this project, we cleaned a car sales dataset by:

1. Loading the data and performing initial exploration.

2. Handling missing values by replacing or removing them.

3. Detecting and removing duplicate records.

4. Cleaning and standardizing column names.

5. Saving the cleaned dataset to a new file.

The cleaned dataset ``(cleaned_car_sales.csv)`` is now ready for analysis or machine learning tasks.

Dataset: https://www.kaggle.com/datasets/gagandeep16/car-sales