## Description of the Task

The objective of this task is to **clean and preprocess a raw, real-world dataset** using the
**Pandas library** in Python so that it becomes suitable for data analysis and machine learning
applications.

Real-world datasets are often messy and contain issues such as missing values, inconsistent
formatting, incorrect data types, and duplicate records. If these issues are not handled
properly, they can negatively impact the accuracy and reliability of analysis and models.

Specifically, the task involves:

- Loading and inspecting a raw dataset to identify data quality issues  
- Handling missing values using appropriate strategies  
- Fixing inconsistencies in categorical and text-based columns  
- Converting columns into correct data types such as numeric and datetime  
- Removing duplicate records  
- Saving the cleaned dataset as a new CSV file  

This task highlights the importance of **data cleaning as the foundation of any data-driven
workflow**.

## Understanding Data Cleaning (Data Detox)

Data cleaning, also known as **data detox**, refers to the process of identifying and correcting
errors, inconsistencies, and missing values in a dataset.

Common data quality problems include:

- Missing or null values  
- Inconsistent text formatting (e.g., “Male”, “male”, “ MALE ”)  
- Incorrect data types (dates stored as strings, numbers stored as text)  
- Duplicate rows  
- Invalid or corrupted entries  

Pandas provides a rich set of functions to inspect, clean, and transform data, making it an
essential tool for preprocessing tasks in data science.

## Dataset Used

A customer-related dataset was used for this task.

- **Number of records:** Approximately 50,000  
- **Number of columns:** 10  

### Key Columns in the Dataset

- CustomerID  
- Name  
- Age  
- Gender  
- Country  
- SignupDate  
- LastLogin  
- TotalPurchase  
- PreferredDevice  
- Email  

### Characteristics of the Dataset

- Contains missing values in both numeric and categorical columns  
- Includes inconsistent text formatting  
- Date columns are initially stored as strings  
- May contain duplicate records  

This dataset closely resembles real-world business data, making it suitable for practicing
practical data cleaning techniques.

## Approach Followed to Solve the Task

### 1. Data Inspection and Exploration

The dataset was loaded using Pandas and explored using functions such as:

- .head() to view sample rows  
- .info() to inspect column data types and non-null counts  
- .isnull().sum() to identify missing values  

A backup copy of the raw dataset was created before applying any transformations to ensure
data safety.

### 2. Handling Missing Values

Missing values were handled carefully based on the nature of each column:

- **Age:** Filled using the **median**, as age data may contain outliers  
- **Gender and Country:** Filled using the **mode** (most frequent value), suitable for
  categorical data  
- **TotalPurchase:** Filled using the **mean**, appropriate for continuous numerical data  

This approach prevents unnecessary data loss while maintaining statistical validity.

### 3. Fixing Text and Categorical Inconsistencies

To ensure consistency in categorical data:

- Leading and trailing whitespaces were removed  
- Text was converted to a uniform format:
  - Gender → lowercase  
  - Country → title case  
  - PreferredDevice → lowercase  

These steps ensure that logically identical categories are treated as the same value.

### 4. Correcting Data Types

Correct data types are essential for meaningful analysis:

- Date columns (SignupDate, LastLogin) were converted to datetime format  
- Invalid date entries were handled safely using coercion  
- Age was converted to integer type  
- TotalPurchase was converted to float type  

This allows accurate numerical and time-based analysis.

### 5. Removing Duplicate Records

Duplicate rows can lead to biased analysis and incorrect insights.

- The dataset was checked for duplicate records  
- All duplicate rows were removed to ensure data integrity  

This ensures that each record represents a unique customer.

### 6. Final Validation and Saving the Dataset

After cleaning, the dataset was validated to confirm:

- Correct data types  
- No unintended missing values  

The cleaned dataset was then saved as a new CSV file, while the raw dataset remained unchanged.

## Outcomes and Learnings

Through this task, I gained practical experience in **cleaning real-world datasets using Pandas**.

Key learnings include:

- The importance of inspecting data before cleaning  
- Choosing appropriate strategies for handling missing values  
- Maintaining consistency in categorical data  
- Preserving raw data through backups  