## Description of the Task

The objective of this task was to **clean and preprocess a raw, real-world dataset** using the
**Pandas library in Python**, so that it becomes suitable for further data analysis and machine
learning applications.

In real-world scenarios, datasets are rarely clean. They often contain missing values,
inconsistent formatting, incorrect data types, and duplicate records. If these issues are not
handled properly, they can significantly affect the accuracy and reliability of any analysis
or model built on the data.

Through this task, I focused on identifying and fixing these common data quality issues in a
structured and systematic way.

Specifically, the task involved:

- Loading and inspecting the raw dataset to identify data quality issues  
- Handling missing values using appropriate techniques  
- Fixing inconsistencies in categorical and text-based columns  
- Converting columns into correct data types such as numeric and datetime  
- Removing duplicate records  
- Saving the cleaned dataset as a new CSV file  

## Understanding Data Cleaning (Data Detox)

Data cleaning, also referred to as **data detox**, is the process of identifying and correcting
errors, inconsistencies, and missing values present in a dataset.

While working on this task, I observed that real-world data commonly suffers from:

- Missing or null values  
- Inconsistent text formatting (e.g., “Male”, “male”, “ MALE ”)  
- Incorrect data types (such as dates stored as strings)  
- Duplicate rows  
- Invalid or corrupted entries  

The Pandas library provides a wide range of functions that make it easier to inspect, clean,
and transform data efficiently. This task helped me understand how essential data cleaning is
before performing any meaningful analysis or machine learning.

## Dataset Used

For this task, I worked with a **customer-related dataset**.

- **Number of records:** Approximately 50,000  
- **Number of columns:** 10  

### Key Columns in the Dataset

- CustomerID  
- Name  
- Age  
- Gender  
- Country  
- SignupDate  
- LastLogin  
- TotalPurchase  
- PreferredDevice  
- Email  

### Characteristics of the Dataset

While exploring the dataset, I noticed that:

- Several numeric and categorical columns contained missing values  
- Text-based columns had inconsistent formatting  
- Date columns were stored as strings instead of datetime objects  
- There was a possibility of duplicate records  

## Approach Followed to Solve the Task

### 1. Data Inspection and Exploration

I began by loading the dataset using Pandas and performing an initial inspection to understand
its structure and quality.

To do this, I used:

- `.head()` to view a few sample rows  
- `.info()` to check data types and non-null counts  
- `.isnull().sum()` to identify columns with missing values  

This step helped me clearly understand which columns required cleaning and what kind of
issues were present in the dataset.

### 2. Handling Missing Values

I handled missing values based on the nature of each column rather than using a single
strategy for all of them:

- **Age:** Filled missing values using the **median**, as age data can contain outliers  
- **Gender and Country:** Filled missing values using the **mode**, which is suitable for
  categorical variables  
- **TotalPurchase:** Filled missing values using the **mean**, since it is a continuous
  numerical feature  

This approach helped preserve the overall distribution of the data while minimizing bias.

### 3. Fixing Text and Categorical Inconsistencies

To ensure consistency across categorical columns, I cleaned and standardized text data:

- Removed leading and trailing whitespaces  
- Converted text into a uniform format:
  - **Gender** → lowercase  
  - **Country** → title case  
  - **PreferredDevice** → lowercase  

This step was important to avoid treating the same category as different values during
analysis.

### 4. Correcting Data Types

Correct data types are essential for accurate analysis and model building. I converted
columns to appropriate data types as follows:

- Converted **SignupDate** and **LastLogin** to datetime format  
- Used coercion to safely handle invalid date entries  
- Converted **Age** to integer type  
- Converted **TotalPurchase** to float type  

This ensured that numerical and date-based operations could be performed without errors.

### 5. Removing Duplicate Records

Duplicate records can distort analysis and lead to misleading insights.

- I checked the dataset for duplicate rows  
- All identified duplicates were removed to maintain data integrity  

### 6. Final Validation and Saving the Dataset

After completing all cleaning steps, I performed a final validation to ensure that:

- All columns had correct data types  
- No unintended missing values remained  
- The dataset was consistent and analysis-ready  

Finally, I saved the cleaned dataset as a **new CSV file**, preserving the original raw dataset
as a backup.

---

## Outcomes and Learnings

Through this task, I gained hands-on experience in **cleaning and preprocessing real-world
datasets using Pandas**.

Key learnings from this task include:

- The importance of thoroughly inspecting data before cleaning  
- Choosing appropriate strategies for handling different types of missing values  
- Maintaining consistency in categorical data to avoid logical errors  
- The significance of preserving raw data before applying transformations  

Overall, this task strengthened my understanding of data preprocessing and highlighted how
crucial data cleaning is in the data science workflow.
