# Data Cleaning

## What is Data Cleaning?

Data cleaning is the process of detecting and correcting (or removing) inaccurate, incomplete, inconsistent, or irrelevant data from a dataset to improve its quality before analysis or modeling.

## Why Data Cleaning Is Important

| Reason             | Explanation                                            |
| ------------------ | ------------------------------------------------------ |
| 🧠 Accuracy        | Ensures correct insights and predictions               |
| 🧼 Consistency     | Fixes mismatches across fields (e.g., "USA" vs "U.S.") |
| ❌ Removes Errors   | Fixes typos, wrong data types, outliers                |
| 📊 Better Visuals  | Clean data leads to clearer, accurate graphs           |
| 🧪 Model Readiness | Required for applying ML algorithms                    |


## Common Data Cleaning Tasks

| Task                | Description                      | Example                      |
| ------------------- | -------------------------------- | ---------------------------- |
| Handle Missing Data | Fill or drop null values         | `df.dropna()`, `df.fillna()` |
| Remove Duplicates   | Remove repeated rows             | `df.drop_duplicates()`       |
| Fix Data Types      | Convert columns to correct types | `pd.to_datetime()`           |
| Format Columns      | Strip spaces, fix casing         | `df.columns.str.strip()`     |
| Standardize Text    | Consistent naming                | "Yes", "yes", "YES" → "Yes"  |
| Remove Outliers     | Drop or cap extreme values       | Z-score, IQR method          |
| Fix Inconsistencies | Unify formats                    | Date formats, phone numbers  |


# Example: Data Cleaning in Python

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv("customers_data.csv")
df.head()

Unnamed: 0,Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website
0,1,DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group,East Leonard,Chile,229.077.5154,397.884.0519x718,zunigavanessa@smith.info,2020-08-24,http://www.stephenson.com/
1,2,1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry,East Jimmychester,Djibouti,5153435776,686-620-1820x944,vmata@colon.com,2021-04-23,http://www.hobbs.com/
2,3,6F94879bDAfE5a6,Roy,Berry,Murillo-Perry,Isabelborough,Antigua and Barbuda,+1-539-402-0259,(496)978-3969x58947,beckycarr@hogan.com,2020-03-25,http://www.lawrence.com/
3,4,5Cef8BFA16c5e3c,Linda,Olsen,"Dominguez, Mcmillan and Donovan",Bensonview,Dominican Republic,001-808-617-6467x12895,+1-813-324-8756,stanleyblackwell@benson.org,2020-06-02,http://www.good-lyons.com/
4,5,053d585Ab6b3159,Joanna,Bender,"Martin, Lang and Andrade",West Priscilla,Slovakia (Slovak Republic),001-234-203-0635x76146,001-199-446-3860x3486,colinalvarado@miles.net,2021-04-17,https://goodwin-ingram.com/


In [4]:
# 1. Check data types
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Index              100 non-null    int64 
 1   Customer Id        100 non-null    object
 2   First Name         100 non-null    object
 3   Last Name          100 non-null    object
 4   Company            100 non-null    object
 5   City               100 non-null    object
 6   Country            100 non-null    object
 7   Phone 1            100 non-null    object
 8   Phone 2            100 non-null    object
 9   Email              100 non-null    object
 10  Subscription Date  100 non-null    object
 11  Website            100 non-null    object
dtypes: int64(1), object(11)
memory usage: 9.5+ KB


In [5]:
# 2. Convert 'Subscription Date' to datetime
df['Subscription Date'] = pd.to_datetime(df['Subscription Date'], errors='coerce')

In [6]:
# 3. Remove rows with missing email or phone
df = df.dropna(subset=['Email', 'Phone 1'])

In [7]:
# 4. Remove duplicates
df = df.drop_duplicates()

In [8]:
# 5. Clean whitespace in column names
df.columns = df.columns.str.strip()

In [9]:
# 6. Fill missing 'City' with 'Unknown'
df['City'] = df['City'].fillna('Unknown')

In [10]:
# 7. Standardize country names
df['Country'] = df['Country'].str.strip().str.upper()

In [11]:
df.head()

Unnamed: 0,Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website
0,1,DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group,East Leonard,CHILE,229.077.5154,397.884.0519x718,zunigavanessa@smith.info,2020-08-24,http://www.stephenson.com/
1,2,1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry,East Jimmychester,DJIBOUTI,5153435776,686-620-1820x944,vmata@colon.com,2021-04-23,http://www.hobbs.com/
2,3,6F94879bDAfE5a6,Roy,Berry,Murillo-Perry,Isabelborough,ANTIGUA AND BARBUDA,+1-539-402-0259,(496)978-3969x58947,beckycarr@hogan.com,2020-03-25,http://www.lawrence.com/
3,4,5Cef8BFA16c5e3c,Linda,Olsen,"Dominguez, Mcmillan and Donovan",Bensonview,DOMINICAN REPUBLIC,001-808-617-6467x12895,+1-813-324-8756,stanleyblackwell@benson.org,2020-06-02,http://www.good-lyons.com/
4,5,053d585Ab6b3159,Joanna,Bender,"Martin, Lang and Andrade",West Priscilla,SLOVAKIA (SLOVAK REPUBLIC),001-234-203-0635x76146,001-199-446-3860x3486,colinalvarado@miles.net,2021-04-17,https://goodwin-ingram.com/


In [12]:
df.count()

Index                100
Customer Id          100
First Name           100
Last Name            100
Company              100
City                 100
Country              100
Phone 1              100
Phone 2              100
Email                100
Subscription Date    100
Website              100
dtype: int64

# Final Note


### Clean data is reliable data. Before any visualization, analysis, or machine learning, always perform a thorough data cleaning process to avoid misleading conclusions.