# Week 02 Project: Data Cleaning and Preprocessing

 
Topics: 
1) Understanding messy data and the need for data cleaning 
2) Handling missing values, duplicates, and outliers 
3) Data transformation: normalization and standardization 
4) Using Pandas and NumPy for data preprocessing 
5) Introduction to Regular Expressions for text cleaning 
Python Project: "Data Cleaning Challenge" – Given a messy dataset, clean it by: Removing missing values Standardizing formats Identifying and handling outliers Dataset: Attached is a csv file with missing values and inconsistent formats

## Importing Requirements and Getting the Data

In [None]:
import pandas as pd
import numpy as np

In [None]:
#messy_data = pd.read_csv('https://raw.githubusercontent.com/Impact-Insights/Group-Project/refs/heads/main/DMD%20Data%20Group%201%20W_02%20Submission/messy_dataset.csv?token=GHSAT0AAAAAAC6XVLETHM7AG6PH4UJMGHLUZ5YHIOA')
messy_data = pd.read_csv('messy_data.csv')

## Understanding the Data

In [None]:
messy_data

In [None]:
messy_data.info()

In [None]:
messy_data.describe()

From the information provided above we can immediately notice that there are some issues that can be fixed:

1. We can change the `ID` column data format to be an `object` as we are not going to perform any calculations on it.
2. We can notice that there are 7 missing values in the `Age` column because it counts that there are 36 non-null values while there should be 45 and the data format can be changed to an integer as age is recorded as a whole number and we can perform calculations on it.
- We can fill the NaN values in the `Age` column by the column mean which is more suitable as the data indicates that most of the people are in the same age group. 
3. The `Salary` column can be changed to be of currency data format or just a number as we can perform calculations on it.
4. The `Joining` column can be changed to a date-time data format as it is a date type.
5. Checking from the `messy_data.head()` results we can observe inconsistencies in the formating style of text, dates and numbers in all the column. We can standarrdize and normalize the data to have a well formatted dataset.
6. Since we are having the `ID` column and knowing that it is a unique identifier we should not get any dupliate values.
7. The `Email` column should also have unique entries since no more than one person can own the same email address. 

### 1. Understanding Messy Data and the Need for Data Cleaning

### 2. Handling Missing Values, Duplicates, and Outliers

In [None]:
messy_data.columns

#### Working with missing values (Filling with Mean Value) [Age Column]

In [None]:
messy_data['Age'] = messy_data['Age'].fillna(0)
#messy_data['Age'] = messy_data['Age'].fillna(messy_data['Age'].mean())

for column in messy_data.columns:
    messy_data['Age'] = np.where(messy_data['Age'] == 0, messy_data['Age'].mean(), messy_data['Age'])

In [None]:
messy_data

In [None]:
#messy_data['Age'] = messy_data['Age'].astype('Int64')
messy_data['Age'] = round(pd.to_numeric(messy_data['Age'], downcast='integer', errors='coerce'), 0) #errors='coerce' will turn non-numeric values to NaN


In [None]:
messy_data['Age'].head(2)

#### Working with Duplicates and Missing Data [Email Column]

In [None]:
messy_data['Email'] = messy_data['Email'].drop_duplicates()

In [None]:
messy_data = messy_data.dropna()
messy_data

### 3. Data Transformation: Normalization and Standardization

#### Working with the Email Column

In [None]:
messy_data.loc[0, 'Email'] = "eve@example.com"
messy_data.loc[3, 'Email'] = "david@example.com"

In [None]:
messy_data['Email'] = messy_data['Email'].str.lower()
messy_data

#### Working with the Salary ($) column

In [None]:
messy_data.loc[:, 'Salary ($)'] = messy_data['Salary ($)'].str.strip(',.$')
messy_data.loc[:, 'Salary ($)'] = messy_data['Salary ($)'].str.replace(',', "")

messy_data['Salary ($)'] = messy_data['Salary ($)'].astype('float64')

messy_data

#### Working with the Joining Date Column

In [None]:
messy_data.loc[:, 'Joining Date'] = messy_data.loc[:,'Joining Date'].str.replace('-', "")

In [None]:
messy_data.loc[:4, 'Joining Date'] = messy_data.loc[:,'Joining Date'].str.replace('-', "")

In [None]:
#WORKING THROUGH THE JOINING DATE COLUMN

 
messy_data.at[0, 'Joining Date'] = pd.to_datetime(messy_data.at[0, 'Joining Date'], format='%d/%m/%Y').strftime('%d-%m-%Y')
messy_data.at[1, 'Joining Date'] = pd.to_datetime(messy_data.at[1, 'Joining Date'], format='%Y%m%d').strftime('%d-%m-%Y')
messy_data.at[2, 'Joining Date'] = pd.to_datetime(messy_data.at[2, 'Joining Date'], format='%d%m%Y').strftime('%d-%m-%Y')
messy_data.at[3, 'Joining Date'] = pd.to_datetime(messy_data.at[3, 'Joining Date'], format='%Y%m%d').strftime('%d-%m-%Y')
messy_data.at[7, 'Joining Date'] = pd.to_datetime(messy_data.at[7, 'Joining Date'], format='%B %d, %Y').strftime('%d-%m-%Y')
 
 
 

In [None]:
messy_data['Age'] = messy_data['Age'].astype('Int64')

In [None]:
messy_data

In [None]:
messy_data.sort_values(by='ID', ascending=True)

In [None]:
cleaned_data = messy_data.to_csv('cleaned_data.csv', index = False)