# **What is Data Cleaning?**


 Data Cleaning is the process of identifying and correcting errors, inconsistencies, and inaccuracies in a dataset. It involves removing or correcting any problematic or irrelevant data to ensure the dataset is accurate, reliable, and suitable for analysis or other data-related tasks.



---
# **Why Data Cleaning is important?**



 Data Cleaning is vital because it improves the quality, reliability, and usefulness of data. It ensures that data analysis is based on accurate and consistent information, leading to more reliable insights and informed decision-making.



---
# Performing Data Cleaning on data 

Watch the video given below.
[link text](https://www.youtube.com/watch?v=mwEPXevpqls&ab_channel=NetComLearning)



---
# let us look at the example given below.













In [None]:
# Importing the necessary libraries
import pandas as pd

# Reading the dataset
df = pd.read_csv('data.csv')

# Displaying the first few rows of the dataset
df.head()

# Checking for missing values
df.isnull().sum()

# Handling missing values
df['Age'].fillna(df['Age'].median(), inplace=True)

# Checking for duplicates
df.duplicated().sum()

# Removing duplicates
df.drop_duplicates(inplace=True)

# Checking for inconsistent entries
df['Gender'].value_counts()

# Correcting inconsistent entries
df['Gender'] = df['Gender'].str.lower().replace('f', 'female')
df['Gender'] = df['Gender'].str.lower().replace('m', 'male')

# Cleaning text data
df['Description'] = df['Description'].str.strip()  # Removing leading and trailing whitespaces
df['Description'] = df['Description'].str.capitalize()  # Capitalizing the first letter

# Saving the cleaned dataset
df.to_csv('cleaned_data.csv', index=False)


As given in the above example , this demonstrates various data cleaning tasks such as handling missing values, removing duplicates, correcting inconsistent entries, and cleaning text data. It uses the pandas library, a popular data manipulation library in Python.



Let's Try Performing Data Cleaning on our own dataset.

# **1.Handling Missing Values**

Look at the code snippet below.


**Importing our dataset**

In [None]:
import pandas as pd
import numpy as np

# read in all our data
nfl_data = pd.read_csv("data.csv")

# set seed for reproducibility
np.random.seed(0)

nfl_data.head()

**Searching for missing values in our dataset**

In [None]:
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

# Challenge Problem : Find whether there are any missing values for the houseprice dataset.

# **Why do we have missing Values in our dataset?**



---
There can be multiple reasons why certain values are missing from the data. Reasons for the missing of data from the dataset affect the approach of handling missing data. So it’s necessary to understand why the data could be missing.

Some of the reasons are listed below:

Past data might get corrupted due to improper maintenance.

1.   List item
2.   List item


Observations are not recorded for certain fields due to some reasons. There might be a failure in recording the values due to human error.
The user has not provided the values intentionally
Item nonresponse: This means the participant refused to respond.


---




**Dropping missing values from our dataset**

In [None]:
nfl_data.dropna()

In [None]:
# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

# Challenge Problem : If there were any missing values in Houseprice dataset , Drop the missing values and print how many values did lost.

**Filling in missing values Automatically**

---



Instead of dropping missing values , we can also fill empty cells of our dataset.Look at the code snippets below.

In [None]:
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

**Rplacing all null values with zero**

In [None]:

subset_nfl_data.fillna(0)

**Replacing all null values with a value that comes after the empty value in the same column and replacing all the other remaining values with zero's.**

In [None]:

subset_nfl_data.fillna(method='bfill', axis=0).fillna(0)