# Data Cleaning using Python with Pandas Library

This tutorial is copied from [Tanu N Prandhu's](https://github.com/Tanu-N-Prabhu/Python/blob/master/Data_Cleaning/Data_Cleaning_using_Python_with_Pandas_Library.ipynb) GitHub repositories on Python.

### The entire data cleaning process is divided into sub-tasks as shown below.
- Importing the required libraries.
- Getting the data-set from a different source (Kaggle) and displaying the dataset.
- Removing the unused or irrelevant columns.
- Renaming the column names as per our convenience.
- Replacing the value of the rows and make it more meaningful.

Even though this tutorial is small, but it’s a good way to start on small things and get our hands dirty later on. I will make sure that everyone with no prior experience in python programming or don’t know what is data science or data cleaning can easily understand this tutorial. I'm not very good at python in the first place, so even for me, this was a good place to start. One thing with python is that the code is self-explanatory, your focus should not be what the code does, because the code pretty much says what it does, rather you should tell why did you choose to do this, the “why” factor is important than the “what” factor.

***

## Step 1: Importing the required libraries.
This step involves just importing the required libraries which are pandas, numpy. These are the necessary libraries when it comes to data science.

In [1]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
import os 
import os.path

# set WD to current file location
wd = os.getcwd()
os.chdir(wd)

***

## Step 2: Getting the data set from a different source and displaying the output. 

This step involves getting the data set and reading it in to python

In [2]:
# Reading a CSV file and displaying the data
df = pd.read_csv("DataSets/heart.csv")
df.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


***

## Step 3: Removing the unused or irrelevant columns

This step involves removing the columns that are likely to be unused.

In [3]:
# Dropping unused columns. 
to_drop = ['cp',
            'fbs',
            'restecg',
            'thalach',
            'exang',
            'oldpeak',
            'slope',
            'thal',
            'target',
            'ca']

df.drop(to_drop, inplace = True, axis =1)
df.head(5)


Unnamed: 0,age,sex,trestbps,chol
0,63,1,145,233
1,37,1,130,250
2,41,0,130,204
3,56,1,120,236
4,57,0,120,354


***

## Step 4: Renaming the column names as per our convenience.

This step changes the column names that may be confusing or hard to understand. But also to make it easier for us to read our dataset. 

In [4]:

# Renaming the column names
new_name = {'age': 'Age',
           'sex': 'Sex',
           'trestbps': 'Bps',
           'chol': 'Cholesterol'
            }

df.rename(columns = new_name, inplace = True)
df.head()

Unnamed: 0,Age,Sex,Bps,Cholesterol
0,63,1,145,233
1,37,1,130,250
2,41,0,130,204
3,56,1,120,236
4,57,0,120,354


*** 

## Step 5: Replacing the values in rows if necessary. 

This step we can relace the values in some rows which may not make sense. In this data set, the Sex field contains a binary value, where 1 is Male and 0 is Female. When can change this now to M and F

In [6]:
# Replacing values in a row
replace_values = {0: 'F', 1: 'M'}

df = df.replace({"Sex": replace_values})

df.head()

Unnamed: 0,Age,Sex,Bps,Cholesterol
0,63,M,145,233
1,37,M,130,250
2,41,F,130,204
3,56,M,120,236
4,57,F,120,354


So the above is the overall simple data cleaning process obviously this is not the actual cleaning process at an industry level, but this is a good start, so let’s start from small and then go for huge data-sets which then involves more cleaning process. This was just to give an idea as to how the process of data cleaning looks like in a beginners perspective. 