# Data Cleaning and Preprocessing

Machine Learning is a field which depends on the data. Hence data is the most important material to train and test a machine learning model. There are many open source datasets available on web. We can use them for learning purposes. The format and amount of data that we need, vary with the model we are building. Therefore, in most of the cases, this stage is essential to train the model well.

It is necessary to discover the relationships between data columns and importance of them because it helps us to decide which data points can be removed. So, these activities are also a part of the data cleaning and preprocessing stage.

In this stage, we can do things like following,
* Identifying empty data points and filling them.
* Deleting unnecessary data columns.
* Modifying data formats.
* Standardizing data points.
    

## Libraries and Packages

In this notebook series, machine learning algorithms are going to build using ‘sklearn’ python 3 library. Hence, as the data cleaning and preprocessing libraries, ‘numpy’, ‘pandas’ and ‘seaborn’ are used. In the below cell, those libraries are imported and other than them ‘matplotlib’ library is also imported for data visualizing purposes. There are some use cases of ‘sklearn’ library too. Those are going to import whenever they are needed. It will be helpful for understanding the relevance.

**%matplotlib inline** command is used to produce graphs within the notebook and store them.

In [1]:
# importing libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

The demonstration of this notebook codes are based on the **Titanic** dataset. 

**Important: This notebook contains many methods that can be used to perform data processing. All of this may or may not require to build a model. I include them here because my goal is to collect all of these in to single place.**

To read the dataset using .csv file, the **read_csv()** command of the pandas library can be used. The **shape** function displays the total number of rows and columns in the dataset. **head()** function is used to print first 5 data rows in the dataset. The number of rows that we want to display can be specified in the parenthesis. 

Ex : **head(10)**

In [5]:
# reading the data

data=pd.read_csv('datasets/Titanic.csv')
print(data.shape)
data.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
