# Data Cleaning

##### Description: Code for data identifying and replacing missing data
##### References:www.udemy.com 
##### Link: (https://www.udemy.com/share/100YFmBUYTclhbQnQ=/?xref=E0ceeV1UTHsJSV82AT0GJVUWTx4dChQ%2BVFE=)
##### Author:Monika Dogra
##### Revision:1
##### Date: 21 Aug 2019

## Import Libraries

In [4]:
import numpy as np 
import pandas as pd
from sklearn.impute import SimpleImputer      #Imputation transformer for completing missing values

## Data Set :

In [17]:
df = pd.read_csv("/home/ritesh/Desktop/Missing_Data.csv")
print(df)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


## Copying Data to perform different operations

In [12]:
df_1 = df.copy()
df_2 = df.copy()
print(df_2)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


## Indexing and Selecting Data with Pandas
### Different Approaches:

In [13]:
x = df_1.loc[:,["Country","Age","Salary"]]# Dataframe.loc[ ] : This function is used for labels.
print(x)
y = df_1.loc[:,["Purchased"]]



   Country   Age   Salary
0   France  44.0  72000.0
1    Spain  27.0  48000.0
2  Germany  30.0  54000.0
3    Spain  38.0  61000.0
4  Germany  40.0      NaN
5   France  35.0  58000.0
6    Spain   NaN  52000.0
7   France  48.0  79000.0
8  Germany  50.0  83000.0
9   France  37.0  67000.0


In [14]:
x = df_1.iloc[:, :-1].values                  # Dataframe.iloc[ ] : This function is used for positions or integer based
y = df_1.iloc[:, 3].values
print(x)


[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Operation to drop rows with missing values by using dropna() function

In [15]:
print(df_2.dropna(axis=0,how = 'any'))        #dropna() method allows the user to analyze and drop Rows/Columns with Null values


   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
5   France  35.0  58000.0       Yes
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


## Filling missing values using fillna() function

In [5]:
print(df_1.fillna(0))                           # fill all null values with zero,or with nay number 
print(df_1.mean())                              # to calucate mean by using pandas
print(df_1.fillna(df.mean()))                   #fill all null values with mean value of that particular column
 

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      0.0       Yes
5   France  35.0  58000.0       Yes
6    Spain   0.0  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes
Age          38.777778
Salary    63777.777778
dtype: float64
   Country        Age        Salary Purchased
0   France  44.000000  72000.000000        No
1    Spain  27.000000  48000.000000       Yes
2  Germany  30.000000  54000.000000        No
3    Spain  38.000000  61000.000000        No
4  Germany  40.000000  63777.777778       Yes
5   France  35.000000  58000.000000       Yes
6    Spain  38.777778  52000.000000        No
7   France  48.000000  79000.000000       Yes
8  Germany  50.000000  83000.000000        No
9   France  37.000000  67000.000000       Yes
   Country   Age   Salary Purchased
0

## Filling missing values using sklearn library Imputing with mean


In [16]:
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean') 
missingvalues = missingvalues.fit(x[:, 1:3])    # fit is to impute the mean value(to clculate mean)
x[:, 1:3]=missingvalues.transform(x[:, 1:3])    #transform used to transfer calculated mean to respective places
print(x[:,1:3])

[[44.0 72000.0]
 [27.0 48000.0]
 [30.0 54000.0]
 [38.0 61000.0]
 [40.0 63777.77777777778]
 [35.0 58000.0]
 [38.77777777777778 52000.0]
 [48.0 79000.0]
 [50.0 83000.0]
 [37.0 67000.0]]


## Summary :

### * we can Index and Select our data either by using DataFrame.loc[] or DataFrame.iloc[] method,but out of two first one will be appropriate in case if we are dealing with a large dataset (because here we are specifying with 'columns name' directry instead of location)
### * Dropna() function : this fuction drop Rows/Columns of datasets with Null values in different ways.
#### axis: axis takes int or string value for rows/columns. Input can be 0 or 1(0 when we want to drop a raw,1 when we want to drop a column)
#### how: how takes string value of two kinds only (‘any’ or ‘all’). ‘any’ drops the row/column if ANY value is Null and ‘all’ drops only if ALL values are null.
## * fillna() function:fillna() function is used to fill NA/NaN values.
### With zero(df.fillna(0)),mean(of that patricular coloumn),median or any frquently used values that depends upon our Dataset.if we have Dataset with repetitive values(mode) ,then we prefer median over mean.
## * Whether to use dropna() or fillna() function again depends upon the Dataset and objective of the dataset with which we are dealing with.