![title](http://www.datacleansing.net.au/Images/Site/DG_Data_Cleansing_Cycle_300px.png)

So far, you have dealt with fairly clean data in the notebooks. **However, in the real world, most data isn't this clean.** As you saw in the getting data from the web notebook, there are cases where your data source isn't the cleanest one. Or other times, the application that is filling the dataset that you are using is already dirty! Thankfully, there are a lot of ways you can clean dirty data. In this notebook, we are going to concentrate  on Standardizing, Normalizing and Deduplicating data. First, we'll have to get a dirty dataset of course, and clean it.

## Standardizing data

The process of standardizing data, is to modify data that exists in various formats into a one single format, without losing the meaning of the data. This happens very commonly with dates and times, since there are a lot of formats for dates around the world, and sometimes timed data is in 24-Hour format and others in 12-Hour format.  Let's see an example:

In [21]:
dates = ["5/15/2017","18/12/2016","4/22/2017","3/14/2017","22/6/2014"] #Dates in two different formats: MM/DD/YYYY and DD/MM/YYYY

import datetime
def standard_date(date):
    #Try, except statement, for catching errors. 
    #This statement tries to convert a date in one format to another. If the format doesn't match, return the original date.
    #Try except is similar to if else, but it's meant to handle errors that stop the execution of the program.
    try:
        return datetime.datetime.strptime(date, '%d/%m/%Y').strftime('%m/%d/%Y')
    except:
        return date

for i in range(0,len(dates)):
    date = standard_date(dates[i])
    print(date)
    dates[i] = date

5/15/2017
12/18/2016
4/22/2017
3/14/2017
06/22/2014


This is  a general purpose example to work with dates with strings, using the datetime library. Strptime extracts the date in the given format, in this case day/month/year, and converts it to another format with strftime, month/day/year. The try,except routine is new, but as the comments explain, it's like an if else but made for catching and handling those nasty Errors that stop the execution of the code. Another way to do this would be extracting the year, day and month individually, then pasting them together in the desired format or creating a dataframe with the individual values:

In [25]:
import pandas as pd
years = []
months = []
days = []

for date in dates:
    date_object = datetime.datetime.strptime(date,"%m/%d/%Y")
    years.append(date_object.year)
    months.append(date_object.month)
    days.append(date_object.day)
    print(str(date_object.year) + "-" + str(date_object.month) + "-" + str(date_object.day))

dataframe = pd.DataFrame({"Month":months,"Day":days,"Year":years})
dataframe

2017-5-15
2016-12-18
2017-4-22
2017-3-14
2014-6-22


Unnamed: 0,Day,Month,Year
0,15,5,2017
1,18,12,2016
2,22,4,2017
3,14,3,2017
4,22,6,2014


This method is friendlier when there is no discernable distinction between the month and the day, like in January 1st. 

Dates and times aren't the only type of data that won't be standardized. Sometimes it's other things like Sex or Education. Like in this case:


In [28]:
Names = ["John Corporan","Kate Beckingsale","Chuck Norris","Porter, Jessica","Mike Ryans"]
Sex = ["M","F",1,"f","m"]
Ed = ["Bs","Bs","Bachelors","Ms","Phd"]

People = {"Full name": Names,
          "Sex": Sex,
          "Education": Ed} #Please no discussions on this column.

Peopledf = pd.DataFrame(People)
Peopledf

Unnamed: 0,Education,Full name,Sex
0,Bs,John Corporan,M
1,Bs,Kate Beckingsale,F
2,Bachelors,Chuck Norris,1
3,Ms,"Porter, Jessica",f
4,Phd,Mike Ryans,m


How would we go standardizing this dataframe? Well that's your exercise now.

#### Your turn now.

Standardize the people dataframe.  

**Hint**: when facing multiple formats of values, you can either standardize to the most common value, or the one that makes the most sense.

In [29]:
#Your code here.

## Normalizing data. 

Data normalization is very similar to the previous concept of standarizing data, but with **continous** numerical values. Imagine you have the price of a house and it's living area. If you, as a Data Scientist, are planning to analyze the relationship between these two variables, it would be wiser to bring them both to the same level.  This way you effectively eliminate the unit of measurement that the data is based on, and can compare them eye-to-eye, mano a mano. But, as you'll see in the machine learning notebooks later on, this also means that the errors between values can be brought to more equal levels, as there won't be as many gaps between the values. 

There are various ways of normalizing data, but I'll show two of them. 

#### Mean normalization  

Or standardization in statistics. This transforms the data in a way that the mean is 0 and the standard deviation is 1. This means the range of the data is now [-n,n] where n is some number, usually a low one. To achieve this normalization, we apply this calculation to each row of the dataset:

![title](http://www.statisticshowto.com/wp-content/uploads/2016/11/alternate-z-score.png)

Where Xi is the current value of the iterations of the data,  X with the line on top (sorry, I can't even LaTex now) is the mean of the data and s is the standard deviation of the data. Let's see this in action:

In [34]:
#Read the house data.
houses = pd.read_csv("Houses.csv")
#Create a copy for later use.
House_Copy = houses.copy()

def Mean_Normalization(data):
    mean = data.mean()
    std = data.std()
    return data.apply(lambda row: (row-mean)/std)


print(houses.head())

houses.SalePrice = Mean_Normalization(houses.SalePrice)
houses.GrLivArea = Mean_Normalization(houses.GrLivArea)
houses.head()

   SalePrice  GrLivArea
0     208500       1710
1     181500       1262
2     223500       1786
3     140000       1717
4     250000       2198


Unnamed: 0,SalePrice,GrLivArea
0,0.347154,0.370207
1,0.007286,-0.482347
2,0.53597,0.514836
3,-0.515105,0.383528
4,0.869545,1.298881


See the difference? Now those values that were lower than the mean are negative, and those that were higher than the mean are positive. To prove that it worked, let's re-calculated the mean and standard deviation.

In [35]:
print(houses.SalePrice.mean())
print(houses.SalePrice.std())

1.102998970697801e-16
0.9999999999999998


#### min-max normalization.

Also called normalization in statistics. In this case, the values are just concentrated between 0 and 1. The calculation goes like this:

![title](http://www.statisticshowto.com/wp-content/uploads/2015/11/normalize-data.png)

So for each value in the data, we'll substract it the minimum value of the data, and divide it by the range of the values (Maximum value - minimum value of the data). Again, we can create a function that does this for us:

In [36]:
def MinMax_Normalization(data):
    maximum = max(data)
    minimum = min(data)
    return data.apply(lambda  row: (row-minimum)/(maximum-minimum))

print(House_Copy.head())

House_Copy.SalePrice = MinMax_Normalization(House_Copy.SalePrice)
House_Copy.GrLivArea = MinMax_Normalization(House_Copy.GrLivArea)
House_Copy.head()

   SalePrice  GrLivArea
0     208500       1710
1     181500       1262
2     223500       1786
3     140000       1717
4     250000       2198


Unnamed: 0,SalePrice,GrLivArea
0,0.241078,0.259231
1,0.203583,0.17483
2,0.261908,0.273549
3,0.145952,0.26055
4,0.298709,0.351168


Now all the values are between 0 and 1, kind of like a probability between 0 and 1. Let's calculate the mean and standard deviation of this one.

In [39]:
print(House_Copy.mean())
print(House_Copy.std())

SalePrice    0.202779
GrLivArea    0.222582
dtype: float64
SalePrice    0.110321
GrLivArea    0.098998
dtype: float64


So, based on this, most values of the data are on the lower end of the spectrum, near 20% for both. This may mean there are some extremely high values that skew the distribution this way. But look what we found out without even checking all the data!   Just from looking where the mean and standard deviation stand when the values are between 0 and 1. But we'll see easier ways to spot insights later in the visualization chapters.

Now, when to use one method over the other? As a rule of thumb, if working with probabilities, go with min-max normalization else go with Mean normalization. However, it's best to test both of them, and see how they affect your predictive model (which we'll see more about later in the machine learning chapters). **However, don't forget to unnormalize your values after you are done with your analysis or about to create predictions!**

#### Your turn now.

Investigate and apply another normalization method to the house data. How does it affect the data?

In [40]:
houses = pd.read_csv("Houses.csv")
#Your code here.



*Your analysis here.*

## Duplicated values.

This is a short, but important section. Often, in datasets, we'll find duplicated values that only serve to **hide the truth** from us. Or, rather make our analysis error prone. Usually they are easy to identify, but when working with large datasets, they are harder to find. Thankfully, we have easy ways to deal with duplicated values, like this:

In [42]:
People_copy = Peopledf.copy() #Copy the people dataframe you so gratiously cleaned earlier.

#Create a list with the copy and the original.
peeps = [Peopledf,People_copy]

#Append one to the end of the other.

merged = pd.concat(peeps)
merged

Unnamed: 0,Education,Full name,Sex
0,Bs,John Corporan,M
1,Bs,Kate Beckingsale,F
2,Bachelors,Chuck Norris,1
3,Ms,"Porter, Jessica",f
4,Phd,Mike Ryans,m
0,Bs,John Corporan,M
1,Bs,Kate Beckingsale,F
2,Bachelors,Chuck Norris,1
3,Ms,"Porter, Jessica",f
4,Phd,Mike Ryans,m


Now we have created a monster! But it's fairly easy to kill when it's weak like this. We just need to use the **drop_duplicates** function that pandas gives us for free:

In [43]:
un_merged = merged.drop_duplicates()
un_merged

Unnamed: 0,Education,Full name,Sex
0,Bs,John Corporan,M
1,Bs,Kate Beckingsale,F
2,Bachelors,Chuck Norris,1
3,Ms,"Porter, Jessica",f
4,Phd,Mike Ryans,m


Yes, it's that easy. **However**, there's a catch. When working with a lot of columns, sometimes a duplicate value may go unnoticed.

In [50]:
#Create a clone of mike ryans to confuse us.
Mikes_Clone = ["PhD","Mike Ryans","m"]
#Add it to the dataframe.
un_merged.loc[5] = Mikes_Clone
un_merged

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Education,Full name,Sex
0,Bs,John Corporan,M
1,Bs,Kate Beckingsale,F
2,Bachelors,Chuck Norris,1
3,Ms,"Porter, Jessica",f
4,Phd,Mike Ryans,m
5,PhD,Mike Ryans,m


Now, if you find yourself in this situation, you have to analyze all the columns, and not simply come to the conclusion that it's a duplicate value. Find the index or some IDs in the dataset, and compare the values between the duplicates. If there is something of value that differentiates them, then it's not a duplicate. 

## Exercise.

Clean the AirBnB dataset( To be added).