# Data Cleanup

Data cleanup methods refer to book "Data Wrangling with Python" and "Python for Data Analysis".

In [4]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%matplotlib inline

## Prepare the data

This data set related to child labor. The data is stored in mn.cvs with column names in acronyms. We can find these acronyms in the mn_headers.csv.

In [28]:
data = pd.read_csv('/Users/newuser/Desktop/Research(Tian_Ye)/Learning_notes/data-wrangling/data/unicef/mn.csv', index_col = 0, engine = 'python')
data.head()

Unnamed: 0,HH1,HH2,LN,MWM1,MWM2,MWM4,MWM5,MWM6D,MWM6M,MWM6Y,...,MCSURV,MCDEAD,mwelevel,mnweight,wscore,windex5,wscoreu,windex5u,wscorer,windex5r
1,1,17,1,1,17,1,14,7,4,2014,...,0.0,0.0,Higher,0.403797,1.60367,5,1.272552,5.0,,
2,1,20,1,1,20,1,14,7,4,2014,...,0.0,0.0,Higher,0.403797,1.543277,5,1.089026,5.0,,
3,2,1,1,2,1,1,9,8,4,2014,...,3.0,0.0,Primary,1.031926,0.878635,4,-0.930721,1.0,,
4,2,1,5,2,1,5,9,12,4,2014,...,,,,0.0,0.0,0,0.0,0.0,0.0,0.0
5,2,1,8,2,1,8,9,8,4,2014,...,0.0,0.0,Secondary,1.031926,0.878635,4,-0.930721,1.0,,


**Remark**:
1. The default `engine` is `'C'` which is faster but perform worse than `'python'` when there are multiple data types in the data set.
2. Set `index_col = 0` to let the first column as the index.

In [31]:
data_header = pd.read_csv('/Users/newuser/Desktop/Research(Tian_Ye)/Learning_notes/data-wrangling/data/unicef/mn_headers.csv')
data_header.head()

Unnamed: 0,Name,Label,Question
0,HH1,Cluster number,
1,HH2,Household number,
2,LN,Line number,
3,MWM1,Cluster number,
4,MWM2,Household number,


## Create a new data with informative columns names

The data has 159 columns while data_header has has 210 rows. Check how many column names in data can find a description in data_header.

In [48]:
sum(data.columns.isin(data_header['Name']))

150

Create a data to contain the columns whose name can be found in the data_header.

In [111]:
data_new = DataFrame()
for i in range(data.shape[1]):
    col_new = data_header['Label'][data_header['Name'] == data.columns[i]]  
    #Note that col_new has dtype: object and hence cannot be used as column name.   
    if len(col_new) > 0:
        col_new = list(col_new)[0]     #Apply list() to convert the dtype: object to normal string.
                                        #We choose index 0 because there is only one element in the list.
        data_new[col_new] = data.iloc[:, i]
        
data_new.head()

Unnamed: 0,Cluster number,Household number,Line number,Man's line number,Interviewer number,Day of interview,Month of interview,Year of interview,Result of man's interview,Field editor,...,Date of birth of woman (CMC),Age,Date of marriage (CMC),Age at first marriage/union,Date of birth of last child (CMC),Marital/Union status,Children surviving,Children dead,Wealth index score,Wealth index quintiles
1,1,17,1,1,14,7,4,2014,Completed,2,...,1013.0,25-29,1365.0,29.0,,Currently married/in union,0.0,0.0,1.60367,5
2,1,20,1,1,14,7,4,2014,Completed,2,...,917.0,35-39,1370.0,37.0,,Currently married/in union,0.0,0.0,1.543277,5
3,2,1,1,1,9,8,4,2014,Completed,1,...,878.0,40-44,1100.0,18.0,,Currently married/in union,3.0,0.0,0.878635,4
4,2,1,5,5,9,12,4,2014,Not at home,1,...,,,,,,,,,0.0,0
5,2,1,8,8,9,8,4,2014,Completed,1,...,1118.0,20-24,,,,Never married/in union,0.0,0.0,0.878635,4


**Remark**: 
1. When we use boolean index to find some specific targets from one column of strings, we always get results with dtype: object. But I cannot find a way to get value from object type (get the string from it). So I use `list()` to convert the data type but it is not effective. 
2. We can use `np.where()` to get the index first. It outputs one tuple data type. However, we need to use int() to convert np.array type to int. The disadvantage of this method is that it failed to deal with all False situation.

In [115]:
i=1
int(np.where(data_header['Name'] == data.columns[i])[0])

1

## Formatting Data

Print out the results with readable format.

In [118]:
print('Question: {}\nAnswer: {}'.format(data_new.columns[0], data_new.iloc[0, 0]))

Question: Cluster number
Answer: 1


Other options.

In [121]:
example_dict = {'f': 2.123123123131,
               'i': 3433233423423,
               'p': .324,}
s_to_p = 'float: {f:.4f}\n'
s_to_p += 'integer: {i:,}\n'
s_to_p += 'percentatage: {p:.2%}'
print (s_to_p.format(**example_dict)) #use ** tounpack the dictionary.

float: 2.1231
integer: 3,433,233,423,423
percentatage: 32.40%


## Date Operations

Firstly, let's look at our data holding interiew start and end times from `data_new`. Print some of our entries to make sure we know what data entries we need to use:

In [126]:
data_new.iloc[0, 6:15]
for x in enumerate(data_new.columns[:15]):
    print(x)           

(0, 'Cluster number')
(1, 'Household number')
(2, 'Line number')
(3, "Man's line number")
(4, 'Interviewer number')
(5, 'Day of interview')
(6, 'Month of interview')
(7, 'Year of interview')
(8, "Result of man's interview")
(9, 'Field editor')
(10, 'Data entry clerk')
(11, 'Start of interview - Hour')
(12, 'Start of interview - Minutes')
(13, 'End of interview - Hour')
(14, 'End of interview - Minutes')


**Remark**: `enumerate` can let's see the positions of entries.

We now have all the data we need to figure out exctly when the interview started and ended. We could use data like this to determine things whether interviews in th evening or morning were more likely to be completed, and whether the length of the interview affected the number of rsponses. We can also determine which was the first interview and the last interview and calculated average duration.

Now let's get the start time.

In [131]:
from datetime import datetime

start_string = '{}/{}/{} {}:{}'.format(data_new.iloc[0, 6], 
                                      data_new.iloc[0, 5], data_new.iloc[0, 7], int(data_new.iloc[0, 11]), 
                                      int(data_new.iloc[0, 12]))
start_time = datetime.strptime(start_string, '%m/%d/%Y %H:%M')
start_time

datetime.datetime(2014, 4, 7, 17, 59)

**Remark**: When we use datetime.strptime, we should make sure all inputs are integers.

Since each element of the time data is a sparate item in our dataset, we could also natively create Python datetime objects without using `strptime`.

In [135]:
end_time = datetime(data_new.iloc[0, 7], data_new.iloc[0, 6], data_new.iloc[0, 5], 
                    int(data_new.iloc[0, 13]), int(data_new.iloc[0, 14]))
end_time

datetime.datetime(2014, 4, 7, 18, 7)

**Remark**: the order of inputs for `datetime` should be year, month, day, hour and minute.

Now we can get some information from date data.

In [138]:
duration = end_time - start_time
print(duration)

0:08:00


In [140]:
print(duration.days)

0


In [141]:
print(duration.total_seconds()) #It is equivalent to directly type `duration`.

480.0


In [142]:
minutes = duration.total_seconds()/60.0
minutes

8.0

Present the date in a human-readable way:

In [143]:
print(end_time.strftime('%m/%d/%Y %H:%M:%S'))

04/07/2014 18:07:00


In [144]:
print(start_time.ctime())   #C's ctime standard

Mon Apr  7 17:59:00 2014


In [146]:
print(start_time.strftime('%Y-%m-%dT%H:%M:%S'))  #PHP format

2014-04-07T17:59:00
