<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center>

# Data Cleaning


Having data does not always allow you to produce some analytics right away. There is often a lot of pre processing to be done. 

This material is about **Cleaning**: making sure each cell has a value that could be used in your coming procedures. There are always some _impurities_ that do not allow the computer to recognize the data correctly, i.e. _commas_ instead of _periods_ and viceversa, the presence of unneeded _blanks_, irrelevant symbols (dollar, euro symbols), or non-standard symbols to represent missing values.

I will use two approaches. The first one is the smart use of regular expressions (**regex**), and the second one a **divide and conquer** strategy.

# REGEX VERSUS DIVIDE AND CONQUER 

Imagine that you request the age people in an online form. Sometimes you run into answers with issues like these:

- "It is:24"
- "It is: 44"
- "It is54"
- "64 it is"
- "I am twenty"
- "The 10th I turn 21"
- "I am 15 years old"
- "~20"

From the above examples, you are interested in the _age_, nothing else. The first two cases are _relatively_ easy to solve with divide and conquer, as you see a character that helps:  

In [1]:
case1="It is:24"
case2="It is: 34"
# try 1
case1.split(':')[1]

'24'

In [2]:
#try 2:
case2.split(':')[1]

' 34'

Split broke the string using ":" and produced a _list_.  The number will be the second element. However, in _case2_ you got an extra space. You need to think about a general rule, so maybe this is better:

In [3]:
case1.split(':')[1].strip()

'24'

In [4]:
case2.split(':')[1].strip()

'34'

Using _strip()_ gets rid of the spaces around the string.  Notice _strip()_ and _split()_ are functions in **base Python**. Pandas has its **own** functions. 

You can use the divide and conquer as long as every string you find follows the same pattern. Imagine those values make a column in a data frame:

In [5]:
import pandas as pd

ages=["It is:24","It is: 44","It is54",
      "64 it is","I am twenty","The 10th I turn 21",
      "I am 15 years old","~20"]

someData=pd.DataFrame({'age':ages})
someData

Unnamed: 0,age
0,It is:24
1,It is: 44
2,It is54
3,64 it is
4,I am twenty
5,The 10th I turn 21
6,I am 15 years old
7,~20


Now, let's use Pandas **own** strip and split:

In [6]:
someData.age.str.split(':')

0             [It is, 24]
1            [It is,  44]
2               [It is54]
3              [64 it is]
4           [I am twenty]
5    [The 10th I turn 21]
6     [I am 15 years old]
7                   [~20]
Name: age, dtype: object

Or alternatively:

In [7]:
someData.age.str.split(':',expand=True)

Unnamed: 0,0,1
0,It is,24.0
1,It is,44.0
2,It is54,
3,64 it is,
4,I am twenty,
5,The 10th I turn 21,
6,I am 15 years old,
7,~20,


Notice the use of _expand_. This allows that each element in the list goes to a column. However, as there is **no consistent pattern**, location of the symbol ":", you do not get a good result. The situation requires the **REGEX** approach. 

Using regular expressions is a great way to go when there is no pattern to apply the previous strategy; however, it takes time to learn how to build a regular expression that will serve all your especific  needs in a particular situation.

In general, you need to **explore** few *regex pattern*s before deciding what to use. I recommend using **contains()** for that:

In [8]:
# do each cell has a character that is not a number? (\D)
someData.age.str.contains(pat=r'\D',
                          regex=True)

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
Name: age, dtype: bool

In [9]:
# do each cell has a number character? (\d)
someData.age.str.contains(pat=r'\d',regex=True)

0     True
1     True
2     True
3     True
4    False
5     True
6     True
7     True
Name: age, dtype: bool

In [10]:
# what is that cell?
someData[~someData.age.str.contains(pat=r'\d',regex=True)]

Unnamed: 0,age
4,I am twenty


In [12]:
# is there a cell where you have 
# symbols beyond [^ ] alphanumeric (\w) or spaces (\s)?  
someData.age[someData.age.str.contains(pat=r'[^\w\s]',regex=True)]

0     It is:24
1    It is: 44
7          ~20
Name: age, dtype: object

In [13]:
# what happens if I erase all non numbers (\D)?
someData.age.str.replace(pat=r'\D',repl='',regex=True)

0      24
1      44
2      54
3      64
4        
5    1021
6      15
7      20
Name: age, dtype: object

In [14]:
# what happens if I extract consecutive numeric characters (\d+) ?
someData.age.str.extract(pat=r'(\d+)',expand=True)

Unnamed: 0,0
0,24.0
1,44.0
2,54.0
3,64.0
4,
5,10.0
6,15.0
7,20.0


In [15]:
# what happens if I erase all 
# numbers (\d+) followed by a texts [[a-z]+] ?
someData.age.str.replace(pat=r'\d+[a-z]+',
                         repl='',
                         regex=True)

0             It is:24
1            It is: 44
2              It is54
3             64 it is
4          I am twenty
5       The  I turn 21
6    I am 15 years old
7                  ~20
Name: age, dtype: object

In [16]:
# so:
someData.age.str.replace(pat=r'\d+[a-z]+',
                         repl='',
                         regex=True).\
             str.extract(pat=r'(\d+)',expand=True)

Unnamed: 0,0
0,24.0
1,44.0
2,54.0
3,64.0
4,
5,21.0
6,15.0
7,20.0


In [37]:
# using or '|'
# ^ beginning of string
# $ end of the string
someData.age.str.extract(pat=r'(^\d+|\d+$|\s\d+\s)',
                         expand=True)

Unnamed: 0,0
0,24.0
1,44.0
2,54.0
3,64.0
4,
5,21.0
6,15.0
7,20.0


Let me use both results:

In [18]:
someData['age1']=someData.age.str.replace(pat=r'\d+[a-z]+',
                                          repl='',
                                          regex=True).\
                                str.extract(pat=r'(\d+)',expand=True)

someData['age2']=someData.age.str.extract(pat=r'(^\d+|\d+$|\s\d+\s)',
                         expand=True)

In [19]:
someData

Unnamed: 0,age,age1,age2
0,It is:24,24.0,24.0
1,It is: 44,44.0,44.0
2,It is54,54.0,54.0
3,64 it is,64.0,64.0
4,I am twenty,,
5,The 10th I turn 21,21.0,21.0
6,I am 15 years old,15.0,15.0
7,~20,20.0,20.0


In [20]:
someData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   age     8 non-null      object
 1   age1    7 non-null      object
 2   age2    7 non-null      object
dtypes: object(3)
memory usage: 320.0+ bytes


In [39]:
someData['age1'].to_list()==someData['age2'].to_list()

False

In [40]:
someData['age1']==someData['age2']

0     True
1     True
2     True
3     True
4    False
5     True
6    False
7     True
dtype: bool

In [29]:
set(someData['age1']) & set(someData['age2'])

{'20', '21', '24', '44', '54', '64', nan}

In [28]:
set(someData['age1']) ^ set(someData['age2'])

{' 15 ', '15'}

In [31]:
someData['age1'].to_list()

['24', '44', '54', '64', nan, '21', '15', '20']

In [26]:
someData['age2'].to_list()

['24', '44', '54', '64', nan, '21', ' 15 ', '20']

In [32]:
someData['age2'].str.strip().to_list()

['24', '44', '54', '64', nan, '21', '15', '20']

## Exercise:

The CIA has several indicators for world countries:

- See [here](https://www.cia.gov/the-world-factbook/references/guide-to-country-comparisons).

Review the topics related to cleaning discussed in class, and see what may be need to clean this indicator from the CIA:

- [Carbon diocide emissions](https://www.cia.gov/the-world-factbook/field/carbon-dioxide-emissions/country-comparison).

In [47]:
from IPython.display import IFrame  
ciaLink="https://www.cia.gov/the-world-factbook/field/carbon-dioxide-emissions/country-comparison" 
IFrame(ciaLink, width=900, height=500)

You  need to make sure you have installed:

* pandas
* html5lib
* lxml
* beautifulsoup4 (or bs4)

You can use **pip show** to verify (for instance, _pip show pandas_). If you have all of them, run this code to get the data:

In [48]:
# read web table into pandas DF
import pandas as pd

ciaDF=pd.read_html(ciaLink, # link
                        header=0, # where is the header?
                        flavor='bs4')

In [50]:
# here it is:
carbonEmi=ciaDF[0].copy()
carbonEmi

Unnamed: 0,Rank,Country,metric tonnes of CO2,Date of Information
0,1,China,1.077325e+10,2019 est.
1,2,United States,5.144361e+09,2019 est.
2,3,India,2.314738e+09,2019 est.
3,4,Russia,1.848070e+09,2019 est.
4,5,Japan,1.103234e+09,2019 est.
...,...,...,...,...
213,214,Antarctica,2.800000e+04,2019 est.
214,215,"Saint Helena, Ascension, and Tristan da Cunha",1.300000e+04,2019 est.
215,216,Niue,8.000000e+03,2019 est.
216,217,Northern Mariana Islands,0.000000e+00,2019 est.


Complete the tasks requested:

1. Change the column names, so that they only have letters, no spaces between words.
2. Keep the country and the indicator, get rid of the other columns.
3. Make sure

In [None]:
# see the first colums

In [None]:
# keep one column
DF_toClean=ancientWars[['Date']]
DF_toClean