<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center>

# Data Cleaning


Having data does not always allow you to produce some analytics right away. There is often a lot of pre processing to be done. 

This material is about **Cleaning**: making sure each cell has a value that could be used in your coming procedures. There are always some _impurities_ that do not allow the computer to recognize the data correctly, i.e. _commas_ instead of _periods_ and viceversa, the presence of unneeded _blanks_, irrelevant symbols (dollar, euro symbols), or non-standard symbols to represent missing values.

I will use two approaches. The first one is the smart use of regular expressions (**regex**), and the second one a **divide and conquer** strategy.

# REGEX VERSUS DIVIDE AND CONQUER 

Imagine that you request the age people in an online form. Sometimes you run into answers with issues like these:

- "It is:24"
- "It is: 44"
- "It is54"
- "64 it is"
- "I am twenty"
- "The 10th I turn 21"
- "I am 15 years old"
- "~20"

From the above examples, you are interested in the _age_, nothing else. The first two cases are _relatively_ easy to solve with divide and conquer, as you see a character that helps:  

In [None]:
case1="It is:24"
case2="It is: 34"
# try 1
case1.split(':')[1]

In [None]:
#try 2:
case2.split(':')[1]

Split broke the string using ":" and produced a _list_.  The number will be the second element. However, in _case2_ you got an extra space. You need to think about a general rule, so maybe this is better:

In [None]:
case1.split(':')[1].strip()

In [None]:
case2.split(':')[1].strip()

Using _strip()_ gets rid of the spaces around the string.  Notice _strip()_ and _split()_ are functions in **base Python**. Pandas has its **own** functions. 

You can use the divide and conquer as long as every string you find follows the same pattern. Imagine those values make a column in a data frame:

In [None]:
import pandas as pd

ages=["It is:24","It is: 44","It is54",
      "64 it is","I am twenty","The 10th I turn 21",
      "I am 15 years old","~20"]

someData=pd.DataFrame({'age':ages})
someData

Now, let's use Pandas **own** strip and split:

In [None]:
someData.age.str.split(':')

Or alternatively:

In [None]:
someData.age.str.split(':',expand=True)

Notice the use of _expand_. This allows that each element in the list goes to a column. However, as there is **no consistent pattern**, location of the symbol ":", you do not get a good result. The situation requires the **REGEX** approach. 

Using regular expressions is a great way to go when there is no pattern to apply the previous strategy; however, it takes time to learn how to build a regular expression that will serve all your especific  needs in a particular situation.

In general, you need to **explore** few *regex pattern*s before deciding what to use. I recommend using **contains()** for that:

In [None]:
# do each cell has a character that is not a number? (\D)
someData.age.str.contains(pat=r'\D',
                          regex=True)

In [None]:
# do each cell has a number character? (\d)
someData.age.str.contains(pat=r'\d',regex=True)

In [None]:
# what is that cell?
someData[~someData.age.str.contains(pat=r'\d',regex=True)]

In [None]:
# is there a cell where you have 
# symbols beyond [^ ] alphanumeric (\w) or spaces (\s)?  
someData.age[someData.age.str.contains(pat=r'[^\w\s]',regex=True)]

In [None]:
# what happens if I erase all non numbers (\D)?
someData.age.str.replace(pat=r'\D',repl='',regex=True)

In [None]:
# what happens if I extract consecutive numeric characters (\d+) ?
someData.age.str.extract(pat=r'(\d+)',expand=True)

In [None]:
# what happens if I erase all 
# numbers (\d+) followed by a texts [[a-z]+] ?
someData.age.str.replace(pat=r'\d+[a-z]+',
                         repl='',
                         regex=True)

In [None]:
# so:
someData.age.str.replace(pat=r'\d+[a-z]+',
                         repl='',
                         regex=True).\
             str.extract(pat=r'(\d+)',expand=True)

In [None]:
# using or '|'
# ^ beginning of string
# $ end of the string
someData.age.str.extract(pat=r'(^\d+|\d+$|\s\d+\s)',
                         expand=True)

Let me use both results:

In [None]:
someData['age1']=someData.age.str.replace(pat=r'\d+[a-z]+',
                                          repl='',
                                          regex=True).\
                                str.extract(pat=r'(\d+)',expand=True)

someData['age2']=someData.age.str.extract(pat=r'(^\d+|\d+$|\s\d+\s)',
                         expand=True)

In [None]:
someData

In [None]:
someData.info()

In [None]:
someData['age1'].to_list()==someData['age2'].to_list()

In [None]:
someData['age1']==someData['age2']

In [None]:
set(someData['age1']) & set(someData['age2'])

In [None]:
set(someData['age1']) ^ set(someData['age2'])

In [None]:
someData['age1'].to_list()

In [None]:
someData['age2'].to_list()

In [None]:
someData['age2'].str.strip().to_list()

## Exercise:

The CIA has several indicators for world countries:

- See [here](https://www.cia.gov/the-world-factbook/references/guide-to-country-comparisons).

Review the topics related to cleaning discussed in class, and see what may be need to clean this indicator from the CIA:

- [Carbon diocide emissions](https://www.cia.gov/the-world-factbook/field/carbon-dioxide-emissions/country-comparison).

In [1]:
from IPython.display import IFrame  
ciaLink="https://www.cia.gov/the-world-factbook/field/carbon-dioxide-emissions/country-comparison" 
IFrame(ciaLink, width=900, height=500)

You  need to make sure you have installed:

* pandas
* html5lib
* lxml
* beautifulsoup4 (or bs4)

You can use **pip show** to verify (for instance, _pip show pandas_). If you have all of them, run this code to get the data:

In [4]:
# read web table into pandas DF
import pandas as pd

linkToFile='https://github.com/CienciaDeDatosEspacial/code_and_data/raw/0a313b2bccbf19b5612c2f59ff5798510c6a74d0/data/carbonEmi_downloaded.csv'
carbon=pd.read_csv(linkToFile)

In [5]:
# here it is:
carbon

Unnamed: 0,name,slug,value,date_of_information,ranking,region
0,China,china,10773248000.0,2019 est.,1,East and Southeast Asia
1,United States,united-states,5144361000.0,2019 est.,2,North America
2,India,india,2314738000.0,2019 est.,3,South Asia
3,Russia,russia,1848070000.0,2019 est.,4,Central Asia
4,Japan,japan,1103234000.0,2019 est.,5,East and Southeast Asia
...,...,...,...,...,...,...
213,Antarctica,antarctica,28000.0,2019 est.,214,Antarctica
214,"Saint Helena, Ascension, and Tristan da Cunha",saint-helena-ascension-and-tristan-da-cunha,13000.0,2019 est.,215,Africa
215,Niue,niue,8000.0,2019 est.,216,Australia and Oceania
216,Northern Mariana Islands,northern-mariana-islands,0.0,2019 est.,217,Australia and Oceania


In [7]:
# also
carbon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   name                 218 non-null    object
 1   slug                 218 non-null    object
 2   value                218 non-null    object
 3   date_of_information  218 non-null    object
 4   ranking              218 non-null    int64 
 5   region               218 non-null    object
dtypes: int64(1), object(5)
memory usage: 10.3+ KB


Complete the tasks requested:

1. Keep the columns _name_, _value_, *date_of_information* and _region_.
    * Tip: use [drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html), [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html), and [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) for the same purpose (three ways to accomplish the task).
2. Change the column name "date_of_information" to "carbon_date".
    * Tip: Use [rename](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html).
3. Make sure the cells with text does not have neither trailing nor leading spaces.
    * Tip: use [strip](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html).
4. Get rid of the commas in the numeric values.
    * Tip: use [replace](https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html).
5. Keep only the year value in the column "carbon_date".
    * Tip: use [extract](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html).

When all tasks are done, create a folder **data** inside the current folder and save the cleaned file like this:


In [None]:
#carbonCleaned.to_csv("carbonCleaned.csv")