<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center>

# Data Formatting (strings and numeric)

Let me clean some data sets:

In [15]:
# links to websites
carbonLink="https://www.cia.gov/the-world-factbook/field/carbon-dioxide-emissions/country-comparison" 
forestLink="https://www.cia.gov/the-world-factbook/field/revenue-from-forest-resources/country-comparison" 

In [17]:
# scrapping into LIST of Dataframes

import pandas as pd

carbonList=pd.read_html(carbonLink,header=0,flavor='bs4')
forestList=pd.read_html(forestLink,header=0,flavor='bs4')
carbonList

[     Rank                                        Country  \
 0       1                                          China   
 1       2                                  United States   
 2       3                                          India   
 3       4                                         Russia   
 4       5                                          Japan   
 ..    ...                                            ...   
 213   214                                     Antarctica   
 214   215  Saint Helena, Ascension, and Tristan da Cunha   
 215   216                                           Niue   
 216   217                       Northern Mariana Islands   
 217   218                                         Tuvalu   
 
      metric tonnes of CO2 Date of Information  
 0             10773248000           2019 est.  
 1              5144361000           2019 est.  
 2              2314738000           2019 est.  
 3              1848070000           2019 est.  
 4              11032

In [20]:
# getting the Dataframe from list
carbon=carbonList[0]
forest=forestList[0]

Unnamed: 0,Rank,Country,% of GDP,Date of Information
0,1,Solomon Islands,20.27,2018 est.
1,2,Liberia,13.27,2018 est.
2,3,Burundi,10.31,2018 est.
3,4,Guinea-Bissau,9.24,2018 est.
4,5,Central African Republic,8.99,2018 est.
...,...,...,...,...
199,200,Guam,0.00,2018 est.
200,201,Faroe Islands,0.00,2017 est.
201,202,Aruba,0.00,2017 est.
202,203,Virgin Islands,0.00,2017 est.


In [32]:
# no spaces in column names
carbon.columns=carbon.columns.str.replace(r'\s','',regex=True)
forest.columns=forest.columns.str.replace(r'\s','',regex=True)

In [33]:
# dropping
toDrop=['Rank']
carbon.drop(columns=toDrop,inplace=True)
forest.drop(columns=toDrop,inplace=True)

KeyError: "['Rank'] not found in axis"

In [35]:
#renaming
newCarbonNames={'metrictonnesofCO2':'co2_tonnes','DateofInformation':'Carbon_yearData'}
newForestNames={'%ofGDP':'ForestRevenue_PctGDP', 'DateofInformation':'Forest_yearData'}
carbon.rename(columns=newCarbonNames,inplace=True)
forest.rename(columns=newForestNames,inplace=True)
carbon

Unnamed: 0,Country,co2_tonnes,Carbon_yearData
0,China,10773248000,2019 est.
1,United States,5144361000,2019 est.
2,India,2314738000,2019 est.
3,Russia,1848070000,2019 est.
4,Japan,1103234000,2019 est.
...,...,...,...
213,Antarctica,28000,2019 est.
214,"Saint Helena, Ascension, and Tristan da Cunha",13000,2019 est.
215,Niue,8000,2019 est.
216,Northern Mariana Islands,0,2019 est.


In [37]:
# no spaces in string values (to several columns)
byeSpaces=lambda x: x.str.strip()
carbon.iloc[:,[0,2]]=carbon.iloc[:,[0,2]].apply(byeSpaces)
forest.iloc[:,[0,2]]=forest.iloc[:,[0,2]].apply(byeSpaces)
forest

Unnamed: 0,Country,ForestRevenue_PctGDP,Forest_yearData
0,Solomon Islands,20.27,2018 est.
1,Liberia,13.27,2018 est.
2,Burundi,10.31,2018 est.
3,Guinea-Bissau,9.24,2018 est.
4,Central African Republic,8.99,2018 est.
...,...,...,...
199,Guam,0.00,2018 est.
200,Faroe Islands,0.00,2017 est.
201,Aruba,0.00,2017 est.
202,Virgin Islands,0.00,2017 est.


In [39]:
# keeping year
carbon.Carbon_yearData=carbon.Carbon_yearData.str. extract(pat=r'(\d+)')
forest.Forest_yearData=forest.Forest_yearData.str. extract(pat=r'(\d+)')
carbon

Unnamed: 0,Country,co2_tonnes,Carbon_yearData
0,China,10773248000,2019
1,United States,5144361000,2019
2,India,2314738000,2019
3,Russia,1848070000,2019
4,Japan,1103234000,2019
...,...,...,...
213,Antarctica,28000,2019
214,"Saint Helena, Ascension, and Tristan da Cunha",13000,2019
215,Niue,8000,2019
216,Northern Mariana Islands,0,2019


Verifying year:

In [40]:
forest[forest.Forest_yearData.str.contains(r'\D')]

Unnamed: 0,Country,ForestRevenue_PctGDP,Forest_yearData


In [41]:
carbon[carbon.Carbon_yearData.str.contains(r'\D')]

Unnamed: 0,Country,co2_tonnes,Carbon_yearData


**Before** starting formatting we check the data types:

In [42]:
forest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Country               204 non-null    object 
 1   ForestRevenue_PctGDP  204 non-null    float64
 2   Forest_yearData       204 non-null    object 
dtypes: float64(1), object(2)
memory usage: 4.9+ KB


In [6]:
carbon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Country          218 non-null    object
 1   co2_tonnes       218 non-null    int64 
 2   Carbon_yearData  218 non-null    object
dtypes: int64(1), object(2)
memory usage: 5.2+ KB


# String case

This is our string column

In [43]:
carbon.Country

0                                              China
1                                      United States
2                                              India
3                                             Russia
4                                              Japan
                           ...                      
213                                       Antarctica
214    Saint Helena, Ascension, and Tristan da Cunha
215                                             Niue
216                         Northern Mariana Islands
217                                           Tuvalu
Name: Country, Length: 218, dtype: object

In [44]:
# do we have duplicates?
carbon[carbon.duplicated(subset='Country')]

Unnamed: 0,Country,co2_tonnes,Carbon_yearData


In [45]:
# do we have weird symbols?
carbon[carbon.Country.str.contains(r'[^\w\s]')]

Unnamed: 0,Country,co2_tonnes,Carbon_yearData
6,"Korea, South",686954000,2019
16,Turkey (Turkiye),391792000,2019
86,"Korea, North",18465000,2019
99,Cote d'Ivoire,11880000,2019
137,"Congo, Republic of the",4523000,2019
141,"Bahamas, The",3984000,2019
150,"Congo, Democratic Republic of the",2653000,2019
185,"Gambia, The",606000,2019
188,Timor-Leste,538000,2019
192,Guinea-Bissau,342000,2019


In [46]:
# accents in words:
carbon[carbon.Country.str.contains(r"\w*[\u00C0-\u01DA']\w*")]

Unnamed: 0,Country,co2_tonnes,Carbon_yearData
99,Cote d'Ivoire,11880000,2019


In [47]:
# only ascii
from unidecode import unidecode

carbon['Country']=carbon.Country.apply(unidecode)
forest['Country']=forest.Country.apply(unidecode)

ModuleNotFoundError: No module named 'unidecode'

The capitalization is an important step, it may help in later stages when merging data frames:

* str.lower(): all to lowercase.

* str.upper(): ALL TO UPPERCASE.

* str.title(): First Character Of Each Word To Uppercase. 

* str.capitalize(): First character to uppercase and remaining to lowercase.

**You can only apply this if the cells are strings.**

Let's do it:

In [48]:
carbon_test=carbon.copy()
carbon_test['countryname']=carbon_test.Country.str.lower()
carbon_test['COUNTRYNAME']=carbon_test.Country.str.upper()
carbon_test['CountryName']=carbon_test.Country.str.title()
carbon_test

Unnamed: 0,Country,co2_tonnes,Carbon_yearData,countryname,COUNTRYNAME,CountryName
0,China,10773248000,2019,china,CHINA,China
1,United States,5144361000,2019,united states,UNITED STATES,United States
2,India,2314738000,2019,india,INDIA,India
3,Russia,1848070000,2019,russia,RUSSIA,Russia
4,Japan,1103234000,2019,japan,JAPAN,Japan
...,...,...,...,...,...,...
213,Antarctica,28000,2019,antarctica,ANTARCTICA,Antarctica
214,"Saint Helena, Ascension, and Tristan da Cunha",13000,2019,"saint helena, ascension, and tristan da cunha","SAINT HELENA, ASCENSION, AND TRISTAN DA CUNHA","Saint Helena, Ascension, And Tristan Da Cunha"
215,Niue,8000,2019,niue,NIUE,Niue
216,Northern Mariana Islands,0,2019,northern mariana islands,NORTHERN MARIANA ISLANDS,Northern Mariana Islands


In [51]:
#Let's keep the upper case
carbon['Country']=carbon.Country.str.upper()
forest['Country']=forest.Country.str.upper()
forest

Unnamed: 0,Country,ForestRevenue_PctGDP,Forest_yearData
0,SOLOMON ISLANDS,20.27,2018
1,LIBERIA,13.27,2018
2,BURUNDI,10.31,2018
3,GUINEA-BISSAU,9.24,2018
4,CENTRAL AFRICAN REPUBLIC,8.99,2018
...,...,...,...
199,GUAM,0.00,2018
200,FAROE ISLANDS,0.00,2017
201,ARUBA,0.00,2017
202,VIRGIN ISLANDS,0.00,2017


In [62]:
# we can save this:
import os

carbon.to_csv(os.path.join("data","carbon_formatted.csv"),index=False)
forest.to_csv(os.path.join("data","forest_formatted.csv"),index=False)

# Numeric case

In [52]:
#looks good
carbon.Carbon_yearData

0      2019
1      2019
2      2019
3      2019
4      2019
       ... 
213    2019
214    2019
215    2019
216    2019
217    2019
Name: Carbon_yearData, Length: 218, dtype: object

In [53]:
# not numeric data
carbon.Carbon_yearData.info()

<class 'pandas.core.series.Series'>
RangeIndex: 218 entries, 0 to 217
Series name: Carbon_yearData
Non-Null Count  Dtype 
--------------  ----- 
218 non-null    object
dtypes: object(1)
memory usage: 1.8+ KB


In [54]:
# see ONE element
carbon.Carbon_yearData[0]

'2019'

You can not get any **expected** statistics if the values are not recognised as numeric:

In [55]:
carbon.Carbon_yearData.describe()

count      218
unique       3
top       2019
freq       216
Name: Carbon_yearData, dtype: object

The easiest way is to use the **pd.to_numeric** function in pandas:

In [56]:
# now you get stats for numeric data
pd.to_numeric(carbon.Carbon_yearData).describe()

count     218.000000
mean     2018.958716
std         0.492471
min      2012.000000
25%      2019.000000
50%      2019.000000
75%      2019.000000
max      2019.000000
Name: Carbon_yearData, dtype: float64

Then:

In [57]:
carbon['Carbon_yearData']=pd.to_numeric(carbon.Carbon_yearData)
carbon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Country          218 non-null    object
 1   co2_tonnes       218 non-null    int64 
 2   Carbon_yearData  218 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 5.2+ KB


In [58]:
forest['Forest_yearData']=pd.to_numeric(forest.Forest_yearData)
forest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Country               204 non-null    object 
 1   ForestRevenue_PctGDP  204 non-null    float64
 2   Forest_yearData       204 non-null    int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 4.9+ KB


Let's overwrite our previous files

In [60]:
carbon.to_csv(os.path.join("data","carbon_formatted.csv"),index=False)
forest.to_csv(os.path.join("data","forest_formatted.csv"),index=False)

NameError: name 'os' is not defined

Notice that this works only if the column has already been cleaned:

In [59]:
# a list of values
listValues=[1,2,3,4,'5','x']
dictValues={'someVals':someValues}
# a series
aDataFrame=pd.DataFrame(dictValues)

aDataFrame

NameError: name 'someValues' is not defined

In [None]:
pd.to_numeric(aDataFrame.someVals)

In [None]:
pd.to_numeric(aDataFrame.someVals,errors='coerce')

In [None]:
pd.to_numeric(aDataFrame.someVals,errors='ignore')

In this case, the **coerce** argument would be the best choice.