<center><img src="https://github.com/DACSS-PreProcessing/Week_1_main/blob/main/pics/LogoSimple.png?raw=true" width="700"></center>

# Data Cleaning in Python

<a id='home'></a>

In the session we will:

1. Collect data as a dataframe

2. Clean data:
    * Fix column names
    * Fix data contents

## 1. Collect data tables


### Read a File

I have the data on the **Human Development Index** in a  folder in a GitHub repo, which I downloaded from this [link](https://hdr.undp.org/data-center/documentation-and-downloads) (_Table 1_).

In [None]:
# Location of data file
linkFile="https://github.com/DACSS-PreProcessing/dataCleaning_Py/raw/main/data/HDI_Table.xlsx"

Reading in a table from a file using pandas, since it is an Excel file, I need that the package **openpyxl** is previously installed:

In [None]:
# available in my computer?
!pip show openpyxl

If not available, please go to Anaconda and install it. Once installed, or if available, continue:

In [None]:
import pandas as pd

hdiFile=pd.read_excel(linkFile)

Take a look:

In [None]:
hdiFile

## 2. Cleaning Process


### Fix column names


#### Recover row of column names

Notice that we do not have the right column names. So we need to save them before we go on:

In [None]:
hdiFile.iloc[[3,4],:]

As you see, the column names are in different positions:

In [None]:
# here
hdiFile.iloc[3,2:]

In [None]:
# and here
hdiFile.iloc[4,:2]

It is easier if we have lists, so we can concatenate:

In [None]:
# save column names turned to lists

RealHeaders=hdiFile.iloc[4,:2].to_list()+hdiFile.iloc[3,2:].to_list()

# these are:
RealHeaders

Let's put the rown in the right place:

In [None]:
# rename all the columns
hdiFile.columns=RealHeaders

# newDF
better_1=hdiFile.copy()

# see head
better_1.head()

#### Subset to drop unneeded columns


Notice the repeated column names (HDI rank) and _NaN_. Notice also that we do not need the last three columns. Let's solve that:

In [None]:
# without the last 4 columns
better_1.iloc[:,:-4]

We use the previous result to rewrite the original:

In [None]:
# without the last four
better_2=better_1.iloc[:,:-4]

We still have column names with missing values:

In [None]:
better_2.columns

...let's get rid of those:

In [None]:
# columns names without missings values
better_2.columns.dropna()

In [None]:
# make the change!

BetterHeaders=better_2.columns.dropna()
#result
BetterHeaders

In [None]:
#subsetting again to keep the good headers

better_2=better_2.loc[:,BetterHeaders]

#see
better_2.head(10)

It is time to offer a better set of column names.

#### Clean column names

The current situation:

In [None]:
better_3=better_2.copy()
better_3.columns.to_list() # always use to_list()

Notice above that the columns:
* Have acronyms in parenthesis.
* Have spaces between words.

**Option 1**: Cleaner column names without _blank spaces_, underscores instead of _blank spaces_.

In [None]:
# bye anything between parentheses
pattern1='\(.+\)'
better_3.columns.str.replace(pattern1,repl="", regex=True)

In [None]:
# bye anything between parentheses, bye leading-trailing spaces
better_3.columns.str.replace(pattern1,repl="", regex=True).str.strip()

In [None]:
# bye anything between parentheses, bye leading-trailing spaces, title case
better_3.columns.str.replace(pattern1,repl="", regex=True).str.strip().str.title()

In [None]:
# bye anything between parentheses, bye leading-trailing spaces, title case, spaces replaced
pattern2='\s+'
better_3.columns.str.replace(pattern1,repl="", regex=True).str.strip().str.title().\
                                                           str.replace(pattern2,repl='_',regex=True)

**Option 2**: Shorthening using Camel case

This requires a good data dictionary in your README!

In [None]:
# one step before we had...

newNames=better_3.columns.str.replace(pattern1,repl="", regex=True).str.strip().str.title()
newNames

In [None]:
names_camel=newNames.str.replace(" ",'',regex=False)
names_camel

**Option 3**: Shorthening using Acronyms

We will do this only for the _variables_:

In [None]:
# each column names splitted:
[name.split() for name in newNames[2::]]

In [None]:
# first letter of each word
[[word[0] for word in name.split()] for name in newNames[2::]]

In [None]:
# concatenating
acronyms=[''.join([word[0] for word in name.split()]) for name in newNames[2::]]
acronyms

In [None]:
alreadyShorthened=names_camel[:2].to_list() # the previous 2 columns
better_3.columns= alreadyShorthened+ acronyms
better_3.columns

Let's keep the last alternative:

In [None]:
better_3.head(10)


______


### Fix Data contents

After becoming familar with the data, we focus on data contents.

#### Cleaning based on cells with missing values:

See all rows with at least one missing value:

In [None]:
# next DF
better_4=better_3.copy()
better_4[better_4.isna().any(axis=1)]

The exploration let us find that we have 84 rows with at least one missing value.

First decision, drop rows where all variable values are missing:

In [None]:
# will keep rows where there is at least one value in the variable columns

better_4=better_4[~better_4.iloc[:,2:].isna().all(axis=1)]

# filtered!
better_4

Second decision: drop the rows with where 'Country', the ID, is missing.

In [None]:
better_4=better_4[~better_4.loc[:,'Country'].isna()]
better_4

Let's explore why some rows have no ranking:

In [None]:
better_4[better_4.loc[:,'HdiRank'].isna()]

Decision three: Keep rows with some important values:

In [None]:
# detecting non-numeric cells in HDI
better_4[pd.to_numeric(better_4.HDI,'coerce').isna()]

In [None]:
# then
better_4=better_4[~pd.to_numeric(better_4.HDI,'coerce').isna()]
better_4

In [None]:
# keep good ranking values
better_4=better_4[~better_4.loc[:,'HdiRank'].isna()]
better_4

#### Cleaning based on cell contents

It seems pretty clean. However, let's play safe and get rid of trailing or leading spaces :

In [None]:
# no trailing nor leading spaces
better_4.loc[:,'Country']=better_4.Country.str.strip()

Are the numeric values read as strings?

In [None]:
better_4.iloc[0,:].to_list()

If this does not work, the numbers are not clean:

In [None]:
better_4.iloc[:,2:].sum()

Finally, let's reset the row indexes:

In [None]:
better_4.reset_index(drop=True, inplace=True)
better_4