# Data Extraction with Python (part 2)
cagetory: [Python, DataExtraction, pandas, excel]

Last time we talk about data extraction from html type files (or data extraction part of web-scraping), in real life projects sometimes we will meet data stored in excel format. Depending on the data, there are different ways to extract them.

### Pandas

The easiest way to extract informations from an excel file should be using the build-in function pd.read_excel() from pandas. Pandas is an external library, we need to install it before use (by pip or conda or other methods).

In [2]:
#hide
!pip install pandas

Collecting pandas
  Downloading pandas-1.2.3-cp37-cp37m-macosx_10_9_x86_64.whl (10.4 MB)
[K     |████████████████████████████████| 10.4 MB 7.1 MB/s eta 0:00:01
Collecting numpy>=1.16.5
  Downloading numpy-1.20.2-cp37-cp37m-macosx_10_9_x86_64.whl (16.0 MB)
[K     |████████████████████████████████| 16.0 MB 51.8 MB/s eta 0:00:01
Collecting pytz>=2017.3
  Downloading pytz-2021.1-py2.py3-none-any.whl (510 kB)
[K     |████████████████████████████████| 510 kB 30.7 MB/s eta 0:00:01
[?25hInstalling collected packages: pytz, numpy, pandas
Successfully installed numpy-1.20.2 pandas-1.2.3 pytz-2021.1


In [3]:
import pandas as pd

# if we call the function without the brackets, python will show some minimum information of the function
pd.read_excel

<function pandas.io.excel._base.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True, storage_options: Union[Dict[str, Any], NoneType] = None)>

The read_excel function will return a <a href='https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html'>DataFrame</a> object, pretty much similar to a excel table. And as you can see in the above description or in the <a href='https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html'>document</a>, the read_excel function has quite some parameters. If we leave everything in default, pandas will grep the first row as the column name and try to convert the format it can recognize (float, datetime, for example, and it won't turn natural number into integers!). Here I will introduce several of them I think are useful.

In practice, clients always have inconsistencies in their data, maybe the file is corrupted, maybe someone did not follow the format straightly (e.g. someone may input XX-XX-2021 when they are not sure about the date, that will make pandas consider the whole column as string instead of datetime)...<b>It is always a great idea to consider all columns as string, and convert them afterwards.</b> To set all data type as string, we can set the dtype as 'object' (there are 4 main types of data in pandas, float64, int64, datetime64 and object).



In [None]:
pd.read_excel(file, dtype='object')

Speaking of inconsistant format, there is one thing we might like to convert by pandas, which is the 'na' values. In practice, maybe the excel was built by different with different input style, or other reason, it is not difficult to found 'NA', '', 'None' etc. in the same file. In pandas, we can input a list of 'na' value to convert all of them.

By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

In [None]:
pd.read_excel(file, na_values=['na', '', 'None'])

Sometime the excel file consist of several sheets, to parse through those sheet, pandas provide a parameter for that, which is sheet_name, setting it to None allow us to parse all sheets from the excel file output as a dictionary of DataFrames. It can also accept order (start from 0) or specific name or list of order and names. 

In [None]:
pd.read_excel(file, sheet_name=None)

header and names: used together when the first row is not column name. We will set header to None, and input column names manually (list of string) using the names parameter. 

In [None]:
pd.read_excel(file, header=None, names=['first column', 'second column', 'third column'])

So now we have our excel tables read as DataFrame object. Before we move on to do some calculation / format change / combine / separation of data, there are two simple data cleansing steps we might want to do.

For whatever reason, there is always possibility to have empty rows in between. By using df.dropna function in pandas, we can easily remove the empty rows.

In [14]:
df = pd.DataFrame([['column a', 'column b', 'column c'], [], ['a', 'b', 'c']])
df

Unnamed: 0,0,1,2
0,column a,column b,column c
1,,,
2,a,b,c


In [15]:
# axis = 0 / 'rows' for row and axis = 1 / 'columns' for column
df = df.dropna(axis=0, how='all')
df

Unnamed: 0,0,1,2
0,column a,column b,column c
2,a,b,c


In [20]:
# we can also skip the indexing of empty rows by adding .reset_index() function after dropna
# drop=True is used to prevent the old index becoming a new column
df = pd.DataFrame([['column a', 'column b', 'column c'], [], ['a', 'b', 'c']])
df = df.dropna(axis=0, how='all').reset_index(drop=True)
df

Unnamed: 0,0,1,2
0,column a,column b,column c
1,a,b,c


We can also use subset parameter and input a list of column names to tell pandas to drop rows if all of the subset columns are empty.

In [23]:
# if we set the subset to 0, since row 1 have value in the 0 column, it won't be dropped
df = pd.DataFrame([['column a', 'column b', 'column c'], ['a', None, None], ['a', 'b', 'c']])
df = df.dropna(axis=0, subset=[0])
df

Unnamed: 0,0,1,2
0,column a,column b,column c
1,a,,
2,a,b,c


In [22]:
# if we set the subset to [1, 2], row 1 will be dropped
df = pd.DataFrame([['column a', 'column b', 'column c'], ['a', None, None], ['a', 'b', 'c']])
df = df.dropna(axis=0, subset=[1, 2])
df

Unnamed: 0,0,1,2
0,column a,column b,column c
2,a,b,c


The other useful tool in pandas is transpose, which reverse the column and row.

In [24]:
df = pd.DataFrame([['column a', 'column b', 'column c'], ['a', None, None], ['a', 'b', 'c']])
df

Unnamed: 0,0,1,2
0,column a,column b,column c
1,a,,
2,a,b,c


In [26]:
df_transpose = df.T
df_transpose

Unnamed: 0,0,1,2
0,column a,a,a
1,column b,,b
2,column c,,c
