# Table of Contents
* [Learning Objectives:](#Learning-Objectives:)
* [Excel](#Excel)
	* [Pandas categoricals](#Pandas-categoricals)


# Learning Objectives:

After completion of this module, learners should be able to:

* Access data stored in Excel spreadsheets
* Identify and normalize redundant data in tabular formats

# Excel

There are several 3rd party Python modules for working with Microsoft Excel spreadsheets.  A list of them is collected at:

* [Working with Excel Files in Python](http://www.python-excel.org/)

I've used [openpyxl](https://openpyxl.readthedocs.org/en/latest/) successfully in some projects.

However, within the Scientific Python toolstack, the most common way of accessing the Excel format is the [Pandas](http://pandas.pydata.org/) framework. This is heavier weight than other options if all you wanted to do was read Excel, but in a scientific context, you already need most of the requirements (NumPy, etc), and you probably want to be using Pandas for numerous other purposes anyway.

Pandas relies internally uses `xlrd` to read Excel files, but provides a higher-level wrapper. You probably need to run:

```bash
conda install xlrd
```

To get the below commands to work.

In [None]:
# Let's import the packages we will use in this Notebook
import pandas as pd

# A large(-ish) data set contained in an Excel spreadsheet
# 24 MiB file, 300k rows of largely categorical data
nyc_harbor_file = "data/nyc_harbor_wq_2006-2014.xlsx"
harbor_data = pd.read_excel(nyc_harbor_file)

In [None]:
harbor_data.count()

In [None]:
harbor_data.UNIT.unique()

In [None]:
stations = harbor_data.STATION.unique()
station_ids = map(str, stations)
print(sorted(station_ids))

In [None]:
# The first row in which STATION 'GB1N' occurs
harbor_data[harbor_data.STATION == 'GB1N'][:2]

In [None]:
harbor_data.columns

In [None]:
print(harbor_data.DATE[:5])
print()
print(harbor_data.RESULT[:5])

In [None]:
harbor_data.dtypes

In [None]:
harbor_data[['STATION','DATE','RESULT']][:10]

In [None]:
harbor_data[:5]

In [None]:
station_not_name = harbor_data[harbor_data.STATION != harbor_data.STATION_NAME]
station_not_name[['STATION','STATION_NAME','DATE','PARAMETER_NAME']]

## Pandas categoricals

Related to the normalization, we might notice that our Pandas `DataFrame` itself is inefficient for the same reasons that normalization is desirable.  A large number of copies of the same strings are stored within the same column `Series` objects.  Moreover, in many cases what is stored are strings which need to be stored as Python objects, and processed much more slowly and indirectly than with basic numeric types that leverage their underlying `numpy` arrays.  We can improve this quite a bit.

In [None]:
%%timeit
# Let's take a look at a relatively expensive query
water_depths = harbor_data.groupby(harbor_data.STATION).DEPTH_WATERCOL_FT.mean()
known_depths = water_depths[pd.notnull(water_depths)]
known_depths.sort()

In [None]:
# Convert STATION to a categorical
harbor_data.STATION = harbor_data.STATION.astype('category')

In [None]:
%%timeit
# Run same operations on categoricalized DataFrame
water_depths = harbor_data.groupby(harbor_data.STATION).DEPTH_WATERCOL_FT.mean()
known_depths = water_depths[pd.notnull(water_depths)]
known_depths.sort()

In [None]:
water_depths = harbor_data.groupby(harbor_data.STATION).DEPTH_WATERCOL_FT.mean()
known_depths = water_depths[pd.notnull(water_depths)]
known_depths.sort()
known_depths

So what happened there? We can see that the data still *looks* the same on a cursory look.  But its storage strategy is much more efficient now.

In [None]:
harbor_data[['STATION', 'DEPTH_WATERCOL_FT']][:4]

In [None]:
harbor_data.STATION[:3]

In [None]:
harbor_data.STATION.cat.codes[:3]

In [None]:
# We can check which columns are good candidates to make categorical
harbor_data.dtypes[harbor_data.dtypes == object]