# Table of Contents
* [Learning Objectives](#Learning-Objectives)
* [Pandas Exercise 3: Relational Normalization](#Pandas-Exercise-3:-Relational-Normalization)
	* [Background on Reading Excel](#Background-on-Reading-Excel)
	* [Background on Relational Normalization](#Background-on-Relational-Normalization)
	* [Background on Categorical Data](#Background-on-Categorical-Data)
	* [Set-up](#Set-up)
	* [Part 1: Read the data](#Part-1:-Read-the-data)
	* [Part 2: Normalize](#Part-2:-Normalize)
	* [Part 3: Create a Sqlite3 database](#Part-3:-Create-a-Sqlite3-database)
	* [Part 4: Compare file sizes](#Part-4:-Compare-file-sizes)
	* [Part 5: Optional](#Part-5:-Optional)
	* [Part 6: Optional](#Part-6:-Optional)


# Learning Objectives

After completion of this module, learners should be able to:
* list various python modules used for reading Excel files
* read an Excel data file into a pandas DataFrame
* use categorials and other techniques to reduce data size
* use pandas to convert an Excel file into an Sqlite database file

# Pandas Exercise 3: Relational Normalization

## Background on Reading Excel

There are several 3rd party Python modules for working with Microsoft Excel spreadsheets.  A list of them is collected at:

* [Working with Excel Files in Python](http://www.python-excel.org/)

I've used [openpyxl](https://openpyxl.readthedocs.org/en/latest/) successfully in some projects.

However, within the Scientific Python toolstack, the most common way of accessing the Excel format is the [Pandas](http://pandas.pydata.org/) framework. This is heavier weight than other options if all you wanted to do was read Excel, but in a scientific context, you already need most of the requirements (NumPy, etc), and you probably want to be using Pandas for numerous other purposes anyway.

Pandas relies internally uses `xlrd` to read Excel files, but provides a higher-level wrapper. You probably need to run:

```bash
conda install xlrd
```

## Background on Relational Normalization

Description from [Wikipedia](https://en.wikipedia.org/wiki/Database_normalization):
> *Database normalization ... is the process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy. Normalization involves decomposing a table into less redundant tables without losing information*

## Background on Categorical Data

Description from the [documentation](https://pandas-docs.github.io/pandas-docs-travis/categorical.html):

> *Categoricals are a pandas data type, which correspond to categorical variables in statistics: **a variable, which can take on only a limited, and usually fixed, number of possible values** (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.*

In [None]:
# Categorical example: notice the counts for each category
import pandas as pd
s = pd.Series(pd.Categorical(["a","b","c","c","e"], categories=["c","a","b","d"]))

s.value_counts()

In [None]:
# Categorical example: notice the NaN for a value that did not match any category

s

## Set-up

In [None]:
## Optional: Uncomment to install the python module `xlrd` for reading Excel files
## Recommendation: use the built-in pandas methods instead.

#     !conda install -y xlrd

In [None]:
# Required: imports needed in this exercise
%matplotlib inline
import pandas as pd

## Part 1: Read the data

Read the NYC Harbor data from the excel data file ``data/nyc_harbor_wq_2006-2014.xlsx`` into DataFrame.

*Note: This Excel file is roughly 24 MB in size, contining 300k rows of largely categorical data. It may take some time to load...*

In [None]:
# Solution:


## Part 2: Normalize

A large fraction of all values in a given column are duplicates.
* Use the unique `STATION` values as categories to reduce data duplication stored in memory

In [None]:
# Solution:


## Part 3: Create a Sqlite3 database

Using the NYC Harbor data set, create an Sqlite3 single-file database containing all of the data inside the spreadsheet.

* Store the data in its native types per column/cell (Pandas does a good job of inferring data types)

In [None]:
#Solution


## Part 4: Compare file sizes

Write code that compares the file size of the resulting sqlite3 file compared to the original Excel file.

In [None]:
#Solution


## Part 5: Optional

Compose some interesting queries of the database to extract patterns or features of the data.

In [None]:
#Solution


## Part 6: Optional

If you have access configured, try the exercise using a general purpose RDBMS, such as MySQL, PostgreSQL, SQL Server, etc.

Related to the normalization, we might notice that our Pandas `DataFrame` itself is inefficient for the same reasons that normalization is desirable.  A large number of copies of the same strings are stored within the same column `Series` objects.  Moreover, in many cases what is stored are strings which need to be stored as Python objects, and processed much more slowly and indirectly than with basic numeric types that leverage their underlying `numpy` arrays.  We can improve this quite a bit.