In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Table of Contents
* [Lecture 2C - File I/O, Missing Values, Sorting and Ranking*](#Lecture-2C---File-I/O,-Missing-Values,-Sorting-and-Ranking*)
	* &nbsp;
		* [Content](#Content)
		* [Learning Outcomes](#Learning-Outcomes)
* [File Input/Output](#File-Input/Output)
	* [Importing data](#Importing-data)
		* [Microsoft Excel](#Microsoft-Excel)
		* [Importing Missing Data](#Importing-Missing-Data)
		* [Handling Missing Data](#Handling-Missing-Data)
	* [Writing Data to Files](#Writing-Data-to-Files)
* [Sorting and Ranking](#Sorting-and-Ranking)


# Lecture 2C - File I/O, Missing Values, Sorting and Ranking*

---

### Content

1. Reading data from files into data frames
2. Processing missing values in files
3. Methods for handling missing values in data frames
4. Writing data frames to files
5. Sorting and ranking values in data frames

\* This notebook material is adapted from Assoc. Prof. Fonnesbeck's tutorial on statistical data analysis in Python and closely follows "Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython" By Wes McKinney.

### Learning Outcomes

At the end of this lecture, you should be able to:

* read data in csv and Excel formats from files into data frames    
* apply methods for processing missing values as inputs from files 
* apply various techniques for handling missing values in data frames 
* output data in data frames into a csv file formats
* use sort and rank operations on values in data frames

In [None]:
from IPython.core.display import HTML 
HTML("<iframe src=http://pandas.pydata.org/pandas-docs/stable/io.html width=800 height=350></iframe>")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Set some Pandas options as you like
pd.set_option('max_columns', 40)
pd.set_option('max_rows', 20)

In [None]:
#this line enables the plots to be embedded into the notebook
%matplotlib inline

# File Input/Output

## Importing data

A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze. 


Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a `DataFrame` object. 

These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported. We will explore the most basic functionality.

Let's start with a popular dataset from the UCI repository called 'Wine' and stored in csv (comma delimited) format.

This table can be read into a DataFrame using `read_csv`:

In [None]:
wd = pd.read_csv("../datasets/wine_data.csv")
wd.info()

Notice that `read_csv` automatically considered the first row in the file to be a header row. Our dataset does not have string names for the features in the column headers, but files of this type often do.

We can override default behaviour by customizing some the arguments, like `header`, `names` or `index_col`.

In [None]:
pd.read_csv("../datasets/wine_data.csv", header=None).head()

If we have sections of data that we do not wish to import (for example, known bad data), we can populate the `skiprows` argument:

In [None]:
pd.read_csv("../datasets/wine_data.csv", skiprows=[1,2,3]).head()


Alternatively, selected columns can be read using `usecols`

In [None]:
pd.read_csv("../datasets/wine_data.csv", usecols=[1,2]).head()

If we only want to import a small number of rows from, say, a very large data file we can use `nrows`:

In [None]:
pd.read_csv("../datasets/wine_data.csv", nrows=5)

**Exercise**: Write a script that reads the first 10 lines, and columns 6-12 of the file above.

### Microsoft Excel

Given that so much financial and scientific data ends up in Excel spreadsheets, Pandas' ability to directly import Excel spreadsheets is valuable. 

This support is contingent on having one or two dependencies installed, but the Python(xy) distribution has all the dependencies installed.

Since modern spreadsheets consist of one or more "sheets", we parse the sheet with the data of interest:

In [None]:
excel1 = pd.read_excel('../datasets/wine_data.xls', sheet_name='sheet1', header=None)
excel1.head()

**Exercise**: Open the above excel file and read 'sheet2' into a data frame and print the first 5 records.

### Importing Missing Data


Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including `NA` and `NULL`.

In [None]:
pd.read_csv("../datasets/wine_data_missing.csv").head(15)

Above, Pandas recognized `NaN` and an empty field as missing data.

In [None]:
pd.isnull(pd.read_csv("../datasets/wine_data_missing.csv")).head(15)

Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the `na_values` argument:
   

In [None]:
pd.read_csv("../datasets/wine_data_missing.csv", na_values=['?', -99999]).head(20)

**Exercise**: Modify the line above in order to specify to pandas explicitly that one remaining null-value 'None' needs to be recognized as such. Read the data into a data frame.

### Handling Missing Data

The occurrence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.

Missing data are represented in `Series` and `DataFrame` objects by the `NaN` floating point value. However, `None` is also treated as missing, since it is commonly used as such in other contexts (*e.g.* NumPy).

Records containing NaN can be dropped:

In [None]:
wine_df.dropna()

By default, `dropna()` drops entire rows in which one or more values are missing.

`dropna()` does not modify the data frame. We will see later how this can be done.

`dropna()` can be overridden by passing the `how='all'` argument, which only drops a row when every field is a missing value.

In [None]:
wine_df.dropna(how='all')

This can be customized further by specifying how many values need to be present before a row is dropped via the `thresh` argument.

In [None]:
wine_df.dropna(thresh=4)

We can specify a filter to list records that do not have any missing values on a particular feature.

In [None]:
wine_df[wine_df['3'].notnull()]

**Exercise**: Write a script that lists records which have a null value for column 2.

Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero) or a value that is either imputed or carried forward/backward from similar data points. 

We can do this programmatically in Pandas with the `fillna` argument.

In [None]:
wine_df.fillna(0)

We can also specify which column is to be filled with replacement values.

In [None]:
wine_df['2'].fillna(0)

Missing values can also be interpolated, using any one of a variety of methods. The following example fills missing values with values that precede it by one row.

In [None]:
wine_df

In [None]:
wine_df.fillna(method='ffill')

More sophisticated imputation can be carried out.

In [None]:
wine_df['3'].fillna(wine_df['3'].mean())

**Exercise**: Write a script that replaces all the null values in row '2' with the mean.

In order to make the changes permanent in both the `fillna()` and `dropna()` methods, it is necessary to pass the argument 'inplace=True'.

In [None]:
wine_df.dropna(inplace=True)
wine_df

## Writing Data to Files

As well as being able to read several data input formats, Pandas can also export data to a variety of storage formats. Here we will bring your attention to storing data in the csv format.

In [None]:
data = pd.DataFrame({'population':[3778000, 19138000, 20000, 447000, 4433000, 22680000, 10900, 549598],
                     'year':[2000, 2000, 2000, 2000, 2014, 2014, 2014, 2014],
                     'nation':['New Zealand', 'Australia', 'Cook Islands', 'Solomon Islands', 
                                'New Zealand', 'Australia', 'Cook Islands', 'Solomon Islands']})
data

In [None]:
data.to_csv("../datasets/population_data.csv")

The `to_csv` method writes a `DataFrame` to a comma-separated values (csv) file. You can specify custom delimiters (via `sep` argument), how missing values are written (via `na_rep` argument), whether the index is written (via `index` argument), whether the header is included (via `header` argument), among other options.

**Exercise**: Write a script that writes the values in the data frame wine_df (from the example above) to a csv file with an appropriate name.

Column names are much more meaningful if they are given appropriate names. As it turns out, the columns in the Wine dataset have meaningful names. The names are:

Col:Name
0)  Class
1)  Alcohol
2)  Malic acid
3)  Ash
4)  Alcalinity of ash  

We will now rename the columns in the Wine dataset and re-write the dataset with the meaningful names into a file. The first step is to create a dictionary that maps the above column numbers to the new column names 

**Exercise**: Create a dictionary that maps the above column numbers to the new column names.

In [None]:
names = {


Columns can be renamed by calling the *rename()* function on a data frame. Inside the function call, there are two arguments that we need to set. The first is *columns=*, which is assigned a dictionary object, and the second is the *inplace=* argument which we need to set to *True* in order to make our change permanent. 

**Exercise**: Use the *rename()* function and your dictionary object to change the column names on the winde_df data frame.

**Exercise**: Write the wine_df data frame to  file as before. Then read it back again and list the names of all the columns in order to confirm that your column changes have been correctly written to file.

# Sorting and Ranking

Pandas objects include methods for re-ordering data.

Below example sorts a data frame by index values.

In [None]:
data.sort_index(ascending=False)

Sort by values on a given feature.

In [None]:
data.population.sort_values(ascending=False)

For a `DataFrame`, we can sort according to the values of one or more columns using the `by` argument of `sort_values`:


In [None]:
data[['nation','population','year']].sort_values(ascending=[False,True], by=['nation', 'population'])

**Exercise**: Write a script that sorts data by 'nation' and 'year' by ascending order..

Ranking does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series.


In [None]:
data.population.rank()

Calling the `DataFrame`'s `rank` method results in the ranks of all columns:

In [None]:
data.rank(ascending=False)

**Exercise**: rank 'population','year' in data in a descending order
