Reading a file with Pandas
==
---

Pandas is a library widely used for statistics and analysis
* Has functions which allow you to read a file directly into your script
* Borrows many feature from R's data frames

*   Read a Comma Separate Values (CSV) data file with `pandas.read_csv`.
    * Uses the same notation as you used for bash ("./" accesses the current folder, "../" searches up to the parent folder)
    * Argument is the name of the file to be read.
    * Assign result to a variable to store the data that was read.

## Accessing Files

We're using the gapminder data that we created yesterday. Remember that these are stored in the shell_lessons directory in a `data` sub-directory, which is why the path to the file is `../shell_lessons/data/gapminder_data/gapminder_final.txt`. If you forget to include `../shell_lessons/`, or if you include it but your copy of the file is somewhere else, you will get a [runtime error]({{ site.github.url }}/05-error-messages/) that ends with a line like this:
    ~~~
    OSError: File b'gapminder_final.txt' does not exist
    ~~~
    
** Don't forget to use the tab key for auto-completion **
    * Auto-complete works in Jupyter notebooks!

In [3]:
# First, import the pandas library
import pandas

In [2]:
# Then read the csv
df = pandas.read_table("../shell_lessons/data/gapminder_data/gapminder_final.txt")

FileNotFoundError: File b'../shell_lessons/data/gapminder_data/gapminder_final.txt' does not exist

 print the data frame

In [4]:
print(df)

NameError: name 'df' is not defined

When we load a csv file with Pandas, it get's loaded into a DataFrame.

DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column. So, a data frame and series are synonomous with table and column.


*   The columns in a data frame are the observed variables, and the rows are the observations.
*   Pandas uses backslash `\` to show wrapped lines when output is too wide to fit the screen.

---
## EtherPad

Hypothetically, the data a project you are working on is stored in a file called `microbes.csv`, which is located in a folder called `field_data`. You are doing analysis in a notebook called `analysis.ipynb`in a sibling folder called `thesis`. You're directory structure looks like this:
    ~~~
    your_home_directory
    +-- field_data/
    |   +-- microbes.csv
    +-- thesis/
        +-- analysis.ipynb
    ~~~

What value(s) should you pass to `read_csv` to read `microbes.csv` in `analysis.ipynb`? Vote for your answer in EtherPad.

    a. "/field_data/microbes.csv"
    b. "./field_data/microbes.csv"
    c. "field_data/microbes.csv"
    d. "../field_data/microbes.csv"

---

## Use `DataFrame.info` to find out more about a data frame.

In [None]:
# Write your code here


## Use `DataFrame.describe` to get summary statistics about data.

DataFrame.describe() gets the summary statistics of only the columns that have numerical data. 
All other columns are ignored.

In [None]:
# Write your code here


---
## EtherPad:
1. Use the python cell below to find the minimum GDP per capita of all countries in 1972?

In [None]:
# Write your code here


Vote for your answer on EtherPad

    a. 331.0
    b. 372.0
    c. 415.0
    d. 424.
    
---

## The `DataFrame.columns` variable stores information about the data frame's columns.

*   Note that this is a method, *not* a function.
    *   Like `math.pi`.
    *   So do not use `()` to try to call it.

In [None]:
# print out the data frame columns


## Use `index_col` to specify that a column's values should be used as row headings.

*   Row headings are numbers (0 and 1 in this case).
*   Really want to index by country.
*   Pass the name of the column to `read_csv` as its `index_col` parameter to do this.

In [None]:
# re-read in the gapminder data with the "country" column/series sa the index_col
df = pandas.read_table("../shell_lessons/data/gapminder_data/gapminder_final.txt", index_col="country")
print(df.head())

* This is a `DataFrame`
* This gives us many rows with the same index value ("e.g. Afghanistan")
  * Not good practice
* lets re-read the table without the index_cols

In [None]:
# Write your code here


## Writing to csv file 
As well as the `read_table` function for reading data from a file, Pandas can write data frames to files with a `to_****` function.
  * Pandas can write data frames to csv, html, excel (xlsx), json, and many more.  
    E.g.  
    `df.to_csv("./my_data.csv'`
    

---
## EXERCISE:
1. With the `gapminder_final.txt` file read in as a data frame, write out a copy of the data frame as a csv to a new file called `gapminder_final.csv` in the `data` directory in the `python_lessons` directory ("./data").

---

In [None]:
# Write your code here


# -- COMMIT YOUR WORK TO GITHUB --

---
## Keypoints:
 * Use the Pandas library to do statistics on tabular data.
 * Use `index_col` to specify that a column's values should be used as row headings.
 * Use `DataFrame.info` to find out more about a data frame.
 * The `DataFrame.columns` variable stores information about the data frame's columns.
 * Use `DataFrame.T` to transpose a data frame.
 * Use `DataFrame.describe` to get summary statistics about data.