# Reading Tabular Data Using Pandas

# Objectives
- Import the Pandas library.
- Use Pandas to load a CSV data set.
- Get summary information from a Pandas DataFrame.
- Download online data using Pandas.

In [2]:
# Before beginning this lesson, run this cell to download
# the figures for this notebook
! mkdir img
!wget -P img/ https://pandas.pydata.org/pandas-docs/stable/_images/01_table_dataframe.svg
!wget -P ./img/ https://pandas.pydata.org/pandas-docs/stable/_images/02_io_readwrite.svg

mkdir: cannot create directory ‘img’: File exists
--2020-05-10 08:51:27--  https://pandas.pydata.org/pandas-docs/stable/_images/01_table_dataframe.svg
Resolving pandas.pydata.org (pandas.pydata.org)... 104.26.1.204, 104.26.0.204, 2606:4700:20::681a:1cc, ...
Connecting to pandas.pydata.org (pandas.pydata.org)|104.26.1.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14271 (14K) [image/svg+xml]
Saving to: ‘img/01_table_dataframe.svg’


2020-05-10 08:51:27 (5.87 MB/s) - ‘img/01_table_dataframe.svg’ saved [14271/14271]

--2020-05-10 08:51:27--  https://pandas.pydata.org/pandas-docs/stable/_images/02_io_readwrite.svg
Resolving pandas.pydata.org (pandas.pydata.org)... 104.26.1.204, 104.26.0.204, 2606:4700:20::681a:1cc, ...
Connecting to pandas.pydata.org (pandas.pydata.org)|104.26.1.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 268769 (262K) [image/svg+xml]
Saving to: ‘./img/02_io_readwrite.svg’


2020-05-10 08:51:28 (944 KB/s) - ‘

## Pandas
- Pandas is a widely-used Python library for statistics and plotting
- Its focus is tabular data
- It is similar to R in that it uses a structure called a dataframes.
- Dataframes can contain multiple data types

![Data Frame ](img/01_table_dataframe.svg)

Source: <https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started>

- Pandas can read all kinds of tabular data

![Data Processed by Pandas](img/02_io_readwrite.svg)

Source: <https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started>

In [5]:
# 1. Run this cell to download the data
# 2. Open the downloaded files to get a sense of the data

# Downloads a zip file from Carpentries webpage with Gapminder data
! wget http://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip .
# Unzips the file
! unzip python-novice-gapminder-data.zip

--2020-05-10 08:16:11--  http://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip
Resolving swcarpentry.github.io (swcarpentry.github.io)... 185.199.108.153, 185.199.111.153, 185.199.110.153, ...
Connecting to swcarpentry.github.io (swcarpentry.github.io)|185.199.108.153|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38471 (38K) [application/zip]
Saving to: ‘python-novice-gapminder-data.zip’


2020-05-10 08:16:11 (615 KB/s) - ‘python-novice-gapminder-data.zip’ saved [38471/38471]

--2020-05-10 08:16:11--  http://./
Resolving . (.)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘.’
FINISHED --2020-05-10 08:16:11--
Total wall clock time: 0.2s
Downloaded: 1 files, 38K in 0.06s (615 KB/s)
Archive:  python-novice-gapminder-data.zip
  inflating: data/gapminder_all.csv  
  inflating: data/gapminder_gdp_africa.csv  
  inflating: data/gapminder_gdp_americas.csv  
  inflating: data/gapminder_gdp_a

- Load Pandas with `import pandas as pd`

In [None]:
#1. Import the pandas library
___

# 1. Use `read_csv` to read the gapminder data
data = pd.read_('data/gapminder_gdp_oceania.csv')
print(data)

- The columns in a dataframe are the observed variables, and the rows are the observations.
- Pandas uses backslash `\` to show wrapped lines when output is too wide to fit the screen.

## `index_col`
- Use `index_col` to specify that a column's values should be used as row identifiers.

In [None]:
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data)

- Use `DataFrame.info` to find out more about a dataframe.

In [6]:
# `info()` is a method of data
data.info()

NameError: name 'data' is not defined

*   This is a `DataFrame`
*   Two rows named `'Australia'` and `'New Zealand'`
*   Twelve columns, each of which has two actual 64-bit floating point values.
*   Uses 208 bytes of memory.

## Attributes
- The `DataFrame.columns` attribute stores information about the dataframe's columns.
- Note that this is a varaible, *not* a method.
  - It doesn't have `()`
*   Called a *member variable*, just a *member*, or an *attribute*.

In [None]:
print(data.columns)

- Use `DataFrame.T` to transpose a dataframe.

*   Sometimes want to treat columns as rows and vice versa.
*   Transpose doesn't copy the data, just changes the program's view of it.

In [7]:
print(data.T)

NameError: name 'data' is not defined

## Summary Statistics
- Use `DataFrame.describe` to get summary statistics about data.

- DataFrame.describe() gets the summary statistics of only the columns that have numerical data. 
  All other columns are ignored.

In [None]:
# 1. Print the summary statistics for our dataframe
print(___)

# Exercise
1. `read_csv()` can download data directly from a webpage.
   Download a dataset called the Titanic Data Set by passing
   the following URL to `read_csv()` instead of a file path.
   Put the new dataframe in a variable called `titanic`.
2. Use `titanic.head()` to have a look at the new dataframe.

**Data URL:**
<https://github.com/pandas-dev/pandas/raw/master/doc/data/titanic.csv>

# Objectives
- Import the Pandas library.
- Use Pandas to load a CSV data set.
- Get summary information from a Pandas DataFrame.
- Download online data using Pandas.