# Pandas data frames

## References

[pandas website](https://pandas.pydata.org/)

Includes link to pdf for *pandas: powerful Python data analysis toolkit*, free online alternative to *Python for Data Analysis* by Wes McKinney 

## Setup

This is the standard import statement for pandas:

In [1]:
import pandas as pd

## Series

Recall lists from basic Python

In [2]:
group_list = ['reptile', 'arachnid', 'annelid', 'insect']
animal_list = ['lizard', 'spider', 'worm', 'bee']
number_legs_list = [4, 8, 0, 6]

In [19]:
new_list = [animal for animal in animal_list if animal > 'cat']
new_list

['lizard', 'spider', 'worm']

Series are one-dimensional pandas data structures

In [3]:
animal_series = pd.Series(animal_list)
print(animal_series)
print(type(animal_series))

0    lizard
1    spider
2      worm
3       bee
dtype: object
<class 'pandas.core.series.Series'>


Series are build from an `ndarray` and a `range()`-like *RangeIndex*

In [4]:
print(animal_series.values)
print(type(animal_series.values))
print(animal_series.index)
print(type(animal_series.index))

['lizard' 'spider' 'worm' 'bee']
<class 'numpy.ndarray'>
RangeIndex(start=0, stop=4, step=1)
<class 'pandas.core.indexes.range.RangeIndex'>


Create series from other lists. Use series in vectorized operations.

In [5]:
group_series = pd.Series(group_list)
number_legs_series = pd.Series(number_legs_list)
con = group_series + ': ' + animal_series
print(con)

0     reptile: lizard
1    arachnid: spider
2       annelid: worm
3         insect: bee
dtype: object


Series can be assigned labels

In [6]:
labeled_groups = pd.Series(group_list, index = animal_list)
labeled_legs = pd.Series(number_legs_list, index = animal_list)
print(labeled_groups)
print()
print(labeled_legs)

lizard     reptile
spider    arachnid
worm       annelid
bee         insect
dtype: object

lizard    4
spider    8
worm      0
bee       6
dtype: int64


In [7]:
print(labeled_groups.values)
print(type(labeled_groups.values))
print()
print(labeled_groups.index)
print(type(labeled_groups.index))

['reptile' 'arachnid' 'annelid' 'insect']
<class 'numpy.ndarray'>

Index(['lizard', 'spider', 'worm', 'bee'], dtype='object')
<class 'pandas.core.indexes.base.Index'>


Series values can be identified by either position (zero-based) or label.

In [8]:
print(labeled_groups[1])
print(labeled_groups['bee'])

arachnid
insect


Series can be sliced like lists or `ndarrays`, but also by their labels (but inclusive of end of range).

In [9]:
print(labeled_legs[1:3])
print()
print(labeled_groups[ ['worm', 'spider', 'lizard'] ])
print()
print(labeled_groups['lizard': 'worm'])

spider    8
worm      0
dtype: int64

worm       annelid
spider    arachnid
lizard     reptile
dtype: object

lizard     reptile
spider    arachnid
worm       annelid
dtype: object


Labels carry over to derivative series

In [None]:
photos = labeled_legs * 2
print(photos)

In [None]:
new_groups = labeled_groups.loc['spider':'bee']

In [None]:
new_groups.iloc[1:2]

In [None]:
new_groups.index

In [None]:
new_groups.iloc[1] = 'wormy'

In [None]:
new_groups

In [10]:
labeled_groups

lizard     reptile
spider    arachnid
worm       annelid
bee         insect
dtype: object

In [11]:
labeled_groups = labeled_groups.sort_values()

In [12]:
labeled_groups

worm       annelid
spider    arachnid
bee         insect
lizard     reptile
dtype: object

bee         insect
lizard     reptile
spider    arachnid
worm       annelid
dtype: object

# Data frames

Data frames are two-dimensional data structures composed of series with shared indices.

Data frames can be created from a dictionary of lists.

In [None]:
# Create a standard Python dictionary
data_dict = {'group': group_list, 'number legs': number_legs_list}

# First argument is the dictionary, second argument is a list containing the index labels.
organism_info = pd.DataFrame(data_dict, index = animal_list)
print(organism_info)

# Direct output of a data frame (vs. print function) has nice formatting in Jupyter notebooks
organism_info

Select a column by its name string. Output is a series.

In [None]:
print(organism_info['group'])
print()
print(type(organism_info['group']))

Dot notation is an option if header string is a valid Python object name

In [None]:
print(organism_info.group)

Select a row using `.loc` with the index label and `.iloc` with the index integer. Output is a series.

In [None]:
print(organism_info.loc['worm'])
print()
print(organism_info.iloc[0])

In [None]:
organism_info.iat[2, 1]
organism_info.at['spider', 'group']

# Reading data from a file path

Mechanism depends on the kind of environment you are running.

In some environments, knowing the current working directory (using `os` module) is helpful.

In [None]:
import os
working_directory = os.getcwd()
print(working_directory)

We'll assume your file is in a subdirectory of your working directory and the subdirectory is named `data`. See videos for variations in creating and accessing this directory. 

We will load a file containing 2016 CO2 emmisions for each state by fuel type. Go to [this link](https://github.com/HeardLibrary/digital-scholarship/blob/master/data/codegraf/co2_state_2016_fuel.xlsx) and click the download button. Save the downloaded file in your data directory. For source information about the file, see [this page](https://github.com/HeardLibrary/digital-scholarship/tree/master/data/codegraf). 

In [None]:
filename = 'co2_state_2016_fuel.xlsx'
path = working_directory + '/data/' + filename
fuel_type = pd.read_excel(path)
fuel_type

Large tables can be abbreviated using `.head()` with number of rows to display as the argument. 5 rows default if no argument.

In [None]:
fuel_type.head()

Data frames do not automatically have assigned index labels. We can use one of the series as the labels.

In [None]:
fuel_type['State']

Assign the series as the data frame's index

In [None]:
fuel_type.index = fuel_type['State']
fuel_type.head()

The state is now both the index label for the table and one of the columns.

In [None]:
fuel_type.loc['Ohio']

## Fuctions for reading and writing from files

`pd.read_csv()` read from a CSV file into a data frame.

`pd.to_csv()` write from a data frame to a CSV file.

`pd.read_excel()` read from an Excel file into a data frame.

`pd.to_excel()` write from a data frame to an Excel file.

For details about reading from particular sheets in an Excel file, delimiters other than commas, etc. see the [pandas User Guide](https://pandas.pydata.org/docs/user_guide/io.html).

## Reading data from a file via a URL

The same functions as above can be used to read from a file via an Internet URL.

In [None]:
schools = pd.read_csv('https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv')

Examining data from a large file can be challenging. Abbreviated views of data frames are possible.

In [None]:
schools.head()

The missing data indicator in pandas is `NaN` ("not a number") borrowed from basic Python. Empty cells read into a data frame are represented as `NaN`.

In [None]:
schools.columns

In [None]:
schools['School Name']

# Views vs. copies

This is an advanced topic that can be skipped if you wish.

In [None]:
school_names = schools['School Name']
school_names

In [None]:
type(school_names)

In [None]:
school_names.index

In [None]:
school_names[1] = 'Judd Street School'
school_names

In [None]:
schools

In [None]:
school_names_copy = schools['School Name'].copy()
school_names_copy

In [None]:
school_names_copy[2] = 'Pleasant Hill Academy'
school_names_copy

In [None]:
schools