# Pandas data frames

## References

[pandas website](https://pandas.pydata.org/)

Includes link to pdf for *pandas: powerful Python data analysis toolkit*, free online alternative to *Python for Data Analysis* by Wes McKinney 

[pandas cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

## Setup

This is the standard import statement for pandas:

In [None]:
import pandas as pd

# DataFrames

DataFrames are two-dimensional data structures composed of Series with shared indices.

DataFrames can be created from a dictionary of Series.

In [None]:
text_series = pd.Series({'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'})
capital_series = pd.Series({'OH': 'Columbus', 'TN': 'Nashville', 'AZ': 'Phoenix', 'PA': 'Harrisburg', 'AK': 'Juneau'})
population_series = pd.Series({'OH': 11799448, 'TN': 6910840, 'AZ': 7151502, 'PA': 13002700, 'AK': 733391})
print(text_series)
print()
print(capital_series)
print()
print(population_series)

states_dict = {'text': text_series, 'capital': capital_series, 'population': population_series}
states_df = pd.DataFrame(states_dict)

When created in this way, the dictionary keys are used as the column headers (column label indices) and each series becomes a column. The label indices of the series are shared by all of the rows as the row label indices.

When you print a pandas DataFrame, you get a text representation. If the name is given as the last line of the notebook cell, it's displayed in a "prettier" form.

In [None]:
print(states_df)
states_df

## Specifying a column

We can specify a column by using its column header as the label index in square brackets. The resulting column is a pandas Series.

In [None]:
print(states_df['capital'])
print()
print(type(states_df['capital']))

Dot notation is an alternative if header string is a valid Python object name.

In [None]:
print(states_df.population)

## Specifying a row

Select a row using `.loc` with the label index and `.iloc` with the integer index. The resulting output is a series.

In [None]:
print(states_df.loc['AZ'])
print()
print(states_df.iloc[1])

## The big picture

From this exploration, we can see that a pandas DataFrame can be thought of as a table, with rows and columns that are pandas Series. When we extract either a row or column, it will have the same behavior as we saw for Series in the previous lesson.

We can force a row or column into a simpler form, such as a list or dictionary by applying a conversion function:

In [None]:
states_list = list(states_df['text'])
print(states_list)

In [None]:
pennsylvania_dictionary = dict(states_df.loc['PA'])
print(pennsylvania_dictionary)

In [None]:
organism_info.iat[2, 1]
organism_info.at['spider', 'group']

# Loading a DataFrame from a file

Although there are a number of ways to build a pandas DataFrame from simpler Python objects, most of the time we will create them from data that are already in tablular form in a file. 

The exact mechanism for loading the DataFrame will depend on the kind of environment you are running Python in (Colab notebook, Jupyter notebook, stand-alone Python) and the kind of file you are opening (CSV or Excel). We will start with the simplest example, loading a CSV from a URL, because it works the same in every environment.

You can load a CSV file by passing in its URL as the argument of the `.read_csv()` method:

In [None]:
schools_df = pd.read_csv('https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv')
schools_df

## Examining the DataFrame

If a DataFrame is large, it will be difficult to examine the whole thing at once. We can use several methods to view characteristics of the DataFrame.

The `.head()` method will display the first 5 rows of the DataFrame. You can pass in a different number of rows to display as an argument. 

In [None]:
schools_df.head()

Data frames do not automatically have assigned index labels. We can use one of the series as the labels.

In [None]:
schools_df.head(3)

Here are some other methods to explore a DataFrame:
- `.tail()` to display the last lines of the DataFrame
- `.shape` returns the rows and columns as a tuple.
- `.columns` returns the column names as a pandas Index object. Use the `list()` function to convert into a simple Python list.

In [None]:
schools_df.tail()

In [None]:
schools_df.shape

In [None]:
print(schools_df.columns)
print()
print(list(schools_df.columns))

## Data types in a DataFrame from a CSV

When a DataFrame is read in from a CSV, pandas tries to guess the type of data in the column. The result might be integer, floating point number, or "object", which is used for strings and mixed content types. To see this, look at the `dtype` value following each of the Series extracted from these three columns.

In [None]:
print(schools_df['Male'])
print()
print(schools_df['Latitude'])
print()
print(schools_df['School Level'])

In some cases, you would like for all columns to be read in as strings -- for example when numbers are being used as identification strings and you don't want leading zeros to be dropped. To do this, use a `dtype=str` argument.

Notice the change in data types in this case.

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv'
schools_df = pd.read_csv(url, dtype=str)

print(schools_df['Male'])
print()
print(schools_df['Latitude'])
print()
print(schools_df['School Level'])

Empty cells are typically read in as the NumPy missing data indicator: `NaN` (Not a Number). Notice the values for `Native Hawaiian or Other Pacific Islander` in rows where those cells were blank.

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv'
schools_df = pd.read_csv(url)
schools_df.head(3)

We can force blank cells to be read in as empty strings instead of as missing data using the `na_filter=False` argument.

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv'
schools_df = pd.read_csv(url, na_filter=False)
schools_df.head(3)

Be careful because turning off the NaN filter will cause numeric columns to be a mixture of strings and numbers, changing the column type from one of the numeric types to "object". That may cause problems if you need to do calculations using that column. 

In [None]:
print(schools_df['Grade PreK 3yrs'])

For this reason, the `na_filter=False` argument is most likely to be used together with the `dtype=str` argument when you want all cells of the table to be strings (including empty strings for empty cells).  

## Setting the row label indices

When a DataFrame is read in from a CSV, pandas does not know what to use for the row label indices. So it defaults to using a sequence of integers (starting with 0) as the row labels. Notice these indices on the left in the table display.

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv'
schools_df = pd.read_csv(url)
schools_df.head(3)

You can specify one of the columns in the table to be converted into the row label indices using the `.set_index()` method, with the column header as the argument.

If we set row label indices, typically we would like to use some kind of unique identifier for the row. In the case of the schools data, the `School ID` column will serve this purpose well. After running the following cell, notice that the `School ID`is no longer a regular column. It is now shown at the left in the index position.

In [None]:
schools_df = schools_df.set_index('School ID')
schools_df.head(3)

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv'
schools_df = pd.read_csv(url)
schools_df.head(3)

If you want to use values from a column as the row index but want that column to remain as part of the data, you can create the index from the column rather than converting the column into the index. The following cell does that. Notice that `School ID` appears both on the left side (as the row label index) but also as the third data column.

In [None]:
schools_df.index = schools_df['School ID']
schools_df.head(3)


## Fuctions for reading and writing from files

`pd.read_csv()` read from a CSV file into a data frame.

`pd.to_csv()` write from a data frame to a CSV file.

`pd.read_excel()` read from an Excel file into a data frame.

`pd.to_excel()` write from a data frame to an Excel file.

For details about reading from particular sheets in an Excel file, delimiters other than commas, etc. see the [pandas User Guide](https://pandas.pydata.org/docs/user_guide/io.html).

axis 0 = rows, axis 1 = columns
dff.mean(axis=1)

df.sort_index(axis=1, ascending=False) # sorting across rows

Pandas Operator 	Boolean 	Requires
& 	and 	All required to True
| 	or 	If any are True
~ 	not 	The opposite

See https://constellate.org/tutorials/pandas-2
for filtering, dropping rows, changing values by condition
