In [None]:
import pandas as pd

# Basic DataFrame manipulation

Data source information is [here](https://github.com/HeardLibrary/digital-scholarship/tree/master/data/codegraf)

Load Excel spreadsheet into DataFrame

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/codegraf/co2_state_2016_sector.xlsx'
state_co2_sector = pd.read_excel(url)

Examine contents of DataFrame

In [None]:
state_co2_sector.head()

In [None]:
state_co2_sector.tail()

## Setting the row index

The `.set_index()` method changes one of the columns into the row index. 

The `.reset_index()` method changes a row index into a regular column.

In [None]:
# Set the State column as the index
state_co2_sector.set_index('State')

In [None]:
# What happened to the index ???
state_co2_sector.tail()

In [None]:
# Create a new view of the DataFrame
new_df = state_co2_sector.set_index('State')
print(new_df.head())
print()
print(state_co2_sector.head())

Use the `inplace` attribute to change the source DataFrame (no assignment)

In [None]:
state_co2_sector.set_index('State', inplace=True)
state_co2_sector.head()

## Removing rows and columns

`.drop()` defaults to rows

In [None]:
state_co2_sector.tail()

In [None]:
state_co2_sector.drop('Total').tail()

In [None]:
# .drop() can be a list
state_co2_sector.drop(['Virginia', 'West Virginia', 'Wyoming']).tail()

In [None]:
# Use inplace argument to change the source table
state_co2_sector.drop('Total', inplace=True)
state_co2_sector.tail()

Use `axis` argument to drop columns

In [None]:
state_co2_sector.drop('Total', axis='columns').head()

## Dealing with missing data



In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv'
schools = pd.read_csv(url)
schools.head()

In some cases, cells were empty because the group wasn't represented (i.e. there were zero students). In that case, those `NaN` values should be zeros.

The first argument of the `.fillna()` method can be a single value if it applys to the entire table, or a dictionary if it applies only to certain columns.

In [None]:
schools.fillna({'Native Hawaiian or Other Pacific Islander': 0}, inplace=True)
schools.head()

In other cases, cells were empty because that column didn't apply to that kind of school (e.g. high schools don't have PreK students). The `.dropna()` method can be used to skip rows with any `NaN` values, but that won't work if you only care about certain columns. In that case, we can filter rows using the `.notnull()` method. The `.isnull()` method can be used to select only rows that have `NaN` valued for a column.

In [None]:
schools[schools['Grade PreK 3yrs'].notnull()]

## Sorting rows

Load state CO2 emissions by fuel spreadsheet

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/codegraf/co2_state_2016_fuel.xlsx'
state_co2_fuel = pd.read_excel(url)
# Set the State column as the row index
state_co2_fuel.set_index('State', inplace=True)
state_co2_fuel.tail()

In [None]:
# Remove the total row
state_co2_fuel.drop('Total', inplace=True)
state_co2_fuel.tail()

In [None]:
# Sort ascending
state_co2_fuel.sort_values(by='Total mmt').head()

In [None]:
# Sort descending, do inplace to modify source table
state_co2_fuel.sort_values(by='Total mmt', ascending=False, inplace=True)
state_co2_fuel.head()

## Slicing columns and rows

To slice using labels, need to use the `.loc()` method. To slice columns, we need to specify both indices, with "all rows" (`:`) selected as the first index.

Recall that slicing with labels is inclusive of last label selected.

In [None]:
# Create a slice (view) with only the fractions
state_co2_fuel_fractions = state_co2_fuel.loc[:, 'Coal fraction': 'Natural Gas fraction']
state_co2_fuel_fractions.head()

To slice rows, only the first index needs to be specified. For integer indices, use the `.iloc()` method.

In [None]:
# Create a slice with only the top four states
top_state_co2_fuel = state_co2_fuel.iloc[:4]
# Note that included rows are 0, 1, 2, and 3 (but not 4).
top_state_co2_fuel

Combine both slicing operations at once.

In [None]:
top_state_co2_fuel_fraction = state_co2_fuel.iloc[:4].loc[:, 'Coal fraction': 'Natural Gas fraction']
top_state_co2_fuel_fraction

# Selecting data

Units are million metric tons

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/codegraf/co2_data.xlsx'
state_co2 = pd.read_excel(url)
state_co2.head(15)

Performing a boolean operation on a column generates a series of booleans whose index matches the DataFrame rows

In [None]:
state_co2.State=='Alabama'

The boolean series can be used to filter a subset of rows in the DataFrame.

Notice that the indices for the rows carry through in the selection.

In [None]:
state_co2[state_co2.State=='Alaska']

In [None]:
state_co2[state_co2['Sector']=='Industrial'].head()

You can assign the selection to a named view (but remember that indices are maintained).

In [None]:
state_co2_industrial = state_co2[state_co2['Sector']=='Industrial']
state_co2_industrial.head()