# Extracting and changing DataFrame data

In [None]:
import pandas as pd

## Slicing rows

The simplest way to slice rows of a DataFrame is to use the `.head()` or `.tail()` methods if the rows that you want are at the start or end of the DataFrame. The result is another DataFrame object, a view of the original DataFrame. Recall that changes made to views will affect the original object. To avoid this, use the `.copy()` method.

To slice using labels, need to use the `.loc()` method. To slice columns, we need to specify both indices, with "all rows" (`:`) selected as the first index.

Recall that slicing with labels is inclusive of last label selected.

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/r/wv5_women_and_development.csv'
development = pd.read_csv(url)
# Use the country name as the row label index
development = development.set_index('country')
development

In [None]:
table_end = development.tail(12).copy()
table_end

In this example, we see that the last 12 rows are summary statistics added after the last country. If we want the DataFrame to include only the countries, we could use `.head()` with a negative argument:

In [None]:
countries_only = development.head(-12).copy()
countries_only

In an earlier lesson, we saw that we could use `.loc()` and `.iloc()` to retrieve single rows by their indices, resulting in a Series. As we did in the lesson on Series, we can use a range of indices or a list of indices to retrieve a slice. But when we slice a DataFrame this way, the result is another DataFrame containing rows from the source DataFrame. 

In [None]:
# Slice by a range of label indices
e_countries = development.loc['Ecuador':'Ethiopia']
e_countries

In [None]:
# Slice by a range of integer indices. Remember that slicing by integer index omits the last number.
integer_slice = development.iloc[1:4]
integer_slice

In [None]:
# Slice by a list of label indices
non_states = development.loc[ ['American Samoa', 'Puerto Rico', 'Virgin Islands (U.S.)'] ]
non_states

The beginning or end of the range can be omitted to include all rows from the top or to the bottom, respectively.

In [None]:
by_income = development.loc['Low income': ]
by_income

## Slicing a rectangular selection

We can slice any rectangular selection of the DataFrame using `.loc()` or `.iloc()` and specifying the ranges on both axes: first the 0th one (rows), then the 1th one (columns), separated by a comma. Omit a starting or ending value to include the range starting from the beginning or to the end, respectively.


In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/r/wv5_women_and_development.csv'
development = pd.read_csv(url)
# Use the country name as the row label index
development = development.set_index('country')
development

In [None]:
# Use range of labels on both axes
income_expectations = development.loc['Low income':'High income', 'male_life_expectancy_at_birth_2017': 'percentage_of_women_ages_20-24_first married_by_age_18' ]
income_expectations

In [None]:
# Specify integer ranges, with the last number one more than the end of the interval you want.
work = development.iloc[218:225, 3:7]
work

In [None]:
# Specify a list of indices instead of a range for one dimension
female_values_by_income = development.loc['Low income':'High income', ['female_life_expectancy_at_birth_2017', 'female_employment_percentage', 'women_in_parliaments_percentage_seats'] ]
female_values_by_income

In [None]:
# Include all countries through Zimbabwe and all columns from women_in_parliaments_percentage_seats to the end
last_values_by_country = development.loc[ :'Zimbabwe', 'women_in_parliaments_percentage_seats': ]
last_values_by_country.tail()

## Slicing columns

There is no simple way to slice only by columns. Rather, slice a rectangular selection that includes all rows. You can indicate "all rows" by including the colon range indicator (`:`) without any starting or ending values. The columns can be specified using any of the variations above (ranges or lists).

In [None]:
life_expectancy = development.loc[ :, 'male_life_expectancy_at_birth_2017': 'female_life_expectancy_at_birth_2017']
life_expectancy

## Deleting ranges of rows or columns

After slicing by rows or columns, the labels of the slice can be used to specify which rows or columns should be deleted. The labels of rows can be retrieved using the `.index` attribute and the labels of the columns can be retrieved using the `.columns` attribute. Once the set of labels has been retrieved, they can be passed into the `.drop()` method to indicate what should be dropped, using the same syntax as for dropping a list.

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv'
schools_df = pd.read_csv(url)
schools_df = schools_df.set_index('School ID')
schools_df.head()

In [None]:
print(schools_df.index)
schools_df.loc[375:460].index

In [None]:
# Drop a range of rows
schools_df = schools_df.drop(schools_df.loc[375:460].index)
schools_df.head()

In [None]:
print(schools_df.columns)
schools_df.loc[:, 'Grade PreK 3yrs':'Grade 12'].columns

To drop columns instead of rows, use the `axis='columns'` (or `axis=1`) argument.

In [None]:
schools_df = schools_df.drop(schools_df.loc[:, 'Grade PreK 3yrs':'Grade 12'].columns, axis='columns')
schools_df.head(10)

## Selecting rows by boolean conditions

Recall from the lesson on Series that we can select some subset of items by generating a sequence of booleans where the `True` values indicate those that should be included and the `False` values indicate those that should be excluded. The same holds for DataFrames, but with the selection possible in either of the axes.

In [None]:
# Recreate the tiny states DataFrame from the earlier lesson
text_series = pd.Series({'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'})
capital_series = pd.Series({'OH': 'Columbus', 'TN': 'Nashville', 'AZ': 'Phoenix', 'PA': 'Harrisburg', 'AK': 'Juneau'})
population_series = pd.Series({'OH': 11799448, 'TN': 6910840, 'AZ': 7151502, 'PA': 13002700, 'AK': 733391})
states_dict = {'text': text_series, 'capital': capital_series, 'population': population_series}
states_df = pd.DataFrame(states_dict)
states_df

In [None]:
row_booleans = pd.Series({
    'OH': False,
    'TN': True,
    'AZ': True,
    'PA': False,
    'AK': True
})
row_booleans

In [None]:
# Slice the rows using .loc() as before, but using booleans instead of explicit naming.
selected_states = states_df.loc[row_booleans]
selected_states

Obviously it is a waste of time to hand-write the boolean values. But it is very easy to generate an appropriate boolean screening series by evaluating a condition. 

Some boolean operators in pandas differ somewhat from those in base Python, where the keywords `and`, `or`, and `not` are used. They are:

| pandas operator | boolean | evaluation |
| --------------- | ------- | -------- |
| & | and | `True` if all `True` |
| \| | or | `True` if any `True` |
| ~ | not | opposite value |

The `==`, `>`, `<=`, etc. operators are the same as in base Python.


In [None]:
states_df

Generate a boolean Series for selecting rows:

In [None]:
row_selector = states_df['text'] == 'Alaska'
row_selector

Apply the selector to locate the states to be included in the slice:

In [None]:
north_star_state = states_df.loc[row_selector]
north_star_state

Typically we would not bother doing this in separate steps, but just nest the condition inside the `.loc[]` expression:

In [None]:
not_arizona = states_df.loc[ ~(states_df['capital']=='Phoenix') ]
not_arizona

Here is a more practical example using the schools data: select all middle schools.

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv'
schools_df = pd.read_csv(url)
schools_df = schools_df.set_index('School ID')
middle_schools = schools_df.loc[ schools_df['School Level']=='Middle School' ]
middle_schools.head()

We can use the `.isnull()` or `.notnull()` methods to select rows that have or don't have `NaN` values respectively. For examle we could look for high schools by finding rows that don't have missing values in the `Grade 12` column. However, you should note that isn't the same as filtering for high schools, since `Special Education` schools also have 12th graders.

In [None]:
high_schools = schools_df.loc[ schools_df['Grade 12'].notnull() ]
high_schools.head()

## Selecting columns by boolean conditions

Because of the way we typically organize data in tables, it's probably less common to select columns by a condition, but it can be done in a manner analogous to how we selected rows. Here we select columns by explicitly constructing a series of booleans for the columns:

In [None]:
# Slice the columns using .loc() using booleans
column_booleans = pd.Series({'text': True, 'capital': False, 'population': True})
selected_columns = states_df.loc[:, column_booleans]
selected_columns

Again, we would be more likely to create the boolian selector series by a condition. In this example, we ask the question "Which columns have a value of `Harrisburg` in the row labeled `PA`?":

In [None]:
column_selector = states_df.loc['PA'] == 'Harrisburg'
column_selector

We then use that selector to slice the column that matches (the `capital` column):

In [None]:
states_df.loc[:, column_selector]

# Changing values as a vectorized operation

In the same way that we can select rows to slice them based on boolean conditions, we can also select cells in a column to change their values. Reload the schools data:

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv'
schools_df = pd.read_csv(url)
schools_df = schools_df.set_index('School ID')
ethnicity = schools_df.loc[:, 'American Indian or Alaska Native':]
ethnicity.head()

In some columns, cells were empty because the group wasn't represented (i.e. there were zero students) and the cells were left empty in the original table. Those empty cells were read into the DataFrame by pandas as `NaN` values. They aren't realy missing values -- rather they should have a value of zero.

We can use `.loc` to select a single cell by specifying its row and column label indices:

In [None]:
ethnicity.loc[375, 'Native Hawaiian or Other Pacific Islander']

This result shows us that the cell has a mising value. We can assign a new value to that cell location explicity using the row and column label indices and `.iloc`:

In [None]:
ethnicity.loc[375, 'Native Hawaiian or Other Pacific Islander'] = 0
ethnicity.head()

However, what we really want to do is to change every row in that column that has a `NaN` value to a zero. We can do that by first specifying a selector Series for that column that indicates whether each row has missing data (`True`) or not (`False`).

In [None]:
islander_missing_data_rows = ethnicity['Native Hawaiian or Other Pacific Islander'].isnull()
islander_missing_data_rows

Now if we use that Series in the column position of the `.loc` attribute, all rows in that column with `True` values (i.e. with missing data) will have their value set to zero.

In [None]:
ethnicity.loc[islander_missing_data_rows, 'Native Hawaiian or Other Pacific Islander'] = 0
ethnicity

# Iterating through rows

One of the main purposes of pandas is to make it possible to perform operations on entire columns using vectorized operations. However, there are some situations where it makes sense to iterate through each row in the DataFrame and deal with values one row at a time. These situations would include complex operations that require multiple lines of code to describe, or actions that must happen sequentially, such as retrieving data from a URL.

Our example will use information about websites

In [None]:
websites = {
    'name': {'alphabet': 'Google', 'vu': 'Vanderbilt', 'fake': 'Obsolete Website'}, 
    'url': {'alphabet': 'https://www.google.com/', 'vu': 'https://www.vanderbilt.edu/', 'fake': 'https://example.org/fake_url'},
    'status': {'alphabet': 'unknown', 'vu': 'unknown', 'fake': 'unknown'}
           }
websites_df = pd.DataFrame(websites)
websites_df

To generate an iterable object from the DataFrame we use the `.iterrows()` method. Iterating using a `for` loop generates a tuple consisting of the label index and the data from the row, in the form of a Series.

In [None]:
for website_tuple in websites_df.iterrows():
    print(website_tuple)
    print()

To access the index and Series separately, we can unpack the tuple as we iterate.

In [None]:
for label_index, website_series in websites_df.iterrows():
    print(label_index)
    print()
    print(website_series)
    print()
    print()

As we learned in the lesson on pandas Series, to access a value from the row Series, we can use `.loc` attribute, or direct indexing, which is simpler but perhaps more ambiguous since it can also be used for integer labels.

In [None]:
for label_index, website_series in websites_df.iterrows():
    print(website_series.loc['url'])
    print(website_series['url'])
    print()

Iterating will allow us to check the status of each website one at a time.

In [None]:
import requests
for label_index, website_series in websites_df.iterrows():
    response = requests.get(website_series['url'])
    if response.status_code == 200:
        print(website_series['name'], 'is up.')
        websites_df.loc[label_index, 'status'] = 'OK'
    elif response.status_code == 404:
        print(website_series['name'], 'is down.')
        websites_df.loc[label_index, 'status'] = 'not found'
    else:
        print(website_series['name'], 'has status code', response.status_code)
        websites_df.loc[label_index, 'status'] = 'other'


In addition to displaying the status to the user, the script also recorded the status in the DataFrame. 

In [None]:
websites_df

## Looking up values (optional)

We can use the strategies from this lesson to look up values in one table and add them to another. In this example, we have a table with data about artists and another table with data about artworks. The artwork table is linked to the artist table by a unique identifier for the artist (the Wikidata Q ID of the artist). 

First read in the tables. I'm chosing to treat the dates as strings, so I use the `dtype=str` argument when I load each CSV into a DataFrame. I also set the `qid` as the label index for the artist and the `accession_number` as the label index for the artwork.

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/codegraf/artists.csv'
# We want the years to be strings, not numbers
artists = pd.read_csv(url, dtype=str)
artists = artists.set_index('qid')
artists

In [None]:
url = 'https://github.com/HeardLibrary/digital-scholarship/raw/master/data/codegraf/artworks.csv'
works = pd.read_csv(url, dtype=str)
works = works.set_index('accession_number')
works

We can find an artist's name in the artist DataFrame using the Q ID label index. 

In [None]:
artist_index = 'Q105090067'
artists.loc[artist_index, 'name']

Instead of hard-coding the artist index, we can generate a Series of the artists for all of the works by inserting the `creator` column of the `works` DataFrame (`works['creator']`). 

In [None]:
artists.loc[works['creator'], 'name']

Because the length of this series is the same as the number of rows in the DataFrame, we can add it as a column. However, the label indices of the Series doesn't match the label indices of the DataFrame rows. So turn the Series into a Python list using the `list()` function.

In [None]:
list(artists.loc[works['creator'], 'name'])

Now we can assign those values to a new column in the `works` DataFrame called `artist`.

In [None]:
works['artist'] = list(artists.loc[works['creator'], 'name'])
works

Things are more complicated if you want to look up the artists by values in a column that isn't the label index. For example, now that we have the artist names in the `works` DataFrame, if we pretend that we didn't have the author `qid` index available, we could use the names to look up artists by matching those names to values in the `name` column of the `artists` DataFrame and then get some information about the artist, such as their birth date. There are probably several ways to do this, but the following is a way to do it based on strategies we already know.

First create a boolean Series for selecting the row in the `artists` table that matches the name.

In [None]:
artist_name = 'Tōteki Unkoku'
artists['name']==artist_name

We can use this boolean Series to locate the birth year in rows where the designated artist name matches the `name` column. If the artist names in the table are unique (only one row per artist name), only a single row will match.

In [None]:
artists.loc[artists['name']==artist_name, 'birth_year']

We want the actual birth year value, not a Series with one value. We can use the `.iloc` attribute of the series to get the value of the 0th item in the Series or just use `[0]` to directly specify the integer index.

In [None]:
print(artists.loc[artists['name']==artist_name, 'birth_year'].iloc[0])
print(artists.loc[artists['name']==artist_name, 'birth_year'][0])

Now we can iterate through each row in the DataFrame and substitute the `artist` name value for the row instead of the hard-coded `artist_name` that we used before. The `artist_birth` date that we looked up can then be added as a value in a new column of the `works` DataFrame called `artist_birth`.

In [None]:
for accession_number, work in works.iterrows():
    artist_birth = artists.loc[artists['name']==work['artist'], 'birth_year'][0]
    works.loc[accession_number, 'artist_birth'] = artist_birth
works