# 1.3 🐼 Pandas Core Concepts - `pandas.DataFrame`


Ok now we're getting to the good stuff. A dataframe is a collection of Pandas Series objects, and shares a lot of the properties of what we'd think of as a table in a database or a spreadsheet. A dataframe has an index just like a series, which is shared by each column (series) in the dataframe.

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Construct a dataframe from a number of lists.

# this column will be our index
names = ['Alice','Bob','Jim','Jane']

# the rest will be our columns
data = {
    "age": [28,56,19,65],
    "height": [150,187,145,197],
    "eyecolor": ["brown","brown","blue","green"],
}

# We will call our dataframe df.
df = pd.DataFrame(
    index=names,
    data=data
)

df.columns.tolist()

In the above output you'll see that Jupyer formats dataframes into a nice tabular format, with the index and the column headers in **bold**. Unlike a 2D numpy array in which all the values must be of the same type, a Pandas DataFrame is made up of Series objects which can each have different data types.

In [None]:
df.dtypes

Note the `object` datatype for eyecolor. This is the format that Pandas usually represents strings and other more complex objects with. Whilst Pandas 1.0 [introduced an experimental String datatype](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.StringDtype.html#pandas.StringDtype) it's much more common to treat strings as objects in pandas.

## Accessing Dataframe Values

Dataframes support a number of ways of accesing the data within them.

- You can access columns like they were dict values with `df['column_name']`
- You can select multiple columns by supplying an array of column names e.g. `df[['age','eyecolor']]` - note the double brackets.
- You can also access columns like they were attribtutes with `df.column_name`
- You can access indices with the `.loc` accessor
- You can slice the dataframe with the `.loc` accessor to select both columns and rows at the same time.
- You can pass an expression to the `.loc` accessor that evaluates to true or false for each row to select those rows that the function returns true for.
- You can use the .iloc accessor to select columns and rows by their numeric index rather than their labels.

You can access columns like they were dict values with `df['column_name']`

In [None]:
df['age']

In [None]:
df.age

In [None]:
df.loc['Jim']

In [None]:
# Then first argument to the .loc accessor specifies the index (or indices), the second the column(s).
df.loc['Alice', ['age','eyecolor']]

In [None]:
# We can use .loc on just the columns by using : to skip the first argument
df.loc[:, ['age','height']]

In [None]:
# Use an expression to select anone over 30.
df.loc[df['age'] > 30]

In [None]:
# You can combine expressions with & (and) and | (or) - be sure to put each expression in parentheses.
# Select anyone over 150cm tall with green eyes
df.loc[
    (df['eyecolor'] == 'green') & (df['height'] > 150)
]

### Accessing data and the Zen of Python

This abundance of ways to access data presents a conflict with the [zen of python](https://www.python.org/dev/peps/pep-0020/):

> There should be one-- and preferably only one --obvious way to do it.

Thankfully it also has this wisdom to offer:

> Explicit is better than implicit.

Whilst there are many ways of accessing data in a dataframe, the two that make your intentions most explicit are usally the `df['column_name']` accessor for returning a single column (since the `.attribute` method could also refer to core dataframe attributes like the `.index` or `.shape` attributes) or the `.loc` accessor for accessing rows or anything more complex.

In [None]:
df_example = pd.DataFrame(
    data={
        "shape": ["circle","square","triangle"],
        "sides": [np.nan, 4, 3]
    }
)
print("df.shape is:")
print(df_example.shape)
print("\nNot what you were expecting? The .shape property of a dataframe is a tuple of the number of rows and columns it contains.\n")
print("df['shape'] is:")
print(df_example['shape'])

### `df.apply()`

We can use the `.apply` method on individual columns in the dataframe just by selecting them. We can also do an apply on the whole dataframe row wise or column wise by specifying an axis.


In [None]:
def description(row):
    return f"The are {row['height']}cm tall, have {row['eyecolor']} eyes and are {row['age']} years old"

# Apply on the columns axis to have the function applied to each row.
df.apply(description, axis='columns')

## Other Useful Dataframe Methods & Attributes

Dataframes have a number of useful methods and attributes that are worth knowing. Some we've covered already.


In [None]:
# Columns is an index of the column names.
print('columns', df.columns)

In [None]:
# Shape is a tuple of the number of rows and columns in the dataframe
print("df has:")
print(df.shape[0], 'rows')
print(df.shape[1], 'columns')

In [None]:
# info gives you information about the columns and the data in them
df.info()

In [None]:
# describe generates descriptive statistics about the numeric data in the dataframe including the number of values, the mean, and various percentiles
df.describe()

## Excercises

From the following dataframe featuring data from the 2017 general election answer the following.

- How many seats did the DUP gain? (select just this data point).
- How many votes were cast in total?
- What percentage of the vote did parties other than Labour and the Conservatives get in total.
- What party had the largest change in the overall vote share?
- Use an apply function to create a series with the value `more seats` if they won seats overall, `no change` if they mainained the same number of seats, or `fewer seats` if they lost seats overall. Count the number of parties in each category (ignore "All other parties")


In [None]:
df = pd.read_csv('data/2017-election-results.csv', index_col=0)
# Show the first few rows, not the whole dataframe!
df.head()

In [None]:
# How many seats did the DUP gain?


In [None]:
# How many votes were cast in total?


In [None]:
# What percentage of the vote did parties other than Labour and the Conservatives win?


In [None]:
# What percentage of seats did parties other than Labour and the Conservatives win?


In [None]:
# What party had the largest change in vote share?

In [None]:
# More seats or fewer seats

def more_or_fewer(value):
    # Add your function code here..

# Apply to the Net column that shows the net number of seats gained or lost
mf = df['Net'].apply(more_or_fewer) 
display(mf)

In [None]:
# The same thing as a one liner.
# Is it quicker to write? maybe. 
# Is it easier to read? maybe not...

