# 1.3 🐼 Pandas Core Concepts - `pandas.DataFrame`


Ok now we're getting to the good stuff. A dataframe is a collection of Pandas Series objects, and shares a lot of the properties of what we'd think of as a table in a database or a spreadsheet. A dataframe has an index just like a series, which is shared by each column (series) in the dataframe.

In [25]:
import pandas as pd
import numpy as np

In [26]:
# Construct a dataframe from a number of lists.

# this column will be our index
names = ['Alice','Bob','Jim','Jane']

# the rest will be our columns
data = {
    "age": [28,56,19,65],
    "height": [150,187,145,197],
    "eyecolor": ["brown","brown","blue","green"],
}

# We will call our dataframe df.
df = pd.DataFrame(
    index=names,
    data=data
)

df

Unnamed: 0,age,height,eyecolor
Alice,28,150,brown
Bob,56,187,brown
Jim,19,145,blue
Jane,65,197,green


In the above output you'll see that Jupyer formats dataframes into a nice tabular format, with the index and the column headers in **bold**. Unlike a 2D numpy array in which all the values must be of the same type, a Pandas DataFrame is made up of Series objects which can each have different data types.

In [27]:
df.dtypes

age          int64
height       int64
eyecolor    object
dtype: object

Note the `object` datatype for eyecolor. This is the format that Pandas usually represents strings and other more complex objects with. Whilst Pandas 1.0 [introduced an experimental String datatype](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.StringDtype.html#pandas.StringDtype) it's much more common to treat strings as objects in pandas.

## Accessing Dataframe Values

Dataframes support a number of ways of accesing the data within them.

- You can access columns like they were dict values with `df['column_name']`
- You can select multiple columns by supplying an array of column names e.g. `df[['age','eyecolor']]` - note the double brackets.
- You can also access columns like they were attribtutes with `df.column_name`
- You can access indices with the `.loc` accessor
- You can slice the dataframe with the `.loc` accessor to select both columns and rows at the same time.
- You can use the .iloc accessor to select columns and rows by their numeric index rather than their labels.

In [28]:
df['age']

Alice    28
Bob      56
Jim      19
Jane     65
Name: age, dtype: int64

In [29]:
df.age

Alice    28
Bob      56
Jim      19
Jane     65
Name: age, dtype: int64

In [30]:
df.loc['Jim']

age           19
height       145
eyecolor    blue
Name: Jim, dtype: object

In [31]:
# Then first argument to the .loc accessor specifies the index (or indices), the second the column(s).
df.loc['Alice', ['age','eyecolor']]

age            28
eyecolor    brown
Name: Alice, dtype: object

### Accessing data and the Zen of Python

This abundance of ways to access data presents a conflict with the [zen of python](https://www.python.org/dev/peps/pep-0020/):

> There should be one-- and preferably only one --obvious way to do it.

Thankfully it also has this wisdom to offer:

> Explicit is better than implicit.

Whilst there are many ways of accessing data in a dataframe, the two that make your intentions most explicit are usally the `df['column_name']` accessor for returning a single column (since the `.attribute` method could also refer to core dataframe attributes like the `.index` or `.shape` attributes) or the `.loc` accessor for accessing rows or anything more complex.

In [32]:
df_example = pd.DataFrame(
    data={
        "shape": ["circle","square","triangle"],
        "sides": [np.nan, 4, 3]
    }
)
print("df.shape is:")
print(df_example.shape)
print("\nNot what you were expecting? The .shape property of a dataframe is a tuple of the number of rows and columns it contains.\n")
print("df['shape'] is:")
print(df_example['shape'])

df.shape is:
(3, 2)

Not what you were expecting? The .shape property of a dataframe is a tuple of the number of rows and columns it contains.

df['shape'] is:
0      circle
1      square
2    triangle
Name: shape, dtype: object


### `df.apply()`

We can use the `.apply` method on individual columns in the dataframe just by selecting them. We can also do an apply on the whole dataframe row wise or column wise by specifying an axis.


In [33]:
def description(row):
    return f"The are {row['height']}cm tall, have {row['eyecolor']} eyes and are {row['age']} years old"

# Apply on the columns axis to have the function applied to each row.
df.apply(description, axis='columns')

Alice    The are 150cm tall, have brown eyes and are 28...
Bob      The are 187cm tall, have brown eyes and are 56...
Jim      The are 145cm tall, have blue eyes and are 19 ...
Jane     The are 197cm tall, have green eyes and are 65...
dtype: object

## Other Useful Dataframe Methods & Attributes

Dataframes have a number of useful methods and attributes that are worth knowing. Some we've covered already.


In [36]:
# Columns is an index of the column names.
print('columns', df.columns)

columns Index(['age', 'height', 'eyecolor'], dtype='object')


In [38]:
# Shape is a tuple of the number of rows and columns in the dataframe
print("df has:")
print(df.shape[0], 'rows')
print(df.shape[1], 'columns')

df has:
4 rows
3 columns


In [40]:
# info gives you information about the columns and the data in them
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Alice to Jane
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   age       4 non-null      int64 
 1   height    4 non-null      int64 
 2   eyecolor  4 non-null      object
dtypes: int64(2), object(1)
memory usage: 288.0+ bytes


In [43]:
# describe generates descriptive statistics about the numeric data in the dataframe including the number of values, the mean, and various percentiles
df.describe()

Unnamed: 0,age,height
count,4.0,4.0
mean,42.0,169.75
std,21.984843,26.09438
min,19.0,145.0
25%,25.75,148.75
50%,42.0,168.5
75%,58.25,189.5
max,65.0,197.0
