# Introduction to Pandas and Dataframes

We can use `pandas` for programmatic manipulation of large data sets.

In [None]:
import pandas as pd

## Series

The basic `pandas` object is the `Series`

In [None]:
l1 = [19.3, 2.67, 47.1, 0.000032, 1562.9986]
s1 = pd.Series(l1)
s1

In [None]:
s2 = s1 + 3
s2

In [None]:
dir(s1)

### Key properties
- values
- `dtype`
- index

In [None]:
s1.values

This `array` is a `numpy` array. We can use it to do lots of different maths operations.

In [None]:
s1.values * 3

In [None]:
s1

In [None]:
s1.index

#### We can access and slice elements in the expected way.

In [None]:
s1[2]

In [None]:
s1[1:3]

There are lots of available methods.

In [None]:
s1.mean()

In [None]:
s1.where(s1 > 10)

In [None]:
s1[s1 > 10]

### We can create Series with an arbitrary index

In [None]:
l2 = [3.2, 9, 'A', False]
idx = ['my flt', 'my int', 'my str', 'my str']
s2 = pd.Series(l2, index = idx)

In [None]:
s2.values

In [None]:
s2

In [None]:
s2['my str']

In [None]:
s2.index

#### This is obviously reminiscent of dictionaries...

In [None]:
d3 = {'name': 'Simon', 'game': 'Warcraft', 'nerd_score': 148, 'interests': 'rocks', 'shaves': False}
s3 = pd.Series(d3)
s3

#### ... but Series have lots of extra attributes, including the ability to slice with non-numeric indices.

In [None]:
s3['game':'interests'] # notice what gets returned...

## Dataframes

The most commonly used (and most important) structure in `pandas` is the **dataframe**. It is two-dimensional rather than one-dimensional.

In [None]:
stock_dict = {'trainers': 78, 'basketballs': 27, 'cricket bats': 38, 'wetsuits': 12}
prices_dict = {'trainers': 102.20, 'basketballs': 22.40, 'cricket bats': 61.38, 'wetsuits': 98.24}

In [None]:
stock = pd.Series(stock_dict)
price = pd.Series(prices_dict)

In [None]:
inventory = pd.DataFrame({"stock": stock, "price": price})

In [None]:
inventory

In [None]:
dir(inventory)

### Again it has some important properties
- index
- dtypes
- columns
- values


In [None]:
inventory.index

In [None]:
inventory.dtypes

In [None]:
inventory.columns

In [None]:
inventory.values

#### Selecting a column (Series)

In [None]:
inventory['price']

In [None]:
inventory.price

#### Creating a new column calculated from other columns

In [None]:
inventory['new_value'] = inventory['stock']*inventory['price']

In [None]:
inventory

#### Column operations are just Series operations

In [None]:
inventory.new_value.sum()

#### Slicing and selecting rows and columns

In [None]:
inventory['basketballs':'cricket bats']

In [None]:
inventory[1:2] # notice the difference!

In [None]:
inventory['price':'new_value']

In [None]:
inventory

In [None]:
inventory.loc['basketballs', 'price']

In [None]:
(inventory['price'] > 70.0)

In [None]:
inventory[inventory['price'] > 70.0]

In [None]:
inventory.loc[inventory['price'] > 70.0]

## File manipulation
Pandas is capable of reading and writing data in a number of different formats

In [None]:
flights_df = pd.read_csv('customer_booking.csv', encoding='utf-8', encoding_errors='ignore')

In [None]:
type(flights_df)

In [None]:
flights_df.count()

In [None]:
flights_df.head()

In [None]:
flights_df.groupby('route').nunique()

In [None]:
single_traveller_df = flights_df.query('num_passengers == 1')

In [None]:
single_traveller_df.head()

In [None]:
single_traveller_df.count()

In [None]:
single_traveller_df.to_json('./single.json', orient='columns')

In [None]:
single_traveller_df.to_parquet('./single.parquet', engine='auto', compression=None)