# Pandas is a library the helps you work with data as tables

- You've just described something as a 'library' what do you mean by that?
- So far you've been using what's called the 'standard library' i.e. the bells and whilstes thrown into Python out of the box. This inludes basic building blocks such as `sum()`, `list()` and even some more complex librares like `random`
- A library allows you to reuse code someone else has kindly written for you
- It allows you to 'stand on the shoulders of giants'
- The library ecosystem is one of the reasons Python is so popular today
- Some libraries are incuded in Python out of the box, others can be installed via `pip` and you can also make your own!

## [Q1] What will the cell below do?

In [None]:
import random
random.randint(1,100)

# Introduction to Pandas

Unfortunately, `Pandas` is rather unintuintively named after Panel Data which is a concept from econometrics about multidimensional matricies. But actual 🐼s are much more fun.

Pandas was originally written by [Wes McKinney](https://en.wikipedia.org/wiki/Wes_McKinney) who was working at a hedge fund and needed a tool to better deal with the time-series data he was working with day to day. It's a great story for a few reasons:

1. It's a great example of an open source project going far and beyond what the creator anticipated.
2. He's admitted that when he started the project he wasn't very good at Python. 
3. This amazing tool used by thousands of people was made by 'just some guy'
4. This article written a decade later is an honest reflection by McKinney on what he got right and what he didn't:  ["10 things I hate about Pandas"](https://wesmckinney.com/blog/apache-arrow-pandas-internals/)

<img height="42" src="https://pics.onsizzle.com/local-mexican-restaurant-used-to-be-a-chinese-restaurant-instead-58896804.png" width=300>

> This isn't especially relevant, we just like it

## How to use Pandas

Remember this: **NO ONE WAS BORN KNOWING ALL OF THIS**

Everyone was a beginner once (even Wes McKinney) and help is available. The answer to pretty much every Python / Pandas question WILL be online, getting good at Google-Fu is arguably the best skill you can have as a programmer.

- The [official documentation](https://pandas.pydata.org/pandas-docs/stable/) is great
- But seriously, Google any question and answers will appear: https://www.google.com/search?q=how+do+I+get+the+number+of+rows+pandas


### To start using Pandas you need to import it
- The code below is very much convention, you don't need to do the `pd` part - but most end up doing so for ease when typing. 
- Programmers are lazy, remember that's why we care about 'efficiency' - less actual work.

In [None]:
import pandas # you can do this and it's totally fine 

In [None]:
import pandas as pd # most people do this to avoid typing

In [None]:
pd. #Press tab to see autocomplete 

# The world runs on Excel

- It's on everyones computer
- It's not going anywhere
- It's in use everywhere from small businesses to Nuclear power plants
- Pandas can make your life easy, because it has lots of fancy features built in

<img src=https://media.giphy.com/media/r01AEhiPmQsG4/giphy.gif>

# Opening a Spreadsheet in Pandas is simple...

## [Q2] What do you think `pd.DataFrame.head()` does?

In [None]:
import pandas as pd
excel_df = pd.read_excel('movies.xls', sheet_name='1900s')
excel_df.head()

## [Q3] How do you think we could preview the last 10 rows?

In [None]:
# Please complete the rest of this cell

# How to filter columns in Pandas

- There are a couple of ways to do this
- The simplest way is to pass a list of columns to the DataFrame within square brackets `[]`
- This is very similar to what you've seen with `List` objects e.g. `[1,2,3,4]`


`data_frame[[col1, col2, col3 ...]]`

This looks a little funny but the two sets of square brackets are doing different things.
1. The first set (outermost) are saying: 'Please provide me with a sequence of column names'
2. The second (innermost) set are simply an explicit set of columns to select

Run the cell below to select to see it in action

In [None]:
excel_df[['Title', 'Year']].sample(n=5)

It might help to see that the code below is functionally identical...

In [None]:
columns_to_select = ['Title', 'Year']
excel_df[columns_to_select].sample(n=5)

## [Q4] Filter the table to just the following columns and show the first 5 rows
```python
['Title', 'Year', 'Genres', 'Language', 'Country', 'Content Rating', 'Budget', 'IMDB Score']
```
- Note: You must be explicit, misspelling will give you a `KeyError`
- Save the results in the variable `test_df`

In [None]:
test_df = # Please complete the rest of this cell

# How to filter rows in Pandas

- If you run the cell below you will see that the selecting of a single column, not a list of columns, looks different...
- This is because one column is actually called a `Series` and you can think about it like a vertical list
- When you break it down, a DataFrame is just a group of `Series` columns behind the scenes

In [None]:
test_df['Year']

- When you do a comparison against a the `Series`, Pandas will compare every item and return `True` or `False` against every row 

In [None]:
test_df['Year'] == 1920

- When you put this within the square brackets of a DataFrame, Pandas will filter to the rows which were `True`

In [None]:
test_df[test_df['Year'] == 1920]

## [Q5] Filter the `test_df` to rows where the IMDB Score is greater than 5

In [None]:
# Please complete the rest of this cell

## [Q6] Filter the `test_df` to films from the USA

In [None]:
# Please complete the rest of this cell

## You can use the following operators to combine conditions:
- `&` to AND conditions together, e.g. The can was GREEN and CLOSED 
- `|` to OR conditions together, e.g. The person was from England or France (either is fine)
- `~` to NEGATE any condition, e.g. The person was not from London
- It's also useful to put conditions in brackets `(`to make sure things working in the right order`)`

For example you could filter a `DataFrame` like so:

`df[(df['age'] >= 18) & (df['height'] < 200)]`

This one is a little harder...
## [Q7] Filter the test_df to films which were made in the 1920s

In [None]:
# Please complete the rest of this cell

## Getting a sense of the data you have to work with
If you ever want to just get a numeric sense of the data in your `DataFrame` the `describe()` function is built for you!

In [None]:
test_df.describe()

# Let's take a second to recap

What you've learnt so far:
- Pandas is a library that comes with lots of built-in tools for working with data
- It provides several ways of previewing the data you are working with
- You can filter columns by passing a sequence of column names
- You can filter rows by applying conditions

If we filter the original data to the following set we can see we are left with 44 rows and 25 columns

In [None]:
first_half_century_df = excel_df[excel_df.Year < 1950]
first_half_century_df.shape

Let's actually do some basic arithmetic with the data...

## [Q8] What was the mean budget of movies produced before 1950

- With the filtered `first_half_century_df`, work out the mean budget 
> (Hint: Use the Budget column itself)

In [None]:
# Please complete the rest of this cell

## [Q9] What was the minimum budget of movies produced before 1950

In [None]:
# Please complete the rest of this cell

## [Q10] What was the maximum budget of movies produced before 1950

In [None]:
# Please complete the rest of this cell