# Pandas is a library that helps you work with data as tables


<table><tr>
<td> <img src="data/images/pandas.png" alt="pandas library" style="width: 400px;"/> </td>
<td> <img src="data/images/pandas1.png" alt="pandas library alternative" style="width: 400px;"/></td>
</tr></table>



## Quick recap on 'Library'
- A library allows you to reuse code someone else has kindly written for us
- The concept of library ecosystem is one of the reasons Python is so popular today
- Some libraries are included in Python by default, others can be installed via `pip` and you can also make your own!



In [None]:
## [Q1] What will the cell below do?

import random
random.randint(1,100)

# Introduction to Pandas

### `Pandas` is named after __Panel Data__ which is a concept about __multidimensional matricies__. 

<img src="data/images/wes.jpg" style="width: 200px;"/>

Pandas was originally written by [Wes McKinney](https://en.wikipedia.org/wiki/Wes_McKinney) who was working at a hedge fund and needed a tool to better deal with the time-series data he was working with day to day. It's a great story for a few reasons:

1. It's a great example of an open source project going far and beyond what the creator anticipated.
2. He's admitted that when he started the project he wasn't very good at Python. 
3. This amazing tool used by thousands of people was made by 'just some guy'.


## How to use Pandas


ALWAYS REMEMBER: **NO ONE WAS BORN KNOWING ALL OF THIS**

Everyone was a beginner once (even Wes McKinney) and help is available. The answer to pretty much every Python / Pandas question WILL be online, getting good at "Googling" is arguably the best skill you can have as a programmer.

- The [official documentation](https://pandas.pydata.org/pandas-docs/stable/) is great



### To start using Pandas you need to import it

- The code below is very much convention, you don't need to do the `pd` part - but most end up doing so for ease when typing. 
- Programmers are lazy, remember that's why we care about 'efficiency' - less actual work.

In [None]:
import pandas # you can do this and it's totally fine 

In [None]:
import pandas as pd # most people do this to avoid typing

In [None]:
pd. #Press tab to see autocomplete 

# The world runs on Excel

- It's on everyones computer
- It's not going anywhere
- It's in use everywhere from small businesses to Nuclear power plants

### - Pandas can make our life easy, because it has lots of fancy features, which allow us to work with Excel spreadsheets and many more!


# Opening a Spreadsheet in Pandas is simple...

## [Q2] What do you think `pd.DataFrame.head()` does?
 Let's try it together

In [None]:
import pandas as pd
excel_df = pd.read_excel('data/movies.xls', sheet_name='1900s')
excel_df.head()

## [Q3] How do you think we could preview the last 10 rows?

In [None]:
excel_df.tail(10)

# How to filter columns in Pandas

- There are a couple of ways to do this
- The simplest way is to pass a list of columns to the DataFrame within square brackets `[]`
- This is very similar to what you've seen with `List` objects e.g. `[1,2,3,4]`


`data_frame[[col1, col2, col3 ...]]`

This looks a little funny but the two sets of square brackets are doing different things.
1. The first set (outermost) are saying: 'Please provide me with a sequence of column names'
2. The second (innermost) set are simply an explicit set of columns to select

Run the cell below to select to see it in action

In [None]:
excel_df[['Title', 'Year']].sample(n=5)

In [None]:
# It might help to see that the code below is functionally identical...

columns_to_select = ['Title', 'Year']
excel_df[columns_to_select].sample(n=5)

## [Q4] Filter the table to just the following columns and show the first 5 rows
```python
['Title', 'Year', 'Genres', 'Language', 'Country', 'Content Rating', 'Budget', 'IMDB Score']
```
- Note: You must be explicit, misspelling will give you a `KeyError`
- Save the results in the variable `test_df`

In [None]:
test_df = excel_df[['Title', 'Year', 'Genres', 'Language', 'Country', 'Content Rating', 'Budget', 'IMDB Score']]

# How to filter rows in Pandas

- If you run the cell below you will see that the selecting of a single column, not a list of columns, looks different...
- This is because one column is actually called a `Series` and you can think about it like a vertical list
- When you break it down, a DataFrame is just a group of `Series` columns behind the scenes

In [None]:
test_df['Year']

- When you do a comparison against a the `Series`, Pandas will compare every item and return `True` or `False` against every row 

In [None]:
test_df['Year'] == 1920

- When you put this within the square brackets of a DataFrame, Pandas will filter to the rows which were `True`

## [Q5] Filter the `test_df` to rows where the IMDB Score is greater than 5

In [None]:
test_df[test_df['IMDB Score'] > 5]

## [Q6] Filter the `test_df` to films from the USA

In [None]:

test_df[test_df['Country'] == 'USA'].head()


In [None]:
# another cool example 

# reset test_df first
test_df = excel_df[['Title', 'Year', 'Genres', 'Language', 'Country', 'Content Rating', 'Budget', 'IMDB Score']]

condition_1 = (test_df['Country'] == 'USA')
condition_2 = (test_df['Country'] == 'UK')
condition_3 = (test_df['Country'] == 'Germany')

test_df= condition_1 | condition_2

print(test_df)

## You can use the following operators to combine conditions:
- `&` to AND conditions together, e.g. The can was GREEN and CLOSED 
- `|` to OR conditions together, e.g. The person was from England or France (either is fine)
- `~` to NEGATE any condition, e.g. The person was not from London
- It's also useful to put conditions in brackets `(`to make sure things working in the right order`)`

For example you could filter a `DataFrame` like so:

`df[(df['age'] >= 18) & (df['height'] < 200)]`


In [None]:
# This one is a little harder...
## [Q7] Filter the test_df to films which were made in the 1920s


test_df[(test_df['Year'] >= 1920) & (test_df['Year'] <= 1929)]



## Getting a sense of the data you have to work with
If you ever want to just get a numeric sense of the data in your `DataFrame` the `describe()` function is built for you!

In [None]:
test_df.describe()

# Let's take a second to recap

What you've learnt so far:
- Pandas is a library that comes with lots of built-in tools for working with data
- It provides several ways of previewing the data you are working with
- You can filter columns by passing a sequence of column names
- You can filter rows by applying conditions

In [None]:
# If we filter the original data to the following set we can see we are left with 46 rows and 25 columns

first_half_century_df = excel_df[excel_df.Year < 1950]
first_half_century_df.shape

## [Q8] What was the mean budget of movies produced before 1950

- With the filtered `first_half_century_df`, work out the mean budget 
> (Hint: Use the Budget column itself)

In [None]:
budget = first_half_century_df['Budget']
format(budget.mean(), ',')

## [Q9] What was the minimum budget of movies produced before 1950

In [None]:
budget.min()

## [Q10] What was the maximum budget of movies produced before 1950

In [None]:
budget.max()

## Let's have a look at how we can write csv files with Pandas!


<img src="data/images/apple-stock.jpg" style="width: 200px;"/>

 - We are going to use a finance API to fetch stock prices for Apple (from internet)
 - Then we are going to write fetched data to a CSV
 - Finally, we are going to learn how to read a CSV file.

## Datareader 


<img src="data/images/datareader.jpg" style="width: 500px;"/>


- The Pandas datareader is a sub package that allows one to create a dataframe from various IN-BUILT internet datasources
- We can fetch historical stock prices, quotes etc from various world exchanges without specifically going to their urls. 

Heres is  list of in-built datareaders: https://pandas-datareader.readthedocs.io/en/latest/readers/index.html


In [2]:
import yfinance as yf

aapl_stock = yf.Ticker("AAPL")
# print(aapl_stock.info)
aapl = aapl_stock.history(period="1mo")
aapl

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-03-22 00:00:00-04:00,159.300003,162.139999,157.809998,157.830002,75701800,0.0,0.0
2023-03-23 00:00:00-04:00,158.830002,161.550003,157.679993,158.929993,67622100,0.0,0.0
2023-03-24 00:00:00-04:00,158.860001,160.339996,157.850006,160.25,59196500,0.0,0.0
2023-03-27 00:00:00-04:00,159.940002,160.770004,157.869995,158.279999,52390300,0.0,0.0
2023-03-28 00:00:00-04:00,157.970001,158.490005,155.979996,157.649994,45992200,0.0,0.0
2023-03-29 00:00:00-04:00,159.369995,161.050003,159.350006,160.770004,51305700,0.0,0.0
2023-03-30 00:00:00-04:00,161.529999,162.470001,161.270004,162.360001,49501700,0.0,0.0
2023-03-31 00:00:00-04:00,162.440002,165.0,161.910004,164.899994,68694700,0.0,0.0
2023-04-03 00:00:00-04:00,164.270004,166.289993,164.220001,166.169998,56976200,0.0,0.0
2023-04-04 00:00:00-04:00,166.600006,166.839996,165.110001,165.630005,46278300,0.0,0.0


In [None]:
import pandas_datareader as pdr


# Try this example, BUT Yahoo Finance has known bugs, so it may not work (their end, theif fault), 
# so we may need to try option 2

import datetime 
aapl = pdr.yahoo.daily.YahooDailyReader('AAPL', 
                          start=datetime.datetime(2020, 10, 1), 
                          end=datetime.datetime(2021, 7, 1))
print(aapl)




In [None]:
# Alternatively we need to use Tiingo datareader. 
# Please create a free account to get free token here: https://api.tiingo.com/ (click sign up for free account)

import pandas_datareader as pdr


api_key='88e4c1e8c4d1d2aede66b18385fd08e7dcb14ec7' # <your OWN API token aka key goes here>

start="2020-1-1"
end="2021-7-1"

df = pdr.tiingo.TiingoDailyReader('AAPL', start=start, end=end, api_key=api_key)


aapl = df.read()

aapl

In [None]:
# Great! Now let's write the data we got from this FINANCE API to a new CSV file

import pandas as pd

aapl.to_csv('data/aapl_historical.csv',  date_format='%Y-%m-%d') # your fle paths and file name

# let's read the file that we have just written
saved_df = pd.read_csv('data/aapl_historical.csv', header=0, index_col='Date', parse_dates=True)

saved_df