# Week 3 - Introduction to Pandas

## Installing and importing a python package/library 

- Installing Pandas Library for Data Processing 
- What is Pandas? 
    - Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
    - Its main data format is the DataFrame, which is like a 2D array, but with column names and indexes for the rows.
    - It is essentially the swiss army knife for your data. Using pandas, you get better undertand your data by cleaning, transforming, and analyzing it. 
    
    - For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:
      - Calculate statistics and answer questions about the data, like
      - What's the average, median, max, or min of each column? 
      - Does column A correlate with column B?
      - What does the distribution of data in column C look like?
      - Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
      - Store the cleaned, transformed data back into a CSV, other file or database
      
- Before you start visualizing data you need to have a good understanding of the nature of your dataset and pandas is the best package to do that.




### Using pip to install Pandas 

In [12]:
# Need to only be run once per environment. No need to run it again in every project.

import sys
!{sys.executable} -m pip install pandas



### Importing pandas package library 

In [1]:
# Imprting pandas library as pd so instead of wrting pandas everywhere we can call pd which is shorter
import pandas as pd

## Creating DataFrame from Scratch 

There are many ways to create a DataFrame from scratch, but to start with lets use a simple dictionary.

Let's say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:

Scratch Data Frame Example Adapated from https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

A dictionary is a collection which is ordered, changeable and do not allow duplicates.

For more on dictiornay, arrys and list read here https://python.plainenglish.io/arrays-vs-list-vs-dictionaries-47058fa19d4e

In [51]:
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

In [52]:
data

{'apples': [3, 2, 0, 1], 'oranges': [0, 3, 7, 2]}

And then pass it to the pandas DataFrame constructor:

In [53]:
purchases = pd.DataFrame(data)

purchases

Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


How did that work?
Each (key, value) item in data corresponds to a column in the resulting DataFrame.

The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.

Let's have customer names as our index:

In [54]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2


So now we could locate a customer's order by using their name:

In [55]:
purchases.loc['June']

apples     3
oranges    0
Name: June, dtype: int64

## Converting back to a CSV, JSON

Pandas provides intuitive commands to save your cleaned data in a file of your choice. They are similar to the way we read the data:

When we save JSON and CSV files, all we have to input into those functions is our desired filename with the appropriate file extension. 

In [56]:
# purchases.to_csv('new_purchases.csv')

# purchases.to_json('new_purchases.json')

## Loading and exploring data with Pandas



First we load in the data. This particular dataset is from [IMDB (Internet Movie Database)](https://www.kaggle.com/datasets/mustafacicek/imdb-top-250-lists-1996-2020?resource=download) and is in the **comma-separated variable (.csv)** format.

``
df = pd.read_csv("data/imdbTop250.csv")
``

Pandas shows us the **head** (first rows) and the **tail** (last rows), as well as the **shape**. So we know there are 16 columns (separate pieces of info about each film), and 6500 films.


In [2]:
#load
#pd.options.display.max_rows = 100
df = pd.read_csv("data/imdbTop250.csv")

In [3]:
df

Unnamed: 0,Ranking,IMDByear,IMDBlink,Title,Date,RunTime,Genre,Rating,Score,Votes,Gross,Director,Cast1,Cast2,Cast3,Cast4
0,1,1996,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1,2,1996,/title/tt0111161/,The Shawshank Redemption,1994,142,Drama,9.3,80.0,2529673,28.34,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler
2,3,1996,/title/tt0117951/,Trainspotting,1996,93,Drama,8.1,83.0,665213,16.50,Danny Boyle,Ewan McGregor,Ewen Bremner,Jonny Lee Miller,Kevin McKidd
3,4,1996,/title/tt0114814/,The Usual Suspects,1995,106,"Crime, Drama, Mystery",8.5,77.0,1045626,23.34,Bryan Singer,Kevin Spacey,Gabriel Byrne,Chazz Palminteri,Stephen Baldwin
4,5,1996,/title/tt0108598/,The Wrong Trousers,1993,30,"Animation, Short, Comedy",8.3,,53316,,Nick Park,Peter Sallis,Peter Hawkins,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6495,246,2021,/title/tt0058946/,The Battle of Algiers,1966,121,"Drama, War",8.1,96.0,57995,0.06,Gillo Pontecorvo,Brahim Hadjadj,Jean Martin,Yacef Saadi,Samia Kerbash
6496,247,2021,/title/tt0050783/,Nights of Cabiria,1957,110,Drama,8.1,,47318,0.75,Federico Fellini,Giulietta Masina,François Périer,Franca Marzi,Dorian Gray
6497,248,2021,/title/tt0093779/,The Princess Bride,1987,98,"Adventure, Family, Fantasy",8.1,77.0,416207,30.86,Rob Reiner,Cary Elwes,Mandy Patinkin,Robin Wright,Chris Sarandon
6498,249,2021,/title/tt7060344/,Raatchasan,2018,170,"Crime, Drama, Mystery",8.4,,37474,,Ram Kumar,Vishnu Vishal,Amala Paul,Radha Ravi,Sangili Murugan


In [59]:
# Standard format is rows x colums
# Hence our dataset has 6500 rows and 16 colums
df.shape

(6500, 16)

In [60]:
df

Unnamed: 0,Ranking,IMDByear,IMDBlink,Title,Date,RunTime,Genre,Rating,Score,Votes,Gross,Director,Cast1,Cast2,Cast3,Cast4
0,1,1996,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1,2,1996,/title/tt0111161/,The Shawshank Redemption,1994,142,Drama,9.3,80.0,2529673,28.34,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler
2,3,1996,/title/tt0117951/,Trainspotting,1996,93,Drama,8.1,83.0,665213,16.50,Danny Boyle,Ewan McGregor,Ewen Bremner,Jonny Lee Miller,Kevin McKidd
3,4,1996,/title/tt0114814/,The Usual Suspects,1995,106,"Crime, Drama, Mystery",8.5,77.0,1045626,23.34,Bryan Singer,Kevin Spacey,Gabriel Byrne,Chazz Palminteri,Stephen Baldwin
4,5,1996,/title/tt0108598/,The Wrong Trousers,1993,30,"Animation, Short, Comedy",8.3,,53316,,Nick Park,Peter Sallis,Peter Hawkins,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6495,246,2021,/title/tt0058946/,The Battle of Algiers,1966,121,"Drama, War",8.1,96.0,57995,0.06,Gillo Pontecorvo,Brahim Hadjadj,Jean Martin,Yacef Saadi,Samia Kerbash
6496,247,2021,/title/tt0050783/,Nights of Cabiria,1957,110,Drama,8.1,,47318,0.75,Federico Fellini,Giulietta Masina,François Périer,Franca Marzi,Dorian Gray
6497,248,2021,/title/tt0093779/,The Princess Bride,1987,98,"Adventure, Family, Fantasy",8.1,77.0,416207,30.86,Rob Reiner,Cary Elwes,Mandy Patinkin,Robin Wright,Chris Sarandon
6498,249,2021,/title/tt7060344/,Raatchasan,2018,170,"Crime, Drama, Mystery",8.4,,37474,,Ram Kumar,Vishnu Vishal,Amala Paul,Radha Ravi,Sangili Murugan


In [61]:
# Let's say we want to just peak at the first five rows of our dataset we can then use .head()
# Since our dataframe is in the varaible df we are asking it to show us the first five rows using df.head(5)

df.head(5)

Unnamed: 0,Ranking,IMDByear,IMDBlink,Title,Date,RunTime,Genre,Rating,Score,Votes,Gross,Director,Cast1,Cast2,Cast3,Cast4
0,1,1996,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1,2,1996,/title/tt0111161/,The Shawshank Redemption,1994,142,Drama,9.3,80.0,2529673,28.34,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler
2,3,1996,/title/tt0117951/,Trainspotting,1996,93,Drama,8.1,83.0,665213,16.5,Danny Boyle,Ewan McGregor,Ewen Bremner,Jonny Lee Miller,Kevin McKidd
3,4,1996,/title/tt0114814/,The Usual Suspects,1995,106,"Crime, Drama, Mystery",8.5,77.0,1045626,23.34,Bryan Singer,Kevin Spacey,Gabriel Byrne,Chazz Palminteri,Stephen Baldwin
4,5,1996,/title/tt0108598/,The Wrong Trousers,1993,30,"Animation, Short, Comedy",8.3,,53316,,Nick Park,Peter Sallis,Peter Hawkins,,


In [62]:
# Similarly Let's say we want to just peak at the last five rows of our dataset we can then use .tail()
# Since our dataframe is in the varaible df we are asking it to show us the last five rows using df.tail(5)
df.tail(5)

Unnamed: 0,Ranking,IMDByear,IMDBlink,Title,Date,RunTime,Genre,Rating,Score,Votes,Gross,Director,Cast1,Cast2,Cast3,Cast4
6495,246,2021,/title/tt0058946/,The Battle of Algiers,1966,121,"Drama, War",8.1,96.0,57995,0.06,Gillo Pontecorvo,Brahim Hadjadj,Jean Martin,Yacef Saadi,Samia Kerbash
6496,247,2021,/title/tt0050783/,Nights of Cabiria,1957,110,Drama,8.1,,47318,0.75,Federico Fellini,Giulietta Masina,François Périer,Franca Marzi,Dorian Gray
6497,248,2021,/title/tt0093779/,The Princess Bride,1987,98,"Adventure, Family, Fantasy",8.1,77.0,416207,30.86,Rob Reiner,Cary Elwes,Mandy Patinkin,Robin Wright,Chris Sarandon
6498,249,2021,/title/tt7060344/,Raatchasan,2018,170,"Crime, Drama, Mystery",8.4,,37474,,Ram Kumar,Vishnu Vishal,Amala Paul,Radha Ravi,Sangili Murugan
6499,250,2021,/title/tt10280296/,Sardar Udham,2021,164,"Biography, Crime, Drama",8.7,,34889,,Shoojit Sircar,Vicky Kaushal,Banita Sandhu,Shaun Scott,Stephen Hogan


## Quick Data Overview

.info() is a good way to get a quick overveiw of your dataset. It provides the essential details such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

We can immediedtly see missing vavlues for Score, Gross, Cast 3 and Cast 4 but more on that later.

In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6500 entries, 0 to 6499
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Ranking   6500 non-null   int64  
 1   IMDByear  6500 non-null   int64  
 2   IMDBlink  6500 non-null   object 
 3   Title     6500 non-null   object 
 4   Date      6500 non-null   int64  
 5   RunTime   6500 non-null   int64  
 6   Genre     6500 non-null   object 
 7   Rating    6500 non-null   float64
 8   Score     5674 non-null   float64
 9   Votes     6500 non-null   int64  
 10  Gross     5691 non-null   float64
 11  Director  6500 non-null   object 
 12  Cast1     6500 non-null   object 
 13  Cast2     6500 non-null   object 
 14  Cast3     6492 non-null   object 
 15  Cast4     6492 non-null   object 
dtypes: float64(3), int64(5), object(8)
memory usage: 812.6+ KB




# Task 1 : Import a JSON file instead of a CSV using Pandas

You maye use the json data being provided by the exchange rate api.
- https://api.exchangerate-api.com/v4/latest/USD

Before importing the data visit the above link to see the json format in it's raw format and later compare it to what it looks like in a pandas dataframe.

## Data types

### Checking data types

We've considered what data is there, but some is numbers, some is text, some are dates. What does **Pandas** think each one is? This is important because it will determine what we can do with the data in each column, how it will be sorted, and filtered etc...

We can use 

``df.dtypes``

to see each columns data type. We see that some of them are **objects**, which is the Pandas type for strings or mixed values. We can load the data in again and tell it which columns represent dates, and it will automatically parse them is they are in a consistent format. This means we can do things like compare them (e.g. which one is earlier?), which is really useful for sorting. 

You will see the loading takes longer, as it has to parse the dates, and that afterwards the chosen columns are of the ``datetime64[ns]`` type

In [64]:
df.dtypes

Ranking       int64
IMDByear      int64
IMDBlink     object
Title        object
Date          int64
RunTime       int64
Genre        object
Rating      float64
Score       float64
Votes         int64
Gross       float64
Director     object
Cast1        object
Cast2        object
Cast3        object
Cast4        object
dtype: object

### Loading and formating date types

- As we can see above that the column "Date" are being read as a object even though they contain date and time hence should be in the datetime format.
- Refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html for more information on other formats.

In [65]:
# Reloading the csv data and parsing the dates in the correct format.

df = pd.read_csv("data/imdbTop250.csv", parse_dates=["Date"])

In [66]:
df.dtypes

Ranking              int64
IMDByear             int64
IMDBlink            object
Title               object
Date        datetime64[ns]
RunTime              int64
Genre               object
Rating             float64
Score              float64
Votes                int64
Gross              float64
Director            object
Cast1               object
Cast2               object
Cast3               object
Cast4               object
dtype: object

## Summarising Data

**Pandas** can also give us a summary of our data (6500 is a lot to look at ourselves!). 

``df.describe(include = "all")``

In [67]:
df.describe()

Unnamed: 0,Ranking,IMDByear,RunTime,Rating,Score,Votes,Gross
count,6500.0,6500.0,6500.0,6500.0,5674.0,6500.0,5691.0
mean,125.5,2008.5,125.603385,8.171431,84.247268,484350.1,70.709617
std,72.173758,7.500577,31.138483,0.327116,10.053112,474385.6,103.075729
min,1.0,1996.0,16.0,5.5,61.0,9194.0,0.01
25%,63.0,2002.0,104.0,8.0,77.0,121463.0,5.32
50%,125.5,2008.5,121.0,8.1,85.0,305777.0,26.24
75%,188.0,2015.0,138.0,8.3,92.0,734121.0,92.435
max,250.0,2021.0,321.0,9.3,100.0,2529673.0,936.66


If we don't put ``include = "all"``, then we will only get summary stastics for **numeric** columns. 

In [68]:
df.describe(include = "all")

  df.describe(include = "all")


Unnamed: 0,Ranking,IMDByear,IMDBlink,Title,Date,RunTime,Genre,Rating,Score,Votes,Gross,Director,Cast1,Cast2,Cast3,Cast4
count,6500.0,6500.0,6500,6500,6500,6500.0,6500,6500.0,5674.0,6500.0,5691.0,6500,6500,6500,6492,6492
unique,,,744,742,101,,200,,,,,437,486,633,667,699
top,,,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1995-01-01 00:00:00,,Drama,,,,,Alfred Hitchcock,Robert De Niro,Harrison Ford,Carrie Fisher,Billy Dee Williams
freq,,,26,26,225,,535,,,,,222,151,78,80,52
first,,,,,1920-01-01 00:00:00,,,,,,,,,,,
last,,,,,2021-01-01 00:00:00,,,,,,,,,,,
mean,125.5,2008.5,,,,125.603385,,8.171431,84.247268,484350.1,70.709617,,,,,
std,72.173758,7.500577,,,,31.138483,,0.327116,10.053112,474385.6,103.075729,,,,,
min,1.0,1996.0,,,,16.0,,5.5,61.0,9194.0,0.01,,,,,
25%,63.0,2002.0,,,,104.0,,8.0,77.0,121463.0,5.32,,,,,


When we run the code, we get a table of stats describing our data. We can see that alot of the stats that are number based dont return values for our **nominal** columns, which is fine. But things such as the most common, and number of unique entries are stil interesting. 

For example, 200 Genres are present and Drama is the most common one. We can also see the first film in our dataset is from  **1920-01-01 00:00:00**, and this works because we formatted it as a date, so Pandas is able to order them. 

## Selecting Columns 


We can select a column using its name 

``df["ColumnName"]``

Or we can select a bunch of columns by passing an array 

``df[["ColumnName1","ColumnName2"]]``

This returns a smaller **Series** object with the results but we can also use 

``result.values``

In [69]:
df["Title"]

0       Star Wars: Episode IV - A New Hope
1                 The Shawshank Redemption
2                            Trainspotting
3                       The Usual Suspects
4                       The Wrong Trousers
                       ...                
6495                 The Battle of Algiers
6496                     Nights of Cabiria
6497                    The Princess Bride
6498                            Raatchasan
6499                          Sardar Udham
Name: Title, Length: 6500, dtype: object

In [70]:
df[["Title","Genre"]]

Unnamed: 0,Title,Genre
0,Star Wars: Episode IV - A New Hope,"Action, Adventure, Fantasy"
1,The Shawshank Redemption,Drama
2,Trainspotting,Drama
3,The Usual Suspects,"Crime, Drama, Mystery"
4,The Wrong Trousers,"Animation, Short, Comedy"
...,...,...
6495,The Battle of Algiers,"Drama, War"
6496,Nights of Cabiria,Drama
6497,The Princess Bride,"Adventure, Family, Fantasy"
6498,Raatchasan,"Crime, Drama, Mystery"


In [77]:
#select column and get it back as an array of values
df["Genre"].values

array(['Action, Adventure, Fantasy', 'Drama', 'Drama', ...,
       'Adventure, Family, Fantasy', 'Crime, Drama, Mystery',
       'Biography, Crime, Drama'], dtype=object)

For more information on the differnce between lists and arrays check out https://www.geeksforgeeks.org/difference-between-list-and-array-in-python/

## Counting Columns

We can get counts to see what the most prevalent combinations of categories are. 
- Here, for the type of films, we can see **Crime** and **Drama** are the most common. This is a good way for us to get a feel for all the different most popular combinations of Genres. 

``df[["Category","SubCategoryName"]].value_counts()``


In [28]:
df[["Title","Genre"]].value_counts()

Title                                           Genre                       
Once Upon a Time in America                     Crime, Drama                    26
12 Angry Men                                    Crime, Drama                    26
Psycho                                          Horror, Mystery, Thriller       26
Star Wars: Episode V - The Empire Strikes Back  Action, Adventure, Fantasy      26
Star Wars: Episode VI - Return of the Jedi      Action, Adventure, Fantasy      26
                                                                                ..
Pump Up the Volume                              Comedy, Drama, Music             1
Priest                                          Drama, Romance                   1
Portrait of a Lady on Fire                      Drama, Romance                   1
Pleasantville                                   Comedy, Drama, Fantasy           1
Zootopia                                        Animation, Adventure, Comedy     1
Length: 74

## Filtering

As well as picking whole columns, we can also pick columns that fit certain parameters, using **filtering**. To do this we pick columns that equal a certain value

``df[df["Director"]=="George Lucas"]``

In [35]:
df[df["Director"]=="George Lucas"]

Unnamed: 0,Ranking,IMDByear,IMDBlink,Title,Date,RunTime,Genre,Rating,Score,Votes,Gross,Director,Cast1,Cast2,Cast3,Cast4
0,1,1996,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
251,2,1997,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
502,3,1998,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
721,222,1998,/title/tt0069704/,American Graffiti,1973-01-01,110,"Comedy, Drama",7.4,97.0,87100,115.0,George Lucas,Richard Dreyfuss,Ron Howard,Paul Le Mat,Charles Martin Smith
756,7,1999,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1007,8,2000,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1257,8,2001,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1508,9,2002,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1759,10,2003,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
2009,10,2004,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness


In [78]:
df[df["Title"]=="Star Wars: Episode IV - A New Hope"]

Unnamed: 0,Ranking,IMDByear,IMDBlink,Title,Date,RunTime,Genre,Rating,Score,Votes,Gross,Director,Cast1,Cast2,Cast3,Cast4
0,1,1996,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
251,2,1997,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
502,3,1998,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
756,7,1999,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1007,8,2000,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1257,8,2001,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1508,9,2002,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1759,10,2003,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
2009,10,2004,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
2257,8,2005,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness


## Sorting

Strangely, this dataset isn't actually sorted by date, but we can do that using ``sort_values``. We tell **Pandas** which column we want to sort by, and this must be either a number, or an **ordinal** value (such as a date). We also say which direction we want to sort the results in, and this is useful if we want to get the top or bottom slice

**Most recent 20**

``df.sort_values(by='Date', ascending=False)[:20]``

**Earliest 20**

``df.sort_values(by='Date', ascending=True)[:20]``

In [39]:
df.head(5)

Unnamed: 0,Ranking,IMDByear,IMDBlink,Title,Date,RunTime,Genre,Rating,Score,Votes,Gross,Director,Cast1,Cast2,Cast3,Cast4
0,1,1996,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977-01-01,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1,2,1996,/title/tt0111161/,The Shawshank Redemption,1994-01-01,142,Drama,9.3,80.0,2529673,28.34,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler
2,3,1996,/title/tt0117951/,Trainspotting,1996-01-01,93,Drama,8.1,83.0,665213,16.5,Danny Boyle,Ewan McGregor,Ewen Bremner,Jonny Lee Miller,Kevin McKidd
3,4,1996,/title/tt0114814/,The Usual Suspects,1995-01-01,106,"Crime, Drama, Mystery",8.5,77.0,1045626,23.34,Bryan Singer,Kevin Spacey,Gabriel Byrne,Chazz Palminteri,Stephen Baldwin
4,5,1996,/title/tt0108598/,The Wrong Trousers,1993-01-01,30,"Animation, Short, Comedy",8.3,,53316,,Nick Park,Peter Sallis,Peter Hawkins,,


In [38]:
df.sort_values(by='Date', ascending=False)[:20]

Unnamed: 0,Ranking,IMDByear,IMDBlink,Title,Date,RunTime,Genre,Rating,Score,Votes,Gross,Director,Cast1,Cast2,Cast3,Cast4
6499,250,2021,/title/tt10280296/,Sardar Udham,2021-01-01,164,"Biography, Crime, Drama",8.7,,34889,,Shoojit Sircar,Vicky Kaushal,Banita Sandhu,Shaun Scott,Stephen Hogan
6265,16,2021,/title/tt10872600/,Spider-Man: No Way Home,2021-01-01,148,"Action, Adventure, Fantasy",8.7,71.0,401937,,Jon Watts,Tom Holland,Zendaya,Benedict Cumberbatch,Jacob Batalon
6432,183,2021,/title/tt1160419/,Dune,2021-01-01,155,"Action, Adventure, Drama",8.1,74.0,452757,,Denis Villeneuve,Timothée Chalamet,Rebecca Ferguson,Zendaya,Oscar Isaac
6389,140,2021,/title/tt15097216/,Jai Bhim,2021-01-01,164,"Crime, Drama",9.3,,169008,,T.J. Gnanavel,Suriya,Lijo Mol Jose,Manikandan,Rajisha Vijayan
6358,109,2021,/title/tt10272386/,The Father,2020-01-01,97,"Drama, Mystery",8.3,88.0,118928,,Florian Zeller,Anthony Hopkins,Olivia Colman,Mark Gatiss,Olivia Williams
6159,160,2020,/title/tt2948372/,Soul,2020-01-01,100,"Animation, Adventure, Comedy",8.1,83.0,295078,,"Pete Docter, Kemp Powers",Jamie Foxx,Tina Fey,Graham Norton,Rachel House
6335,86,2021,/title/tt8503618/,Hamilton,2020-01-01,160,"Biography, Drama, History",8.4,90.0,81100,,Thomas Kail,Lin-Manuel Miranda,Phillipa Soo,Leslie Odom Jr.,Renée Elise Goldsberry
6050,51,2020,/title/tt8503618/,Hamilton,2020-01-01,160,"Biography, Drama, History",8.4,90.0,81100,,Thomas Kail,Lin-Manuel Miranda,Phillipa Soo,Leslie Odom Jr.,Renée Elise Goldsberry
6367,118,2021,/title/tt8579674/,1917,2019-01-01,119,"Action, Drama, War",8.3,78.0,524911,159.23,Sam Mendes,Dean-Charles Chapman,George MacKay,Daniel Mays,Colin Firth
6455,206,2021,/title/tt1950186/,Ford v Ferrari,2019-01-01,152,"Action, Biography, Drama",8.1,81.0,351575,117.62,James Mangold,Matt Damon,Christian Bale,Jon Bernthal,Caitriona Balfe


In [80]:
df.sort_values(by='Date', ascending=True)[:20]

Unnamed: 0,Ranking,IMDByear,IMDBlink,Title,Date,RunTime,Genre,Rating,Score,Votes,Gross,Director,Cast1,Cast2,Cast3,Cast4
2664,165,2006,/title/tt0010323/,The Cabinet of Dr. Caligari,1920-01-01,67,"Horror, Mystery, Thriller",8.1,,61561,,Robert Wiene,Werner Krauss,Conrad Veidt,Friedrich Feher,Lil Dagover
2436,187,2005,/title/tt0010323/,The Cabinet of Dr. Caligari,1920-01-01,67,"Horror, Mystery, Thriller",8.1,,61561,,Robert Wiene,Werner Krauss,Conrad Veidt,Friedrich Feher,Lil Dagover
2925,176,2007,/title/tt0010323/,The Cabinet of Dr. Caligari,1920-01-01,67,"Horror, Mystery, Thriller",8.1,,61561,,Robert Wiene,Werner Krauss,Conrad Veidt,Friedrich Feher,Lil Dagover
4360,111,2013,/title/tt0012349/,The Kid,1921-01-01,68,"Comedy, Drama, Family",8.3,,122477,5.45,Charles Chaplin,Charles Chaplin,Edna Purviance,Jackie Coogan,Carl Miller
4845,96,2015,/title/tt0012349/,The Kid,1921-01-01,68,"Comedy, Drama, Family",8.3,,122477,5.45,Charles Chaplin,Charles Chaplin,Edna Purviance,Jackie Coogan,Carl Miller
6100,101,2020,/title/tt0012349/,The Kid,1921-01-01,68,"Comedy, Drama, Family",8.3,,122477,5.45,Charles Chaplin,Charles Chaplin,Edna Purviance,Jackie Coogan,Carl Miller
4127,128,2012,/title/tt0012349/,The Kid,1921-01-01,68,"Comedy, Drama, Family",8.3,,122477,5.45,Charles Chaplin,Charles Chaplin,Edna Purviance,Jackie Coogan,Carl Miller
5598,99,2018,/title/tt0012349/,The Kid,1921-01-01,68,"Comedy, Drama, Family",8.3,,122477,5.45,Charles Chaplin,Charles Chaplin,Edna Purviance,Jackie Coogan,Carl Miller
3669,170,2010,/title/tt0012349/,The Kid,1921-01-01,68,"Comedy, Drama, Family",8.3,,122477,5.45,Charles Chaplin,Charles Chaplin,Edna Purviance,Jackie Coogan,Carl Miller
6353,104,2021,/title/tt0012349/,The Kid,1921-01-01,68,"Comedy, Drama, Family",8.3,,122477,5.45,Charles Chaplin,Charles Chaplin,Edna Purviance,Jackie Coogan,Carl Miller
