# DataFrames III: Data Extraction

In [1]:
import pandas as pd

## This Module's Dataset
- This module's dataset is a collection of all James Bond movies.

In [3]:
bond= pd.read_csv('jamesbond.csv')
bond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


## The set_index and reset_index Methods
- The index serves as the collection of primary identifiers/labels/entrypoints for the rows.
- The fastest way to extract a row is from a sorted index by position/label.
- Pandas uses index labels/values when merging different objects together.
- The `set_index` method sets an existing column as the index of the **DataFrame**.
- The `reset_index` method sets the standard ascending numeric index as the index of the **DataFrame**.

In [None]:
# in general, is always easier to find a value inside a sorted by index collection than by an unsorted one

# and that's why whener people find the proper index column for a given dataframe, they usually sort the whole set of data around it

In [5]:
bond= pd.read_csv('jamesbond.csv', index_col='Film') # one way to do it
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [13]:
# another way to do it using the set index method
bond= pd.read_csv('jamesbond.csv')

bond.set_index('Film') # this yields a new dataframe

# to make changes permanent, we can do by 2 manners:
#bond.set_index('Film', inplace= True)
bond= bond.set_index('Film')

In [17]:
bond.reset_index()
bond.reset_index(drop= True).head()

Unnamed: 0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,1967,David Niven,Ken Hughes,315.0,85.0,


In [20]:
#bond.set_index('Year') 
# if we directly apply the above code we'll lose the previous defined index

bond= bond.reset_index().set_index('Year')
bond.head()

Unnamed: 0_level_0,Film,Actor,Director,Box Office,Budget,Bond Actor Salary
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1962,Dr. No,Sean Connery,Terence Young,448.8,7.0,0.6
1963,From Russia with Love,Sean Connery,Terence Young,543.8,12.6,1.6
1964,Goldfinger,Sean Connery,Guy Hamilton,820.4,18.6,3.2
1965,Thunderball,Sean Connery,Terence Young,848.1,41.9,4.7
1967,Casino Royale,David Niven,Ken Hughes,315.0,85.0,


## Retrieve Rows by Index Position with iloc Accessor
- The `iloc` accessor retrieves one or more rows by index position.
- Provide a pair of square brackets after the accessor.
- `iloc` accepts single values, lists, and slices.

- Pandas will always take care of the ordered number identifiers of the pandas object, even if they are not visible depending on the index choice

In [26]:
bond= pd.read_csv('jamesbond.csv')
bond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [33]:
bond.iloc[5]
bond.iloc[[15, 20]]
bond.iloc[4:8]
bond.iloc[:6]
bond.iloc[20:]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
20,The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,30.0
26,No Time to Die,2021,Daniel Craig,Cary Joji Fukunaga,774.2,301.0,25.0


In [34]:
bond.dtypes

Film                  object
Year                   int64
Actor                 object
Director              object
Box Office           float64
Budget               float64
Bond Actor Salary    float64
dtype: object

## Retrieve Rows by Index Label with loc Accessor
- The `loc` accessor retrieves one or more rows by index label.
- Provide a pair of square brackets after the accessor.

- This method must be used whenever we assign a custom index label to our dataframe

In [35]:
bond= pd.read_csv('jamesbond.csv', index_col= 'Film')
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [48]:
bond.loc['Goldfinger']
bond.loc[['GoldenEye']]
bond.loc['Casino Royale'] # our indexes must ideally be unique, but that's not always true

#bond.loc['Sacred Bond']

bond.loc[['Octopussy', 'Moonraker']]
bond.loc[[ 'Moonraker', 'Octopussy']]
bond.loc[[ 'Moonraker', 'Octopussy', 'Casino Royale']]
bond.loc['Diamonds Are Forever': 'Moonraker'] # with loc method, th final selected index is included in the query

bond.loc['GoldenEye':]
bond.loc[:"On Her Majesty's Secret Service"]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6


## ChatGPT problem set

In [51]:
bond= pd.read_csv('jamesbond.csv')
bond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [61]:
#1) Use .loc to filter rows where the Year is greater than or equal to 2000. Return only the Film and Year columns
#bond[bond['Year'] >= 2000][['Film', 'Year']]
bond.loc[bond['Year'] >= 2000][['Film', 'Year']]

Unnamed: 0,Film,Year
21,Die Another Day,2002
22,Casino Royale,2006
23,Quantum of Solace,2008
24,Skyfall,2012
25,Spectre,2015
26,No Time to Die,2021


In [63]:
#2) Use .loc to filter rows where the Actor is "Daniel Craig" and return the Film, Year, and Box Office columns.
bond.loc[bond['Actor'] == 'Daniel Craig'][['Film', 'Year', 'Box Office']]

Unnamed: 0,Film,Year,Box Office
22,Casino Royale,2006,581.5
23,Quantum of Solace,2008,514.2
24,Skyfall,2012,943.5
25,Spectre,2015,726.7
26,No Time to Die,2021,774.2


In [64]:
#3) Use .loc to filter rows where the Box Office is greater than 500 million and the Year is after 2000. Return all columns for the filtered rows.
bond[ (bond['Box Office'] > 500000) & (bond['Year'] > 2000) ]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary


In [65]:
#4) Use .loc to filter rows where the Actor is either "Pierce Brosnan" or "Sean Connery". Return only the Actor, Film, and Box Office columns
actors_of_interest= ['Pierce Brosnan', 'Sean Connery']
columns_to_show= ['Actor', 'Film', 'Box Office']

bond[ bond['Actor'].isin(actors_of_interest) ][columns_to_show]

Unnamed: 0,Actor,Film,Box Office
0,Sean Connery,Dr. No,448.8
1,Sean Connery,From Russia with Love,543.8
2,Sean Connery,Goldfinger,820.4
3,Sean Connery,Thunderball,848.1
5,Sean Connery,You Only Live Twice,514.2
7,Sean Connery,Diamonds Are Forever,442.5
13,Sean Connery,Never Say Never Again,380.0
18,Pierce Brosnan,GoldenEye,518.5
19,Pierce Brosnan,Tomorrow Never Dies,463.2
20,Pierce Brosnan,The World Is Not Enough,439.5


In [70]:
#5) Use .iloc to return the first 5 rows and the first 3 columns from the DataFrame.
bond.iloc[[row_index for row_index in range(5)], [col_index for col_index in range(3)]]
#bonc.iloc[ [0,1,2,3,4], [0,1,2] ]

Unnamed: 0,Film,Year,Actor
0,Dr. No,1962,Sean Connery
1,From Russia with Love,1963,Sean Connery
2,Goldfinger,1964,Sean Connery
3,Thunderball,1965,Sean Connery
4,Casino Royale,1967,David Niven


In [72]:
#6) Use .loc to filter rows where the Box Office is between 200 and 500 million. Return only the Film and Box Office columns.
bond[ bond['Box Office'].between(200000000, 500000000) ][['Film', 'Box Office']]


Unnamed: 0,Film,Box Office


In [77]:
bond.head()
bond['Budget']>100000

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
Name: Budget, dtype: bool

In [83]:
#7) Use .loc to filter rows where the Director is "Martin Campbell", the Budget is less than 100 million, and the Box Office is greater than 300 million. Return the Film, Director, Box Office, and Budget columns.

have_Martin_as_director= bond['Director'] == 'Martin Campbell'
budget_is_less_than_100M= bond['Budget'] < 100000000
box_office_is_greater_than_300M= bond['Box Office'] > 300000000

bond[
    have_Martin_as_director
    & budget_is_less_than_100M
    & box_office_is_greater_than_300M
][['Film', 'Director', 'Box Office', 'Budget']]

Unnamed: 0,Film,Director,Box Office,Budget


In [84]:
#8)  Use .loc to filter rows where the Year is between 1980 and 2000, and the Actor is not "Roger Moore". Return the Film, Year, and Actor columns.
bond[ bond['Year'].between(1980, 2000) & ( bond['Actor'] != 'Roger Moore' ) ][['Film', 'Year', 'Actor']]

Unnamed: 0,Film,Year,Actor
13,Never Say Never Again,1983,Sean Connery
16,The Living Daylights,1987,Timothy Dalton
17,Licence to Kill,1989,Timothy Dalton
18,GoldenEye,1995,Pierce Brosnan
19,Tomorrow Never Dies,1997,Pierce Brosnan
20,The World Is Not Enough,1999,Pierce Brosnan


In [88]:
#9) Use .loc to filter the rows where the Year is after 1990, and either the Box Office is less than 200 million or the Budget is greater than 100 million. Return all columns for the filtered rows.
bond.loc[ ( bond['Year'] > 1990 ) & ( (bond['Box Office'] < 200000000) | (bond['Budget'] > 100000000) ) ]

film_was_produced_after_90s= bond['Year'] > 1990
box_office_is_less_than_200M= bond['Box Office'] < 200000000
budget_is_greater_than_100M= bond['Budget'] > 100000000
bond[ film_was_produced_after_90s & (box_office_is_less_than_200M | budget_is_greater_than_100M)]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
18,GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
19,Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
20,The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,30.0
26,No Time to Die,2021,Daniel Craig,Cary Joji Fukunaga,774.2,301.0,25.0


In [94]:
round(2.387473846,2)

2.39

In [98]:
#10) Use .iloc to filter the rows where the Box Office is greater than the average Box Office value. Return only the Film, Year, and Box Office columns for these rows.

box_office_mean= bond['Box Office'].mean()
bond.iloc[ (bond['Box Office'] > box_office_mean).index ][['Film', 'Year', 'Box Office']]

Unnamed: 0,Film,Year,Box Office
0,Dr. No,1962,448.8
1,From Russia with Love,1963,543.8
2,Goldfinger,1964,820.4
3,Thunderball,1965,848.1
4,Casino Royale,1967,315.0
5,You Only Live Twice,1967,514.2
6,On Her Majesty's Secret Service,1969,291.5
7,Diamonds Are Forever,1971,442.5
8,Live and Let Die,1973,460.3
9,The Man with the Golden Gun,1974,334.0


## Second Arguments to loc and iloc Accessors
- The second value inside the square brackets targets the columns.
- The `iloc` requires numeric positions for rows and columns.
- The `loc` requires labels for rows and columns.

## Overwrite Value in a DataFrame
- Use the `iloc` or `loc` accessor on the **DataFrame** to target a value, then provide the equal sign and a new value.

##  Overwrite Multiple Values in a DataFrame
- The `replace` method replaces all occurrences of a **Series** value with another value (think of it like "Find and Replace").
- To overwrite multiple values in a **DataFrame**, remember to use an accessor on the **DataFrame** itself.
- Accessors like `loc` and `iloc` can accept Boolean Series. Use them to target the values to overwrite.

## Rename Index Labels or Columns in a DataFrame
- The `rename` method accepts a dictionary for either its `columns` or `index` parameters.
- The dictionary keys represent the existing names and the values represent the new names.
- We can replace all columns by overwriting the **DataFrame's** `columns` attribute.

## Delete Rows or Columns from a DataFrame
- The `drop` method deletes one or more rows/columns from a **DataFrame**.
- Pass the `index` or `columns` parameters a list of the column names to remove.
- The `pop` method removes and returns a single **Series** (it mutates the **DataFrame** in the process).
- Python's `del` keyword also removes a single **Series**.

## Create Random Sample with the sample Method
- The `sample` method returns a specified one or more random rows from the **DataFrame**.
- Customize the `axis` parameter to extract random columns.

## The nsmallest and nlargest Methods
- The `nlargest` method returns a specified number of rows with the largest values from a given column.
- The `nsmallest` method returns rows with the smallest values from a given column.
- The `nlargest` and `nsmallest` methods are more efficient than sorting the entire **DataFrame**.

## Filtering with the where Method
- Similar to square brackets or `loc`, the `where` method filters the original `DataFrame` with a Boolean Series.
- Pandas will populate rows that do **not** match the criteria with `NaN` values.
- Leaving in the `NaN` values can be advantageous for certain merge and visualization operations.

## The apply Method with DataFrames
- The `apply` method invokes a function on every column or every row in the **DataFrame**.
- Pass the uninvoked function as the first argument to the `apply` method.
- Pass the `axis` parameter an argument of `"columns"` to invoke the function on every row.
- Pandas will pass in the row's values as a **Series** object. We can use accessors like `loc` and `iloc` to extract the column's values for that row.