# DataFrames III: Data Extraction

In [1]:
import pandas as pd

## This Module's Dataset
- This module's dataset is a collection of all James Bond movies.

In [2]:
bond = pd.read_csv('jamesbond.csv')
bond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


## The set_index and reset_index Methods
- The index serves as the collection of primary identifiers/labels/entrypoints for the rows.
- The fastest way to extract a row is from a sorted index by position/label.
- Pandas uses index labels/values when merging different objects together.
- The `set_index` method sets an existing column as the index of the **DataFrame**.
- The `reset_index` method sets the standard ascending numeric index as the index of the **DataFrame**.

In [26]:
bond = bond.set_index('Film')

In [28]:
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [30]:
bond.loc['From Russia with Love':]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,
The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,


In [16]:
bond.iloc[3:12]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,
The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
Moonraker,1979,Roger Moore,Lewis Gilbert,535.0,91.5,


In [41]:
bond.loc[['Thunderball', ], 'Year':]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7


In [45]:
bond2=bond.reset_index()
bond2.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [51]:
bond2.iloc[:, 2:].head()

Unnamed: 0,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Sean Connery,Terence Young,448.8,7.0,0.6
1,Sean Connery,Terence Young,543.8,12.6,1.6
2,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Sean Connery,Terence Young,848.1,41.9,4.7
4,David Niven,Ken Hughes,315.0,85.0,


In [52]:
bond.loc['Thunderball': 'Moonraker', ]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,
The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
Moonraker,1979,Roger Moore,Lewis Gilbert,535.0,91.5,


## Retrieve Rows by Index Position with iloc Accessor
- The `iloc` accessor retrieves one or more rows by index position.
- Provide a pair of square brackets after the accessor.
- `iloc` accepts single values, lists, and slices.

In [62]:
bond2.iloc[:, [1,3,4]]

Unnamed: 0,Year,Director,Box Office
0,1962,Terence Young,448.8
1,1963,Terence Young,543.8
2,1964,Guy Hamilton,820.4
3,1965,Terence Young,848.1
4,1967,Ken Hughes,315.0
5,1967,Lewis Gilbert,514.2
6,1969,Peter R. Hunt,291.5
7,1971,Guy Hamilton,442.5
8,1973,Guy Hamilton,460.3
9,1974,Guy Hamilton,334.0


In [65]:
bond2.iloc[2:, 1:]

Unnamed: 0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
2,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,1967,David Niven,Ken Hughes,315.0,85.0,
5,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
6,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
7,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
8,1973,Roger Moore,Guy Hamilton,460.3,30.8,
9,1974,Roger Moore,Guy Hamilton,334.0,27.7,
10,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
11,1979,Roger Moore,Lewis Gilbert,535.0,91.5,


## Retrieve Rows by Index Label with loc Accessor
- The `loc` accessor retrieves one or more rows by index label.
- Provide a pair of square brackets after the accessor.

In [5]:
bond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [15]:
bond.iloc[5:,:5]

Unnamed: 0,Film,Year,Actor,Director,Box Office
5,You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2
6,On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5
7,Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5
8,Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3
9,The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0
10,The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0
11,Moonraker,1979,Roger Moore,Lewis Gilbert,535.0
12,For Your Eyes Only,1981,Roger Moore,John Glen,449.4
13,Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0
14,Octopussy,1983,Roger Moore,John Glen,373.8


In [9]:
bond.iloc[ :, :3]

Unnamed: 0,Film,Year,Actor
0,Dr. No,1962,Sean Connery
1,From Russia with Love,1963,Sean Connery
2,Goldfinger,1964,Sean Connery
3,Thunderball,1965,Sean Connery
4,Casino Royale,1967,David Niven
5,You Only Live Twice,1967,Sean Connery
6,On Her Majesty's Secret Service,1969,George Lazenby
7,Diamonds Are Forever,1971,Sean Connery
8,Live and Let Die,1973,Roger Moore
9,The Man with the Golden Gun,1974,Roger Moore


## Second Arguments to loc and iloc Accessors
- The second value inside the square brackets targets the columns.
- The `iloc` requires numeric positions for rows and columns.
- The `loc` requires labels for rows and columns.

## Overwrite Value in a DataFrame
- Use the `iloc` or `loc` accessor on the **DataFrame** to target a value, then provide the equal sign and a new value.

In [64]:
bond

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
5,You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
6,On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
7,Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
8,Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
9,The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


In [18]:
bond.iloc[3,2] = 'Teslim'

In [19]:
bond

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Teslim,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
5,You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
6,On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
7,Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
8,Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
9,The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


##  Overwrite Multiple Values in a DataFrame
- The `replace` method replaces all occurrences of a **Series** value with another value (think of it like "Find and Replace").
- To overwrite multiple values in a **DataFrame**, remember to use an accessor on the **DataFrame** itself.
- Accessors like `loc` and `iloc` can accept Boolean Series. Use them to target the values to overwrite.

## Rename Index Labels or Columns in a DataFrame
- The `rename` method accepts a dictionary for either its `columns` or `index` parameters.
- The dictionary keys represent the existing names and the values represent the new names.
- We can replace all columns by overwriting the **DataFrame's** `columns` attribute.

In [43]:
bond.head()

Unnamed: 0,1,2,3,4,5,6,7
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Teslim,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [65]:
bond = bond.rename(columns={'Year': 'Year_of_release'})

In [66]:
bond.head()

Unnamed: 0,Film,Year_of_release,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [67]:
bond.columns

Index(['Film', 'Year_of_release', 'Actor', 'Director', 'Box Office', 'Budget', 'Bond Actor Salary'], dtype='object')

## Delete Rows or Columns from a DataFrame
- The `drop` method deletes one or more rows/columns from a **DataFrame**.
- Pass the `index` or `columns` parameters a list of the column names to remove.
- The `pop` method removes and returns a single **Series** (it mutates the **DataFrame** in the process).
- Python's `del` keyword also removes a single **Series**.

In [85]:
# bond.drop(columns = ['Year_of_release', 'Box Office'], index = 4)

In [89]:
bond.shape

(27, 7)

In [90]:
actor = bond.pop('Actor')
actor

0       Sean Connery
1       Sean Connery
2       Sean Connery
3       Sean Connery
4        David Niven
5       Sean Connery
6     George Lazenby
7       Sean Connery
8        Roger Moore
9        Roger Moore
10       Roger Moore
11       Roger Moore
12       Roger Moore
13      Sean Connery
14       Roger Moore
15       Roger Moore
16    Timothy Dalton
17    Timothy Dalton
18    Pierce Brosnan
19    Pierce Brosnan
20    Pierce Brosnan
21    Pierce Brosnan
22      Daniel Craig
23      Daniel Craig
24      Daniel Craig
25      Daniel Craig
26      Daniel Craig
Name: Actor, dtype: object

In [95]:
pd.DataFrame(actor)

Unnamed: 0,Actor
0,Sean Connery
1,Sean Connery
2,Sean Connery
3,Sean Connery
4,David Niven
5,Sean Connery
6,George Lazenby
7,Sean Connery
8,Roger Moore
9,Roger Moore


In [83]:
bond.shape

(27, 6)

In [84]:
bond

Unnamed: 0,Film,Year,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,Ken Hughes,315.0,85.0,
5,You Only Live Twice,1967,Lewis Gilbert,514.2,59.9,4.4
6,On Her Majesty's Secret Service,1969,Peter R. Hunt,291.5,37.3,0.6
7,Diamonds Are Forever,1971,Guy Hamilton,442.5,34.7,5.8
8,Live and Let Die,1973,Guy Hamilton,460.3,30.8,
9,The Man with the Golden Gun,1974,Guy Hamilton,334.0,27.7,


## Create Random Sample with the sample Method
- The `sample` method returns a specified one or more random rows from the **DataFrame**.
- Customize the `axis` parameter to extract random columns.

In [160]:
def sample_dataframe(df):
    # Sample 10 rows and sort by the first column
    result = df.sample(n=10).sort_values(by=df.columns[1])
    return result

# Example call
sample_dataframe(bond)

Unnamed: 0,Film,Year,Director,Box Office,Budget,Bond Actor Salary
1,From Russia with Love,1963,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Guy Hamilton,820.4,18.6,3.2
9,The Man with the Golden Gun,1974,Guy Hamilton,334.0,27.7,
13,Never Say Never Again,1983,Irvin Kershner,380.0,86.0,
16,The Living Daylights,1987,John Glen,313.5,68.8,5.2
18,GoldenEye,1995,Martin Campbell,518.5,76.9,5.1
21,Die Another Day,2002,Lee Tamahori,465.4,154.2,17.9
22,Casino Royale,2006,Martin Campbell,581.5,145.3,3.3
23,Quantum of Solace,2008,Marc Forster,514.2,181.4,8.1
26,No Time to Die,2021,Cary Joji Fukunaga,774.2,301.0,25.0


## The nsmallest and nlargest Methods
- The `nlargest` method returns a specified number of rows with the largest values from a given column.
- The `nsmallest` method returns rows with the smallest values from a given column.
- The `nlargest` and `nsmallest` methods are more efficient than sorting the entire **DataFrame**.

In [4]:
bond.nlargest(n=4, columns="Year")

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
26,No Time to Die,2021,Daniel Craig,Cary Joji Fukunaga,774.2,301.0,25.0
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,30.0
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1


In [8]:
bond["Year"].nlargest(6)

26    2021
25    2015
24    2012
23    2008
22    2006
21    2002
Name: Year, dtype: int64

## Filtering with the where Method
- Similar to square brackets or `loc`, the `where` method filters the original `DataFrame` with a Boolean Series.
- Pandas will populate rows that do **not** match the criteria with `NaN` values.
- Leaving in the `NaN` values can be advantageous for certain merge and visualization operations.

In [10]:
actor = bond["Actor"] =="Sean Connery"
bond[actor]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
5,You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
7,Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
13,Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,


In [11]:
bond.where(actor)

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962.0,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963.0,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964.0,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965.0,Sean Connery,Terence Young,848.1,41.9,4.7
4,,,,,,,
5,You Only Live Twice,1967.0,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
6,,,,,,,
7,Diamonds Are Forever,1971.0,Sean Connery,Guy Hamilton,442.5,34.7,5.8
8,,,,,,,
9,,,,,,,


## The apply Method with DataFrames
- The `apply` method invokes a function on every column or every row in the **DataFrame**.
- Pass the uninvoked function as the first argument to the `apply` method.
- Pass the `axis` parameter an argument of `"columns"` to invoke the function on every row.
- Pandas will pass in the row's values as a **Series** object. We can use accessors like `loc` and `iloc` to extract the column's values for that row.

In [12]:
bond.apply(len)

Film                 27
Year                 27
Actor                27
Director             27
Box Office           27
Budget               27
Bond Actor Salary    27
dtype: int64