# DataFrames III: Data Extraction

In [None]:
import pandas as pd

## This Module's Dataset
- This module's dataset is a collection of all James Bond movies.

In [None]:
bond = pd.read_csv('jamesbond.csv')
bond

## The set_index and reset_index Methods
- The index serves as the collection of primary identifiers/labels/entrypoints for the rows.
- The fastest way to extract a row is from a sorted index by position/label.
- Pandas uses index labels/values when merging different objects together.
- The `set_index` method sets an existing column as the index of the **DataFrame**.
- The `reset_index` method sets the standard ascending numeric index as the index of the **DataFrame**.

In [None]:
bond = pd.read_csv('jamesbond.csv')
bond

In [None]:
bond = bond.set_index('Film')
bond

In [None]:
bond = bond.reset_index()
bond

## Retrieve Rows by Index Position with iloc Accessor
- The `iloc` accessor retrieves one or more rows by index position.
- Provide a pair of square brackets after the accessor.
- `iloc` accepts single values, lists, and slices.

In [None]:
bond = pd.read_csv('jamesbond.csv')
bond

In [None]:
bond.iloc[[1, 2]]

In [None]:
bond.iloc[5:]

## Retrieve Rows by Index Label with loc Accessor
- The `loc` accessor retrieves one or more rows by index label.
- Provide a pair of square brackets after the accessor.

In [None]:
bond = pd.read_csv('jamesbond.csv', index_col='Film')
bond

In [None]:
bond.loc['You Only Live Twice':]

In [None]:
bond.loc['Moonraker']

## Second Arguments to loc and iloc Accessors
- The second value inside the square brackets targets the columns.
- The `iloc` requires numeric positions for rows and columns.
- The `loc` requires labels for rows and columns.

In [None]:
bond = pd.read_csv('jamesbond.csv', index_col='Film')
bond

In [None]:
bond.loc['From Russia with Love': 'Live and Let Die', ['Year', 'Budget']]

In [None]:
bond.loc['From Russia with Love': 'Live and Let Die', 'Year': 'Budget']

In [None]:
bond.loc[['From Russia with Love', 'Live and Let Die'], ['Year', 'Budget']]

In [None]:
bond.loc[['From Russia with Love', 'Live and Let Die'], 'Year': 'Budget']

In [None]:
bond = bond.reset_index()
bond

In [None]:
bond.iloc[[1, 2], [0, 2, 5]]

In [None]:
bond.iloc[5: , [0, 2, 5]]

## Overwrite Value in a DataFrame
- Use the `iloc` or `loc` accessor on the **DataFrame** to target a value, then provide the equal sign and a new value.

In [None]:
bond = pd.read_csv('jamesbond.csv', index_col='Film')
bond

In [None]:
bond.loc['Dr. No', 'Year'] = 2024

In [None]:
bond

In [None]:
bond.iloc[0, 1] = 'Petar Koprinkov'

In [None]:
bond

##  Overwrite Multiple Values in a DataFrame
- The `replace` method replaces all occurrences of a **Series** value with another value (think of it like "Find and Replace").
- To overwrite multiple values in a **DataFrame**, remember to use an accessor on the **DataFrame** itself.
- Accessors like `loc` and `iloc` can accept Boolean Series. Use them to target the values to overwrite.

In [41]:
bond = pd.read_csv('jamesbond.csv', index_col='Film')
bond

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


In [None]:
my_actor = bond['Actor'] == 'Sean Connery'
my_actor

In [None]:
bond.loc[my_actor, 'Actor'] = 'Petar Koprinkov'
bond

In [43]:
bond['Actor'] = bond['Actor'].replace('Sean Connery', 'Petar Koprinkov')

In [44]:
bond

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Petar Koprinkov,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Petar Koprinkov,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Petar Koprinkov,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Petar Koprinkov,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Petar Koprinkov,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Petar Koprinkov,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


## Rename Index Labels or Columns in a DataFrame
- The `rename` method accepts a dictionary for either its `columns` or `index` parameters.
- The dictionary keys represent the existing names and the values represent the new names.
- We can replace all columns by overwriting the **DataFrame's** `columns` attribute.

## Delete Rows or Columns from a DataFrame
- The `drop` method deletes one or more rows/columns from a **DataFrame**.
- Pass the `index` or `columns` parameters a list of the column names to remove.
- The `pop` method removes and returns a single **Series** (it mutates the **DataFrame** in the process).
- Python's `del` keyword also removes a single **Series**.

## Create Random Sample with the sample Method
- The `sample` method returns a specified one or more random rows from the **DataFrame**.
- Customize the `axis` parameter to extract random columns.

## The nsmallest and nlargest Methods
- The `nlargest` method returns a specified number of rows with the largest values from a given column.
- The `nsmallest` method returns rows with the smallest values from a given column.
- The `nlargest` and `nsmallest` methods are more efficient than sorting the entire **DataFrame**.

## Filtering with the where Method
- Similar to square brackets or `loc`, the `where` method filters the original `DataFrame` with a Boolean Series.
- Pandas will populate rows that do **not** match the criteria with `NaN` values.
- Leaving in the `NaN` values can be advantageous for certain merge and visualization operations.

## The apply Method with DataFrames
- The `apply` method invokes a function on every column or every row in the **DataFrame**.
- Pass the uninvoked function as the first argument to the `apply` method.
- Pass the `axis` parameter an argument of `"columns"` to invoke the function on every row.
- Pandas will pass in the row's values as a **Series** object. We can use accessors like `loc` and `iloc` to extract the column's values for that row.