# Modifying dataframes reference

This is the reference component to the "Modifying dataframes" section of the Advanced Pandas track. For the workbook component, [click here](#$EXERCISE_URL$).

For this lesson, we'll once again be using the [Wine Reviews dataset](https://www.kaggle.com/zynicide/wine-reviews).

In [None]:
import pandas as pd
reviews = pd.read_csv("../input/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)
reviews.head()

## Assigning to columns

When working with vanilla Python objects, we can use indexing notation to access individual elements of a list or dict:

In [None]:
x = [1, 2, 4, 8]
x[2]

But we can also use an index expression as the left-hand side of an assignment, to modify an object in place:

In [None]:
x[2] = -40
x

Pandas uses a similar convention. `reviews['region_1']` by itself gives me the "region_1" column. But if I use it as the left-hand-side of an assignment statement, I can overwrite the values of that column:

In [None]:
reviews['region_1'] = "pandaville"
reviews.head()

Note that our dataset has about 130k rows, but the right-hand-side of the assignment was just a single value. In this case, Pandas uses *broadcasting* to repeat my scalar value as many times as necessary to fill the dataframe. I could also have used a list or Series as the right-hand-side:

In [None]:
approx_location = "Somewhere in " + reviews.country
print("Assigning a {} with {} entries to the region_1 column".format(
    type(approx_location), len(approx_location)))
reviews['region_1'] = approx_location
reviews.head()

As long as its length matches the dataframe...

In [None]:
reviews['region_1'] = ["A", "B", "C"]
reviews.head()

### Creating new columns

Adding a new column to a dataframe is as easy as overwriting an existing column - it uses the exact same syntax:

In [None]:
reviews['COUNTRY'] = reviews.country.str.upper()
reviews.head()

### Modifying selections

What if we want to assign only to rows in the dataset matching some particular condition? For example, suppose we want to force the score of all Canadian wines to 100? We've seen that the subset of our dataframe corresponding to "the points for all Canadian wines" can be expressed by:

In [None]:
reviews.loc[reviews.country == 'Canada', 'points']

We can simply use this as the left-hand-side of an assignment statement:

In [None]:
reviews.loc[reviews.country == 'Canada', 'points'] = 100

Let's confirm that this worked.

In [None]:
reviews.loc[reviews.country == 'Canada']

What if we do this for a column that doesn't exist yet?

In [None]:
reviews.loc[
    reviews.country.isin(['Canada', 'US', 'Mexico']),
    'continent'
] = 'North America'

Perhaps surprisingly, Pandas lets us do this. What happens to this new column for rows not in the selection?

In [None]:
reviews.head()

They get set to `NaN` - which seems sensible enough.

## Using `inplace = True`

In the [lesson on Indexing and Selecting](#$TUTORIAL_URL(2)$) we briefly saw [the `set_index` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html), which lets us choose a column to use as our dataframe's new index.

In [None]:
# Create a new column "new_idx" of integers which count down backwards to 1
reviews['new_idx'] = list(reversed(range(1, len(reviews)+1)))
# Set it as the new index
reviews.set_index('new_idx')

Now if we check the first 4 rows of `reviews`, they should have index labels `[129971, 129970, 129969, 129968]`, right?

In [None]:
reviews.head(4)

What happened? By default, `set_index` doesn't modify the dataframe it's called on. Instead it returns a new dataframe, which is a copy of `reviews` having the new index. But if we *do* want to modify our dataframe, we can pass the keyword argument `inplace=True`:

In [None]:
reviews.set_index('new_idx', inplace=True)

In [None]:
reviews.head(4)

In upcoming lessons, you'll see many more methods which can take an optional `inplace` argument, some of which you may already be familiar with, including:
- `reset_index()`
- `sort_values()`, `sort_index()`
- `fillna()`
- `rename()`

By default, all these methods return *new* objects rather than modifying the object they're called on, unless we set `inplace=True`.