<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

In [None]:
import pandas as pd
import numpy as np
import datetime

pd.options.display.float_format = '{:,.2f}'.format

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Remove (drop) rows that meet a certain condition

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv"
)

data.head()

In the data set, we have a couple of rows where the `Customer ID` is 0. This is an undercover customer satisfaction officer visiting a store location and taking a survey. Since it's not a real customer, let's remove it from the dataset.

First, let's see what those rows look like:

In [None]:
data.loc[ data['Customer Id'] == 0 ]

To drop rows that meet a condition, do this:

In [None]:
data.loc[ data['Customer Id'] != 0 ]

In [None]:
data = data.loc[ data['Customer Id'] != 0 ]

In [None]:
data

In [None]:
data[ data['Customer Id'] == 0 ]

In [None]:
data.info()

**QUESTION:** What's a possible problem with this approach?

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

**ANSWER:**

This potentially requires creating a new dataframe which has all but the dropped rows. **This can potentially be a very expensive operation for large data frames!**

### How do we do better?

In [None]:
dt = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    parse_dates=['Date']
)

In [None]:
dt[ dt['Customer Id'] == 0 ]

In [None]:
id(dt)

### First, get the indices of all the rows we want to drop:

In [None]:
dt[ dt['Customer Id'] == 0 ].index

In [None]:
indices_to_drop = dt[ dt['Customer Id'] == 0 ].index
indices_to_drop

### Then, just use the `drop()` method, passing the indices of the rows to remove:

In [None]:
dt.drop(indices_to_drop, inplace=True)

In [None]:
id(dt)

In [None]:
dt[ dt['Customer Id'] == 0 ]

In [None]:
dt.info()

In [None]:
# This will NOT work because the row with
# the specified indices was already dropped!

dt.drop(indices_to_drop, inplace=True)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Drop duplicate rows

* data is rarely clean
* duplicates can pose a problem (e.g. when doing machine learning)
* we can easily drop them using `drop_duplicates()`

In [None]:
points = pd.DataFrame({
    'user_id': [1101, 1102, 1103, 1104, 1105, 1106, 1101],
    'total_points': [303, 304, 300, 250, 270, 300, 303],
    'country': ['US', 'UK', 'Australia', 'US', 'Germany', 'France', 'US']
})

points

In [None]:
points.drop_duplicates()

<br>

#### NOTE: `drop_duplicates` does not modify the DataFrame by default

In [None]:
points

<br>

#### If we want to persist the changes, we can use `inplace=True`

In [None]:
points.drop_duplicates(inplace=True)
points

<br>

#### You can consider just a subset of the columns, when finding duplicates

* by default, all columns must match in order for two rows to be considered duplicates

* we can actually specify which columns must match in order for two rows to be considered duplicates

In [None]:
points = pd.DataFrame({
    'user_id': [1101, 1102, 1103, 1104, 1105, 1106, 1101],
    'total_points': [303, 304, 300, 250, 270, 300, 302],
    'country': ['US', 'UK', 'Australia', 'US', 'Germany', 'France', 'US']
})

points

In [None]:
points.drop_duplicates()

In [None]:
points.drop_duplicates(subset=['user_id', 'country'])

In [None]:
points

<br>

#### We can choose which of the duplicate rows to drop

In [None]:
points = pd.DataFrame({
    'user_id': [1101, 1102, 1103, 1104, 1105, 1106, 1101],
    'total_points': [303, 304, 300, 250, 270, 300, 303],
    'country': ['US', 'UK', 'Australia', 'US', 'Germany', 'France', 'US']
})

points

<br>

We can keep the **first row**, and drop all the other rows:

In [None]:
points.drop_duplicates(keep='first')

<br>

We can keep the **last row**, and drop all the other rows:

In [None]:
points.drop_duplicates(keep='last')

<br>

Or we can just drop all the rows that are duplicated (and keep none of them):

In [None]:
points.drop_duplicates(keep=False)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Update rows that meet a certain condition

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv"
)

data.head()

If you look at the location column, you'll see that some of the rows have an `NaN` (not a number). Let's replace that with the text `missing`.

### To get rows that have empty values in a column, use `isnull()`

In [None]:
data['Location']

In [None]:
data['Location'].isnull()

In [None]:
data[ data['Location'].isnull() ]

Now let's update these rows. We will use `.loc` for this:

#### QUESTION: 
What will this code return?

In [None]:
data['Location']

What about this code?

In [None]:
data.loc[ data['Location'].isnull() ]

What about this code?

In [None]:
data.loc[ data['Location'].isnull(), 'Location']

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### To update these rows, we can then simply assign a value to the `Location` column

In [None]:
data.loc[ data['Location'].isnull(), 'Location'] = 'missing'

In [None]:
data.head()

In [None]:
data[data['Location'].isnull()]

In [None]:
data.info()

In [None]:
data.loc[ (data['Helpfulness'] > 2) & (data['Courtesy'] > 2), 'Overall Satisfaction'] = 3

In [None]:
data.head()