# Cleaning Data

## Summary
In this notebook, we'll be covering:
- [Dropping blank/empty cells](#Dropping-Blank-Cells)
- [Filling in blank/empty cells](#Filling-in-Blank-Cells)
- [Handling duplicates](#Dealing-with-Duplicates)
- [Fixing incorrect data types](#Fixing-Incorrect-Data-Types)

### Introduction

Often, data you read in has problems. These problems can include missing data, entry errors, and type errors. Some of these issues are really filtering problems. For instance, a person whose weight is listed as 1555 pounds is probably a 155 lb person but the person entering data hit "5" one too many times. This sort of issue is best addressed by filtering out any weight that is too extreme, which will involve our next topic, filtering. (And, potentially, summary statistics to detect outliers.) In this section we will cover more basic data cleaning, such as dropping rows with empty data. In fact, that is exactly where we will start. The code below will create the dataframe from the last notebook but now with some data missing and some other lines duplicated.

In [1]:
import pandas as pd
import random
import numpy as np

### Dropping Blank Cells

In [2]:
workout_dict = {'ID': [], 'Measurement Device': [], 'Heart Rate Max': [], 'Heart Rate Min': [], 'Heart Rate Avg': [],
              'Duration of exercise (min)': [], 'Exercise Type': []}
used_ids = []

for index in range(0, 500):
    workout_id = random.randint(100_000_000, 999_999_999)
    while workout_id in used_ids:
        workout_id = random.randint(100_000_000, 999_999_999)
    used_ids.append(workout_id)
    device = random.choice(['Skykandal', 'B-Wolf'])
    mu = random.randint(65, 85)
    min_rate = int(random.gauss(mu, 10))
    max_rate = int(random.gauss(mu + 55, 25))
    while max_rate <= min_rate:
        max_rate = int(random.gauss(mu + 55, 25))
    avg = random.gauss((max_rate + min_rate) / 2, (max_rate - min_rate) / 5)
    duration = random.randint(10, 90)
    exercise = random.choice(['Running', 'Running', 'Running', 'Bicycling', 'Swimming', 'Swimming',
                              'Weight training'])
    row = [device, min_rate, max_rate, avg, duration, exercise]
    # make blank cells, always make a blank on row 3 where we can see it
    if index == 2 or random.randint(0, 3) == 0:
        i = random.randint(0, 5)
        row.pop(i)
        row.insert(i, np.nan)
        # chance that more cells will be blank
        additional_blank = random.randint(0, 3)
        while additional_blank == 0:
            i = random.randint(0, 5)
            row.pop(i)
            row.insert(i, np.nan)
            additional_blank = random.randint(0, 3)
    # very rarely, but always on row 31, the whole row will be missing data
    if random.randint(0, 50) == 0 or index == 30:
        workout_id = np.nan
        row = [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
    workout_dict['ID'].append(workout_id)
    workout_dict['Measurement Device'].append(row[0])
    workout_dict['Heart Rate Min'].append(row[1])
    workout_dict['Heart Rate Max'].append(row[2])
    workout_dict['Heart Rate Avg'].append(row[3])
    if np.isnan(row[4]):
        workout_dict['Duration of exercise (min)'].append(row[4])
    else:
        workout_dict['Duration of exercise (min)'].append(str(row[4]))
    workout_dict['Exercise Type'].append(row[5])

df = pd.DataFrame(workout_dict)
df.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,651634082.0,,86.0,63.0,72.3372,12.0,Bicycling
1,233795711.0,Skykandal,133.0,69.0,117.699453,85.0,Swimming
2,379985044.0,B-Wolf,,72.0,110.58738,,Running
3,870084742.0,B-Wolf,103.0,46.0,72.409598,27.0,Swimming
4,830742209.0,Skykandal,141.0,88.0,97.060477,29.0,Bicycling
5,179659191.0,Skykandal,119.0,84.0,102.703314,,Running
6,339838135.0,B-Wolf,103.0,92.0,97.056334,54.0,Running
7,612087150.0,B-Wolf,104.0,70.0,80.662931,72.0,Weight training
8,213506940.0,B-Wolf,183.0,76.0,124.622657,21.0,Swimming
9,629210912.0,Skykandal,93.0,66.0,79.422937,73.0,Weight training


Let's break that code block down so that we can get a better understanding of what's happening

```Python
workout_dict = {'ID': [], 'Measurement Device': [], 'Heart Rate Max': [], 'Heart Rate Min': [], 'Heart Rate Avg': [],
              'Duration of exercise (min)': [], 'Exercise Type': []}
used_ids = []
```
The first thing we're doing is initializing the `workout_dict` and `used_ids` variables. `workout_dict` has seven keys and each key has an empty list for its value. `used_ids` is simply an empty list.

```Python
for index in range(0, 500):
    workout_id = random.randint(100_000_000, 999_999_999)
    while workout_id in used_ids:
        workout_id = random.randint(100_000_000, 999_999_999)
    used_ids.append(workout_id)
```
Here, we're looping 500 times where each loop will add a new row to our dataset. We use Python's `random` library to generate an integer between [100'000'000, 999'999'999]. The underscore (`_`) in the numbers doesn't change the value, but it does allow them to be more readable. An important thing to note is that unlike `range`, `randint` is inclusive, which means the code will generate a random number that can be up to _and_ including 999'999'999. We then check to see if `workout_id` has already been added to our `used_ids` list and if so, we continuously generate a new one until we have one that is unique. Once we have a unique `workout_id`, we append it to `used_ids`.

```Python
    device = random.choice(['Skykandal', 'B-Wolf'])
    mu = random.randint(65, 85)
    min_rate = int(random.gauss(mu, 10))
    max_rate = int(random.gauss(mu + 55, 25))
    while max_rate <= min_rate:
        max_rate = int(random.gauss(mu + 55, 25))
    avg = random.gauss((max_rate + min_rate) / 2, (max_rate - min_rate) / 5)
    duration = random.randint(10, 90)
    exercise = random.choice(['Running', 'Running', 'Running', 'Bicycling', 'Swimming', 'Swimming',
                              'Weight training'])
    row = [device, min_rate, max_rate, avg, duration, exercise]
```
The Python `random` library also has a couple of other functions that are useful to us, namely `choice()` and `gauss()`. `choice()` randomly selects from an element from the passed-in sequence. `gauss` randomly samples a number from a normal distribution, where `mu` and `sigma` are the two parameters. Similar to how we kept picking `workout_id`s until we selected one that was unique, here we're continuously selecting a `max_rate` until we get one that's larger than the `min_rate`. Once we've randomly selected all of our values, we store them in a list and assign that list to `row`.

```Python
    # make blank cells, always make a blank on row 3 where we can see it
    if index == 2 or random.randint(0, 3) == 0:
        i = random.randint(0, 5)
        row.pop(i)
        row.insert(i, np.nan)
        # chance that more cells will be blank
        additional_blank = random.randint(0, 3)
        while additional_blank == 0:
            i = random.randint(0, 5)
            row.pop(i)
            row.insert(i, np.nan)
            additional_blank = random.randint(0, 3)
```
Like the comment says, this block of code produces blank/empty cells, or `NaN`s (Not a Number). The first `if` statement always guarantees at least one `NaN` on row 3 (`index == 2`). It also gives a 1/4 chance on any other row (`random.randint(0, 3) == 0`). `pop()` removes (and returns) the element at a particular index, so that we can insert our `NaN` in that location with `row.insert(i, np.nan)`. Then we draw a new random number between [0, 3], and if (and while) that number is `0`, we insert more `NaN`s. 

```Python
    # very rarely, but always on row 31, the whole row will be missing data
    elif random.randint(0, 50) == 0 or index == 30:
        workout_id = np.nan
        row = [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
```
Similar to the code block above, this guarantees a whole row of `NaN`s on row 31 (`index == 30`). It also gives a 1/51 chance on any row of being fully `NaN`s.

```Python
    workout_dict['ID'].append(workout_id)
    workout_dict['Measurement Device'].append(row[0])
    workout_dict['Heart Rate Min'].append(row[1])
    workout_dict['Heart Rate Max'].append(row[2])
    workout_dict['Heart Rate Avg'].append(row[3])
    if np.isnan(row[4]):
        workout_dict['Duration of exercise (min)'].append(row[4])
    else:
        workout_dict['Duration of exercise (min)'].append(str(row[4]))
    workout_dict['Exercise Type'].append(row[5])
```
This last part is pretty straightforward. We're just appending all of the row elements to their respective `dict` entry. The only change is with the exercise duration, where we're making the value a `str` if it's not a `NaN`.

```
df = pd.DataFrame(workout_dict)
df.head(10)
```
The only part that's left is to create the DataFrame and view the first 10 rows. We're passing in `workout_dict` to `pd.DataFrame()`, and since it's a `dict` with `str` keys and `list` values, `pandas` automatically knows to create a DataFrame from it using the keys as column names and the values as column values.

In [3]:
df.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,651634082.0,,86.0,63.0,72.3372,12.0,Bicycling
1,233795711.0,Skykandal,133.0,69.0,117.699453,85.0,Swimming
2,379985044.0,B-Wolf,,72.0,110.58738,,Running
3,870084742.0,B-Wolf,103.0,46.0,72.409598,27.0,Swimming
4,830742209.0,Skykandal,141.0,88.0,97.060477,29.0,Bicycling
5,179659191.0,Skykandal,119.0,84.0,102.703314,,Running
6,339838135.0,B-Wolf,103.0,92.0,97.056334,54.0,Running
7,612087150.0,B-Wolf,104.0,70.0,80.662931,72.0,Weight training
8,213506940.0,B-Wolf,183.0,76.0,124.622657,21.0,Swimming
9,629210912.0,Skykandal,93.0,66.0,79.422937,73.0,Weight training


In the output above you will see one or more cells that read `NaN` rather than a value. In this case, your data is generated randomly so we don't know exactly how many values are missing (although there will always be something missing on row three), but given the code that generated it, roughly one-quarter of all rows should be missing a value.

First, how would we detect missing values? Dataframes have an `isna` method that returns either a `True` (the cell is blank) or a `False` (the cell has a value). Let's run this on row three (index 2, which is coded to always have at least one `NaN` value).

In [4]:
df.loc[2].isna()

ID                            False
Measurement Device            False
Heart Rate Max                 True
Heart Rate Min                False
Heart Rate Avg                False
Duration of exercise (min)     True
Exercise Type                 False
Name: 2, dtype: bool

What you should see is that for any column that is `NaN` you got a `True` and for the other columns you got `False`. We could also run this on the whole dataframe.

In [5]:
df.isna()

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,False,True,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,True,False,False,True,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
495,False,False,False,False,False,False,False
496,False,False,False,False,False,False,False
497,False,False,False,False,False,False,False
498,False,False,False,False,False,False,False


This is less useful, since we now have the entire dataframe turned into `True` and `False`, but using `sum` will help. This will treat every `True` as 1 and every `False` as 0.

In [6]:
df.isna().sum()

ID                            12
Measurement Device            35
Heart Rate Max                41
Heart Rate Min                35
Heart Rate Avg                35
Duration of exercise (min)    43
Exercise Type                 25
dtype: int64

This should provide you with column names followed by a number. That number represents the number of `NaN`s in that row. One quick sanity check is that the number of `NaN`s in ID should be much lower than the other rows, since the generator that made this data will never give ID a null value unless the whole row is null.

This method has let us grasp the scope of the problem. Let's start getting rid of `NaN`s.

The key method here is `dropna` which takes a few arguments. We won't cover all of them, since some are for more advanced usage, but the essential ones are `axis`, `how`, and `inplace`. `axis` defaults to 0, which means `dropna` will act row-by-row. 1 would be column-by-column. `how` can take one of two values: `any`, which drops any row or column (based on the value of `axis`) which has any `NaN`s, or `all` which drops a row/column only if it is all `NaN`s. It defaults to `any`. Our old friend `inplace` functions as before: `True` modifies the current dataframe, `False` makes a new, modified, dataframe. It defaults to `False`.

So, let's drop any row which is all `NaN`s. That's a garbage row, and I doubt you'll ever see such a thing in the All of Us data. In this case, we'll be using `inplace=False` so we don't alter our dataframe, and can see what different sorts of drops would do.

In [7]:
# this could also be written:
# df.dropna(axis=0, how='all', inplace=False)
# however, axis=0 and inplace=False are the defaults, so we can just not write those arguments
df_dropped = df.dropna(how='all')
print(df_dropped['ID'].isna().sum())
print(len(df_dropped['Heart Rate Max']))

0
488


You should see two numbers printed above. The first should be zero, and the second should be less than 500. In fact, it should be less than 500 by the same amount as the number of `NaN`s in ID, because `NaN` only appears in ID if the whole row is `NaN`. That second number is the size of the data set once we drop the all-`NaN` rows.

What would happen if we changed the axis?

#### Try this below, using `axis=1` and `how='all'`. Use `head` on your new dataframe to see what it looks like now.

In [8]:
# put your code here


This doesn't look any different from the original dataframe. Why? Because there's no column that is entirely `NaN`s. So, let's modify the code to use `how='any'`. Since this is the default option we could just leave the `how` argument off, but I'll specify it here because that's clearer to read.

In [8]:
df_2 = df.dropna(axis=1, how='any')
df_2.head()

0
1
2
3
4


Lovely dataframe, isn't it?

So, what happened? Well, we dropped every column in the initial dataframe that had any `NaN`s, which was all of them.

#### Below, try to write code that drops every row that is all `NaN`s and then modifies that dataframe by dropping all columns with any `NaN`s. 
When you use `head` you should see that only the ID column remains, since by dropping all the all-`NaN` rows first you will remove all the `NaN`s from ID. Use that to check that you have done this correctly.

In [10]:
# your code goes here


We might also want to drop `NaN`s in some columns but not others. The `subset` argument of `dropna` allows us to specify some columns to drop. The code below drops `NaN`s only in Heart Rate Max, and then shows us this to verify.

In [9]:
df2 = df.dropna(subset=['Heart Rate Max'])

print(df2['Heart Rate Max'].isna().sum())
df2.head(10)

0


Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,651634082.0,,86.0,63.0,72.3372,12.0,Bicycling
1,233795711.0,Skykandal,133.0,69.0,117.699453,85.0,Swimming
3,870084742.0,B-Wolf,103.0,46.0,72.409598,27.0,Swimming
4,830742209.0,Skykandal,141.0,88.0,97.060477,29.0,Bicycling
5,179659191.0,Skykandal,119.0,84.0,102.703314,,Running
6,339838135.0,B-Wolf,103.0,92.0,97.056334,54.0,Running
7,612087150.0,B-Wolf,104.0,70.0,80.662931,72.0,Weight training
8,213506940.0,B-Wolf,183.0,76.0,124.622657,21.0,Swimming
9,629210912.0,Skykandal,93.0,66.0,79.422937,73.0,Weight training
10,272333999.0,Skykandal,154.0,71.0,,35.0,Bicycling


#### Below, write code that drops all rows where either ID or Measurement Device is `NaN`.

In [12]:
# your code goes here


### Filling in Blank Cells

Another useful tool is `fillna`. This lets us replace `NaN` with some value. Obviously, whether this is a good idea depends on what you're doing with the dataframe. However, let's imagine that it is a good idea to replace missing values in our data with 0. The code below will do this.

In [10]:
df2 = df.fillna(0)
df2.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,651634082.0,0,86.0,63.0,72.3372,12,Bicycling
1,233795711.0,Skykandal,133.0,69.0,117.699453,85,Swimming
2,379985044.0,B-Wolf,0.0,72.0,110.58738,0,Running
3,870084742.0,B-Wolf,103.0,46.0,72.409598,27,Swimming
4,830742209.0,Skykandal,141.0,88.0,97.060477,29,Bicycling
5,179659191.0,Skykandal,119.0,84.0,102.703314,0,Running
6,339838135.0,B-Wolf,103.0,92.0,97.056334,54,Running
7,612087150.0,B-Wolf,104.0,70.0,80.662931,72,Weight training
8,213506940.0,B-Wolf,183.0,76.0,124.622657,21,Swimming
9,629210912.0,Skykandal,93.0,66.0,79.422937,73,Weight training


We can see that what was previously `NaN` is now 0. `fillna` has arguments much like `dropna`. The only required one is `value`, which is the first one, that I set to 0 here. This is the value to fill in. Like several previous methods `fillna` also takes the `inplace` argument (defaulting to `False`). 

In the example above everything was being filled to 0. However, what if we want to fill `NaN` with 150 in the Heart Rate Max column and 100 in the Heart Rate Avg column? Then we pass `fillna` a dictionary.

In [11]:
df2 = df.fillna({'Heart Rate Max': 150, 'Heart Rate Avg': 100})
df2.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,651634082.0,,86.0,63.0,72.3372,12.0,Bicycling
1,233795711.0,Skykandal,133.0,69.0,117.699453,85.0,Swimming
2,379985044.0,B-Wolf,150.0,72.0,110.58738,,Running
3,870084742.0,B-Wolf,103.0,46.0,72.409598,27.0,Swimming
4,830742209.0,Skykandal,141.0,88.0,97.060477,29.0,Bicycling
5,179659191.0,Skykandal,119.0,84.0,102.703314,,Running
6,339838135.0,B-Wolf,103.0,92.0,97.056334,54.0,Running
7,612087150.0,B-Wolf,104.0,70.0,80.662931,72.0,Weight training
8,213506940.0,B-Wolf,183.0,76.0,124.622657,21.0,Swimming
9,629210912.0,Skykandal,93.0,66.0,79.422937,73.0,Weight training


#### For practice, write code that will fill NaNs in Measurement Device, and only Measurement Device, with some default value. As always, use head to check your work. Also, leave inplace alone, or your dataframe will be changed and that may impact the exercises below.

In [15]:
# your code goes here


There are some advanced uses of `fillna` which you can see in the documentation that the pandas project provides. For instance, passing a `limit` will only fill that many `NaN`s.

In [12]:
df2 = df.fillna(0, limit=1)
df2.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,651634082.0,0,86.0,63.0,72.3372,12.0,Bicycling
1,233795711.0,Skykandal,133.0,69.0,117.699453,85.0,Swimming
2,379985044.0,B-Wolf,0.0,72.0,110.58738,0.0,Running
3,870084742.0,B-Wolf,103.0,46.0,72.409598,27.0,Swimming
4,830742209.0,Skykandal,141.0,88.0,97.060477,29.0,Bicycling
5,179659191.0,Skykandal,119.0,84.0,102.703314,,Running
6,339838135.0,B-Wolf,103.0,92.0,97.056334,54.0,Running
7,612087150.0,B-Wolf,104.0,70.0,80.662931,72.0,Weight training
8,213506940.0,B-Wolf,183.0,76.0,124.622657,21.0,Swimming
9,629210912.0,Skykandal,93.0,66.0,79.422937,73.0,Weight training


As you should be able to see above, this only filled the first `NaN` with a zero, and left subsequent ones alone.

A more useful advanced case is dynamic fills. As we discussed in the Pandas Data Structures notebook, you can create a new column by setting it equal to some simple mathematical expression involving other columns. So, for instance, `df['Heart Rate Midpoint'] = (df['Heart Rate Max'] + df['Heart Rate Min']) / 2` would create a column called Heart Rate Midpoint which was the average of Heart Rate Max and Heart Rate Min. We can also use these expressions in `fillna`. In the code below, we'll fill any missing values in Heart Rate Avg with the average of the maximum and minimum heart rates.

In [13]:
df2 = df.fillna({'Heart Rate Max': 150, 'Heart Rate Avg': (df['Heart Rate Max'] + df['Heart Rate Min']) / 2})
df2.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,651634082.0,,86.0,63.0,72.3372,12.0,Bicycling
1,233795711.0,Skykandal,133.0,69.0,117.699453,85.0,Swimming
2,379985044.0,B-Wolf,150.0,72.0,110.58738,,Running
3,870084742.0,B-Wolf,103.0,46.0,72.409598,27.0,Swimming
4,830742209.0,Skykandal,141.0,88.0,97.060477,29.0,Bicycling
5,179659191.0,Skykandal,119.0,84.0,102.703314,,Running
6,339838135.0,B-Wolf,103.0,92.0,97.056334,54.0,Running
7,612087150.0,B-Wolf,104.0,70.0,80.662931,72.0,Weight training
8,213506940.0,B-Wolf,183.0,76.0,124.622657,21.0,Swimming
9,629210912.0,Skykandal,93.0,66.0,79.422937,73.0,Weight training


Try this yourself.

#### Below, write code that fills NaNs in Heart Rate Min with a number that is 15 below the corresponding Heart Rate Avg.

In [18]:
# your code goes here


### Dealing with Duplicates
Sometimes you get duplicated data. This is especially common if you have made a dataset yourself by combining data from several sources, but we need to deal with it. 

First, let's make a mess.

In [14]:
dup_dict = {'ID': [], 'Measurement Device': [], 'Heart Rate Max': [], 'Heart Rate Min': [], 'Heart Rate Avg': [],
              'Duration of exercise (min)': [], 'Exercise Type': []}

for index, row in df.dropna().iterrows():
    for n in range(0, random.choice([1, 1, 1, 2, 3])):
        dup_dict['ID'].append(int(row['ID']))
        dup_dict['Measurement Device'].append(row['Measurement Device'])
        dup_dict['Heart Rate Min'].append(row['Heart Rate Min'])
        dup_dict['Heart Rate Max'].append(row['Heart Rate Max'])
        dup_dict['Heart Rate Avg'].append(row['Heart Rate Avg'])
        dup_dict['Duration of exercise (min)'].append(row['Duration of exercise (min)'])
        dup_dict['Exercise Type'].append(row['Exercise Type'])

dup_frame = pd.DataFrame(dup_dict)
dup_frame.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,233795711,Skykandal,133.0,69.0,117.699453,85,Swimming
1,870084742,B-Wolf,103.0,46.0,72.409598,27,Swimming
2,870084742,B-Wolf,103.0,46.0,72.409598,27,Swimming
3,870084742,B-Wolf,103.0,46.0,72.409598,27,Swimming
4,830742209,Skykandal,141.0,88.0,97.060477,29,Bicycling
5,339838135,B-Wolf,103.0,92.0,97.056334,54,Running
6,612087150,B-Wolf,104.0,70.0,80.662931,72,Weight training
7,612087150,B-Wolf,104.0,70.0,80.662931,72,Weight training
8,213506940,B-Wolf,183.0,76.0,124.622657,21,Swimming
9,629210912,Skykandal,93.0,66.0,79.422937,73,Weight training


The code above essentially copies df over to a new dataframe but sometimes copies a row two or three times instead of just once. The new frame, `dup_frame`, has duplicates.

In our case, we know that while lots of the data in our dataframe could be repeated elsewhere, the ID number should not. How many unique IDs do we have?

In [15]:
n_unique_ids = dup_frame['ID'].nunique()

print('Unique IDs:', n_unique_ids)
print('ID rows:', dup_frame['ID'].count())

Unique IDs: 370
ID rows: 574


`nunique` is a dataframe method that counts the number of unique entries in the given frame or series. As you can see, there are more rows that unique IDs!

(A related method, `unique`, lists the unique values. If you weren't sure what exercises were listed in Exercise Type, for instance, `df['Exercise Type'].unique()` would give you that list.)

So, how do we get rid of these duplicates? The method `drop_duplicates`. Like `dropna` it takes an `inplace` argument, and in the code below we're leaving that as False so we can try other things with the original frame.

In [16]:
dropped = dup_frame.drop_duplicates()

n_unique_ids = dropped['ID'].nunique()
print('Unique IDs:', n_unique_ids)
print('ID rows:', dropped['ID'].count())

Unique IDs: 370
ID rows: 370


Now the number of rows and the number of unique IDs should be the same. By default, `drop_duplicates` considers all columns and only drop rows that are duplicates across all of them. We could restrict this if we wanted. Imagine that, for some odd reason, we only wanted one entry for each measurement device and exercise type combination. The `subset` argument would let us do that by passing a list of columns to use.

In [17]:
dropped2 = dup_frame.drop_duplicates(subset=['Measurement Device', 'Exercise Type'])
dropped2.head(8)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,233795711,Skykandal,133.0,69.0,117.699453,85,Swimming
1,870084742,B-Wolf,103.0,46.0,72.409598,27,Swimming
4,830742209,Skykandal,141.0,88.0,97.060477,29,Bicycling
5,339838135,B-Wolf,103.0,92.0,97.056334,54,Running
6,612087150,B-Wolf,104.0,70.0,80.662931,72,Weight training
9,629210912,Skykandal,93.0,66.0,79.422937,73,Weight training
10,437545117,Skykandal,159.0,54.0,99.431639,33,Running
39,745211841,B-Wolf,134.0,51.0,92.678622,38,Bicycling


Only the first row for each combination is kept.

#### Below, write a block of code that will return dup_frame minus any duplicated Heart Rate Avg entries (only). Since it's fairly unlikely that Heart Rate Avg will randomly hit exactly the same number twice the resulting number of rows should be very close to the number of rows after dropping duplicates without restriction.

In [23]:
# your code goes here


### Fixing Incorrect Data Types

Imagine that, for some reason, we want to know the half-length of the exercise. This should be easy. First, we'll drop all NaNs in that column, and then make a new column that is that column divided by 2.

The top two rows of code are a little housekeeping. We can't divide `NaN` by 2, so we'll drop all the `NaN`s in Duration of exercise (min) and we'll have to type the name of this column a lot, so we'll rename it to Duration, which is a lot less typing.

In [18]:
df.dropna(subset=['Duration of exercise (min)'], inplace=True)
df.rename(columns={'Duration of exercise (min)': 'Duration'}, inplace=True)

df['Half-duration'] = df['Duration'] / 2
df.head()

TypeError: unsupported operand type(s) for /: 'str' and 'int'

You'll have a long error message above, but the important parts are `TypeError`, and the line that clarifies: `TypeError: unsupported operand type(s) for /: 'str' and 'int'`.

(Also important: since the error occurred after our housekeeping rows, the `dropna` and `rename` lines both ran and have taken effect.) 

Why the `TypeError`? Let's check the types of the two columns (called "dtypes"). (Actually, we'll just print the dtypes of all columns, since this is faster.)

In [19]:
df.dtypes

ID                    float64
Measurement Device     object
Heart Rate Max        float64
Heart Rate Min        float64
Heart Rate Avg        float64
Duration               object
Exercise Type          object
dtype: object

As you can see, the dtypes for most columns are `float64`. The number matters less, but `float` is a floating-point number (a number that can support decimal places) and an `int` is an integer. So these are number types. Duration is an `object`. Why? Well, the code that generated it saved it as text, and so while you may see a 20 in a particular cell that's not the number 20, that's the text characters for 20. You can't divide text by a number, so we get a TypeError.

How do we fix this? We tell pandas what type the column should be. There are two ways to do this. The `astype` method lets us specify a column and a dtype in a dictionary. (To enable us to try several things I am running this with `inplace` as False.)

In [20]:
fixed_df = df.astype({'Duration': 'int64'})
fixed_df.dtypes

ID                    float64
Measurement Device     object
Heart Rate Max        float64
Heart Rate Min        float64
Heart Rate Avg        float64
Duration                int64
Exercise Type          object
dtype: object

Now we should be able to run our operation on the new dataframe.

In [21]:
fixed_df['Half-duration'] = fixed_df['Duration'] / 2
fixed_df.head()

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration,Exercise Type,Half-duration
0,651634082.0,,86.0,63.0,72.3372,12,Bicycling,6.0
1,233795711.0,Skykandal,133.0,69.0,117.699453,85,Swimming,42.5
3,870084742.0,B-Wolf,103.0,46.0,72.409598,27,Swimming,13.5
4,830742209.0,Skykandal,141.0,88.0,97.060477,29,Bicycling,14.5
6,339838135.0,B-Wolf,103.0,92.0,97.056334,54,Running,27.0


Another way to do this would be to restrict our frame to the single column we want first.

In [22]:
df['Duration (fixed)'] = df['Duration'].astype('int64')
df.dtypes

ID                    float64
Measurement Device     object
Heart Rate Max        float64
Heart Rate Min        float64
Heart Rate Avg        float64
Duration               object
Exercise Type          object
Duration (fixed)        int64
dtype: object

However, this required me to know that I wanted (or could use) an `int64`. If all you want is a number the `to_numeric` function will figure out the type for you. Note that `to_numeric` is not a dataframe method, but a general pandas one, so we will write `pd.to_numeric()` not `df.to_numeric`. This also means we need to pass what we want to turn into numbers as an argument to `to_numeric`.

In [23]:
df['Duration, fixed again'] = pd.to_numeric(df['Duration'])
df.dtypes

ID                       float64
Measurement Device        object
Heart Rate Max           float64
Heart Rate Min           float64
Heart Rate Avg           float64
Duration                  object
Exercise Type             object
Duration (fixed)           int64
Duration, fixed again      int64
dtype: object

You can see here that `to_numeric` has picked a number type without us having to figure it out.

`to_numeric` has a useful error-handling mechanism. Let's look at this with a list that has some characters that won't become numbers correctly.

In [24]:
pd.to_numeric(['1', '32', 'a', '5'])

ValueError: Unable to parse string "a" at position 2

By default, `to_numeric` has its `errors` argument set to `raise`, which means that when it fails to convert something it stops and gives us an error (a `ValueError`, in this case).

`errors` can also be set to `ignore`, which just doesn't do anything to invalid input, and `coerce` which sets invalud input to `NaN`. The code below shows both of these.

In [25]:
pd.to_numeric(['1', '32', 'a', '5'], errors='ignore')

  pd.to_numeric(['1', '32', 'a', '5'], errors='ignore')


array(['1', '32', 'a', '5'], dtype=object)

In [26]:
pd.to_numeric(['1', '32', 'a', '5'], errors='coerce')

array([ 1., 32., nan,  5.])

In many cases, where you know you should have a number, `coerce` is the right setting, since it won't stop the code from running but considers a bad input to be a blank cell.

#### Below, fix Duration properly, inplace.

In [33]:
# your code goes here


Less frequently, you want to force a number to be an object. The ID column in our dataframe treats the IDs as numbers, but we really want them to act like labels. (The code below simply shows that they are a numeric type.)

In [27]:
df.dtypes

ID                       float64
Measurement Device        object
Heart Rate Max           float64
Heart Rate Min           float64
Heart Rate Avg           float64
Duration                  object
Exercise Type             object
Duration (fixed)           int64
Duration, fixed again      int64
dtype: object

There's no simple shortcut for this, but using `astype` to turn the numbers into the `object` dtype will work.

#### Below, write code that turns the ID columns into the object dtype, and then verify that using df.dtypes.

In [35]:
# your code goes here


Time to put it all together in one final exercise! The code below will make a new row, with NaNs and some dtype issues. This row is supposed to be the day of the year (1 to 365) that the exercise took place.

In [28]:
days = []
for x in range(0, len(df['ID'])):
    if random.randint(0, 5) == 0:
        days.append(np.nan)
    else:
        days.append(str(random.randint(0, 364)))

df['Day'] = days
df.head()

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration,Exercise Type,Duration (fixed),"Duration, fixed again",Day
0,651634082.0,,86.0,63.0,72.3372,12,Bicycling,12,12,172
1,233795711.0,Skykandal,133.0,69.0,117.699453,85,Swimming,85,85,151
3,870084742.0,B-Wolf,103.0,46.0,72.409598,27,Swimming,27,27,200
4,830742209.0,Skykandal,141.0,88.0,97.060477,29,Bicycling,29,29,358
6,339838135.0,B-Wolf,103.0,92.0,97.056334,54,Running,54,54,144


Unfortunately, this row isn't 1 to 365, it follows Python conventions and is 0 to 364.

#### Below, for the last exercise, make a new row that fixes the issue in Day by adding 1 to every value in Day. To do this you will need to fix two other issues as well.

In [37]:
# your code goes here


<details>
<summary>Stuck? Click here to see a hint.</summary>

The first issue is the `NaN`s. Drop those out of the Day column so you can do math with just numbers.
</details>

<details>
<summary>Still stuck? Click this to see a second hint.</summary>

The second issue is the dtype. You'll need to change Day to a numeric dtype.
</details>