<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/033__Working_With_Missing_And_Duplicate_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 4/6: DATA CLEANING AND ANALYSIS

# MISSION 5: Working With Missing And Duplicate Data

Learn how to work with missing and duplicate data in pandas.

## 1. Introduction

As we near the end of our course, we'll cover a topic that's essential to any data cleaning workflow - handling missing and duplicate data.

Missing or duplicate data may exist in a data set for a number of different reasons. Sometimes, missing or duplicate data is introduced as we perform cleaning and transformation tasks such as:

- Combining data
- Reindexing data
- Reshaping data

Other times, it exists in the original data set for reasons such as:

- User input error
- Data storage or conversion issues

In the case of missing values, they may also exist in the original data set to purposely indicate that data is unavailable.



In the Pandas Fundamentals course, we learned that there are various ways to handle missing data:

- Remove any rows that have missing values.
- Remove any columns that have missing values.
- Fill the missing values with some other value.
- Leave the missing values as is.

In this mission, we'll explore each of these options in detail and learn when to use them. We'll work with the 2015, 2016, and 2017 World Happiness Reports again - more specifically, we'll combine them and clean missing values as we start to define a more complete data cleaning workflow. You can find the data sets [here](https://www.kaggle.com/unsdsn/world-happiness#2015.csv), along with descriptions of each of the columns.

In this mission, we'll work with modified versions of the data sets. Each data set has already been updated so that each contains the same countries. For example, if a country appeared in the original 2015 report, but not in the original 2016 report, a row like the one below was added to the 2016 data set:

|...|
|-|

You'll notice that we revisit some of the concepts we learned in previous missions, such as combining data and vectorized string methods. This is to start giving you a sense of how all the data cleaning concepts we've learned fit together and better prepare you to work on the guided project at the end of this course!

Let's start by gathering information about the dataframes.

**Instructions:**

We've already read in the modified 2015, 2016, and 2017 World Happiness Reports to the variables `happiness2015`, `happiness2016`, and `happiness2017`, respectively. We also updated each dataframe so that each contain the same countries, as described above.

- Use the `DataFrame.shape` attribute to confirm the number of rows and columns for `happiness2015`, `happiness2016`, and `happiness2017`.
 - Assign the result for `happiness2015` to `shape_2015`.
 - Assign the result for `happiness2016` to `shape_2016`.
 - Assign the result for `happiness2017` to `shape_2017`.

In [None]:
# Import files directly using Google Colab
# Download the files from the links below:
# wh_2015.csv: https://drive.google.com/file/d/1hGi74f9j_HLbNGTkIcphEaU84OkgWVPK/view?usp=sharing
# wh_2016.csv: https://drive.google.com/file/d/13OJS16M5C4qUdumDkm29bLTra8QX1DBq/view?usp=sharing
# wh_2017.csv: https://drive.google.com/file/d/1pNnEWXuwDZt1A4pMOvNXkD5mCNoDrbWg/view?usp=sharing

from google.colab import files
upload = files.upload()
upload = files.upload()
upload = files.upload()

MessageError: ignored

In [None]:
# Import pandas and numpy libraries
import pandas as pd
import numpy as np

In [None]:
 # Read the csv files
 happiness2015 = pd.read_csv("wh_2015.csv")
 happiness2016 = pd.read_csv("wh_2016.csv")
 happiness2017 = pd.read_csv("wh_2017.csv")

In [None]:
happiness2015.head()

In [None]:
# result of exercise comes here

## 2. Identifying Missing Values

In the last exercise, we confirmed that each data set contains the same number of rows.

Recall that the dataframes were updated so that each contains the same countries, even if the happiness score, happiness rank, etc. were missing. However, that also means that each likely contains missing values, like the one we reviewed in the previous screen:

|...|
|---|

In pandas, missing values are generally represented by the `NaN` value, as seen in the dataframe above, or the `None` value.

However, it's good to note that pandas will not automatically identify values such as `n/a`, `-`, or `--` as `NaN` or `None`, but they may also indicate data is missing. See [here](https://stackoverflow.com/questions/40011531/in-pandas-when-using-read-csv-how-to-assign-a-nan-to-a-value-thats-not-the#answer-40011736) for more information on how to use the `pd.read_csv()` function to read those values in as `NaN`.

Once we ensure that all missing values were read in correctly, we can use the `Series.isnull()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isnull.html) to identify rows with missing values:




In [None]:
missing = happiness2015['Happiness Score'].isnull()
happiness2015[missing]

However, when working with bigger data sets, it's easier to get a summary of the missing values as follows:

In [None]:
happiness2015.isnull().sum()

The result is a series in which:

- The index contains the names of the columns in `happiness2015`.
- The corresponding value is the number of null values in each column.

In `happiness2015`, all columns except for the `Country` and `Year` columns have six missing values.

Let's confirm the number of missing values in `happiness2016` and `happiness2017` next.



**Instructions:**

- Use the `DataFrame.isnull()` and `DataFrame.sum()` methods to confirm the number of missing values in `happiness2016`. Assign the result to `missing_2016`.
- Use the `DataFrame.isnull()` and `DataFrame.sum()` methods to confirm the number of missing values in `happiness2017`. Assign the result to `missing_2017`.

## 3. Correcting Data Cleaning Errors that Result in Missing Values

In the previous exercise, you should've confirmed that `happiness2016` and `happiness2017` also contain missing values in all columns except for `Country` and `Year`. It's good to check for missing values before transforming data to make sure we don't unintentionally introduce missing values.

If we do introduce missing values after transforming data, we'll have to determine if the data is really missing or if it's the result of some kind of error. As we progress through this mission, we'll use the following workflow to clean our missing values, starting with checking for errors:
1. Check for errors in data cleaning/transformation.
2. Use data from additional sources to fill missing values.
3. Drop row/column.
4. Fill missing values with reasonable estimates computed from the available data.

Let's return to a task we completed in a previous mission - combining the 2015, 2016, and 2017 World Happiness Reports. Recall that we can use the `pd.concat()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) to combine them:

In [None]:
combined = pd.concat([happiness2015, happiness2016, happiness2017], ignore_index=True)

Next, let's check for missing values in `combined`:

In [None]:
combined.isnull().sum()

We can see above that our dataframe has many missing values and these missing values follow a pattern. Most columns fall into one of the following categories:

- 177 missing values (about 1/3 of the total values)
- 337 missing values (about 2/3 of the total values)

You may have also noticed that some of the column names differ only by punctuation, which caused the dataframes to be combined incorrectly:



In [None]:
Trust (Government Corruption)
Trust..Government.Corruption.

In the next exercise, we'll update the column names to make them uniform and combine the dataframes again. To clean the column names, we recommend using a technique we haven't covered yet, described in [this Stack Overflow answer](https://stackoverflow.com/questions/39741429/pandas-replace-a-character-in-all-column-names).

As you start to work on more data cleaning tasks, you'll inevitably encounter scenarios you don't know specifically how to handle. Stack Overflow is a great place to reference to get answers for these questions, as other people have likely already asked the same question and solicited answers.

As a reminder, below is a list of common string methods you can use to clean the columns:



|Method|	Description|
|---|---|
|Series.str.split()	|Splits each element in the Series.
|Series.str.strip()	|Strips whitespace from each string in the Series.
|Series.str.lower()	|Converts strings in the Series to lowercase.
|Series.str.upper()	|Converts strings in the Series to uppercase.
|Series.str.get()	|Retrieves the ith element of each element in the Series.
|Series.str.replace()	|Replaces a regex or string in the Series with another string.
|Series.str.cat()	|Concatenates strings in a Series.
|Series.str.extract()	|Extracts substrings from the Series matching a regex pattern.

Let's clean the column names next.

**Instructions:**

We've already updated the column names for `happiness2017`.

- Update the columns names for `happiness2015` and `happiness2016` to match the formatting of the column names in `happiness2017`. Use the following criteria to rename the columns:
 - All letters should be uppercase.
 - There should be only one space between words.
 - There should be no parentheses in column names
 - For example, the `Health (Life Expectancy)` columns should both be renamed to `HEALTH LIFE EXPECTANCY`.
- Use the DataFrame.isnull() and DataFrame.sum() methods to check for missing values. Assign the result to a variable named missing.
-Use the `pd.concat()` function to combine `happiness2015`, `happiness2016`, and `happiness2017`. Set the `ignore_index` argument equal to `True` to reset the index in the resulting dataframe. Assign the result to `combined`.

## 4. Visualizing Missing Data

In the last exercise, we corrected some of the missing values by fixing the column names. Note that we could have cleaned the column names without changing the capitalization. It's good practice, however, to make the capitalization uniform, because a stray uppercase or lowercase letter could've reintroduced missing values.

We also confirmed there are still values missing:

```
COUNTRY                          0
DYSTOPIA RESIDUAL               22
ECONOMY GDP PER CAPITA          22
FAMILY                          22
FREEDOM                         22
GENEROSITY                      22
HAPPINESS RANK                  22
HAPPINESS SCORE                 22
HEALTH LIFE EXPECTANCY          22
LOWER CONFIDENCE INTERVAL      335
REGION                         177
STANDARD ERROR                 334
TRUST GOVERNMENT CORRUPTION     22
UPPER CONFIDENCE INTERVAL      335
WHISKER HIGH                   337
WHISKER LOW                    337
YEAR                             0
dtype: int64
```

We can learn more about where these missing values are located by visualizing them with a [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html), a graphical representation of our data in which values are represented as colors. We'll use the seaborn library to create the heatmap.

Note below that we first reset the index to be the `YEAR` column so that we'll be able to see the corresponding year on the left side of the heatmap:

In [None]:
import seaborn as sns
combined_updated = combined.set_index('YEAR')
sns.heatmap(combined_updated.isnull(), cbar=False)

To understand this visualization, imagine we took `combined`, highlighted missing values in light gray and all other values in black, and then shrunk it so that was could easily view the entire dataframe at once.

Since we concatenated `happiness2015`, `happiness2016`, and `happiness2017` by stacking them, note that the top third of the dataframe corresponds to the 2015 data, the second third corresponds to the 2016 data, and the bottom third corresponds to the 2017 data.



We can make the following observations:

- No values are missing in the `COUNTRY` column.
- There are some rows in the 2015, 2016, and 2017 data with missing values in all columns EXCEPT the `COUNTRY` column.
- Some columns only have data populated for one year.
- It looks like the `REGION` data is missing for the year 2017.

Let's check that the last statement is correct in the next exercise.

**Instructions:**

- Confirm that the `REGION` column is missing from the 2017 data. Recall that there are 164 rows for the year 2017.
 - Select just the rows in `combined` in which the `YEAR` column equals 2017. Then, select just the `REGION` column. Assign the result to `regions_2017`.
 - Use the `Series.isnull()` and `Series.sum()` to calculate the total number of missing values in `regions_2017`, the `REGION` column for 2017. Assign the result to `missing`.
- Use the variable inspector to view the results of `missing`. Are all 164 region values missing for the year 2017?

## 5. Using Data From Additional Sources to Fill in Missing Values

In the last exercise, we confirmed that the `REGION` column is missing from the 2017 data. Since we need the regions to analyze our data, let's turn our attention there next.

Before we drop or replace any values, let's first see if there's a way we can use other available data to correct the values.

1. Check for errors in data cleaning/transformation.
2. Use data from additional sources to fill missing values.
3. Drop row/column.
4. Fill missing values with reasonable estimates computed from the available data.

Recall once more that each year contains the same countries. Since the regions are fixed values - the region a country was assigned to in 2015 or 2016 won't change - we should be able to assign the 2015 or 2016 region to the 2017 row.

In order to do so, we'll use the following strategy:

1. Create a dataframe containing all of the countries and corresponding regions from the `happiness2015`, `happiness2016`, and `happiness2017` dataframes.
2. Use the `pd.merge()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) to assign the `REGION` in the dataframe above to the corresponding country in `combined`.
3. The result will have two region columns - the original column with missing values will be named `REGION_x`. The updated column without missing values will be named `REGION_y`. We'll drop `REGION_x` to eliminate confusion.

Note that there are other ways to complete this task. We encourage you to explore them on your own.



**Instructions:**

We've already created a dataframe named `regions` containing all of the countries and corresponding regions from the `happiness2015`, `happiness2016`, and `happiness2017` dataframes.

- Use the `pd.merge()` function to assign the `REGION` in the `regions` dataframe to the corresponding country in `combined`.
 - Set the `left` parameter equal to `combined`.
 - Set the `right` parameter equal to `regions`.
 - Set the `on` parameter equal to `'COUNTRY'`.
 - Set the `how` parameter equal to `'left'` to make sure we don't drop any rows from `combined`.
 - Assign the result back to `combined`.
- Use the `DataFrame.drop()` method to drop the original region column with missing values, now named `REGION_x`.
 - Pass `'REGION_x'` into the `df.drop()` method.
 - Set the `axis` parameter equal to `1`.
 - Assign the result back to `combined`.
- Use the `DataFrame.isnull()` and `DataFrame.sum()` methods to check for missing values. Assign the result to a variable named `missing`.

## 6. Identifying Duplicates Values

In the previous screen, we used the 2015 and 2016 data to fill in the missing region values for the 2017 data. Note that we renamed the corrected region column to `REGION` separately to avoid confusion in the following exercises.

Before we decide how to handle the rest of our missing values, let's first check our dataframe for duplicate rows.

We'll use the `DataFrame.duplicated()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html) to check for duplicate values. If no parameters are specified, the method will check for any rows in which **all** columns have the same values.

Since we should only have one country for each year, we can be a little more thorough by defining rows with ONLY the same country and year as duplicates. To accomplish this, let's pass a list of the `COUNTRY` and `YEAR` column names into the `df.duplicated()` method:

In [None]:
dups = combined.duplicated(['COUNTRY', 'YEAR'])
combined[dups]

Since the dataframe is empty, we can tell that there are no rows with exactly the same country AND year.

However, one thing to keep in mind is that the `df.duplicated()` method will only look for exact matches, so if the capitalization for country names isn't exactly the same, they won't be identified as duplicates. To be extra thorough, we can first standardize the capitalization for the `COUNTRY` column and then check for duplicates again.

**Instructions:**

- Standardize the capitalization so that all the values in the `COUNTRY` column in `combined` are uppercase.
 - As an example, ``'India'` should be changed to `'INDIA'`.
- Use the `df.duplicated()` method to identify any rows that have the same value in the `COUNTRY` and `YEAR` columns. Assign your result to `dups`.
- Use `dups` to index `combined`. Print the results.

## 7. Correcting Duplicates Values

In the previous screen, we standardized the capitalization of the values in the `COUNTRY` column and identified that we actually do have three duplicate rows!

In [None]:
combined['COUNTRY'] = combined['COUNTRY'].str.upper()
dups = combined.duplicated(['COUNTRY', 'YEAR'])

Let's inspect all the rows for `SOMALILAND REGION` in `combined`.



In [None]:
combined[combined['COUNTRY'] == 'SOMALILAND REGION']

Now, we can see that there are two rows for 2015, 2016, and 2017 each.

Next, let's use the `df.drop_duplicates()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) to drop the duplicate rows. Like the `df.duplicated()` method, the `df.drop_duplicates()` method will define duplicates as rows in which **all** columns have the same values. We'll have to specify that rows with the same values in only the `COUNTRY` and `YEAR` columns should be dropped.



It's also important to note that by default, the `drop_duplicates()` method will only keep the first duplicate row. To keep the last duplicate row, set the `keep` parameter to `'last'`. Sometimes, this will mean sorting the dataframe before dropping the duplicate rows.

In our case, since the second duplicate row above contains more missing values than the first row, we'll keep the first row.

**Instructions:**

- Use the `df.drop_duplicates()` method to drop rows with more than one country for each year. Assign the result back to `combined`.
 - Pass a list containing the `COUNTRY` and `YEAR` columns into the `drop_duplicates()` method.

## 8. Handle Missing Values by Dropping Columns

In the last exercise, we used the df.drop()` method to drop columns we don't need for our analysis.

However, as you start working with bigger datasets, it can sometimes be tedious to create a long list of column names to drop. Instead we can use the `DataFrame.dropna()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) to complete the same task.

By default, the `dropna()` method will drop rows with any missing values. To drop columns, we can set the `axis` parameter equal to `1`, just like with the `df.drop()` method:

```
df.dropna(axis=1)
```

However, this would result in dropping columns with any missing values - we only want to drop certain columns. Instead, we can also use the `thresh` parameter to only drop columns if they contain below a certain number of non-null values.

So far, we've used the `df.isnull()` method to confirm the number of missing values in each column. To confirm the number of values that are NOT missing, we can use the `DataFrame.notnull()` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notnull.html):



In [None]:
combined.notnull().sum().sort_values()

Above, we can see that the columns we'd like to drop - `LOWER CONFIDENCE INTERVAL`, `STANDARD ERROR`, `UPPER CONFIDENCE INTERVAL`, `WHISKER HIGH`, and `WHISKER LOW` - only contain between 155 and 158 non null values. As a result, we'll set the `thresh` parameter equal to 159 in the `df.dropna()` method to drop them.



**Instructions:**

- Use the `df.dropna()` method to drop all columns in `combined` with 159 or less non null values.
 - Set the `thresh` argument equal to `159` and the `axis` parameter equal to `1`.
- Use the `df.isnull()` and `df.sum()` methods to calculate the number of missing values for each column. Assign the result to `missing`.

## 9. Handle Missing Values by Dropping Columns Continued

In the last exercise, we used the `df.drop()` method to drop columns we don't need for our analysis.

However, as you start working with bigger datasets, it can sometimes be tedious to create a long list of column names to drop. Instead we can use the `DataFrame.dropna()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) to complete the same task.

By default, the `dropna()` method will drop rows with any missing values. To drop columns, we can set the `axis` parameter equal to `1`, just like with the `df.drop()` method:

```
df.dropna(axis=1)
```

## 10. Analyzing Missing Data

## 11. Handling Missing Values with Imputation

## 12. Dropping Rows