<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/033__Working_With_Missing_And_Duplicate_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 4/6: DATA CLEANING AND ANALYSIS

# MISSION 5: Working With Missing And Duplicate Data

Learn how to work with missing and duplicate data in pandas.

## 1. Introduction

As we near the end of our course, we'll cover a topic that's essential to any data cleaning workflow - handling missing and duplicate data.

Missing or duplicate data may exist in a data set for a number of different reasons. Sometimes, missing or duplicate data is introduced as we perform cleaning and transformation tasks such as:

- Combining data
- Reindexing data
- Reshaping data

Other times, it exists in the original data set for reasons such as:

- User input error
- Data storage or conversion issues

In the case of missing values, they may also exist in the original data set to purposely indicate that data is unavailable.



In the Pandas Fundamentals course, we learned that there are various ways to handle missing data:

- Remove any rows that have missing values.
- Remove any columns that have missing values.
- Fill the missing values with some other value.
- Leave the missing values as is.

In this mission, we'll explore each of these options in detail and learn when to use them. We'll work with the 2015, 2016, and 2017 World Happiness Reports again - more specifically, we'll combine them and clean missing values as we start to define a more complete data cleaning workflow. You can find the data sets [here](https://www.kaggle.com/unsdsn/world-happiness#2015.csv), along with descriptions of each of the columns.

In this mission, we'll work with modified versions of the data sets. Each data set has already been updated so that each contains the same countries. For example, if a country appeared in the original 2015 report, but not in the original 2016 report, a row like the one below was added to the 2016 data set:

|...|
|-|

You'll notice that we revisit some of the concepts we learned in previous missions, such as combining data and vectorized string methods. This is to start giving you a sense of how all the data cleaning concepts we've learned fit together and better prepare you to work on the guided project at the end of this course!

Let's start by gathering information about the dataframes.

**Instructions:**

We've already read in the modified 2015, 2016, and 2017 World Happiness Reports to the variables `happiness2015`, `happiness2016`, and `happiness2017`, respectively. We also updated each dataframe so that each contain the same countries, as described above.

- Use the `DataFrame.shape` attribute to confirm the number of rows and columns for `happiness2015`, `happiness2016`, and `happiness2017`.
 - Assign the result for `happiness2015` to `shape_2015`.
 - Assign the result for `happiness2016` to `shape_2016`.
 - Assign the result for `happiness2017` to `shape_2017`.

In [1]:
# Import files directly using Google Colab
# Download the files from the links below:
# wh_2015.csv: https://drive.google.com/file/d/1hGi74f9j_HLbNGTkIcphEaU84OkgWVPK/view?usp=sharing
# wh_2016.csv: https://drive.google.com/file/d/13OJS16M5C4qUdumDkm29bLTra8QX1DBq/view?usp=sharing
# wh_2017.csv: https://drive.google.com/file/d/1pNnEWXuwDZt1A4pMOvNXkD5mCNoDrbWg/view?usp=sharing

from google.colab import files
upload = files.upload()
upload = files.upload()
upload = files.upload()

Saving wh_2015.csv to wh_2015.csv


Saving wh_2016.csv to wh_2016.csv


Saving wh_2017.csv to wh_2017.csv


In [2]:
# Import pandas and numpy libraries
import pandas as pd
import numpy as np

In [3]:
 # Read the csv files
 happiness2015 = pd.read_csv("wh_2015.csv")
 happiness2016 = pd.read_csv("wh_2016.csv")
 happiness2017 = pd.read_csv("wh_2017.csv")

In [4]:
happiness2015.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,Year
0,Switzerland,Western Europe,1.0,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,2015
1,Iceland,Western Europe,2.0,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,2015
2,Denmark,Western Europe,3.0,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,2015
3,Norway,Western Europe,4.0,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,2015
4,Canada,North America,5.0,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,2015


In [5]:
# result of exercise comes here

## 2. Identifying Missing Values

In the last exercise, we confirmed that each data set contains the same number of rows.

Recall that the dataframes were updated so that each contains the same countries, even if the happiness score, happiness rank, etc. were missing. However, that also means that each likely contains missing values, like the one we reviewed in the previous screen:

|...|
|---|

In pandas, missing values are generally represented by the `NaN` value, as seen in the dataframe above, or the `None` value.

However, it's good to note that pandas will not automatically identify values such as `n/a`, `-`, or `--` as `NaN` or `None`, but they may also indicate data is missing. See [here](https://stackoverflow.com/questions/40011531/in-pandas-when-using-read-csv-how-to-assign-a-nan-to-a-value-thats-not-the#answer-40011736) for more information on how to use the `pd.read_csv()` function to read those values in as `NaN`.

Once we ensure that all missing values were read in correctly, we can use the `Series.isnull()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isnull.html) to identify rows with missing values:




In [6]:
missing = happiness2015['Happiness Score'].isnull()
happiness2015[missing]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,Year
158,Belize,,,,,,,,,,,,2015
159,Namibia,,,,,,,,,,,,2015
160,Puerto Rico,,,,,,,,,,,,2015
161,Somalia,,,,,,,,,,,,2015
162,Somaliland Region,,,,,,,,,,,,2015
163,South Sudan,,,,,,,,,,,,2015


However, when working with bigger data sets, it's easier to get a summary of the missing values as follows:

In [7]:
happiness2015.isnull().sum()

Country                          0
Region                           6
Happiness Rank                   6
Happiness Score                  6
Standard Error                   6
Economy (GDP per Capita)         6
Family                           6
Health (Life Expectancy)         6
Freedom                          6
Trust (Government Corruption)    6
Generosity                       6
Dystopia Residual                6
Year                             0
dtype: int64

The result is a series in which:

- The index contains the names of the columns in `happiness2015`.
- The corresponding value is the number of null values in each column.

In `happiness2015`, all columns except for the `Country` and `Year` columns have six missing values.

Let's confirm the number of missing values in `happiness2016` and `happiness2017` next.



**Instructions:**

- Use the `DataFrame.isnull()` and `DataFrame.sum()` methods to confirm the number of missing values in `happiness2016`. Assign the result to `missing_2016`.
- Use the `DataFrame.isnull()` and `DataFrame.sum()` methods to confirm the number of missing values in `happiness2017`. Assign the result to `missing_2017`.

## 3. Correcting Data Cleaning Errors that Result in Missing Values

In the previous exercise, you should've confirmed that `happiness2016` and `happiness2017` also contain missing values in all columns except for `Country` and `Year`. It's good to check for missing values before transforming data to make sure we don't unintentionally introduce missing values.

If we do introduce missing values after transforming data, we'll have to determine if the data is really missing or if it's the result of some kind of error. As we progress through this mission, we'll use the following workflow to clean our missing values, starting with checking for errors:
1. Check for errors in data cleaning/transformation.
2. Use data from additional sources to fill missing values.
3. Drop row/column.
4. Fill missing values with reasonable estimates computed from the available data.

Let's return to a task we completed in a previous mission - combining the 2015, 2016, and 2017 World Happiness Reports. Recall that we can use the `pd.concat()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) to combine them:

In [8]:
combined = pd.concat([happiness2015, happiness2016, happiness2017], ignore_index=True)

Next, let's check for missing values in `combined`:

In [9]:
combined.isnull().sum()

Country                            0
Region                           177
Happiness Rank                   177
Happiness Score                  177
Standard Error                   334
Economy (GDP per Capita)         177
Family                            22
Health (Life Expectancy)         177
Freedom                           22
Trust (Government Corruption)    177
Generosity                        22
Dystopia Residual                177
Year                               0
Lower Confidence Interval        335
Upper Confidence Interval        335
Happiness.Rank                   337
Happiness.Score                  337
Whisker.high                     337
Whisker.low                      337
Economy..GDP.per.Capita.         337
Health..Life.Expectancy.         337
Trust..Government.Corruption.    337
Dystopia.Residual                337
dtype: int64

We can see above that our dataframe has many missing values and these missing values follow a pattern. Most columns fall into one of the following categories:

- 177 missing values (about 1/3 of the total values)
- 337 missing values (about 2/3 of the total values)

You may have also noticed that some of the column names differ only by punctuation, which caused the dataframes to be combined incorrectly:



In [10]:
Trust (Government Corruption)
Trust..Government.Corruption.

SyntaxError: ignored

In the next exercise, we'll update the column names to make them uniform and combine the dataframes again. To clean the column names, we recommend using a technique we haven't covered yet, described in [this Stack Overflow answer](https://stackoverflow.com/questions/39741429/pandas-replace-a-character-in-all-column-names).

As you start to work on more data cleaning tasks, you'll inevitably encounter scenarios you don't know specifically how to handle. Stack Overflow is a great place to reference to get answers for these questions, as other people have likely already asked the same question and solicited answers.

As a reminder, below is a list of common string methods you can use to clean the columns:



|Method|	Description|
|---|---|
|Series.str.split()	|Splits each element in the Series.
|Series.str.strip()	|Strips whitespace from each string in the Series.
|Series.str.lower()	|Converts strings in the Series to lowercase.
|Series.str.upper()	|Converts strings in the Series to uppercase.
|Series.str.get()	|Retrieves the ith element of each element in the Series.
|Series.str.replace()	|Replaces a regex or string in the Series with another string.
|Series.str.cat()	|Concatenates strings in a Series.
|Series.str.extract()	|Extracts substrings from the Series matching a regex pattern.

Let's clean the column names next.

**Instructions:**

We've already updated the column names for `happiness2017`.

- Update the columns names for `happiness2015` and `happiness2016` to match the formatting of the column names in `happiness2017`. Use the following criteria to rename the columns:
 - All letters should be uppercase.
 - There should be only one space between words.
 - There should be no parentheses in column names
 - For example, the `Health (Life Expectancy)` columns should both be renamed to `HEALTH LIFE EXPECTANCY`.
- Use the DataFrame.isnull() and DataFrame.sum() methods to check for missing values. Assign the result to a variable named missing.
-Use the `pd.concat()` function to combine `happiness2015`, `happiness2016`, and `happiness2017`. Set the `ignore_index` argument equal to `True` to reset the index in the resulting dataframe. Assign the result to `combined`.

## 4. Visualizing Missing Data

In the last exercise, we corrected some of the missing values by fixing the column names. Note that we could have cleaned the column names without changing the capitalization. It's good practice, however, to make the capitalization uniform, because a stray uppercase or lowercase letter could've reintroduced missing values.

We also confirmed there are still values missing:

```
COUNTRY                          0
DYSTOPIA RESIDUAL               22
ECONOMY GDP PER CAPITA          22
FAMILY                          22
FREEDOM                         22
GENEROSITY                      22
HAPPINESS RANK                  22
HAPPINESS SCORE                 22
HEALTH LIFE EXPECTANCY          22
LOWER CONFIDENCE INTERVAL      335
REGION                         177
STANDARD ERROR                 334
TRUST GOVERNMENT CORRUPTION     22
UPPER CONFIDENCE INTERVAL      335
WHISKER HIGH                   337
WHISKER LOW                    337
YEAR                             0
dtype: int64
```

We can learn more about where these missing values are located by visualizing them with a [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html), a graphical representation of our data in which values are represented as colors. We'll use the seaborn library to create the heatmap.

Note below that we first reset the index to be the `YEAR` column so that we'll be able to see the corresponding year on the left side of the heatmap:

## 5. Using Data From Additional Sources to Fill in Missing Values

## 6. Identifying Duplicates Values

## 7. Correcting Duplicates Values

## 8. Handle Missing Values by Dropping Columns

## 9. Handle Missing Values by Dropping Columns Continued

## 10. Analyzing Missing Data

## 11. Handling Missing Values with Imputation

## 12. Dropping Rows