## ![logo](../../img/license_header_logo.png)
> **Copyright &copy; 2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program and the accompanying materials are made available under the
terms of the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). <br>
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations
under the License. <br>
<br>**SPDX-License-Identifier: Apache-2.0**

# <a name="top">04 - Data Cleanup and Missing Data</a>
Authored by: Scotrraaj Gopal - scotrraaj.gopal@certifai.ai

## <a name="description">Notebook Description</a>

When you get out of the classroom and start applying your analytical skills in the real world, you'll realise that the dataset that you obtain is not really like how you had expected. Datasets go through varieties of processes before reaching your hands, causing it to end up being unstructured and rather 'unfriendly' for the data scientist. This tutorial focuses on how to clean up your data and a handful of methods of dealing with missing data.

By the end of this tutorial, you will be able to:

1. List, rename and delete columns.
2. Detect and count `NaN` values
3. Remove rows with `NaN` values.
4. Replace `NaN` values.

## Notebook Outline
Here's the outline for this tutorial:
1. [Notebook Description](#description)
2. [Notebook Configurations](#configuration)
3. [Data Cleanup](#clean)
    - [Handling Duplicates](#duplicates)
    - [Column Formatting](#columns)
4. [Missing Data](#missing)
    - [How Would You Know?](#detect)
    - [What To Do?](#action)
5. [Summary](#summary)
6. [Reference](#reference)

## <a name="configuration">Notebook Configurations</a>

Begin by first importing the `pandas` module and the dataset from `../../Datasets/pandas/winemag-data-130k-v2.csv`. Name the `DataFrame` object `reviews`.

In [None]:
# YOUR CODE HERE


reviews

## <a name="clean">Data Cleanup</a>

This is a pre-processing step that has to be done before beginning to analyse the dataset. Usually the pre-processing step is the most extensive and challenging one because it has to be done carefully without introducing bias.

### <a name="duplicates">Handling Duplicates</a>

It is always important to verify that the dataset that we are working with has no duplicates, so that we know for sure that we are not aggregating duplicate rows. Removing duplicates can be considered the lowest fruit to pluck when it comes to preprocessing data.

We can do this with the `.duplicated()` method. Chaining it with the `.sum()` shows the total duplicate rows.

In [None]:
# YOUR CODE HERE


There are 9,983 duplicate rows in our dataset. Let's use the `.drop_duplicates()` method to drop the duplicates.

> *Note: Use the* `inplace` *flag once you are sure with the operation as this would immediately affect the root dataset variable. __Be cautious when using this flag as mistakes of using the flag on a big dataset can be costly in terms of time and effort.__*

In [None]:
print(f"Shape of reviews with duplicates = {reviews.shape}")
# YOUR CODE HERE


print(f"Shape of reviews without duplicates = {reviews.shape}")

### <a name="columns">Column Formatting</a>

Routinely, datasets will have unstructured column labels with some of them being a cocktail of lowercase and uppercase words, spaces and typos. In order to make our life easier when selecting data by columns, we can spend a little time on cleaning up their names.

We can access the column labels, all at once, with `.columns`.

In [None]:
# YOUR CODE HERE


We can use the `.rename()` method to rename certain columns with a `dict` style argument.

Let's rename the `points` column to `score` and `taster_twitter_handle` to `taster_twitter`.

In [None]:
# YOUR CODE HERE


reviews.columns

The column labels of our dataset is actually already in good shape. The best practices for formatting column labels are as follows:
1. Lowercase letters
2. No special character such as symbols and brackets
3. Spaces replaced with underscores
4. Short but descriptive

Now, what if there are columns that doesn't bring any meaning to your problem statement?

You can always free up some space and clutter in your dataset by using `.drop()` method. This is a powerful method to remove rows or colums. Use `axis` attribute to specify if you're removing a column (`axis=1`) or a row (`axis=0`)

Let's remove the `taster_twitter` column from our `DataFrame`.

In [None]:
print(f"reviews columns before dropping: {reviews.columns}")
# YOUR CODE HERE


print(f"reviews columns after dropping: {reviews.columns}")
reviews

## <a name="missing">Missing Data</a>

When investigating your data, you will most inevitably come across missing or null values which are generally placeholders for non-existent information. By `pandas` default, missing values in a dataset are given the values `NaN`, short for "Not a Number". 

### <a name="detect">How Would You Know?</a>

While you can always inspect your dataset by eyeballing out every occurence of `NaN` and dealing with them, this may not be very feasible when you have thousands of rows of data.

You can check if your `DataFrame` object has `NaN` with the method `.isna()`. Chain it with `.sum()` to obtain a total of `NaN` values in each column.

In [None]:
# YOUR CODE HERE


### <a name='action'>What To Do?</a>

There are a two strategies that can be done when encountered with `NaN` values. Deciding on which strategy to go with requires intimate knowledge of the dataset and its context.

1. Remove the whole row with `NaN` values.
2. Use data imputation to fill the `NaN` values with a reasonably justified value.

#### Removing the rows with `NaN` values is rather straightforward. 

Use the `.dropna()` method to return a version of the `DataFrame` without any `NaN` values. Since we still need the `NaN` values to showcase the next strategy, we will not use the `inplace` attribute. Chain the `.reset_index()` method with appropriate attributes *(wink.. wink..)* to provide a new index flow. 

In [None]:
# YOUR CODE HERE


#### Replace `NaN` values with data imputation.

Replacing missing values is a conventional operation to keep valuable data that have `NaN` values. We can opt into this strategy when dropping every row with missing data causes the lost of a huge chunk of data. `pandas` provide a really handy method for this problem: `.fillna()`. This method allows a few different ways of replacing the values to mitigate such data.

For example, we can simply replace the `NaN` values in every row with 0.

In [None]:
# YOUR CODE HERE


There is also another method called `.replace()` that can be used to deal with this issue. But this method is more versatile in its uses as it can be utilized to even replace non-`NaN` values. 

For example, let's say the element in the `variety` column called `White Blend` has been updated recently to `White Mix`. We can easily implement this change in our dataset with `.replace()`.

In [None]:
print(f"First row in 'variety' column before replace: {reviews.variety[0]}")
# YOUR CODE HERE


print(f"First row in 'variety' column after replace: {reviews.variety[0]}")

We may use `.replace()` to replace `NaN` values with values that can be more relatable to the data. For example, `NaN` values in the `price` column can be replaced with `Free` or the average price etc.

##  <a name="summary">Summary</a>
To conclude, you should now be able to:

1. List, rename and delete columns.
2. Detect and count `NaN` values
3. Remove rows with `NaN` values.
4. Replace `NaN` values.

Congratulations, that concludes this tutorial. Exploring, cleaning and transforming data is an essential skill in data science. After some practice, you should be really comfortable with most of the basics. 

In your learning journey, you may come across errors in many different forms. Don't let that discourage you as even the best programmers have them too. Steer through the error by interpreting it and try your level best to debug your code. You may also use this [guide](https://geo-python.github.io/site/notebooks/L6/errors.html) as a point of reference.

There is an exercise notebook in this folder that can be helpful for you to test out everything that you've learnt. Don't forget to check it out!

**See you and happy coding!**

## <a name="reference">Reference</a>
* [Dataset Source](https://www.kaggle.com/zynicide/wine-reviews)

<font size=2>[Back to Top](#top)</font>