<hr>

##### Mount Drive - **Google Colab Only Step**

When using google colab in order to access files on our google drive we need to mount the drive by running the below python cell, then clicking the link it generates and pasting the code in the cell.



In [0]:
from google.colab import drive
drive.mount('/content/drive')

Change Directory To Access The Dependent Files - **Google Colab Only Step**

In [0]:
directory = "student"
if (directory == "student"):
  %cd drive/Colab\ Notebooks/data-science-track/
else:
  %cd drive/Shared\ drives/Rubrik/Data\ Science/Course/Data-Science-Track

<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"> Data Cleaning Exercise</h1>
<hr>
Welcome to the workbook for <span style="color:royalblue">Exercise 2: Data Cleaning</span>! 

Remember, **better data beats better algorithms**.


<br><hr id="toc">

### In this lesson...

In this lesson, we'll cover the essential steps for building your analytical base table:
1. [Drop unwanted observations](#drop)
2. [Fix structural errors](#structural)
3. [Handle missing data](#missing-data)

Finally, we'll save the data to a new file so we can use it in other lessons.

<br><hr>

### First, let's import libraries and load the dataset.

In general, it's good practice to keep all of your library imports at the top of your notebook or program.

We've provided comments for guidance.

In [0]:
# NumPy for numerical computing

# Pandas for DataFrames


# Matplotlib for visualization

# Seaborn for easier visualization

## Import the employee dataset
- Use pandas' `read_csv()` function 
- Provide the following path for the data 
```python 
path = './data/employee_data.csv'
```

In [0]:
# Load employee data from CSV

<br>

### **Now we're ready to jump into cleaning the data!**
### ...

<br>

<span id="drop"></span>
# 1. Drop Unwanted observations

The first step to data cleaning is removing samples from your dataset that you don't want to include in the model.

<br>

**First, <span style="color:royalblue">drop duplicates</span> from the dataset.**
* Then, print the shape of the new dataframe.

In [0]:
# Drop duplicates

<br>

**Display all of the unique classes of the <code style="color:steelblue">'department'</code> feature**

In [0]:
# Unique classes of 'department'


<br>

**Drop all observations that belong to the <span style="color:crimson">'temp'</span> department.**
* **Hint:** This is the same as keeping all that don't belong to that department.
* **Hint:** Remember to overwrite your original dataframe.
* Then, print the shape of the new dataframe.

In [0]:
# Drop temporary workers


<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>

<span id="structural"></span>
# 2. Fix structural errors

The next bucket under data cleaning involves fixing structural errors, which arise during measurement, data transfer, or other types of "poor housekeeping."

<br>

**Print the unique values of <code style="color:steelblue">'filed_complaint'</code> and <code style="color:steelblue">'recently_promoted'</code>.**

In [0]:
# Print unique values of 'filed_complaint'


In [0]:
# Print unique values of 'recently_promoted'


<br>

**Fill missing <code style="color:steelblue">'filed_complaint'</code> and <code style="color:steelblue">'recently_promoted'</code> values with <code style="color:crimson">0</code>.**

In [0]:
# NaN values in filed_complaint should be 0.


In [0]:
# NaN values in recently_promoted should be 0.


<br>

**Print the unique values of <code style="color:steelblue">'filed_complaint'</code> and <code style="color:steelblue">'recently_promoted'</code> again, just to confirm.**

In [0]:
# Print unique values of 'filed_complaint'


In [0]:
# Print unique values of 'recently_promoted'


<br>

**Replace any instances of <code style="color:crimson">'information_technology'</code> with <code style="color:crimson">'IT'</code> instead.**
* Remember to do it **inplace**, OR better yet, save over the column.
* Then, plot the **bar chart** for <code style="color:steelblue">'department'</code> to see its new distribution.

In [0]:
# 'information_technology' should be 'IT'


In [0]:
# Plot class distributions for 'department'


<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>

<span id="missing-data"></span>
# 3. Handle missing data

Next, it's time to handle **missing data**. 

<br>

**Display the <span style="color:royalblue">number of missing values</span> for each feature (both categorical and numeric).**

In [0]:
# Display number of missing values for all features in the dataset.


<br>

**Label missing values in <code style="color:steelblue">'department'</code> as <code style="color:crimson">'Missing'</code>.**
* By the way, the <code style="color:steelblue">.fillna()</code> function also has an <code style="color:steelblue">inplace=</code> argument, just like the <code style="color:steelblue">.replace()</code> function.
* Normally, I recommend we overwrote that column. This time, try using the <code style="color:steelblue">inplace=True</code> argument instead, which does not RETURN anything. Therefore no need to overwrite nay variable. 

In [0]:
# Fill missing values in department with 'Missing'


<br>

**First, let's flag <code style="color:steelblue">'last_evaluation'</code> with an indicator variable of missingness.** 
* <code style="color:crimson">0</code> if not missing.
* <code style="color:crimson">1</code> if missing. 

Let's name the new indicator variable <code style="color:steelblue">'last_evaluation_missing'</code>.
* We can use the <code style="color:steelblue">.isnull()</code> function.
* Also, remember to convert it with <code style="color:steelblue">.astype(int)</code>

In [0]:
# Indicator variable for missing last_evaluation


<br>

**Then, simply fill in the original missing value with <code style="color:crimson">0</code> just so your algorithms can run properly.**\

<br>
<b style="color:crimson">WARNING!</b> This is <b style="color:crimson">NOT</b> standard practice!!! 

In [0]:
# Fill missing values in last_evaluation with 0


<br>

**Display the number of missing values for each feature (both categorical and numeric) again, just to confirm.**

In [0]:
# Display number of missing values by feature


<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>

<span id="save-abt"></span>
# 5. Save the data

Finally, let's save the **cleaned data**. 

<br>

In [0]:
 # Save the dataframe to csv


<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>

<br>

## Next Steps

Congratulations for making through Exercise 2's Employee data cleaning!

As a reminder, here are a few things you did in this module:
* You cleaned dropped irrelevant observations from the dataset.
* You fixed various structural errors, such as wannabe indicator variables.
* You handled missing data.

<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>