# Lab Two: Data Cleaning

Data cleaning is the very first part of any data analysis and/or machine learning project. In this lab, you will be going over some of the common data issues and applying suitable fixes.

##### Loading libraries needed and the data

In [None]:
import pandas as pd
import numpy as np

### Explaining the data we will be using for this and the next few labs

In this example, we have 2 datasets. The datasets from two different hypothetical clinics, "Clinic1" and "Clinic2" which diagnose patients with a novel device that takes many measurements. The final goal is to see if they have a particular disease or not.

Measurements taken from patients in the two clinics are presented in dataframes `Lab2_df1.csv` and we also have an inspection log, recorded in `Lab2_df2.csv` file for the devices used in "Clinic1" and "Clinic2" where many variables from the device are measured.

Two of these variables, `'M1'` and `'M3'`, are believed to affect the readings taken from the patients. The `Lab2_df1` dataset is labelled with an actual diagnosis of whether the patient had the disease or not, and the goal is to predict the existence of the disease based on the measurements taken from the patients. Since the variables of the devices, measured in inspection, affects the measurements taken from patients in clinics, they should also be considered. Here are the data frames:

#### Note: Cells which have '[A]' represents the activity you have to do.

In [None]:
# loading the 'Lab2_df1.csv' csv data
df1 = pd.read_csv()

# This makes it so we are able to see 100 rows when displaying the data
pd.set_option()

In [None]:
df1.head(20)

## Looking at issues with the data

##### Unique values of the `'Gender'` feature.

In [None]:
df1['Gender'].unique()

##### Unique values of the `'Mode'` feature.

##### Observing the data types of each column.

# Lab Activity One: Guided Data Cleaning

##### It is generally a good idea to make a copy of the master dataset for cleaning so you can always go back if ever needed.

In [None]:
df_clean = df1.copy()

##### [A] Drop the `'Unnamed: 0'` column

##### [A] Fix the data in the `'Gender'` column.
> Hint: make it so there are 2 unique entries, `'male'` and `'female'`.

Check the 'Gender' column again

##### [A] Fix the data in the `'Mode'` column.
> Hint: Check for lower and upper case entries

### Setting The Correct Data Type

##### [A] Change the `'H'` column to numeric data `type(float64)` instead of `'object'`.

*Check* the 'H' column again

##### [A] Convert `'Examination date'` column to `datetime` type.

In [None]:
df_clean.dtypes

##### Using the `gender_type` key we set the `'Gender'` column as categorical data.

In [None]:
gender_type = pd.CategoricalDtype(categories=["female", "male"])

df_clean["Gender"] = df_clean["Gender"].astype(gender_type)

In [None]:
df_clean.dtypes

##### [A] Replicate how we changed the gender column to categorical but this time for the `'Mode'` column.

##### [A] Again, change the `'Diagnosis'` column to categorical replicating above the example above.

### Duplicate Entries
- Duplicate entities in a dataset is not good to have as it  can use overfit and is redundant information

##### [A] Check for duplicate entries and delete them

In [None]:
duplicate_rows = df_clean.duplicated()

In [None]:
display(duplicate_rows)

before using tilde operator
True = Duplicated, False = not duplicated

After using tilde operator:
False = Duplicated, Trues = Not _ duplicated

In [None]:
df_clean = df_clean[~duplicate_rows]

In [None]:
df_clean

##### [A] Once you delete a row in your dataset the index of that row is also deleted. Reset the index of the dataset so it in proper order.
> Hint: use the `dataframe.reset_index` function and set the drop parameter to `True`.

##### [A] Check the data types of your dataframe and print the shape of the dataframe.

# Lab Activity Two: Clean a Dataset Yourself

In this activity, you will need to clean the dataset yourself using the examples from the activity above. The `Lab2_df2.csv` is loaded and displayed for you.

In [None]:
#Read Lab2_df2.csv file

df2 = pd.read_csv('')
df2.head(10)

##### [A] Figure out the issues with this dataset and apply the cleaning methods