# Lab 2: Cleaning

## <u>Table of contents</u>

### 1. Manipulation
1. Data Structure and Input/Output Data <br>
2. Getting Data and Modifying Data <br>
3. Summary statistics and aggregating data <br>
4. Merge and Append Data

### 2. Cleaning
1. Outlier <br>
2. Incorrect data type<br>
3. Missing data<br>
4. Duplicates<br>
5. Inaccurate data/ Invalid category<br>
6. Data Binning<br>
7. Data encoding

<b>Cheat sheet</b> <br>
pandas: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

First, `import` is used to import modules in Python. <br>
Python has many modules related to data manipulation. The most common modules are `numpy` and `pandas`.

In [None]:
# To import modules, use these codes
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # use for create graph

## 1. Outlier

<b> Do not forget to check and clean outlier. </b>

The dataset we use in this session is a generate total sales dataset.

In [None]:
sales_data = pd.read_csv("./2_total_sales_column.csv", sep="\t")
sales_data.head()

We can see preliminary data by creating statistics data.

In [None]:
# View descriptive statistics of data
sales_data.describe()

And, we plot the boxplot by using `<DataFrame>.boxplot()`.<br><br>
Doc df.boxplot: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html

In [None]:
# View boxplot of data
sales_data.boxplot()
plt.show()

We will separate group of upper outliers and the other because this is total sales dataset.

In [None]:
# Separate group of upper outlier and the other
cond1 = sales_data["# total_sales"] > 2500
upper = sales_data.loc[cond1, ["# total_sales"]]
lower = sales_data.loc[~cond1, ["# total_sales"]]

In [None]:
# View boxplot of upper outliers
upper.boxplot()
plt.show()

In [None]:
# View boxplot of remaining data
lower.boxplot()
plt.show()

we do not remove lower outliers because we assume that we interested in returning a product.

## 2. Incorrect data type

The dataset which we use in this session is a demo dataset.

In [None]:
demo_data = pd.read_csv("./data/4_Demo_data.csv")
demo_data

Check data dtype by using `<DataFrame>.info()`.

In [None]:
demo_data.info()

We can set index by using `<DataFrame>.set_index()`. <br><br>
Doc. df.set_index: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html

In [None]:
demo_data = demo_data.set_index("Index")
demo_data.head()

We can change the data into the `category` dtype by using `<DataFrame>.astype()`. <br><br>
Doc. astype: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

In [None]:
demo_data["categorical"] = demo_data["categorical"].astype("category")
demo_data.info()

We must handle `na` by changing it to `<numpy>.nan` by using `<DataFrame>.replace()`. <br><br>

Doc. df.replace: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

In [None]:
demo_data = demo_data.replace("na", np.nan)
demo_data.head()

In [None]:
demo_data[["feature1", "feature2"]] = demo_data[["feature1", "feature2"]].astype("float")
demo_data.info()

We can also convert the argument into datetime by using `<pandas>.to_datetime()`.<br><br>
Doc. pd.to_datetime: https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html <br>
Doc. strftime format: https://strftime.org/

In [None]:
demo_data["create_date"] = pd.to_datetime(demo_data["create_date"], format="%d-%b-%y")
demo_data.info()

## 3. Missing data

We can check the number of missing data rows by using `<DataFrame>.info()`

In [None]:
demo_data.info()

Or check each missing data row by using `<DataFrame>.isna()`. <br><br>
Doc. df.isna: https://pandas.pydata.org/docs/reference/api/pandas.isna.html

In [None]:
demo_data.isna()

We can combine boolean DataFrame by using `<DataFrame>.all()` and `<DataFrame>.any()`.
- `<DataFrame>.all()`: Return whether all elements are True, potentially over an axis.
- `<DataFrame>.any()`: Return whether any element is True, potentially over an axis.

Doc df.all: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.all.html <br>
Doc df.any: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html

In [None]:
missing = demo_data.isna().any(axis=1)
missing

In [None]:
# View all rows which includes missing data
demo_data[missing]

### 3.1. Deletion

We can remove missing data by using `<DataFrame>.dropna()` <br><br>
Doc. df.dropna: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

In [None]:
# Try deleting all missing data rows
demo_data.dropna()

In [None]:
# Try deleting all missing data columns
demo_data.dropna(axis=1)

In [None]:
# Try deleting specific missing data row
demo_data.dropna(how="all", subset=["feature1", "feature2"])

### 3.2. Substitution

Fill the missing values by using the specified method `<DataFrame>.fillna()`<br><br>
Doc. df.fillna: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

In [None]:
# Try substituting with 0 (select only missing values)
demo_data.fillna(0)[missing]

In [None]:
# Try substituting with median (select only missing values)
demo_data.fillna(demo_data.median())[missing]

In [None]:
# Try substituting with Last Observation Carried Forward (LOCF)/Next Observation Carried Backward (NOCB) (select only missing values)
demo_data.fillna(method="ffill")[missing]
# demo_data.fillna(method="bfill")[missing]

We can substitute interporation for missing data by using `<DataFrame>.interpolate()`<br><br>
Doc. df.interporate: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html

In [None]:
# Try substituting with interporate
demo_data["feature1"]= demo_data["feature1"].fillna(demo_data["feature1"].interpolate())
demo_data["feature2"] = demo_data["feature2"].fillna(demo_data["feature2"].interpolate())

In [None]:
# Select only missing values
demo_data[missing]

## 4. Duplicate

The dataset which we use in this session is a modified supermarket sales dataset. <br><br>
Ref: https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales

In [None]:
supermerket_data = pd.read_csv("./data/5_Supermarket_data_duplicate.csv")

We can check the dupilcated index DataFrame by using `<DataFrame>.duplicated()` <br><br>
Doc. df.duplicated: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

In [None]:
supermerket_data.duplicated()

In [None]:
# View all duplicated rows
supermerket_data[supermerket_data.duplicated(keep=False)]

And we can remove dupilcated index DataFrame by using `<DataFrame>.drop_duplicates()`. <br><br> 
Doc. df.duplicate: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

In [None]:
supermerket_data = supermerket_data.drop_duplicates()
supermerket_data[supermerket_data.duplicated(keep=False)] # View all duplicated rows

In [None]:
supermerket_data.loc[supermerket_data["Invoice ID"] == "605-03-2706"]

Sometimes, we may want to group DataFrame index instead of removing the duplicated column.

In [None]:
# Read duplicate csv file again
supermerket_data = pd.read_csv("./data/5_Supermarket_data_duplicate.csv")

In [None]:
supermerket_data.columns

In [None]:
supermerket_groupby = supermerket_data.groupby(['Invoice ID', 'Branch', 'City', 'Customer type', 'Gender', 'Product line', 'Unit price'])\
                        .sum().reset_index()
supermerket_groupby.loc[supermerket_groupby["Invoice ID"] == "605-03-2706"]

## 5. Inaccurate data/ Invalid category

The dataset which we use in this session is a demo dataset. <br>
Sometimes, categorical data has invalid category. We must remove it or change it to `np.NaN`

In [None]:
# create DataFrame
fruit_data = pd.DataFrame({
    "fruit": ["strawberry", "melon", "raisin", "cherry", "grape", "red apple", "lime", "pear", "raspberry"],
    "color": ["red", "green", "purple", "red", "purple", "red", "sphere", "green", "red"]
})

# copy DataFrame
fruit_data1 = fruit_data.copy()
fruit_data2 = fruit_data.copy()

fruit_data

We can return unique values of Series object by using `<Series>.unique()`. <br><br>
Doc. s.unique: https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html

In [None]:
fruit_data["color"].unique()

### 5.1. Replace values

In [None]:
cond = fruit_data1["color"] == "sphere"
fruit_data1.loc[cond, "color"] = np.NaN
fruit_data1

### 5.2. Remove row

In [None]:
cond = fruit_data2["color"] == "sphere"
fruit_data2 = fruit_data2[~cond]
fruit_data2

## 6. Data binning

Data binning is the procress to reduce the effects of minor observation errors.

The dataset which we use in this session is a modified solar generation and demand dataset. <br><br>
Ref: https://www.kaggle.com/datasets/arielcedola/solar-generation-and-demand-italy-20152016

In [None]:
solar_data = pd.read_csv("./data/6_Solar_generation_and_demand_2015_modify.csv")
solar_data

In [None]:
# Use only interesting column and drop duplicates index
selected_solar_column = solar_data["IT_solar_generation"].drop_duplicates()

We can bin values into discrete intervals by using `<pandas>.cut()` <br><br>
Doc. pd.cut: https://pandas.pydata.org/docs/reference/api/pandas.cut.html

In [None]:
# pd.cut(selected_solar_column, 4)
pd.cut(selected_solar_column, 4).value_counts().sort_index()

In [None]:
pd.cut(selected_solar_column, 4, labels=["low", "normal", "high",  "extreme"])
# pd.cut(selected_solar_column, 4, labels=["low", "normal", "high",  "extreme"]).value_counts().sort_index()

We can bin values into quantile intervals by using `<pandas>.qcut()` <br><br>
Doc. pd.qcut: https://pandas.pydata.org/docs/reference/api/pandas.qcut.html

In [None]:
pd.qcut(selected_solar_column, 4)
# pd.qcut(selected_solar_column, 4).value_counts().sort_index()

In [None]:
pd.qcut(selected_solar_column, 4, labels=["q0-q1", "q1-q2", "q2-q3", "q3-q4"])
# pd.qcut(selected_solar_column, 4, labels=["q0-q1", "q1-q2", "q2-q3", "q3-q4"]).value_counts().sort_index()

## 7. Data encoding

Encoding is the process of converting data into a specified format.

The dataset which we use in this session is a demo dataset.

In [None]:
# create DataFrame
fruit_data = pd.DataFrame({
    "fruit": ["strawberry", "melon", "raisin", "cherry", "grape", "red apple", "lime", "pear", "raspberry"],
    "color": ["red", "green", "purple", "red", "purple", "red", "green", "green", "red"]
})

# copy DataFrame
fruit_data1 = fruit_data.copy()
fruit_data2 = fruit_data.copy()

fruit_data

### 7.1. Ordinal encoding

First, we have to change `color` column to `category` dtype. <br>
Then, use `<Series>.cat.codes` to return Series of codes. <br><br>
doc s.cat.codes: https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.codes.html

In [None]:
fruit_data1["color"] = fruit_data1["color"].astype("category")
fruit_data1.info()

In [None]:
fruit_data1["color_encode"] = fruit_data1["color"].cat.codes
fruit_data1

### 7.2. One-hot encoding

Use `<Pandas>.get_dummies()` to convert categorical variable into dummy/indicator variables. (Normally, we often remove the first level)<br><br>
Doc pd.get_dummies: https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

In [None]:
dummie = pd.get_dummies(fruit_data["color"], prefix="color", drop_first=True)
dummie

In [None]:
fruit_data2 = pd.concat([fruit_data, dummie], axis=1)
fruit_data2

### <u>Exercise 5</u>

The dataset which we use in this exercise is a chronic kidney disease dataset. <br><br>
Ref: https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease

<u><b>Metadata</b></u>

<table>
    <tbody>
        <tr>
            <th><b>Variable</b></th>
            <th><b>Definition</b></th>
            <th><b>Remark</b></th>
        </tr>
        <tr>
            <td>age</td>
            <td>age</td>
            <td>numerical</td>
        </tr>
        <tr>
            <td>bp</td>
            <td>blood pressure</td>
            <td>numerical</td>
        </tr>
        <tr>
            <td>al</td>
            <td>albumin</td>
            <td>nominal (0,1,2,3,4,5)</td>
        </tr>
        <tr>
            <td>su</td>
            <td>sugar</td>
            <td>nominal (0,1,2,3,4,5)</td>
        </tr>
        <tr>
            <td>appet</td>
            <td>appetite</td>
            <td>nominal (0,1,2,3,4,5)</td>
        </tr>
        <tr>
            <td>pe</td>
            <td>pedal edema</td>
            <td>nominal (yes,no)</td>
        </tr>
        <tr>
            <td>ane</td>
            <td>anemia</td>
            <td>nominal (yes,no)</td>
        </tr>
        <tr>
            <td>class</td>
            <td>class</td>
            <td>nominal (ckd,notckd)</td>
        </tr>
    </tbody>
</table>

Import `7_Chronic_kidney_disease` file and do the codings to clean data, and clean it the best you can. <br>

<b>Check list</b>
- Outlier
- Incorrect datatype
- Missing data
- Duplicates
- Inaccurate data/ Invalid category
- Data Binning
- Data encoding

In [None]:
# The code for Exercise 5 is here
# Load data
data_5 = pd.read_csv("./data/7_Chronic_kidney_disease.csv")
data_5.head()

---
---