![license_header_logo](https://user-images.githubusercontent.com/59526258/124226124-27125b80-db3b-11eb-8ba1-488d88018ebb.png)
> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful,
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

## Introduction
In this tutorial, we will learn on how to automate data validation by using some of the pre-built packages.<br>

Spending multiple hour a day and figure out that data came through was incorrect for whatever reasons can be relieving but frustrating at the same time. There are many reasons to have incorrect data such as text introduced as integer, an integer is probably an outlier or worse-case scenario if entire specific column was not received in the data feed.<br>

At the end of this tutorial, you should be able to:
1. Automate data validation in your project
2. Able to handle issues regarding data type, data format and etc.
3. Detect which of column of your dataset give errors

## Data Validation Packages : Pandera
`pandera` is a statistical data validation library for `pandas` data structures. `pandera`provides a versatile and expressive API for performing data validation on tidy (long-form) and wide data to make data processing pipelines more readable and robust.<br>

More information can be retrieved <a href=https://pandera.readthedocs.io/en/stable/>here</a>.

## Table of Content
* [Data Type and Data Format](#typeformat)
* [Data Duplication](#duplicate)
* [Exercise ](#exercise)

In [1]:
# import libraries
import pandas as pd
import pandera as pa
from pandera.errors import SchemaError
import numpy as np

In [2]:
# Create dataframe
df = pd.DataFrame({
    "Student ID": [101, 102, 103, 104, 105, 106, 107, 107, 108, 109],
    "Gender" : ["F", np.nan, "M", "M", np.nan, "M", "F", "F", "M", "F"],
    "Grade" : [54, 85, -15, 20, 60, 96, 84, 84, -25, 17],
    "Student fee" : [1320.0, 1450.0, 1200.0, 3200.50, 2500.0, 1785.5, 3100.2, 3100.2, 1540.0, 1630.0],
    "Date joined": pd.to_datetime(["12/01/2021", "02/01/2021", "02/02/2021", "07/02/2021", "28/12/2020", 
                    "07/01/2021", "20/11/2019", "20/11/2019", "01/02/2021", "23/01/2021"], format="%d/%m/%Y")
})

In [3]:
df.head(10)

Unnamed: 0,Student ID,Gender,Grade,Student fee,Date joined
0,101,F,54,1320.0,2021-01-12
1,102,,85,1450.0,2021-01-02
2,103,M,-15,1200.0,2021-02-02
3,104,M,20,3200.5,2021-02-07
4,105,,60,2500.0,2020-12-28
5,106,M,96,1785.5,2021-01-07
6,107,F,84,3100.2,2019-11-20
7,107,F,84,3100.2,2019-11-20
8,108,M,-25,1540.0,2021-02-01
9,109,F,17,1630.0,2021-01-23


In [4]:
df.dtypes

Student ID              int64
Gender                 object
Grade                   int64
Student fee           float64
Date joined    datetime64[ns]
dtype: object

## <a name="typeformat">Data Type and Data Format
Here we check if our data format fit with the schema that we provided below.

In [5]:
# Define schema
schema = pa.DataFrameSchema({
    "Student ID" : pa.Column(int, nullable=False),
    "Gender" : pa.Column(str, checks=pa.Check(lambda x: x.isin(["M", "F"])), nullable=True),
    "Grade" : pa.Column(int, checks=pa.Check.greater_than(0)),
    "Student fee" : pa.Column(float, checks=pa.Check.in_range(0.0, 3500.00)),
    "Date joined" : pa.Column(pa.DateTime, checks=pa.Check.greater_than_or_equal_to("2021-01-01"))
})

In [6]:
schema.validate(df, lazy=True)

SchemaErrors: A total of 2 schema errors were found.

Error Counts
------------
- schema_component_check: 2

Schema Error Summary
--------------------
                                                                                              failure_cases  n_failure_cases
schema_context column      check                                                                                            
Column         Date joined greater_than_or_equal_to(2021-01-01)  [2020-12-28 00:00:00, 2019-11-20 00:00:00]                2
               Grade       greater_than(0)                                                       [-15, -25]                2

Usage Tip
---------

Directly inspect all errors by catching the exception:

```
try:
    schema.validate(dataframe, lazy=True)
except SchemaErrors as err:
    err.failure_cases  # dataframe of schema errors
    err.data  # invalid dataframe
```


<b>lazy=True</b> is used to display in details what errors as shown above.

In [7]:
# Display error
try:
    schema(df)
except SchemaError as e:
    print("Failed check:", e.check)
    print("\nInvalidated dataframe:\n", e.data)
    print("\nFailure cases:\n", e.failure_cases)

Failed check: <Check greater_than: greater_than(0)>

Invalidated dataframe:
    Student ID Gender  Grade  Student fee Date joined
0         101      F     54       1320.0  2021-01-12
1         102    NaN     85       1450.0  2021-01-02
2         103      M    -15       1200.0  2021-02-02
3         104      M     20       3200.5  2021-02-07
4         105    NaN     60       2500.0  2020-12-28
5         106      M     96       1785.5  2021-01-07
6         107      F     84       3100.2  2019-11-20
7         107      F     84       3100.2  2019-11-20
8         108      M    -25       1540.0  2021-02-01
9         109      F     17       1630.0  2021-01-23

Failure cases:
    index  failure_case
0      2           -15
1      8           -25


Catch the exception using try...except pattern to access the data and failure cases as attributes in `SchemaError` object.

## <a name="duplicate">Data Duplication
Question: How can we tell whether we have duplicate values in our dataset? 

In [8]:
schema_duplicate = pa.DataFrameSchema({
    "Student ID" : pa.Column(int)
    },
    unique=["Student ID"])

In [9]:
schema_duplicate.validate(df)

SchemaError: columns '('Student ID',)' not unique:
       column  index  failure_case
0  Student ID      6           107
1  Student ID      7           107

## <a name="exercise">Exercise

Given dataset based on customer feedback at a shopping mall. You need to perform data validation based on criteria below:
1. Customer ID must be unique.
2. Gender must be either "Male" or "Female".
3. The annual income must be in valid range(no negative numbers).
4. The range of spending score must be between 1 to 100.

In [10]:
shop = pd.read_csv("../data/shopping.csv")

In [11]:
shop.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score
0,2,Male,21,15,81
1,173,Male,36,87,10
2,85,Female,21,54,57
3,109,Male,68,63,43
4,120,Female,50,67,57


In [12]:
shop.dtypes

CustomerID             int64
Gender                object
Age                    int64
Annual Income (k$)     int64
Spending Score         int64
dtype: object

In [13]:
# Define schema
schema = pa.DataFrameSchema({
    "CustomerID" : pa.Column(int,nullable=False),
    "Gender" : pa.Column(str, pa.Check(lambda s:s.isin(["Female", "Male"])), nullable=True),
    "Age" : pa.Column(int, checks=pa.Check.greater_than(0), nullable=True),
    "Annual income(k$)" : pa.Column(int, checks=pa.Check.greater_than_or_equal_to(0), nullable=True),
    "Spending Score" : pa.Column(int, checks=pa.Check.in_range(1,100), nullable=True)
    },
    unique = ["CustomerID"]
    )

In [14]:
# Validate schema
schema.validate(shop, lazy=True, inplace=True)

SchemaErrors: A total of 4 schema errors were found.

Error Counts
------------
- column_not_in_dataframe: 1
- schema_component_check: 2
- duplicates: 1

Schema Error Summary
--------------------
                                                                  failure_cases  n_failure_cases
schema_context  column         check                                                            
DataFrameSchema <NA>           column_in_dataframe          [Annual income(k$)]                1
                CustomerID     multiple_fields_uniqueness  [57, 58, 59, 60, 61]                5
Column          Age            greater_than(0)                       [-19, -26]                2
                Spending Score in_range(1, 100)                         [0, -5]                2

Usage Tip
---------

Directly inspect all errors by catching the exception:

```
try:
    schema.validate(dataframe, lazy=True)
except SchemaErrors as err:
    err.failure_cases  # dataframe of schema errors
    err.data  # invalid dataframe
```


### <b>Now we should ask ourselves : What is the reason for these errors?
* Do we have an improperly defined schema?
* Do we expect to have negative values in our data?
* Why do we have negative values in **Age** column? Typo?
* Why do we have negative values in **Spending Score**? Is it because customer receive poor service?
* What should we do with our schema and our failing data points?