# GetGround DataTask

**The Data**
- GetGround currently has end-customers referred to us by partners, such as lettings agents and mortgage brokers. The customer then signs up for our service, and we pay the partner a small commission per referrals.
- Referrals are on a company level: a customer who signs up for five companies counts as five referrals. Five customers in one company count as one referral.
- Partners each have consultants, such as Joe Smith working at Lettings Agent A. The referrals are attributed to the specific consultant at a partner.
- For referrals, the updated_at field essentially says when the status went from pending to either disinterested or successful. Timestamps are in Unix Nano format.
- is_outbound is true when we refer a customer to a partner, i.e. "upsell". In this case we send them the customer, and they pay us a commission. We haven't done this very thoroughly yet, so most referrals are inbound.
- Our sales people work in a "key account" model. Referrals come from partners, and a sales person typically manages partner accounts.
- We currently have sales people in the UK, Singapore and Hong Kong.

**Questions and Exercises**
- Please insert the data provided as CSV into tables in an SQL database. Please include SQL queries used throughout the assignment. **DONE**: Docker + Postgres   
- Use dbt to pre-precess the data and output dbt models for analysis. Include appropriate data quality tests and documentation.**DONE**: dbt - Postgres
- Analyse the data using SQL. Be sure to include your investigative thought process, findings, limitations, and assumptions. -> **Data quality analysis!**
- Based on your analysis, how would you reccomend GG improve the quality of the analyses we can deliver.  -> **Recomendations**

# Data quality analysis
"80% of my time was spent cleaning the data. Better data will always beat better models" - Thomsom Nguyen.

##### **Goal**: analyzing the quality of data in datasets to determine potential issues, shortcomings, and errors.

##### **Metrics to check:**
- **Coherence**: Data can be combined with other relevant data in accurate manner. 
    - Presence of Id in all tables to avoid Limitations regarding correlations
    - Id is unique key: Counting the unique values in our datasets and comparing them to the total values in our dataset.
- **Number of Missing data**: Counting the number of fields that are empty within our datasets and why we have them.
- **Duplicates**: Repetitions in our dataset -> Check duplicates in our entire dataset and also in some columns.
- **Bad data**: Innacurate/wrong information
- **Completeness**: the entries are complete and consistent


## Imports and settings

In [3]:
import pandas as pd
import numpy as np
import datetime

#sql connection
from sqlalchemy import create_engine

## Connection to DB

In [19]:
db_name = "postgres"
port = 5432
user = "postgres"
password = "postgres"
host = "localhost"

engine_template = "postgresql://{user}:{password}@{host}:{port}/{db_name}"
engine_str = engine_template.format(user=user, password=password, host=host, port=port, db_name=db_name)
print(engine_str)
engine = create_engine(engine_str)


postgresql://postgres:postgres@localhost:5432/postgres


### Checking connection

In [24]:
QUERY = 'SELECT * FROM dbt.stg_partners'

try:
    partners = pd.read_sql_query(QUERY, engine)
    print("Connected")
except:
    print("Not connected") 

Connected


In [25]:
partners.head()

Unnamed: 0,model_pk_id,id,partner_type,lead_sales_contact,partners_creation_date,partners_update_date,model_updated_dt
0,586327945cdfa8fe678c4d364a2b10f3,2,Agent,Potato,2020-08-31 06:47:46.322480+00:00,2020-12-04 01:51:12.823860+00:00,2022-08-14 14:52:56.070576+00:00
1,fa48d0cddd76a487c3b3ee77b0d64484,3,Agent,Lion,2020-08-31 07:37:09.759830+00:00,2021-04-20 01:31:02.989190+00:00,2022-08-14 14:52:56.070576+00:00
2,b8afa505c132130ce0429e433ff4db88,4,Agent,Potato,2020-08-31 07:38:16.698880+00:00,2021-03-25 03:10:56.065320+00:00,2022-08-14 14:52:56.070576+00:00
3,0c56e706336c13415593804a77129007,5,Agent,Lion,2020-08-31 07:43:16.281430+00:00,2020-12-07 08:43:01.086640+00:00,2022-08-14 14:52:56.070576+00:00
4,9c4d43bab9208e9438505589e43cc7f8,6,Agent,Potato,2020-08-31 09:18:53.133670+00:00,2021-01-04 06:51:57.822900+00:00,2022-08-14 14:52:56.070576+00:00


### Connecting to analytics layer

In [73]:
QUERY = 'SELECT * FROM dbt.sales_analytics_layer'
sales_layer = pd.read_sql_query(QUERY, engine)
sales_layer.head()

Unnamed: 0,partners_id,partner_type,lead_sales_contact,partners_creation_date,partners_update_date,referral_id,company_id,partner_id,consultant_id,status,is_outbound,referral_creation_date,referral_update_date,partner_name,country
0,4,Agent,Potato,2020-08-31 07:38:16.698880+00:00,2021-03-25 03:10:56.065320+00:00,1.0,385.0,4.0,4.0,successful,0.0,2020-09-01 10:25:18.374780+00:00,2020-09-01 10:25:18.374780+00:00,,
1,7,Agent,Lion,2020-08-31 09:34:00.948540+00:00,2021-02-19 07:24:33.984160+00:00,2.0,390.0,7.0,8.0,successful,0.0,2020-09-03 03:51:22.516150+00:00,2020-09-03 03:51:22.516150+00:00,Lion,HongKong
2,7,Agent,Lion,2020-08-31 09:34:00.948540+00:00,2021-02-19 07:24:33.984160+00:00,3.0,387.0,7.0,8.0,successful,0.0,2020-09-03 03:54:09.006400+00:00,2020-09-03 03:54:09.006400+00:00,Lion,HongKong
3,7,Agent,Lion,2020-08-31 09:34:00.948540+00:00,2021-02-19 07:24:33.984160+00:00,4.0,385.0,7.0,8.0,successful,0.0,2020-09-03 03:55:56.931170+00:00,2020-09-03 03:55:56.931170+00:00,Lion,HongKong
4,8,Agent,Lion,2020-09-03 03:53:06.703690+00:00,2020-12-07 08:42:55.415070+00:00,5.0,331.0,8.0,9.0,successful,0.0,2020-09-03 03:59:32.272380+00:00,2020-09-03 03:59:32.272380+00:00,Lion,HongKong


### Checking df basic info

In [40]:
sales_layer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1882 entries, 0 to 1881
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   partners_id             1882 non-null   int64              
 1   partner_type            1882 non-null   object             
 2   lead_sales_contact      1755 non-null   object             
 3   partners_creation_date  1882 non-null   datetime64[ns, UTC]
 4   partners_update_date    1882 non-null   datetime64[ns, UTC]
 5   referral_id             1470 non-null   float64            
 6   company_id              1470 non-null   float64            
 7   partner_id              1470 non-null   float64            
 8   consultant_id           1470 non-null   float64            
 9   status                  1470 non-null   object             
 10  is_outbound             1470 non-null   float64            
 11  referral_creation_date  1470 non-null   dat

## Analysis

### 1. Checking Coherence 
This step was verified in the initial exploratory analysis, prior to the generation of the analytical layer. I'll repeat it here just for validation.

#### Referrals table

In [119]:
# query raw data referrals
QUERY = 'SELECT * FROM public.referrals'
referrals = pd.read_sql_query(QUERY, engine)
referrals.head()

Unnamed: 0,id,created_at,updated_at,company_id,partner_id,consultant_id,status,is_outbound
0,1,1.59895591837478e+18,1.59895591837478e+18,385,4,4,successful,0
1,2,1.59910508251615e+18,1.59910508251615e+18,390,7,8,successful,0
2,3,1.5991052490064e+18,1.5991052490064e+18,387,7,8,successful,0
3,4,1.59910535693117e+18,1.59910535693117e+18,385,7,8,successful,0
4,5,1.59910557227238e+18,1.59910557227238e+18,331,8,9,successful,0


In [90]:
# check if the number of rows is equal to the number of unique id
print(referrals.shape[0])
print(len(referrals.id.unique()))

1470
1470


#### Partners table

In [107]:
QUERY = 'SELECT * FROM public.partners'
partners = pd.read_sql_query(QUERY, engine)
partners.head()

Unnamed: 0,id,created_at,updated_at,partner_type,lead_sales_contact
0,2,1.59885646632248e+18,1.60704667282386e+18,Agent,Potato
1,3,1.59885942975983e+18,1.61888226298919e+18,Agent,Lion
2,4,1.59885949669888e+18,1.61664185606532e+18,Agent,Potato
3,5,1.59885979628143e+18,1.60733058108664e+18,Agent,Lion
4,6,1.59886553313367e+18,1.6097431178229e+18,Agent,Potato


In [95]:
# check if the number of rows is equal to the number of unique id
print(partners.shape[0])
print(len(partners.id.unique()))

522
522


#### Sales_people table

In [101]:
QUERY = 'SELECT * FROM public.sales_people'
sales_people = pd.read_sql_query(QUERY, engine)
sales_people.head()

Unnamed: 0,name,country
0,Orange,Singapore
1,Apple,Singapore
2,Lion,HongKong
3,Tree,HongKong
4,Root,HongKong


There isn't a numeric id for correspondence, so I decided to use the column name as key to join tbale partners on lead_sales_contact.
However, having only the firt name present can lead this process to erros.

### 2. Checking missing data

In [74]:
sales_layer.isnull().sum()

partners_id                 0
partner_type                0
lead_sales_contact        127
partners_creation_date      0
partners_update_date        0
referral_id               412
company_id                412
partner_id                412
consultant_id             412
status                    412
is_outbound               412
referral_creation_date    412
referral_update_date      412
partner_name              447
country                   447
dtype: int64

In [77]:
# percentage of null or na data
((sales_layer.isnull() | sales_layer.isna()).sum() * 100 / sales_layer.index.size).round(2)

partners_id                0.00
partner_type               0.00
lead_sales_contact         6.75
partners_creation_date     0.00
partners_update_date       0.00
referral_id               21.89
company_id                21.89
partner_id                21.89
consultant_id             21.89
status                    21.89
is_outbound               21.89
referral_creation_date    21.89
referral_update_date      21.89
partner_name              23.75
country                   23.75
dtype: float64

In [None]:
# Plot 

#### **Check null data in partner name and county ( table sales_people)**

In [46]:
potato = sales_layer[sales_layer['partner_name'].isna()]
print(potato.shape)
potato.head()

(447, 15)


Unnamed: 0,partners_id,partner_type,lead_sales_contact,partners_creation_date,partners_update_date,referral_id,company_id,partner_id,consultant_id,status,is_outbound,referral_creation_date,referral_update_date,partner_name,country
0,4,Agent,Potato,2020-08-31 07:38:16.698880+00:00,2021-03-25 03:10:56.065320+00:00,1.0,385.0,4.0,4.0,successful,0.0,2020-09-01 10:25:18.374780+00:00,2020-09-01 10:25:18.374780+00:00,,
11,6,Agent,Potato,2020-08-31 09:18:53.133670+00:00,2021-01-04 06:51:57.822900+00:00,12.0,0.0,6.0,7.0,successful,0.0,2020-09-04 07:20:17.954300+00:00,2020-09-24 15:42:17.243660+00:00,,
12,6,Agent,Potato,2020-08-31 09:18:53.133670+00:00,2021-01-04 06:51:57.822900+00:00,13.0,0.0,6.0,7.0,successful,0.0,2020-09-04 07:20:42.749080+00:00,2020-09-04 07:20:42.749080+00:00,,
18,6,Agent,Potato,2020-08-31 09:18:53.133670+00:00,2021-01-04 06:51:57.822900+00:00,19.0,444.0,6.0,7.0,successful,0.0,2020-09-14 02:15:13.822570+00:00,2020-09-14 02:15:13.822570+00:00,,
20,2,Agent,Potato,2020-08-31 06:47:46.322480+00:00,2020-12-04 01:51:12.823860+00:00,21.0,0.0,2.0,16.0,successful,0.0,2020-09-14 15:08:09.439530+00:00,2020-09-14 15:26:15.952350+00:00,,


In [48]:
potato.lead_sales_contact.unique()

array(['Potato', None], dtype=object)

#### **Check null data in lead_sales_contact ( table partners)**

In [51]:
partners_null = sales_layer[sales_layer['lead_sales_contact'].isna()]
print(partners_null.shape)
partners_null.head()

(127, 15)


Unnamed: 0,partners_id,partner_type,lead_sales_contact,partners_creation_date,partners_update_date,referral_id,company_id,partner_id,consultant_id,status,is_outbound,referral_creation_date,referral_update_date,partner_name,country
103,21,Other,,2020-09-24 09:52:12.527970+00:00,2020-09-24 09:58:32.036060+00:00,104.0,521.0,21.0,39.0,successful,0.0,2020-10-05 08:37:48.240370+00:00,2020-10-05 08:37:48.240370+00:00,,
104,21,Other,,2020-09-24 09:52:12.527970+00:00,2020-09-24 09:58:32.036060+00:00,105.0,497.0,21.0,39.0,successful,0.0,2020-10-05 08:39:48.752400+00:00,2020-10-05 08:39:48.752400+00:00,,
112,21,Other,,2020-09-24 09:52:12.527970+00:00,2020-09-24 09:58:32.036060+00:00,113.0,0.0,21.0,39.0,successful,0.0,2020-10-07 02:33:39.678320+00:00,2020-10-07 02:33:39.678320+00:00,,
125,22,Other,,2020-09-24 09:53:39.932750+00:00,2020-09-24 09:58:27.361180+00:00,126.0,543.0,22.0,40.0,successful,0.0,2020-10-12 04:58:28.754320+00:00,2020-10-27 15:39:09.531300+00:00,,
127,22,Other,,2020-09-24 09:53:39.932750+00:00,2020-09-24 09:58:27.361180+00:00,128.0,549.0,22.0,40.0,disinterested,0.0,2020-10-12 05:01:11.301240+00:00,2020-11-13 13:59:16.645150+00:00,,


#### **Check null referrals**

In [53]:
referrals_null = sales_layer[sales_layer['referral_id'].isna()]
print(referrals_null.shape)
referrals_null.head()

(412, 15)


Unnamed: 0,partners_id,partner_type,lead_sales_contact,partners_creation_date,partners_update_date,referral_id,company_id,partner_id,consultant_id,status,is_outbound,referral_creation_date,referral_update_date,partner_name,country
1470,251,Agent,Tree,2021-02-05 01:07:07.449050+00:00,2021-02-05 01:07:07.449050+00:00,,,,,,,NaT,NaT,Tree,HongKong
1471,106,Agent,Lion,2020-12-07 08:38:51.260320+00:00,2020-12-07 08:38:51.260320+00:00,,,,,,,NaT,NaT,Lion,HongKong
1472,285,Agent,Leaf,2021-02-09 10:43:29.907890+00:00,2021-02-09 10:43:29.907890+00:00,,,,,,,NaT,NaT,Leaf,UK
1473,120,Agent,Tree,2020-12-09 07:50:33.896520+00:00,2020-12-09 07:50:33.896520+00:00,,,,,,,NaT,NaT,Tree,HongKong
1474,264,Agent,Root,2021-02-09 03:42:22.980590+00:00,2021-02-09 03:42:22.980590+00:00,,,,,,,NaT,NaT,Root,HongKong


### 3. Checking duplicates:

##### Sanity check: Understanding relation between partners x referrals.

In [61]:
# Check frequency of repeated 
partners_duplicated = sales_layer[sales_layer['partners_id'].duplicated()]
partners_duplicated.head()

Unnamed: 0,partners_id,partner_type,lead_sales_contact,partners_creation_date,partners_update_date,referral_id,company_id,partner_id,consultant_id,status,is_outbound,referral_creation_date,referral_update_date,partner_name,country
2,7,Agent,Lion,2020-08-31 09:34:00.948540+00:00,2021-02-19 07:24:33.984160+00:00,3.0,387.0,7.0,8.0,successful,0.0,2020-09-03 03:54:09.006400+00:00,2020-09-03 03:54:09.006400+00:00,Lion,HongKong
3,7,Agent,Lion,2020-08-31 09:34:00.948540+00:00,2021-02-19 07:24:33.984160+00:00,4.0,385.0,7.0,8.0,successful,0.0,2020-09-03 03:55:56.931170+00:00,2020-09-03 03:55:56.931170+00:00,Lion,HongKong
5,8,Agent,Lion,2020-09-03 03:53:06.703690+00:00,2020-12-07 08:42:55.415070+00:00,6.0,364.0,8.0,11.0,successful,0.0,2020-09-03 04:00:07.422910+00:00,2020-09-03 04:00:07.422910+00:00,Lion,HongKong
6,8,Agent,Lion,2020-09-03 03:53:06.703690+00:00,2020-12-07 08:42:55.415070+00:00,7.0,362.0,8.0,11.0,successful,0.0,2020-09-03 04:00:39.345260+00:00,2020-09-04 09:15:29.752330+00:00,Lion,HongKong
7,8,Agent,Lion,2020-09-03 03:53:06.703690+00:00,2020-12-07 08:42:55.415070+00:00,8.0,373.0,8.0,13.0,successful,0.0,2020-09-03 04:02:14.818720+00:00,2020-09-03 04:02:14.818720+00:00,Lion,HongKong


##### Check referrals duplicated

In [66]:
referrals_duplicated = sales_layer[sales_layer['referral_id'].duplicated()]
print(referrals_duplicated.shape)
referrals_duplicated.head()

(411, 15)


Unnamed: 0,partners_id,partner_type,lead_sales_contact,partners_creation_date,partners_update_date,referral_id,company_id,partner_id,consultant_id,status,is_outbound,referral_creation_date,referral_update_date,partner_name,country
1471,106,Agent,Lion,2020-12-07 08:38:51.260320+00:00,2020-12-07 08:38:51.260320+00:00,,,,,,,NaT,NaT,Lion,HongKong
1472,285,Agent,Leaf,2021-02-09 10:43:29.907890+00:00,2021-02-09 10:43:29.907890+00:00,,,,,,,NaT,NaT,Leaf,UK
1473,120,Agent,Tree,2020-12-09 07:50:33.896520+00:00,2020-12-09 07:50:33.896520+00:00,,,,,,,NaT,NaT,Tree,HongKong
1474,264,Agent,Root,2021-02-09 03:42:22.980590+00:00,2021-02-09 03:42:22.980590+00:00,,,,,,,NaT,NaT,Root,HongKong
1475,497,Agent,Sky,2021-05-03 06:17:19.964020+00:00,2021-05-03 06:17:19.964020+00:00,,,,,,,NaT,NaT,Sky,HongKong


Meaning:  the referrals without id!

##### Check duplicated rows

In [99]:
# Check for duplicates
duplicated_rows = sales_layer[sales_layer.duplicated()]
duplicated_rows

Unnamed: 0,partners_id,partner_type,lead_sales_contact,partners_creation_date,partners_update_date,referral_id,company_id,partner_id,consultant_id,status,is_outbound,referral_creation_date,referral_update_date,partner_name,country


Theres no duplicated rows!

### 4. Bad data
This step was verified in the initial exploratory analysis, prior to the generation of the analytical layer. I'll repeat it here just for validation.

#### Sales_people table
checking bad data/inconsistences in name or country info

In [102]:
sales_people.head()

Unnamed: 0,name,country
0,Orange,Singapore
1,Apple,Singapore
2,Lion,HongKong
3,Tree,HongKong
4,Root,HongKong


##### **Unique values in country column**

In [104]:

print(len(sales_people.country.unique()))
print((sales_people.country.unique()))

3
['Singapore' 'HongKong' 'UK']


##### **Unique names in sales people**

In [105]:
sales_unique_names = list(sales_people.name.unique())
sales_unique_names.sort()
print(sales_unique_names)
print(len(sales_unique_names))

['Apple', 'Cloud', 'Daisy', 'Fig', 'Horiz', 'Leaf', 'Lion', 'Orange', 'Root', 'Sky', 'Tree', 'Tulip']
12


##### **Checking if Lead_sales people = names**

In [109]:
partners_unique_names = list(partners.lead_sales_contact.unique())
partners_unique_names.sort()
print(partners_unique_names)
print(len(partners_unique_names))
# 0

['0', 'Apple', 'Cloud', 'Daisy', 'Fig', 'Horiz', 'Leaf', 'Lion', 'Potato', 'Root', 'Sky', 'Tree', 'Tulip']
13


In [110]:
sales_names_not_in_partners = set(sales_unique_names) - set(partners_unique_names)
sales_names_not_in_partners

{'Orange'}

In [112]:
partners_names_not_in_partners = set(partners_unique_names) - set(sales_unique_names)
partners_names_not_in_partners

{'0', 'Potato'}

##### **Bad data in partners**

In [124]:
partners.dtypes

id                     int64
created_at            object
updated_at            object
partner_type          object
lead_sales_contact    object
dtype: object

In [136]:
partners.partner_type.unique()

array(['Agent', 'IFA', 'Developer', 'Other', 'Lender',
       'Management company', 'Insurer', 'Influencer'], dtype=object)

##### **Bad data in referrals**

Sales names not in partners: Orange
Partners names not in sales: 0 and Potato

In [127]:
referrals.dtypes

id                int64
created_at       object
updated_at       object
company_id        int64
partner_id        int64
consultant_id     int64
status           object
is_outbound       int64
dtype: object

In [129]:
# referrals.company_id.unique() - replace 0 per na in order to not mask the amount of null data

In [134]:
# referrals.consultant_id.unique() # ok

In [132]:
referrals.status.unique() # ok

array(['successful', 'disinterested', 'pending'], dtype=object)

In [133]:
referrals.is_outbound.unique() # Ok

array([0, 1])

##### **Date format**

In [140]:
partners.dtypes

id                     int64
created_at            object
updated_at            object
partner_type          object
lead_sales_contact    object
dtype: object

In [141]:
partners['created_at'] = partners['created_at'].astype(float)
partners['updated_at'] = partners['updated_at'].astype(float)

In [143]:
# check dates
partner_corrected = partners.astype({'created_at':'datetime64[ns]', 'updated_at': 'datetime64[ns]'})
partner_corrected.created_at.min()

Timestamp('2020-08-31 06:47:46.322480128')

In [144]:
partners['created_at_len'] = partners['created_at'].astype(str).map(len)
partners.head()

Unnamed: 0,id,created_at,updated_at,partner_type,lead_sales_contact,created_at_len
0,2,1.598856e+18,1.607047e+18,Agent,Potato,20
1,3,1.598859e+18,1.618882e+18,Agent,Lion,20
2,4,1.598859e+18,1.616642e+18,Agent,Potato,20
3,5,1.59886e+18,1.607331e+18,Agent,Lion,20
4,6,1.598866e+18,1.609743e+18,Agent,Potato,20


In [146]:
partners.created_at_len.unique()

array([20, 19, 17, 18])

We have differente lens/ formats to unixe datetime

# Results and recomendations

### **1. Missing data**: 

From the analysis we can see that there are some missing data in the following columns:
- There is a **lead sales contact** that are present in the table partners but not in sales_people (Potato).
- There are 447 null data in partner **name and country** referring to the lack of correlation with the lead sales contact null data (127 entries) and  'Potato'
- There are 412 rows in **referrals table** without correlation with partners, which simply means that that partner doesn't have a referral yet.

- **Recomendations**: 
We should not have missing data. The missing first names in sales people and in lead sales contact can result in a fail of partner record registration in the analytics layer. For instance, we have the name Orange present in lead_sales_contact, but we do not have it on sales people, so in the final layer we won't have this register.

### **2. Coherence and completeness**:

- **Coherence**: 
    - We do have unique ids tor relate the tables  referrals to partners, but we do not have an ideal id for correlate table sales_people to partners.
- **Completeness**: 
    - We have names without last names.
- **Recomendations**: 
    - A referral should be  able to be tied to a lead sales contact/name and country. This provides a coherent picture of the referral. So to improve the quality, we should create an id or at least add the last name to link the sales_people name and lead_sales_contact


### **3. Duplicates**: 

There is no duplicates in the datasets

### **4. Bad data**: 
Some bad data was filtered or replaced using sql query in dbt

- Sales names not in partners: Orange 
- Partners names not in sales: 0 and Potato
- The dates with wrong format/length: Dates in unix nano timestamp format were the biggest challenge. To convert them into datetime with python is easy, however to convert them using sql it was more complicated, mainly due to the difference in the len of the exponential number. So I had to create several cases for transformation.  
- Imcomplete data may be ununsuable. Dates are data that can be sources of errors in the transformation process.  We should validate the numbers of digits or stablish an easier and comprehensible datetime format.

### **General notes and recomendations**:

Curating and cleaning data cover up to 80% of the time in data science projects. Every organization that relies on data for decision-making should consider practicing Data Quality Analysis. This will ensure that their decisions are based on accurate and up-to-date data rather than incorrect and out-of-date data. 


- **Definitions of fields**: Individuals data fields should have a well defined and unambiguous meaning. We can create a data dictionary to improve the quality and interpretability of the data.

- **Data generation**: when data entry is done by humans, there are many possibilities for errors. We should try to standardize and control data entry to maintain data quality, usability and comparability. 
- **Data governance/ data magamente tool**: Put data management tools into use. These will help you eliminate or reduce human errors. The data need to be collected and stored automatically, and we should validate the enters.Clear protocols and training sessions.