---

# Phase 2. Data Understanding
The data understanding phase starts with initial data collection and proceeds with activities that enable you to become familiar with the data, identify data quality problems, discover first insights into the data, and/or detect interesting subsets to form hypotheses.

## 2.1 Collect initial data
### 2.1.1 Task
Acquire the data (or access to the data) listed in the project resources. This initial collection includes data loading, if necessary for data understanding.

#### 2.1.1.1 Import Libraries
Import python libraries containing the necessary functionality we will need.

In [1]:
import pandas as pd 
import numpy as np 
from scipy.stats import skew, kurtosis, iqr 
from datetime import datetime, timedelta 
import os 
import sys 
import time 
from math import radians, asin, sqrt, cos, sin

import re

#### 2.1.1.2 Load the data
Establish the data location path

In [2]:
Data_Path = r'C:\Users\mjamesbreen\OneDrive - KPMG\Projects\Santander\fakedata'

### 2.1.2 Output
#### 2.1.2.1 Initial data collection report
List the dataset(s) acquired, together with their locations, the methods used to acquire them, and any problems encountered (and how those problems were resolved).

#### 2.1.2.2 customer geographic data for F11, 12, 21 

In [3]:
df_cust_ll = pd.read_csv(os.path.join(Data_Path,r'FakeCustomerLocationData.csv'))

#### 2.1.2.3 branch geographic data for F11, 12, 21

In [4]:
df_branch_11 = pd.read_csv(os.path.join(Data_Path,r'FakeBranchLocations.csv')) 

#### 2.1.2.4 turnover data for F......

In [5]:
df_turnovers = pd.read_csv(os.path.join(Data_Path,r'FakeTurnoverData.csv'))

## 2.2 Describe data
### 2.2.1 Task
Examine the “gross” or “surface” properties of the acquired data and report on the results.

### 2.2.2 Output
#### 2.2.2.1 Data description report
Describe the data that has been acquired, including the format of the data, the quantity of data (for example, the number of records and fields in each table), the identities of the fields, and any other surface features which have been discovered.

#### 2.2.2.2 Data description report - df_cust_ll

In [6]:
df_cust_ll.head()

Unnamed: 0,customer_source_unique_id,cust_lat,cust_long
0,RUKJ1234567,51.503,-0.019


In [7]:
df_cust_ll.describe()

Unnamed: 0,cust_lat,cust_long
count,1.0,1.0
mean,51.503,-0.019
std,,
min,51.503,-0.019
25%,51.503,-0.019
50%,51.503,-0.019
75%,51.503,-0.019
max,51.503,-0.019


In [8]:
df_cust_ll.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   customer_source_unique_id  1 non-null      object 
 1   cust_lat                   1 non-null      float64
 2   cust_long                  1 non-null      float64
dtypes: float64(2), object(1)
memory usage: 152.0+ bytes


#### 2.2.2.3 Data description report - df_branch_11

In [9]:
df_branch_11.head()

Unnamed: 0,centre_code,latitude,longitude
0,1234,51.495,-0.182
1,4321,51.279,1.08
2,5678,51.912,0.919
3,9876,53.685,-1.494
4,1010,53.502,-2.157


In [10]:
df_branch_11.describe()

Unnamed: 0,centre_code,latitude,longitude
count,5.0,5.0,5.0
mean,4423.8,52.3746,-0.3668
std,3644.82252,1.137565,1.436668
min,1010.0,51.279,-2.157
25%,1234.0,51.495,-1.494
50%,4321.0,51.912,-0.182
75%,5678.0,53.502,0.919
max,9876.0,53.685,1.08


In [11]:
df_branch_11.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   centre_code  5 non-null      int64  
 1   latitude     5 non-null      float64
 2   longitude    5 non-null      float64
dtypes: float64(2), int64(1)
memory usage: 248.0 bytes


### converted centre_code to string (from int)

In [12]:
df_branch_11['centre_code'] = df_branch_11['centre_code'].astype(str)

In [13]:
df_branch_11.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   centre_code  5 non-null      object 
 1   latitude     5 non-null      float64
 2   longitude    5 non-null      float64
dtypes: float64(2), object(1)
memory usage: 248.0+ bytes


In [14]:
df_branch_11.describe()

Unnamed: 0,latitude,longitude
count,5.0,5.0
mean,52.3746,-0.3668
std,1.137565,1.436668
min,51.279,-2.157
25%,51.495,-1.494
50%,51.912,-0.182
75%,53.502,0.919
max,53.685,1.08


#### 2.2.2.4 Data description report - df_turnovers

In [15]:
df_turnovers.head()

Unnamed: 0,cust_id,credit_turnover_12,credit_turnover_24,debit_turnover_12,debit_turnover_24
0,RUKJ1234567,180000,190000,90000,80000


In [16]:
df_turnovers.describe()

Unnamed: 0,credit_turnover_12,credit_turnover_24,debit_turnover_12,debit_turnover_24
count,1.0,1.0,1.0,1.0
mean,180000.0,190000.0,90000.0,80000.0
std,,,,
min,180000.0,190000.0,90000.0,80000.0
25%,180000.0,190000.0,90000.0,80000.0
50%,180000.0,190000.0,90000.0,80000.0
75%,180000.0,190000.0,90000.0,80000.0
max,180000.0,190000.0,90000.0,80000.0


In [17]:
df_turnovers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   cust_id             1 non-null      object
 1   credit_turnover_12  1 non-null      int64 
 2   credit_turnover_24  1 non-null      int64 
 3   debit_turnover_12   1 non-null      int64 
 4   debit_turnover_24   1 non-null      int64 
dtypes: int64(4), object(1)
memory usage: 168.0+ bytes


## 2.3 Explore data
### 2.3.1 Task
This task addresses data mining questions using querying, visualization, and reporting techniques. These include distribution of key attributes (for example, the target attribute of a prediction task) relationships between pairs or small numbers of attributes, results of simple aggregations, properties of significant sub-populations, and simple statistical analyses.

### 2.3.2 Output
Describe results of this task, including first findings or initial hypothesis and their impact on the remainder of the project.

## 2.4 Verify data quality
### 2.4.1 Task
Examine the quality of the data, addressing questions such as: Is the data complete (does it cover all the cases required)? Is it correct, or does it contain errors and, if there are errors, how common are they?


### 2.4.2 Output
#### 2.4.2.1 Data exploration report
List the results of the data quality verification; if quality problems exist, list possible solutions.

---