## Libaries

In [3]:
# Installations
!pip install pandas

Collecting pandas
  Downloading pandas-1.5.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting numpy>=1.20.3
  Downloading numpy-1.23.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: numpy, pandas
Successfully installed numpy-1.23.4 pandas-1.5.1


In [7]:
import pandas as pd

## Loading the data

- What files do we have
- What is the contents of the files
- Which data is useful for us

In [9]:
# list files
%ls data

data_path = "data/"

sample_submission.csv  Train.csv                  VariableDefinitions.csv
Test.csv               unlinked_masked_final.csv


In [10]:
# Read the VariableDefinitions file
var_defs = pd.read_csv(data_path + 'VariableDefinitions.csv')
var_defs

Unnamed: 0,Variable,Definition
0,CustomerId,Unique number identifying the customer on plat...
1,TransactionStartTime,Transaction start time
2,Value,Value of transaction
3,Amount,Value of Transaction with charges
4,TransactionId,Unique transaction identifier on platform
5,BatchId,Identifier for bulk transactions being done on...
6,SubscriptionId,You can have one account with multiple subscri...
7,CurrencyCode,Country currency
8,CountryCode,Numerical geographical code of country
9,ProviderId,Source provider of Item bought


### Understanding the variables and commenting on them
#### Key variables
- **Value:** Actual amount being transacted
- **Amount:** Cumulative amount with costs
(These two variables can tell us how transaction costs affect transaction behavior e.g. When transacting large sums of money, do people split them in smaller transactions, etc)
- **IssuedDate, PaidOnDate, DueData:** Track the duration of the loan
(From this we can create 2 variables 1. LoanDuration(Days) 2. OverDue (T/F))
- **AmountLoan:** Value of the loan
- **IsDefaulted:** Was the loan repaid or not
- **InvestorId:** Does the investor/institution affect how likely one can receive a loan? Are there institutions with more defaulters?
- **TransactionStatus:** (y) -> Loan accepted or rejected

#### Might be useful
- _TransactionStartTime:_ We can get patterns of transactions, what days/time are they at peak, low etc
- _ProviderId:_ Does a certain source has more transactions/customer. Are there regular people for the provider
- _ProductId:_ Is a certain product more bought. What are the trends and patterns?
- _ProductCategory_ (Same as above)
- _ChannelId:_ Is there a preferred channel of transacting?
- _InvestorId:_ Are there financial services that are more preferred for loans?

#### Sensitive variables
- CustomerId: Do we want our model to associate a customer with some fixed prediction even if their behavior might change. This could create a potential bias.

#### Not necessarily useful
- TransactionId (But can we find transactions that are anomalous?)
- BatchId
- SubscriptionID
- CurrencyCode
- CountryCode

In [11]:
# Load the training and test datasets
training_data = pd.read_csv(data_path+'Train.csv')
training_data.head(5)

Unnamed: 0,CustomerId,TransactionStartTime,Value,Amount,TransactionId,BatchId,SubscriptionId,CurrencyCode,CountryCode,ProviderId,...,LoanId,PaidOnDate,IsFinalPayBack,InvestorId,DueDate,LoanApplicationId,PayBackId,ThirdPartyId,IsThirdPartyConfirmed,IsDefaulted
0,CustomerId_27,2018-09-21 12:17:39,550.0,-550.0,TransactionId_1683,BatchId_641,SubscriptionId_2,UGX,256,ProviderId_1,...,,,,,,,,,,
1,CustomerId_27,2018-09-25 09:20:29,550.0,-550.0,TransactionId_2235,BatchId_820,SubscriptionId_2,UGX,256,ProviderId_1,...,,,,,,,,,,
2,CustomerId_27,2018-09-25 10:33:31,550.0,-550.0,TransactionId_1053,BatchId_210,SubscriptionId_4,UGX,256,ProviderId_1,...,,,,,,,,,,
3,CustomerId_27,2018-09-27 10:26:41,1000.0,-1000.0,TransactionId_2633,BatchId_876,SubscriptionId_4,UGX,256,ProviderId_1,...,,,,,,,,,,
4,CustomerId_27,2018-09-27 12:44:21,500.0,-500.0,TransactionId_71,BatchId_1362,SubscriptionId_4,UGX,256,ProviderId_1,...,,,,,,,,,,


In [12]:
test_data = pd.read_csv(data_path+'Test.csv')
test_data.head(5)

Unnamed: 0,CustomerId,TransactionStartTime,Value,Amount,TransactionId,BatchId,SubscriptionId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,TransactionStatus,IssuedDateLoan,LoanId,InvestorId,LoanApplicationId,ThirdPartyId
0,CustomerId_310,2019-03-31 13:33:05,14000.0,-14000.0,TransactionId_925,BatchId_1144,SubscriptionId_7,UGX,256,ProviderId_1,ProductId_7,airtime,ChannelId_1,1,2019-03-31 13:33:04,LoanId_1027,InvestorId_1,LoanApplicationId_825,ThirdPartyId_1175
1,CustomerId_243,2019-03-31 15:04:09,1000.0,-1000.0,TransactionId_1080,BatchId_1214,SubscriptionId_7,UGX,256,ProviderId_1,ProductId_8,data_bundles,ChannelId_1,1,2019-03-31 15:04:08,LoanId_768,InvestorId_1,LoanApplicationId_68,ThirdPartyId_604
2,CustomerId_142,2019-03-31 17:31:11,2500.0,-2500.0,TransactionId_2315,BatchId_2150,SubscriptionId_7,UGX,256,ProviderId_1,ProductId_7,airtime,ChannelId_1,1,2019-03-31 17:31:09,LoanId_1067,InvestorId_1,LoanApplicationId_1223,ThirdPartyId_1521
3,CustomerId_142,2019-03-31 17:32:15,500.0,-500.0,TransactionId_1466,BatchId_1071,SubscriptionId_7,UGX,256,ProviderId_1,ProductId_7,airtime,ChannelId_1,1,2019-03-31 17:32:14,LoanId_202,InvestorId_1,LoanApplicationId_633,ThirdPartyId_406
4,CustomerId_142,2019-03-31 17:34:41,1000.0,-1000.0,TransactionId_337,BatchId_2477,SubscriptionId_7,UGX,256,ProviderId_1,ProductId_7,airtime,ChannelId_1,1,2019-03-31 17:34:40,LoanId_533,InvestorId_1,LoanApplicationId_309,ThirdPartyId_302


## Exploring the data