# Exploratory Data Analysis for Requests


## Conclusions and key findings from EDA:

Cash Requests:

- At first glance, it seems there's no relevant missing data or corrupted data:
    - There's 2103 requests without `user_id`. All of them seem to correspond to deleted_accounts (which seems to indicate that the data for `user_id`/`deleted_account_id` is reliable).
    - There's other columns with missing data but it seems normal (those columns only apply to specific cases)
    - All the entries for amount seem to be in normal ranges (min amount is 1; max amount is 200)

- Format:
    - The columns `user_id` and `deleted_account_id` are stored as float. They need to be converted to int.
    - All columns with dates need to be converted to a valid date format. 
        - Note: at the moment they're stored as str and float (those with float are likely because the data is NaN)
        - List of columns to convert to date:
            - "created_at"
            - "updated_at"
            - "moderated_at"
            - "reimbursement_date"
            - "cash_request_received_date"
            - "money_back_date"
            - "send_at"
            - "reco_creation"
            - "reco_last_update"


<br>
<br>


In [1]:
import pandas as pd

requests = pd.read_csv("project_dataset\\extract - cash request - data analyst.csv")


## EDA: fees


In [17]:
display(requests.shape)
display(requests.columns)
display(requests.head())
# display(requests.sample(10))
# display(requests.tail())

(23970, 16)

Index(['id', 'amount', 'status', 'created_at', 'updated_at', 'user_id',
       'moderated_at', 'deleted_account_id', 'reimbursement_date',
       'cash_request_received_date', 'money_back_date', 'transfer_type',
       'send_at', 'recovery_status', 'reco_creation', 'reco_last_update'],
      dtype='object')

Unnamed: 0,id,amount,status,created_at,updated_at,user_id,moderated_at,deleted_account_id,reimbursement_date,cash_request_received_date,money_back_date,transfer_type,send_at,recovery_status,reco_creation,reco_last_update
0,5,100.0,rejected,2019-12-10 19:05:21.596873+00,2019-12-11 16:47:42.40783+00,804.0,2019-12-11 16:47:42.405646+00,,2020-01-09 19:05:21.596363+00,,,regular,,,,
1,70,100.0,rejected,2019-12-10 19:50:12.34778+00,2019-12-11 14:24:22.900054+00,231.0,2019-12-11 14:24:22.897988+00,,2020-01-09 19:50:12.34778+00,,,regular,,,,
2,7,100.0,rejected,2019-12-10 19:13:35.82546+00,2019-12-11 09:46:59.779773+00,191.0,2019-12-11 09:46:59.777728+00,,2020-01-09 19:13:35.825041+00,,,regular,,,,
3,10,99.0,rejected,2019-12-10 19:16:10.880172+00,2019-12-18 14:26:18.136163+00,761.0,2019-12-18 14:26:18.128407+00,,2020-01-09 19:16:10.879606+00,,,regular,,,,
4,1594,100.0,rejected,2020-05-06 09:59:38.877376+00,2020-05-07 09:21:55.34008+00,7686.0,2020-05-07 09:21:55.320193+00,,2020-06-05 22:00:00+00,,,regular,,,,


In [4]:
display(requests.describe())

Unnamed: 0,id,amount,user_id,deleted_account_id
count,23970.0,23970.0,21867.0,2104.0
mean,13910.966124,82.720818,32581.250789,9658.755228
std,7788.117214,26.528065,27618.565773,7972.743249
min,3.0,1.0,34.0,91.0
25%,7427.25,50.0,10804.0,3767.0
50%,14270.5,100.0,23773.0,6121.5
75%,20607.75,100.0,46965.0,16345.0
max,27010.0,200.0,103719.0,30445.0


In [23]:
# check missing data

requests.isna().sum()

id                                0
amount                            0
status                            0
created_at                        0
updated_at                        0
user_id                        2103
moderated_at                   7935
deleted_account_id            21866
reimbursement_date                0
cash_request_received_date     7681
money_back_date                7427
transfer_type                     0
send_at                        7329
recovery_status               20640
reco_creation                 20640
reco_last_update              20640
dtype: int64

In [20]:
# Check requests without user_id

requests[requests["user_id"].isna()].shape[0] # 2103

requests[(requests["user_id"].isna()) & (requests["deleted_account_id"].isna())] # 0

# Result: all requests without user_id seem to correspond to deleted_accounts (which seems to indicate that the data for user_id is reliable)


Unnamed: 0,id,amount,status,created_at,updated_at,user_id,moderated_at,deleted_account_id,reimbursement_date,cash_request_received_date,money_back_date,transfer_type,send_at,recovery_status,reco_creation,reco_last_update


In [25]:
# check data types

requests.dtypes

id                              int64
amount                        float64
status                         object
created_at                     object
updated_at                     object
user_id                       float64
moderated_at                   object
deleted_account_id            float64
reimbursement_date             object
cash_request_received_date     object
money_back_date                object
transfer_type                  object
send_at                        object
recovery_status                object
reco_creation                  object
reco_last_update               object
dtype: object

In [31]:
# check types for the columns with dtype object


display(requests["status"].apply(type).unique())        # str
display(requests["created_at"].apply(type).unique())    # str
display(requests["updated_at"].apply(type).unique())    # str
display(requests["moderated_at"].apply(type).unique())  # str, float
print("....")
display(requests["reimbursement_date"].apply(type).unique())            # str
display(requests["cash_request_received_date"].apply(type).unique())    # str, float
display(requests["money_back_date"].apply(type).unique())               # str, float
display(requests["transfer_type"].apply(type).unique())                 # str
print("....")
display(requests["send_at"].apply(type).unique())                       # str, float
display(requests["recovery_status"].apply(type).unique())               # str, float
display(requests["reco_creation"].apply(type).unique())                 # str, float
display(requests["reco_last_update"].apply(type).unique())              # str, float




array([<class 'str'>], dtype=object)

array([<class 'str'>], dtype=object)

array([<class 'str'>], dtype=object)

array([<class 'str'>, <class 'float'>], dtype=object)

....


array([<class 'str'>], dtype=object)

array([<class 'float'>, <class 'str'>], dtype=object)

array([<class 'float'>, <class 'str'>], dtype=object)

array([<class 'str'>], dtype=object)

....


array([<class 'float'>, <class 'str'>], dtype=object)

array([<class 'float'>, <class 'str'>], dtype=object)

array([<class 'float'>, <class 'str'>], dtype=object)

array([<class 'float'>, <class 'str'>], dtype=object)