### Table of contents



### Goals

This dataset contains information on credit card users and if they are considered to be at risk of default.

We are only going to be using the `test_data.csv` for this project.

The analysis will roughtly follow the following outline:
- get acquainted with the data
- clean up the data
- Find questions for analysis
- Analyse varibles and relationsships to find patterns and answer the questions

### Data

The data for this analysis was sourced from Kaggle

Source: [https://www.kaggle.com/datasets/tanayatipre/car-price-prediction-dataset](https://www.kaggle.com/datasets/tanayatipre/car-price-prediction-dataset)

A first insight on the dataset can be gained from Kaggle

#### Loading the data

Import the necessary libraries and datasets

In [166]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [167]:
credit_card_unclean = pd.read_csv("train_data.csv")

#### Data Outline

- We have 20 `columns`and 29165 `rows`
- Job title is the only column with missing / null values
- Age is recorded in days backwards
- Employment lenght is recorded in days backwards, negative if unemployed
- Account age is recorded in years backwards
- around 1.7% of accounts are at risk
- Has mobile phone is only 1, so it contains no information

In [168]:
credit_card_unclean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29165 entries, 0 to 29164
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   29165 non-null  int64  
 1   Gender               29165 non-null  object 
 2   Has a car            29165 non-null  object 
 3   Has a property       29165 non-null  object 
 4   Children count       29165 non-null  int64  
 5   Income               29165 non-null  float64
 6   Employment status    29165 non-null  object 
 7   Education level      29165 non-null  object 
 8   Marital status       29165 non-null  object 
 9   Dwelling             29165 non-null  object 
 10  Age                  29165 non-null  int64  
 11  Employment length    29165 non-null  int64  
 12  Has a mobile phone   29165 non-null  int64  
 13  Has a work phone     29165 non-null  int64  
 14  Has a phone          29165 non-null  int64  
 15  Has an email         29165 non-null 

In [169]:
credit_card_unclean.isnull().sum()

ID                        0
Gender                    0
Has a car                 0
Has a property            0
Children count            0
Income                    0
Employment status         0
Education level           0
Marital status            0
Dwelling                  0
Age                       0
Employment length         0
Has a mobile phone        0
Has a work phone          0
Has a phone               0
Has an email              0
Job title              9027
Family member count       0
Account age               0
Is high risk              0
dtype: int64

In [170]:
credit_card_unclean.describe()

Unnamed: 0,ID,Children count,Income,Age,Employment length,Has a mobile phone,Has a work phone,Has a phone,Has an email,Family member count,Account age,Is high risk
count,29165.0,29165.0,29165.0,29165.0,29165.0,29165.0,29165.0,29165.0,29165.0,29165.0,29165.0,29165.0
mean,5078232.0,0.43079,186890.4,-15979.47749,59257.761255,1.0,0.22431,0.294977,0.090279,2.197531,-26.137734,0.01711
std,41824.0,0.741882,101409.6,4202.997485,137655.883458,0.0,0.417134,0.45604,0.286587,0.912189,16.486702,0.129682
min,5008804.0,0.0,27000.0,-25152.0,-15713.0,1.0,0.0,0.0,0.0,1.0,-60.0,0.0
25%,5042047.0,0.0,121500.0,-19444.0,-3153.0,1.0,0.0,0.0,0.0,2.0,-39.0,0.0
50%,5074666.0,0.0,157500.0,-15565.0,-1557.0,1.0,0.0,0.0,0.0,2.0,-24.0,0.0
75%,5114629.0,1.0,225000.0,-12475.0,-412.0,1.0,0.0,1.0,0.0,3.0,-12.0,0.0
max,5150485.0,19.0,1575000.0,-7705.0,365243.0,1.0,1.0,1.0,1.0,20.0,0.0,1.0


In [171]:
credit_card_unclean.nunique()

ID                     29165
Gender                     2
Has a car                  2
Has a property             2
Children count             9
Income                   259
Employment status          5
Education level            5
Marital status             5
Dwelling                   6
Age                     6794
Employment length       3483
Has a mobile phone         1
Has a work phone           2
Has a phone                2
Has an email               2
Job title                 18
Family member count       10
Account age               61
Is high risk               2
dtype: int64

### Data Cleaning

#### Remove unnessesary information

`Has a mobile phone` contains no information and will be removed

In [172]:
credit_card_unclean.drop("Has a mobile phone", axis = 1, inplace = True)

credit_card_unclean.columns

Index(['ID', 'Gender', 'Has a car', 'Has a property', 'Children count',
       'Income', 'Employment status', 'Education level', 'Marital status',
       'Dwelling', 'Age', 'Employment length', 'Has a work phone',
       'Has a phone', 'Has an email', 'Job title', 'Family member count',
       'Account age', 'Is high risk'],
      dtype='object')

#### Rename Columns

The colums have a lot of capital letters, spaces and unnesserary words.

We will give the `columns` clean names that are better for analysis.

In [173]:
credit_card_unclean.columns = ['id', 'gender', 'car', 'property', 'children',
       'income', 'employment_status', 'education', 'marital_status',
       'dwelling', 'age', 'employment_length', 'work_phone',
       'phone', 'email', 'job', 'family_size',
       'account_age', 'high_risk']

credit_card_unclean.columns

Index(['id', 'gender', 'car', 'property', 'children', 'income',
       'employment_status', 'education', 'marital_status', 'dwelling', 'age',
       'employment_length', 'work_phone', 'phone', 'email', 'job',
       'family_size', 'account_age', 'high_risk'],
      dtype='object')

#### Format Columns

- Turn columns with timeline by day to year
- `employment_length`, `age` and `account_age` from negative to positive

In [174]:
credit_card_unclean.age = -credit_card_unclean.age / 365.25
credit_card_unclean.account_age = -credit_card_unclean.account_age
credit_card_unclean.employment_length = -credit_card_unclean.employment_length / 365.25

credit_card_unclean.employment_length.describe()

count    29165.000000
mean      -162.238908
std        376.881269
min       -999.980835
25%          1.127995
50%          4.262834
75%          8.632444
max         43.019849
Name: employment_length, dtype: float64

#### Remove false data

In the `employment_length` column everyone unemployed has been unemployed for 1000 years. Let's add a column `employed` to see if someone is employed or not and set the unemployed data in `employment_length` to 0.

In [178]:
credit_card_unclean["employed"] = credit_card_unclean.employment_length < 0
# credit_card_unclean.employment_length = 

credit_card_unclean.employed.describe()

count     29165
unique        2
top       False
freq      24257
Name: employed, dtype: object