## What is a Variable?

_A variable is any characteristic, number, or quantity that can be measured or counted. The following are examples of variables:_

1. _Age (21, 35, 62, ...)_
2. _Gender (male, female)_
3. _Income (GBP 20000, GBP 35000, GBP 45000, ...)_
4. _House price (GBP 350000, GBP 570000, ...)_
5. _Country of birth (China, Russia, Costa Rica, ...)_
6. _Eye colour (brown, green, blue, ...)_
7. _Vehicle make (Ford, Volkswagen, ...)_

_They are called 'variables' because the value they take may vary (and it usually does) in a population._


_Most variables in a data set can be classified into one of two major types:_

* _Numerical variables_
* _Categorical variables_

### Numerical Variables
_The values of a numerical variable are numbers. They can be further classified into discrete and continuous variables._

#### 1. _Discrete Numerical Variable_
_A variable which values are whole numbers (counts) is called discrete. For example, the number of items bought by a customer in a supermarket is discrete. The customer can buy 1, 25, or 50 items, but not 3.7 items. It is always a round number. The following are examples of discrete variables:_

1. _Number of active bank accounts of a borrower (1, 4, 7, ...)_
2. _Number of pets in the family_
3. _Number of children in the family_


#### 2. _Continuous Numerical Variable_
_A variable that may contain any value within some range is called continuous. For example, the total amount paid by a customer in a supermarket is continuous. The customer can pay, GBP 20.5, GBP 13.10, GBP 83.20 and so on. Other examples of continuous variables are:_

1. _House price (in principle, it can take any value) (GBP 350000, 57000, 1000000, ...)_
2. _Time spent surfing a website (3.4 seconds, 5.10 seconds, ...)_
3. _Total debt as percentage of total income in the last month (0.2, 0.001, 0, 0.75, ...)_

In [0]:
# Importing Required Libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

In [0]:
# Follow article: https://medium.com/@yvettewu.dw/tutorial-kaggle-api-google-colaboratory-1a054a382de0

!pip install kaggle

!mkdir .kaggle

import json
token = {"username":"piyushsingla","key":"94939b9ecb5fcbd1eb64911f57ee6087"}
with open('/content/.kaggle/kaggle.json', 'w') as file:
    json.dump(token, file)

In [0]:
!chmod 600 /content/.kaggle/kaggle.json
!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json
!kaggle config set -n path -v{/content}

!kaggle datasets download -d wendykan/lending-club-loan-data -p /content

!unzip \*.zip

In [0]:
pd.set_option('display.max_rows',1000)
pd.set_option('display.max_columns',1000)

In [0]:
# Download dataset: https://www.kaggle.com/wendykan/lending-club-loan-data or Link to Kaggle

use_cols = ['loan_amnt', 'int_rate', 'annual_inc', 'open_acc', 'loan_status', 'open_il_12m','id', 'purpose', 'loan_status', 'home_ownership','loan_amnt', 'grade', 'purpose', 'issue_d', 'last_pymnt_d']

data = pd.read_csv('/content/loan.csv', usecols=use_cols)
data1 = data.sample(10000, random_state=44)  # set a seed for reproducibility
data.head()

In [0]:
data['loan_status'].value_counts()

In [0]:
data.info()

In [0]:
# Continous Variable
data1.loan_amnt.unique()

In [0]:
plt.style.use('seaborn-pastel')

In [0]:
fig = data1.loan_amnt.hist(bins=50)
fig.set_title('Loan Amount Requested')
fig.set_xlabel('Loan Amount')
fig.set_ylabel('Number of Loans')

_The values of the variable vary across the entire range of the variable. This is characteristic of continuous variables._

In [0]:
data1.int_rate.unique()

In [0]:
fig = data1.int_rate.hist(bins=30)
fig.set_title('Interest Rate')
fig.set_xlabel('Interest Rate')
fig.set_ylabel('Number of Loans')

In [0]:
fig = data1.annual_inc.hist(bins=100)
fig.set_xlim(0, 400000)
fig.set_title("Customer's Annual Income")
fig.set_xlabel('Annual Income')
fig.set_ylabel('Number of Customers')

In [0]:
# Discrete Variable
data1.open_acc.dropna().unique()

_This is, the total number of credit items (for example, credit cards, car loans, mortgages, etc) that is known for that borrower. By definition it is a discrete variable, because a borrower can have 1 credit card, but not 3.5 credit cards._

In [0]:
fig = data1.open_acc.hist(bins=100)
fig.set_xlim(0, 30)
fig.set_title('Number of open accounts')
fig.set_xlabel('Number of open accounts')
fig.set_ylabel('Number of Customers')

In [0]:
data1.open_il_12m.unique()

In [0]:
fig = data1.open_il_12m.hist(bins=50)
fig.set_title('Number of installment accounts opened in past 12 months')
fig.set_xlabel('Number of installment accounts opened in past 12 months')
fig.set_ylabel('Number of Borrowers')

_Histograms of discrete variables have this typical broken shape, as not all the values within the variable range are present in the variable._

In [0]:
# Discrete Variable: Binary Value
data.loan_status.unique()

_Binary variables, are discrete variables, that can take only 2 values, therefore binary._

_In the next cells we will create an additional variable, called defaulted, to capture the number of loans that have defaulted. A defaulted loan is a loan that a customer has failed to re-pay and the money is lost._

_The variable takes the values 0 where the loans are ok and being re-paid regularly, or 1, when the borrower has confirmed that will not be able to re-pay the borrowed amount._

In [0]:
data['defaulted'] = np.where(data.loan_status.isin(['Default']), 1, 0)
data.defaulted.mean()

In [0]:
data[data.loan_status.isin(['Default'])].head()

In [0]:
data.defaulted.unique()

In [0]:
fig = data.defaulted.hist()
fig.set_xlim(0, 2)
fig.set_title('Defaulted accounts')
fig.set_xlabel('Defaulted')
fig.set_ylabel('Number of Loans')

### Categorical Variables

_The values of a categorical variable are selected from a group of categories, also called labels. Examples are gender (male or female) and marital status (never married, married, divorced or widowed). Other examples of categorical variables include:_

1. _Intended use of loan (debt-consolidation, car purchase, wedding expenses, ...)_
2. _Mobile network provider (Vodafone, Orange, ...)_
3. _Postcode_
4. _Categorical variables can be further categorised into ordinal and nominal variables._


#### _Ordinal Categorical Variable_
_Categorical variable in which categories can be meaningfully ordered are called ordinal. For example:_

1. _Student's grade in an exam (A, B, C or Fail)._
2. _Days of the week can be ordinal with Monday = 1 and Sunday = 7._
3. _Educational level, with the categories Elementary school, High school, College graduate and PhD ranked from 1 to 4._

#### _Nominal Categorical Variable_
_There isn't an intrinsic order of the labels. For example, country of birth (Argentina, England, Germany) is nominal. Other examples of nominal variables include:_

1. _Postcode_
2. _Vehicle make (Citroen, Peugeot, ...)_
3. _There is nothing that indicates an intrinsic order of the labels, and in principle, they are all equal._

In [0]:
data1.home_ownership.unique()

In [0]:
fig = data1['home_ownership'].value_counts().plot.bar()
fig.set_title('Home Ownership')
fig.set_ylabel('Number of customers')

In [0]:
data1['home_ownership'].value_counts()

In [0]:
data1.purpose.unique()

In [0]:
fig = data1['purpose'].value_counts().plot.bar()
fig.set_title('Loan Purpose')
fig.set_ylabel('Number of customers')

In [0]:
data1.loan_status.unique()

In [0]:
fig = data1['loan_status'].value_counts().plot.bar()
fig.set_title('Status of the Loan')
fig.set_ylabel('Number of customers')

### Date & Time

_A special type of categorical variable are those that instead of taking traditional labels, like color (blue, red), or city (London, Manchester), take dates as values. For example, date of birth ('29-08-1987', '12-01-2012'), or time of application ('2016-Dec', '2013-March')._

_Datetime variables can contain dates only, or time only, or date and time._

_Typically, we would never work with a date variable as a categorical variable, for a variety of reasons:_

1. _Date variables usually contain a huge number of individual categories, which will expand the feature space dramatically_
2. _Date variables allow us to capture much more information from the dataset if preprocessed in the right way_

In [0]:
data1.dtypes

In [0]:
data1['issue_dt'] = pd.to_datetime(data1.issue_d)
data1['last_pymnt_dt'] = pd.to_datetime(data1.last_pymnt_d)

data1[['issue_d', 'issue_dt', 'last_pymnt_d', 'last_pymnt_dt']].head()

In [0]:
fig = data1.groupby(['issue_dt', 'grade'])['loan_amnt'].sum().unstack().plot(figsize=(14, 8), linewidth=2)

fig.set_title('Disbursed amount in time')
fig.set_ylabel('Disbursed Amount (US Dollars)')

### Mixed Variables

_Mixed variables are those which values contain both numbers and labels._

_Variables can be mixed for a variety of reasons. For example, when credit agencies gather and store financial information of users, usually, the values of the variables they store are numbers. However, in some cases the credit agencies cannot retrieve information for a certain user for different reasons. What Credit Agencies do in these situations is to code each different reason due to which they failed to retrieve information with a different code or 'label'. Like this, they generate mixed type variables. These variables contain numbers when the value could be retrieved, or labels otherwise._

_As an example, think of the variable 'number_of_open_accounts'. It can take any number, representing the number of different financial accounts of the borrower. Sometimes, information may not be available for a certain borrower, for a variety of reasons. Each reason will be coded by a different letter, for example: 'A': couldn't identify the person, 'B': no relevant data, 'C': person seems not to have any open account._

_Another example of mixed type variables, is for example the variable missed_payment_status. This variable indicates, whether a borrower has missed a (any) payment in their financial item. For example, if the borrower has a credit card, this variable indicates whether they missed a monthly payment on it. Therefore, this variable can take values of 0, 1, 2, 3 meaning that the customer has missed 0-3 payments in their account. And it can also take the value D, if the customer defaulted on that account._

_Typically, once the customer has missed 3 payments, the lender declares the item defaulted (D), that is why this variable takes numerical values 0-3 and then D._