<a href="https://colab.research.google.com/github/StacyChebet/LoanDefaults/blob/master/LoanDefaults.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Introduction to Data Processing**
Data processing involves cleaning and transforming raw data into a suitable format for modeling.
## **Objective**
This notebook aims to provide hands-on experience in preprocessing data, focusing on handling missing value, scaling feautures, encoding categorical variables, and more.
## **Key Objectives**
1. **Understand Data:** Familiarize with the dataset and its characteristics.
2. **Handle Missing Values:** Identify and treat missing data in different columns.
3. **Feature Encoding:** Convert categorical variables into numerical formats suitable for machine learning models.
4. **Feature Scaling:** Normalize or standardize numerical values to improve model performance.
5. **Data Splitting:** Prepare the dataset for training and testing to evaluate the performance of machine learning models.


##**Data Description**
The dataset provided is a collection of loan application records, which can be used to ptedict the likelihood of a default. <br>
Here is a brief description of each column in the dataset:
- `TARGET`: Binary indicator where **1** represents a default on a loan and **0** represents a non-default. This is the label for our predictive modeling.
- `NAME_CONTRACT_TYPE`: Type of loan contracted. Categorical variable (e.g. 'Cash loans', 'Resolving loans').
- `CODE_GENDER`: Gender of the applicant. Categorical variable ('M' for male, 'F' for female).
- `FLAG_OWN_CAR`: Indicates whether the applicant owns a car ('Y' for yes, 'N' for no).
- `FLAG_OWN_REALTY`: Indicates whether the applicant owns real estate ('Y' for yes, 'N' for no).
- `CNT_CHILDREN`: Number of children the applicant has.
- `AMT_INCOME_TOTAL`: Total annual income of the applicant.
- `AMT_CREDIT`: Credit amount of the loan taken.
- `AMT_ANNUITY`: Loan annuity.
- `DAYS_BIRTH`: Applicant's age in days at the time of application (negative values indicating the age).
- `YEARS_EMPLOYED`: Number of years the applicant has been employed.


##**Loading Libraries and Data**
Libraries used:
- **Pandas:** For data manipulation
- **Numpy:** For numerical operations
- **Seaborn:** For data visualization

In [2]:
#Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Setting visualization styles
sns.set(style="whitegrid")

#Mounting google drive
from google.colab import drive
drive.mount('/content/drive')

#Changing directory
%cd /content/drive/My Drive/Colab Notebooks/Data Analytics - IBT/LoanDefaults

#Loading the dataset
file_path = "loan_default.csv"
df = pd.read_csv(file_path)

#Displaying the first few rows of the dataset
df.head()

Mounted at /content/drive
/content/drive/My Drive/Colab Notebooks/Data Analytics - IBT/LoanDefaults


Unnamed: 0,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,DAYS_BIRTH,YEARS_EMPLOYED
0,0.0,Cash loans,M,Y,N,1,225000.0,578619.0,23229.0,-12347,0
1,0.0,Revolving loans,M,Y,Y,1,,270000.0,13500.0,-14048,6
2,0.0,Cash loans,M,Y,N,0,144000.0,753840.0,29340.0,-14639,6
3,0.0,Cash loans,F,N,Y,0,81000.0,98910.0,7785.0,-14591,11
4,0.0,Cash loans,F,N,Y,1,103500.0,521280.0,26779.5,-12023,0


##**Initial Data Exploration**
In this step, we will conduct an initial exploration of the dataset to understand its structure and basic characteristics. We will:

1. Check the shape of the dataset
2. Display the data types of each column
3. Get a summary of the dataset using descriptive statistics

In [None]:
#Checking