# Understanding the Problem and Data Preparation 

## Problem Definition
Clarify the goal of classifying credit risk of applicants using the available features in the dataset. The primary aim is to predict whether a loan application is of high risk or low risk based on financial and personal details provided by applicants.

## Data Loading and Initial Exploration
First, let's load the dataset and perform an initial exploration to understand its structure and data types. This includes:

- Viewing the first few rows to understand the format.
- Generating descriptive statistics to get an overview of numerical features.
- Checking for missing values or any inconsistencies in the data.

In [3]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configuring visuals
%matplotlib inline
sns.set_style("whitegrid")


In [4]:
# Load the dataset
data_path = '../data/german.data-numeric'
data = pd.read_csv(data_path, header=None, delim_whitespace=True)

# Displaying the first few rows of the dataset to ensure correct loading
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,1,6,4,12,5,5,3,4,1,67,...,0,0,1,0,0,1,0,0,1,1
1,2,48,2,60,1,3,2,2,1,22,...,0,0,1,0,0,1,0,0,1,2
2,4,12,4,21,1,4,3,3,1,49,...,0,0,1,0,0,1,0,1,0,1
3,1,42,2,79,1,4,3,4,2,45,...,0,0,0,0,0,0,0,0,1,1
4,1,24,3,49,1,3,3,4,4,53,...,1,0,1,0,0,0,0,0,1,2


In [5]:
data=data.iloc[:,:-5]
# Column names based on the dataset attribute description
column_names = [
    "Status of Existing Checking Account", "Duration in Months", "Credit History",
    "Purpose", "Credit Amount", "Savings Account/Bonds", "Present Employment Since",
    "Installment Rate in Percentage of Disposable Income", "Personal Status and Sex",
    "Other Debtors/Guarantors", "Present Residence Since", "Property", "Age in Years",
    "Other Installment Plans", "Housing", "Number of Existing Credits at This Bank",
    "Job", "Number of People Liable to Provide Maintenance For", "Telephone",
    "Foreign Worker"
]

# Assign column names to the DataFrame
data.columns = column_names

# Verify the column names were added successfully
data.head()


Unnamed: 0,Status of Existing Checking Account,Duration in Months,Credit History,Purpose,Credit Amount,Savings Account/Bonds,Present Employment Since,Installment Rate in Percentage of Disposable Income,Personal Status and Sex,Other Debtors/Guarantors,Present Residence Since,Property,Age in Years,Other Installment Plans,Housing,Number of Existing Credits at This Bank,Job,Number of People Liable to Provide Maintenance For,Telephone,Foreign Worker
0,1,6,4,12,5,5,3,4,1,67,3,2,1,2,1,0,0,1,0,0
1,2,48,2,60,1,3,2,2,1,22,3,1,1,1,1,0,0,1,0,0
2,4,12,4,21,1,4,3,3,1,49,3,1,2,1,1,0,0,1,0,0
3,1,42,2,79,1,4,3,4,2,45,3,1,2,1,1,0,0,0,0,0
4,1,24,3,49,1,3,3,4,4,53,3,2,2,1,1,1,0,1,0,0


### Generate descriptive statistics:

In [6]:
# Generate descriptive statistics to understand the dataset's numerical distributions
display(data.describe())


Unnamed: 0,Status of Existing Checking Account,Duration in Months,Credit History,Purpose,Credit Amount,Savings Account/Bonds,Present Employment Since,Installment Rate in Percentage of Disposable Income,Personal Status and Sex,Other Debtors/Guarantors,Present Residence Since,Property,Age in Years,Other Installment Plans,Housing,Number of Existing Credits at This Bank,Job,Number of People Liable to Provide Maintenance For,Telephone,Foreign Worker
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,2.577,20.903,2.545,32.711,2.105,3.384,2.682,2.845,2.358,35.546,2.675,1.407,1.155,1.404,1.037,0.234,0.103,0.907,0.041,0.179
std,1.257638,12.058814,1.08312,28.252605,1.580023,1.208306,0.70808,1.103718,1.050209,11.375469,0.705601,0.577654,0.362086,0.490943,0.188856,0.423584,0.304111,0.290578,0.198389,0.383544
min,1.0,4.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,19.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,12.0,2.0,14.0,1.0,3.0,2.0,2.0,1.0,27.0,3.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
50%,2.0,18.0,2.0,23.0,1.0,3.0,3.0,3.0,2.0,33.0,3.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
75%,4.0,24.0,4.0,40.0,3.0,5.0,3.0,4.0,3.0,42.0,3.0,2.0,1.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0
max,4.0,72.0,4.0,184.0,5.0,5.0,4.0,4.0,4.0,75.0,3.0,4.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0


### Check for missing values:

In [7]:
# Check for missing values in the dataset
missing_values = data.isnull().sum()
print("Missing values in each column:\n", missing_values)


Missing values in each column:
 Status of Existing Checking Account                    0
Duration in Months                                     0
Credit History                                         0
Purpose                                                0
Credit Amount                                          0
Savings Account/Bonds                                  0
Present Employment Since                               0
Installment Rate in Percentage of Disposable Income    0
Personal Status and Sex                                0
Other Debtors/Guarantors                               0
Present Residence Since                                0
Property                                               0
Age in Years                                           0
Other Installment Plans                                0
Housing                                                0
Number of Existing Credits at This Bank                0
Job                                                    0

## Data Preprocessing
This includes several tasks such as:

- Cleaning the data by handling missing values and correcting any errors.
- Encoding categorical variables into numeric format using techniques like one-hot encoding, especially since many machine learning models require numerical input.
- Normalizing or standardizing numerical variables if required to ensure models function optimally.

In [8]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Scale numeric columns only; assuming all columns are numeric now after dummy encoding
data_scaled = scaler.fit_transform(data)
data = pd.DataFrame(data_scaled, columns=data.columns)
