In [5]:
# Supress warnings
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)


import pandas as pd
import numpy as np
import skimpy 

---

# Point 3: Data Understanding
- Gathering initial data and exploring its characteristics.
- Assessing data quality, completeness, and relevance.
- Identifying potential data issues and limitations.


---

In [6]:
#Gathering initial data and exploring its characteristics.
df = pd.read_csv('../data/raw_data.csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


---

The dataset contains a mix of numerical and categorical variables representing various attributes of loan applicants and their loan applications. The dataset appears to be suitable for analyzing factors influencing loan approval decisions and building predictive models for loan eligibility. However, further exploration and analysis are needed to understand the relationships and patterns within the data fully.


---

In [7]:
# Assessing data quality, completeness, and relevance
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB
None


---

- The dataset contains 614 entries (rows) and 13 columns.
- Each column represents a different variable or feature.
- The variables have different data types:
- Four columns are of type float64, representing numerical variables (e.g., CoapplicantIncome, LoanAmount).
- One column is of type int64, representing a numerical variable (e.g., ApplicantIncome).
- Eight columns are of type object, representing categorical variables (e.g., Gender, Married, Education).
- There are missing values in several columns:
Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History have some missing values.
- The target variable (Loan_Status) is categorical and has two classes: Y (Yes) and N (No).
- Other categorical variables include Gender, Married, Education, Self_Employed, and Property_Area.
- Numerical variables include ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, and Credit_History.

---

In [8]:
#Identifying potential data issues and limitations.
from skimpy import skim
skim(df)

---

This summary provides a quick overview of the dataset's structure, missing values, and basic statistics for numerical variables. It can be useful for identifying potential issues and guiding further data exploration and analysis. 

---