<div class="alert alert-block alert-info">
Singapore Management University<br>
CS105 Statistical Thinking for Data Science, 2024/25 Term 2
</div>

# CS105 Group Project Submission (Part I)

-----
Provide your team details, including section, team number, team members, and the name of the dataset. 
Complete all of the following sections. For any part requiring code to derive your answers, please create a code cell immediately below your response and run the code.
To edit any markdown cell, double click the cell; after editing, execute the markdown cell to collapse it.
<br>
-----

## Declaration

<span style="color:red">By submitting this notebook, we declare that **no part of this submission is generated by any AI tool**. We understand that AI-generated submissions will be considered as plagiarism, and just like other plagirisum cases, disciplinary actions will be imposed.</span>

#### Section: G2
#### Team: T3
#### Members:
1. Bradley Goh
2. Denzyl Ng
3. Jared Yeo
4. Nagaraj Yohapriya
5. Zhang Zemin

#### Dataset: Credit

## Part I: Exploratory Data Analysis (EDA) [8% of final grade]

### 1. Overview of dataset [15% of Part I]

**a.** Summarise the background of the dataset [limited to 50 words]

**Response.** 
The dataset contains data of individuals applying for a credit facility in a bank. Each row corresponds to a different applicant and contains attributes captured by bank during an application. The approval status of each application is also captured in each row.

**b.** State the size of the dataset

**Response.**
1000 rows x 23 columns

**c.** For each variable, describe what it represents and its data type (numerical or categorical)

**Response.**
ID - Identification number - Categorical
Checking_Account - Status of checking account - Categorical
Duration - Credit duration in months - Numerical
Payment_Status - Credit history - Categorical
Purpose - Purpose of credit - Categorical
Amount - Credit amount - Numerical
Savings_Account - Status of savings account - Categorical
Employment - Length of current employment - Categorical
Installment - Installment rate as percentage of disposable income - Numerical
Personal_Status - Marital status and sex - Categorical
Guarantors - Other debtors or guarantors - Categorical
Residence_Length - Number of years staying in current residence - Categorical
Assets - Asset ownership - Categorical
Age - Age in years - Numerical
Credit_Rating - Credit rating - Numerical
Existing_Credits - Other existing credit in place - Categorical
Housing_Type - Type of apartment - Categorical
Num_Credits - Number of existing credits - Numerical
Occupation - Occupation - Categorical
Dependents - Number of dependents - Numerical
Telephone - Has telephone - Categorical
Foreign_Worker - Foreign worker or not - Categorical
Approval - Loan approval status - Categorical

### 2. Data pre-processing [35% of Part I]

**a.** For each variable, determine the percentage of missing data. For any column with missing data, describe how you resolve the issue. Clearly state any assumption you made.

In [11]:
#**Response.** 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

credit = pd.read_csv("credit.csv")

#percentage of missing values for each column
missing = credit.apply(lambda x: x.isna().sum()/1000)
print(missing)

#Assume that they have no checking account when data is missing, therefore, data under checking account should be encoded with 4
#Assume that they have no dependents when data is missing, therefore, data under dependents should be set to 0

#fill in empty rows for checking account
credit["Checking_Account"].fillna(4, inplace=True)

#fill in empty rows for dependents
credit["Dependents"].fillna(0, inplace=True)

ID                  0.000
Checking_Account    0.058
Duration            0.000
Payment_Status      0.000
Purpose             0.000
Amount              0.000
Savings_Account     0.000
Employment          0.000
Installment         0.000
Personal_Status     0.000
Guarantors          0.000
Residence_Length    0.000
Assets              0.000
Age                 0.000
Credit_Rating       0.000
Existing_Credits    0.000
Housing_Type        0.000
Num_Credits         0.000
Occupation          0.000
Dependents          0.050
Telephone           0.000
Foreign_Worker      0.000
Approval            0.000
dtype: float64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  credit["Checking_Account"].fillna(4, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  credit["Dependents"].fillna(0, inplace=True)


**b.** For each variable, identify outliers (if any) and describe how you resolve the issue. Clearly state any assumption you made.

In [13]:
#**Response.** 

#check if ID are all unique values
id = credit["ID"].unique()
print(len(id))

# Find invalid values (less than 1 or greater than 4) for Residence_Length
invalid_values = credit[(credit["Residence_Length"] < 1) | (credit["Residence_Length"] > 4)]

# Print results
if not invalid_values.empty:
    print("Invalid Residence_Length values found:")
    print(invalid_values)
    
    # Calculate the mode of Residence_Length (excluding invalid values)
    valid_residence = credit[(credit["Residence_Length"] >= 1) & (credit["Residence_Length"] <= 4)]["Residence_Length"]
    residence_mode = valid_residence.mode()[0]  # mode() returns a Series, so we get the first value

    # Replace invalid values with the mode
    credit.loc[(credit["Residence_Length"] < 1) | (credit["Residence_Length"] > 4), "Residence_Length"] = residence_mode

1000
Invalid Residence_Length values found:
      ID  Checking_Account  Duration  Payment_Status  Purpose  Amount  \
41    42               4.0        12               4        0     682   
47    48               4.0        12               4        6    2748   
86    87               4.0        10               2        0    1546   
99   100               4.0        36               2        9    5742   
323  324               1.0        36               2       10   15857   
329  330               4.0        30               2        3    2333   
425  426               4.0         8               4        0     713   
496  497               4.0        12               2        6    3321   
578  579               2.0         9               2        3    1082   
760  761               2.0        24               2        1    4113   
836  837               4.0         6               2        3    1595   
952  953               4.0        18               4        3    6070   

     S

**c.** For categorical variables, perform the necessary encoding.

**Response.** 
Data is already encoded, no encoding performed.

### 3.	Exploratory analysis and visualization [50% of Part I]

**a.** For each variable, provide relevant summary statistics

In [None]:
#**Response.** 

# Summary of Checking_Account
print(f"Checking_Account - Mode: {credit.Checking_Account.mode()}")

# Summary of Duration
print(f"Duration - Mean: {credit.Duration.mode()}")
print(f"Duration - Median: {credit.Duration.median()}")
print(f"Duration - Variance: {credit.Duration.var()}")

# Summary of Payment_Status
print(f"Payment_Status - Mode: {credit.Payment_Status.mode()}")

# Summary of Purpose
print(f"Purpose - Mode: {credit.Purpose.mode()}")

# Summary of Num_Credits
print(f"Num_Credits - Mean: {credit.Num_Credits.mean()}")
print(f"Num_Credits - Median: {credit.Num_Credits.median()}")
print(f"Num_Credits - Variance: {credit.Num_Credits.var()}")

# Mode for Occupation
print(f"Mode for Occupation: {credit.Occupation.mode()}")

# Summary of Dependents
print(f"Dependents - Mean: {credit.Dependents.mean()}")
print(f"Dependents - Median: {credit.Dependents.median()}")
print(f"Dependents - Variance: {credit.Dependents.var()}")

# Mode for Foreign_Worker
print(f"Mode for Occupation: {credit.Foreign_Worker.mode()}")



**b.** For each variable, provide an appropriate visualisation depicting the distribution of its values, and summarize any key observation(s) you made.

**Response.** 

**c.** Perform bi-variate analysis on the variables. You do not need to present the analysis of every pair of variables; only focus on the pairs you believe are worth investigating and explain. For each pair, describe the relationship between the two variables. Use appropriate statistical methods and/or visualizations.

**Response.** 