<a href="https://colab.research.google.com/github/Nelkit/DSI-AT2-data-analysis-project/blob/main/notebooks/DSI_AT2_data_analysis_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assessment task 2: Data analysis project




## Stage 1 – To-Do List

### 1. Setup
- [x] Create GitHub repository (`DSI-AT2-data-analysis-project`).
- [x] Add README.md with project name, description, and member list.
- [ ] Set up shared workspace (Google Collab/Word).

### 2. Dataset Selection & Problem Definition
- [ ] Search for and shortlist possible datasets.
- [ ] Evaluate datasets for relevance, size, completeness, and licensing.
- [ ] Select final dataset.
- [ ] Define the general problem (context and background).
- [ ] Identify the business/research need.
- [ ] Write 1–3 research questions.
- [ ] Draft **Section 1: Problem** (300–400 words).

###  3. Literature Review
- [ ] Each member finds 2–3 relevant academic/industry papers.
- [ ] Summarise each paper: objective, method, findings, relevance.
- [ ] Identify trends and gaps.
- [ ] Draft **Section 2: Literature Review** (600–700 words).

###  4. Data Exploration
- [ ] Load dataset into analysis environment (Python, R, etc.).
- [ ] Inspect structure: rows, columns, data types.
- [ ] Identify missing values and data quality issues.
- [ ] Generate summary statistics.
- [ ] Create 1–2 visualisations to illustrate data characteristics.

###  5. Data Preparation
- [ ] Clean data (handle missing values, remove duplicates, fix types).
- [ ] Apply transformations (normalisation, encoding, etc.) if needed.
- [ ] Document all preparation steps and reasoning.
- [ ] Draft preparation part of **Section 4: Data**.

### 6. Approach Planning
- [ ] Define methodology for Stage 2 (analysis/modeling approach).
- [ ] List tools, frameworks, and metrics to be used.
- [ ] Create a mini-timeline for Stage 2 work.
- [ ] Draft **Section 3: Approach** (400–500 words).

###  7. Report Integration & Review
- [ ] Merge all sections into one document with consistent formatting.
- [ ] Add figures/tables in the right sections.
- [ ] Write introduction/title page.
- [ ] Add references in correct style (APA/Harvard).
- [ ] Review word count (~2000 words excluding extras).
- [ ] Proofread for clarity, grammar, and alignment with marking criteria.

###  8. Submission
- [ ] Export report as PDF.
- [ ] Final group review and sign-off.
- [ ] Submit before **29 Aug 23:59**.



## 0. Setup Environment

### 0.a Install Mandatory Packages

> Do not modify this code before running it

In [1]:
# Do not modify this code

import os
import sys
from pathlib import Path

COURSE = "36100"
ASSIGNMENT = "AT2"
DATA = "data"

asgmt_path = f"{COURSE}/assignment/{ASSIGNMENT}"
root_path = "./"

if os.getenv("COLAB_RELEASE_TAG"):

    from google.colab import drive
    from pathlib import Path

    print("\n###### Connect to personal Google Drive ######")
    gdrive_path = "/content/gdrive"
    drive.mount(gdrive_path)
    root_path = f"{gdrive_path}/MyDrive/"

print("\n###### Setting up folders ######")
folder_path = Path(f"{root_path}/{asgmt_path}/") / DATA
folder_path.mkdir(parents=True, exist_ok=True)
print(f"\nYou can now save your data files in: {folder_path}")

if os.getenv("COLAB_RELEASE_TAG"):
    %cd {folder_path}



###### Connect to personal Google Drive ######
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).

###### Setting up folders ######

You can now save your data files in: /content/gdrive/MyDrive/36100/assignment/AT2/data
/content/gdrive/MyDrive/36100/assignment/AT2/data


### 0.b Disable Warnings Messages

> Do not modify this code before running it

In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### 0.c Install Additional Packages

> If you are using additional packages, you need to install them here using the command: `! pip install <package_name>`

In [3]:
!pip install scipy



### 0.d Import Packages

In [4]:
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import matplotlib.gridspec as gridspec

### 0.f Reusable Functions

## 1. Project Overview

### 1.1 Project Description

`Type your project decription here`

### 1.2 Business Objective

`Type your business objective here`

### 1.3 Research Questions

`Type your research Questions here`

## 2. Data Loading and Understanding

In [5]:
df_train = pd.read_csv(folder_path / "train.csv")
df_test = pd.read_csv(folder_path / "test.csv")

  df_train = pd.read_csv(folder_path / "train.csv")


## 3. Exploratory Data Analysis (EDA)

### 3.1 Explore features

In [7]:
df_train.head()

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,0x1602,CUS_0xd40,January,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,_,809.98,26.82262,22 Years and 1 Months,No,49.574949,80.41529543900253,High_spent_Small_value_payments,312.49408867943663,Good
1,0x1603,CUS_0xd40,February,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.94496,,No,49.574949,118.28022162236736,Low_spent_Large_value_payments,284.62916249607184,Good
2,0x1604,CUS_0xd40,March,Aaron Maashoh,-500,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,28.609352,22 Years and 3 Months,No,49.574949,81.699521264648,Low_spent_Medium_value_payments,331.2098628537912,Good
3,0x1605,CUS_0xd40,April,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.377862,22 Years and 4 Months,No,49.574949,199.4580743910713,Low_spent_Small_value_payments,223.45130972736783,Good
4,0x1606,CUS_0xd40,May,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,Good,809.98,24.797347,22 Years and 5 Months,No,49.574949,41.420153086217326,High_spent_Medium_value_payments,341.48923103222177,Good


In [8]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  object 
 1   Customer_ID               100000 non-null  object 
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  object 
 5   SSN                       100000 non-null  object 
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  object 
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  object 
 13  Type_of_Loan              88592 non-null   ob

In [9]:
df_train.describe()

Unnamed: 0,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Delay_from_due_date,Num_Credit_Inquiries,Credit_Utilization_Ratio,Total_EMI_per_month
count,84998.0,100000.0,100000.0,100000.0,100000.0,98035.0,100000.0,100000.0
mean,4194.17085,17.09128,22.47443,72.46604,21.06878,27.754251,32.285173,1403.118217
std,3183.686167,117.404834,129.05741,466.422621,14.860104,193.177339,5.116875,8306.04127
min,303.645417,-1.0,0.0,1.0,-5.0,0.0,20.0,0.0
25%,1625.568229,3.0,4.0,8.0,10.0,3.0,28.052567,30.30666
50%,3093.745,6.0,5.0,13.0,18.0,6.0,32.305784,69.249473
75%,5957.448333,7.0,7.0,20.0,28.0,9.0,36.496663,161.224249
max,15204.633333,1798.0,1499.0,5797.0,67.0,2597.0,50.0,82331.0


### 3.2 Explore target variable

## 4. Feature Selection

### 4.1 Feature Selection Approach

### 4.2 Final Selected Features

## 5. Data Preprocessing

### 5.1 Data Cleaning

### 5.2. Feature Engineering

### 5.3 Data Transformation

## 6. Data Modeling

### 6.1 Generate Predictions with Baseline Model

### 6.2 Assess the Baseline Model

## 7. Model Evaluation

### 7.1 Generate Predictions with Model Selected

### 7.2 Assess the Selected Model

## 8. Insights and Conclusions