Data-Science-Programming

Python-based data preprocessing project converting raw customer CSV data into structured JSON files. Includes data cleaning, nested structuring, and derived metrics like salary-per-commute. Features anomaly detection and Seaborn visualisations for analysing trends in age, salary, and employment.

The project simulates a real-world business scenario where a company needs to prepare large volumes of customer information for further analysis. The data, originally stored in a flat CSV file (acw_user_data.csv), contains various attributes such as customer demographics, financial details, vehicle information, and employment status. However, this raw dataset is not suitable for direct analytical use, as it lacks structure, consistency, and derived metrics.

To address these issues, a robust Python-based data processing pipeline was implemented using standard libraries (csv, json, os, sys, etc.). The pipeline performs a sequence of essential preprocessing tasks, including:

Reading and cleaning data to handle missing or invalid entries (e.g., dependants column errors).
Structuring flat data into nested representations such as Vehicle, Credit Card, and Address objects.
Segregating data into multiple JSON outputs (processed.json, retired.json, employed.json, remove_ccard.json, and commute.json) for specific analytical needs.
Deriving new metrics such as the Salary-Commute ratio, enabling performance ranking of customers based on income relative to travel distance.
Identifying anomalies, such as credit card records with unrealistic validity periods (greater than 10 years).

In addition, the second phase of the project introduces data visualisation using Pandas and Seaborn, allowing insights into customer demographics and trends. Statistical measures (e.g., mean salary, median age) are calculated, and both univariate and multivariate plots are generated to explore relationships between variables such as age, salary, commute distance, and marital status.

Through this project, the complete data pre-processing lifecycle — from data ingestion and transformation to feature derivation and visual analysis — is demonstrated. It highlights the importance of clean, structured, and well-documented data preparation as a foundational step for subsequent machine learning and AI-based modelling.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
CustomerDataPreprocessing.ipynb		CustomerDataPreprocessing.ipynb
Histogram of Age Conditioned on Marital Status.png		Histogram of Age Conditioned on Marital Status.png
Histogram of Age with Bin Width of 5.png		Histogram of Age with Bin Width of 5.png
Histogram of Dependants (Data Errors Handled).png		Histogram of Dependants (Data Errors Handled).png
README.md		README.md
acw_user_data.csv		acw_user_data.csv
age_vs_salary.png		age_vs_salary.png
age_vs_salary_by_dependants.png		age_vs_salary_by_dependants.png
commute.json		commute.json
commuted_distance_vs_salary.png		commuted_distance_vs_salary.png
employed.json		employed.json
processed.json		processed.json
remove_ccard.json		remove_ccard.json
retired.json		retired.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data-Science-Programming

About

Uh oh!

Releases

Packages

Languages

OJ-Vic/Data-Science-Programming

Folders and files

Latest commit

History

Repository files navigation

Data-Science-Programming

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages