💰 FinTech Project

As part of the Big Data and AI Engineering Onsite Bootcamp, we are asked to deliver a solution for the Saudi market that can be solved by data science. The project has to have an impact and deliver a solution for a real-world problem using Saudi datasets.

Table of Contents

Project Overview
Business Objective
- Methods Used
- Technologies
Dataset Overview
Preprocessing Overview
Visualization
Modeling Results
Contributing Members Contact
Acknowledgments

Project Overview

This is the overview of the project's structure and files for easier navigation. However, some notebooks and datasets cannot be uploaded either to ensure the company's confidentiality or due to size limits:

├── README.md
├── CapestoneProject_Dashboard_Desert_Ninjas.pdf
├── CapstoneProject_Presentation_Desert_Ninjas.pdf
├── Notebooks
│   ├── CapstoneProject_Pre_Preprocessing_Notebook_ComanyNameEncryption.ipynb 
|   ├── CapstoneProject_Preprocessing_Notebook_Desert_Ninjas.ipynb
│   ├── CapstoneProject_EDA_Notebook_Desert_Ninjas.ipynb
│   └── CapstoneProject_ML_Notebook_Desert_Ninjas.ipynb
└── Datasets
    ├── Encrypted_full_dataset.csv (output of the pre-preprocessing notebook) 
    ├── Encrypted_exported_raw_data.csv (output of the pre-preprocessing notebook) 
    ├── Preprocessed_full_dataset.csv (output of the preprocessing notebook) 
    └── Final_extracted_dataset.csv (used for the EDA, Dashboard, and Machine Learning models)

Note: As a beginning, we were provided with two datasets that contain different schemas (Encrypted_full_dataset + Encrypted_exported_raw_data)

(back to top)

Business Objective

The purpose of this project is to predict potential customers for a FinTech startup company using their visitor's activity logs. Those potential investors would then be targeted with marketing strategies.

Methods Used

Preprocessing raw data
Feature Engineering
Feature Selection
Labeling and classifying the data
Exploratory Data Analysis
Data Visualization
Machine Learning
Oversampling

Technologies

Python, Jupyter
Pandas
Plotly
Sklearn
Imbalanced-learn
Power BI

(back to top)

Dataset Overview

A startup FinTech company named X is interested in knowing its customers’ behaviors and whether they’re going to invest based on their activity logs. However, the problem has challenges because we don't have the following to support our analysis:

The number of visitors to the website
The demographics of these visitors

The analysis will help the company create a new marketing strategy for attracting more customers, increasing its revenues, and learning the patterns of customers who reach the investment pages but do not commit to the full transaction. Lucky for the FinTech company, we say, challenge accepted!

At the beginning of our analysis, we raised some questions that we intend to answer using our EDA, dashboard visualization, and modeling. The questions are:

What kind of data does their website collect from users?
What is the path that gets visited by users usually? And how much time do users spend on this path?
Does the average time spent on a page differ based on the user type?
Which path has the maximum time? Is this the path that leads to a successful transaction (investment)? We hope to answer all of these questions in our analysis.

(back to top)

Preprocessing Overview

Preprocessing is the essence of this project. In this README file, we will be listing the overview of each step. However, for a more detailed description, visit our Medium Blog Post.

The dataset before and after the preprocessing:

Preprocessing steps:

Feature engineering steps:

Features before removing data leakage:

Selecting the features after removing the data leakage:

(back to top)

Visualization

Based on our EDA, we found that 80% of our users are regular visitors, while only 17% are investors, thus, we wanted to create two dashboards for these two user types.

Visitors dashboard:

Investors dashboard:

As mentioned above, you can visit our web blog for a detailed analysis of the project.

(back to top)

Modeling Results

All of these models were evaluted in order to choose the best one of them.

However, in our criteria, since our dataset is imbalanced, we will take recall as our evaluation metric. Also, we want to focus on identifying the potential customers class, so, we took the best model in identifying this class as compared to our baseline; which is XGBoost.

XGBoost results:

Baseline Distribution:

(back to top)

Contributing Members Contact

Team Leadear: Reema Alaswad (Reema's LinkedIn)

Other Members:

Name	LinkedIn
Raghad Aleisa	Raghad's LinkedIn
AlJohara Alkanhal	AlJohara's LinkedIn
Maha AlHazzani	Maha's LinkedIn
Eman Aldosari	Eman's LinkedIn

(back to top)

Acknowledgments

(back to top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💰 FinTech Project

Project Overview