# 📂 Financial Data Analytics: Forecasting, Variance Analysis, and Customer Insights  
**💳 Dataset Reference:** [Financial Transactions Dataset: Analytics (Kaggle)](https://www.kaggle.com)  

### Table of Contents

1. **EXPLORING FINANCIAL MARKET TREND ANALYSIS**
   - [Introduction](#introduction)
   - [Project Overview](#project-overview)

2. **Data Collection and Preprocessing**
   - [Key Libraries](#key-libraries)
   - [Data Retrieval with APIs](#data-retrieval-with-apis)

3. **Data Insight and Analysis**
   - [Stock Daily Returns with S&P 500 as a Benchmark](#stock-daily-returns)
   - [geometric mean](#geom-mean)
   - [harmonic mean](#harm-mean)
   - [Cumulative Daily Returns](#cumulative-daily-returns)
   - [Median of Daily Returns](#median-daily-returns)
   - [Standard Deviation of Daily Returns](#std-deviation-daily-returns)
   - [Kurtosis of Daily Returns](#kurtosis-daily-returns)
   - [Skewness of Daily Returns](#skewness-daily-returns)

4. **Building and Optimizing Portfolios**
   - [Stock Portfolio Mix](#stock-portfolio-mix)
   - [Comparative Review: Benchmark vs. Strategy](#comparative-review)
   - [Need for Optimization](#need-optimization)

5. **Stock Optimizing Portfolios**

6. **Portfolio Optimization: The Markowitz Mean-Variance Model**
   - [Expected Returns and Risk Models](#expected-returns-risk)
   - [Mean-Variance Optimization](#mean-variance-optimization)
   - [Efficient Frontier and Mean-Variance Optimization](#efficient-frontier)
   - [Optimization Problem](#optimization-problem)
   - [Comparison: Unoptimized vs. Optimized Portfolio](#comparison-portfolios)

7. **Fundamental vs Technical Analysis**

# 📚 1. Introduction
___

<div style="font-family: Avenir, sans-serif; font-size: 16px; line-height: 1.6; color: white; background-color: #333; padding: 10px; border-radius: 5px;">
This section deals with the purpose, and problem statement, and contribution of this project to the sphere of knowledge. The project overview is also discussed in this section.

</div>

#### Project Overview 
This project leverages a **comprehensive financial dataset** containing **transaction records, customer details, and card information** spanning the **2010s decade**. The dataset is designed for **financial forecasting, fraud detection, and AI-powered banking solutions**.  

The goal is to **demonstrate expertise in SQL, Python, Machine Learning, and Power BI** by performing:  
- **Financial forecasting** for revenue and expense prediction.  
- **Variance analysis** to compare budgeted vs. actual financials.  
- **Customer insights extraction** for spending behavior segmentation.  
- **Business intelligence reporting** using interactive dashboards.  



#### Key Objectives 
This project focuses on four core objectives: financial forecasting, variance analysis, customer segmentation, and business intelligence reporting. By leveraging SQL for data extraction and transformation, Python for machine learning, and Power BI for visualization, the project aims to provide actionable financial insights.

- **Financial Forecasting:** The goal is to build predictive models to forecast future revenue and expenses using historical transaction data. This involves extracting and cleaning financial data with SQL, training time-series models (ARIMA, XGBoost, LSTMs), and evaluating model performance using metrics such as RMSE, MAPE, and R² scores.

- **Variance Analysis:** The project will compare budgeted vs. actual financial performance to identify key variance drivers. Using SQL, variance percentages will be calculated across customers, merchants, and spending categories. This analysis will help detect high-variance spending patterns and uncover seasonal trends impacting financial deviations.

- **Customer Spending & Segmentation:** To understand customer behavior, clustering algorithms (K-Means, DBSCAN) will be used to categorize customers based on spending patterns. The analysis will differentiate between high-value and low-value customers by evaluating transaction trends and creating customer profiles based on spending frequency, transaction volume, and merchant categories.

- **Business Intelligence & Reporting:** The final objective is to develop a Power BI dashboard for real-time financial tracking and variance monitoring. SQL-based financial data will be integrated into Power BI to design interactive visualizations for revenue, expense, and variance trends. The dashboard will provide financial decision-makers with insights into spending behavior, revenue trends, and key financial variances.

By addressing these objectives, the project will demonstrate a strong foundation in SQL, Python, Machine Learning, and Power BI, ensuring a data-driven approach to financial analysis and forecasting.


#### Tools & Technologies

| **Technology** | **Use Case** |
|---------------|-------------|
| **SQL** | Data extraction, cleaning, and variance calculations. |
| **Python (Pandas, NumPy)** | Data manipulation and preprocessing. |
| **Machine Learning (Scikit-learn, Statsmodels, XGBoost, LSTMs)** | Predictive modeling for financial forecasting. |
| **Power BI** | Interactive dashboards for financial tracking and business intelligence. |



#### Project Execution Plan

1. **Data Integration & SQL Setup**  
   - Load the dataset into **SQL** and establish relationships.  
   - Ensure data consistency and integrity.  

2. **Exploratory Data Analysis (EDA)**  
   - Analyze **transaction trends, revenue fluctuations, and spending patterns**.  
   - Detect high-variance cases using **SQL queries** and **Python visualizations**.  

3. **Financial Forecasting & Segmentation Models**  
   - Train and evaluate **time-series forecasting models**.  
   - Apply **clustering techniques for customer segmentation**.  

4. **Power BI Dashboard Creation**  
   - Visualize key metrics, financial forecasts, and variance trends.  
   - Create interactive reports for decision-making.  


🚀 **Next Steps:**  
- **Step 1:** Load and integrate the dataset into SQL.  
- **Step 2:** Perform **Exploratory Data Analysis (EDA)**.  
- **Step 3:** Develop **Machine Learning models** for forecasting and segmentation.  
- **Step 4:** Build **Power BI dashboard** for financial insights.  

🔗 **Reference Dataset:** [Financial Transactions Dataset: Analytics (Kaggle)](https://www.kaggle.com)  

---
**📌 Author:** [Teslim Adeyanju] 
**📅 Date:** [Project Start Date]  

### Loadingof Libraries


In [2]:
import pandas as pd
import numpy as np

# 📚 2. Data Integration and SQL Setup
___

<div style="font-family: Avenir, sans-serif; font-size: 16px; line-height: 1.6; color: white; background-color: #333; padding: 10px; border-radius: 5px;">
This section deals with the purpose, and problem statement, and contribution of this project to the sphere of knowledge. The project overview is also discussed in this section.

</div>

The dataset consists of five interlinked files, requiring proper SQL-based relationships and Python-based preprocessing. The data integration process involves loading the dataset into SQL, establishing relationships, and ensuring data consistency and integrity. The SQL setup will enable efficient data retrieval and manipulation for exploratory data analysis (EDA) and machine learning modeling.

### Dataset Schema & Relationships  

The dataset consists of multiple interconnected files, requiring structured relationships to ensure seamless data integration. The table below outlines the **primary keys, foreign keys, and purpose** of each dataset component.  

| **Dataset**                | **Primary Key**     | **Foreign Key(s) & Relationships**           | **Purpose** |
|----------------------------|--------------------|----------------------------------------------|-------------|
| `transactions_data.csv`     | `transaction_id`   | `user_id`, `card_id`, `mcc_code`            | Tracks revenue, expenses, and spending details. |
| `cards_data.csv`           | `card_id`          | `user_id`                                   | Links transactions to customers via cards. |
| `users_data.csv`           | `user_id`          | None                                        | Provides user demographics and account details. |
| `mcc_codes.json`           | `mcc_code`         | None                                        | Maps merchant category codes to business types. |
| `train_fraud_labels.json`  | `transaction_id`   | None                                        | Labels transactions as fraudulent (1) or legitimate (0). |

---

### Key Relationships & Data Flow 
1. **Transactions (`transactions_data.csv`)**:  
   - Connects **users (`users_data.csv`)** via `user_id`.  
   - Links to **card details (`cards_data.csv`)** through `card_id`.  
   - Associates with **merchant categories (`mcc_codes.json`)** using `mcc_code`.  
   - Mapped to **fraud labels (`train_fraud_labels.json`)** using `transaction_id`.  

2. **Users (`users_data.csv`)**:  
   - Each user can own multiple cards (`cards_data.csv`) via `user_id`.  

3. **Cards (`cards_data.csv`)**:  
   - Each card is linked to a specific user (`users_data.csv`) using `user_id`.  

4. **Fraud Labels (`train_fraud_labels.json`)**:  
   - Identifies whether a transaction (`transactions_data.csv`) was fraudulent (`1`) or legitimate (`0`).  

---

This structured schema ensures **efficient SQL querying, data transformation, and machine learning applications** by providing a **clear mapping of financial transactions, user demographics, merchant categories, and fraud labels**.  


### Load dataset: users_data

In [3]:
# Load users data
users = pd.read_csv("users_data.csv")
users.head()

Unnamed: 0,id,current_age,retirement_age,birth_year,birth_month,gender,address,latitude,longitude,per_capita_income,yearly_income,total_debt,credit_score,num_credit_cards
0,825,53,66,1966,11,Female,462 Rose Lane,34.15,-117.76,$29278,$59696,$127613,787,5
1,1746,53,68,1966,12,Female,3606 Federal Boulevard,40.76,-73.74,$37891,$77254,$191349,701,5
2,1718,81,67,1938,11,Female,766 Third Drive,34.02,-117.89,$22681,$33483,$196,698,5
3,708,63,63,1957,1,Female,3 Madison Street,40.71,-73.99,$163145,$249925,$202328,722,4
4,1164,43,70,1976,9,Male,9620 Valley Stream Drive,37.76,-122.44,$53797,$109687,$183855,675,1


In [4]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 2000 non-null   int64  
 1   current_age        2000 non-null   int64  
 2   retirement_age     2000 non-null   int64  
 3   birth_year         2000 non-null   int64  
 4   birth_month        2000 non-null   int64  
 5   gender             2000 non-null   object 
 6   address            2000 non-null   object 
 7   latitude           2000 non-null   float64
 8   longitude          2000 non-null   float64
 9   per_capita_income  2000 non-null   object 
 10  yearly_income      2000 non-null   object 
 11  total_debt         2000 non-null   object 
 12  credit_score       2000 non-null   int64  
 13  num_credit_cards   2000 non-null   int64  
dtypes: float64(2), int64(7), object(5)
memory usage: 218.9+ KB


In [5]:
# Check for missing values
users.isnull().sum()

id                   0
current_age          0
retirement_age       0
birth_year           0
birth_month          0
gender               0
address              0
latitude             0
longitude            0
per_capita_income    0
yearly_income        0
total_debt           0
credit_score         0
num_credit_cards     0
dtype: int64

In [5]:
def clean_users_data(data):
    data['per_capita_income'] = data['per_capita_income'].str.replace("$", "").str.replace(",", "").astype(float)
    data['yearly_income'] = data['yearly_income'].str.replace("$", "").str.replace(",", "").astype(float)
    data['total_debt'] = data['total_debt'].str.replace("$", "").str.replace(",", "").astype(float)
    
    return data

# Clean users data
users_data = clean_users_data(users)


In [6]:
users_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 2000 non-null   int64  
 1   current_age        2000 non-null   int64  
 2   retirement_age     2000 non-null   int64  
 3   birth_year         2000 non-null   int64  
 4   birth_month        2000 non-null   int64  
 5   gender             2000 non-null   object 
 6   address            2000 non-null   object 
 7   latitude           2000 non-null   float64
 8   longitude          2000 non-null   float64
 9   per_capita_income  2000 non-null   float64
 10  yearly_income      2000 non-null   float64
 11  total_debt         2000 non-null   float64
 12  credit_score       2000 non-null   int64  
 13  num_credit_cards   2000 non-null   int64  
dtypes: float64(5), int64(7), object(2)
memory usage: 218.9+ KB


____

### Load dataset: card_data

In [7]:
# Load card data
cards = pd.read_csv("cards_data.csv")
cards.head()

Unnamed: 0,id,client_id,card_brand,card_type,card_number,expires,cvv,has_chip,num_cards_issued,credit_limit,acct_open_date,year_pin_last_changed,card_on_dark_web
0,4524,825,Visa,Debit,4344676511950444,12/2022,623,YES,2,$24295,09/2002,2008,No
1,2731,825,Visa,Debit,4956965974959986,12/2020,393,YES,2,$21968,04/2014,2014,No
2,3701,825,Visa,Debit,4582313478255491,02/2024,719,YES,2,$46414,07/2003,2004,No
3,42,825,Visa,Credit,4879494103069057,08/2024,693,NO,1,$12400,01/2003,2012,No
4,4659,825,Mastercard,Debit (Prepaid),5722874738736011,03/2009,75,YES,1,$28,09/2008,2009,No


In [8]:
cards.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6146 entries, 0 to 6145
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   id                     6146 non-null   int64 
 1   client_id              6146 non-null   int64 
 2   card_brand             6146 non-null   object
 3   card_type              6146 non-null   object
 4   card_number            6146 non-null   int64 
 5   expires                6146 non-null   object
 6   cvv                    6146 non-null   int64 
 7   has_chip               6146 non-null   object
 8   num_cards_issued       6146 non-null   int64 
 9   credit_limit           6146 non-null   object
 10  acct_open_date         6146 non-null   object
 11  year_pin_last_changed  6146 non-null   int64 
 12  card_on_dark_web       6146 non-null   object
dtypes: int64(6), object(7)
memory usage: 624.3+ KB


In [9]:
import pandas as pd

def clean_cards_data(data):
    """
    Cleans the cards data:
    1. Converts 'expires' and 'acct_open_date' to datetime.
    2. Cleans and converts 'credit_limit' to float.
    3. Converts 'year_pin_last_changed' to integer (since it contains only years).
    4. Handles missing values.
    """
    # Convert 'expires' to datetime (assuming format is 'MM/YYYY')
    data['expires'] = pd.to_datetime(data['expires'], format='%m/%Y', errors='coerce')
    
    # Clean and convert 'credit_limit' to float if it's not already a float
    if data['credit_limit'].dtype == 'object':
        data['credit_limit'] = (data['credit_limit'].str.replace(r'[$,]', '', regex=True).astype(float))
    
    # Convert 'acct_open_date' to datetime (assuming format is 'MM/DD/YYYY')
    data['acct_open_date'] = pd.to_datetime(data['acct_open_date'], format='%m/%d/%Y', errors='coerce')
    
    # Convert 'year_pin_last_changed' to integer (since it contains only years)
    data['year_pin_last_changed'] = pd.to_numeric(data['year_pin_last_changed'], errors='coerce').astype('Int64')  # Use 'Int64' for nullable integer type
    
    # Handle missing values (optional, depending on your use case)
    data['expires'] = data['expires'].fillna(pd.NaT)  # Fill missing datetime with NaT
    data['credit_limit'] = data['credit_limit'].fillna(0.0)  # Fill missing credit limits with 0.0
    data['acct_open_date'] = data['acct_open_date'].fillna(pd.NaT)  # Fill missing datetime with NaT
    data['year_pin_last_changed'] = data['year_pin_last_changed'].fillna(0)  # Fill missing years with 0
    
    return data

# Clean cards data
cards_data = clean_cards_data(cards)

In [10]:
cards_data.head()

Unnamed: 0,id,client_id,card_brand,card_type,card_number,expires,cvv,has_chip,num_cards_issued,credit_limit,acct_open_date,year_pin_last_changed,card_on_dark_web
0,4524,825,Visa,Debit,4344676511950444,2022-12-01,623,YES,2,24295.0,NaT,2008,No
1,2731,825,Visa,Debit,4956965974959986,2020-12-01,393,YES,2,21968.0,NaT,2014,No
2,3701,825,Visa,Debit,4582313478255491,2024-02-01,719,YES,2,46414.0,NaT,2004,No
3,42,825,Visa,Credit,4879494103069057,2024-08-01,693,NO,1,12400.0,NaT,2012,No
4,4659,825,Mastercard,Debit (Prepaid),5722874738736011,2009-03-01,75,YES,1,28.0,NaT,2009,No


In [11]:
cards_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6146 entries, 0 to 6145
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     6146 non-null   int64         
 1   client_id              6146 non-null   int64         
 2   card_brand             6146 non-null   object        
 3   card_type              6146 non-null   object        
 4   card_number            6146 non-null   int64         
 5   expires                6146 non-null   datetime64[ns]
 6   cvv                    6146 non-null   int64         
 7   has_chip               6146 non-null   object        
 8   num_cards_issued       6146 non-null   int64         
 9   credit_limit           6146 non-null   float64       
 10  acct_open_date         0 non-null      datetime64[ns]
 11  year_pin_last_changed  6146 non-null   Int64         
 12  card_on_dark_web       6146 non-null   object        
dtypes: 

____

### Load dataset: mcc_codes_data

In [12]:
# Load mcc codes data
mcc_codes = pd.read_json("mcc_codes.json", orient="index")
mcc_codes.head()

Unnamed: 0,0
5812,Eating Places and Restaurants
5541,Service Stations
7996,"Amusement Parks, Carnivals, Circuses"
5411,"Grocery Stores, Supermarkets"
4784,Tolls and Bridge Fees


In [13]:
mcc_codes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 109 entries, 5812 to 5733
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       109 non-null    object
dtypes: object(1)
memory usage: 1.7+ KB


___

### Load dataset: Transaction_data

In [14]:
# load transactions data
transactions = pd.read_csv("transactions_data.csv", parse_dates=["date"]) 
transactions.head()

Unnamed: 0,id,date,client_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,mcc,errors
0,7475327,2010-01-01 00:01:00,1556,2972,$-77.00,Swipe Transaction,59935,Beulah,ND,58523.0,5499,
1,7475328,2010-01-01 00:02:00,561,4575,$14.57,Swipe Transaction,67570,Bettendorf,IA,52722.0,5311,
2,7475329,2010-01-01 00:02:00,1129,102,$80.00,Swipe Transaction,27092,Vista,CA,92084.0,4829,
3,7475331,2010-01-01 00:05:00,430,2860,$200.00,Swipe Transaction,27092,Crown Point,IN,46307.0,4829,
4,7475332,2010-01-01 00:06:00,848,3915,$46.41,Swipe Transaction,13051,Harwood,MD,20776.0,5813,


In [15]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13305915 entries, 0 to 13305914
Data columns (total 12 columns):
 #   Column          Dtype         
---  ------          -----         
 0   id              int64         
 1   date            datetime64[ns]
 2   client_id       int64         
 3   card_id         int64         
 4   amount          object        
 5   use_chip        object        
 6   merchant_id     int64         
 7   merchant_city   object        
 8   merchant_state  object        
 9   zip             float64       
 10  mcc             int64         
 11  errors          object        
dtypes: datetime64[ns](1), float64(1), int64(5), object(5)
memory usage: 1.2+ GB


In [16]:
import pandas as pd

def clean_transactions_data(transactions_df):
    """
    Cleans the transactions data using vectorized operations:
    1. Removes dollar signs and commas from the `amount` column and converts it to float.
    2. Converts the `date` column to a datetime object.
    3. Handles missing values in `merchant_state`, `zip`, and `errors` columns.
    4. Returns a cleaned DataFrame.
    """
    # Clean the `amount` column
    transactions_df['amount'] = (
        transactions_df['amount']
        .str.replace('$', '', regex=False)  # Remove dollar signs
        .str.replace(',', '', regex=False)  # Remove commas
        .astype(float)  # Convert to float
    )
    
    # Convert the `date` column to datetime
    transactions_df['date'] = pd.to_datetime(transactions_df['date'])
    
    # Handle missing values in `merchant_state`, `zip`, and `errors`
    transactions_df['merchant_state'] = transactions_df['merchant_state'].fillna('Unknown')
    transactions_df['zip'] = transactions_df['zip'].fillna('Unknown')
    transactions_df['errors'] = transactions_df['errors'].fillna('No Errors')
    
    # Convert specific columns to integers
    int_columns = ['id', 'client_id', 'card_id', 'merchant_id', 'mcc']
    transactions_df[int_columns] = transactions_df[int_columns].astype(int)
    
    # Rename columns to match the database schema (optional)
    transactions_df = transactions_df.rename(columns={
        'id': 'transaction_id',
        'client_id': 'user_id',
        'date': 'transaction_date',
        'mcc': 'mcc_code'
    })
    
    return transactions_df

# Clean transactions data
transactions_data = clean_transactions_data(transactions)

In [17]:
transactions_data.head()

Unnamed: 0,transaction_id,transaction_date,user_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,mcc_code,errors
0,7475327,2010-01-01 00:01:00,1556,2972,-77.0,Swipe Transaction,59935,Beulah,ND,58523.0,5499,No Errors
1,7475328,2010-01-01 00:02:00,561,4575,14.57,Swipe Transaction,67570,Bettendorf,IA,52722.0,5311,No Errors
2,7475329,2010-01-01 00:02:00,1129,102,80.0,Swipe Transaction,27092,Vista,CA,92084.0,4829,No Errors
3,7475331,2010-01-01 00:05:00,430,2860,200.0,Swipe Transaction,27092,Crown Point,IN,46307.0,4829,No Errors
4,7475332,2010-01-01 00:06:00,848,3915,46.41,Swipe Transaction,13051,Harwood,MD,20776.0,5813,No Errors


In [18]:
transactions_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13305915 entries, 0 to 13305914
Data columns (total 12 columns):
 #   Column            Dtype         
---  ------            -----         
 0   transaction_id    int64         
 1   transaction_date  datetime64[ns]
 2   user_id           int64         
 3   card_id           int64         
 4   amount            float64       
 5   use_chip          object        
 6   merchant_id       int64         
 7   merchant_city     object        
 8   merchant_state    object        
 9   zip               object        
 10  mcc_code          int64         
 11  errors            object        
dtypes: datetime64[ns](1), float64(1), int64(5), object(5)
memory usage: 1.2+ GB


____

### Load dataset: Train_fraud_data

In [19]:
# Load fraud labels data
fraud_labels = pd.read_json("train_fraud_labels.json")
fraud_labels.head()

Unnamed: 0,target
10649266,No
23410063,No
9316588,No
12478022,No
9558530,No


In [20]:
fraud_labels.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8914963 entries, 10649266 to 15151926
Data columns (total 1 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   target  object
dtypes: object(1)
memory usage: 136.0+ MB


### SQL Database

In [21]:
import os
import mysql.connector
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Retrieve MySQL credentials from environment variables
MYSQL_USER = os.getenv("MYSQL_USERNAME")
MYSQL_PASSWORD = os.getenv("MYSQL_PASSWORD")
MYSQL_HOST = os.getenv("MYSQL_HOST")

# Validate credentials
if not MYSQL_USER or not MYSQL_PASSWORD or not MYSQL_HOST:
    raise ValueError("Missing MySQL credentials. Please check your .env file.")
else:
    print("MySQL credentials loaded successfully.")

MySQL credentials loaded successfully.


### Connect to SQL Database

In [22]:
# Database connection function
def connect_to_db():
    return mysql.connector.connect(
        host=MYSQL_HOST,
        user=MYSQL_USER,
        password=MYSQL_PASSWORD
    )

# Databse creation function
def create_database():
    conn = connect_to_db()
    cursor = conn.cursor()
    cursor.execute("CREATE DATABASE IF NOT EXISTS financial_transactions_db")
    print("Database 'financial_transactions_db' created successfully and connected.")
    conn.close()

# Connect to the database function
def connect_to_db_with_db():
    return mysql.connector.connect(
        host=MYSQL_HOST,
        user=MYSQL_USER,
        password=MYSQL_PASSWORD,
        database="financial_transactions_db"
    )

# Call the functions
create_database()
db = connect_to_db_with_db()

Database 'financial_transactions_db' created successfully and connected.


### Create Tables

1. user tables
2. card tables
3. mcc_code tables
4. transaction tables
5. fraud tables

In [23]:
def create_users_table():
    """
    Creates the 'users' table in the database with the correct schema.
    """
    conn = connect_to_db_with_db()
    cursor = conn.cursor()
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS users (
        user_id BIGINT UNSIGNED PRIMARY KEY, -- Ensures consistency across foreign keys
        current_age INT,
        retirement_age INT,
        birth_year INT,
        birth_month INT,
        gender VARCHAR(10),
        address VARCHAR(255),
        latitude DOUBLE PRECISION,         
        longitude DOUBLE PRECISION,        
        per_capita_income DOUBLE PRECISION, 
        yearly_income DOUBLE PRECISION,    
        total_debt DOUBLE PRECISION,       
        credit_score INT,
        num_credit_cards INT
    );
    """)
    conn.commit()
    conn.close()
    print("✅ Table 'users' created successfully.")


In [24]:
def create_cards_table():
    """
    Creates the 'cards' table in the database with the correct schema.
    """
    conn = connect_to_db_with_db()
    cursor = conn.cursor()
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS cards (
        card_id BIGINT UNSIGNED PRIMARY KEY,
        user_id BIGINT UNSIGNED NOT NULL, 
        card_brand VARCHAR(50),
        card_type VARCHAR(50),
        card_number BIGINT,
        expires DATE,
        cvv INT,
        has_chip VARCHAR(10),
        num_cards_issued INT,
        credit_limit DOUBLE PRECISION, 
        acct_open_date DATE,
        year_pin_last_changed INT,
        card_on_dark_web VARCHAR(10),
        FOREIGN KEY (user_id) REFERENCES users(user_id) ON DELETE CASCADE
    );
    """)
    conn.commit()
    conn.close()
    print("✅ Table 'cards' created successfully.")


In [25]:
def create_mcc_codes_table():
    """
    Creates the 'mcc_codes' table in the database with merchant category codes.
    """
    conn = connect_to_db_with_db()
    cursor = conn.cursor()
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS mcc_codes (
        mcc_code BIGINT UNSIGNED PRIMARY KEY,
        category_description VARCHAR(255)
    );
    """)
    conn.commit()
    conn.close()
    print("✅ Table 'mcc_codes' created successfully.")


In [26]:
def create_transactions_table():
    """
    Creates the 'transactions' table in the database with the correct schema.
    """
    conn = connect_to_db_with_db()
    cursor = conn.cursor()
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS transactions (
        transaction_id BIGINT UNSIGNED PRIMARY KEY,  
        transaction_date TIMESTAMP,  
        user_id BIGINT UNSIGNED NOT NULL,  
        card_id BIGINT UNSIGNED NOT NULL,  
        amount DOUBLE PRECISION,  
        use_chip VARCHAR(50),  
        merchant_id BIGINT UNSIGNED,  
        merchant_city TEXT,  
        merchant_state VARCHAR(50),  
        zip VARCHAR(20),  
        mcc_code BIGINT UNSIGNED,  
        errors TEXT,  
        FOREIGN KEY (user_id) REFERENCES users(user_id) ON DELETE CASCADE,  
        FOREIGN KEY (card_id) REFERENCES cards(card_id) ON DELETE CASCADE,  
        FOREIGN KEY (mcc_code) REFERENCES mcc_codes(mcc_code) ON DELETE SET NULL
    );
    """)
    conn.commit()
    conn.close()
    print("✅ Table 'transactions' created successfully.")


In [27]:
def create_fraud_table():
    """
    Creates the 'fraud_labels' table in the database.
    """
    conn = connect_to_db_with_db()
    cursor = conn.cursor()
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS fraud_labels (
        transaction_id BIGINT UNSIGNED PRIMARY KEY,  
        is_fraud BOOLEAN,  
        FOREIGN KEY (transaction_id) REFERENCES transactions(transaction_id) ON DELETE CASCADE
    );
    """)
    conn.commit()
    conn.close()
    print("✅ Table 'fraud_labels' created successfully.")


In [28]:
# Create tables
create_users_table()
create_cards_table()
create_mcc_codes_table()
create_transactions_table()
create_fraud_table()

✅ Table 'users' created successfully.
✅ Table 'cards' created successfully.
✅ Table 'mcc_codes' created successfully.
✅ Table 'transactions' created successfully.
✅ Table 'fraud_labels' created successfully.


In [29]:
# Show tables in the financial_transactions_db database
cursor = db.cursor()
cursor.execute("SHOW TABLES IN financial_transactions_db")
tables = cursor.fetchall()
for table in tables:
	print(table)


('cards',)
('fraud_labels',)
('mcc_codes',)
('transactions',)
('users',)


### Insert Data into: User Table

In [37]:
users_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 2000 non-null   int64  
 1   current_age        2000 non-null   int64  
 2   retirement_age     2000 non-null   int64  
 3   birth_year         2000 non-null   int64  
 4   birth_month        2000 non-null   int64  
 5   gender             2000 non-null   object 
 6   address            2000 non-null   object 
 7   latitude           2000 non-null   float64
 8   longitude          2000 non-null   float64
 9   per_capita_income  2000 non-null   float64
 10  yearly_income      2000 non-null   float64
 11  total_debt         2000 non-null   float64
 12  credit_score       2000 non-null   int64  
 13  num_credit_cards   2000 non-null   int64  
dtypes: float64(5), int64(7), object(2)
memory usage: 218.9+ KB


### continue from here 

In [None]:
# Function to insert users' data into the table efficiently
def insert_users_data(users_df):
    """
    Inserts users' data from a DataFrame into the users table efficiently.
    """
    # Ensure column alignment
    users_df.rename(columns={'id': 'user_id'}, inplace=True)

    # Connect to the database
    conn = connect_to_db_with_db()
    cursor = conn.cursor()

    # Prepare SQL query
    sql_query = """
        INSERT INTO users (user_id, 
                           current_age, 
                           retirement_age, 
                           birth_year, 
                           birth_month, 
                           gender, 
                           address,
                           latitude, 
                           longitude, 
                           per_capita_income, 
                           yearly_income, 
                           total_debt, 
                           credit_score, 
                           num_credit_cards)
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
    """

    # Convert DataFrame to list of tuples (for executemany)
    data_to_insert = [
        (
            int(row['user_id']), 
            int(row['current_age']), 
            int(row['retirement_age']),
            int(row['birth_year']), 
            int(row['birth_month']), 
            row['gender'],
            row['address'], 
            float(row['latitude']), 
            float(row['longitude']),
            float(row['per_capita_income']), 
            float(row['yearly_income']), 
            float(row['total_debt']),
            int(row['credit_score']), 
            int(row['num_credit_cards'])
        )
        for _, row in users_df.iterrows()
    ]

    # Execute batch insertion for better performance
    cursor.executemany(sql_query, data_to_insert)

    # Commit and close
    conn.commit()
    conn.close()

    print("✅ Users data inserted successfully.")

# Call the function
insert_users_data(users_data)


✅ Users data inserted successfully.


### Insert Data into: Cards Table

In [None]:
cards.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6146 entries, 0 to 6145
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   id                     6146 non-null   int64 
 1   client_id              6146 non-null   int64 
 2   card_brand             6146 non-null   object
 3   card_type              6146 non-null   object
 4   card_number            6146 non-null   int64 
 5   expires                6146 non-null   object
 6   cvv                    6146 non-null   int64 
 7   has_chip               6146 non-null   object
 8   num_cards_issued       6146 non-null   int64 
 9   credit_limit           6146 non-null   object
 10  acct_open_date         6146 non-null   object
 11  year_pin_last_changed  6146 non-null   int64 
 12  card_on_dark_web       6146 non-null   object
dtypes: int64(6), object(7)
memory usage: 624.3+ KB


In [None]:
# Function to insert cards' data into the table
def insert_cards_data(cards_df):
    # Connect to the database
    conn = connect_to_db_with_db()
    cursor = conn.cursor()

    # Iterate over each row in the DataFrame and insert it into the table
    for index, row in cards_df.iterrows():
        # Remove dollar signs and convert credit_limit to float
        credit_limit = float(row['credit_limit'].replace('$', '').replace(',', ''))
        
        cursor.execute("""
            INSERT INTO cards (card_id, 
                               user_id, 
                               card_brand, 
                               card_type, 
                               card_number, 
                               expires, 
                               cvv, 
                               has_chip,
                               num_cards_issued, 
                               credit_limit, 
                               acct_open_date, 
                               year_pin_last_changed, 
                               card_on_dark_web)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
        """, 
        (   int(row['id']), 
            int(row['client_id']), 
            row['card_brand'], 
            row['card_type'], 
            row['card_number'], 
            row['expires'], 
            row['cvv'], 
            row['has_chip'], 
            int(row['num_cards_issued']), 
            credit_limit, 
            row['acct_open_date'], 
            int(row['year_pin_last_changed']), 
            row['card_on_dark_web']
        )
    )

    # Commit the transaction
    conn.commit()

    # Close the connection
    conn.close()

    print("✅ Cards data inserted successfully.")

# Call the function
insert_cards_data(cards)

✅ Cards data inserted successfully.


### Insert Data into: MCC Codes Table

In [None]:
mcc_codes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 109 entries, 5812 to 5733
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       109 non-null    object
dtypes: object(1)
memory usage: 1.7+ KB


In [None]:
# Function to insert MCC codes data into the table
def insert_mcc_codes_data(mcc_codes_df):
    # Connect to the database
    conn = connect_to_db_with_db()
    cursor = conn.cursor()

    # Iterate over each row in the DataFrame and insert it into the table
    for index, row in mcc_codes_df.iterrows():
        cursor.execute("""
            INSERT INTO mcc_codes (mcc_code, 
                                  category_description)
            VALUES (%s, %s)
        """, 
        (   index,  # Assuming `index` is the `mcc_code`
            row[0]  # Assuming the first column is `category_description`
        )
    )

    # Commit the transaction
    conn.commit()

    # Close the connection
    conn.close()

    print("✅ MCC Codes data inserted successfully.")

# Call the function
insert_mcc_codes_data(mcc_codes)

✅ MCC Codes data inserted successfully.


### Insert Data into: Transactions Table

In [None]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13305915 entries, 0 to 13305914
Data columns (total 12 columns):
 #   Column          Dtype  
---  ------          -----  
 0   id              int64  
 1   date            object 
 2   client_id       int64  
 3   card_id         int64  
 4   amount          object 
 5   use_chip        object 
 6   merchant_id     int64  
 7   merchant_city   object 
 8   merchant_state  object 
 9   zip             float64
 10  mcc             int64  
 11  errors          object 
dtypes: float64(1), int64(5), object(6)
memory usage: 1.2+ GB


In [None]:
transactions.head()

Unnamed: 0,id,date,client_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,mcc,errors
0,7475327,2010-01-01 00:01:00,1556,2972,$-77.00,Swipe Transaction,59935,Beulah,ND,58523.0,5499,
1,7475328,2010-01-01 00:02:00,561,4575,$14.57,Swipe Transaction,67570,Bettendorf,IA,52722.0,5311,
2,7475329,2010-01-01 00:02:00,1129,102,$80.00,Swipe Transaction,27092,Vista,CA,92084.0,4829,
3,7475331,2010-01-01 00:05:00,430,2860,$200.00,Swipe Transaction,27092,Crown Point,IN,46307.0,4829,
4,7475332,2010-01-01 00:06:00,848,3915,$46.41,Swipe Transaction,13051,Harwood,MD,20776.0,5813,


In [None]:
transactions.isnull().sum()

id                       0
date                     0
client_id                0
card_id                  0
amount                   0
use_chip                 0
merchant_id              0
merchant_city            0
merchant_state     1563700
zip                1652706
mcc                      0
errors            13094522
dtype: int64

In [None]:
import pandas as pd

def clean_transactions_data(transactions_df):
    """
    Cleans the transactions data using vectorized operations:
    1. Removes dollar signs and commas from the `amount` column and converts it to float.
    2. Converts the `date` column to a datetime object.
    3. Handles missing values in `merchant_state`, `zip`, and `errors` columns.
    4. Returns a cleaned DataFrame.
    """
    # Clean the `amount` column
    transactions_df['amount'] = (
        transactions_df['amount']
        .str.replace('$', '', regex=False)  # Remove dollar signs
        .str.replace(',', '', regex=False)  # Remove commas
        .astype(float)  # Convert to float
    )
    
    # Convert the `date` column to datetime
    transactions_df['date'] = pd.to_datetime(transactions_df['date'])
    
    # Handle missing values in `merchant_state`, `zip`, and `errors`
    transactions_df['merchant_state'] = transactions_df['merchant_state'].fillna('Unknown')
    transactions_df['zip'] = transactions_df['zip'].fillna('Unknown')
    transactions_df['errors'] = transactions_df['errors'].fillna('No Errors')
    
    # Convert specific columns to integers
    int_columns = ['id', 'client_id', 'card_id', 'merchant_id', 'mcc']
    transactions_df[int_columns] = transactions_df[int_columns].astype(int)
    
    # Rename columns to match the database schema (optional)
    transactions_df = transactions_df.rename(columns={
        'id': 'transaction_id',
        'client_id': 'user_id',
        'date': 'transaction_date',
        'mcc': 'mcc_code'
    })
    
    return transactions_df

In [None]:
# Clean the transactions data
cleaned_transactions = clean_transactions_data(transactions)

In [None]:
cleaned_transactions.head()

Unnamed: 0,transaction_id,transaction_date,user_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,mcc_code,errors
0,7475327,2010-01-01 00:01:00,1556,2972,-77.0,Swipe Transaction,59935,Beulah,ND,58523.0,5499,No Errors
1,7475328,2010-01-01 00:02:00,561,4575,14.57,Swipe Transaction,67570,Bettendorf,IA,52722.0,5311,No Errors
2,7475329,2010-01-01 00:02:00,1129,102,80.0,Swipe Transaction,27092,Vista,CA,92084.0,4829,No Errors
3,7475331,2010-01-01 00:05:00,430,2860,200.0,Swipe Transaction,27092,Crown Point,IN,46307.0,4829,No Errors
4,7475332,2010-01-01 00:06:00,848,3915,46.41,Swipe Transaction,13051,Harwood,MD,20776.0,5813,No Errors
