# The Heart of Investment

In this analysis, we will generate a fictional investment datasets. Our goal is to study asset classes.

In [49]:
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta
from faker import Faker

In [80]:
# Seed for reproducibility
# Initialize Faker and random seed
fake = Faker()
Faker.seed(42)
random.seed(42)
np.random.seed(42)

# Parameters
client_ids = [f"C{str(i).zfill(5)}" for i in range(1, 5001)]
asset_classes = ['Stocks', 'Bonds', 'Real Estate', 'Crypto', 'Mutual Funds']

# Generate synthetic data
def generate_random_date(start_year=2008, end_year=2025):
    start = datetime(start_year, 1, 1)
    end = datetime(end_year, 1, 1)
    return start + timedelta(days=random.randint(0, (end - start).days))

data = []
client_data = []
for client_id in client_ids:
    client_id = random.choice(client_ids)
    date = generate_random_date()
    asset_class = random.choice(asset_classes)
    investment_amount = round(np.random.uniform(1000, 5000000), 2)
    # Mean 8%, SD 15%
    return_pct = round(np.random.normal(loc=0.08, scale=0.50), 4)
    investment_value = round(investment_amount * (1 + return_pct), 2)
    # How much funds does the client have to invest in addition to current assets 
    #external_assets = round(np.random.uniform(5000, 5000000), 2)
    
    data.append({
        'client_id': client_id,
        'investment_date': date.strftime('%Y-%m-%d'),
        'asset_class': asset_class,
        'investment_amount': investment_amount,
        'return_pct': return_pct,
        'investment_value': investment_value
    })

    name = fake.name()
    address = fake.street_address()
    city = fake.city()
    state = fake.state()
    country = fake.country()
    phone = fake.phone_number()
    latitude = round(fake.latitude(), 6)
    longitude = round(fake.longitude(), 6)

    client_data.append({
        'client_id': client_id,
        'name': name,
        'address': address,
        'city': city,
        'state': state,
        'country': country,
        'phone_number': phone,
        'latitude': latitude,
        'longitude': longitude
    })

# Create DataFrame
df_investments = pd.DataFrame(data)
# Optional: Save to CSV
df_investments.to_csv("synthetic_investment_data.csv", index=False)


# Create DataFrame
df_clients = pd.DataFrame(client_data)
# Optional: Save to CSV
df_clients.to_csv("synthetic_clients_data.csv", index=False)


df_investments.head()

Unnamed: 0,client_id,investment_date,asset_class,investment_amount,return_pct,investment_value
0,C00913,2008-07-23,Real Estate,1873326.05,-0.4759,981810.18
1,C02007,2013-01-02,Bonds,780816.61,0.2395,967822.19
2,C00840,2023-03-06,Mutual Funds,291359.98,0.2195,355313.5
3,C00713,2021-03-30,Crypto,3540654.82,0.5853,5613000.09
4,C00261,2008-09-01,Stocks,103901.89,-0.2104,82040.93


In [82]:
df_investments.info()
df_clients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   client_id          5000 non-null   object 
 1   investment_date    5000 non-null   object 
 2   asset_class        5000 non-null   object 
 3   investment_amount  5000 non-null   float64
 4   return_pct         5000 non-null   float64
 5   investment_value   5000 non-null   float64
dtypes: float64(3), object(3)
memory usage: 234.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   client_id     5000 non-null   object
 1   name          5000 non-null   object
 2   address       5000 non-null   object
 3   city          5000 non-null   object
 4   state         5000 non-null   object
 5   country       5000 non-null   object
 6   phone_num

Client-Level Analyses (Demographic Insights)  
	1.	Client Location Heatmap  
        * Plot latitude & longitude to visualize geographic concentration of investors.
	2.	Top Cities/States/Countries  
        * Rank locations by number of clients or total investment.
	3.	Client Clustering  
        * Use K-Means or DBSCAN on lat/long to group clients by region.

Investment-Level Analyses  
	1.	Asset Allocation Overview  
        * Distribution of asset classes (Stocks, Bonds, Real Estate, Crypto, Mutual Funds).
	2.	Return Distribution  
        * Histogram or KDE of return_pct to evaluate investment performance variability.
	3.	Investment Amount Trends  
        * Time-series of investment volumes by date or year.
	4.	Return by Asset Class  
        * Mean, median, and variance of return per asset class.

Client + Investment Joined Analyses  
	1.	Total Investment Per Client  
        * Aggregate investment amount and value by client_id.
	2.	Top Investors  
        * Identify clients with highest total investment or return.
	3.	Geographic Return Analysis  
        * Compare average returns across cities/states/countries.
	4.	Client Segmentation by Portfolio  
        * Segment clients by their asset diversification (e.g., only crypto vs. multi-asset).
	5.	Choropleth or GeoMap  
        * Visualize total investment or return by region.

Optional Advanced Analyses  
	1.	Predictive Modeling  
        * Predict expected returns based on asset class, amount, and location using regression.
	2.	Risk-Return Profiling  
        * Scatter plot of investment volatility vs. return per client or asset class.
	3.	Temporal Investment Behavior  
        * Analyze client investment frequency and amounts over time.
	4.	Client Lifetime Value (CLV)  
        * Estimate based on cumulative return and frequency of investments.