Here I generated some 20k users and their behaviours by mimicing already existing ones

# Import modules

In [1]:
import pandas as pd
import sys
import os

# Add the parent directory's 'scripts' folder to the Python path
sys.path.append(os.path.abspath('../scripts'))

# from utils import generate_unique_phone_numbers , generate_user_dates , generate_total_reloads
from utils import *


# 🚀 Step 1: Generate 20,000 Unique Users

Let's generate 20k users with each number starting with 2257987 then preceeds with 5 random digits

In [2]:
users = generate_unique_phone_numbers(count=20000, prefix='2257987')
users[:10]  # Display the first 10 phone numbers

['225798754166',
 '225798737214',
 '225798780743',
 '225798775800',
 '225798705734',
 '225798774451',
 '225798714478',
 '225798703577',
 '225798751155',
 '225798711486']

.

# 📅 Step 2: Generate Dates for Each User

📌 Based on original:

Date range: 2024-09-01 to 2025-04-21 (233 days total)

Each user had 7 to 619 records (mean ≈ 165)

Let’s give each user a random number of records between, say, 50 and 250



In [3]:
df = generate_user_dates(users, start_date='2024-09-01', end_date='2025-04-21')

In [4]:
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

df.sample(10)  # Display the first 10 rows of the DataFrame

Unnamed: 0,Phone Number,dates
492670,225798753813,20241217
2389805,225798706313,20250308
1428068,225798746145,20241226
488439,225798731579,20241117
353637,225798780420,20250421
989585,225798736764,20241222
2037614,225798726619,20250228
2821756,225798730461,20250308
2317888,225798768331,20241110
1051008,225798728022,20250105


In [5]:
df.shape # Check the number of unique phone numbers


(2994814, 2)

# 📌 Step 3: Add total_reloads

Based on real given data:

min: 1
max: 12
mean: 1.6
median: 1

🔍 Insight: Most users reload just once per day — this is highly skewed toward low values.


I’ll generate total_reloads per row using a skewed distribution — for example, a weighted random choice:

| Value | Weight (probability) |
| ----- | -------------------- |
| 1     | 60%                  |
| 2     | 25%                  |
| 3     | 8%                   |
| 4–6   | 6%                   |
| 7–12  | 1%                   |

In [6]:
df['total_reloads'] = generate_total_reloads(len(df))
df.sample(10)  # Display the first 10 rows of the DataFrame

Unnamed: 0,Phone Number,dates,total_reloads
190920,225798744195,20240922,1
1366387,225798764965,20250317,1
175291,225798796652,20241218,1
483715,225798755395,20241011,2
2449596,225798727352,20241008,2
519140,225798700518,20250411,2
524599,225798775128,20250327,1
125046,225798775943,20241021,1
116128,225798761619,20250205,9
921000,225798785856,20240914,1


# 📌 Step 4: Generate total_reload_amount

from the given dataset the following distribution

| Stat         | Value                     |
| ------------ | ------------------------- |
| Min          | 2.17                      |
| Max          | 40,000                    |
| Mean         | 522.49                    |
| Std          | 1361.03                   |
| Median (50%) | 220                       |
| 25%          | 100                       |
| 75%          | 500                       |
| NaNs         | 68 out of \~9300 (≈ 0.7%) |

This suggests a right-skewed distribution — lots of small values, few big ones.

I'll use a log-normal distribution to mimic skewed values

Clip/round values between 2.0 and 40000

Add ~0.7% NaNs


In [7]:
df['total_reload_amount'] = generate_total_reload_amount(len(df))
df.sample(10)  # Display the first 10 rows of the DataFrame

Unnamed: 0,Phone Number,dates,total_reloads,total_reload_amount
717465,225798797412,20241007,1,23.36
2193828,225798741267,20240909,1,78.79
904556,225798760175,20241208,1,88.44
991265,225798773297,20250219,1,25.51
951324,225798752344,20250227,1,765.67
1996821,225798713657,20241015,1,70.99
1479547,225798748273,20250116,1,115.0
198326,225798775775,20241108,3,244.48
2485217,225798740440,20250131,2,198.74
744032,225798710243,20240924,1,149.45


# 📌 Step 5: Generate imei

The given data behavior:

Each row had a unique imei

Values are 10-digit integers

Start with either 1 or 2



In [8]:
df['imei'] = generate_imei(len(df))
df.sample(10)  # Display the first 10 rows of the DataFrame

Unnamed: 0,Phone Number,dates,total_reloads,total_reload_amount,imei
2176875,225798710161,20241104,1,63.95,2168326419
2139537,225798704581,20250307,1,126.69,1001628204
1741263,225798711042,20250406,6,304.2,1877361683
2193287,225798773422,20250315,11,91.63,1627497549
1441401,225798702530,20250406,1,98.88,2634631874
717641,225798722937,20241221,3,111.44,1024173345
1297857,225798797446,20250114,1,333.96,1347640194
675880,225798771733,20250105,2,33.29,2951974210
1407901,225798765649,20241214,1,37.04,2946999278
1415225,225798794724,20250419,1,568.57,1620482775


# 📌 Step 6: Generate brand_name + model_name

From the given data:

📊 brand_name distribution:

49 brands total

Skewed: top brand has 4590 entries, some have only 1

Median ≈ 16 entries per brand

📊 model_name distribution:
352 models total

Skewed: top model has 1510 entries, others very rare

Brands can have multiple models

✅ Strategy
Define a dictionary of fake brand_name: [model_name1, model_name2, ...]

Sample brands using weighted probability

Then randomly pick a model under that brand

In [9]:
df['brand_name'], df['model_name'] = generate_brand_and_model(len(df))
df.sample(10)  # Display the first 10 rows of the DataFrame

Unnamed: 0,Phone Number,dates,total_reloads,total_reload_amount,imei,brand_name,model_name
1560776,225798762945,20240922,1,30.58,2554540507,Brand_41,Brand_41_Model_2
1031768,225798798983,20241104,1,28.57,2851969100,Brand_16,Brand_16_Model_4
2513041,225798765088,20241001,1,136.55,1938602309,Brand_31,Brand_31_Model_6
2849290,225798701750,20250227,1,140.94,1676391959,Brand_49,Brand_49_Model_1
1568200,225798743922,20250130,1,254.51,1339034805,Brand_46,Brand_46_Model_3
2653641,225798744654,20241024,2,160.13,1684744616,Brand_28,Brand_28_Model_10
647506,225798774525,20250215,1,362.22,2640918142,Brand_1,Brand_1_Model_3
1703920,225798786066,20250127,3,215.89,2795104296,Brand_32,Brand_32_Model_3
2695608,225798712309,20240912,1,342.72,1277910312,Brand_10,Brand_10_Model_4
2574857,225798700391,20241204,2,447.47,1722790816,Brand_21,Brand_21_Model_1


# 📌 Step 7: Generate device_category

from the given data set

| Code | Feature           | Count | Approx %  |
| ---- | ----------------- | ----- | --------- |
| 5    | Smartphone        | 3970  | \~42.3% ✅ |
| 7    | Basic             | 2982  | \~31.8% ✅ |
| 4    | FeaturePhone      | 2241  | \~23.9% ✅ |
| -    | (invalid/missing) | 180   | \~1.9%    |
| 2    | Modem             | 3     | \~0.03%   |
| 3    | M2M               | 1     | \~0.01%   |



In [10]:
df['device_category'] = generate_device_category(len(df))
df.sample(10)  # Display the first 10 rows of the DataFrame

Unnamed: 0,Phone Number,dates,total_reloads,total_reload_amount,imei,brand_name,model_name,device_category
514576,225798784235,20250415,1,209.15,2720291194,Brand_49,Brand_49_Model_5,5
1401417,225798723624,20241217,1,92.68,1545862270,Brand_16,Brand_16_Model_3,5
905350,225798773800,20240902,2,365.8,1811505922,Brand_11,Brand_11_Model_2,5
2005233,225798783758,20241016,1,393.79,1188471105,Brand_21,Brand_21_Model_2,7
396024,225798774629,20241103,1,243.39,1969281888,Brand_43,Brand_43_Model_2,7
2551941,225798786948,20241018,1,114.95,2738019263,Brand_28,Brand_28_Model_6,7
1875398,225798791169,20250217,1,204.36,2899852440,Brand_28,Brand_28_Model_9,7
1922362,225798724852,20241117,3,114.19,2194181720,Brand_13,Brand_13_Model_2,4
1221526,225798728157,20241015,1,48.13,2787744311,Brand_27,Brand_27_Model_3,5
243969,225798738178,20250330,1,272.93,2344931754,Brand_1,Brand_1_Model_9,4


# 📌 Step 8: Generate data_kb (daily mobile data usage)

| Stat    | Value          |
| ------- | -------------- |
| Min     | 0.0            |
| Max     | 8,353,427.0 😲 |
| 25%     | 0.0            |
| 50%     | 0.0            |
| 75%     | \~15,000.0     |
| Mean    | \~142,924.9    |
| Std Dev | \~413,158.9    |


🧠 Insight:

Most users don’t use data at all → many 0s

Others use very small amount

Few are heavy consumers (long right tail)

✅ Plan

Generate many 0s (let’s say ~5%–10% of rows = zero usage)

The rest follow a log-normal distribution (right-skewed, realistic)

Final result looks natural and spread out like real telecom data

In [11]:
df['data_kb'] = generate_data_kb(len(df), zero_ratio=0.07)
df.sample(10)  # Display the first 10 rows of the DataFrame

Unnamed: 0,Phone Number,dates,total_reloads,total_reload_amount,imei,brand_name,model_name,device_category,data_kb
379659,225798778133,20241108,1,14.46,1653505803,Brand_29,Brand_29_Model_3,4,121319.51
212480,225798776252,20241228,1,382.42,1498192976,Brand_37,Brand_37_Model_2,7,333245.37
2026010,225798739482,20241124,2,513.22,2056444915,Brand_6,Brand_6_Model_4,5,693466.71
2251322,225798732946,20250102,1,233.95,1322405872,Brand_37,Brand_37_Model_1,7,1439029.15
532424,225798797687,20250416,2,62.52,2974119745,Brand_16,Brand_16_Model_2,5,10683.16
645232,225798729384,20250111,1,43.75,1766578565,Brand_39,Brand_39_Model_5,5,495511.46
2152563,225798746671,20241027,2,43.65,1456346147,Brand_33,Brand_33_Model_2,4,711761.35
2673443,225798775816,20241113,1,89.22,1437023807,Brand_31,Brand_31_Model_3,5,157875.23
200740,225798716575,20250321,1,386.4,1637236143,Brand_37,Brand_37_Model_3,4,59395.6
509986,225798717331,20240925,1,313.68,2586667536,Brand_5,Brand_5_Model_5,5,312932.79


In [12]:
df

Unnamed: 0,Phone Number,dates,total_reloads,total_reload_amount,imei,brand_name,model_name,device_category,data_kb
0,225798722725,20250226,1,1497.34,1566059407,Brand_49,Brand_49_Model_4,5,0.00
1,225798722760,20241125,4,655.05,1461658951,Brand_28,Brand_28_Model_7,5,74055.87
2,225798769346,20250310,1,62.25,2186777731,Brand_16,Brand_16_Model_1,5,99548.76
3,225798783528,20240902,2,199.50,2466370262,Brand_1,Brand_1_Model_3,7,92339.23
4,225798752766,20240924,2,376.31,2081550714,Brand_30,Brand_30_Model_3,5,528053.80
...,...,...,...,...,...,...,...,...,...
2994809,225798761423,20250215,1,772.48,1607045912,Brand_49,Brand_49_Model_4,7,2137415.36
2994810,225798743753,20241021,1,140.90,2077599725,Brand_9,Brand_9_Model_3,4,33657.25
2994811,225798718924,20250413,1,28.21,1898916851,Brand_33,Brand_33_Model_1,5,86719.71
2994812,225798776913,20250322,1,94.85,2269934669,Brand_5,Brand_5_Model_6,4,348661.72


In [13]:
df['data_kb'].describe()  # Display the summary statistics of the 'data_kb' column

count    2.994814e+06
mean     5.033131e+05
std      8.238644e+05
min      0.000000e+00
25%      9.500815e+04
50%      2.394204e+05
75%      5.618043e+05
max      8.000000e+06
Name: data_kb, dtype: float64

# Merge with the existing data

In [14]:
df2 = pd.read_excel('../data/raw/mtn_upsell_uncleaned.xlsx')

# df = df.merge(df2, on='Phone number', how='left')

list(zip(df.columns, df2.columns))

[('Phone Number', 'dates'),
 ('dates', 'Phone Number'),
 ('total_reloads', 'total_reloads'),
 ('total_reload_amount', 'total_reload_amount'),
 ('imei', 'imei'),
 ('brand_name', 'brand_name'),
 ('model_name', 'model_name'),
 ('device_category', 'device_category'),
 ('data_kb', 'data_kb')]

In [15]:
df = df[df2.columns]
df = pd.concat([df, df2], ignore_index=True)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
df

Unnamed: 0,dates,Phone Number,total_reloads,total_reload_amount,imei,brand_name,model_name,device_category,data_kb
0,20240903,225798707300,2,52.22,2529499082,Brand_10,Brand_10_Model_7,7,121142.46
1,20250331,225798724508,2,41.02,1480929134,Brand_33,Brand_33_Model_8,5,103739.88
2,20250204,225798733835,1,43.90,2150205527,Brand_22,Brand_22_Model_10,7,99597.33
3,20250214,225798721011,2,36.32,1701085748,Brand_26,Brand_26_Model_1,7,125106.20
4,20241123,225798705193,1,35.55,2389010154,Brand_2,Brand_2_Model_10,5,1441097.83
...,...,...,...,...,...,...,...,...,...
3004186,20241213,225798744864,1,49.51,1583585686,Brand_2,Brand_2_Model_9,7,491512.55
3004187,20241230,225798752096,2,54.62,2765832989,Brand_21,Brand_21_Model_4,5,322323.00
3004188,20250321,225798785198,2,560.87,2925281234,Brand_19,Brand_19_Model_6,7,167538.59
3004189,20241006,225798740974,1,117.98,1173782221,Brand_16,Brand_16_Model_3,7,54288.12


store the DataFrame in a CSV file

In [16]:
df.to_csv('../data/raw/mtn_upsell_generated.csv', index=False)