## **Build Models and Comparisons** 

In order to study the relationship between economic characteristics and telephone fraud strategies, the appropriate independent variables should reflect various aspects of economic conditions.

I choose the following aspects of economic data as independent variables:

1. **GDP Per Capita**: measures the overall level of economic activity and wealth in a state or region. More economically developed areas may have more fraudulent activity because fraudsters may believe that residents of these areas have more money.

2. **Personal Income**: includes total personal income and per capita personal income, which can reflect the average economic level of residents.

3. **Disposable Personal Income**: the remaining income of residents after subtracting necessary taxes and other involuntary expenditures from their income. This indicator can help to understand the real purchasing power of the population.

4. **Personal Consumption Expenditures**: indicates how much the population spends on a variety of goods and services, which may be related to the financial activities of consumers and their vulnerability to fraud.

5. **Regional Price Parity**: reflects differences in price levels in different regions, which may affect the real purchasing power of the population and the choice of fraudulent strategies.

6. **Total Employment**: an indicator of economic activity, where a high employment rate may indicate a stable economy, while a low employment rate may increase the risk of fraud.

7. **Education Statistics (higher than the number of people over the age of 25 with a Bachelor's Degree)**: differences in education levels across this region may respond differently to versus receiving a scam call.

Dependent variable selection:

 **Topic of Fraudulent Call**

* Importing and cleaning of data

Importing state-by-state economic data for 2022

In [1]:
import pandas as pd

# Load data
file_path = '/Users/fangguoguo/Desktop/fraud_call_project/Table-4_clean.csv'  
data = pd.read_csv(file_path)

# Fix column names: remove spaces before and after
data.columns = data.columns.str.strip()

# List all available economic indicators, making sure the column names match those in dataset
print("columns:", data.columns.tolist())

# Selection of relevant economic indicators
selected_columns = [
    'State',
    'Gross domestic product (GDP)',
    'Personal income',
    'Disposable personal income',
    'Personal consumption expenditures',
    'Regional price parities (RPPs) 9',
    'Total employment (number of jobs)'
]

# Creating a new DataFrame
economic_2022_df = data[selected_columns].copy()

# Check the new DataFrame
print(economic_2022_df.head())

# Save new DataFrame to CSV
economic_2022_df.to_csv('/Users/fangguoguo/Desktop/fraud_call_project/economic_2022_df.csv', index=False)


columns: ['State', 'Disposable personal income', 'Gross domestic product (GDP)', 'Implicit regional price deflator 10', 'Per capita disposable personal income 7', 'Per capita personal consumption expenditures (PCE) 8', 'Per capita personal income 6', 'Personal consumption expenditures', 'Personal income', 'Real GDP (millions of chained 2017 dollars) 1', 'Real PCE (millions of constant (2017) dollars) 3', 'Real per capita PCE 5', 'Real per capita personal income 4', 'Real personal income (millions of constant (2017) dollars) 2', 'Regional price parities (RPPs) 9', 'Total employment (number of jobs)']
        State  Gross domestic product (GDP)  Personal income  \
0     Alabama                      281569.0         258362.2   
1      Alaska                       65698.8          50349.7   
2     Arizona                      475653.7         430083.5   
3    Arkansas                      165989.3         160254.2   
4  California                     3641643.4        3006647.3   

   Dispo

  from pandas.core import (


In [2]:
economic_2022_df.info()
economic_2022_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 7 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   State                              52 non-null     object 
 1   Gross domestic product (GDP)       52 non-null     float64
 2   Personal income                    52 non-null     float64
 3   Disposable personal income         52 non-null     float64
 4   Personal consumption expenditures  52 non-null     float64
 5   Regional price parities (RPPs) 9   52 non-null     float64
 6   Total employment (number of jobs)  52 non-null     int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 3.0+ KB


Unnamed: 0,State,Gross domestic product (GDP),Personal income,Disposable personal income,Personal consumption expenditures,Regional price parities (RPPs) 9,Total employment (number of jobs)
0,Alabama,281569.0,258362.2,229599.5,215104.6,87.776,2869931
1,Alaska,65698.8,50349.7,45685.6,43412.8,101.989,457687
2,Arizona,475653.7,430083.5,377143.3,368866.6,99.897,4287595
3,Arkansas,165989.3,160254.2,142906.9,128662.4,86.597,1755536
4,California,3641643.4,3006647.3,2464043.4,2352361.6,112.47,25300974


Importing education statistics for 2022

In [3]:
education_data_path = '/Users/fangguoguo/Desktop/fraud_call_project/tabn104.csv'

education_2022_df = pd.read_csv(education_data_path)

education_2022_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 2 columns):
 #   Column                                             Non-Null Count  Dtype 
---  ------                                             --------------  ----- 
 0   State                                              51 non-null     object
 1   Number of persons age 25 and over 
(in thousands)  51 non-null     object
dtypes: object(2)
memory usage: 948.0+ bytes


In [4]:
# Remove spaces and line breaks from all column names
education_2022_df.columns = education_2022_df.columns.str.replace('\n', '').str.strip()

# Convert column 'Number of persons age 25 and over (in thousands)' in education_df from object to float
# Remove commas and convert data types
education_2022_df['Number of persons age 25 and over (in thousands)'] = education_2022_df['Number of persons age 25 and over (in thousands)'].str.replace(',', '').astype(float)

In [5]:
print(economic_2022_df['State'].unique())
print(education_2022_df['State'].unique())

['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota'
 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire'
 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota'
 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina'
 'South Dakota' 'Tennessee' 'Texas' 'United States' 'Utah' 'Vermont'
 'Virginia' 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming']
['Alabama ' 'Alaska ' 'Arizona ' 'Arkansas ' 'California ' 'Colorado '
 'Connecticut ' 'Delaware ' 'District of Columbia ' 'Florida ' 'Georgia '
 'Hawaii ' 'Idaho ' 'Illinois ' 'Indiana ' 'Iowa ' 'Kansas ' 'Kentucky '
 'Louisiana ' 'Maine ' 'Maryland ' 'Massachusetts ' 'Michigan '
 'Minnesota ' 'Mississippi ' 'Missouri ' 'Montana ' 'Nebraska ' 'Nevada '
 'New Hampshire ' 'New 

In [6]:
# Remove extra spaces from State column values in education_df
education_2022_df['State'] = education_2022_df['State'].str.strip()

# Re-merge
eco_edu_2022_df = pd.merge(economic_2022_df, education_2022_df, on="State")

# Print the first few lines of the merged result
eco_edu_2022_df.head()

Unnamed: 0,State,Gross domestic product (GDP),Personal income,Disposable personal income,Personal consumption expenditures,Regional price parities (RPPs) 9,Total employment (number of jobs),Number of persons age 25 and over (in thousands)
0,Alabama,281569.0,258362.2,229599.5,215104.6,87.776,2869931,3475.0
1,Alaska,65698.8,50349.7,45685.6,43412.8,101.989,457687,490.0
2,Arizona,475653.7,430083.5,377143.3,368866.6,99.897,4287595,5048.0
3,Arkansas,165989.3,160254.2,142906.9,128662.4,86.597,1755536,2056.0
4,California,3641643.4,3006647.3,2464043.4,2352361.6,112.47,25300974,26878.0


Importing data on economic losses from 2022 fraudulent calls

In [7]:
fraud_report_2022_path = '/Users/fangguoguo/Desktop/fraud_call_project/2022_CSN_State_Fraud_Reports_and_Losses.csv' 

fraud_report_2022_df = pd.read_csv(fraud_report_2022_path)
selected_columns = ['State', '# of Reports', 'Total $ Loss']

fraud_report_2022_df = fraud_report_2022_df[selected_columns].copy()

fraud_report_2022_df.info()

fraud_report_2022_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   State         52 non-null     object
 1   # of Reports  52 non-null     object
 2   Total $ Loss  52 non-null     object
dtypes: object(3)
memory usage: 1.3+ KB


Unnamed: 0,State,# of Reports,Total $ Loss
0,Alabama,22113,"$53,864,805.00"
1,Alaska,4409,"$16,691,422.00"
2,Arizona,43960,"$173,944,111.00"
3,Arkansas,12917,"$34,563,485.00"
4,California,213223,"$1,348,767,079.00"


In [8]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# First, clean up the Total $ Loss column, removing possible currency symbols and commas
fraud_report_2022_df['Total $ Loss'] = fraud_report_2022_df['Total $ Loss'].replace('[\$,]', '', regex=True).astype(float)

# Clean up # of Reports columns, remove commas
fraud_report_2022_df['# of Reports'] = fraud_report_2022_df['# of Reports'].replace(',', '', regex=True).astype(int)

# Checking changed data types
print(fraud_report_2022_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   State         52 non-null     object 
 1   # of Reports  52 non-null     int64  
 2   Total $ Loss  52 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.3+ KB
None


In [9]:
# Merging data
eco_edu_fraud_2022_df = pd.merge(eco_edu_2022_df, fraud_report_2022_df, on='State')

eco_edu_fraud_2022_df.head(60)

eco_edu_fraud_2022_df.to_csv('/Users/fangguoguo/Desktop/fraud_call_project/eco_edu_fraud_2022_df.csv', index=False)

* Multiple linear regression modeling to provide a simple understanding of the relationship between the number of reports of fraudulent calls and economic conditions

In [10]:
import pandas as pd
import statsmodels.api as sm


# Defining the dependent variable
y = eco_edu_fraud_2022_df['# of Reports']

# Define the independent variables
X = eco_edu_fraud_2022_df[['Gross domestic product (GDP)', 'Personal income', 'Disposable personal income', 
          'Personal consumption expenditures', 'Regional price parities (RPPs) 9', 'Total employment (number of jobs)', 
          'Number of persons age 25 and over (in thousands)']]

# Adding a constant term (intercept) to the model
X = sm.add_constant(X)

# Fitting multiple linear regression models
model = sm.OLS(y, X).fit()

# Summary of the output model
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           # of Reports   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.993
Method:                 Least Squares   F-statistic:                     1034.
Date:                Sun, 23 Jun 2024   Prob (F-statistic):           9.25e-46
Time:                        00:36:52   Log-Likelihood:                -481.15
No. Observations:                  51   AIC:                             978.3
Df Residuals:                      43   BIC:                             993.7
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                                       coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------

In this multiple linear regression model, Regional Price Parities (RPPs) and education level are the significant factors affecting the number of fraudulent calls reported, where each unit increase in RPPs is associated with a statistically significant increase in the number of fraudulent calls reported by about 220.6 and each unit increase in the number of persons aged 25 years and above is associated with a statistically significant increase in the number of fraudulent calls reported by about 5.384. While other variables such as GDP, personal income, disposable personal income, and personal consumption expenditures have an effect on the number of fraudulent phone reports, these effects are not statistically significant. Overall, the adjusted R-squared value of the model is 0.993, indicating that the model explains the variation in the number of fraudulent phone reports well and that the model as a whole is statistically significant (p-value < 0.001 for F-statistic). This suggests that fraudulent telephone activity is more closely associated with the economic characteristics and demographics of certain areas.

* Multiple linear regression modeling for a simple understanding of the relationship between fraudulent call losses and economic characteristics

In [11]:
import pandas as pd
import statsmodels.api as sm


# Defining the dependent variable
y = eco_edu_fraud_2022_df['Total $ Loss']

# Defining the independent variables
X = eco_edu_fraud_2022_df[['Gross domestic product (GDP)', 'Personal income', 'Disposable personal income', 
          'Personal consumption expenditures', 'Regional price parities (RPPs) 9', 'Total employment (number of jobs)', 
          'Number of persons age 25 and over (in thousands)']]

# Adding constant term (intercept) to the model
X = sm.add_constant(X)

# Fitting multiple linear regression models
model = sm.OLS(y, X).fit()

# Summary of the output model
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           Total $ Loss   R-squared:                       0.969
Model:                            OLS   Adj. R-squared:                  0.964
Method:                 Least Squares   F-statistic:                     191.5
Date:                Sun, 23 Jun 2024   Prob (F-statistic):           2.81e-30
Time:                        00:36:52   Log-Likelihood:                -961.00
No. Observations:                  51   AIC:                             1938.
Df Residuals:                      43   BIC:                             1953.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                                       coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------

This multiple linear regression model shows that personal consumption expenditures and the number of education levels significantly affect total losses from fraudulent calls. Specifically, each one-unit increase in personal consumption expenditures significantly increases total fraudulent call losses by approximately $2,113, while each one-thousand-unit increase in the number of individuals 25 years of age and older who have earned a degree significantly decreases total losses by approximately $107,400. The model has an adjusted R-squared value of 0.964, indicating that it explains the variation in loss amounts well. Despite the overall significance of the model, the effects of other factors such as GDP, personal income, disposable personal income, regional price parity, and total employment were not statistically significant, which may imply that fraudulent telephone losses have a weaker direct relationship with these economic indicators.

* Next modeling regional economic characteristics and phone scamming strategies (scamming themes)

Importing data on the topic of phone scams (theft type) in 2022

In [12]:
import pandas as pd

# Loading CSV files
file_path = '/Users/fangguoguo/Desktop/fraud_call_project/2022_CSN_State_Identity_Theft_Reports.csv' 
theft_type_2022 = pd.read_csv(file_path)

# Clean up the '# of Reports' column, remove commas and convert to integers
theft_type_2022['# of Reports'] = theft_type_2022['# of Reports'].str.replace(',', '').astype(int)

# Using the pivot_table method to reshape data
theft_type_2022 = theft_type_2022.pivot_table(index='State', columns='Theft Type', values='# of Reports', aggfunc='sum')

# Re-use index 'State' as a column
theft_type_2022.reset_index(inplace=True)

# Display of converted data
theft_type_2022.head()

Theft Type,State,Bank Fraud,Credit Card Fraud,Employment or Tax-Related Fraud,Government Documents or Benefits Fraud,Loan or Lease Fraud,Other Identity Theft,Phone or Utilities Fraud
0,Alabama,1276,8585,917,415,2754,5876,918
1,Alaska,155,249,88,37,74,227,67
2,Arizona,2042,7609,2072,749,2665,6111,1682
3,Arkansas,619,2091,540,232,714,1587,457
4,California,11166,64878,10156,4790,14977,32952,6850


In [13]:
# Merging data
eco_edu_f_t_2022_df = pd.merge(eco_edu_fraud_2022_df, theft_type_2022, on="State")

eco_edu_f_t_2022_df.to_csv('/Users/fangguoguo/Desktop/fraud_call_project/eco_edu_f_t_2022_df.csv', index=False)

eco_edu_f_t_2022_df.head()

Unnamed: 0,State,Gross domestic product (GDP),Personal income,Disposable personal income,Personal consumption expenditures,Regional price parities (RPPs) 9,Total employment (number of jobs),Number of persons age 25 and over (in thousands),# of Reports,Total $ Loss,Bank Fraud,Credit Card Fraud,Employment or Tax-Related Fraud,Government Documents or Benefits Fraud,Loan or Lease Fraud,Other Identity Theft,Phone or Utilities Fraud
0,Alabama,281569.0,258362.2,229599.5,215104.6,87.78,2869931,3475.0,22113,53864805.0,1276,8585,917,415,2754,5876,918
1,Alaska,65698.8,50349.7,45685.6,43412.8,101.99,457687,490.0,4409,16691422.0,155,249,88,37,74,227,67
2,Arizona,475653.7,430083.5,377143.3,368866.6,99.9,4287595,5048.0,43960,173944111.0,2042,7609,2072,749,2665,6111,1682
3,Arkansas,165989.3,160254.2,142906.9,128662.4,86.6,1755536,2056.0,12917,34563485.0,619,2091,540,232,714,1587,457
4,California,3641643.4,3006647.3,2464043.4,2352361.6,112.47,25300974,26878.0,213223,1348767079.0,11166,64878,10156,4790,14977,32952,6850


Every state experiences all types of phone scams, so what we might consider is predicting the severity or frequency of each type of scam, not just its presence or absence.

In this case, I plan to treat the data on each fraud topic as continuous variables to be used in the predictive model. For example, the number of scam reports or the economic losses caused by such scams could be used as target variables, rather than simple presence/absence (yes/no) binary labels. So the choice of dependent variable is the number of reports (continuous variable) on the topic of telephone fraud, which includes bank fraud, credit card fraud, employment or tax-related fraud, government document or benefit fraud, loan or lease fraud, other identity theft, and telephone or utility fraud.

---
Building multiple output regression model

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Preparing the input features (independent variables)
X = eco_edu_f_t_2022_df[['Gross domestic product (GDP)', 'Personal income', 'Disposable personal income',
          'Personal consumption expenditures', 'Regional price parities (RPPs) 9', 'Total employment (number of jobs)',
          'Number of persons age 25 and over (in thousands)']]

# Prepare the target variables for multiple scam types
y = eco_edu_f_t_2022_df[['Bank Fraud', 'Credit Card Fraud', 'Employment or Tax-Related Fraud', 
          'Government Documents or Benefits Fraud', 'Loan or Lease Fraud', 
          'Other Identity Theft', 'Phone or Utilities Fraud']]

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the multioutput linear regression model
linear_model = MultiOutputRegressor(LinearRegression())
linear_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = linear_model.predict(X_test)

# Calculate overall evaluation metrics
linear_overall_mse = mean_squared_error(y_test, y_pred, multioutput='uniform_average')
linear_overall_rmse = np.sqrt(linear_overall_mse)
linear_overall_mae = mean_absolute_error(y_test, y_pred, multioutput='uniform_average')
linear_overall_r2 = r2_score(y_test, y_pred, multioutput='uniform_average')

# Print out the overall metrics
print("Overall Mean Squared Error (MSE):", linear_overall_mse)
print("Overall Root Mean Squared Error (RMSE):", linear_overall_rmse)
print("Overall Mean Absolute Error (MAE):", linear_overall_mae)
print("Overall R-squared (R2):", linear_overall_r2)

Overall Mean Squared Error (MSE): 10279735.703671264
Overall Root Mean Squared Error (RMSE): 3206.2026922313044
Overall Mean Absolute Error (MAE): 1504.0388958756762
Overall R-squared (R2): 0.6888235549932845


Advanced Multiple Output Linear Modeling

In [15]:
import time
import sys
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Calculate the start time of the model
start_time = time.time()


# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the multioutput linear regression model
linear_model = MultiOutputRegressor(LinearRegression())
linear_model.fit(X_train, y_train)

# Make predictions on both the train and test sets
y_train_pred = linear_model.predict(X_train)
y_test_pred = linear_model.predict(X_test)

# Calculate metrics for training data
linear_train_mse = mean_squared_error(y_train, y_train_pred, multioutput='uniform_average')
linear_train_rmse = np.sqrt(linear_train_mse)
linear_train_mae = mean_absolute_error(y_train, y_train_pred, multioutput='uniform_average')
linear_train_r2 = r2_score(y_train, y_train_pred, multioutput='uniform_average')

# Calculate metrics for testing data
linear_test_mse = mean_squared_error(y_test, y_test_pred, multioutput='uniform_average')
linear_test_rmse = np.sqrt(linear_test_mse)
linear_test_mae = mean_absolute_error(y_test, y_test_pred, multioutput='uniform_average')
linear_test_r2 = r2_score(y_test, y_test_pred, multioutput='uniform_average')

# Calculating Runtime
linear_end_time = time.time()
linear_elapsed_time = linear_end_time - start_time

# Display Indicator Tables
linear_metrics_df = pd.DataFrame({
    "Metric": ["MSE", "RMSE", "MAE", "R²"],
    "Train": [linear_train_mse, linear_train_rmse, linear_train_mae, linear_train_r2],
    "Test": [linear_test_mse, linear_test_rmse, linear_test_mae, linear_test_r2]
})

print(linear_metrics_df)

# Print out the training time of the model
print("Training time:", linear_elapsed_time, "seconds")

# Computational Model Memory Usage
print("Model size:", sys.getsizeof(linear_model), "bytes")


  Metric      Train        Test
0    MSE 3799680.28 10279735.70
1   RMSE    1949.28     3206.20
2    MAE    1052.13     1504.04
3     R²       0.85        0.69
Training time: 0.010259151458740234 seconds
Model size: 56 bytes


The multi-output regression model performed reasonably well in predicting different types of phone fraud cases, where the coefficient of determination was 0.6888, showing that the model was able to explain about 68.88% of the data variability. However, the high average prediction error (RMSE of about 3206 cases and MAE of about 1504 cases) points out that the model still has room for improvement. I will choose to use the Lasso model to optimize the multiple output regression model.

---
Building the Lasso model

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Selection of numerical features for normalization
features = eco_edu_f_t_2022_df.columns[1:8]  
X = eco_edu_f_t_2022_df[features]

# Initialize and apply standardizers
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Select fraud-related columns as output variables
scam_features = eco_edu_f_t_2022_df.columns[10:]  
Y = eco_edu_f_t_2022_df[scam_features]

# Split the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42)

# Initialize Lasso regression model, set more iterations
lasso_model = Lasso(alpha=0.1, max_iter=10000)

# Training model
lasso_model.fit(X_train, Y_train)

# Evaluating models using test data
lasso_score = lasso_model.score(X_test, Y_test)

# Predictions on the test set using the model
Y_pred = lasso_model.predict(X_test)

# Calculation of assessment indicators
r2 = r2_score(Y_test, Y_pred)
mse = mean_squared_error(Y_test, Y_pred)
rmse = mean_squared_error(Y_test, Y_pred, squared=False)
mae = mean_absolute_error(Y_test, Y_pred)

# Print out the results of the assessment indicators
print(f'R² score: {r2}')
print(f'Mean Squared Error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
print(f'Mean Absolute Error: {mae}')

R² score: 0.7094341992869276
Mean Squared Error: 7888587.031392873
Root Mean Squared Error: 2452.789819708901
Mean Absolute Error: 1408.8104248849302


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


The performance of the Lasso regression model in predicting the number of phone fraud cases is relatively impressive, successfully explaining about 70.9% of the variability in the data. While this result shows that my model has some predictive power, the average prediction error (RMSE of about 2,452 cases and MAE of about 1,408 cases) suggests that I have room for improvement. To improve the accuracy of my model, I consider adjusting the model parameters to optimize the model.

Optimized lasso model

In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import MultiTaskLassoCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error


# Selection of numerical features for normalization
features = eco_edu_f_t_2022_df.columns[1:8] 
X = eco_edu_f_t_2022_df[features]

# Initialize and apply standardizers
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Select fraud-related columns as output variables
scam_features = eco_edu_f_t_2022_df.columns[10:]  
Y = eco_edu_f_t_2022_df[scam_features]

# Split the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42)

# Initialize the MultiTaskLassoCV regression model and use cross-validation to determine the best alpha
lasso = MultiTaskLassoCV(cv=5, random_state=42, max_iter=10000)

# Training Model
lasso.fit(X_train, Y_train)

# Predictions on the test set using the model
Y_pred = lasso.predict(X_test)

# Calculation of assessment indicators
r2 = r2_score(Y_test, Y_pred, multioutput='uniform_average')
mse = mean_squared_error(Y_test, Y_pred, multioutput='uniform_average')
rmse = mean_squared_error(Y_test, Y_pred, squared=False, multioutput='uniform_average')
mae = mean_absolute_error(Y_test, Y_pred, multioutput='uniform_average')

# Print out the results of the assessment indicators
print(f'Optimal alpha: {lasso.alpha_}')
print(f'R² score: {r2}')
print(f'Mean Squared Error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
print(f'Mean Absolute Error: {mae}')

Optimal alpha: 1453.5765059662886
R² score: 0.7730662522207071
Mean Squared Error: 6813601.013380618
Root Mean Squared Error: 2203.7093222165217
Mean Absolute Error: 1282.754843596172




Advanced lasso modeling

In [18]:
import pandas as pd
import time
import sys
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import MultiTaskLassoCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np

# Start counting
start_time = time.time()

# Selection of numerical features for normalization
features = eco_edu_f_t_2022_df.columns[1:8] 
X = eco_edu_f_t_2022_df[features]

# Initialize and apply standardizers
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Select fraud-related columns as output variables
scam_features = eco_edu_f_t_2022_df.columns[10:]  
Y = eco_edu_f_t_2022_df[scam_features]

# Split the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42)

# Initialize the MultiTaskLassoCV regression model and use cross-validation to determine the best alpha
lasso = MultiTaskLassoCV(cv=5, random_state=42, max_iter=10000)

# Training Models
lasso.fit(X_train, Y_train)

# Predictions using the model on the training and test sets
Y_train_pred = lasso.predict(X_train)
Y_test_pred = lasso.predict(X_test)

# Compute various evaluation metrics for training and test sets
lasso_train_r2 = r2_score(Y_train, Y_train_pred, multioutput='uniform_average')
lasso_train_mse = mean_squared_error(Y_train, Y_train_pred, multioutput='uniform_average')
lasso_train_rmse = np.sqrt(lasso_train_mse)
lasso_train_mae = mean_absolute_error(Y_train, Y_train_pred, multioutput='uniform_average')

lasso_test_r2 = r2_score(Y_test, Y_test_pred, multioutput='uniform_average')
lasso_test_mse = mean_squared_error(Y_test, Y_test_pred, multioutput='uniform_average')
lasso_test_rmse = np.sqrt(lasso_test_mse)
lasso_test_mae = mean_absolute_error(Y_test, Y_test_pred, multioutput='uniform_average')

# Stopping the timer and calculating the running time
lasso_end_time = time.time()
lasso_elapsed_time = lasso_end_time - start_time

# Creating and displaying indicator tables
lasso_metrics_df = pd.DataFrame({
    'Metric': ['MSE', 'RMSE', 'MAE', 'R²'],
    'Train': [lasso_train_mse, lasso_train_rmse, lasso_train_mae, lasso_train_r2],
    'Test': [lasso_test_mse, lasso_test_rmse, lasso_test_mae, lasso_test_r2]
})

print(lasso_metrics_df)

# Print out the training time of the model
print("Training time:", lasso_elapsed_time, "seconds")

# Calculate and print the memory size occupied by the model
print("Model size:", sys.getsizeof(lasso), "bytes")


  Metric      Train       Test
0    MSE 6792162.99 6813601.01
1   RMSE    2606.18    2610.29
2    MAE    1180.93    1282.75
3     R²       0.78       0.77
Training time: 0.4518468379974365 seconds
Model size: 56 bytes


---
Modeling Random Forests

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler


# Selection of numerical features for normalization
features = eco_edu_f_t_2022_df.columns[1:8]  
X = eco_edu_f_t_2022_df[features]

# Initialize and apply standardizers
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Select fraud-related columns as output variables
scam_features = eco_edu_f_t_2022_df.columns[10:]  
Y = eco_edu_f_t_2022_df[scam_features]

# Split the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42)

# Initialize a random forest regression model
random_forest = RandomForestRegressor(n_estimators=100, random_state=42)

# Training Model
random_forest.fit(X_train, Y_train)

# Predictions on the test set using the model
Y_pred = random_forest.predict(X_test)

# Calculation of assessment indicators
r2 = r2_score(Y_test, Y_pred, multioutput='uniform_average')
mse = mean_squared_error(Y_test, Y_pred, multioutput='uniform_average')
rmse = mean_squared_error(Y_test, Y_pred, squared=False, multioutput='uniform_average')
mae = mean_absolute_error(Y_test, Y_pred, multioutput='uniform_average')

# Print out the results of the assessment indicators
print(f'R² score: {r2}')
print(f'Mean Squared Error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
print(f'Mean Absolute Error: {mae}')

R² score: 0.7693225349031566
Mean Squared Error: 11925503.233315583
Root Mean Squared Error: 2824.606827417942
Mean Absolute Error: 1680.613636363636




The performance of the random forest regression model in predicting the type of phone fraud cases is relatively impressive, successfully explaining about 76.93% of the variability in the data. This result shows that the model has a good predictive ability, but with an average prediction error (RMSE of about 2,824 cases and MAE of about 1,680 cases). I plan to optimize the model by adjusting its parameters.

Optimized Random Forest Model

In [20]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# Defining the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],  # Number of trees, adjustable and expandable according to needs
    'max_depth': [10, 20, 30, 40, 50, None]  # Maximum depth of the tree, None means the tree can grow to any depth
}

# Creating Random Forest Regression Model
rf = RandomForestRegressor(random_state=42)

# Setting up grid searches using 50% off cross validation
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2, n_jobs=-1)

# Training Grid Search Models
grid_search.fit(X_train, Y_train)

# Print the optimal parameters and the corresponding MSE
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score: {:.2f}".format(-grid_search.best_score_))

Fitting 5 folds for each of 30 candidates, totalling 150 fits


  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (


[CV] END .....................max_depth=10, n_estimators=100; total time=   0.1s
[CV] END .....................max_depth=10, n_estimators=100; total time=   0.1s
[CV] END .....................max_depth=10, n_estimators=100; total time=   0.1s
[CV] END .....................max_depth=10, n_estimators=100; total time=   0.1s
[CV] END .....................max_depth=10, n_estimators=100; total time=   0.1s
[CV] END .....................max_depth=10, n_estimators=200; total time=   0.1s
[CV] END .....................max_depth=10, n_estimators=200; total time=   0.1s
[CV] END .....................max_depth=10, n_estimators=200; total time=   0.1s
[CV] END .....................max_depth=10, n_estimators=200; total time=   0.1s
[CV] END .....................max_depth=10, n_estimators=200; total time=   0.1s
[CV] END .....................max_depth=10, n_estimators=300; total time=   0.1s
[CV] END .....................max_depth=10, n_estimators=300; total time=   0.2s
[CV] END ...................

The optimal parameters have been found: max_depth of 10 and n_estimators of 300. I can use these parameters to build an optimized Random Forest regression model and train and evaluate it to confirm the performance of these parameters on real data.

In [21]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Initializing Random Forest Regression Models with Optimal Parameters
optimized_rf = RandomForestRegressor(n_estimators=300, max_depth=10, random_state=42)

# Training model
optimized_rf.fit(X_train, Y_train)

# Predictions on the test set using the model
Y_pred_optimized = optimized_rf.predict(X_test)

# Calculation of assessment indicators
optimized_r2 = r2_score(Y_test, Y_pred_optimized, multioutput='uniform_average')
optimized_mse = mean_squared_error(Y_test, Y_pred_optimized, multioutput='uniform_average')
optimized_rmse = mean_squared_error(Y_test, Y_pred_optimized, squared=False, multioutput='uniform_average')
optimized_mae = mean_absolute_error(Y_test, Y_pred_optimized, multioutput='uniform_average')

# Print out the results of the assessment indicators
print(f'Optimized R² score: {optimized_r2}')
print(f'Optimized Mean Squared Error: {optimized_mse}')
print(f'Optimized Root Mean Squared Error: {optimized_rmse}')
print(f'Optimized Mean Absolute Error: {optimized_mae}')

Optimized R² score: 0.773827697448578
Optimized Mean Squared Error: 10743368.627943678
Optimized Root Mean Squared Error: 2714.0992173043132
Optimized Mean Absolute Error: 1626.6956132756134




Advanced Random Forest Modeling

In [22]:
import time
import sys
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np

# Selection of numerical features for normalization
features = eco_edu_f_t_2022_df.columns[1:8]  
X = eco_edu_f_t_2022_df[features]

# Initialize and apply standardizers
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Select fraud-related columns as output variables
scam_features = eco_edu_f_t_2022_df.columns[10:]  
Y = eco_edu_f_t_2022_df[scam_features]

# Split the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42)

# Initializing Random Forest Regression Models with Optimal Parameters
optimized_rf = RandomForestRegressor(n_estimators=300, max_depth=10, random_state=42)

# Start counting
start_time = time.time()

# Training model
optimized_rf.fit(X_train, Y_train)

# Predictions using the model on the training and test sets
Y_train_pred = optimized_rf.predict(X_train)
Y_test_pred = optimized_rf.predict(X_test)

# Compute various evaluation metrics for training and test sets
rf_train_r2 = r2_score(Y_train, Y_train_pred, multioutput='uniform_average')
rf_train_mse = mean_squared_error(Y_train, Y_train_pred, multioutput='uniform_average')
rf_train_rmse = np.sqrt(rf_train_mse)
rf_train_mae = mean_absolute_error(Y_train, Y_train_pred, multioutput='uniform_average')

rf_test_r2 = r2_score(Y_test, Y_test_pred, multioutput='uniform_average')
rf_test_mse = mean_squared_error(Y_test, Y_test_pred, multioutput='uniform_average')
rf_test_rmse = np.sqrt(rf_test_mse)
rf_test_mae = mean_absolute_error(Y_test, Y_test_pred, multioutput='uniform_average')

# Stopping the timer and calculating the running time
rf_end_time = time.time()
rf_elapsed_time = rf_end_time - start_time

# Creating and displaying indicator tables
rf_metrics_df = pd.DataFrame({
    'Metric': ['MSE', 'RMSE', 'MAE', 'R²'],
    'Train': [rf_train_mse, rf_train_rmse, rf_train_mae, rf_train_r2],
    'Test': [rf_test_mse, rf_test_rmse, rf_test_mae, rf_test_r2]
})

print(rf_metrics_df)

# Print out the training time of the model
print("Training time:", rf_elapsed_time, "seconds")

# Calculate and print the memory size occupied by the model
print("Model size:", sys.getsizeof(optimized_rf), "bytes")


  Metric      Train        Test
0    MSE 2193164.35 10743368.63
1   RMSE    1480.93     3277.71
2    MAE     536.52     1626.70
3     R²       0.94        0.77
Training time: 0.12435722351074219 seconds
Model size: 56 bytes


---
Building Gradient Boosting Regressor Model

In [23]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Selection of numerical features for normalization
features = eco_edu_f_t_2022_df.columns[1:8] 
X = eco_edu_f_t_2022_df[features]

# Initialize and apply standardizers
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Select fraud-related columns as output variables
scam_features = eco_edu_f_t_2022_df.columns[10:] 
Y = eco_edu_f_t_2022_df[scam_features]

# Split the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42)

# Initialize the gradient boosting regression model
gbm = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Use MultiOutputRegressor to make GBM support multiple outputs
multioutput_gbm = MultiOutputRegressor(gbm)

# Training model
multioutput_gbm.fit(X_train, Y_train)

# Predictions on the test set using the model
Y_pred_gbm = multioutput_gbm.predict(X_test)

# Calculation of overall assessment indicators
gbm_r2 = r2_score(Y_test, Y_pred_gbm, multioutput='uniform_average')
gbm_mse = mean_squared_error(Y_test, Y_pred_gbm, multioutput='uniform_average')
gbm_rmse = mean_squared_error(Y_test, Y_pred_gbm, squared=False, multioutput='uniform_average')
gbm_mae = mean_absolute_error(Y_test, Y_pred_gbm, multioutput='uniform_average')

# Print out the results of the assessment indicators
print(f'Overall GBM R² score: {gbm_r2}')
print(f'Overall GBM Mean Squared Error: {gbm_mse}')
print(f'Overall GBM Root Mean Squared Error: {gbm_rmse}')
print(f'Overall GBM Mean Absolute Error: {gbm_mae}')

Overall GBM R² score: 0.592157731933271
Overall GBM Mean Squared Error: 25167487.342226144
Overall GBM Root Mean Squared Error: 3896.8444125867754
Overall GBM Mean Absolute Error: 2194.9013304380273




The gradient lifter model performed moderately well in predicting the types of phone fraud cases, where the overall R² score was 0.5922, indicating that the model explained about 59.22% of the variability in the data. However, the model has an overall mean square error (MSE) of 25,167,487.34, a root mean square error (RMSE) of 3896.84, and a mean absolute error (MAE) of 2194.90, which are high error metrics suggesting that there is still much room for improvement in terms of prediction accuracy.

However, for the GBM, it is true that it is not possible to directly optimize the multi-output task through the standard GradientBoostingRegressor, as it does not support multiple outputs. This means that standard single-output model optimization techniques (e.g., grid search) cannot be directly applied to process and optimize multiple outputs simultaneously. For the excellence of the model we need to compare at the end, it would not be able to support multiple outputs for accuracy results, so I chose to abandon the optimization of the GBM model.

Advanced Gradient Boosting Regressor Model

In [24]:
import time
import sys
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np

# Selection of numerical features for normalization
features = eco_edu_f_t_2022_df.columns[1:8] 
X = eco_edu_f_t_2022_df[features]

# Initialize and apply standardizers
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Select fraud-related columns as output variables
scam_features = eco_edu_f_t_2022_df.columns[10:]  
Y = eco_edu_f_t_2022_df[scam_features]

# Split the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42)

# Initialize the gradient boosting regression model
gbm = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Using the MultiOutputRegressor to make GBM support multiple outputs
multioutput_gbm = MultiOutputRegressor(gbm)

# Start counting
start_time = time.time()

# Training model
multioutput_gbm.fit(X_train, Y_train)

# Predictions using the model on the training and test sets
Y_train_pred = multioutput_gbm.predict(X_train)
Y_test_pred = multioutput_gbm.predict(X_test)

# Compute various evaluation metrics for training and test sets
gbm_train_r2 = r2_score(Y_train, Y_train_pred, multioutput='uniform_average')
gbm_train_mse = mean_squared_error(Y_train, Y_train_pred, multioutput='uniform_average')
gbm_train_rmse = np.sqrt(gbm_train_mse)
gbm_train_mae = mean_absolute_error(Y_train, Y_train_pred, multioutput='uniform_average')

gbm_test_r2 = r2_score(Y_test, Y_test_pred, multioutput='uniform_average')
gbm_test_mse = mean_squared_error(Y_test, Y_test_pred, multioutput='uniform_average')
gbm_test_rmse = np.sqrt(gbm_test_mse)
gbm_test_mae = mean_absolute_error(Y_test, Y_test_pred, multioutput='uniform_average')

# Stopping the timer and calculating the running time
gbm_end_time = time.time()
gbm_elapsed_time = gbm_end_time - start_time

# Creating and displaying indicator tables
gbm_metrics_df = pd.DataFrame({
    'Metric': ['MSE', 'RMSE', 'MAE', 'R²'],
    'Train': [gbm_train_mse, gbm_train_rmse, gbm_train_mae, gbm_train_r2],
    'Test': [gbm_test_mse, gbm_test_rmse, gbm_test_mae, gbm_test_r2]
})

print(gbm_metrics_df)

# Print out the training time of the model
print("Training time:", gbm_elapsed_time, "seconds")

# Calculate and print the memory size occupied by the model
print("Model size:", sys.getsizeof(multioutput_gbm), "bytes")


  Metric   Train        Test
0    MSE 5395.30 25167487.34
1   RMSE   73.45     5016.72
2    MAE   41.73     2194.90
3     R²    1.00        0.59
Training time: 0.11009097099304199 seconds
Model size: 56 bytes


---
Building Neural Network Model

In [25]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from keras.models import Sequential
from keras.layers import Dense


# Selection of numerical features for normalization
features = eco_edu_f_t_2022_df.columns[1:8] 
X = eco_edu_f_t_2022_df[features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Selection of target variables
target_features = eco_edu_f_t_2022_df.columns[10:] 
Y = eco_edu_f_t_2022_df[target_features]

# Segmented data sets
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42)

# Building neural network model
model = Sequential([
    Dense(128, input_dim=X_train.shape[1], activation='relu'),
    Dense(64, activation='relu'),
    Dense(Y_train.shape[1], activation='linear')
])

# Compilation model
model.compile(loss='mse', optimizer='adam', metrics=['mae'])

# Training model
history = model.fit(X_train, Y_train, epochs=100, batch_size=10, verbose=1, validation_split=0.2)

# Predictions on the test set using the model
Y_pred = model.predict(X_test)

# Calculation of assessment indicators
r2 = r2_score(Y_test, Y_pred)
mse = mean_squared_error(Y_test, Y_pred)
rmse = mean_squared_error(Y_test, Y_pred, squared=False)
mae = mean_absolute_error(Y_test, Y_pred)

# Print out the results of the assessment indicators
print("R² score:", r2)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("Mean Absolute Error:", mae)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78



The neural network model performed poorly in predicting the type of phone fraud cases where the R² score was 0.2869 indicating that the model could only explain about 28.69% of the data variability. In addition, the evaluation metrics of the model showed high errors, where the mean square error (MSE) was 57701461.04, the root mean square error (RMSE) was 5635.66, as well as the mean absolute error (MAE) was 3417.13, which were all much higher than expected. This may indicate that the model structure or training process needs further adjustment and optimization.

Optimized neural network model

In [26]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau

# Building neural network models
model = Sequential([
    Dense(128, input_dim=X_train.shape[1], activation='relu'),  # input layer
    Dense(64, activation='relu'),  # hidden layer
    Dense(Y_train.shape[1], activation='linear')  # output layer
])

# Compilation model
model.compile(loss='mse', optimizer=Adam(learning_rate=0.001), metrics=['mae'])

# Learning rate decay
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10, min_lr=0.00001, verbose=1)

# Train the model, adjust batch size and number of training rounds
history = model.fit(X_train, Y_train, epochs=200, batch_size=20, verbose=1, validation_split=0.2, callbacks=[reduce_lr])

# Predictions on the test set using the model
Y_pred = model.predict(X_test)

# Calculation of assessment indicators
r2 = r2_score(Y_test, Y_pred)
mse = mean_squared_error(Y_test, Y_pred)
rmse = mean_squared_error(Y_test, Y_pred, squared=False)
mae = mean_absolute_error(Y_test, Y_pred)

# Print out the results of the assessment indicators
print("R² score:", r2)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("Mean Absolute Error:", mae)



Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78



The neural network model through optimization is clearly optimized as there is an improvement in the model accuracy.

Advanced Neural Network Modeling

In [27]:
import time
import sys
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau

# Building neural network models
model = Sequential([
    Dense(128, input_dim=X_train.shape[1], activation='relu'),  # input layer
    Dense(64, activation='relu'),  # hidden layer
    Dense(Y_train.shape[1], activation='linear')  # output layer
])

# Compilation model
model.compile(loss='mse', optimizer=Adam(learning_rate=0.001), metrics=['mae'])

# Learning rate decay
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10, min_lr=0.00001, verbose=1)

# Starting time
start_time = time.time()

# Train the model, adjust batch size and number of training rounds
history = model.fit(X_train, Y_train, epochs=200, batch_size=20, verbose=1, validation_split=0.2, callbacks=[reduce_lr])

# Predictions on the test set using the model
Y_pred = model.predict(X_test)

# Using the model's predictions on the training set (for calculating training set metrics)
Y_train_pred = model.predict(X_train)

# Compute various evaluation metrics for training and test sets
nn_train_r2 = r2_score(Y_train, Y_train_pred)
nn_train_mse = mean_squared_error(Y_train, Y_train_pred)
nn_train_rmse = np.sqrt(nn_train_mse)
nn_train_mae = mean_absolute_error(Y_train, Y_train_pred)

nn_test_r2 = r2_score(Y_test, Y_pred)
nn_test_mse = mean_squared_error(Y_test, Y_pred)
nn_test_rmse = np.sqrt(nn_test_mse)
nn_test_mae = mean_absolute_error(Y_test, Y_pred)

# Stopping the timer and calculating the running time
nn_end_time = time.time()
nn_elapsed_time = nn_end_time - start_time

# Creating and displaying indicator tables
nn_metrics_df = pd.DataFrame({
    'Metric': ['MSE', 'RMSE', 'MAE', 'R²'],
    'Train': [nn_train_mse, nn_train_rmse, nn_train_mae, nn_train_r2],
    'Test': [nn_test_mse, nn_test_rmse, nn_test_mae, nn_test_r2]
})

print(nn_metrics_df)

# Print out the training time of the model
print("Training time:", nn_elapsed_time, "seconds")

# Calculate and print the memory size occupied by the model
print("Model size:", sys.getsizeof(model), "bytes")




Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

---
Summarize the performance of all models

In [28]:
import pandas as pd

# Create an empty DataFrame
columns = ['Model', 'MSE Train', 'MSE Test', 'R-Squared Train', 'R-Squared Test', 
           'RMSE Train', 'RMSE Test', 'MAE Train', 'MAE Test']
combined_df = pd.DataFrame(columns=columns)

# Populating the DataFrame
models = [
    ('Linear Model', linear_metrics_df),
    ('Best Lasso Model', lasso_metrics_df),
    ('Best RF Model', rf_metrics_df),
    ('GB Model', gbm_metrics_df),
    ('Best Neural Network Model', nn_metrics_df)
]

for model_name, metrics_df in models:
    # Extract the required metrics values from metrics_df
    mse_train = metrics_df.loc[metrics_df['Metric'] == 'MSE', 'Train'].values[0]
    mse_test = metrics_df.loc[metrics_df['Metric'] == 'MSE', 'Test'].values[0]
    r2_train = metrics_df.loc[metrics_df['Metric'] == 'R²', 'Train'].values[0]
    r2_test = metrics_df.loc[metrics_df['Metric'] == 'R²', 'Test'].values[0]
    rmse_train = metrics_df.loc[metrics_df['Metric'] == 'RMSE', 'Train'].values[0]
    rmse_test = metrics_df.loc[metrics_df['Metric'] == 'RMSE', 'Test'].values[0]
    mae_train = metrics_df.loc[metrics_df['Metric'] == 'MAE', 'Train'].values[0]
    mae_test = metrics_df.loc[metrics_df['Metric'] == 'MAE', 'Test'].values[0]

    # Create a new row DataFrame
    new_row = pd.DataFrame({
        'Model': [model_name],
        'MSE Train': [mse_train],
        'MSE Test': [mse_test],
        'R-Squared Train': [r2_train],
        'R-Squared Test': [r2_test],
        'RMSE Train': [rmse_train],
        'RMSE Test': [rmse_test],
        'MAE Train': [mae_train],
        'MAE Test': [mae_test]
    })

    # Use pd.concat to add lines
    combined_df = pd.concat([combined_df, new_row], ignore_index=True)

# Output the consolidated DataFrame
combined_df




  combined_df = pd.concat([combined_df, new_row], ignore_index=True)


Unnamed: 0,Model,MSE Train,MSE Test,R-Squared Train,R-Squared Test,RMSE Train,RMSE Test,MAE Train,MAE Test
0,Linear Model,3799680.28,10279735.7,0.85,0.69,1949.28,3206.2,1052.13,1504.04
1,Best Lasso Model,6792162.99,6813601.01,0.78,0.77,2606.18,2610.29,1180.93,1282.75
2,Best RF Model,2193164.35,10743368.63,0.94,0.77,1480.93,3277.71,536.52,1626.7
3,GB Model,5395.3,25167487.34,1.0,0.59,73.45,5016.72,41.73,2194.9
4,Best Neural Network Model,29133984.54,38760396.13,0.46,0.54,5397.59,6225.78,1939.32,2697.93


Since all models take less time to run and occupy similar amount of memory. So we only rank the performance of each model based on the performance summary and data provided.

1. **Best Lasso Model**: Demonstrates the best stability and generalization ability overall, with close performance on the training and test sets, shows strong prediction ability on unknown data, and is suitable for applications that require highly reliable predictions.

2. **Best RF Model**: despite the excellent performance on the training set, the performance drop on the test set indicates an overfitting problem.

3. **Linear Model**: Despite a drop in performance on the test set, it still demonstrates moderate fitting ability. This model is suitable for quick preliminary analysis, but may require more sophisticated modeling when dealing with more complex data relationships.

4. **GB Model**: The almost perfect performance on the training set contrasts with the sharp performance drop on the test set, suggesting that the model creates a strong overfitting problem.

5. **Best Neural Network Model**: a relatively poor performer among all models, with poor performance on both the training and test sets.

* Conclusion

This week I have summarized the ranking of all predictive models by compiling their performance in terms of accuracy metrics, runtime and memory usage in the training and test sets. Based on the final results, we can see that the lasso model was the best performer in the combined considerations, while the neural network model was the worst performer. Perhaps the Lasso model will be more practically relevant in the topic of studying the economic indicators of each region to predict the topic strategy of phone scams.