MCA Technology Solutions Private Limited was established in 2015 in Bangalore with an objective to
integrate analytics and technology with business. MCA Technology Solutions helped its clients in areas such as
customer intelligence, forecasting, optimization, risk assessment, web analytics, and text mining and cloud
solutions. Risk assessment vertical at MCA technology solutions focused on problems such as fraud detection
and credit scoring. Sachin Kumar, Director at MCA Technology Solutions, Bangalore was approached by one
his clients, a commercial bank, to assist them in detecting earnings manipulators among the bank's customers.
The bank provided business loans to small and medium enterprises and the value of loan ranged from INR 10
million to 500 million. The bank suspected that its customers may be involved in earnings manipulations to
increase their chance of securing a loan. Saurabh Rishi, the chief data scientist at MCA Technologies was
assigned the task of developing a use case for predicting earnings manipulations. He was aware of models such
as Benford's law and Beneish model used for predicting earnings manipulations; however, he was not sure of
its performance, especially in the Indian context. Saurabh decided to develop his own model for predicting
earnings manipulations using data downloaded from the Prowess database maintained by the Centre of
Monitoring Indian Economy (CMIE). Daniel received information related to earning manipulators from
Securities Exchange Board of India (SEBI) and the Lexis Nexis database. Data on more than 1200 companies
was collected to develop the model. MCA Technology believed that machine learning algorithms may give
better accuracy compared to other traditional models such as Beneish model used for predicting earnings
manipulation.”


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.preprocessing import StandardScaler
import scipy.stats as stats
import warnings
from sklearn.metrics import average_precision_score
from sklearn.tree import export_text
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
warnings.filterwarnings('ignore')
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

from google.colab import files
files.upload()
CD = pd.read_csv("complete_data.csv")

from google.colab import files
files.upload()
SD = pd.read_csv("sample_data.csv")



Saving complete_data.csv to complete_data.csv


Saving sample_data.csv to sample_data.csv


**1.** The number of manipulators is usually much less than non-manipulators. What kind of modeling
problems can one expect when cases in one class are much lower than the other class in a binary
classification problem? How can one handle these problems?

One can have a misleading accuracy when modeling the data. A model can predict “non-manipulator” for everyone and still look “accurate.”The model under-detects manipulators (false negatives).Learner skews toward the majority; probabilities are under-calibrated for the rare class.

In [None]:
target = "C-MANIPULATOR"  # change if needed
print("Class counts:\n", CD[target].value_counts())


Class counts:
 C-MANIPULATOR
0    1200
1      39
Name: count, dtype: int64


**2.** Use sample_data.csv (220 cases including 39 manipulators) and develop a logistic regression model
that can be used by MCA Technologies Private Limited for predicting probability of earnings
manipulation. Describe each step of building the model and explain the rationale when necessary

The logistic regression model shows good performance (AUC = 0.91, accuracy = 84.8%). Significant predictors include ACCR, SGI, GMI, and DSRI. All positively related to manipulation risk. The model detects half of manipulators (recall = 0.50) with moderate precision (0.60).

In [None]:
# Create a clean binary target
SD["manip"] = SD["Manipulator"].astype(str).str.strip().isin(["Yes", "C", "1"]).astype(int)

# Select predictors: drop ID and target-like columns
# Remove obvious identifier/target columns so the model can’t “cheat.”
# Force everything to numeric (logit requires numeric features).
# Drop columns that became entirely NaN after coercion.
# Drop rows still missing required predictors so we train on complete cases.

drop_cols = [c for c in ["Company ID", "Manipulator", "manip", "C-MANIPULATOR"] if c in SD.columns]
X = SD.drop(columns=drop_cols, errors="ignore")
X = X.apply(pd.to_numeric, errors="coerce")
X = X.loc[:, X.notna().any(axis=0)]

SD_model = pd.concat([SD[["manip"]], X], axis=1).dropna(axis=0)
Y = SD_model["manip"]
X = SD_model.drop(columns=["manip"])

# Train/test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=RANDOM_SEED, stratify=Y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", Y_train.shape)
print("Y_test shape:", Y_test.shape)

# Fit logistic regression with class_weight="balanced"
logit_s = LogisticRegression(
    max_iter=1000, class_weight="balanced", solver="lbfgs", random_state=RANDOM_SEED)
logit_s.fit(X_train, Y_train)

# Evaluate on the test set (threshold = 0.50 first)
YProb = logit_s.predict_proba(X_test)[:, 1]
YPred = (YProb >= 0.5).astype(int)

print("\nConfusion matrix @0.5:")
print(confusion_matrix(Y_test, YPred))

print("\nClassification report @0.5:")
print(classification_report(Y_test, YPred, digits=3))

print("ROC AUC:", roc_auc_score(Y_test, YProb))

predictor_terms = " + ".join([f'Q("{c}")' for c in X.columns])
formula = f"manip ~ {predictor_terms}"

# Recreate the modeling frame for stats models
SD_sm = pd.concat([Y, X], axis=1)

logit_model = smf.logit(formula=formula, data=SD_sm).fit(disp=False)
print(logit_model.summary())

X_train shape: (154, 8)
X_test shape: (66, 8)
Y_train shape: (154,)
Y_test shape: (66,)

Confusion matrix @0.5:
[[50  4]
 [ 6  6]]

Classification report @0.5:
              precision    recall  f1-score   support

           0      0.893     0.926     0.909        54
           1      0.600     0.500     0.545        12

    accuracy                          0.848        66
   macro avg      0.746     0.713     0.727        66
weighted avg      0.840     0.848     0.843        66

ROC AUC: 0.9074074074074074
                           Logit Regression Results                           
Dep. Variable:                  manip   No. Observations:                  220
Model:                          Logit   Df Residuals:                      211
Method:                           MLE   Df Model:                            8
Date:                Fri, 17 Oct 2025   Pseudo R-squ.:                  0.4206
Time:                        21:19:51   Log-Likelihood:                -59.558
converged: 

**3.** Comment on the model you developed by explaining the results; how do you measure the accuracy of
the model?

The logistic regression model developed to predict the probability of earnings manipulation is statistically significant, with a pseudo R² of 0.42 and a p-value below 0.001, showing that it explains about 42% of the variation in manipulation behavior. Key variables such as DSRI, GMI, AQI, SGI, and ACCR are statistically significant and increase the likelihood of manipulation, while DEPI, SGAI, and LEVI are not significant.

Using a 70/30 train-test split, the model achieved an accuracy of 84.8% and a ROC-AUC of 0.91, indicating excellent ability to distinguish manipulators from non-manipulators. Precision (0.60), recall (0.50), and F1-score (0.55) further confirm a good balance between detecting manipulators and minimizing false alarms. Overall, the model is both statistically sound and practically reliable for identifying firms at high risk of earnings manipulation.

**4.** What should be the strategy adopted by MCA Technology Solutions to deploy the logistic regression
model developed?

MCA Technology Solutions should use the logistic regression model as a decision-support tool to identify companies with a higher probability of earnings manipulation.
Firms with predicted probabilities above a chosen threshold (for example, 0.5) can be flagged for detailed review before loans or approvals are finalized.
The model should be integrated into MCA’s assessment system, updated regularly with new financial data, and validated against real outcomes to maintain accuracy.
It should serve as a guidance tool to support, not replace, the professional judgment of financial analysts.

**5.** Based on the models developed in questions 3 and 4, suggest a Mscore (Manipulator score) that can be used by regulators to identify potential manipulators.

In [None]:
Model = logit_model.predict(SD_model.drop(columns=["manip"]))

def mscore(prob):
    return prob * 100.0

# Show first 10
ExampleScores = pd.DataFrame({"prob": Model[:10], "M_score": mscore(Model[:10])})
print("Example M-Score (first 10 from SAMPLE set):")
print(ExampleScores)

print("""
Example policy bands:
- 0–29: Low risk → auto
- 30–59: Medium → light review / request docs
- 60–100: High → enhanced review
""")
SD["M_score"] = mscore(logit_model.predict(SD_model.drop(columns=["manip"])))

def band(ms):
    if ms >= 60: return "High"
    if ms >= 30: return "Medium"
    return "Low"

SD["RiskBand"] = SD["M_score"].apply(band)

display(SD.head())

Example M-Score (first 10 from SAMPLE set):
       prob     M_score
0  0.140924   14.092447
1  1.000000  100.000000
2  0.197409   19.740926
3  0.288999   28.899862
4  0.156169   15.616860
5  0.401512   40.151223
6  0.044196    4.419647
7  0.983856   98.385572
8  0.219807   21.980696
9  0.713263   71.326332

Example policy bands:
- 0–29: Low risk → auto
- 30–59: Medium → light review / request docs
- 60–100: High → enhanced review



Unnamed: 0,Company ID,DSRI,GMI,AQI,SGI,DEPI,SGAI,ACCR,LEVI,Manipulator,C-MANIPULATOR,manip,M_score,RiskBand
0,1,1.624742,1.128927,7.185053,0.366211,1.381519,1.624145,-0.166809,1.161082,Yes,1,1,14.092447,Low
1,2,1.0,1.606492,1.004988,13.081433,0.4,5.198207,0.060475,0.986732,Yes,1,1,100.0,High
2,3,1.0,1.015607,1.241389,1.475018,1.169353,0.647671,0.036732,1.264305,Yes,1,1,19.740926,Low
3,4,1.486239,1.0,0.465535,0.67284,2.0,0.09289,0.273434,0.680975,Yes,1,1,28.899862,Low
4,5,1.0,1.369038,0.637112,0.861346,1.454676,1.74146,0.123048,0.939047,Yes,1,1,15.61686,Low


**6.** Develop classification and regression tree (CART) model. Describe each step of building the model
and explain the rationale when necessary. What insights do you obtain from the CART model?


In [None]:
#Model Set up
DT = DecisionTreeClassifier(
    criterion="gini",
    max_depth=4,
    min_samples_leaf=10,
    class_weight="balanced",
    random_state=RANDOM_SEED)

#Training the model
DT.fit(X_train, Y_train)

# Predict
Pred = DT.predict_proba(X_test)[:, 1]
XP = (Pred >= 0.50).astype(int)

#Evaluating the model
print("Confusion @0.50:\n", confusion_matrix(Y_test, XP))
print(classification_report(Y_test, XP, digits=3))
print("ROC-AUC:", roc_auc_score(Y_test, XP))

# Quick insights
importances = pd.Series(DT.feature_importances_, index=X_train.columns).sort_values(ascending=False)
print("\nTop features:\n", importances.head(5))
print("\nRules (top levels):")
print("PR-AUC:", average_precision_score(Y_test, XP))  # add PR-AUC
print(export_text(DT, feature_names=list(X_train.columns), max_depth=3))



Confusion @0.50:
 [[38 16]
 [ 9  3]]
              precision    recall  f1-score   support

           0      0.809     0.704     0.752        54
           1      0.158     0.250     0.194        12

    accuracy                          0.621        66
   macro avg      0.483     0.477     0.473        66
weighted avg      0.690     0.621     0.651        66

ROC-AUC: 0.47685185185185186

Top features:
 LEVI    0.301861
DSRI    0.264444
SGAI    0.259699
AQI     0.173997
SGI     0.000000
dtype: float64

Rules (top levels):
PR-AUC: 0.17583732057416268
|--- LEVI <= 0.61
|   |--- class: 1
|--- LEVI >  0.61
|   |--- DSRI <= 1.74
|   |   |--- SGAI <= 0.75
|   |   |   |--- class: 1
|   |   |--- SGAI >  0.75
|   |   |   |--- AQI <= 2.02
|   |   |   |   |--- class: 0
|   |   |   |--- AQI >  2.02
|   |   |   |   |--- class: 1
|   |--- DSRI >  1.74
|   |   |--- class: 1



The CART model achieved 62% accuracy, with a ROC-AUC of 0.48 and PR-AUC of 0.18, indicating limited predictive power. Key predictors were LEVI, DSRI, SGAI, and AQI. Suggesting that firms with lower leverage, higher receivables, or increasing intangible assets are more likely to manipulate earnings.

 Although overall performance is modest, the model provides clear, interpretable rules, making it valuable for auditors and analysts to identify red-flag patterns and understand the financial characteristics driving manipulation.

**7.** Develop a logistic regression model using the complete_data.csv (1200 non-manipulators and 39
manipulators), compare the results with the previous logistic regression model.

In [None]:
# Prepare data for the complete dataset

if "C-MANIPULATOR" in CD.columns:
    CD["manip_c"] = CD["C-MANIPULATOR"].astype(int)
else:
    # Fallback if the column name is different; adjust if needed
    CD["manip_c"] = CD["Manipulator"].astype(str).str.strip().isin(["Yes","C","1"]).astype(int)


drop_cols = [c for c in ["Company ID", "Manipulator", "manip_c", "C-MANIPULATOR"] if c in CD.columns]
XC = CD.drop(columns=drop_cols, errors="ignore")
XC = XC.apply(pd.to_numeric, errors="coerce")
XC = XC.loc[:, XC.notna().any(axis=0)]

# Fix: Pass the dataframes as a list to pd.concat
CD_model = pd.concat([CD[["manip_c"]], XC], axis=1).dropna(axis=0)
YC = CD_model["manip_c"]
XC = CD_model.drop(columns=["manip_c"])

XC_train, XC_test, YC_train, YC_test = train_test_split(
    XC, YC, test_size=0.30, random_state=RANDOM_SEED, stratify=YC
)

print("XC_train shape:", XC_train.shape)
print("XC_test shape:", XC_test.shape)
print("YC_train shape:", YC_train.shape)
print("YC_test shape:", YC_test.shape)

# Fit logistic regression with class_weight="balanced"
LogitC = LogisticRegression(
    max_iter=1000, class_weight="balanced", solver="lbfgs", random_state=RANDOM_SEED
)
LogitC.fit(XC_train, YC_train)
print("\nLogistic Regression (COMPLETE) trained successfully.")

# Evaluate on the test set (threshold = 0.50 first)
YCProb = LogitC.predict_proba(XC_test)[:, 1]
YCPred = (YCProb >= 0.5).astype(int)

print("\nConfusion matrix @0.5 (COMPLETE):")
print(confusion_matrix(YC_test, YCPred))

print("\nClassification report @0.5 (COMPLETE):")
print(classification_report(YC_test, YCPred, digits=3))

print("ROC AUC (COMPLETE):", roc_auc_score(YC_test, YCProb))

PredictorC = " + ".join([f'Q("{c}")' for c in XC.columns])
FormulaC = f"manip_c ~ {PredictorC}"
CD_sm = pd.concat([YC, XC], axis=1)
LogitModelC = smf.logit(formula=FormulaC, data=CD_sm).fit(disp=False)
print(LogitModelC.summary())

XC_train shape: (867, 8)
XC_test shape: (372, 8)
YC_train shape: (867,)
YC_test shape: (372,)

Logistic Regression (COMPLETE) trained successfully.

Confusion matrix @0.5 (COMPLETE):
[[321  39]
 [  3   9]]

Classification report @0.5 (COMPLETE):
              precision    recall  f1-score   support

           0      0.991     0.892     0.939       360
           1      0.188     0.750     0.300        12

    accuracy                          0.887       372
   macro avg      0.589     0.821     0.619       372
weighted avg      0.965     0.887     0.918       372

ROC AUC (COMPLETE): 0.8953703703703704
                           Logit Regression Results                           
Dep. Variable:                manip_c   No. Observations:                 1239
Model:                          Logit   Df Residuals:                     1230
Method:                           MLE   Df Model:                            8
Date:                Fri, 17 Oct 2025   Pseudo R-squ.:                  

The logistic regression model on the complete dataset is statistically significant (pseudo R² = 0.32, p < 0.001), explaining about 32% of the variation in manipulation behavior. Key predictors such as DSRI, GMI, AQI, SGI, and ACCR are statistically significant and positively associated with manipulation.

The model achieved an accuracy of 88.7% and a ROC-AUC of 0.90, demonstrating strong predictive performance. While precision for manipulators is low (0.19), the recall of 0.75 shows that the model effectively identifies most manipulation cases despite the class imbalance.

**8.** Develop models using machine learning algorithms such as random forest and boosting (or even more
than these two). Using appropriate evaluation approaches to compare the outputs from these methods
with logistic regression and classification tree.

In [None]:
# Define models
LogistReg = LogisticRegression(max_iter=1000, class_weight="balanced", solver="lbfgs", random_state=RANDOM_SEED)
Random    = RandomForestClassifier(n_estimators=200, max_depth=6, class_weight="balanced", random_state=RANDOM_SEED)
GBoost    = GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, max_depth=3, random_state=RANDOM_SEED)


# Train & evaluate helper  (robust to models without predict_proba)
def test_model(model, X_tr, X_te, y_tr, y_te):
    model.fit(X_tr, y_tr)
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_te)[:, 1]
    else:
        # fallback if ever needed
        df = model.decision_function(X_te)
        y_prob = (df - df.min()) / (df.max() - df.min() + 1e-9)
    roc = roc_auc_score(y_te, y_prob)
    pr  = average_precision_score(y_te, y_prob)
    return roc, pr

# Evaluate on SAMPLE DATA (SD)
SD_Scores = []
for name, mdl in [
    ("Logistic", LogistReg),
    ("Random Forest", Random),
    ("Gradient Boosting", GBoost),

]:
    roc, pr = test_model(mdl, X_train, X_test, Y_train, Y_test)
    SD_Scores.append((name + " (SD)", roc, pr))

# Evaluate on COMPLETE DATA (CD)
CD_Scores = []
for name, mdl in [
    ("Logistic", LogistReg),
    ("Random Forest", Random),
    ("Gradient Boosting", GBoost),

]:
    roc, pr = test_model(mdl, XC_train, XC_test, YC_train, YC_test)
    CD_Scores.append((name + " (CD)", roc, pr))

# Combine and show
results = pd.DataFrame(SD_Scores + CD_Scores, columns=["Model", "ROC-AUC", "PR-AUC"])
print("\nModel comparison on SD and CD datasets:")
print(results.sort_values(["Model", "PR-AUC"], ascending=[True, False]))



Model comparison on SD and CD datasets:
                    Model   ROC-AUC    PR-AUC
5  Gradient Boosting (CD)  0.865856  0.346491
2  Gradient Boosting (SD)  0.797068  0.489814
3           Logistic (CD)  0.895370  0.300202
0           Logistic (SD)  0.907407  0.669590
4      Random Forest (CD)  0.884028  0.435887
1      Random Forest (SD)  0.827160  0.607427


**9.** What will be your final recommendation for predicting earnings manipulators? Explain your reasons.

MCA Technology Solutions should deploy the Logistic Regression model (trained on the complete dataset) as the primary predictive tool for detecting potential earnings manipulators. This model offers:

* High overall accuracy (88.7%) and strong discrimination (ROC-AUC = 0.90), ensuring reliable separation between manipulators and non-manipulators.

* Solid recall (0.75), allowing the system to detect most manipulation cases even in an imbalanced dataset.

* Transparent interpretability, as it clearly identifies the financial indicators (DSRI, GMI, AQI, SGI, and ACCR) most associated with manipulation risk.

As a complementary measure, MCA can use Gradient Boosting for advanced pattern detection and to capture non-linear relationships missed by the linear model. Together, Logistic Regression for explainability and Gradient Boosting for deeper predictive strength provide a balanced, accurate, and transparent framework for earnings manipulation detection.