## Introduction

In this assignment, we will be using the following methods to **predict accounting restatment**.
1. OLS
2. Logit
2. LASSO
3. KNN
4. Decision Tree
5. Random Forest
6. Boosting

Below is the description for **x-variables** and **y-variable** chosen

| Variable    | Name| Description |
| :-----------: | :----------- | :----------- |
| y-variable  | restatement | Accounting Restatement 
| x-variables | ch_inv| <br/><font size="2"> $\frac{\text{Change in inventory}}{\text{total assets}}$</font><br/><br/> |
|             | issue |Dummy variable == 1 if firm just issued securities (seasoned equity issuance)|
|             | ret |stock return in the past year          |
|             | leverage |<br/><font size="2"> $\frac{\text{Long-term Debt}}{\text{total assets}}$</font><br/><br/> |
|             | ch_pension |Change in expected return return on pensionplan assets |
|             | auopic | An indicator variable coded 1 if the firm gets a qualified opinion in term of internal control from auditor |
|             | spread | Past 252 days average bid-ask spread |
|             | beat |An indicator variable coded 1 if the firm’s earnings meet or beat the analyst forecast |
|             | industry dummies | We need to include industry dummies because each industry has a very different business nature and industry characteristics. For example, bank and insurance industry (ind14) will have much lower inventory held in their account compared to pharmaceautical industry (ind15) |
|             | year dummies  |For year dummies, it is included to capture any time-related effects since the firms' financial statements were prepared based on different macroeconomics conditions and subject to the law and regulations.|

(**Note:** using only the training sample; avoid multicollinearity for industry and year dummies) 

###  1. OLS Regreesion

Introduction
>xxxxx

Usefulness
>xxxxx

Limitation
>xxxxx

In [314]:
# General import
import pandas as pd
import statsmodels.api as sm
import numpy as np
import sklearn
import os   

# Display the table of dataset
os.getcwd()
df_data_final = pd.read_csv('C:\\Users\\Yowpe\\Downloads\\Code\\data_final.csv')
df_data_final

Unnamed: 0,gvkey,date,fyear,foreign,big4,audit_ch,audit_tenure,ret,lagret1,ffind,...,ind8,ind9,ind10,ind11,ind12,ind13,ind14,ind15,test,validation
0,1004,31-May-02,2001,0,1,0,18,-0.051837,0.139215,Retail,...,0,0,0,0,1,0,0,0,0,0
1,1004,31-May-03,2002,0,1,0,19,-0.541911,-0.051837,Retail,...,0,0,0,0,1,0,0,0,0,0
2,1004,31-May-04,2003,0,1,0,20,0.918543,-0.541911,Retail,...,0,0,0,0,1,0,0,0,0,0
3,1004,31-May-05,2004,0,1,0,21,0.574125,0.918543,Retail,...,0,0,0,0,1,0,0,0,0,0
4,1004,31-May-06,2005,0,1,0,22,0.375776,0.574125,Retail,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54349,297209,31-Dec-12,2012,1,1,0,4,-0.144719,0.000000,Refining & Extractive,...,0,0,0,0,0,0,0,0,1,0
54350,297209,31-Dec-13,2013,1,1,0,5,-0.102787,-0.144719,Refining & Extractive,...,0,0,0,0,0,0,0,0,1,0
54351,297209,31-Dec-14,2014,1,1,0,6,-0.703434,-0.102787,Refining & Extractive,...,0,0,0,0,0,0,0,0,1,0
54352,315887,31-Dec-14,2014,1,1,0,2,-0.861493,0.000000,Transportation,...,0,0,1,0,0,0,0,0,1,0


In [313]:
 # get dummies data

df_data_final['fyear_new']=df_data_final['fyear']
df_data_final =pd.get_dummies(df_data_final, columns=['fyear'])
col_year = ['fyear_{}'.format(2001 + i) for i in range(10)]

col_ind = ['ind{}'.format(i+1) for i in range(14)]

In [310]:
# Data Split (Training set & Testing Set)
cond_0 = (df_data_final['test'] == 0) 
cond_1 = (df_data_final['test'] == 1)
    
X_columns = ['ch_inv','issue','ret','leverage','ch_pension','auopic','spread','beat'] + col_ind + col_year
x = df_data_final.loc[cond_0, X_columns] 
x = sm.add_constant(x) 

y = df_data_final.loc[cond_0, 'restatement']

model = sm.OLS(y, x).fit()
model.summary()

0,1,2,3
Dep. Variable:,restatement,R-squared:,0.045
Model:,OLS,Adj. R-squared:,0.044
Method:,Least Squares,F-statistic:,63.13
Date:,"Thu, 10 Nov 2022",Prob (F-statistic):,0.0
Time:,18:44:47,Log-Likelihood:,-2313.9
No. Observations:,43031,AIC:,4694.0
Df Residuals:,42998,BIC:,4980.0
Df Model:,32,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0183,0.007,2.454,0.014,0.004,0.033
ch_inv,0.0643,0.035,1.821,0.069,-0.005,0.133
issue,0.0273,0.004,6.401,0.000,0.019,0.036
ret,-0.0026,0.002,-1.172,0.241,-0.007,0.002
leverage,0.0280,0.007,3.882,0.000,0.014,0.042
ch_pension,1.4543,0.594,2.448,0.014,0.290,2.619
auopic,0.1869,0.007,25.670,0.000,0.173,0.201
spread,-0.1188,0.079,-1.498,0.134,-0.274,0.037
beat,-0.0060,0.003,-2.245,0.025,-0.011,-0.001

0,1,2,3
Omnibus:,26299.433,Durbin-Watson:,0.805
Prob(Omnibus):,0.0,Jarque-Bera (JB):,182734.188
Skew:,3.059,Prob(JB):,0.0
Kurtosis:,11.031,Cond. No.,739.0


### a)  R-square (OLS)

R-square = 0.030

### b) R-square (Manual Calculation)

In [4]:
model.ess 

88.75328198905163

In [5]:
model.ssr

2848.5723669617037

In [6]:
ess = model.ess
tss = model.ssr + ess
r_2 = ess/tss
print('R^2 =', r_2)

R^2 = 0.03021567663794965


###  c) Adjusted R-square (OLS)

Adjusted R-square = 0.030 

### d) Adjusted R-square (Manual Calculation)

$Adjusted$ $R^2$ $= 1-\frac{\frac{RSS}{N-K-1}}{\frac{TSS}{N-1}}$

In [7]:
ess = model.ess
rss = model.ssr
tss = model.ssr + ess

n= len(y)
k= x.shape[1]

adjusted_r_2 = 1-((rss/(n-k-1))/(tss/(n-1)))
print('R^2 =', adjusted_r_2)

R^2 = 0.029516513540570055


In [8]:
adjusted_r_2 = 1 - (1-r_2)*(len(y)-1)/(len(y)-x.shape[1]-1)

adjusted_r_2

#check this, which one correct ?

0.029516513540570166

### e) Variables Singnificant Level

In [9]:
p_value = pd.DataFrame({'pvalue': model.pvalues,})
p_value

Unnamed: 0,pvalue
const,0.008772734
ch_inv,0.3610245
ch_cs,0.03980055
ch_emp,0.0001959284
oplease,6.839848e-05
issue,2.158107e-10
ret,0.01357181
ind1,0.7965258
ind2,0.001987462
ind3,0.5579976


In [10]:
P1_cols = p_value[p_value.pvalue < 0.01].drop('const').index.tolist()
P2_cols = p_value[p_value.pvalue < 0.05].drop('const').drop(P1_cols).index.tolist()
P3_cols = p_value[p_value.pvalue < 0.10].drop('const').drop(P1_cols).drop(P2_cols).index.tolist()

print('Variables significant at 1% : ',P1_cols) 
print('Variables significant at 5% : ',P2_cols) 
print('Variables significant at 10% : ',P3_cols) 

Variables significant at 1% :  ['ch_emp', 'oplease', 'issue', 'ind2', 'ind5', 'ind6', 'ind9', 'ind11', 'ind12', 'ind13', 'ind14', 'fyear_2001', 'fyear_2002', 'fyear_2003', 'fyear_2004', 'fyear_2005']
Variables significant at 5% :  ['ch_cs', 'ret']
Variables significant at 10% :  ['ind7']


By looking at the p-value, the rules below had been used to determine the variables significant level. 

1. p<0.01: “The estimated coefficient is significant at 1% level"
2. p<0.05: “The estimated coefficient is significant at 5% level"
3. p<0.10: “The estimated coefficient is significant at 10% level"
4. p>0.10: "The estimated coefficient is not statistically significant as it is not statistically different from 0"


Variables significant at 1% = ch_emp,oplease,issue,ind2,ind5,ind6,ind9,ind11,ind12,ind13,ind14,fyear_2001,fyear_2003, fyear_2004,fyear_2005

Variables significant at 5% = ch_cs, ret

Variables significant at 10% = ind7


### f) Interpret the meaning of estimated coefficients 

1. <u> Issue </u>              
Issue has a positive and significant impact on a firm’s likelihood of having an accounting restatement. Holding all other firm characteristics equal, a firm that had issued securities recently is 2.7% more likely to have an accounting restatement than a firm that didn’t issue securities recently.
<br>

2. <u> Ind 9 </u> 
<br>Ind 9 has a positive and significant impact on a firm’s likelihood of having an accounting restatement. Holding all other firm characteristics equal, a firm that is in industry 9 is 4.2 % more likely to have an accounting restatement than a firm that is not in industry 9.
<br>

3. <u> Fyear 2001 </u> 
<br>Fyear 2001 has a positive and significant impact on a firm’s likelihood of having an accounting restatement. Holding all other firm characteristics equal, a firm that is in financial year 2001 is 6.8 % more likely to have an accounting restatement than a firm that is not in financial year 2001.





###  g) Likelihood of having a restatement change if x-variable changes from 25th percentile to 75th percentile

In [11]:
df_data_final['restatement'].describe() 

count    54354.000000
mean         0.066214
std          0.248658
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: restatement, dtype: float64

$\;\;\;\;\;$**i. Ch_cs**

In [12]:
df_data_final['ch_cs'].describe() 

count    54354.000000
mean         0.046719
std          1.256693
min         -6.692308
25%         -0.088095
50%          0.051227
75%          0.203228
max          6.757576
Name: ch_cs, dtype: float64

In [13]:
(0.203228--0.088095) * 0.0020

0.000582646

In [14]:
0.06/7.0

0.008571428571428572

Interpretation: A 0.9% increase in the mean probability if a firm's change in account receivables goes from 25th percentile to 75th percentile.


$\;\;\;\;\;$**ii. Ret**

In [15]:
df_data_final['ret'].describe() 

count    54354.000000
mean         0.067745
std          0.567142
min         -0.861493
25%         -0.255567
50%         -0.023484
75%          0.246067
max          2.836037
Name: ret, dtype: float64

In [16]:
(0.246067--0.255567) * -0.0055

-0.002758987

In [17]:
-0.28/7.0

-0.04

Interpretation: A 4% decrease in the mean probability if a firm's stock return in the past year goes from 25th percentile to 75th percentile.

$\;\;\;\;\;$**iii. Ch_emp**

In [18]:
df_data_final['ch_emp'].describe() 

count    54354.000000
mean        -0.041584
std          0.277849
min         -1.371319
25%         -0.118597
50%         -0.024756
75%          0.058048
max          0.896901
Name: ch_emp, dtype: float64

In [19]:
(0.058048--0.118597) * -0.0168

-0.0029676359999999996

In [20]:
-0.30/7.0

-0.04285714285714286

Interpretation: A 4.3% decrease in the mean probability if a firm's change in employees goes from 25th percentile to 75th percentile.

$\;\;\;\;\;$**iv. Oplease**

In [21]:
df_data_final['oplease'].describe() 

count    54354.000000
mean         0.002807
std          0.025175
min         -0.089858
25%         -0.002301
50%          0.000000
75%          0.004588
max          0.124105
Name: oplease, dtype: float64

In [22]:
(0.004588--0.002301) * 0.1951

0.0013440439

In [23]:
0.13/7.0

0.018571428571428572

Interpretation: A 1.9% increase in the mean probability if a firm's present value of operating lease goes from 25th percentile to 75th percentile.

# Question 3

### a) Predict likelihood of having a restatement

$\;\;\;\;\;$**i. Training Sample**

In [24]:
y_pred_train_ols = model.predict(x)
y_pred_train_ols 

0        0.166496
1        0.174506
2        0.158339
3        0.129334
4        0.128767
           ...   
54314    0.060099
54318    0.051090
54322    0.123524
54323    0.053851
54339    0.017358
Length: 43031, dtype: float64

In [25]:
y_pred_train_ols.describe()

count    43031.000000
mean         0.073691
std          0.045416
min         -0.045056
25%          0.042059
50%          0.073403
75%          0.106211
max          0.205144
dtype: float64

$\;\;\;\;\;$**ii. Test Sample**

In [26]:
cond_1 = (df_data_final['test'] == 1)
X_test = df_data_final.loc[cond_1, X_columns]  
X_test = sm.add_constant(X_test)
y_pred_test_ols = model.predict(X_test)
y_pred_test_ols

11       0.092499
12       0.096797
13       0.088759
42       0.045719
54       0.048585
           ...   
54349    0.078367
54350    0.051792
54351    0.062912
54352    0.040708
54353    0.043741
Length: 11323, dtype: float64

In [27]:
y_pred_test_ols.describe()

count    11323.000000
mean         0.049505
std          0.031624
min         -0.041158
25%          0.020890
50%          0.047972
75%          0.077004
max          0.139045
dtype: float64

$\;\;\;\;\;$**iii. Whole Sample**

In [28]:
X_all = df_data_final[X_columns]  
X_all = sm.add_constant(X_all)
y_pred_all_ols = model.predict(X_all)
y_pred_all_ols

0        0.166496
1        0.174506
2        0.158339
3        0.129334
4        0.128767
           ...   
54349    0.078367
54350    0.051792
54351    0.062912
54352    0.040708
54353    0.043741
Length: 54354, dtype: float64

In [29]:
y_pred_all_ols.describe() 

count    54354.000000
mean         0.068653
std          0.044019
min         -0.045056
25%          0.039017
50%          0.065061
75%          0.095611
max          0.205144
dtype: float64

### b) Number of obs with predicted probability <0 

$\;\;\;\;\;$**i. Training Sample**

In [30]:
y_pred_train_ols[y_pred_train_ols<0].describe()

count    1203.000000
mean       -0.011470
std         0.008612
min        -0.045056
25%        -0.016418
50%        -0.010528
75%        -0.004155
max        -0.000013
dtype: float64

$\;\;\;\;\;$**ii. Test Sample**

In [31]:
y_pred_test_ols[y_pred_test_ols<0].describe() 

count    449.000000
mean      -0.013623
std        0.006832
min       -0.041158
25%       -0.017044
50%       -0.014113
75%       -0.010571
max       -0.000187
dtype: float64

# Logit Regression

In [32]:
model_logit = sm.Logit(y,x).fit()
model_logit.summary()

Optimization terminated successfully.
         Current function value: 0.247820
         Iterations 7


0,1,2,3
Dep. Variable:,restatement,No. Observations:,43031.0
Model:,Logit,Df Residuals:,43000.0
Method:,MLE,Df Model:,30.0
Date:,"Thu, 10 Nov 2022",Pseudo R-squ.:,0.05802
Time:,16:13:02,Log-Likelihood:,-10664.0
converged:,True,LL-Null:,-11321.0
Covariance Type:,nonrobust,LLR p-value:,1.858e-257

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-3.4529,0.126,-27.352,0.000,-3.700,-3.206
ch_inv,0.3900,0.517,0.755,0.450,-0.623,1.403
ch_cs,0.0455,0.018,2.505,0.012,0.010,0.081
ch_emp,-0.2347,0.064,-3.673,0.000,-0.360,-0.109
oplease,2.0557,0.640,3.214,0.001,0.802,3.309
issue,0.4154,0.079,5.273,0.000,0.261,0.570
ret,-0.0707,0.032,-2.203,0.028,-0.134,-0.008
ind1,-0.0845,0.431,-0.196,0.844,-0.928,0.759
ind2,0.3886,0.123,3.169,0.002,0.148,0.629


### Psuedo R-square 

Pseudo R-square = 0.05802

In [33]:
PR_2 = 1- (-10664)/(-11321) 
print('Pseudo R-square =', PR_2)

Pseudo R-square = 0.05803374260224359


### Predict likelihood of having a restatement

$\;\;\;\;\;$**i. Training Sample**

In [34]:
y_pred_train_logit = model_logit.predict(x)
y_pred_train_logit 

0        0.190109
1        0.203539
2        0.175067
3        0.121617
4        0.135492
           ...   
54314    0.057192
54318    0.049368
54322    0.116876
54323    0.051630
54339    0.027862
Length: 43031, dtype: float64

In [35]:
y_pred_train_logit.describe()

count    43031.000000
mean         0.073691
std          0.047212
min          0.008739
25%          0.039402
50%          0.059316
75%          0.101425
max          0.284945
dtype: float64

$\;\;\;\;\;$**ii. Test Sample**

In [36]:
cond_1 = (df_data_final['test'] == 1)
X_test = df_data_final.loc[cond_1, X_columns]  
X_test = sm.add_constant(X_test)
y_pred_test_logit = model_logit.predict(X_test)
y_pred_test_logit

11       0.078528
12       0.082463
13       0.074751
42       0.047282
54       0.047441
           ...   
54349    0.079580
54350    0.050668
54351    0.058691
54352    0.043150
54353    0.044825
Length: 11323, dtype: float64

In [37]:
y_pred_test_logit.describe()

count    11323.000000
mean         0.049420
std          0.021722
min          0.009546
25%          0.029803
50%          0.047184
75%          0.068313
max          0.139200
dtype: float64

$\;\;\;\;\;$**iii. Whole Sample**

In [38]:
X_all = df_data_final[X_columns]  
X_all = sm.add_constant(X_all)
y_pred_all_logit = model_logit.predict(X_all)
y_pred_all_logit

0        0.190109
1        0.203539
2        0.175067
3        0.121617
4        0.135492
           ...   
54349    0.079580
54350    0.050668
54351    0.058691
54352    0.043150
54353    0.044825
Length: 54354, dtype: float64

In [39]:
y_pred_all_logit.describe() 

count    54354.000000
mean         0.068635
std          0.044273
min          0.008739
25%          0.038181
50%          0.055393
75%          0.084050
max          0.284945
dtype: float64

### Number of obs with predicted probability <0 

$\;\;\;\;\;$**i. Training Sample**

In [40]:
y_pred_train_logit[y_pred_train_logit<0].describe() 

count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
dtype: float64

$\;\;\;\;\;$**ii. Test Sample**

In [41]:
y_pred_test_logit[y_pred_test_logit<0].describe() 

count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
dtype: float64

# Optional parts 

### 1. SEC on OLS predicted probabilities

In [42]:
df_data_final['y_pred_all_ols'] = y_pred_all_ols
group_year = df_data_final.groupby(['fyear_new'])['y_pred_all_ols'].quantile(0.666).rename('pctile66_model_OLS')
                                                     #use all?          #also 0.666?


In [43]:
group_year

fyear_new
2001    0.132903
2002    0.135896
2003    0.134142
2004    0.128307
2005    0.102813
2006    0.071640
2007    0.061922
2008    0.053640
2009    0.058780
2010    0.064525
2011    0.063565
2012    0.062623
2013    0.065045
2014    0.066265
Name: pctile66_model_OLS, dtype: float64

In [44]:
df_data_final = df_data_final.merge(group_year, on='fyear_new', how='left') 
df_data_final['top_third_model_OLS'] = 0 
cond_top_third_OLS = (df_data_final['y_pred_all_ols'] >= df_data_final['pctile66_model_OLS'])
df_data_final.loc[cond_top_third_OLS, 'top_third_model_OLS'] = 1

In [45]:
cond_select_OLS = (df_data_final['test'] == 1) & (df_data_final['top_third_model_OLS'] == 1) & (df_data_final['restatement'] == 1)

In [46]:
len(df_data_final[cond_select_OLS])

167

In [47]:
cond_how_many_restatement_test= (df_data_final['test'] == 1) & (df_data_final['restatement'] == 1)

len(df_data_final[cond_how_many_restatement_test])

428

Interpretation: According to the OLS model, SEC can catch 167/428 restatements.

### 2. SEC on Logit predicted probabilities

In [48]:
df_data_final['y_pred_all_logit'] = y_pred_all_logit
group_year_logit = df_data_final.groupby(['fyear_new'])['y_pred_all_logit'].quantile(0.666).rename('pctile66_model_logit')
                                                     #use all?          #also 0.666?

In [49]:
group_year_logit

fyear_new
2001    0.138788
2002    0.141258
2003    0.140108
2004    0.134865
2005    0.105240
2006    0.067489
2007    0.054884
2008    0.046147
2009    0.052844
2010    0.058651
2011    0.057790
2012    0.056950
2013    0.058330
2014    0.059886
Name: pctile66_model_logit, dtype: float64

In [50]:
df_data_final = df_data_final.merge(group_year_logit, on='fyear_new', how='left') 
df_data_final['top_third_model_logit'] = 0 
cond_top_third_logit = (df_data_final['y_pred_all_logit'] >= df_data_final['pctile66_model_logit'])
df_data_final.loc[cond_top_third_logit, 'top_third_model_logit'] = 1

In [51]:
cond_select_logit = (df_data_final['test'] == 1) & (df_data_final['top_third_model_logit'] == 1) & (df_data_final['restatement'] == 1)

In [52]:
len(df_data_final[cond_select_logit])

168

In [176]:
cond_how_many_restatement_test= (df_data_final['test'] == 1) & (df_data_final['restatement'] == 1)

len(df_data_final[cond_how_many_restatement_test])

428

Interpretation: According to the Logit model, SEC can catch 168/428 restatements.

### 3. Include more x-variables to catch restatement


$\;\;\;\;\;$**i. OLS**

In [292]:
x_columns_2 = ['ch_inv','issue','ret','ch_pension','auopic','spread','beat','leverage'] + col_ind + col_year
x_2 = df_data_final.loc[cond_0,x_columns_2] 
x_2 = sm.add_constant(x_2) 

y_2 = df_data_final.loc[cond_0, 'restatement']

model_2 = sm.OLS(y_2, x_2).fit()
model_2.summary()

0,1,2,3
Dep. Variable:,restatement,R-squared:,0.045
Model:,OLS,Adj. R-squared:,0.044
Method:,Least Squares,F-statistic:,63.13
Date:,"Thu, 10 Nov 2022",Prob (F-statistic):,0.0
Time:,16:52:42,Log-Likelihood:,-2313.9
No. Observations:,43031,AIC:,4694.0
Df Residuals:,42998,BIC:,4980.0
Df Model:,32,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0183,0.007,2.454,0.014,0.004,0.033
ch_inv,0.0643,0.035,1.821,0.069,-0.005,0.133
issue,0.0273,0.004,6.401,0.000,0.019,0.036
ret,-0.0026,0.002,-1.172,0.241,-0.007,0.002
ch_pension,1.4543,0.594,2.448,0.014,0.290,2.619
auopic,0.1869,0.007,25.670,0.000,0.173,0.201
spread,-0.1188,0.079,-1.498,0.134,-0.274,0.037
beat,-0.0060,0.003,-2.245,0.025,-0.011,-0.001
leverage,0.0280,0.007,3.882,0.000,0.014,0.042

0,1,2,3
Omnibus:,26299.433,Durbin-Watson:,0.805
Prob(Omnibus):,0.0,Jarque-Bera (JB):,182734.188
Skew:,3.059,Prob(JB):,0.0
Kurtosis:,11.031,Cond. No.,739.0


In [293]:
X_all_2 = df_data_final[x_columns_2]  
X_all_2 = sm.add_constant(X_all_2)
y_pred_all_ols2 = model_2.predict(X_all_2)
y_pred_all_ols2

0        0.170360
1        0.168802
2        0.174542
3        0.121116
4        0.128912
           ...   
54349    0.058665
54350    0.060235
54351    0.061566
54352    0.042831
54353    0.064689
Length: 54354, dtype: float64

In [294]:
y_pred_all_ols2.describe() 

count    54354.000000
mean         0.069032
std          0.054038
min         -0.054312
25%          0.034248
50%          0.061671
75%          0.094219
max          0.345380
dtype: float64

In [295]:
df_data_final['y_pred_all_ols2'] = y_pred_all_ols2
group_year2 = df_data_final.groupby(['fyear_new'])['y_pred_all_ols2'].quantile(0.666).rename('pctile66_model_OLS2')
                                                     #use all?          #also 0.666?


In [296]:
group_year2

fyear_new
2001    0.130799
2002    0.136772
2003    0.133138
2004    0.120820
2005    0.094187
2006    0.065494
2007    0.055561
2008    0.051202
2009    0.058280
2010    0.062603
2011    0.063013
2012    0.061059
2013    0.064277
2014    0.065466
Name: pctile66_model_OLS2, dtype: float64

In [299]:
df_data_final = df_data_final.merge(group_year2, on='fyear_new', how='left') 
df_data_final['top_third_model_OLS2'] = 0 
cond_top_third_OLS2 = (df_data_final['y_pred_all_ols2'] >= df_data_final['pctile66_model_OLS2'])
df_data_final.loc[cond_top_third_OLS2, 'top_third_model_OLS2'] = 1

In [300]:
cond_select_OLS2 = (df_data_final['test'] == 1) & (df_data_final['top_third_model_OLS2'] == 1) & (df_data_final['restatement'] == 1)

In [301]:
len(df_data_final[cond_select_OLS2])

210

In [302]:
cond_how_many_restatement_test= (df_data_final['test'] == 1) & (df_data_final['restatement'] == 1)

len(df_data_final[cond_how_many_restatement_test])

428

Interpretation: According to the new OLS model, SEC can catch 230/428 restatements.

$\;\;\;\;\;$**ii. Logit**

In [67]:
model_logit_2 = sm.Logit(y_2,x_2).fit()
model_logit_2.summary()

Optimization terminated successfully.
         Current function value: 0.241573
         Iterations 8


0,1,2,3
Dep. Variable:,restatement,No. Observations:,43031.0
Model:,Logit,Df Residuals:,42990.0
Method:,MLE,Df Model:,40.0
Date:,"Fri, 30 Sep 2022",Pseudo R-squ.:,0.08176
Time:,19:52:24,Log-Likelihood:,-10395.0
converged:,True,LL-Null:,-11321.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-3.1241,0.114,-27.509,0.000,-3.347,-2.902
age,-0.0057,0.002,-3.689,0.000,-0.009,-0.003
auopic,1.5693,0.072,21.654,0.000,1.427,1.711
spread,-5.0119,1.384,-3.622,0.000,-7.724,-2.300
beat,-0.0710,0.041,-1.725,0.085,-0.152,0.010
IG,-0.4455,0.073,-6.098,0.000,-0.589,-0.302
short,2.4095,0.438,5.503,0.000,1.551,3.268
dadif,-0.0394,0.037,-1.053,0.292,-0.113,0.034
ch_pension,16.2657,10.130,1.606,0.108,-3.589,36.121


In [68]:
X_all_2 = df_data_final[x_columns_2]  
X_all_2 = sm.add_constant(X_all_2)
y_pred_all_logit2 = model_logit_2.predict(X_all_2)
y_pred_all_logit2

0        0.145426
1        0.199330
2        0.212320
3        0.161909
4        0.134343
           ...   
54349    0.073980
54350    0.059764
54351    0.060386
54352    0.041601
54353    0.065707
Length: 54354, dtype: float64

In [69]:
y_pred_all_logit2.describe() 

count    54354.000000
mean         0.069015
std          0.056690
min          0.005704
25%          0.032904
50%          0.053634
75%          0.085515
max          0.613959
dtype: float64

In [70]:
df_data_final['y_pred_all_logit2'] = y_pred_all_logit2
group_year3 = df_data_final.groupby(['fyear_new'])['y_pred_all_logit2'].quantile(0.666).rename('pctile66_model_logit2')
                                                     #use all?          #also 0.666?


In [71]:
group_year3

fyear_new
2001    0.136544
2002    0.144287
2003    0.139712
2004    0.121814
2005    0.091797
2006    0.060551
2007    0.049275
2008    0.044090
2009    0.052261
2010    0.057472
2011    0.056721
2012    0.056425
2013    0.057641
2014    0.058365
Name: pctile66_model_logit2, dtype: float64

In [72]:
df_data_final = df_data_final.merge(group_year3, on='fyear_new', how='left') 
df_data_final['top_third_model_logit2'] = 0 
cond_top_third_logit2 = (df_data_final['y_pred_all_logit2'] >= df_data_final['pctile66_model_logit2'])
df_data_final.loc[cond_top_third_logit2, 'top_third_model_logit2'] = 1

In [73]:
cond_select_logit2 = (df_data_final['test'] == 1) & (df_data_final['top_third_model_logit2'] == 1) & (df_data_final['restatement'] == 1)

In [74]:
len(df_data_final[cond_select_logit2])

235

In [75]:
cond_how_many_restatement_test= (df_data_final['test'] == 1) & (df_data_final['restatement'] == 1)

len(df_data_final[cond_how_many_restatement_test])

428

Interpretation: According to the new Logit model, SEC can catch 235/428 restatements.

### 4. LASSO 

In [76]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

In [None]:
param = {'alpha':np.arange(0,0.001,0.005)}

In [None]:
lasso=Lasso() 
lasso_search = GridSearchCV(lasso, param) 
lasso_search.fit(x, y)
import warnings
warnings.filterwarnings('ignore')  

In [None]:
lasso_search.cv_results_

In [115]:
lasso_search_2 = GridSearchCV(lasso, param, scoring='r2')
lasso_search_2.fit(x, y)

GridSearchCV(estimator=Lasso(),
             param_grid={'alpha': array([0.    , 0.0005, 0.001 , 0.0015, 0.002 , 0.0025, 0.003 , 0.0035,
       0.004 , 0.0045, 0.005 , 0.0055, 0.006 , 0.0065, 0.007 , 0.0075,
       0.008 , 0.0085, 0.009 , 0.0095])},
             scoring='r2')

In [None]:
for result in lasso_search.cv_results_:
    print(result, lasso_search.cv_results_[result])
    print(lasso_search.best_score_)
    print(lasso_search.best_params_)
    print(lasso_search.best_estimator_)

In [82]:
lasso_search_2 = GridSearchCV(lasso, param, scoring='r2')
lasso_search_2.fit(x, y)
import warnings
warnings.filterwarnings('ignore')  

In [84]:
for result in lasso_search_2.cv_results_:
    print(result, lasso_search_2.cv_results_[result])
    print(lasso_search_2.best_estimator_)

mean_fit_time [3.55421295 0.05648146 0.04547505 0.04069781 0.03527327 0.03322668
 0.0308857  0.03287115 0.02933607 0.02773099 0.02734437 0.02884378
 0.02722383 0.02673898 0.02438879 0.02692914 0.02491508 0.02431717
 0.02356567 0.02605157]
Lasso(alpha=0.0)
std_fit_time [0.16663772 0.01050933 0.00466829 0.00219221 0.00616465 0.00225456
 0.00383385 0.00758046 0.00212278 0.00434592 0.0039169  0.00486631
 0.00539107 0.00333373 0.0065641  0.00257668 0.00330463 0.00168607
 0.00444028 0.0036319 ]
Lasso(alpha=0.0)
mean_score_time [0.00581126 0.0021883  0.00438657 0.00456238 0.00545282 0.00381227
 0.00257659 0.00429473 0.00225215 0.00514321 0.00542412 0.00368772
 0.00404129 0.0069881  0.00420675 0.00212235 0.00397124 0.0022264
 0.00125232 0.00253496]
Lasso(alpha=0.0)
std_score_time [0.00371672 0.00140542 0.00338015 0.0022283  0.00294426 0.00232574
 0.00340775 0.00293973 0.00247542 0.00258575 0.002616   0.0036558
 0.00225628 0.00234199 0.00263952 0.00126185 0.00246166 0.00224388
 0.00204363 0.001

In [85]:
sklearn.metrics.SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_wei

In [88]:
model_lasso=Lasso(alpha=0) 
model_lasso.fit(x, y)  
model_lasso.coef_ 

array([ 0.        ,  0.03278322,  0.0020261 , -0.01681148,  0.19507473,
        0.02675283, -0.00550633, -0.00694714,  0.02680137,  0.00566544,
        0.00566085, -0.03852833, -0.02581661,  0.01339416, -0.00144786,
        0.04239867,  0.00535402, -0.02506738,  0.04927245,  0.03116159,
       -0.03305919,  0.06774002,  0.07317636,  0.07013973,  0.06163844,
        0.03612609,  0.00539149, -0.00576728, -0.00909781, -0.00068027,
        0.00109513])

In [89]:
y_lasso_pred = model_lasso.predict(X_all) 

In [90]:
df_data_final['restatement_prob_model_lasso'] = y_lasso_pred 

In [91]:
df_data_final['restatement_prob_model_lasso']

0        0.166496
1        0.174506
2        0.158339
3        0.129334
4        0.128767
           ...   
54349    0.078367
54350    0.051792
54351    0.062912
54352    0.040708
54353    0.043741
Name: restatement_prob_model_lasso, Length: 54354, dtype: float64

In [92]:
df_data_final['restatement_prob_model_lasso'].describe() 

count    54354.000000
mean         0.068653
std          0.044019
min         -0.045056
25%          0.039017
50%          0.065061
75%          0.095611
max          0.205144
Name: restatement_prob_model_lasso, dtype: float64

In [93]:
group_year4 = df_data_final.groupby(['fyear_new'])['restatement_prob_model_lasso'].quantile(0.666).rename('pctile66_model_lasso')


In [94]:
group_year4

fyear_new
2001    0.132903
2002    0.135896
2003    0.134142
2004    0.128307
2005    0.102813
2006    0.071640
2007    0.061922
2008    0.053640
2009    0.058780
2010    0.064525
2011    0.063565
2012    0.062623
2013    0.065045
2014    0.066265
Name: pctile66_model_lasso, dtype: float64

In [95]:
df_data_final = df_data_final.merge(group_year4, on='fyear_new', how='left')  
df_data_final['top_third_model_lasso'] = 0 
cond_top_third_lasso = (df_data_final['restatement_prob_model_lasso'] >= df_data_final['pctile66_model_lasso'])
df_data_final.loc[cond_top_third_lasso, 'top_third_model_lasso'] = 1

In [96]:
cond_select_lasso = (df_data_final['test'] == 1) & (df_data_final['top_third_model_lasso'] == 1) & (df_data_final['restatement'] == 1)

In [97]:
len(df_data_final[cond_select_lasso]) #y lesser than OLS?

167

In [1198]:
cond_how_many_restatement_test= (df_data_final['test'] == 1) & (df_data_final['restatement'] == 1)

len(df_data_final[cond_how_many_restatement_test])

428

In [98]:
model_lasso2=Lasso(alpha=0)  
model_lasso2.fit(x_2, y_2)

Lasso(alpha=0)

In [99]:
model_lasso2.coef_

array([ 0.00000000e+00, -3.28474369e-04,  1.81693376e-01, -1.99831410e-01,
       -4.12877658e-03, -2.10281973e-02,  1.49939636e-01, -1.87035119e-03,
        1.00024862e+00,  1.78335417e-04, -7.01615827e-02, -1.26751796e-03,
        4.61998925e-03,  2.97999972e-02,  8.72902371e-03,  1.90156137e-03,
        1.07425243e-01, -1.13991648e-02,  2.27615047e-02,  1.14960220e-02,
        8.02349143e-03, -3.97724757e-02, -2.05923113e-02,  1.31652296e-02,
       -9.11320476e-04,  3.98227724e-02,  6.60808653e-05, -9.93906281e-03,
        4.93322513e-02,  2.66244137e-02, -3.31564871e-02,  7.45440260e-02,
        8.03487026e-02,  7.60324051e-02,  5.63175084e-02,  3.02404672e-02,
        1.16228539e-03, -1.14430322e-02, -1.41678744e-02, -9.97981991e-04,
        1.09446557e-03])

In [100]:
y_lasso_pred2 = model_lasso2.predict(X_all_2) 

In [101]:
df_data_final['restatement_prob_model_lasso2'] = y_lasso_pred2 

In [102]:
df_data_final['restatement_prob_model_lasso2']

0        0.148138
1        0.168255
2        0.170952
3        0.150063
4        0.131774
           ...   
54349    0.074244
54350    0.064086
54351    0.064519
54352    0.038884
54353    0.070579
Name: restatement_prob_model_lasso2, Length: 54354, dtype: float64

In [103]:
df_data_final['restatement_prob_model_lasso2'].describe() 

count    54354.000000
mean         0.068985
std          0.055171
min         -0.059936
25%          0.031461
50%          0.064343
75%          0.096905
max          0.362675
Name: restatement_prob_model_lasso2, dtype: float64

In [104]:
group_year5 = df_data_final.groupby(['fyear_new'])['restatement_prob_model_lasso2'].quantile(0.666).rename('pctile66_model_lasso')


In [110]:
df_data_final = df_data_final.merge(group_year5, on='fyear_new', how='left')  
df_data_final['top_third_model_lasso2'] = 0 
cond_top_third_lasso2 = (df_data_final['restatement_prob_model_lasso2'] >= df_data_final['pctile66_model_lasso'])
df_data_final.loc[cond_top_third_lasso2, 'top_third_model_lasso2'] = 1

In [111]:
cond_select_lasso2 = (df_data_final['test'] == 1) & (df_data_final['top_third_model_lasso2'] == 1) & (df_data_final['restatement'] == 1)

In [112]:
len(df_data_final[cond_select_lasso2]) #y lesser than OLS?

230