# Topic

## 1. Introduction

Auditing is the process of investigating financial records of businesses to ensure that their financial statements align with the internationally-accepted legal standards. The data set used in this project is the collection of present and historical risk factors relevant to assisting auditors in appropriately identifying businesses that are risky of committing unfair practices (N. Hooda et al., 2018).

Precisely, we will be using the Audit Data collected by the Comptroller and Auditor General (CAG). CAG of India is an independent constitutional body entrusted with the task of auditing the financial transactions of government-funded firms. They collected comprehensive non-confidential data from the Auditor General Office (AGO), with a focus on 777 firms from 46 Indian cities between the years 2015 to 2016. The target companies were then grouped into 14 different sectors.

This data analytic project revolves around the application of predictive analytics in the classification of fraudulent firms using the case study above. Therefore, this project will attempt to tackle the following research question: Is a business risky or not risky of committing fraud based on its Inherent, Control and Detection Risk scores?

The variables we will use to answer this question are the potential discrepancies that can occur during the companies’ transactions (Inherent Risk), internal audit (Control Risk) and external audit (Detection Risk). The product of these variables will give us the Audit Risk score (ARS). In the risk assessment, the companies with ARS scores more than or equal to 1 are identified as “risky” firms and are assigned a risk assessment value of 1, and companies scoring less than 1 are classified as “not risky” firms, getting a risk assessment score of 0.


## 2. Preliminary Explanatory Data Analysis

In [2]:
#load library
options(repr.matrix.max.rows=6)
library(tidyverse)
library(tidymodels)
library(repr)
library(cowplot)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

In [4]:
#Read dataset and correct variable types
audit_data <- read_csv("data/audit_risk.csv") |>
    mutate(Risk = as.factor(Risk)) |>
    mutate(LOCATION_ID = as.integer(LOCATION_ID))

#Splitting data into training set and testing set
set.seed(265)
audit_split <- initial_split(audit_data, prop = 0.75, strata = Risk)
audit_training <- training(audit_split)
audit_testing <- testing(audit_split)

#Printing training data set
audit_training

[1m[22mNew names:
[36m•[39m `Score_B` -> `Score_B...7`
[36m•[39m `Score_B` -> `Score_B...11`
[1mRows: [22m[34m776[39m [1mColumns: [22m[34m27[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (1): LOCATION_ID
[32mdbl[39m (26): Sector_score, PARA_A, Score_A, Risk_A, PARA_B, Score_B...7, Risk_B...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
“NAs introduced by coercion”


Sector_score,LOCATION_ID,PARA_A,Score_A,Risk_A,PARA_B,Score_B...7,Risk_B,TOTAL,numbers,⋯,RiSk_E,History,Prob,Risk_F,Score,Inherent_Risk,CONTROL_RISK,Detection_Risk,Audit_Risk,Risk
<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
3.89,6,0.51,0.2,0.102,0.23,0.2,0.046,0.74,5,⋯,0.4,0,0.2,0,2,1.548,0.4,0.5,0.3096,0
3.89,6,0.00,0.2,0.000,0.08,0.2,0.016,0.08,5,⋯,0.4,0,0.2,0,2,1.416,0.4,0.5,0.2832,0
3.89,6,0.00,0.2,0.000,0.83,0.2,0.166,0.83,5,⋯,0.4,0,0.2,0,2,2.156,0.4,0.5,0.4312,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
55.57,2,1.06,0.4,0.424,0.63,0.2,0.126,1.69,5,⋯,1.2,0,0.2,0,2.6,2.930,1.2,0.5,1.7580,1
55.57,32,0.00,0.2,0.000,8.49,0.6,5.094,8.49,5,⋯,0.4,0,0.2,0,3.2,6.580,0.4,0.5,1.3160,1
55.57,13,1.06,0.4,0.424,1.60,0.4,0.640,2.66,5,⋯,0.4,0,0.2,0,3.2,12.118,0.4,0.5,2.4236,1


Although the dataset does not include sector ID for each unique sector, we hypothesized that each sector should have a unique historical sector risk score (Sector_score). Thus, we will count how many unique sector score there are and the number of firms under each sector score. 

In [5]:
#Table reporting number of sectors and number of firms under each sector
firm_count <- audit_data |>
    select(Sector_score) |>
group_by(Sector_score) |>
    summarise(firm_count = n())
print(firm_count)

[90m# A tibble: 13 × 2[39m
   Sector_score firm_count
          [3m[90m<dbl>[39m[23m      [3m[90m<int>[39m[23m
[90m 1[39m         1.85         95
[90m 2[39m         1.99         47
[90m 3[39m         2.34          5
[90m 4[39m         2.36          1
[90m 5[39m         2.37         74
[90m 6[39m         2.72         82
[90m 7[39m         3.41         76
[90m 8[39m         3.89        114
[90m 9[39m        15.6           3
[90m10[39m        17.7           1
[90m11[39m        21.6          41
[90m12[39m        55.6         200
[90m13[39m        59.8          37


The table shows 13 different unique sector scores, however, there are supposed to be 14 sectors. Hence, this also caused some of the firm counts under each sector to not match up. This could be due to some sectors having exactly equal sector risk scores, making the sector_score an inaccurate variable to count for number of sectors and number of firms under them. 