# Visualizations

https://lookerstudio.google.com/reporting/8a8c0320-fb19-42cb-b942-8e275a5b4f8c

### Fraud Incident Over Time
This chart shows how fraud incidents change over time.<br>
Insights:
- Are fraud cases rising or falling?
- When do spikes happen, and which transaction categories are likely involved?
- Which type of frauds are more prevalent: low-value but frequent, or high-value but rare?

### Merchant Category Fraud Rate
This chart shows which merchant categories are most affected by fraud.<br>
Insights:
- Which user age tend to commit fraud in which categories?
- When will certain merchant categories experience more fraud?
- What are the types of fraud each merchant categories will experience?

### Fraud Committed Per City Population
This chart shows how fraud rates vary by city.<br>
Insights:
- Which cities have the highest fraud rates?
- What’s the age distribution of fraudsters in those cities?
- Which merchant categories are the most affected in high-fraud cities?

### User Committed Fraud Age Distribution
This chart shows the ratio of fraudsters by age.<br>
Insights:
- Which age groups are more likely to commit fraud?
- What merchant category are targeted by different user ages?

# Handling PII Data

I decide which columns are PII columns by looking at its values. If the values will reveal private and information that makes the credit card users identifiable, they are PII columns and should be handled appropriately.

1. Credit card number ```cc_num```: All digits are masked except the last 4 digits, this will still allow operators to validate the user during calls by cross-checking the last 4 digits of the credit card.
2. First name ```first``` and last names ```last```: SHA-256 hashing is applied, this obscures the users' name, but still allows the analysts to identify unique users and their transactions for fraud and analytics purposes.
3. Address ```street```: Fully redacted since the leaking of their exact residence entails physical risk towards the users. 
4. ZIP code ```zip```: The last 2 digits are masked, this obscures their locality but preserves the city/town info for analytic purposes.
5. Latitude ```lat``` and longitude ```long```: Round to 1 decimal place, which only provide a ~10km accuracy of their residence, still allow analysis on city/town level.
6. Date of birth ```dob```: Month and day are masked, this protects users' personal privacy but still allows analysis to be performed on age trend.

# Data Quality Assurance

### Identifying and processing dirty data
To identify dirty data, I prefer fully parsing the dataset so that each field becomes its own column. This allows me to save the data as a CSV file, and inspect individual column using Excel.<br>

By using the sort and filter tools in excel, I can generate a drop down for each column. This allows me to inspect the distinct values of a column.<br>

The most noisy columns is ```person_name```, through inspection, I found that it contains dirty symbol irrelevant for our use case, and also extra characters like ```eeeee``` at the end of the name. These dirty symbol can be cleaned using regex. Although extra characters like ```eeeee``` are not cleaned, since they all occured at the end of the value, I can simply separate the value out by spaces, and just extract the first and last name.

The second dirty data I found are string null values such as ```na```, ```null```, and empty string, these values exists in some string columns. To ensure all columns are rid of these string null values, I apply the conversion of these string null values into real null. Since leading and trailing spaces are nearly impossible to detect, I prefer trimming all cols irregardless of its assumed data type.

There are also some formatting issues in certain string and numeric columns. For instance, ```merchant``` has a prefix ```fraud_```, this can be easily removed. For ```zip```, I found that some values has lesser digits than the 5-digit format used by the United States, these are actually zip codes that starts with 0, but got removed at an unknown stage. I fixed it by padding 0 in front of it.

For timestamp columns, ```merch_last_update_time``` and ```merch_eff_time``` are integers with 13 and 16 digits respectively, I deduce these are unix timestamps. However, some values has 1 less digit (```merch_last_update_time``` has 12 digits while ```merch_eff_time``` has 11 digits). I experimented the fix by first multiplying them by 10, which converts them into a timestamp that are consistent with its normal values, thus this fix was implemented.

# Transformation Steps

### Extract nested fields
Use ```col(parent_field.nested_field)``` to extract the nested field out.<br>
For the nested field ```address```, it contains another layer of nested fields, extract these after extracting the nested fields under ```personal detail```.

### Clean and split person name

1. Clean dirty symbols like commas, slashes, semicolons, etc. by replacing them with space.
2. Compressing multi-spaces cleaned from the dirty symbols into one space.
3. Remove leading and trailing spaces.
4. Split the value by space, the first word is the first name andt he seocond word is the last name.

### Drop parsed struct columns

Remove the parsed struct columns to ensure the no extra columns in the DF, this ensures subsequent data cleaning are not redundant.

### Clean up all fields

1. Clean up all fields by applying trim to remove leading and trailing spaces.
2. If there are fields that has string null values such as ```na```, ```null```, or empty space, convert the value to proper null.

### Transform string fields

1. If ```merchant``` value starts with ```fraud_```, remove the ```fraud_```.
2. If ```zip``` has less than 5 digits, pad the value with leading zeroes so that it has 5 digits.

### Process and convert timestamp fields

1. Parses ```trans_date_trans_time``` string using format: ```yyyy-MM-dd HH:mm:ss``` and convert it into timestamp.
2. Create ```trans_date_trans_time_utc+8``` timestamp column by adding the timestamp by 8 hours.
3. Fix millisecond unix timestamp ```merch_last_update_time``` that has 12 digits by multiplying it by 10.
4. Convert unit of value from milliseconds to seconds (with decimals), then convert it into timestamp.
5. Create ```merch_last_update_time_utc+8``` timestamp column by adding the timestamp by 8 hours.
3. Fix microsecond unix timestamp ```merch_eff_time``` that has 15 digits by multiplying it by 10.
4. Convert unit of value from microseconds to seconds (with decimals), then convert it into timestamp.
5. Create ```merch_eff_time_utc+8``` timestamp column by adding the timestamp by 8 hours.

### Apply PII masking

1. Mask the credit card number ```cc_num```, only reveal the last 4 digits.
2. Applies SHA-256 hashing to first and last names, ```first```, ```last```.
3. Redact the address ```street```.
4. Masks the last 2 digits of ZIP code ```zip```.
5. Round the latitude ```lat``` and longitude ```long``` into 1 decimal place.
6. Mask the month and day of date of birth ```dob```.