## WNS TRIANGE HACKQUEST

### Problem Statement

Within the realm of insurance, the processing of claims related to vehicle damage stands out as a routine yet crucial responsibility. The insurance sector grapples with an ongoing dilemma in distinguishing genuine claims from deceptive ones, a situation that can result in substantial financial setbacks. The emergence of Generative AI and various stable diffusion models has contributed to a surge in the number of fraudulent claims. It has become commonplace for users to incorporate fraudulent images as components of the claim settlement process.

This poses a formidable challenge to insurance companies as they strive to differentiate between legitimate and deceitful claims. Deceptive claims often involve amplifying the severity of damage or fabricating entirely false claims. To curb these financial losses and uphold the integrity of their operations, insurance firms must formulate effective approaches for accurately and efficiently flagging fraudulent claims.

In the context of this hackathon, the WNS team invites the community to devise a robust and high-performance model utilizing computer vision techniques to classify images as either fraudulent or non-fraudulent within the context of insurance claims. By precisely identifying fraudulent images, insurance companies can evaluate the authenticity of a claim and make well-informed decisions regarding payout.

### Dataset

You are provided with 3 files: training set, test set and sample submission.

The training set contains a diverse dataset of car images, each labeled with information being fraudulent or non fraudulent. The dataset includes images from varying lighting conditions, cluttered backgrounds, long tail distribution, and so on.

In the test set, you are provided with only the images and you need to predict the label as fraudulent or non fraudulent for each image present.

The sample submission file contains the format in which the user needs to submit the solution file.

### Dataset Description

Following is the dataset description of training set, test set and sample submission.

#### Training set

The training set contains 2 files: images folder and train.csv

The images folder contains the images which are to be used for training the model and train.csv contains the labels of each image present in the training set and data description is given below.

![image](https://github.com/Akshay-Paunikar/WNS_Triange_Hackquest/assets/86560684/b2d4a1b8-f13c-4bb7-b75a-7272c0e620c8)

#### Test set

The test set contains 2 files: images folder and test.csv

The images folder contains the test images for which prediction is to be done and test.csv contains the unique identifiers of each image present in the test set. You will need to make predictions for each image present in the test set and data description is given below.

![image](https://github.com/Akshay-Paunikar/WNS_Triange_Hackquest/assets/86560684/810e41cc-f63e-489f-a8f7-1a3a24ac52a6)

#### Sample Submission

Sample submission contains 2 columns - image_id and label and its description is given below

![image](https://github.com/Akshay-Paunikar/WNS_Triange_Hackquest/assets/86560684/3209b2b2-276f-4d5d-8147-cf834da748e9)

#### Evaluation metric

The model will be evaluated with the macro F1 score.

#### Public and Private Split

Test data is further divided into Public (40%) and Private (60%) data.

Your initial responses will be checked and scored on the Public data. The final rankings would be based on your private score which will be published once the competition is over.

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# import required libraries
import os
import numpy as np
import pandas as pd
from PIL import Image

In [None]:
# path to current working directory
%pwd

'/content'

In [None]:
# change working directory to desired path
os.chdir("/content/drive/MyDrive/WNS_TRIANGE_HACKQUEST/")

In [None]:
# check if you are in the right directory
%pwd

'/content/drive/MyDrive/WNS_TRIANGE_HACKQUEST'

In [None]:
# read the train data
train_data = pd.read_csv("/content/drive/MyDrive/WNS_TRIANGE_HACKQUEST/dataset/train/train.csv")

In [None]:
# first five records
train_data.head()

Unnamed: 0,image_id,filename,label
0,1,1.jpg,0
1,2,2.jpg,0
2,3,3.jpg,0
3,4,4.jpg,0
4,5,5.jpg,0


In [None]:
# shape of dataset
train_data.shape

(8079, 3)

In [None]:
# check if data is balanced w.r.t. label or not
train_data['label'].value_counts()

0    7614
1     465
Name: label, dtype: int64

As you can see that the dataset is imbalanced as it has non-fraudulent label - 7614 and fraudulent label - 465.

Now we will read the given images and combine the csv data provided to create a new dataframe for further usage.

In [None]:
combined_data_train = []

In [None]:
for index, row in train_data.iterrows():
    image_path = "/content/drive/MyDrive/WNS_TRIANGE_HACKQUEST/dataset/train/images/" + row['filename']
    image = Image.open(image_path)

    combined_row_train = {
        "image_name": row['filename'],
        "fraudulent_claim": row['label'],
        "image_data": image
    }

    combined_data_train.append(combined_row_train)

In [None]:
combined_df_train = pd.DataFrame(combined_data_train)

In [None]:
combined_df_train.head()

Unnamed: 0,image_name,fraudulent_claim,image_data
0,1.jpg,0,<PIL.JpegImagePlugin.JpegImageFile image mode=...
1,2.jpg,0,<PIL.JpegImagePlugin.JpegImageFile image mode=...
2,3.jpg,0,<PIL.JpegImagePlugin.JpegImageFile image mode=...
3,4.jpg,0,<PIL.JpegImagePlugin.JpegImageFile image mode=...
4,5.jpg,0,<PIL.JpegImagePlugin.JpegImageFile image mode=...


In [None]:
combined_df_train.dtypes

image_name          object
fraudulent_claim     int64
image_data          object
dtype: object

In [None]:
target_size = (224, 224)

In [None]:
def preprocess_image(image):
    resized_image = image.resize(target_size)
    processed_image = np.array(resized_image)/255.0
    return processed_image

In [None]:
combined_df_train['processed_image'] = combined_df_train['image_data'].apply(preprocess_image)

In [None]:
combined_df_train.drop(columns=['image_data'], inplace=True)

In [None]:
combined_df_train.head()

In [None]:
combined_df_train.to_csv("training_data.csv",index=False)

In [None]:
# read the test data
test_data = pd.read_csv("/content/drive/MyDrive/WNS_TRIANGE_HACKQUEST/dataset/test/test.csv")

In [None]:
# first five records
test_data.head()

Unnamed: 0,image_id,filename
0,8080,8080.jpg
1,8081,8081.jpg
2,8082,8082.jpg
3,8083,8083.jpg
4,8084,8084.jpg


In [None]:
# shape of dataset
test_data.shape

(3462, 2)

In [None]:
combined_data_test = []

for index, row in test_data.iterrows():
    image_path = "/content/drive/MyDrive/WNS_TRIANGE_HACKQUEST/dataset/test/images/" + row['filename']
    image = Image.open(image_path)

    combined_row_test = {
        "image_name": row['filename'],
        "image_data": image
    }

    combined_data_test.append(combined_row_test)

In [None]:
combined_df_test = pd.DataFrame(combined_data_test)

In [None]:
combined_df_test.head()

Unnamed: 0,image_name,image_data
0,8080.jpg,<PIL.JpegImagePlugin.JpegImageFile image mode=...
1,8081.jpg,<PIL.JpegImagePlugin.JpegImageFile image mode=...
2,8082.jpg,<PIL.JpegImagePlugin.JpegImageFile image mode=...
3,8083.jpg,<PIL.JpegImagePlugin.JpegImageFile image mode=...
4,8084.jpg,<PIL.JpegImagePlugin.JpegImageFile image mode=...


In [None]:
combined_df_test.dtypes

image_name    object
image_data    object
dtype: object

In [None]:
combined_df_test['processed_image'] = combined_df_test['image_data'].apply(preprocess_image)

In [None]:
combined_df_test.drop(columns=['image_data'], inplace=True)

In [None]:
combined_df_test.head()

In [None]:
combined_df_test.to_csv("testing_data.csv",index=False)