In [5]:
# EDA.ipynb

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set(style="whitegrid")

# Load the data
data = pd.read_csv('/Users/habeeb/Downloads/Git/ML/AXIS/Data/cleaned_fraud_data.csv')

# Display the first few rows
data.head()

Unnamed: 0,Month,WeekOfMonth,DayOfWeek,Make,AccidentArea,DayOfWeekClaimed,MonthClaimed,WeekOfMonthClaimed,Sex,MaritalStatus,...,WitnessPresent,AgentType,NumberOfSuppliments,AddressChange_Claim,NumberOfCars,Year,BasePolicy,Unnamed: 33,Unnamed: 34,Statement
0,Dec,5,Wednesday,Honda,Urban,Tuesday,Jan,1,0,Single,...,0,External,none,1 year,3 to 4,1994,Liability,-,Sport - Liability,True
1,Jan,3,Wednesday,Honda,Urban,Monday,Jan,4,1,Single,...,0,External,none,no change,1 vehicle,1994,Collision,-,Sport - Collision,True
2,Oct,5,Friday,Honda,Urban,Thursday,Nov,2,1,Married,...,0,External,none,no change,1 vehicle,1994,Collision,-,Sport - Collision,True
3,Jun,2,Saturday,Toyota,Rural,Friday,Jul,1,1,Married,...,0,External,more than 5,no change,1 vehicle,1994,Liability,-,Sport - Liability,False
4,Jan,5,Monday,Honda,Urban,Tuesday,Feb,2,0,Single,...,0,External,none,no change,1 vehicle,1994,Collision,-,Sport - Collision,True


In [13]:
# Display basic information about the dataset
data.info()

# Display summary statistics
data.describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15420 entries, 0 to 15419
Data columns (total 36 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Month                 15420 non-null  object
 1   WeekOfMonth           15420 non-null  int64 
 2   DayOfWeek             15420 non-null  object
 3   Make                  15420 non-null  object
 4   AccidentArea          15420 non-null  object
 5   DayOfWeekClaimed      15420 non-null  object
 6   MonthClaimed          15420 non-null  object
 7   WeekOfMonthClaimed    15420 non-null  int64 
 8   Sex                   15420 non-null  int64 
 9   MaritalStatus         15420 non-null  object
 10  Age                   15420 non-null  int64 
 11  Fault                 15420 non-null  int64 
 12  PolicyType            15420 non-null  object
 13  VehicleCategory       15420 non-null  object
 14  VehiclePrice          15420 non-null  object
 15  FraudFound_P          15420 non-null

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
WeekOfMonth,15420.0,2.788586,1.287585,1.0,2.0,3.0,4.0,5.0
WeekOfMonthClaimed,15420.0,2.693969,1.259115,1.0,2.0,3.0,4.0,5.0
Sex,15420.0,0.843061,0.363755,0.0,1.0,1.0,1.0,1.0
Age,15420.0,39.855707,13.492377,0.0,31.0,38.0,48.0,80.0
Fault,15420.0,0.728275,0.444863,0.0,0.0,1.0,1.0,1.0
FraudFound_P,15420.0,0.059857,0.23723,0.0,0.0,0.0,0.0,1.0
PolicyNumber,15420.0,7710.5,4451.514911,1.0,3855.75,7710.5,11565.25,15420.0
RepNumber,15420.0,8.483268,4.599948,1.0,5.0,8.0,12.0,16.0
Deductible,15420.0,407.70428,43.950998,300.0,400.0,400.0,400.0,700.0
DriverRating,15420.0,2.487808,1.119453,1.0,1.0,2.0,3.0,4.0


# Metadata Brief:

## 1. Month (`object`)
- Contains 3-letter abbreviations for the months of the year.
- **Question**: Are these the months in which the accident occurred?

## 2. WeekOfMonth (`int64`)
- Represents the week in the month when the accident occurred.
- **Question**: Is this the week the accident took place?

## 3. DayOfWeek (`object`)
- Contains days of the week (e.g., Monday, Tuesday, etc.).
- **Question**: Do these represent the days of the week the accident occurred?

## 4. Make (`object`)
- Contains a list of 19 car manufacturers.

## 5. AccidentArea (`object`)
- Classifies accident areas as "Urban" or "Rural."

## 6. DayOfWeekClaimed (`object`)
- Contains the day of the week the claim was filed.
- **Issue**: Contains '0'. Investigate how many instances there are and check if they represent missing data.

## 7. MonthClaimed (`object`)
- Contains 3-letter abbreviations for the months the claims were filed.
- **Issue**: Contains '0'. Investigate how many occurrences there are and their meaning (possible missing data).

## 8. WeekOfMonthClaimed (`int64`)
- Represents the week in the month when the claim was filed.

## 9. Sex (`object`)
- Gender of the individual making the claim.
- **Action**: Convert to binary (e.g., 0 for male, 1 for female).

## 10. MaritalStatus (`object`)
- Marital status of the individual making the claim.

## 11. Age (`int64`)
- Age of the individual making the claim.
- **Issue**: At least one individual has an age of 0, suggesting missing data.

## 12. Fault (`object`)
- Categorizes who was deemed at fault in the accident.
- **Action**: Convert to binary (1 for at fault, 0 for not at fault).

## 13. PolicyType (`object`)
- Contains two pieces of information:
  - Type of insurance: Liability, all perils, or collision.
  - Category of vehicle: Sport, sedan, or utility.

## 14. VehicleCategory (`object`)
- Contains the categorization of the vehicle (refer to `PolicyType`).

## 15. VehiclePrice (`object`)
- Contains ranges for the value of the vehicle.
- **Action**: Replace ranges with the mean value of the range and convert to `float`.

## 16. FraudFound_P (`int64`)
- Indicates whether the claim was fraudulent (1) or not (0).
- **Purpose**: This is the target variable for prediction.

## 17. PolicyNumber (`int64`)
- Masked policy number, appears to be the same as the row number minus 1.

## 18. RepNumber (`int64`)
- Represents the number of the representative handling the claim, an integer from 1 to 16.

## 19. Deductible (`int64`)
- The deductible amount (integer values).

## 20. DriverRating (`int64`)
- Rating scale from 1 to 4.
- **Question**: Is `DriverRating` an ordinal variable, or is it also interval-based?

## 21. Days_Policy_Accident (`object`)
- Represents the number of days between policy purchase and the accident.
- **Action**: Values are given as ranges, so replace each value with the mean of the range and convert to `float`.

## 22. Days_Policy_Claim (`object`)
- Represents the number of days between policy purchase and claim filing.
- **Action**: Similar to `Days_Policy_Accident`, replace ranges with the mean and convert to `float`.

## 23. PastNumberOfClaims (`object`)
- Represents the previous number of claims filed by the policyholder (or claimant).

## 24. AgeOfVehicle (`object`)
- Represents the age of the vehicle at the time of the accident.
- **Action**: Replace the range with the mean of the range and convert to `float`.

## 25. AgeOfPolicyHolder (`object`)
- Represents the age of the policyholder at the time of the claim.
- **Action**: Replace the range with the mean of the range and convert to `float`.

## 26. PoliceReportFiled (`object`)
- Indicates whether a police report was filed for the accident.
- **Action**: Convert to binary (0 for no, 1 for yes).

## 27. WitnessPresent (`object`)
- Indicates whether a witness was present during the accident.
- **Action**: Convert to binary (0 for no, 1 for yes).

## 28. AgentType (`object`)
- Classifies the agent handling the claim as either internal or external.
- **Question**: What does this classification mean? Consider converting to binary.

## 29. NumberOfSuppliments (`object`)
- Likely refers to something other than daily vitamins (uncertain in context).
- **Question**: What does "suppliment" mean in the context of insurance? Clarify and potentially transform.

## 30. AddressChange_Claim (`object`)
- Possibly represents the time between when the claim was filed and when the individual moved (i.e., filed an address change).
- **Action**: Replace each interval with the mean value of the range.

## 31. NumberOfCars (`object`)
- Likely refers to the number of cars involved in the accident or the number of cars covered under the policy.
- **Action**: Replace each interval with the mean value of the range.

## 32. Year (`int64`)
- Represents the year in which the accident occurred.

## 33. BasePolicy (`object`)
- Represents the type of insurance coverage (e.g., as seen in `PolicyType`).
