#Data Understanding and Initial Checks

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/A2D2/fraud_oracle.csv')
df.head()

In [None]:
num_rows, num_columns = df.shape
print(f"The dataset contains {num_rows} rows and {num_columns} columns.")

In [None]:
df.columns.tolist()

=>Well-structured dataset for Vehicle Insurance Claim Fraud Detection, covering various features related to both the insured person and the claim event.

###Overview of the key columns

**Temporal Features:**






*  Month, WeekOfMonth, DayOfWeek: These capture the date of the accident. Anomalies in the time of accidents might indicate fraudulent behavior.
*   DayOfWeekClaimed, MonthClaimed, WeekOfMonthClaimed: These show when the claim was made. A large gap between the accident and the claim date may raise suspicion.


**Demographics and Driver Information:**




*   Sex, MaritalStatus, Age, AgeOfPolicyHolder: Demographic details can help identify if certain groups are more likely to commit fraud.




**Accident and Claim Details:**


*   AccidentArea: Whether the accident occurred in an urban or rural area can be significant.
*   Fault: Who was at fault in the accident may influence the claim's legitimacy.


*   PastNumberOfClaims, Days_Policy_Accident, Days_Policy_Claim: A history of frequent claims or new policies may suggest fraud.






**Policy and Vehicle Information:**




*   PolicyType, VehicleCategory, VehiclePrice, BasePolicy: Different types of policies and vehicle characteristics could correlate with fraudulent claims.


**Claim Investigation:**



*   PoliceReportFiled, WitnessPresent, AgentType: The presence of witnesses or police reports often impacts the claim's credibility.




**Fraud Indicator:**




*   FraudFound_P: This is likely the target variable, indicating whether fraud was detected in the claim.




=> A row in our dataset represents a single insurance claim made by a policyholder.

=>Each column provides specific information related to the claim, the policyholder, the insured vehicle, and the context of the incident.

#Comprehensive Exploratory Data Analysis (EDA)

###01.Data Overview

In [None]:
df.info()  # Get information about data types and missing values

In [None]:
df.describe(include='object').T # Summary statistics for categorical data

In [None]:
df.describe()  # Summary statistics for numerical data

In [None]:
# Check for missing values
df.isnull().sum()

Good No Missing values .

In [None]:
duplicate_rows = df.duplicated() # Check for duplicates
print(f"Number of duplicate rows: {duplicate_rows.sum()}")

NO duplicate rows

###02.Univariate Analysis

Univariate analysis examines each variable individually to understand its distribution, central tendency (mean, median), spread (variance, standard deviation), and outliers. This analysis provides a foundation for detecting any data issues or significant patterns.

**Analysis of Categorical Variables**

In [None]:
categorical_cols = ['Make', 'AccidentArea', 'Sex', 'MaritalStatus', 'Fault', 'PolicyType', 'VehicleCategory', 'FraudFound_P']

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


num_cols = 4
num_rows = 2
fig, axes = plt.subplots(num_rows, num_cols, figsize=(25, 10))
axes = axes.flatten()

# Plotting the count distributions for categorical variables
for i, col in enumerate(categorical_cols):
    sns.countplot(x=col, data=df, palette='Set2', order=df[col].value_counts().index, ax=axes[i])
    axes[i].set_title(f'Distribution of {col}', fontsize=14)
    axes[i].set_xlabel(col, fontsize=12)
    axes[i].set_ylabel('Count', fontsize=12)
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

**Make:**

*Dominance of common car makes:*

-Pontiac, Toyota, and Honda are the most frequent car brands in the dataset. This likely indicates that these makes are more commonly insured or involved in claims, which could be due to their widespread use in the population.

-On the other hand, luxury brands like Ferrari and Porsche have fewer occurrences, which makes sense because fewer people own luxury cars.

Implication for fraud detection:

-It’s worth investigating whether certain makes are more associated with fraud. -For example, luxury cars, though rare, might have a different fraud rate compared to more common makes like Toyota or Honda.

**Fault:**

*Policyholder at fault:*

Most claims attribute fault to the policyholder rather than a third party. => This could reflect the nature of insurance claims, where policyholders are more likely to report incidents where they are involved directly.

Implication for fraud detection:

=>A deeper look could reveal if claims where the policyholder is at fault have different fraud rates.

=>For example, policyholders who admit fault might be more honest, or certain types of fraudulent behavior might involve third-party blame.

**Policy Type:**

*Sedan policies dominate:*  

Most claims come from sedan owners, especially in collision and liability cases. This reflects the common use of sedans as personal vehicles. Other vehicle types, such as utility and sport vehicles, have fewer claims.

Implication for fraud detection:

-Since sedans are common, it’s crucial to assess whether certain policy types are more prone to fraud.

-For example, “All Perils” policies might attract more fraudulent claims than “Collision” or “Liability” types due to their broader coverage.

In [None]:
fraud_counts = df['FraudFound_P'].value_counts().reset_index()
fraud_counts.columns = ['FraudFound_P', 'Count']
fraud_counts

In [None]:
import plotly.express as px
# Pie Chart : Target Balance
fig = px.pie(fraud_counts, names='FraudFound_P', values='Count', color='FraudFound_P',
            color_discrete_map={0: '#87CEFA', 1:'FF6F61'})

fig.update_traces(
    textinfo='percent',
    textfont={'size': 16, 'color': 'Black'},
    marker=dict(line=dict(color='black', width=2)))

fig.update_layout(
    title={
        'text': 'Target Balance',
        'x':0.5,
        'font': {'family': "Arial, sans-serif", 'size': 24}
    })

fig.show()

**FraudFound_P:**

*Low fraud rate:*

Only a small fraction of claims are marked as fraudulent. This is typical in fraud detection, where fraudulent claims are a small subset of the total claims.

=>Implication for fraud detection: Since fraud cases are rare, this class imbalance could make it harder to detect fraud accurately using machine learning.

=>It will be essential to focus on techniques that handle imbalanced datasets to ensure the model performs well in identifying fraud.

**Analysis of Numerical Variables**

In [None]:
#DriverRating Distribution
import seaborn as sns
driver_rating_counts = df['DriverRating'].value_counts()

labels = driver_rating_counts.index
sizes = driver_rating_counts.values

pastel_colors = sns.color_palette("pastel", len(labels))

plt.figure(figsize=(8, 8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140, colors=pastel_colors)
plt.title('Driver Rating Distribution', fontsize=16)
plt.axis('equal')
plt.show()

In [None]:
# Plotting the histogram for age distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['Age'], bins=30, kde=True, color='skyblue', edgecolor='black')
plt.title('Age Distribution', fontsize=16)
plt.xlabel('Age', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.show()

**Deductible Distribution:**

-Deductible: The plot  shows the distribution of deductibles, with most claims falling between 400 and 500.

-Outliers around 300 and 700 suggest rare cases where the deductible is unusually low or high.

**Fraud Correlation:** =>It would be useful to check if deductibles correlate with fraud occurrences (i.e., the FraudFound_P column). Higher or lower deductibles might suggest riskier claims or fraudulent activity, depending on how these outliers align with fraud flags.

**Characteristic of Dataset:**

In [None]:
# Function01: Summarize the characteristic of dataset
def summarize(DataFrame):

    summary = pd.DataFrame()

    # Data Type
    summary['Data Type'] = DataFrame.dtypes
    # N Unique
    summary['N Unique'] = DataFrame.nunique()
    # Unique
    summary['Unique'] = DataFrame.apply(lambda x: x.unique().tolist())
    # Max
    summary['Max'] = DataFrame.apply(lambda x: x.max() if pd.api.types.is_numeric_dtype(x) else '-')
    # Min
    summary['Min'] = DataFrame.apply(lambda x: x.min() if pd.api.types.is_numeric_dtype(x) else '-')

    # Measures of Central Tendency: Mean, Median, Mode
    summary['Mean'] = DataFrame.apply(lambda x: round(x.mean(), 2) if pd.api.types.is_numeric_dtype(x) else '-')
    summary['Median'] = DataFrame.apply(lambda x: x.median() if pd.api.types.is_numeric_dtype(x) else '-')
    summary['Mode'] = DataFrame.apply(lambda x: x.mode().iloc[0] if not x.mode().empty else '-')

    # Measures of Dispersion: Range, Variance, Standard Deviation
    summary['Range'] = DataFrame.apply(lambda x: x.max() - x.min() if pd.api.types.is_numeric_dtype(x) else '-')
    summary['Variance'] = DataFrame.apply(lambda x: x.var() if pd.api.types.is_numeric_dtype(x) else '-')
    summary['Standard Deviation'] = DataFrame.apply(lambda x: x.std() if pd.api.types.is_numeric_dtype(x) else '-')

    # Measures of Shape: Skewness, Kurtosis
    summary['Skewness'] = DataFrame.apply(lambda x: round(x.skew(), 2) if pd.api.types.is_numeric_dtype(x) else '-')
    summary['Kurtosis'] = DataFrame.apply(lambda x: round(x.kurt(), 2) if pd.api.types.is_numeric_dtype(x) else '-')

    return summary

In [None]:
summary = summarize(df)
summary

###3. Bivariate Analysis

This helps us explore relationships between two variables, particularly between features and the target variable (FraudFound_P).

In [None]:
df_fraud = df[df['FraudFound_P'] == 1]
df_non_fraud = df[df['FraudFound_P'] == 0]
df_fraud

**Fraud Detection by Sex**

In [None]:
fraud_counts_sex = df_fraud['Sex'].value_counts()
fraud_percentages_sex = (fraud_counts_sex / fraud_counts_sex.sum()) * 100
print(fraud_percentages_sex)

In [None]:
import plotly.graph_objects as go
fig = go.Figure()


fig.add_trace(go.Bar(
    x=fraud_counts_sex.index,
    y=fraud_counts_sex.values,
    text=[f'{count} ({percentage:.2f}%)' for count, percentage in zip(fraud_counts_sex.values, fraud_percentages_sex)],
    textposition='auto',
    marker_color=['rgba(179, 205, 227, 0.8)', 'rgba(251, 180, 174, 0.8)'],
    width=0.5
))

fig.update_layout(
    title={
        'text': 'Fraud Detection by Sex',
        'x': 0.5,
        'font': {'family': "Arial, sans-serif", 'size': 24}
    },
    xaxis=dict(
        title='Sex',
        tickmode='array',
        tickvals=fraud_counts_sex.index
    ),
    yaxis=dict(title='Count'),
    showlegend=False,
    template="plotly_white"
)

fig.show()


**Fraud Detection by Sex**

**Insight:**Males dominate the fraud detection statistics by a significant margin compared to females.

**Analysis:** This disparity between the sexes could indicate that men are either more prone to risky behaviors leading to accidents or are more likely to engage in fraudulent activities. The large gap suggests that targeted interventions could focus more on male drivers to curb fraud.

In [None]:
df_counts_make = df['Make'].value_counts().sort_index()
df_counts_fraud2 = df_fraud['Make'].value_counts().sort_index()
df_percentages_fraud2 = pd.DataFrame(round((df_counts_fraud2 / df_counts_make) * 100, 2)).fillna(0).reset_index()

df_percentages_fraud2.columns = ['Make', 'Fraud %']
df_percentages_fraud2 = df_percentages_fraud2.sort_values(by=['Fraud %', 'Make'])

# Bar Chart
fig = go.Figure()

fig.add_trace(go.Bar(x=df_percentages_fraud2['Fraud %'], y=df_percentages_fraud2['Make'], orientation='h',
    marker_color='lightcoral'))

fig.update_layout(
    title={
        'text': 'Fraud Detection by Make',
        'x':0.5,
        'font': {'family': "Arial, sans-serif", 'size': 24}
    })

fig.show()

**Fraud Detection by Make**

**Insight:** Mercedes tops the chart as the make most associated with fraud claims, followed by Accura, Saturn, and Saab. On the other hand, luxury brands like Ferrari and Jaguar have the lowest fraud detection numbers.

**Analysis:** This could suggest that certain brands might be more frequently involved in fraud cases, possibly because of the demographic or economic status of their owners, or the cost of repairs and insurance associated with the brand. Mercedes, a high-end brand, showing significant fraud could be an indicator of more extensive or high-value fraud attempts.

In [None]:
df_viz = df.copy()
df_fraud = df_viz[df_viz['FraudFound_P'] == 1]

In [None]:
df_counts_vp = df_viz['VehiclePrice'].value_counts().sort_index()
df_counts_fraud3 = df_fraud['VehiclePrice'].value_counts().sort_index()
df_percentages_fraud3 = pd.DataFrame(round((df_counts_fraud3 / df_counts_vp) * 100, 2)).fillna(0).reset_index()

df_percentages_fraud3.columns = ['VehiclePrice', 'Fraud %']
df_percentages_fraud3 = df_percentages_fraud3.sort_values(by=['Fraud %'])

# Bar Chart
fig = go.Figure()

fig.add_trace(go.Bar(x=df_percentages_fraud3['Fraud %'], y=df_percentages_fraud3['VehiclePrice'], orientation='h',
    marker_color='#9FF781'))

fig.update_layout(
    title={
        'text': 'Fraud Detection by VehiclePrice',
        'x':0.5,
        'font': {'family': "Arial, sans-serif", 'size': 24}
    })

fig.update_layout(
    xaxis=dict(title='Percentage', range=[0, 16], dtick=2), yaxis=dict(title='VehiclePrice'), legend=dict(title='Outcome'), bargap=0.3)

fig.add_shape(
    type="line",
    x0=6,
    y0=0,
    x1=6,
    y1=5.5,
    line=dict(
        color="rgba(0, 100, 0, 1)",
        width=2,
        dash="dashdot",
    ),
)

fig.show()

 Fraudulent cases were detected most frequently in either the most expensive or the cheapest vehicle category.

In [None]:
# Age of Vehicle
df_counts_ageofvehicle = df_viz['AgeOfVehicle'].value_counts().sort_index()
df_counts_fraud6 = df_fraud['AgeOfVehicle'].value_counts().sort_index()
df_percentages_fraud6 = pd.DataFrame(round((df_counts_fraud6 / df_counts_ageofvehicle) * 100, 2)).fillna(0).reset_index()

df_percentages_fraud6.columns = ['AgeOfVehicle', 'Fraud %']
df_percentages_fraud6 = df_percentages_fraud6.sort_values(by=['Fraud %'])

# Bar Chart
fig = go.Figure()

fig.add_trace(go.Bar(x=df_percentages_fraud6['Fraud %'], y=df_percentages_fraud6['AgeOfVehicle'], orientation='h',
    marker_color='rgb(255, 165, 0)'))

fig.update_layout(
    title={
        'text': 'Fraud Detection by Age of Vehicle',
        'x':0.5,
        'font': {'family': "Arial, sans-serif", 'size': 24}
    })

fig.update_layout(
    xaxis=dict(title='Percentage', range=[0, 16], dtick=2), yaxis=dict(title='AgeOfVehicle'), legend=dict(title='Outcome'), bargap=0.3)

fig.add_shape(
    type="line",
    x0=6,
    y0=-1,
    x1=6,
    y1=8,
    line=dict(
        color="rgba(0, 100, 0, 1)",
        width=2,
        dash="dashdot",
    ),
)

fig.show()

In [None]:
from plotly.subplots import make_subplots
# Month
df_counts_month = df_viz['Month'].value_counts().sort_index()
df_counts_fraud7 = df_fraud['Month'].value_counts().sort_index()
df_percentages_fraud7 = pd.DataFrame(round((df_counts_fraud7 / df_counts_month) * 100, 2)).fillna(0).reset_index()

df_percentages_fraud7.columns = ['Month', 'Fraud %']

# MonthClaimed
df_counts_monthclaimed = df_viz['MonthClaimed'].value_counts().sort_index()
df_counts_fraud8 = df_fraud['MonthClaimed'].value_counts().sort_index()
df_percentages_fraud8 = pd.DataFrame(round((df_counts_fraud8 / df_counts_monthclaimed) * 100, 2)).fillna(0).reset_index()

df_percentages_fraud8.columns = ['MonthClaimed', 'Fraud %']

month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Bar Chart - Subplots
fig = make_subplots(rows=1, cols=2, subplot_titles=('Month', "MonthClaimed"))

fig.add_trace(go.Bar(x=df_percentages_fraud7['Month'], y=df_percentages_fraud7['Fraud %'],
    marker_color='#8A2BE2'), row=1, col=1)

fig.add_trace(go.Bar(x=df_percentages_fraud8['MonthClaimed'], y=df_percentages_fraud8['Fraud %'],
    marker_color='#FF4500'), row=1, col=2)

fig.update_layout(
    title={
        'text': 'Fraud Detection by Month & MonthClaimed',
        'x':0.5,
        'font': {'family': "Arial, sans-serif", 'size': 24}
    }, showlegend=False,
    xaxis=dict(categoryorder='array', categoryarray=month_order),
    xaxis2=dict(categoryorder='array', categoryarray=month_order))

fig.add_shape(
    type="line",
    x0=-1,
    y0=6,
    x1=12,
    y1=6,
    line=dict(
        color="rgba(128, 128, 128, 1)",
        width=2,
        dash="dashdot",
    ),
    xref="x1",
    yref="y1"
)

fig.add_shape(
    type="line",
    x0=-1,
    y0=6,
    x1=12.5,
    y1=6,
    line=dict(
        color="rgba(128, 128, 128, 1)",
        width=2,
        dash="dashdot",
    ),
    xref="x2",
    yref="y2"
)

fig.show()

The features have similar distributions.

###4.Multivariate Analysis

In [None]:
df_male = df[df['Sex'] == 'Male']
age_bins = [18, 25, 35, 45, 55, 65, 75, 85]
age_labels = ['18-24', '25-34', '35-44', '45-54', '55-64', '65-74', '75+']
df_male['AgeRange'] = pd.cut(df_male['Age'], bins=age_bins, labels=age_labels, right=False)

age_range_counts = df_male['AgeRange'].value_counts().sort_index()
age_range_percentages = (age_range_counts / age_range_counts.sum()) * 100


fig = go.Figure()

fig.add_trace(go.Bar(
    x=age_range_percentages.index,
    y=age_range_percentages.values,
    text=[f'{percentage:.2f}%' for percentage in age_range_percentages.values],
    marker_color='rgba(179, 205, 227, 0.8)',
    width=0.5
))

fig.update_layout(
    title={
        'text': 'Male Fraud Detection by Age Range',
        'x': 0.5,
        'font': {'family': "Arial, sans-serif", 'size': 24}
    },
    xaxis=dict(
        title='Age Range',
        tickmode='array',
        tickvals=age_range_percentages.index
    ),
    yaxis=dict(title='Percentage (%)'),
    showlegend=False,
    template="plotly_white"
)


fig.show()

**Male Fraud Detection by Age Range**

**Insight:** The age group 25-34 has the highest fraud detection rate at 32.39%, followed by 35-44 (27.69%), and 45-54 (20.01%). Fraud detections taper off significantly after the age of 54.

**Analysis:** This suggests that younger males (especially in their mid-20s to mid-30s) are more likely to be involved in fraudulent claims. The high fraud detection in these age ranges might relate to riskier driving behavior or financial pressure leading to fraudulent activities. As age increases, fraud detections decrease, possibly due to higher responsibility or fewer accidents.

In [None]:
df_female = df[df['Sex'] == 'Female']
age_bins = [18, 25, 35, 45, 55, 65, 75, 85]
age_labels = ['18-24', '25-34', '35-44', '45-54', '55-64', '65-74', '75+']

df_female['AgeRange'] = pd.cut(df_female['Age'], bins=age_bins, labels=age_labels, right=False)
age_range_counts_female = df_female['AgeRange'].value_counts().sort_index()
age_range_percentages_female = (age_range_counts_female / age_range_counts_female.sum()) * 100

fig = go.Figure()

fig.add_trace(go.Bar(
    x=age_range_percentages_female.index,
    y=age_range_percentages_female.values,
    text=[f'{percentage:.2f}%' for percentage in age_range_percentages_female.values],
    textposition='auto',
    marker_color='rgba(255, 182, 193, 0.8)',
    width=0.5
))

fig.update_layout(
    title={
        'text': 'Female Fraud Detection by Age Range',
        'x': 0.5,
        'font': {'family': "Arial, sans-serif", 'size': 24}
    },
    xaxis=dict(
        title='Age Range',
        tickmode='array',
        tickvals=age_range_percentages_female.index
    ),
    yaxis=dict(title='Percentage (%)'),
    showlegend=False,
    template="plotly_white"
)

fig.show()

In [None]:
df_mercedes_male = df[(df['Make'] == 'Mecedes') & (df['Sex'] == 'Male') & (df['Age'].between(18,85))]


age_bins = [18, 25, 35, 45, 55, 65, 75, 85]
age_labels = ['18-24', '25-34', '35-44', '45-54', '55-64', '65-74', '75+']
df_mercedes_male['AgeRange'] = pd.cut(df_mercedes_male['Age'], bins=age_bins, labels=age_labels, right=False)

age_range_counts_mercedes_male = df_mercedes_male['AgeRange'].value_counts().sort_index()
age_range_percentages_mercedes_male = (age_range_counts_mercedes_male / age_range_counts_mercedes_male.sum()) * 100

fig = go.Figure()

fig.add_trace(go.Bar(
    x=age_range_percentages_mercedes_male.index,
    y=age_range_percentages_mercedes_male.values,
    text=[f'{percentage:.2f}%' for percentage in age_range_percentages_mercedes_male.values],
    textposition='auto',
    marker_color='rgba(179, 205, 227, 0.8)',
    width=0.5
))

fig.update_layout(
    title={
        'text': "Fraud Detection by Mercedes, Male, and Age Ranges",
        'x': 0.5,
        'font': {'family': "Arial, sans-serif", 'size': 24}
    },
    xaxis=dict(
        title='Age Range',
        tickmode='array',
        tickvals=age_range_percentages_mercedes_male.index
    ),
    yaxis=dict(title='Percentage (%)'),
    showlegend=False,
    template="plotly_white"
)

fig.show()

In [None]:
import plotly.graph_objects as go
df_accura_male = df[(df['Make'] == 'Accura') & (df['Sex'] == 'Male') & (df['Age'].between(18, 85))]
age_bins = [18, 25, 35, 45, 55, 65, 75, 85]
age_labels = ['18-24', '25-34', '35-44', '45-54', '55-64', '65-74', '75+']


df_accura_male['AgeRange'] = pd.cut(df_accura_male['Age'], bins=age_bins, labels=age_labels, right=False)
age_range_counts_accura_male = df_accura_male['AgeRange'].value_counts().sort_index()


age_range_percentages_accura_male = (age_range_counts_accura_male / age_range_counts_accura_male.sum()) * 100


fig = go.Figure()

fig.add_trace(go.Bar(
    x=age_range_percentages_accura_male.index,
    y=age_range_percentages_accura_male.values,
    text=[f'{percentage:.2f}%' for percentage in age_range_percentages_accura_male.values],
    textposition='auto',
    marker_color='rgba(179, 205, 227, 0.8)',
    width=0.5
))

fig.update_layout(
    title={
        'text': "Fraud Detection by Accura, Male, and Age Ranges",
        'x': 0.5,
        'font': {'family': "Arial, sans-serif", 'size': 24}
    },
    xaxis=dict(
        title='Age Range',
        tickmode='array',
        tickvals=age_range_percentages_accura_male.index
    ),
    yaxis=dict(title='Percentage (%)'),
    showlegend=False,
    template="plotly_white"
)


fig.show()

In [None]:
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Correlation matrix heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df[numerical_cols].corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features', fontsize=16)
plt.show()

FraudFound_P:

This is the target variable, and it shows low correlations with other features (the highest absolute value is around 0.03 with Age). This suggests that none of the numerical features have a strong linear relationship with fraud detection, indicating that perhaps more complex relationships (non-linear) exist or categorical variables may play a larger role.

PolicyNumber and Year:

These two variables have a very high correlation of 0.94. This could indicate that policy numbers might follow a pattern based on the year they were issued, potentially leading to redundancy. we might consider removing one of these features to avoid multicollinearity issues.

#Conclusion:

1/There is no missing value.

2/"PolicyNumber" is merely an identification number. Let's drop it.

3/The features "Sex", "PoliceReportFiled", and "WitnessPresent" are actually Boolen Types. Should be converted to 0 or 1.

4/The features "AccidentalArea", "Fault", and "AgentType" each have only two unique values. Can also be converted to 0 or 1.

5/The minimum value of "Age" is 0. It totally doesn't make sense.

6/The features "FraudFound_P" and "Deductible" are highly skewed; Typically, a feature is considered skewed if its skewness falls outside the range of
 -0.5~0.5.

 Since "FraudFound_P" and "Deductible" have skewness values outside this range, their distributions likely have pronounced tails, meaning that either small or large values are more frequent. This is essential to know, as highly skewed data can impact the performance of many machine learning models, particularly those sensitive to outliers or that assume normality (e.g., linear regression). **In such cases, transforming the data (e.g., using log, square root, or Box-Cox transformations) might be helpful to reduce skewness.**



7/The features "WeekOfMonth", "WeekOfMonthClaimed", "FraudFound_P", "Deductible", and "DriverRating" exhibit high kurtosis; Typically a feature is considered to have high kurtosis if its value falls outside the range of -1 ~ 1.

8/Among all the features, **only Age is a numeric variable**. The rest can be interpreted as categorical variables.

9/The Target Variable "Fraud_Found_P" is **highly imbalanced**.

10/Males are significantly more likely to be involved in detected fraud cases compared to females.

**Issues Caused by Imbalanced Target Variable:**

**Model Bias:**The model may become biased towards the majority class, leading to poor performance on the minority class.

**Misleading Metrics:** Metrics like accuracy can be misleading, as high accuracy can be achieved by simply predicting the majority class.

**Poor Generalization:** The model may not generalize well to real-world scenarios where the minority class is important.

**=>Necessity of Oversampling:**
Oversampling address these issues by balancing the dataset, which helps the model to learn from both classes more effectively. This improves the model's ability to correctly predict the minority class, leading to better overall performance and more reliable evaluation metrics.


**Impoved Model Performance:** In cases of severe class imbalance, the model may become biased towards the majority class, leading to poor performance on the minority class. Oversampling helps mitigate this issue, enhancing the overall performance of the model.

**Enhanced Evaluation Metrics:** With severe class imbalance, metrics like accuracy can be misleading. Oversampling allows for more accurate assessment of metrics such as Precision, Recall, and F-1 Score.

**Balanced Learning:**Balancing the dataset ensures that the model learns from all classes evenly. This leads to better performance across various scenarios when the model is deployed in real-world applications.