<h1 align="center">📖 Introduction 📖</h1>

<div class="alert alert-block" style="font-size:20px; font-family:verdana; line-height:1.7em; border-radius:20px; padding:2em; background-color:#212021; color: #c4c2c4">
  <img src="assets/insurance.jpg" style="float: right; margin-left: 20px; width: 500px;">
  Insurance is an essential financial service that provides protection against unforeseen risks, offering peace of mind to individuals and businesses. One of the central challenges in the insurance industry is determining the Premium Amount—the cost charged to customers in exchange for coverage. Setting the right premium requires a careful balance: it should reflect the risk associated with the insured entity while remaining competitive and fair.
  </br></br>
  The prediction of the Premium Amount is influenced by numerous factors, such as customer demographics, claim histories, and the specific attributes of the insured item (e.g., vehicle type, property details). This problem is inherently complex due to the interactions between these features and the need to minimize both underpricing (leading to losses for the insurer) and overpricing (causing customers to look elsewhere).
  </br></br>
  In this notebook, we will analyze the dataset to uncover patterns and relationships that drive premium calculations. Through this exploration, we aim to develop a deeper understanding of the variables at play, identify key drivers of the Premium Amount, and set the stage for building predictive models that can assist in accurate pricing strategies.
  </br></br>
  Let’s delve into the data and tackle this fascinating insurance problem!
</div>

<h1 align="center">🔬 Feature Description 🔬</h1>

<div class="alert alert-block" style="font-size:20px; font-family:verdana; line-height:1.7em; border-radius:20px; padding:2em; background-color:#212021; color: #c4c2c4;">
Here’s a brief feature description list for both the <strong>Train</strong> and the <strong>Test</strong> datasets combined:</br>
<ol>
<li><strong>Gender</strong>: Participant's gender.</li>
<li><strong>Age</strong>: Participant's age, ranging from 18 to 64, with a mean of 41, allowing age-based insights.</li>
<li><strong>Annual Income</strong>: Yearly income ranging from 1 to 149,997, with a mean of 32,768.</li>
<li><strong>Marital Status</strong>: Represents participants' marital status, categorized as <strong>Single</strong>, <strong>Married</strong>, or <strong>Divorced</strong>.</li>
<li><strong>Number of Dependents</strong>: Refers to individuals financially supported by the participant, ranging from 0 to 4, with a mean of 2.</li>
<li><strong>Education Level</strong>: Represents participants' highest level of education, including <strong>Master's</strong>, <strong>PhD</strong>, <strong>Bachelor's</strong>, and <strong>High School</strong>.</li>
<li><strong>Occupation</strong>: Represents the participant's employment status, including <strong>Employed</strong>, <strong>Self-Employed</strong>, and <strong>Unemployed</strong>.</li>
<li><strong>Health Score</strong>: Represents the participant's health rating, ranging from 1 to 58, with a mean of 24.</li>
<li><strong>Location</strong>: Represents the participant's living area, categorized as <strong>Suburban</strong>, <strong>Rural</strong>, or <strong>Urban</strong>.</li>
<li><strong>Policy Type</strong>: Represents the type of insurance policy, categorized as <strong>Premium</strong>, <strong>Comprehensive</strong>, or <strong>Basic</strong>.</li>
<li><strong>Policy Start Date</strong>: Represents the date the participant's policy started, ranging from 2019 to 2024, with a mean of 2021.</li>
<li><strong>Previous Claims</strong>: Represents the number of previous claims made by the participant, ranging from 0 to 9, with a mean of 1.</li>
<li><strong>Vehicle Age</strong>: Represents the age of the participant's vehicle, ranging from 0 to 19 years, with a mean of 9.</li>
<li><strong>Credit Score</strong>: Represents the participant's credit score, ranging from 300 to 849, with a mean of 592.</li>
<li><strong>Insurance Duration</strong>: Represents the number of years the participant has held their insurance, ranging from 1 to 9 years, with a mean of 5.</li>
<li><strong>Customer Feedback</strong>: Represents the participant's feedback on the service, categorized as <strong>Poor</strong>, <strong>Average</strong>, or <strong>Good</strong>.</li>
<li><strong>Smoking Status</strong>: Represents whether the participant smokes, categorized as <strong>Yes</strong> or <strong>No</strong>.</li>
<li><strong>Exercise Frequency</strong>: Represents how often the participant exercises, categorized as <strong>Rarely</strong>, <strong>Daily</strong>, <strong>Weekly</strong>, or <strong>Montlhy</strong>.</li>
<li><strong>Property Type</strong>: Represents the type of property the participant resides in, categorized as <strong>House</strong>, <strong>Condo</strong>, or <strong>Apartment</strong>.</li>
</ol>
<strong>Target Variable</strong></br>
<ul>
<li><strong>Premium Amount</strong>: Represents the insurance premium amount for the participant, ranging from 20 to 4,999, based on various factors.</li>
</ul>
</div>


<h1 align="center">🎯 Target 🎯</h1>

<div class="alert alert-block alert-danger" style="font-size:20px; font-family:verdana; line-height:1.7em; background-color:#212021; border-radius:2em; padding:2em;">
    The target of this project is to develop predictive models that accurately estimate the insurance premium amount for individuals based on a variety of demographic, lifestyle, and personal factors. By analyzing this dataset, we aim to:</br>
    <ol>
        <li><strong>Identify Key Factors</strong>: Discover which variables—such as age, marital status, education level, occupation, credit score, health status, and more—most strongly influence the premium amount.</li>
        <li><strong>Build a Predictive Model</strong>: Use machine learning techniques to create a model that can predict an individual's insurance premium amount, given their personal and lifestyle information.</li>
        <li><strong>Gain Business Insights</strong>: Understand the patterns and trends that impact premium pricing, which can help insurance companies optimize their offerings and improve customer segmentation strategies.</li>
    </ol>
    Ultimately, this project seeks to demonstrate how machine learning can assist in the insurance industry by identifying key factors influencing premiums, offering valuable insights to help in pricing strategies and customer targeting.
</div>


  <center><h1>🎉 Fun Fact 🎉</h1></center>

<div class="alert alert-block alert-info" style="font-size:20px; font-family:verdana; line-height:1.7em; background-color:#f0f8ff; border-radius:2em; padding:2em;">
  The word "premium" is derived from the Latin word <i>praemium</i>, which meant "reward" or "prize." So, every time you pay your insurance premium, you're actually investing in a <i>"reward"</i> or <i>"prize"</i> for future protection!
</div>

In [86]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [87]:
train_path = './data/train.csv'
test_path = './data/test.csv'

In [88]:
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

In [89]:
df = pd.concat([train_df, test_df], sort=False)

In [92]:
train_df.columns

Index(['id', 'Age', 'Gender', 'Annual Income', 'Marital Status',
       'Number of Dependents', 'Education Level', 'Occupation', 'Health Score',
       'Location', 'Policy Type', 'Previous Claims', 'Vehicle Age',
       'Credit Score', 'Insurance Duration', 'Policy Start Date',
       'Customer Feedback', 'Smoking Status', 'Exercise Frequency',
       'Property Type', 'Premium Amount'],
      dtype='object')

In [93]:
df['Policy Start Date'] = pd.to_datetime(df['Policy Start Date'])
df['Policy Start Year'] = df['Policy Start Date'].dt.year
df['Policy Start Month'] = df['Policy Start Date'].dt.month
df['Policy Start Day'] = df['Policy Start Date'].dt.day

In [94]:
df['Policy Start Year'].value_counts().astype(int)

Policy Start Year
2022    409127
2021    408649
2020    402870
2023    398493
2024    240087
2019    140774
Name: count, dtype: int64