# <div style='padding:25px;background-color:maroon;color:white;border-radius:4px;font-size:100%;text-align: center'> Insurance Analytics and Prediction<br></div>

# <div style='padding:5px;background-color:maroon;color:white;border-radius:2px;font-size:100%;text-align: center'>Data Cleaning<br></div>

## <span style="color:Aqua;"> Objective of the Project:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The objective of this project is to leverage advanced analytics techniques, including classification,
regression, and clustering, to extract valuable insights from insurance data. By analyzing a
comprehensive dataset, the project aims to enhance decision-making processes, optimize risk
assessment, and improve overall operational efficiency within the insurance industry.

### <p style="color:Aqua;"> Key Components:</p>

<p style="color:Tomato;font-size: 110%"> <b> 1. Customer Segmentation (Clustering):</b> </p>

<span style="color: Chartreuse;">   &#9784; &nbsp;</span> Apply clustering algorithms to group policyholders based on similar characteristics and behavior.

<span style="color: Chartreuse;">   &#9784; &nbsp;</span> Identify customer segments with common insurance needs and preferences.

<span style="color: Chartreuse;">   &#9784; &nbsp;</span> Tailor marketing strategies and product offerings to specific clusters, enhancing customer engagement and increasing cross-selling opportunities.

<p style="color:Tomato;font-size: 110%"> <b> 2. Fraudulent or Legitimate Assessment (Classification):</b> </p>
<span style="color: Chartreuse;">   &#9784; &nbsp;</span>  Implement a classification model to categorize insurance claims into predefined classes, such as
fraudulent or legitimate. <br>
<span style="color: Chartreuse;">   &#9784; &nbsp;</span> Utilize machine learning algorithms to predict the likelihood of a claim being fraudulent based on
historical data. <br>
<span style="color: Chartreuse;">   &#9784; &nbsp;</span> Enhance fraud detection capabilities to reduce financial losses and improve the accuracy of claim
assessments.

<p style="color:Tomato;font-size: 110%"> <b> 3. Premium Prediction (Regression):</b> </p>

<span style="color: Chartreuse;">   &#9784; &nbsp;</span> Develop regression models to predict insurance premium pricing based on various factors such as age, location, coverage type, and previous claims history. <br>
<span style="color: Chartreuse;">   &#9784; &nbsp;</span> Explore the relationship between different variables and premiums to optimize pricing strategies. <br>
<span style="color: Chartreuse;">   &#9784; &nbsp;</span> Provide recommendations for personalized premium adjustments, leading to improved customer
satisfaction and retention.

## <span style="color:Aqua;">Importing libraries from Python</span>

In [1]:
import pandas as pd 
import os
import numpy as np
from IPython.display import display, HTML

pd.options.display.max_columns = 50
pd.set_option("display.precision", 4)
pd.set_option('display.float_format', '{:.4f}'.format)


## <span style="color:Aqua;">Data Exploration:

<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> In this section we are gonna explore features of each variables (data columns) and understand any issues with the data which may affect our Machine Learning model (Predicting app).



### <span style="color:Tomato;">Reading Dataset from CSV:

<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> Reading data from the excel to pandas (Python's Data Wrangler).

In [180]:
df = pd.read_excel('insurance_data.xlsx')

### <span style="color:Tomato;"> Understanding the dataset:
<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> From the output below we see that we have 1000 rows and 39 columns. Next output is 5 sample data points generated randomly from the dataset.

In [181]:
display(HTML(f"<p style='color: orange; font-weight: bold;'>{df.shape}\n\n</p>"))
df.sample(5)

Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,insured_sex,insured_education_level,insured_occupation,insured_hobbies,insured_relationship,capital-gains,capital-loss,incident_date,incident_type,collision_type,incident_severity,authorities_contacted,incident_state,incident_city,incident_location,incident_hour_of_the_day,number_of_vehicles_involved,property_damage,bodily_injuries,witnesses,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_model,auto_year,fraud_reported
313,436,59,153154,2010-08-21,OH,500/1000,1000,1338.55,0,430380,MALE,PhD,protective-serv,board-games,own-child,39000,0,2015-01-12,Multi-vehicle Collision,Side Collision,Total Loss,Police,WV,Hillsdale,6355 4th Hwy,10,3,?,2,2,NO,68000,13600,6800,47600,Toyota,Corolla,2014,N
608,267,46,270208,2004-08-09,OH,100/300,2000,1546.01,0,616276,FEMALE,MD,adm-clerical,polo,wife,0,0,2015-01-06,Multi-vehicle Collision,Front Collision,Total Loss,Police,VA,Riverwood,9760 4th Hwy,4,4,NO,2,1,?,77100,15420,7710,53970,Volkswagen,Jetta,1996,N
605,246,44,996850,1995-03-08,OH,100/300,1000,1397.0,0,614521,MALE,High School,machine-op-inspct,reading,not-in-family,0,0,2015-01-03,Single Vehicle Collision,Rear Collision,Minor Damage,Other,NY,Arlington,7705 Lincoln Drive,6,1,NO,1,0,NO,61740,6860,6860,48020,Accura,MDX,1997,N
50,430,59,691189,2004-01-10,OH,250/500,2000,1326.62,7000000,477310,MALE,MD,other-service,bungie-jumping,own-child,0,0,2015-01-03,Multi-vehicle Collision,Front Collision,Minor Damage,Fire,NY,Riverwood,5104 Francis Drive,19,3,?,0,3,YES,81800,16360,8180,57260,Nissan,Pathfinder,1998,N
541,239,41,743092,2013-11-11,OH,250/500,1000,1325.44,7000000,474898,FEMALE,JD,farming-fishing,paintball,other-relative,51400,-6300,2015-02-18,Parked Car,?,Trivial Damage,Police,NC,Arlington,6303 1st Drive,22,1,?,0,2,YES,10790,1660,830,8300,Mercedes,E400,2013,N


In [182]:
df.head()

Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,insured_sex,insured_education_level,insured_occupation,insured_hobbies,insured_relationship,capital-gains,capital-loss,incident_date,incident_type,collision_type,incident_severity,authorities_contacted,incident_state,incident_city,incident_location,incident_hour_of_the_day,number_of_vehicles_involved,property_damage,bodily_injuries,witnesses,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_model,auto_year,fraud_reported
0,328,48,521585,2014-10-17,OH,250/500,1000,1406.91,0,466132,MALE,MD,craft-repair,sleeping,husband,53300,0,2015-01-25,Single Vehicle Collision,Side Collision,Major Damage,Police,SC,Columbus,9935 4th Drive,5,1,YES,1,2,YES,71610,6510,13020,52080,Saab,92x,2004,Y
1,228,42,342868,2006-06-27,IN,250/500,2000,1197.22,5000000,468176,MALE,MD,machine-op-inspct,reading,other-relative,0,0,2015-01-21,Vehicle Theft,?,Minor Damage,Police,VA,Riverwood,6608 MLK Hwy,8,1,?,0,0,?,5070,780,780,3510,Mercedes,E400,2007,Y
2,134,29,687698,2000-09-06,OH,100/300,2000,1413.14,5000000,430632,FEMALE,PhD,sales,board-games,own-child,35100,0,2015-02-22,Multi-vehicle Collision,Rear Collision,Minor Damage,Police,NY,Columbus,7121 Francis Lane,7,3,NO,2,3,NO,34650,7700,3850,23100,Dodge,RAM,2007,N
3,256,41,227811,1990-05-25,IL,250/500,2000,1415.74,6000000,608117,FEMALE,PhD,armed-forces,board-games,unmarried,48900,-62400,2015-01-10,Single Vehicle Collision,Front Collision,Major Damage,Police,OH,Arlington,6956 Maple Drive,5,1,?,1,2,NO,63400,6340,6340,50720,Chevrolet,Tahoe,2014,Y
4,228,44,367455,2014-06-06,IL,500/1000,1000,1583.91,6000000,610706,MALE,Associate,sales,board-games,unmarried,66000,-46000,2015-02-17,Vehicle Theft,?,Minor Damage,,NY,Arlington,3041 3rd Ave,20,1,NO,0,1,NO,6500,1300,650,4550,Accura,RSX,2009,N


In [183]:
df.tail()

Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,insured_sex,insured_education_level,insured_occupation,insured_hobbies,insured_relationship,capital-gains,capital-loss,incident_date,incident_type,collision_type,incident_severity,authorities_contacted,incident_state,incident_city,incident_location,incident_hour_of_the_day,number_of_vehicles_involved,property_damage,bodily_injuries,witnesses,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_model,auto_year,fraud_reported
995,3,38,941851,1991-07-16,OH,500/1000,1000,1310.8,0,431289,FEMALE,Masters,craft-repair,paintball,unmarried,0,0,2015-02-22,Single Vehicle Collision,Front Collision,Minor Damage,Fire,NC,Northbrook,6045 Andromedia St,20,1,YES,0,1,?,87200,17440,8720,61040,Honda,Accord,2006,N
996,285,41,186934,2014-01-05,IL,100/300,1000,1436.79,0,608177,FEMALE,PhD,prof-specialty,sleeping,wife,70900,0,2015-01-24,Single Vehicle Collision,Rear Collision,Major Damage,Fire,SC,Northbend,3092 Texas Drive,23,1,YES,2,3,?,108480,18080,18080,72320,Volkswagen,Passat,2015,N
997,130,34,918516,2003-02-17,OH,250/500,500,1383.49,3000000,442797,FEMALE,Masters,armed-forces,bungie-jumping,other-relative,35100,0,2015-01-23,Multi-vehicle Collision,Side Collision,Minor Damage,Police,NC,Arlington,7629 5th St,4,3,?,2,3,YES,67500,7500,7500,52500,Suburu,Impreza,1996,N
998,458,62,533940,2011-11-18,IL,500/1000,2000,1356.92,5000000,441714,MALE,Associate,handlers-cleaners,base-jumping,wife,0,0,2015-02-26,Single Vehicle Collision,Rear Collision,Major Damage,Other,NY,Arlington,6128 Elm Lane,2,1,?,0,1,YES,46980,5220,5220,36540,Audi,A5,1998,N
999,456,60,556080,1996-11-11,OH,250/500,1000,766.19,0,612260,FEMALE,Associate,sales,kayaking,husband,0,0,2015-02-26,Parked Car,?,Minor Damage,Police,WV,Columbus,1416 Cherokee Ridge,6,1,?,0,3,?,5060,460,920,3680,Mercedes,E400,2007,N


<span style="color:Chartreuse;font-size:120%;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> We can see that the few data points updated with ? rather than null. We must invetigate further.</span>

In [184]:
df.describe(include = "number").T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
months_as_customer,1000.0,203.954,115.1132,0.0,115.75,199.5,276.25,479.0
age,1000.0,38.948,9.1403,19.0,32.0,38.0,44.0,64.0
policy_number,1000.0,546238.648,257063.0053,100804.0,335980.25,533135.0,759099.75,999435.0
policy_deductable,1000.0,1136.0,611.8647,500.0,500.0,1000.0,2000.0,2000.0
policy_annual_premium,1000.0,1256.4061,244.1674,433.33,1089.6075,1257.2,1415.695,2047.59
umbrella_limit,1000.0,1101000.0,2297406.5981,-1000000.0,0.0,0.0,0.0,10000000.0
insured_zip,1000.0,501214.488,71701.6109,430104.0,448404.5,466445.5,603251.0,620962.0
capital-gains,1000.0,25126.1,27872.1877,0.0,0.0,0.0,51025.0,100500.0
capital-loss,1000.0,-26793.7,28104.0967,-111100.0,-51500.0,-23250.0,0.0,0.0
incident_hour_of_the_day,1000.0,11.644,6.9514,0.0,6.0,12.0,17.0,23.0


<span style="color:Chartreuse;font-size:120%;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> We can see the statistical characteristics of numerical features such as count, central tendency, standardeviation, minimum, maximum, and percentile value of each numerical features (columns). It will be helpful for us to understand the data distribution, structures, etc.</span>

In [185]:
df.describe(exclude = "number").T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max
policy_bind_date,1000,,,,2002-02-08 04:40:47.999999872,1990-01-08 00:00:00,1995-09-19 00:00:00,2002-04-01 12:00:00,2008-04-21 12:00:00,2015-02-22 00:00:00
policy_state,1000,3.0,OH,352.0,,,,,,
policy_csl,1000,3.0,250/500,351.0,,,,,,
insured_sex,1000,2.0,FEMALE,537.0,,,,,,
insured_education_level,1000,7.0,JD,161.0,,,,,,
insured_occupation,1000,14.0,machine-op-inspct,93.0,,,,,,
insured_hobbies,1000,20.0,reading,64.0,,,,,,
insured_relationship,1000,6.0,own-child,183.0,,,,,,
incident_date,1000,,,,2015-01-30 08:02:24,2015-01-01 00:00:00,2015-01-15 00:00:00,2015-01-31 00:00:00,2015-02-15 00:00:00,2015-03-01 00:00:00
incident_type,1000,4.0,Multi-vehicle Collision,419.0,,,,,,


<span style="color:Chartreuse;font-size:120%;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> We can see the statistical characteristics of categorical features such as count, central tendency,unique values, frequency of each categorical features (columns). It will be helpful for us to understand the data distribution, stuctures, etc. </span>

### <span style="color:Khaki;">Feature Details:

In [186]:
df.sample(3)

Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,insured_sex,insured_education_level,insured_occupation,insured_hobbies,insured_relationship,capital-gains,capital-loss,incident_date,incident_type,collision_type,incident_severity,authorities_contacted,incident_state,incident_city,incident_location,incident_hour_of_the_day,number_of_vehicles_involved,property_damage,bodily_injuries,witnesses,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_model,auto_year,fraud_reported
991,257,44,109392,2006-07-12,OH,100/300,1000,1280.88,0,433981,MALE,MD,other-service,basketball,other-relative,59400,-32200,2015-02-06,Single Vehicle Collision,Rear Collision,Total Loss,Other,WV,Riverwood,5312 Francis Ridge,21,1,NO,0,1,NO,46980,0,5220,41760,Accura,TL,2002,N
748,322,44,769602,2004-12-19,IL,100/300,1000,1156.19,0,606249,FEMALE,College,machine-op-inspct,cross-fit,husband,49900,-62700,2015-02-15,Multi-vehicle Collision,Side Collision,Major Damage,Fire,NY,Northbrook,3751 Tree Hwy,20,3,YES,0,3,?,49400,9880,4940,34580,Jeep,Wrangler,2010,N
14,180,38,644081,1998-12-28,OH,250/500,2000,1301.13,0,476685,FEMALE,College,machine-op-inspct,board-games,not-in-family,41300,-55500,2015-01-15,Single Vehicle Collision,Rear Collision,Total Loss,Police,SC,Springfield,6851 3rd Drive,12,1,NO,0,2,YES,46200,4200,8400,33600,Dodge,Neon,2003,Y


<span style="color:Tomato;font-size: 100%"> <b> 1. Month as  Customer - </b></span> Customer tenure ranges from new customers (0 months) to long-term ones (479 months), with a mean of about 204 months. Helps identify long-term vs new customers.

<span style="color:Tomato;font-size: 100%"> <b> 2. Age - </b></span> Important demographic factor and a crucial factor in premium calculation.

<span style="color:Tomato;font-size: 100%"> <b> 3. Policy Number - </b></span> Redundant features. Generally not used in modeling.

<span style="color:Tomato;font-size: 100%"> <b> 5. Policy Bind Date - </b></span> Policy start date. Can be used to calculate policy age.

<span style="color:Tomato;font-size: 100%"> <b> 6. Policy State - </b></span> Location. Location-based risk factors.

<span style="color:Tomato;font-size: 100%"> <b> 7. Policy CSL - </b></span>  Combined Single Limit. Indicates coverage level.

<span style="color:Tomato;font-size: 100%"> <b> 8. Policy Deductable - </b></span> Portion of a claim that policy holder responsible to pay. Redundant features. 

<span style="color:Tomato;font-size: 100%"> <b> 9. Policy Annual Premium - </b></span> Preminum amount indicates customer value.

<span style="color:Tomato;font-size: 100%"> <b> 10. Umbrella Limit - </b></span> Max coverage limit.

<span style="color:Tomato;font-size: 100%"> <b> 10. Insured Zip - </b></span> Zip/Postal codes can be used to determine the level of risk associated with an insured individual and also help in developing targeted insurance products.

<span style="color:Tomato;font-size: 100%"> <b> 11. Insured Sex - </b></span> Gender can help in developing targeted insurance products.

<span style="color:Tomato;font-size: 100%"> <b> 12. Insured Education Level - </b></span> Demographic information and proxy for risk and income.

<span style="color:Tomato;font-size: 100%"> <b> 13. Insured Occupation - </b></span> Lifestyle indicator Might reveal unexpected correlations with fraud.

<span style="color:Tomato;font-size: 100%"> <b> 14. Insured Hobbies - </b></span> Lifestyle and risk indicator. Might reveal unexpected correlations with fraud

<span style="color:Tomato;font-size: 100%"> <b> 15. Insured Relationship - </b></span> Family Status. May influence premium price, and could be a indicators of potential fraud.

<span style="color:Tomato;font-size: 100%"> <b> 16. Capital Gains and Loss - </b></span> Not sure about this feature. I will try interpret as we go on with EDA. If it is not helpful we can drop these features.

<span style="color:Tomato;font-size: 100%"> <b> 17. Incident Date - </b></span> Accident date, can Calculate policy age from policy bind date and incident date.

<span style="color:Tomato;font-size: 100%"> <b> 18. Incident Type and Collision Type - </b></span> Different types may have varying fraud rates. Past incidents may affect future premiums.

<span style="color:Tomato;font-size: 100%"> <b> 19. Incident Severity </b></span> Severe incidents might be less likely to be fraudulent.

<span style="color:Tomato;font-size: 100%"> <b> 20. Authorities Contacted - </b></span> This information can help in predicting claim amounts or categorizing claims based on severity. Also could be indicative of fraudulent claims.

<span style="color:Tomato;font-size: 100%"> <b> 21. Incident State and City - </b></span> Location-based risk factors.

<span style="color:Tomato;font-size: 100%"> <b> 22. Incident Hour - </b></span> Analyzing the relationship between incident hour and other variables (e.g., incident type, claim amount) can reveal patterns indicative of fraud.

<span style="color:Tomato;font-size: 100%"> <b> 23. Number Of Vehicle Involved - </b></span> This information can help in risk assessment and premium calculation.

<span style="color:Tomato;font-size: 100%"> <b> 24. Property Damage - </b></span> Property damage can be a factor in assessing risk for future insurance policies. Inconsistent information between property damage and other claim details might indicate potential fraud.

<span style="color:Tomato;font-size: 100%"> <b> 25. Bodily Injuries - </b></span> Unusual patterns in bodily injuries might indicate potential fraud.

<span style="color:Tomato;font-size: 100%"> <b> 26. Witnesses - </b></span> A higher number of witnesses could indicate a lower probability of fraud.

<span style="color:Tomato;font-size: 100%"> <b> 28. Police Report Available - </b></span> This feature can be a valuable indicator of potential fraud.

<span style="color:Tomato;font-size: 100%"> <b> 29. Claim Amount - </b></span> Can help identify anomalies.

<span style="color:Tomato;font-size: 100%"> <b> 30. Vehicle Details - </b></span> Premiums are often based on the vehicle's value, which is influenced by its make, model, and year. 

<span style="color:Tomato;font-size: 100%"> <b> 31. Fraud Reported - </b></span> One of our target variable.

### <span style="color:Khaki;"> Checking data Type:
<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> From the output below, we can see the data type of each feature. We can reduce memory usage by maintaining or casting applicable data types to the features. This process will smooth our data wrangling efforts.

In [187]:
df.dtypes

months_as_customer                      int64
age                                     int64
policy_number                           int64
policy_bind_date               datetime64[ns]
policy_state                           object
policy_csl                             object
policy_deductable                       int64
policy_annual_premium                 float64
umbrella_limit                          int64
insured_zip                             int64
insured_sex                            object
insured_education_level                object
insured_occupation                     object
insured_hobbies                        object
insured_relationship                   object
capital-gains                           int64
capital-loss                            int64
incident_date                  datetime64[ns]
incident_type                          object
collision_type                         object
incident_severity                      object
authorities_contacted             

In [188]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 39 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   months_as_customer           1000 non-null   int64         
 1   age                          1000 non-null   int64         
 2   policy_number                1000 non-null   int64         
 3   policy_bind_date             1000 non-null   datetime64[ns]
 4   policy_state                 1000 non-null   object        
 5   policy_csl                   1000 non-null   object        
 6   policy_deductable            1000 non-null   int64         
 7   policy_annual_premium        1000 non-null   float64       
 8   umbrella_limit               1000 non-null   int64         
 9   insured_zip                  1000 non-null   int64         
 10  insured_sex                  1000 non-null   object        
 11  insured_education_level      1000 non-null  

### <span style="color:Khaki;">Replacing ? to Null:

In [189]:
df=df.replace('?',np.nan)

### <span style="color:Khaki;">Checking Null Values:
<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> This code checks and return the empty cells in our dataset. It is crusial to handle this null values to feef our data to Machine Learning algorithm.

In [190]:
print(df.isnull().sum().sum(),' -- ',df.isna().sum().sum())
df.isnull().sum()

972  --  972


months_as_customer               0
age                              0
policy_number                    0
policy_bind_date                 0
policy_state                     0
policy_csl                       0
policy_deductable                0
policy_annual_premium            0
umbrella_limit                   0
insured_zip                      0
insured_sex                      0
insured_education_level          0
insured_occupation               0
insured_hobbies                  0
insured_relationship             0
capital-gains                    0
capital-loss                     0
incident_date                    0
incident_type                    0
collision_type                 178
incident_severity                0
authorities_contacted           91
incident_state                   0
incident_city                    0
incident_location                0
incident_hour_of_the_day         0
number_of_vehicles_involved      0
property_damage                360
bodily_injuries     

### <span style="color:Khaki;">Checking unique values of the features:
<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> This code provides the unique datapoints that we have in the respective features and also checks for null values.

In [191]:
for col in df.columns:
    print(f"{col} - {len(df[col].unique())} - {df[col].nunique()} \n  Diff is {df[col].nunique() - len(df[col].unique())}") 

months_as_customer - 391 - 391 
  Diff is 0
age - 46 - 46 
  Diff is 0
policy_number - 1000 - 1000 
  Diff is 0
policy_bind_date - 951 - 951 
  Diff is 0
policy_state - 3 - 3 
  Diff is 0
policy_csl - 3 - 3 
  Diff is 0
policy_deductable - 3 - 3 
  Diff is 0
policy_annual_premium - 991 - 991 
  Diff is 0
umbrella_limit - 11 - 11 
  Diff is 0
insured_zip - 995 - 995 
  Diff is 0
insured_sex - 2 - 2 
  Diff is 0
insured_education_level - 7 - 7 
  Diff is 0
insured_occupation - 14 - 14 
  Diff is 0
insured_hobbies - 20 - 20 
  Diff is 0
insured_relationship - 6 - 6 
  Diff is 0
capital-gains - 338 - 338 
  Diff is 0
capital-loss - 354 - 354 
  Diff is 0
incident_date - 60 - 60 
  Diff is 0
incident_type - 4 - 4 
  Diff is 0
collision_type - 4 - 3 
  Diff is -1
incident_severity - 4 - 4 
  Diff is 0
authorities_contacted - 5 - 4 
  Diff is -1
incident_state - 7 - 7 
  Diff is 0
incident_city - 7 - 7 
  Diff is 0
incident_location - 1000 - 1000 
  Diff is 0
incident_hour_of_the_day - 24 - 2

<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> From the output above, we can see that we have 12500 customer details in the overall 100000 data points.<br>
<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> Customer name count 10128 confirms that the customers may have multiple accounts.<br>
<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> Credit_Utilization_Ratio and ID are unique value. It may not be useful.

### <span style="color:Khaki;">Checking values lesser than or equal to 0:

<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> Sometimes the data may contain negative values or just updated with 0 due to data entry error. This code will show such datapoints.

In [192]:
for col in df.select_dtypes(include=['number']).columns:
    print(f"{col} --  {(df[col] <= 0).sum()}")

months_as_customer --  1
age --  0
policy_number --  0
policy_deductable --  0
policy_annual_premium --  0
umbrella_limit --  799
insured_zip --  0
capital-gains --  508
capital-loss --  1000
incident_hour_of_the_day --  52
number_of_vehicles_involved --  0
bodily_injuries --  340
witnesses --  249
total_claim_amount --  0
injury_claim --  25
property_claim --  19
vehicle_claim --  0
auto_year --  0


<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> From above output, few of the features may have negative values or zero. We must check and address any errors.

In [193]:
df.columns

Index(['months_as_customer', 'age', 'policy_number', 'policy_bind_date',
       'policy_state', 'policy_csl', 'policy_deductable',
       'policy_annual_premium', 'umbrella_limit', 'insured_zip', 'insured_sex',
       'insured_education_level', 'insured_occupation', 'insured_hobbies',
       'insured_relationship', 'capital-gains', 'capital-loss',
       'incident_date', 'incident_type', 'collision_type', 'incident_severity',
       'authorities_contacted', 'incident_state', 'incident_city',
       'incident_location', 'incident_hour_of_the_day',
       'number_of_vehicles_involved', 'property_damage', 'bodily_injuries',
       'witnesses', 'police_report_available', 'total_claim_amount',
       'injury_claim', 'property_claim', 'vehicle_claim', 'auto_make',
       'auto_model', 'auto_year', 'fraud_reported'],
      dtype='object')

## <span style="color:Aqua;">Data Cleaning:

<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> In this section we are gonna clean the data by imputing missing values etc.



### <span style="color:Khaki;">Imputing Null Values in collision_type:

In [194]:
print(df.collision_type.value_counts().values.sum())
df.collision_type.value_counts()

822


collision_type
Rear Collision     292
Side Collision     276
Front Collision    254
Name: count, dtype: int64

<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> We have 822 out of 1000. We can fill it with others since we cannot determine this type with the existing data.

In [195]:
df['collision_type'] = df['collision_type'].fillna('Others')

In [196]:
print(df.collision_type.value_counts().values.sum())
df.collision_type.value_counts()

1000


collision_type
Rear Collision     292
Side Collision     276
Front Collision    254
Others             178
Name: count, dtype: int64

### <span style="color:Khaki;">Imputing Null Values in property_damage:

In [197]:
print(df.property_damage.value_counts().values.sum())
df.property_damage.value_counts()

640


property_damage
NO     338
YES    302
Name: count, dtype: int64

<span style="color: Chartreuse;font-size:120%"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> We have 640 out of 1000. We can fill it with the feature named property claim, Inceident Type
- If policy holder claimed property claim than this column should be Yes otherwise No.
- If Incident type is Theft then it might not be appplicable, No.

In [198]:
df['property_damage'] = np.where(df['property_damage'].isnull() & (df['property_claim']>0), 'YES', df['property_damage'])

In [199]:
df['property_damage'] = np.where(df['property_damage'].isnull() & (df['incident_type'] == 'Vehicle Theft'), 'NO', df['property_damage'])

In [200]:
df['property_damage'] = np.where(df['property_damage'].isnull() & (df['property_claim']<= 0), 'NO', df['property_damage'])

In [201]:
print(df.property_damage.value_counts().values.sum())
df.property_damage.value_counts()

1000


property_damage
YES    655
NO     345
Name: count, dtype: int64

### <span style="color:Khaki;">Imputing Null Values in authorities_contacted:

In [202]:
print(df.authorities_contacted.value_counts().values.sum())
df.authorities_contacted.value_counts()

909


authorities_contacted
Police       292
Fire         223
Other        198
Ambulance    196
Name: count, dtype: int64

<span style="color: Chartreuse;font-size:120%"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> We have 909 out of 1000. We can fill it with the feature named police_report_available, Inceident Type, Bodily Injuries.

- If police_report_available is yes than the Police might be contacted.
- If Bodily Injuries than the Ambulance might be contacted.

In [203]:
df['authorities_contacted'] = np.where(df['authorities_contacted'].isnull() & (df['police_report_available']== 'YES'), 'Police', df['authorities_contacted'])

In [204]:
df['authorities_contacted'] = np.where(df['authorities_contacted'].isnull() & 
                                       (df['incident_type']== 'Vehicle Theft') & 
                                       (df['bodily_injuries'] > 0) & 
                                       (df['police_report_available']== 'NO'), 
                                       'Ambulance', df['authorities_contacted'])

In [205]:
df['authorities_contacted'] = df['authorities_contacted'].fillna('Other')

In [206]:
print(df.authorities_contacted.value_counts().values.sum())
df.authorities_contacted.value_counts()

1000


authorities_contacted
Police       320
Other        250
Fire         223
Ambulance    207
Name: count, dtype: int64

### <span style="color:Khaki;">Imputing Null Values in police_report_available:

In [207]:
print(df.police_report_available.value_counts().values.sum())
df.police_report_available.value_counts()

657


police_report_available
NO     343
YES    314
Name: count, dtype: int64

<span style="color: Chartreuse;font-size:120%"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> We have 657 out of 1000. We can fill it with the feature named fraud_reported, authorities_contacted.

- If fraud_reported is yes than the Police might not be contacted and no report.
- If authorities_contacted is police than the Police might be involved and we should have a report.

In [208]:
df['police_report_available'] = np.where(df['fraud_reported'] == 'Y' , 
                                       'NO', df['police_report_available'])

In [209]:
df['police_report_available'] = np.where(df['police_report_available'].isnull() & 
                                       (df['incident_type']== 'Vehicle Theft') & 
                                       (df['fraud_reported'] == 'N') &
                                       (df['property_damage'] == 'YES'), 
                                       'YES', df['police_report_available'])

In [210]:
df['police_report_available'] = np.where(df['police_report_available'].isnull() & 
                                       (df['authorities_contacted']== 'Police') & 
                                       (df['fraud_reported'] == 'N') &
                                       (df['bodily_injuries'] > 0), 
                                       'YES', df['police_report_available'])

In [211]:
df['police_report_available'] = np.where(df['police_report_available'].isnull() & 
                                       (df['fraud_reported'] == 'Y') , 
                                       'NO', df['police_report_available'])

In [212]:
df['police_report_available'] = np.where(df['police_report_available'].isnull() & 
                                       (df['incident_type']== 'Multi-vehicle Collision') & 
                                       (df['fraud_reported'] == 'N') &
                                       (df['authorities_contacted'].isin(['Police','Fire','Ambulance'])),
                                       'YES', df['police_report_available'])

In [213]:
df['police_report_available'] = df['police_report_available'].fillna('NO')

In [214]:
print(df.police_report_available.value_counts().values.sum())
df.police_report_available.value_counts()

1000


police_report_available
NO     637
YES    363
Name: count, dtype: int64

## <span style="color:Aqua;"> Cleaning Case and Text inconsistency and checking unique values:

<span style="color: Chartreuse;font-size:120%"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> Correcting text and case inconsistency while checking unique categories.

In [215]:
df.sample(2)

Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,insured_sex,insured_education_level,insured_occupation,insured_hobbies,insured_relationship,capital-gains,capital-loss,incident_date,incident_type,collision_type,incident_severity,authorities_contacted,incident_state,incident_city,incident_location,incident_hour_of_the_day,number_of_vehicles_involved,property_damage,bodily_injuries,witnesses,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_model,auto_year,fraud_reported
240,249,43,547802,2013-09-03,IL,250/500,1000,1518.46,0,606238,FEMALE,MD,armed-forces,cross-fit,own-child,0,0,2015-01-26,Single Vehicle Collision,Front Collision,Major Damage,Fire,SC,Riverwood,2201 4th Lane,16,1,YES,0,0,YES,53500,5350,5350,42800,Saab,92x,2015,N
727,39,22,691115,1993-01-28,IN,500/1000,500,1173.21,0,431202,MALE,JD,farming-fishing,polo,not-in-family,0,0,2015-02-14,Single Vehicle Collision,Rear Collision,Major Damage,Police,SC,Northbend,4782 Sky Lane,14,1,YES,0,1,NO,86130,15660,7830,62640,Suburu,Legacy,2009,Y


In [216]:
excluded_columns = ['policy_state', 'insured_education_level', 'incident_state','policy_number','policy_csl',
                    'insured_zip','incident_hour_of_the_day','number_of_vehicles_involved','bodily_injuries','witnesses','auto_year']
for col in df.select_dtypes(include='category'):
    if col not in excluded_columns:
        print(col)
        df[col] = df[col].str.strip().str.title()
        df[col].describe()
        print(df[col].describe(),'\n\n',df[col].value_counts().reset_index().sort_values(by=col,ascending=True),'\n','---X---'*10)
        df[col] =df[col].astype('category')

## <span style="color:Aqua;">Casting Appropriate Data Type:

<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> Casting appropriate data types reduces dataset memeory and allow us to perform data cleaning operation smoothly.

In [217]:
cat_var = ['insured_zip','policy_number','incident_hour_of_the_day','number_of_vehicles_involved','bodily_injuries',
           'witnesses','auto_year']

In [218]:
df[cat_var] = df[cat_var].astype('category')

In [219]:
df[df.select_dtypes(include='number').columns].describe()

Unnamed: 0,months_as_customer,age,policy_deductable,policy_annual_premium,umbrella_limit,capital-gains,capital-loss,total_claim_amount,injury_claim,property_claim,vehicle_claim
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,203.954,38.948,1136.0,1256.4061,1101000.0,25126.1,-26793.7,52761.94,7433.42,7399.57,37928.95
std,115.1132,9.1403,611.8647,244.1674,2297406.5981,27872.1877,28104.0967,26401.5332,4880.9519,4824.7262,18886.2529
min,0.0,19.0,500.0,433.33,-1000000.0,0.0,-111100.0,100.0,0.0,0.0,70.0
25%,115.75,32.0,500.0,1089.6075,0.0,0.0,-51500.0,41812.5,4295.0,4445.0,30292.5
50%,199.5,38.0,1000.0,1257.2,0.0,0.0,-23250.0,58055.0,6775.0,6750.0,42100.0
75%,276.25,44.0,2000.0,1415.695,0.0,51025.0,0.0,70592.5,11305.0,10885.0,50822.5
max,479.0,64.0,2000.0,2047.59,10000000.0,100500.0,0.0,114920.0,21450.0,23670.0,79560.0


In [220]:
num_var = df.select_dtypes(include='number').columns
df[num_var] = df[num_var].apply(pd.to_numeric, errors='raise', downcast='integer')

In [221]:
cat_var = df.select_dtypes(include='object').columns
df[cat_var] = df[cat_var].astype('category')

In [222]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 39 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   months_as_customer           1000 non-null   int16         
 1   age                          1000 non-null   int8          
 2   policy_number                1000 non-null   category      
 3   policy_bind_date             1000 non-null   datetime64[ns]
 4   policy_state                 1000 non-null   category      
 5   policy_csl                   1000 non-null   category      
 6   policy_deductable            1000 non-null   int16         
 7   policy_annual_premium        1000 non-null   float64       
 8   umbrella_limit               1000 non-null   int32         
 9   insured_zip                  1000 non-null   category      
 10  insured_sex                  1000 non-null   category      
 11  insured_education_level      1000 non-null  

<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> After casting data type we could see that the memory usage is reduced to 200 KB.

## <p Style="color: Aqua"> Droping Reluctant Features:

In [223]:
col_drop = ['policy_number','policy_csl','umbrella_limit','insured_zip','capital-gains','capital-loss',
            'incident_location','auto_model']
df = df.drop(col_drop,axis=1)

## <p Style="color: Aqua"> Checking Data Frame After Cleaning:

In [224]:
df.sample(5)

Unnamed: 0,months_as_customer,age,policy_bind_date,policy_state,policy_deductable,policy_annual_premium,insured_sex,insured_education_level,insured_occupation,insured_hobbies,insured_relationship,incident_date,incident_type,collision_type,incident_severity,authorities_contacted,incident_state,incident_city,incident_hour_of_the_day,number_of_vehicles_involved,property_damage,bodily_injuries,witnesses,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_year,fraud_reported
204,241,39,1996-06-04,IL,2000,1042.26,MALE,JD,sales,kayaking,husband,2015-01-31,Multi-vehicle Collision,Rear Collision,Total Loss,Ambulance,WV,Northbend,21,3,NO,1,2,YES,19080,4240,2120,12720,Saab,1995,N
276,296,42,2003-03-16,IN,2000,1219.27,MALE,Associate,tech-support,paintball,husband,2015-02-16,Multi-vehicle Collision,Side Collision,Total Loss,Ambulance,WV,Columbus,9,3,YES,1,2,NO,64080,7120,7120,49840,Saab,2012,N
637,292,45,1991-02-05,IL,1000,1358.91,MALE,Masters,craft-repair,dancing,unmarried,2015-01-09,Vehicle Theft,Others,Trivial Damage,Police,WV,Northbend,4,1,NO,0,2,NO,7370,670,1340,5360,Suburu,1997,N
887,441,55,2009-07-29,IN,500,1270.29,MALE,College,armed-forces,exercise,husband,2015-02-19,Parked Car,Others,Minor Damage,Other,VA,Arlington,4,1,NO,0,0,NO,6400,640,640,5120,Honda,2002,N
993,124,28,2001-12-08,OH,1000,1235.14,MALE,MD,exec-managerial,camping,husband,2015-02-17,Multi-vehicle Collision,Side Collision,Total Loss,Other,OH,Hillsdale,20,3,YES,0,1,NO,60200,6020,6020,48160,Volkswagen,2012,N


In [225]:
df.dtypes

months_as_customer                      int16
age                                      int8
policy_bind_date               datetime64[ns]
policy_state                         category
policy_deductable                       int16
policy_annual_premium                 float64
insured_sex                          category
insured_education_level              category
insured_occupation                   category
insured_hobbies                      category
insured_relationship                 category
incident_date                  datetime64[ns]
incident_type                        category
collision_type                       category
incident_severity                    category
authorities_contacted                category
incident_state                       category
incident_city                        category
incident_hour_of_the_day             category
number_of_vehicles_involved          category
property_damage                      category
bodily_injuries                   

In [226]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   months_as_customer           1000 non-null   int16         
 1   age                          1000 non-null   int8          
 2   policy_bind_date             1000 non-null   datetime64[ns]
 3   policy_state                 1000 non-null   category      
 4   policy_deductable            1000 non-null   int16         
 5   policy_annual_premium        1000 non-null   float64       
 6   insured_sex                  1000 non-null   category      
 7   insured_education_level      1000 non-null   category      
 8   insured_occupation           1000 non-null   category      
 9   insured_hobbies              1000 non-null   category      
 10  insured_relationship         1000 non-null   category      
 11  incident_date                1000 non-null  

### <span style="color:Khaki;">Checking Null Values:
<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> This code checks and return the empty cells in our dataset. It is crusial to handle this null values to feef our data to Machine Learning algorithm.

In [227]:
print(df.isnull().sum().sum(),' -- ',df.isna().sum().sum())
df.isnull().sum()

0  --  0


months_as_customer             0
age                            0
policy_bind_date               0
policy_state                   0
policy_deductable              0
policy_annual_premium          0
insured_sex                    0
insured_education_level        0
insured_occupation             0
insured_hobbies                0
insured_relationship           0
incident_date                  0
incident_type                  0
collision_type                 0
incident_severity              0
authorities_contacted          0
incident_state                 0
incident_city                  0
incident_hour_of_the_day       0
number_of_vehicles_involved    0
property_damage                0
bodily_injuries                0
witnesses                      0
police_report_available        0
total_claim_amount             0
injury_claim                   0
property_claim                 0
vehicle_claim                  0
auto_make                      0
auto_year                      0
fraud_repo

<span style="color: Chartreuse;font-size:120%"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> No null values as we addressed all.

### <span style="color:Khaki;">Checking unique values of the features:
<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> This code provides the unique datapoints that we have in the respective features and also checks for null values.

In [228]:
for col in df.columns:
    print(f"{col} - {len(df[col].unique())} - {df[col].nunique()} \n  Diff is {df[col].nunique() - len(df[col].unique())}") 

months_as_customer - 391 - 391 
  Diff is 0
age - 46 - 46 
  Diff is 0
policy_bind_date - 951 - 951 
  Diff is 0
policy_state - 3 - 3 
  Diff is 0
policy_deductable - 3 - 3 
  Diff is 0
policy_annual_premium - 991 - 991 
  Diff is 0
insured_sex - 2 - 2 
  Diff is 0
insured_education_level - 7 - 7 
  Diff is 0
insured_occupation - 14 - 14 
  Diff is 0
insured_hobbies - 20 - 20 
  Diff is 0
insured_relationship - 6 - 6 
  Diff is 0
incident_date - 60 - 60 
  Diff is 0
incident_type - 4 - 4 
  Diff is 0
collision_type - 4 - 4 
  Diff is 0
incident_severity - 4 - 4 
  Diff is 0
authorities_contacted - 4 - 4 
  Diff is 0
incident_state - 7 - 7 
  Diff is 0
incident_city - 7 - 7 
  Diff is 0
incident_hour_of_the_day - 24 - 24 
  Diff is 0
number_of_vehicles_involved - 4 - 4 
  Diff is 0
property_damage - 2 - 2 
  Diff is 0
bodily_injuries - 3 - 3 
  Diff is 0
witnesses - 4 - 4 
  Diff is 0
police_report_available - 2 - 2 
  Diff is 0
total_claim_amount - 763 - 763 
  Diff is 0
injury_claim -

### <span style="color:Khaki;">Checking values lesser than or equal to 0:

<span style="color: Chartreuse;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#9784; </span> Sometimes the data may contain negative values or just updated with 0 due to data entry error. This code will show such datapoints.

In [229]:
for col in df.select_dtypes(include=['number']).columns:
    print(f"{col} --  {(df[col] < 0).sum()}")

months_as_customer --  0
age --  0
policy_deductable --  0
policy_annual_premium --  0
total_claim_amount --  0
injury_claim --  0
property_claim --  0
vehicle_claim --  0


<p style="color: Khaki;">

## <p Style="color: Aqua"> Exporting Data Frame

In [230]:
df.to_feather('Cleaned_data.feather') # Exporting data frame as feather data type. It is efficient and keep our data types.

In [231]:
df2 = pd.read_feather('Cleaned_data.feather')

In [232]:
df2.sample(5)

Unnamed: 0,months_as_customer,age,policy_bind_date,policy_state,policy_deductable,policy_annual_premium,insured_sex,insured_education_level,insured_occupation,insured_hobbies,insured_relationship,incident_date,incident_type,collision_type,incident_severity,authorities_contacted,incident_state,incident_city,incident_hour_of_the_day,number_of_vehicles_involved,property_damage,bodily_injuries,witnesses,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_year,fraud_reported
242,190,40,2007-01-27,OH,2000,965.21,FEMALE,JD,exec-managerial,camping,other-relative,2015-02-02,Parked Car,Others,Trivial Damage,Police,SC,Hillsdale,10,1,YES,2,1,YES,6300,630,630,5040,Nissan,2001,N
149,193,41,1995-07-16,OH,500,847.03,FEMALE,JD,craft-repair,skydiving,not-in-family,2015-02-08,Single Vehicle Collision,Side Collision,Major Damage,Other,SC,Springfield,1,1,YES,1,0,NO,112320,17280,17280,77760,Suburu,2011,Y
440,108,31,2005-12-09,IN,2000,1175.7,MALE,Masters,protective-serv,yachting,not-in-family,2015-02-19,Single Vehicle Collision,Rear Collision,Total Loss,Fire,NY,Columbus,14,1,NO,0,2,NO,57330,6370,6370,44590,Dodge,2006,N
218,328,46,1996-06-19,IL,500,1314.6,FEMALE,MD,prof-specialty,exercise,not-in-family,2015-02-23,Single Vehicle Collision,Rear Collision,Total Loss,Other,WV,Hillsdale,0,1,YES,2,3,NO,70290,12780,6390,51120,Saab,1998,Y
110,261,42,2009-01-11,OH,500,1337.56,FEMALE,College,prof-specialty,video-games,unmarried,2015-01-12,Single Vehicle Collision,Rear Collision,Minor Damage,Police,SC,Riverwood,18,1,YES,1,2,YES,74700,7470,14940,52290,Dodge,2010,N


In [233]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   months_as_customer           1000 non-null   int16         
 1   age                          1000 non-null   int8          
 2   policy_bind_date             1000 non-null   datetime64[ns]
 3   policy_state                 1000 non-null   category      
 4   policy_deductable            1000 non-null   int16         
 5   policy_annual_premium        1000 non-null   float64       
 6   insured_sex                  1000 non-null   category      
 7   insured_education_level      1000 non-null   category      
 8   insured_occupation           1000 non-null   category      
 9   insured_hobbies              1000 non-null   category      
 10  insured_relationship         1000 non-null   category      
 11  incident_date                1000 non-null  