# Business and Problem Statement Analysis

Shiv Nadar Institution of Eminence is a student centric, multidisciplinary and research focused university offering a wide range of academic programs at the Undergraduate, Masters and Doctoral levels. The University was set up in 2011 by the Shiv Nadar Foundation, a philanthropic foundation established by Mr. Shiv Nadar, founder of HCL. In the NIRF (Government’s National Institutional Ranking Framework), the University has been the youngest institution in the ‘top 100’ overall list.

The university’s Academy of Continuing Education aims to facilitate best-in-class knowledge, practices and skill development offerings to the growing ecosystem of lifetime learners and leaders, both within and outside the university. With distinguished academics as the university’s faculty members and programme instructors, the Academy of Continuing Education offers uniquely crafted programmes that are delivered innovatively, bringing 
together the best of the university’s rich intellectual resources.

The university aims to help students prepare for today as well as their future through its unique certification programme in data sciences and 
business analytics. The collaboration between the Academy of Continuing Education at Shiv Nadar Institution of Eminence and MachineHack hopes to
 strengthen the data science community in India and pave the way for innovation in business analytics.



### About Dataset and problem:

The Vehicle Insurance business is a multi-billion dollar industry. Every year millions and millions of premiums are paid, and a huge amount of 
claims also pile up.

You have to step into the shoes of a data scientist who is building models to help an insurance company understand which claims are worth rejecting and the claims which should be accepted for reimbursement.

You are given a rich dataset consisting of thousands of rows of past records, which you can use to  learn more about your customers’ behaviours. For example, you are supposed to create an ML model to look at a case of an insurance claim and decide whether to reject or accept it.

Dimensions: (to be added later) Columns: ['ID', 'AGE', 'GENDER', 'DRIVING_EXPERIENCE', 'EDUCATION', 'INCOME', 'CREDIT_SCORE', 'VEHICLE_OWNERSHIP', 'VEHICLE_YEAR', 'MARRIED', 'CHILDREN', 'POSTAL_CODE', 'ANNUAL_MILEAGE', 'SPEEDING_VIOLATIONS', 'DUIS', 'PAST_ACCIDENTS', 'OUTCOME', 'TYPE_OF_VEHICLE'] 

Learn and predict the OUTCOME variable.


### THe main motive of this project is to learn the deployment part of the model

* The main focus would be to train a model
* try deploying it to the access it using the url.

### Evaluation Criteria

**The submission will be evaluated using the [Log Loss metric](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html#sklearn-metrics-log-loss). One can use [sklearn.metric.log_loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html#sklearn-metrics-log-loss) to calculate the same**

This hackathon supports private and public leaderboards

The public leaderboard is evaluated on 30% of Test data

The private leaderboard will be made available at the end of the hackathon, which will be evaluated on 100% Test data

Final winners will be judged on the following in the final jury round:

- 30% Business Outcome/Impact,
- 20% Innovative + Creativity
- 20% Algorithm and ML approach,
- 15% Statistically analysis,
- 15% Presentation + Communication

# Download the dataset

In [1]:
import pandas as pd
import numpy as np

#from pandas_profiling import ProfileReport
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
df_train_full = pd.read_csv('train.csv')

In [3]:
# takes too much time
# # The data profiling is okay but it has too much description to work with
# profile = ProfileReport(df_train_full)
# profile

# Data Preparartion

In [4]:
df_train_full.head().T

Unnamed: 0,0,1,2,3,4
ID,816393,251762,481952,3506,498013
AGE,40-64,26-39,40-64,40-64,40-64
GENDER,female,male,male,male,female
DRIVING_EXPERIENCE,20-29y,20-29y,20-29y,20-29y,20-29y
EDUCATION,university,high school,none,high school,none
INCOME,middle class,middle class,middle class,upper class,working class
CREDIT_SCORE,0.63805,0.475741,0.839817,0.682527,0.572184
VEHICLE_OWNERSHIP,0.0,1.0,1.0,1.0,1.0
VEHICLE_YEAR,after 2015,before 2015,before 2015,before 2015,after 2015
MARRIED,0.0,1.0,1.0,0.0,1.0


In [5]:
df_train_full.dtypes

ID                       int64
AGE                     object
GENDER                  object
DRIVING_EXPERIENCE      object
EDUCATION               object
INCOME                  object
CREDIT_SCORE           float64
VEHICLE_OWNERSHIP      float64
VEHICLE_YEAR            object
MARRIED                float64
CHILDREN               float64
POSTAL_CODE              int64
ANNUAL_MILEAGE         float64
SPEEDING_VIOLATIONS      int64
DUIS                     int64
PAST_ACCIDENTS           int64
OUTCOME                float64
TYPE_OF_VEHICLE         object
dtype: object

In [6]:
df_train_full['OUTCOME'].unique()

array([0., 1.])

In [7]:
df_train_full['OUTCOME'] = df_train_full['OUTCOME'].astype(int)

In [8]:
df_train_full.dtypes

ID                       int64
AGE                     object
GENDER                  object
DRIVING_EXPERIENCE      object
EDUCATION               object
INCOME                  object
CREDIT_SCORE           float64
VEHICLE_OWNERSHIP      float64
VEHICLE_YEAR            object
MARRIED                float64
CHILDREN               float64
POSTAL_CODE              int64
ANNUAL_MILEAGE         float64
SPEEDING_VIOLATIONS      int64
DUIS                     int64
PAST_ACCIDENTS           int64
OUTCOME                  int64
TYPE_OF_VEHICLE         object
dtype: object

In [9]:
#converting the columns names without spaces
df_train_full.columns = df_train_full.columns.str.lower().str.replace(' ', '_')

# converting the string values in columns by removing the space between them
string_columns = list(df_train_full.dtypes[df_train_full.dtypes == 'object'].index)

for col in string_columns:
    df_train_full[col] = df_train_full[col].str.lower().str.replace(' ', '_')

In [10]:
df_train_full.head().T

Unnamed: 0,0,1,2,3,4
id,816393,251762,481952,3506,498013
age,40-64,26-39,40-64,40-64,40-64
gender,female,male,male,male,female
driving_experience,20-29y,20-29y,20-29y,20-29y,20-29y
education,university,high_school,none,high_school,none
income,middle_class,middle_class,middle_class,upper_class,working_class
credit_score,0.63805,0.475741,0.839817,0.682527,0.572184
vehicle_ownership,0.0,1.0,1.0,1.0,1.0
vehicle_year,after_2015,before_2015,before_2015,before_2015,after_2015
married,0.0,1.0,1.0,0.0,1.0


In [11]:
df_train_full.dtypes[df_train_full.dtypes == 'object']

age                   object
gender                object
driving_experience    object
education             object
income                object
vehicle_year          object
type_of_vehicle       object
dtype: object

In [12]:
# separating into categroical and numerical variables
categorical = ['age', 'gender', 'driving_experience', 'education', 'income', 'vehicle_ownership', 
               'married', 'children', 'postal_code', 'speeding_violations', 'duis', 'past_accidents', 
               'vehicle_year', 'type_of_vehicle']
numerical = [ col for col in df_train_full.columns if col not in categorical]

(len(categorical) + len(numerical) ) == len(df_train_full.columns) #checking if any of the colums are left out

numerical.remove('outcome')

# Splitting the data into train,val as test data is already present

This is done before EDA on teh train data because to copy the real life scenarios where you act as if the test data is not present with you

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
# Split your data in train/val/test sets, with 60%/20%/20% distribution.
train_df, val_df = train_test_split(df_train_full, test_size=0.2, random_state=1)

In [15]:
test_df = pd.read_csv('test.csv')

In [16]:
len(train_df), len(val_df), len(test_df)

(84000, 21000, 45000)

In [17]:
y_train = train_df.outcome.values
y_val = val_df.outcome.values

In [18]:
del train_df['outcome']
del val_df['outcome']

# Exploratory data analysis

In [19]:
df_train_full.isnull().sum()

id                     0
age                    0
gender                 0
driving_experience     0
education              0
income                 0
credit_score           0
vehicle_ownership      0
vehicle_year           0
married                0
children               0
postal_code            0
annual_mileage         0
speeding_violations    0
duis                   0
past_accidents         0
outcome                0
type_of_vehicle        0
dtype: int64

In [20]:
df_train_full['outcome'].value_counts()

0    60622
1    44378
Name: outcome, dtype: int64

In [21]:
#@ counting % of each class
df_train_full['outcome'].value_counts()*100/len(y_train)

0    72.169048
1    52.830952
Name: outcome, dtype: float64

In [22]:
from IPython.display import display

In [23]:
global_mean = df_train_full.outcome.mean()
global_mean

0.42264761904761905

In [24]:
for col in categorical:
    df_group = df_train_full.groupby(by=col).outcome.agg(['mean'])
    df_group['diff'] = df_group['mean'] - global_mean
    df_group['risk'] = df_group['mean'] / global_mean
    display(df_group)

Unnamed: 0_level_0,mean,diff,risk
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
16-25,0.418324,-0.004324,0.98977
26-39,0.422969,0.000322,1.000761
40-64,0.425199,0.002551,1.006036
65+,0.422244,-0.000404,0.999045


Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.430058,0.007411,1.017534
male,0.418145,-0.004502,0.989347


Unnamed: 0_level_0,mean,diff,risk
driving_experience,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0-9y,0.419196,-0.003452,0.991833
10-19y,0.413131,-0.009517,0.977483
20-29y,0.431254,0.008606,1.020363
30y+,0.425928,0.00328,1.007761


Unnamed: 0_level_0,mean,diff,risk
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
high_school,0.423503,0.000855,1.002024
none,0.424127,0.001479,1.003499
university,0.420083,-0.002564,0.993933


Unnamed: 0_level_0,mean,diff,risk
income,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
middle_class,0.420546,-0.002102,0.995026
poverty,0.426498,0.00385,1.00911
upper_class,0.423319,0.000672,1.001589
working_class,0.420218,-0.00243,0.99425


Unnamed: 0_level_0,mean,diff,risk
vehicle_ownership,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,0.425968,0.00332,1.007856
1.0,0.421953,-0.000694,0.998357


Unnamed: 0_level_0,mean,diff,risk
married,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,0.426177,0.003529,1.00835
1.0,0.420135,-0.002512,0.994056


Unnamed: 0_level_0,mean,diff,risk
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,0.421687,-0.000961,0.997726
1.0,0.423535,0.000887,1.002099


Unnamed: 0_level_0,mean,diff,risk
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10238,0.423874,0.001226,1.002901
11343,0.000000,-0.422648,0.000000
11514,1.000000,0.577352,2.366037
11545,1.000000,0.577352,2.366037
11626,0.000000,-0.422648,0.000000
...,...,...,...
92097,0.471264,0.048617,1.115029
92098,0.456693,0.034045,1.080552
92099,0.425806,0.003159,1.007474
92100,0.398010,-0.024638,0.941706


Unnamed: 0_level_0,mean,diff,risk
speeding_violations,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.42436,0.001712,1.004051
1,0.417829,-0.004819,0.988598
2,0.419248,-0.0034,0.991956
3,0.422074,-0.000574,0.998642
4,0.413507,-0.009141,0.978373
5,0.422613,-3.5e-05,0.999917
6,0.401656,-0.020991,0.950334
7,0.495726,0.073079,1.172907
8,0.411538,-0.011109,0.973715
9,0.424419,0.001771,1.00419


Unnamed: 0_level_0,mean,diff,risk
duis,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.422409,-0.000239,0.999436
1,0.43207,0.009422,1.022294
2,0.411813,-0.010835,0.974365
3,0.427502,0.004855,1.011486
4,0.425952,0.003304,1.007818
5,0.422311,-0.000337,0.999203
6,0.381579,-0.041069,0.90283


Unnamed: 0_level_0,mean,diff,risk
past_accidents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.421138,-0.00151,0.996428
1,0.431551,0.008903,1.021065
2,0.429811,0.007163,1.016949
3,0.435106,0.012458,1.029476
4,0.403189,-0.019459,0.95396
5,0.406181,-0.016467,0.961039
6,0.417603,-0.005045,0.988064
7,0.401937,-0.020711,0.950998
8,0.405882,-0.016765,0.960333
9,0.389831,-0.032817,0.922353


Unnamed: 0_level_0,mean,diff,risk
vehicle_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
after_2015,0.422182,-0.000466,0.998898
before_2015,0.423032,0.000384,1.00091


Unnamed: 0_level_0,mean,diff,risk
type_of_vehicle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
hatchback,0.425703,0.003055,1.007229
sedan,0.417959,-0.004689,0.988906
sports_car,0.428047,0.005399,1.012775
suv,0.415114,-0.007534,0.982175


### Mutual Info score:

In [25]:
from sklearn.metrics import mutual_info_score

In [26]:
def calculate_mi(series):
    return mutual_info_score(series, df_train_full.outcome)

df_mi = df_train_full[categorical].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')

display(df_mi.head())
display(df_mi.tail())

Unnamed: 0,MI
postal_code,0.067632
speeding_violations,0.000159
driving_experience,0.000108
past_accidents,7e-05
gender,6.8e-05


Unnamed: 0,MI
income,8.744343e-06
education,5.833603e-06
vehicle_ownership,4.721103e-06
children,1.747201e-06
vehicle_year,3.668605e-07


# Training the model

In [27]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

In [28]:
columns = categorical + numerical

In [29]:
if('outcome' in columns):
    print('yes')
else:
    print('no')

no


In [30]:
C=1.0

In [31]:
train_dicts = train_df[columns].to_dict(orient='records')
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)

In [38]:
train_dicts[1534], y_train[1534]

({'age': '26-39',
  'gender': 'male',
  'driving_experience': '0-9y',
  'education': 'none',
  'income': 'middle_class',
  'vehicle_ownership': 1.0,
  'married': 1.0,
  'children': 0.0,
  'postal_code': 10238,
  'speeding_violations': 2,
  'duis': 0,
  'past_accidents': 0,
  'vehicle_year': 'after_2015',
  'type_of_vehicle': 'sports_car',
  'id': 18647,
  'credit_score': 0.7179927843792123,
  'annual_mileage': 9000.0},
 0)

In [32]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
model.fit(X_train, y_train)

LogisticRegression(max_iter=1000, solver='liblinear')

In [33]:
val_dicts = val_df[columns].to_dict(orient='records')
X_val = dv.transform(val_dicts)

y_pred = model.predict_proba(X_val)[:, 1]

In [34]:
y_pred

array([0.4138601 , 0.41675677, 0.40503222, ..., 0.44444914, 0.43478886,
       0.42250146])

In [35]:
roc_auc_score(y_val, y_pred)

0.5020962233915409

# Deploying the model
1. Saving and loading the model
1. Web services: introduction to Flask
1. Serving the churn model with Flask
1. Python virtual environment: Pipenv
1. Environment management: Docker

## Saving and loading the model

In [39]:
## Save the model

In [36]:
import pickle

In [37]:
output_file = f'model_C={C}.bin'

In [38]:
with open(output_file, 'wb') as f_out: 
    pickle.dump((dv, model), f_out)

## Load the model

In [40]:
#  to chekc if the model is loaded from scratch try to restart the kernel and tehn try running the command.

In [1]:
import pickle

In [2]:
!ls

 analytics_olympiad_2022.ipynb	 submission.csv   train.csv
'model_C=1.0.bin'		 test.csv


In [4]:
input_file = 'model_C=1.0.bin'

In [5]:
with open(input_file, 'rb') as f_in: 
    dv, model = pickle.load(f_in)

In [6]:
model

LogisticRegression(max_iter=1000, solver='liblinear')

In [39]:
customer = {'age': '26-39',
  'gender': 'male',
  'driving_experience': '0-9y',
  'education': 'none',
  'income': 'middle_class',
  'vehicle_ownership': 1.0,
  'married': 1.0,
  'children': 0.0,
  'postal_code': 10238,
  'speeding_violations': 2,
  'duis': 0,
  'past_accidents': 0,
  'vehicle_year': 'after_2015',
  'type_of_vehicle': 'sports_car',
  'id': 18647,
  'credit_score': 0.7179927843792123,
  'annual_mileage': 9000.0}

In [40]:
X = dv.transform([customer])

In [41]:
y_pred = model.predict_proba(X)[0, 1]

In [42]:
print('input:', customer)
print('output:', y_pred)

input: {'age': '26-39', 'gender': 'male', 'driving_experience': '0-9y', 'education': 'none', 'income': 'middle_class', 'vehicle_ownership': 1.0, 'married': 1.0, 'children': 0.0, 'postal_code': 10238, 'speeding_violations': 2, 'duis': 0, 'past_accidents': 0, 'vehicle_year': 'after_2015', 'type_of_vehicle': 'sports_car', 'id': 18647, 'credit_score': 0.7179927843792123, 'annual_mileage': 9000.0}
output: 0.4484685581863869


In [43]:
# till here the model is saved and retrieved from the folder

## Web services: introduction to Flask

## Serving the churn model with Flask

## Python virtual environment: Pipenv

## Environment management: Docker

# Submission

In [35]:
sub_df = pd.read_csv('submission.csv')

# Resouces

Projects Details:
* This is what you need to do for each project
	* Think of a problem that's interesting for you and find a dataset for that
	* Describe this problem and explain how a model could be used
	* Prepare the data and doing EDA, analyze important features
	* Train multiple models, tune their performance and select the best model
	* Export the notebook into a script
	* Put your model into a web service and deploy it locally with Docker
	* Bonus points for deploying the service to the cloud
    * Links to submission:
        * [Project instructions](https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp/projects)
        * [Project Submission link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/projects.md#midterm-project)




* [Pandas profiling for easy EDA](https://pypi.org/project/pandas-profiling/)
* [Course notebook for reference](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/04-evaluation/homework_4.ipynb)

Future requirements:
* How does DictVetorizer works ?