## Homework

> Note: sometimes your answer doesn't match one of 
> the options exactly. That's fine. 
> Select the option that's closest to your solution.


In this homework, we will use the lead scoring dataset Bank Marketing dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv).


In this dataset our desired target for classification task will be `converted` variable - has the client signed up to the platform or not. 

### Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    * For caterogiral features, replace them with 'NA'
    * For numerical features, replace with with 0.0 


Split the data into 3 parts: train/validation/test with 60%/20%/20% distribution. Use `train_test_split` function for that with `random_state=1`

In [66]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

In [67]:
# !wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv

In [68]:
df_origin = pd.read_csv("course_lead_scoring.csv")

In [69]:
df_origin.isna().sum()

lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [70]:
df_origin.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

In [71]:
def prepare_data(df):
    df_result = df.copy()
    df_result["annual_income"] = df_result["annual_income"].fillna(0)
    df_result["lead_source"] = df_result["lead_source"].fillna("NA")
    df_result["industry"] = df_result["industry"].fillna("NA")
    df_result["employment_status"] = df_result["employment_status"].fillna("NA")
    df_result["location"] = df_result["location"].fillna("NA")

    return df_result

In [72]:
df = prepare_data(df_origin)

In [73]:
df.isna().sum()

lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64

In [74]:
df.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


In [75]:
target_column = "converted"

In [76]:
from pandas import DataFrame


def split_features_and_target(df_train: DataFrame, df_validation, df_test):
    y_train = df_train[target_column].values
    y_validation = df_validation[target_column].values
    y_test = df_test[target_column].values

    df_train = df_train.drop([target_column], axis=1)
    df_validation = df_validation.drop([target_column], axis=1)
    df_test = df_test.drop([target_column], axis=1)

    return df_train, y_train, df_validation, y_validation, df_test, y_test

In [77]:
df_full_train, df_test = train_test_split(df_origin, test_size=0.2, random_state=1)

In [78]:
df_train, df_validation = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [79]:
df_train, y_train, df_validation, y_validation, df_test, y_test = split_features_and_target(df_train, df_validation, df_test)

### Question 1: ROC AUC feature importance

ROC AUC could also be used to evaluate feature importance of numerical variables. 

Let's do that

* For each numerical variable, use it as score (aka prediction) and compute the AUC with the `y` variable as ground truth.
* Use the training dataset for that


If your AUC is < 0.5, invert this variable by putting "-" in front

(e.g. `-df_train['balance']`)

AUC can go below 0.5 if the variable is negatively correlated with the target variable. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.

Which numerical variable (among the following 4) has the highest AUC?

- `lead_score`
- `number_of_courses_viewed`
- `interaction_count`
- `annual_income`