Initial set up

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df=pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv')

Data Preparation

In [3]:
df.isnull().sum()

Unnamed: 0,0
lead_source,128
industry,134
number_of_courses_viewed,0
annual_income,181
employment_status,100
location,63
interaction_count,0
lead_score,0
converted,0


In [4]:
categorical = df.dtypes[df.dtypes==object].index.to_list()
categorical

['lead_source', 'industry', 'employment_status', 'location']

In [5]:
numerical = df.dtypes[df.dtypes!=object].index.to_list()
numerical

['number_of_courses_viewed',
 'annual_income',
 'interaction_count',
 'lead_score',
 'converted']

If there are missing values:
  1. For caterogiral features, replace them with 'NA'
  2. For numerical features, replace with with 0.0

In [6]:
df[categorical]=df[categorical].fillna('NA')
df[numerical]=df[numerical].fillna(0)

Split the data into 3 parts: train/validation/test with 60%/20%/20% distribution. Use train_test_split function for that with random_state=1

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [8]:
df_full_train, df_test=train_test_split(df, test_size=0.2,random_state=1)
df_train, df_val =train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train=df_train.reset_index(drop=True)
df_val=df_val.reset_index(drop=True)
df_test=df_test.reset_index(drop=True)


y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values

del df_train['converted']
del df_val['converted']
del df_test['converted']

Q1: ROC AUC feature importance

ROC AUC could also be used to evaluate feature importance of numerical variables.

Let's do that

For each numerical variable, use it as score (aka prediction) and compute the AUC with the y variable as ground truth.
Use the training dataset for that
If your AUC is < 0.5, invert this variable by putting "-" in front

(e.g. -df_train['balance'])

AUC can go below 0.5 if the variable is negatively correlated with the target variable. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.

Which numerical variable (among the following 4) has the highest AUC?

In [13]:
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score

In [12]:
numerical_auc = numerical.copy()
numerical_auc.remove('converted')
print(numerical_auc)

['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']


In [21]:
for clmn in numerical_auc:
  auc=roc_auc_score(y_train, df_train[clmn])
  if (auc<0.5):
    auc=roc_auc_score(y_train, -df_train[clmn])
  print (f"{clmn} | AUC: {auc}")

number_of_courses_viewed | AUC: 0.7635680590007088
annual_income | AUC: 0.5519578313253012
interaction_count | AUC: 0.738270176293409
lead_score | AUC: 0.6144993577250176


Q2: Training the model

Apply one-hot-encoding using DictVectorizer and train the logistic regression with these parameters:

LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)

What's the AUC of this model on the validation dataset? (round to 3 digits)

1)0.32

2)0.52

3)0.72

4)0.92


In [29]:
dicts = df_train.to_dict(orient='records')
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(dicts)

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
model.fit(X_train, y_train)

In [30]:
X_val = dv.transform(df_val.to_dict(orient='records'))

In [32]:
y_pred = model.predict_proba(X_val)[:,1]

auc = roc_auc_score(y_val, y_pred)

auc

np.float64(0.8171316268814112)

Q3: Precision and Recall

Now let's compute precision and recall for our model.

Evaluate the model on all thresholds from 0.0 to 1.0 with step 0.01
For each threshold, compute precision and recall
Plot them
At which threshold precision and recall curves intersect?

1)0.145

2)0.345

3)0.545

4)0.745

In [34]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

In [42]:
def tpr_fpr_dataframe(y_val,y_pred):
    scores=[]
    thresholds = np.linspace(0,1,101)

    for t in thresholds:
      predict_positive = (y_pred >=t)

      p=precision_score(y_val,predict_positive, zero_division=0)
      r=recall_score(y_val,predict_positive)

      scores.append((t, p, r))



In [43]:
np.random.seed(1)
y_rand=np.random.uniform(0,1,size=len(y_val))
y_rand

df_rand = tpr_fpr_dataframe(y_val,y_rand)


In [46]:
plt.plot(df_rand.t,df_rand[t], label='Precision')
plt.plot(df_rand.t,df_rand[p], label='Recall')
plt.legend()

AttributeError: 'NoneType' object has no attribute 't'