In [39]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import xgboost as xgb
from scipy.stats.mstats import winsorize

In [40]:
# Loading data from a CSV file and converting the 'churn' variable to 0/1.

data = pd.read_csv('task_data_churned.csv')
data['churned_status_int'] = data['churned_status'].replace({'No': 0, 'Yes': 1})

In [41]:
# Initial data check

data.describe()

# Most variables are numeric, and most of them have all values. Number of variables have zeros in most cases, so
# it makes sense to check how important they are for prediction and what their interpretation is. Also, data quality
# is something that needs further examination in terms of missing values, and then decide on the best approach for
# handling and/or imputing those missing values. It would be best if we could discover
# the reason why they are missing and then impute the exact values that are missing. In addition, we can also perform
# imputation based on the mean value or based on the values from 'similar' instances that have the given value
# we want to impute. For example, if we don't have a value for the variable 'action_gps_tracking' in one row,
# and in another row, which is similar to the first one based on other variables, we have a value for 'action_gps_tracking', then we copy that value
# to the missing row.

# Another topic to consider is how to approach outliers. Depending on the nature of the data itself
# and the meaning of certain variables, we can approach outliers' resolution differently. One method is called
# winsorizing, where the value of an outlier is replaced with the value of a certain percentile (e.g., 90%) of that variable. 
# This way, we don't lose instances with missing values, and therefore, we have more data for training the model.


# This dataset also contains categorical variable (country), so we need to decide how to
# convert them into numeric values. One way is a simply creating a new variable for each country that has
# values 0 or 1. Other methods include, for example, setting the mean value (mean encode) of the dependent variable (churn)
# for certain groups of a given categorical variable. For example, for the country 'serbia', we would set the mean 
# value of the dependent variable calculated based on all instances with country='serbia' (from training set).

# Since we see that the ratio of churn to non-churn in this dataset is asymmetric, we should also consider methods
# that we could apply to resolve that asymmetry. The asymmetry is not significant, but it's worth exploring
# different methods to achieve an optimal solution. One way to do this is by using the SMOTE algorithm, where we would generate
# new instances of the class that is less present and thus equalize the proportion of that class.
# Another way is to set weighting parameters in the training algorithm itself to address class imbalance.

# To choose the best model, there are several factors we can consider and then test the performance of different
# models. We can test different algorithms, as well as different approaches in handling outliers, asymmetric classes,
# missing values, etc., and then based on that, see what makes the most sense and gives
# the best results in our specific case.

# For the purposes of this task, I have decided to compare two models, a basic model that will serve as a baseline
# and an enhanced one where I will apply some techniques to address the issues described above, and then compare their performance.

# It should also be noted that it is necessary to choose the statistic that will be our target,
# which one we are trying to optimize. In different cases, depending on business needs, it can be precision,
#recall, accuracy, etc. I will focus on one of them here (accuracy).


Unnamed: 0,ws_users_activated,ws_users_deactivated,ws_users_invited,action_create_project,action_export_report,action_api_and_webhooks,action_time_entries_via_tracker,action_start_trial,action_import_csv,action_create_invoice,...,action_gps_tracking,action_screenshots,action_create_custom_field,value_days_to_purchase,value_number_of_active_months,value_transactions_number,value_regular_seats,value_kiosk_seats,revenue,churned_status_int
count,2502.0,2502.0,2502.0,2502.0,2502.0,2502.0,2502.0,2502.0,2502.0,2502.0,...,876.0,1044.0,443.0,2502.0,2502.0,2502.0,2502.0,2502.0,2502.0,2502.0
mean,5.619504,0.827738,0.158273,28.043965,22.709432,0.383293,19.479616,0.175859,0.622702,8.494005,...,1.371005,1.417625,7.24605,61.286571,4.215827,5.728617,6.067946,0.257794,378.331825,0.319345
std,11.36413,3.527056,0.784527,80.761092,80.884964,3.089846,114.85605,0.380777,4.770705,52.699928,...,0.726969,0.791806,11.577418,85.179584,3.691711,4.893211,11.766325,2.95797,1007.971191,0.466316
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,1.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,2.0,1.0,1.0,2.0,1.0,0.0,38.961,0.0
50%,2.0,0.0,0.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,4.0,24.0,3.0,4.0,2.0,0.0,105.7615,0.0
75%,6.0,0.0,0.0,26.0,15.0,0.0,0.0,0.0,0.0,2.0,...,2.0,2.0,7.0,84.75,7.0,8.0,6.0,0.0,333.45975,1.0
max,206.0,73.0,20.0,1923.0,1740.0,127.0,3382.0,1.0,120.0,1405.0,...,8.0,11.0,106.0,420.0,14.0,90.0,215.0,117.0,27235.156,1.0


In [43]:
# We split the dataset into X (independent variables) and y (dependent variable) vectors

X = data.drop(['churned_status_int', 'churned_status'], axis=1)
y = data['churned_status_int']

# dummy variables for categorical variable

X = pd.get_dummies(X)

# Creating two datasets: one for test, one for train

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# xgboost model

clf = xgb.XGBClassifier(objective='binary:logistic', random_state=42)

# Training 

clf.fit(X_train, y_train)

# Predictions on test and train data set

y_pred = clf.predict(X_test)
y_pred_train = clf.predict(X_train)

# Evaluation of the model on test and train

accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_test = accuracy_score(y_test, y_pred)

# Feature importance

feature_importance = clf.feature_importances_

# dataframe for feature importance

feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importance
})

# Sorting

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print(feature_importance_df)

print(accuracy_train)
print(accuracy_test)




                             Feature  Importance
6    action_time_entries_via_tracker    0.127062
19     value_number_of_active_months    0.053622
0                 ws_users_activated    0.028782
20         value_transactions_number    0.025955
87                     country_India    0.023774
..                               ...         ...
75          country_French Polynesia    0.000000
72                   country_Finland    0.000000
71                  country_Ethiopia    0.000000
70                   country_Estonia    0.000000
178                country_Wisconsin    0.000000

[179 rows x 2 columns]
0.9815092453773113
0.7325349301397206


In [48]:
# These were the results for the baseline model. Now, I will apply some techniques described above, train a new 
# model, and compare the performance with the baseline model.

# Imputation of Missing Values

# Considering there are columns with more than half missing values ('action_create_custom_field' and 
# 'action_gps_tracking'), I decide to drop them from the model. For other columns that have missing values, 
# I will impute the mean values of the respective columns

data_imputed = data.fillna(data.mean())

data_imputed = data_imputed.drop(['action_create_custom_field', 'action_gps_tracking'], axis=1)


# Outlier detection: winsorizing.

# Numeric columns selected

numeric_columns = data_imputed.select_dtypes(include=['number']).columns.difference(['churned_status_int'])

# Winsorize for each numeric column

for column in numeric_columns:
    data_imputed[column] = winsorize(data_imputed[column], limits=(0.05, 0.05))  

# Creating two datasets, train and test

data_train, data_test = train_test_split(data_imputed, test_size=0.2, random_state=42)

# Mean encoding za categorical variable.

country_mean_encoding = data_train.groupby('country')['churned_status_int'].mean()
data_train['country_mean_encoded'] = data_train['country'].map(country_mean_encoding)
data_test['country_mean_encoded'] = data_test['country'].map(country_mean_encoding)


# We split the dataset into X (independent variables) and y (dependent variable) vectors

X_train = data_train.drop(['churned_status_int', 'churned_status'], axis=1)
y_train = data_train['churned_status_int']

X_test = data_test.drop(['churned_status_int', 'churned_status'], axis=1)
y_test = data_test['churned_status_int']


# Drop initial country variable

X_train = X_train.drop('country', axis=1)
X_test = X_test.drop('country', axis=1)


# Weighting Parameters for asymmetrical classes

class_weights = len(y_train) / (2 * np.bincount(y_train))

# xgboost model with weights

clf = xgb.XGBClassifier(objective='binary:logistic', scale_pos_weight=class_weights[1], random_state=42)

# Model training

clf.fit(X_train, y_train)

# Predictions

y_pred = clf.predict(X_test)
y_pred_train = clf.predict(X_train)

# Evaluation

accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_test = accuracy_score(y_test, y_pred)

print(accuracy_train)
print(accuracy_test)

0.9880059970014993
0.7405189620758483


  data_imputed = data.fillna(data.mean())


In [49]:

# Feature importance

feature_importance = clf.feature_importances_

# df for feature importance

feature_importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': feature_importance
})

# Sorting

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)


feature_importance_df


Unnamed: 0,Feature,Importance
6,action_time_entries_via_tracker,0.251189
17,value_number_of_active_months,0.090734
0,ws_users_activated,0.058622
5,action_api_and_webhooks,0.045987
22,country_mean_encoded,0.044203
18,value_transactions_number,0.042763
10,action_lock_entries,0.042183
7,action_start_trial,0.040232
1,ws_users_deactivated,0.039408
16,value_days_to_purchase,0.034657


In [None]:
# Concluding Observations: We can see that we managed to increase accuracy by 1%. It's not a significant 
# improvement, but it definitely shows one way we can enhance the model.
# An accuracy of 74% is not bad on test data, although the desired value of the statistic we are trying to 
# optimize depends largely on the specific business problem and what we define in advance as success.

# To further improve the model, the focus should be on acquiring more data. If it's possible to obtain more users
# and variables used for prediction, that would be beneficial. We see that the most significant variable for the 
# model is 'action_time_entries_via_tracker.' It is important to understand what it represents and gather new 
# variables in that direction. In this process, communication and feedback from the business are crucial because 
# that's where we can learn and define more important variables for this specific problem and understand how they 
# impact the prediction.

# One potential issue that can be observed here is overfitting. We see a significant difference between accuracy 
# on training and test data. There are different methods to address this problem, such as redefining the model 
# with only the most important variables and using cross-validation to check which of these models show similar 
# results on training and test data.

# Hyperparameter tuning is also one of the topics that can be explored to improve the model itself, 
# through the selection of appropriate hyperparameters for the model.