### Models built for check-in in next month prediction (imbalanced dataset handling method is used in this notebook)
### Training dataset is up to end of Jun.2024 to see how models predict Jul.2024 check-in

In [2]:
### read in the necessary files
import pandas as pd
usr_segment = pd.read_csv("Jun_segmentation.csv")
features = pd.read_csv("features.csv")
openapp_recency = pd.read_csv("openapp_recency.csv")

In [4]:
## merge dataset
merged_1 = pd.merge(usr_segment, features, on='user_sn', how='inner')
merged_2 = pd.merge(merged_1, openapp_recency, on='user_sn', how='inner')

In [5]:
merged_2.dtypes

user_sn                        int64
segmentation                   int64
hourly_ci_number               int64
ovn_ci_number                  int64
day_ci_number                  int64
total_paid_from_user           int64
usr_cancel_num_3_months        int64
hotel_cancel_num_3_months      int64
g2j_cancel_num_3_months        int64
no_show_num_3_months           int64
recency                        int64
average_ci_time_gap          float64
std_ci_time_gap              float64
open_app_num_3_months        float64
mileage_used_num             float64
current_mileage_point        float64
search_num_3_months          float64
review_num_3_months          float64
avg_mark                     float64
time_since_join                int64
user_province                 object
ci_num                         int64
openapp_recency                int64
dtype: object

### feature explanation
- segmentation: user segmentation by the month of train period

- hourly_ci_number: user's number of hourly check-ins

- ovn_ci_number: user's number of ovn check-ins 

- day_ci_number: user's number of day check-ins 

- total_paid_from_user: user's paid amount 

- usr_cancel_num_3_months: number of user cancels in the last 3 months in train period month

- hotel_cancel_num_3_months: number of user cancels in the last 3 months in train period month

- g2j_cancel_num_3_months: number of user cancels in the last 3 months  in train period month

- no_show_num_3_months: number of noshow in the last 3 months in train period month

- recency: the day difference between last check-in date before end of train period month and end of train period month

- average_ci_time_gap: average day gap between check-ins

- std_ci_time_gap: standard deviation of day gaps between check-ins

- open_app_num_3_months: number of times user open app in the last 3 months in the train period month

- mileage_used_num: number of times users used mileage points until end of train period month

- current_mileage_point: mileage points of users by the end of train period month

- search_num_3_months: number of times user search in the last 3 months in the train period month

- review_num_3_months: number of times user give review in the last 3 months in the train period month

- time_since_join: day gap between user's register date and end of train period month

- user_province: Province of user

- openapp_recency: the day difference between last open-app date before end of train period month and end of train period month

- ci_num: number of user's check-ins in next month period (used to create target variable)

In [6]:
merged_2.isnull().sum()

user_sn                           0
segmentation                      0
hourly_ci_number                  0
ovn_ci_number                     0
day_ci_number                     0
total_paid_from_user              0
usr_cancel_num_3_months           0
hotel_cancel_num_3_months         0
g2j_cancel_num_3_months           0
no_show_num_3_months              0
recency                           0
average_ci_time_gap               0
std_ci_time_gap                   0
open_app_num_3_months        117067
mileage_used_num                239
current_mileage_point         50134
search_num_3_months          156746
review_num_3_months          103449
avg_mark                     103449
time_since_join                   0
user_province                     2
ci_num                            0
openapp_recency                   0
dtype: int64

## Cleaning data

In [7]:
## Fill the null values with 0 for open_app_num_3_months, mileage_used_num, current_mileage_point
## search_num_3_months, review_num_3_months, avg_mark
merged_2[['open_app_num_3_months', 'mileage_used_num', 'current_mileage_point', 
          'review_num_3_months','search_num_3_months']] = merged_2[['open_app_num_3_months', 'mileage_used_num', 'current_mileage_point',
                                                            'review_num_3_months','search_num_3_months',]].fillna(0).astype('int64')

merged_2['avg_mark'] = merged_2['avg_mark'].fillna(0).astype('float32')

In [8]:
merged_2.dtypes

user_sn                        int64
segmentation                   int64
hourly_ci_number               int64
ovn_ci_number                  int64
day_ci_number                  int64
total_paid_from_user           int64
usr_cancel_num_3_months        int64
hotel_cancel_num_3_months      int64
g2j_cancel_num_3_months        int64
no_show_num_3_months           int64
recency                        int64
average_ci_time_gap          float64
std_ci_time_gap              float64
open_app_num_3_months          int64
mileage_used_num               int64
current_mileage_point          int64
search_num_3_months            int64
review_num_3_months            int64
avg_mark                     float32
time_since_join                int64
user_province                 object
ci_num                         int64
openapp_recency                int64
dtype: object

In [9]:
# Get rid of records with time_since_join or average_ci_time_gap < 0 and records with user_province as null
merged_2 = merged_2[merged_2['time_since_join']>=0]
merged_2 = merged_2[merged_2['average_ci_time_gap']>=0]
merged_2 = merged_2[merged_2['user_province'].isnull()==False]

In [11]:
segment_data = {
    'segmentation': [1, 2, 3, 4, 5, 6, 7],
    'segmentation_text': ['New', 'Existing', 'Retention', 'Win-back', 'Churn', 'Drop', 'Dormant']
}

segment_df = pd.DataFrame(segment_data)

# add the segementaion test
cleaned_df = pd.merge(merged_2, segment_df, on='segmentation', how='left')
cleaned_df.shape

(265085, 24)

### Create target/dependent variable and train test split

In [12]:
cleaned_df['ci_or_not'] = cleaned_df['ci_num'].apply(lambda x: 1 if x > 0 else 0)
cleaned_df['ci_or_not'].value_counts()

ci_or_not
0    242262
1     22823
Name: count, dtype: int64

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X = cleaned_df.drop(columns=['user_sn','segmentation', 'ci_or_not', 'ci_num'],axis=1)
y = cleaned_df['ci_or_not']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=32, stratify=y)

In [15]:
num_features = X_train.select_dtypes(exclude="object").columns
cat_features = X_train.select_dtypes(include="object").columns

print('We have {} numerical features : {}'.format(len(num_features), num_features))
print('\nWe have {} categorical features : {}'.format(len(cat_features), cat_features))

We have 19 numerical features : Index(['hourly_ci_number', 'ovn_ci_number', 'day_ci_number',
       'total_paid_from_user', 'usr_cancel_num_3_months',
       'hotel_cancel_num_3_months', 'g2j_cancel_num_3_months',
       'no_show_num_3_months', 'recency', 'average_ci_time_gap',
       'std_ci_time_gap', 'open_app_num_3_months', 'mileage_used_num',
       'current_mileage_point', 'search_num_3_months', 'review_num_3_months',
       'avg_mark', 'time_since_join', 'openapp_recency'],
      dtype='object')

We have 2 categorical features : Index(['user_province', 'segmentation_text'], dtype='object')


### Preprocessing, imbalanced data handling and train for Tree-based models

In [16]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
oh_transformer = OneHotEncoder()
numeric_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, cat_features),    
        ('pass', 'passthrough', num_features)
    ]
)

In [18]:
from imblearn.over_sampling import SMOTE
from collections import Counter

counter = Counter(y_train)
print('Before', counter)

# oversampling the train dataset using SMOTE
X_train_encoded = preprocessor.fit_transform(X_train)
smt = SMOTE()
X_train_sm, y_train_sm = smt.fit_resample(X_train_encoded, y_train)

counter = Counter(y_train_sm)
print('After', counter)

Before Counter({0: 193810, 1: 18258})
After Counter({0: 193810, 1: 193810})


In [19]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, min_samples_split=5, min_samples_leaf=5, max_features='log2', bootstrap=True, random_state=2)
# rf = RandomForestClassifier(n_estimators=100, bootstrap=True, max_features='sqrt', random_state=1, class_weight=class_weights)

processed_X_test = preprocessor.transform(X_test)


rf.fit(X_train_sm, y_train_sm) # Train model

        # Make predictions

y_test_pred = rf.predict(processed_X_test)

report = classification_report(y_test, y_test_pred)
print("Test Classification Report:")
print(report)


Test Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.96      0.96     48452
           1       0.54      0.53      0.53      4565

    accuracy                           0.92     53017
   macro avg       0.75      0.74      0.74     53017
weighted avg       0.92      0.92      0.92     53017



##### Good precision and recall for 0 (no checkin in next month) class
##### Precison of 0.54 for 1 (checkin in next month) class -> out of all next month check-in predictions, model gets 54% of predictions are correct
##### Recall of 0.53 for 1 (checkin in next month) class -> out of all next month actual check-ins, model gets 53% of actual check-ins

### Preprocessing, imbalanced data handling and train for non-tree-based models

In [23]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

oh_transformer = OneHotEncoder()
numeric_transformer = StandardScaler()


preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, cat_features),
         ("StandardScaler", numeric_transformer, num_features),        
    ]
)

In [36]:
from imblearn.over_sampling import SMOTE
from collections import Counter

counter = Counter(y_train)
print('Before', counter)

# oversampling the train dataset using SMOTE
X_train_encoded = preprocessor.fit_transform(X_train)
smt = SMOTE()
X_train_sm, y_train_sm = smt.fit_resample(X_train_encoded, y_train)

counter = Counter(y_train_sm)
print('After', counter)

Before Counter({0: 193810, 1: 18258})
After Counter({0: 193810, 1: 193810})


In [27]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42, max_iter=1000)

processed_X_test = preprocessor.transform(X_test)

lr.fit(X_train_sm, y_train_sm) # Train model

y_test_pred = lr.predict(processed_X_test)

report = classification_report(y_test, y_test_pred)
print("Test Classification Report:")
print(report)


Test Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.82      0.89     48452
           1       0.30      0.82      0.44      4565

    accuracy                           0.82     53017
   macro avg       0.64      0.82      0.66     53017
weighted avg       0.92      0.82      0.85     53017



##### Good precision and recall for 0 (no checkin in next month) class
##### Precison of 0.30 for 1 (checkin in next month) class -> out of all next month check-in predictions, model gets 30% of predictions are correct
##### Recall of 0.82 for 1 (checkin in next month) class -> out of all next month actual check-ins, model gets 82% of actual check-ins