## Project Description

This project classify the phone user behavior based on the usage scaling from 1 to 5. The bigger the number, the more user use the phone. The features inside this dataset is:
- User ID
- Device Model
- Operating System
- App Usage Time (min/day)
- Screen On Time (hours/day)
- Battery Drain (mAh/day)
- Number of Apps Installed
- Data Usage (MB/ day)
- Age
- Gender
- User Behavior Class

The dataset used is from kaggle, link: https://www.kaggle.com/datasets/valakhorasani/mobile-device-usage-and-user-behavior-dataset 

## Import the relevant libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mutual_info_score, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

import xgboost as xgb
import pickle

## Load the dataset

In [None]:
df = pd.read_csv('data/user_behavior_dataset.csv')
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

There's no null value inside the dataset

In [None]:
# Turn the column names into lowercase
df.columns = df.columns.str.lower().str.replace(' ', '_')

We should mapping the target (`user_behavior_class`) from 0 to 4, instead from 1 to 5. (Multiclass need arrange from 0 for xgboost, and not doing this will cause an error)

In [None]:
# mapping the user_behavior_class from 0 to 4
df.user_behavior_class = df.user_behavior_class.map({1:0, 2:1, 3:2, 4:3, 5:4})

In [None]:
# Mapping the gender into number 
df.gender = df.gender.map({'Female': 0, 'Male': 1})

In [None]:
df.user_behavior_class.value_counts()

In [None]:
# drop the user_id
del df['user_id']

## EDA

In [None]:
df.describe()

For the numerical features we got the median of the data exactly in the middle. 

In [None]:
# Seperate the numerical and categorical for feature importance needs
categorical_features = ['gender', 'device_model', 'operating_system']
numerical_features = ['app_usage_time_(min/day)', 'screen_on_time_(hours/day)',
                      'battery_drain_(mah/day)', 'number_of_apps_installed',
                      'data_usage_(mb/day)', 'age']

In [None]:
# Checking the correlation between the numerical features and the target
df[numerical_features].corrwith(df.user_behavior_class)

In [None]:
# Checking the mutual info betweeen the categorical 
def mutual_info_usage_score(series):
    return mutual_info_score(series, df.user_behavior_class)

mi = df[categorical_features].apply(mutual_info_usage_score)
mi.sort_values(ascending=False)

Based on the feature importance above, we will not use `age` cause the impact is really small for the model. Based on correlation the value is `-0.0006`.

Also we drop `operating_system` and `device_model`

In [None]:
figure, axis = plt.subplots(nrows=len(numerical_features), sharex=True, figsize=(6, 30))
for i in range(len(numerical_features)):
    sns.boxplot(x=df.user_behavior_class, y=df[numerical_features[i]], ax=axis[i])

In [None]:
df_clean = df.drop(['age', 'operating_system', 'device_model'], axis=1)
df_clean.head()

## Modeling

In [None]:
df_full_train, df_test = train_test_split(df_clean, test_size=0.2, random_state=12)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=12)

In [None]:
y_train = df_train.user_behavior_class.values
y_val = df_val.user_behavior_class.values
y_test = df_test.user_behavior_class.values
y_full_train = df_full_train.user_behavior_class.values

del(df_train['user_behavior_class'])
del(df_val['user_behavior_class'])
del(df_test['user_behavior_class'])
del(df_full_train['user_behavior_class'])

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_full_train = df_full_train.reset_index(drop=True)

In [None]:
unique, count = np.unique(y_train, return_counts=True)
count, unique

In [None]:
train_dict = df_train.to_dict(orient='records')
val_dict = df_val.to_dict(orient='records')
test_dict = df_test.to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)
X_test = dv.transform(test_dict)

### Modeling with XGBoost

In [None]:
features = dv.get_feature_names_out()
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=list(features))
dval = xgb.DMatrix(X_val, feature_names=list(features))
dtest = xgb.DMatrix(X_test, feature_names=list(features))

In [None]:
%%time
xgb_params = {
    'eta': 0.3,
    'max_depth': 6,
    'min_child_weight': 1,

    'objective': 'multi:softmax',
    'num_class': 5,
    'nthread': 8,

    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=10)

In [None]:
%%time
y_pred = model.predict(dval)

In [None]:
(y_val == y_pred).mean()

In [None]:
y_pred = model.predict(dtest)
(y_test == y_pred).mean()

In [None]:
cm = confusion_matrix(y_val, y_pred)
sns.heatmap(pd.DataFrame(cm))

We got the accuracy, precision, and recall `1.0`, means there is no `false positive` and `false negative`.

### Modeling with naive bayes

In [None]:
clf = GaussianNB(var_smoothing=0.01)

In [None]:
%%time
clf.fit(X_train, y_train)
clf.score(X_val, y_val)

In [None]:
%%time
nb_pred = clf.predict(X_val)

In [None]:
cm_nb = confusion_matrix(y_val, nb_pred)
sns.heatmap(pd.DataFrame(cm_nb))

In [None]:
y_pred = clf.predict_proba(X_val)
roc_auc_score(y_val, y_pred, multi_class='ovr')

### Modeling with KNN

In [None]:
%%time
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

In [None]:
%%time
y_pred = knn.predict(X_val)

In [None]:
knn.score(X_val, y_val)

In [None]:
df_clean

We got 100% accuration for all of the model. After look back to the model we got that the seperation between all of the categories is very clear. So we end up not tuning the model because no need tuning in this case

## Turn into pickle 

We use naive bayes model for the fastest computation

In [None]:
output_file = 'model.bin'
output_file

In [None]:
with open(output_file, 'wb') as f_out:
    pickle.dump((dv, clf), f_out)

In [None]:
df[101]