# Requirements

In [2]:
# Add as many imports as you need.
import pandas as pd

# Laboratory Exercise - Run Mode (8 points)

## Mobile Device Usage and User Behavior Dataset
The dataset contains detailed information on 700 mobile device users, capturing various usage patterns and behavior classifications. The features include app usage time, screen-on time, battery drain, data consumption, and more. These metrics provide insights into the user's daily interactions with their device, such as how much time is spent on apps, the amount of screen activity, battery usage, and mobile data consumption. In addition, user demographics like age and gender are included, as well as the device model and operating system. The 'user behavior class' attribute categorizes users based on their usage patterns, ranging from light to extreme behavior. All features, except for the 'user behavior class', can be used as input variables for analysis and modeling, while the 'user behavior class' serves as the target variable for prediction. This dataset offers valuable insights for studying mobile user behavior and can be used for building predictive models in the domain of mobile technology and applications.

Load the dataset into a `pandas` data frame.

In [6]:
# Write your code here. Add as many boxes as you need.
data = pd.read_csv('user_behavior_data.csv')
data.head()

Unnamed: 0,User ID,Device Model,Operating System,App Usage Time (min/day),Screen On Time (hours/day),Battery Drain (mAh/day),Number of Apps Installed,Data Usage (MB/day),Age,Gender,User Behavior Class
0,1,Google Pixel 5,Android,393,6.4,1872,67,1122.0,40.0,Male,4
1,2,OnePlus 9,Android,268,4.7,1331,42,944.0,47.0,Female,3
2,3,Xiaomi Mi 11,Android,154,4.0,761,32,,42.0,Male,2
3,4,Google Pixel 5,Android,239,4.8,1676,56,871.0,20.0,Male,3
4,5,iPhone 12,iOS,187,4.3,1367,58,988.0,31.0,Female,3


Preprocess the input and the output variables appropriately.

In [8]:
# Write your code here. Add as many boxes as you need.

Explore the dataset using visualizations of your choice.

In [10]:
# Write your code here. Add as many boxes as you need.

Check if the dataset is balanced.

In [12]:
# Write your code here. Add as many boxes as you need.
data['User Behavior Class'].value_counts()

User Behavior Class
2    146
3    143
4    139
5    136
1    136
Name: count, dtype: int64

## Detecting Missing Values
Calculate the percentage of missing values present in each column of the dataset.

In [14]:
# Write your code here. Add as many boxes as you need.
data.isnull().sum()

User ID                         0
Device Model                    0
Operating System               70
App Usage Time (min/day)        0
Screen On Time (hours/day)      0
Battery Drain (mAh/day)         0
Number of Apps Installed        0
Data Usage (MB/day)           140
Age                            35
Gender                          0
User Behavior Class             0
dtype: int64

In [15]:
data.drop(columns=['User ID'], inplace=True)

In [16]:
data.sample(10)

Unnamed: 0,Device Model,Operating System,App Usage Time (min/day),Screen On Time (hours/day),Battery Drain (mAh/day),Number of Apps Installed,Data Usage (MB/day),Age,Gender,User Behavior Class
98,Google Pixel 5,Android,59,1.2,361,18,293.0,25.0,Female,1
71,iPhone 12,iOS,521,9.0,2902,97,1701.0,37.0,Male,5
118,Samsung Galaxy S21,Android,82,1.6,590,13,,28.0,Female,1
113,Samsung Galaxy S21,Android,136,3.2,818,33,404.0,42.0,Male,2
188,iPhone 12,iOS,130,2.0,602,21,589.0,30.0,Female,2
434,iPhone 12,iOS,581,8.4,2591,99,2304.0,58.0,Male,5
653,Samsung Galaxy S21,Android,49,1.2,365,19,144.0,29.0,Male,1
454,iPhone 12,iOS,143,3.9,1160,24,398.0,45.0,Male,2
250,OnePlus 9,Android,42,1.4,324,13,272.0,29.0,Female,1
114,Xiaomi Mi 11,Android,471,7.9,2156,76,1324.0,54.0,Female,4


In [17]:
percent = data.isnull().sum() / len(data) * 100

In [18]:
percent

Device Model                   0.0
Operating System              10.0
App Usage Time (min/day)       0.0
Screen On Time (hours/day)     0.0
Battery Drain (mAh/day)        0.0
Number of Apps Installed       0.0
Data Usage (MB/day)           20.0
Age                            5.0
Gender                         0.0
User Behavior Class            0.0
dtype: float64

In [19]:
from sklearn.impute import SimpleImputer

In [20]:
imp = SimpleImputer(strategy='most_frequent')
data['Operating System'] = imp.fit_transform(data[['Operating System']]).ravel()
data.sample(20)

Unnamed: 0,Device Model,Operating System,App Usage Time (min/day),Screen On Time (hours/day),Battery Drain (mAh/day),Number of Apps Installed,Data Usage (MB/day),Age,Gender,User Behavior Class
301,iPhone 12,iOS,267,4.4,1505,59,971.0,38.0,Female,3
669,Samsung Galaxy S21,Android,160,3.2,648,31,339.0,27.0,Female,2
2,Xiaomi Mi 11,Android,154,4.0,761,32,,42.0,Male,2
436,OnePlus 9,Android,221,4.4,1341,46,862.0,20.0,Male,3
674,Xiaomi Mi 11,Android,522,11.4,2776,93,1768.0,27.0,Female,5
337,Samsung Galaxy S21,Android,30,1.3,561,15,252.0,34.0,Male,1
387,Samsung Galaxy S21,Android,72,1.3,461,13,199.0,32.0,Male,1
500,Google Pixel 5,Android,66,1.3,369,14,,,Male,1
179,iPhone 12,iOS,539,11.9,2853,83,2007.0,55.0,Male,5
186,iPhone 12,iOS,402,7.8,2014,79,1088.0,34.0,Female,4


In [21]:
data.isnull().sum()

Device Model                    0
Operating System                0
App Usage Time (min/day)        0
Screen On Time (hours/day)      0
Battery Drain (mAh/day)         0
Number of Apps Installed        0
Data Usage (MB/day)           140
Age                            35
Gender                          0
User Behavior Class             0
dtype: int64

In [22]:
imp = SimpleImputer(strategy='mean')
data['Age'] = imp.fit_transform(data[['Age']])
data.sample(20)

Unnamed: 0,Device Model,Operating System,App Usage Time (min/day),Screen On Time (hours/day),Battery Drain (mAh/day),Number of Apps Installed,Data Usage (MB/day),Age,Gender,User Behavior Class
517,iPhone 12,Android,64,1.2,592,19,,25.0,Male,1
650,Google Pixel 5,Android,149,2.0,1041,39,,49.0,Male,2
1,OnePlus 9,Android,268,4.7,1331,42,944.0,47.0,Female,3
177,Samsung Galaxy S21,Android,193,5.7,1471,51,972.0,31.0,Female,3
277,Google Pixel 5,Android,230,4.4,1607,52,878.0,54.0,Male,3
58,Xiaomi Mi 11,Android,428,7.0,2306,75,1144.0,22.0,Male,4
23,Google Pixel 5,Android,292,5.6,1401,46,949.0,37.0,Female,3
95,Xiaomi Mi 11,Android,326,7.2,2243,73,1454.0,50.0,Male,4
420,iPhone 12,iOS,32,1.4,416,12,198.0,56.0,Male,1
476,Google Pixel 5,Android,318,6.6,2089,77,1126.0,49.0,Female,4


In [23]:
data.isnull().sum()

Device Model                    0
Operating System                0
App Usage Time (min/day)        0
Screen On Time (hours/day)      0
Battery Drain (mAh/day)         0
Number of Apps Installed        0
Data Usage (MB/day)           140
Age                             0
Gender                          0
User Behavior Class             0
dtype: int64

In [24]:
imp = SimpleImputer(strategy='mean')
data['Data Usage (MB/day)'] = imp.fit_transform(data[['Data Usage (MB/day)']])
data.sample(20)

Unnamed: 0,Device Model,Operating System,App Usage Time (min/day),Screen On Time (hours/day),Battery Drain (mAh/day),Number of Apps Installed,Data Usage (MB/day),Age,Gender,User Behavior Class
502,Xiaomi Mi 11,Android,582,8.4,2664,91,2493.0,55.0,Female,5
569,Google Pixel 5,Android,404,6.6,2181,77,1327.0,18.0,Male,4
374,OnePlus 9,Android,69,1.3,434,12,164.0,42.0,Male,1
396,Xiaomi Mi 11,Android,78,1.1,437,14,942.332143,38.342857,Female,1
362,Xiaomi Mi 11,Android,78,1.5,341,11,259.0,38.0,Female,1
82,Xiaomi Mi 11,Android,330,7.2,2363,77,1133.0,21.0,Female,4
385,iPhone 12,iOS,349,6.6,2041,78,1096.0,40.0,Female,4
150,iPhone 12,iOS,523,9.4,2583,92,1539.0,21.0,Male,5
590,Samsung Galaxy S21,Android,159,3.7,630,33,575.0,30.0,Male,2
264,iPhone 12,iOS,334,6.8,2000,77,1079.0,40.0,Female,4


In [25]:
data.isnull().sum()

Device Model                  0
Operating System              0
App Usage Time (min/day)      0
Screen On Time (hours/day)    0
Battery Drain (mAh/day)       0
Number of Apps Installed      0
Data Usage (MB/day)           0
Age                           0
Gender                        0
User Behavior Class           0
dtype: int64

## Understanding the Causes Behind Missing Values
Using visualization tools such as heatmaps, and dendrograms, illustrate the interdependence between attributes with missing values. Also, visualize the distribution of the missing values within the dataset using matrices and bar charts.

In [27]:
# Write your code here. Add as many boxes as you need.

## Handling the Missing Values
Handle the missing values using suitable method based on the insights obtained from the various visualizations.

In [29]:
# Write your code here. Add as many boxes as you need.

In [30]:
from sklearn.preprocessing import LabelEncoder

In [31]:
def label(data, columns):
    encoder = LabelEncoder()
    data_copy = data.copy()
    for c in columns:
        data_copy[c] = encoder.fit_transform(data_copy[[c]].astype(str).values.ravel())
    return data_copy
    

In [32]:
data = label(data=data, columns=['Device Model','Operating System','Gender','User Behavior Class'])

In [33]:
data.head()

Unnamed: 0,Device Model,Operating System,App Usage Time (min/day),Screen On Time (hours/day),Battery Drain (mAh/day),Number of Apps Installed,Data Usage (MB/day),Age,Gender,User Behavior Class
0,0,0,393,6.4,1872,67,1122.0,40.0,1,3
1,1,0,268,4.7,1331,42,944.0,47.0,0,2
2,3,0,154,4.0,761,32,942.332143,42.0,1,1
3,0,0,239,4.8,1676,56,871.0,20.0,1,2
4,4,1,187,4.3,1367,58,988.0,31.0,0,2


## Dataset Splitting
Partition the dataset into training and testing sets with an 80:20 ratio.

In [35]:
# Write your code here. Add as many boxes as you need.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(data.iloc[:,:-1], data.iloc[:,-1], test_size=0.2)

## Feature Scaling
Standardize the features appropriately.

In [37]:
# Write your code here. Add as many boxes as you need.
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()

data.sample(15)

Unnamed: 0,Device Model,Operating System,App Usage Time (min/day),Screen On Time (hours/day),Battery Drain (mAh/day),Number of Apps Installed,Data Usage (MB/day),Age,Gender,User Behavior Class
522,3,0,438,6.5,1849,64,1125.0,49.0,0,3
466,2,0,414,7.3,2349,75,1092.0,51.0,1,3
185,0,0,498,10.7,2738,94,1995.0,42.0,1,4
475,2,0,412,6.2,2201,68,1085.0,54.0,0,3
68,4,1,516,10.2,2932,98,1547.0,31.0,1,4
237,1,0,425,6.9,2142,66,1130.0,19.0,0,3
663,0,0,469,6.4,1858,78,1297.0,55.0,0,3
639,0,0,538,9.8,2778,91,2080.0,35.0,0,4
207,0,0,163,3.1,620,21,419.0,23.0,1,1
402,4,1,411,7.4,1960,71,1264.0,40.0,1,3


## Model Selection

Choose and train an approriate model for the given task.

In [40]:
# Write your code here. Add as many boxes as you need.
# XGB
from xgboost import XGBClassifier
model = XGBClassifier(max_depth=50, min_child_weight=1,  n_estimators=200, n_jobs=-1 , verbose=1,learning_rate=0.16)
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)

Parameters: { "verbose" } are not used.



In [41]:
from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(Y_test, y_pred))
print(classification_report(Y_test, y_pred))

[[28  0  0  0  0]
 [ 0 28  0  0  0]
 [ 0  0 30  0  0]
 [ 0  0  0 23  0]
 [ 0  0  0  1 30]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        28
           1       1.00      1.00      1.00        28
           2       1.00      1.00      1.00        30
           3       0.96      1.00      0.98        23
           4       1.00      0.97      0.98        31

    accuracy                           0.99       140
   macro avg       0.99      0.99      0.99       140
weighted avg       0.99      0.99      0.99       140



Use the trained model to make predictions for the test set.

In [43]:
# Write your code here. Add as many boxes as you need.

Assess the performance of the model by using different classification metrics.

In [99]:
# Write your code here. Add as many boxes as you need.
# LGBM
from lightgbm import LGBMClassifier

clf = LGBMClassifier()
clf.fit(X_train, Y_train)

y_pred = clf.predict(X_test)
print(classification_report(Y_test, y_pred))

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000068 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 560, number of used features: 9
[LightGBM] [Info] Start training from score -1.645806
[LightGBM] [Info] Start training from score -1.557252
[LightGBM] [Info] Start training from score -1.600549
[LightGBM] [Info] Start training from score -1.574347
[LightGBM] [Info] Start training from score -1.673976
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        28
           1       1.00      1.00      1.00        28
           2       1.00      1.00      1.00        30
           3       1.00      1.00      1.00        23
           4       1.00      1.00      1.00        31

    accuracy                           1.00       140
 

# Laboratory Exercise - Bonus Task (+ 2 points)

As part of the bonus task in this laboratory assignment, your objective is to fine-tune at least one hyper-parameter using a cross-validation with grid search. This involves systematically experimenting with various values for the hyper-parameter(s) and evaluating the model's performance using cross-validation. Upon determining the most suitable value(s) for the hyper-parameter(s), evaluate the model's performance on a test set for final assessment.

Hint: Use the `GridCVSearch` from the `scikit-learn` library. Check the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.

## Dataset Splitting
Partition the dataset into training and testing sets with an 90:10 ratio.

In [49]:
# Write your code here. Add as many boxes as you need.

## Feature Scaling
Standardize the features appropriately.

In [51]:
# Write your code here. Add as many boxes as you need.

## Fine-tuning the Hyperparameters
Experiment with various values for the chosen hyperparameter(s) and evaluate the model's performance using cross-validation.

In [53]:
# Write your code here. Add as many boxes as you need.

## Final Assessment of the Model Performance
Upon determining the most suitable hyperparameter(s), evaluate the model's performance on a test set for final assessment.

In [55]:
# Write your code here. Add as many boxes as you need.