In [None]:
import pandas as pd
import numpy as np

In [None]:
df=pd.read_csv(r"/kaggle/input/mobile-device-usage-and-user-behavior-dataset/user_behavior_dataset.csv")
df

In [None]:
df.info()

****
## Data Visualization ##
****

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
tdf=df['Device Model'].value_counts().reset_index()
tdf1=df['Operating System'].value_counts().reset_index()
tdf2=df['Gender'].value_counts().reset_index()
fig,ax=plt.subplots(1,3,figsize=(15,6))
ax[0].pie(x=tdf['count'],labels=tdf['Device Model'],autopct='%.2f%%')
ax[0].set_title("Distribution of Phone Brands")
ax[1].pie(x=tdf1['count'],labels=tdf1['Operating System'],autopct='%.2f%%')
ax[1].set_title("Distribution of Operating System")
ax[2].pie(x=tdf2['count'],labels=tdf2['Gender'],autopct='%.2f%%')
ax[2].set_title("Distribution of Gender")
plt.show()

**Findings**

1. This dataset contains 5 different Phone brands. All of them constitute around 20 percent of the dataset each
2. Two types of Operating systems, Android makes up aroung 80 percent of the dataset (makes sense cause 4 out of 5 phone brands use Android :) )
3. Dataset contains a balanced ratio of Male to Female

In [None]:
fig,ax=plt.subplots(2,2,figsize=(20,20))
sns.countplot(data=df,x='Device Model',hue='Gender',ax=ax[0][0])
sns.countplot(data=df,x='Operating System',hue='Gender',ax=ax[0][1])
sns.scatterplot(data=df,y='Device Model',x='Age',hue='Gender',ax=ax[1][0])
sns.scatterplot(data=df,y='Operating System',x='Age',hue='Gender',ax=ax[1][1])
plt.show()

**Deductions:**

**-> The first graph (TOP-LEFT) tries to find out relationship between Device Model and age:**
1. Google Pixel 5, One Plus 9 and Samsung Galaxy 21 has more male users than female users and Xiaomi and iPhone 12 has more female users than male  users
2. This itself might not be beneficial for us but future analysis might indicate some relationship between the target variable (User Behavior Class) and Phone brand which in turn might provide us with something related to Gender and target variable

**-> The Second graph (TOP-RIGHT) tries to find out relationship between Gender and Operating Systems:**
1. Android tends to have more male users, possibly due to its broader device offerings and flexibility.
2. IOS is more balanced in terms of gender distribution, suggesting it appeals equally to both males and females.

**-> The Third Graph (BOTTOM-LEFT) visualizes any relationship between Device model and Age:**
1. There is no clear device model preference based on age or gender, with models like the Samsung Galaxy S21 and iPhone 12 enjoying broad popularity across all groups, So there is no real pattern between them :(

**-> The Fourth Graph (BOTTOM-RIGHT) visualizes any relationship between choice of operating system and age:**
1. Android has a more diverse age representation compared to iOS, which skews younger. This could suggest that Android devices are more affordable or offer more variety for older users, while iOS appeals to younger, possibly more brand-conscious consumers. There is a slight trend but not really an important one

In [None]:
plt.figure(figsize=(20, 6))
sns.boxplot(x='Age', y='App Usage Time (min/day)', hue='Gender', data=df)
plt.title('App Usage Time by Age and Gender')
plt.show()

1. **General App Usage Time Variability:** Across all ages, there is considerable variability in app usage time for both males and females, with some ages showing a wider spread of usage times (e.g., ages 22, 31, 45) while others have a more concentrated range (e.g., ages 38, 41, 48).

2. **Outliers:** There are a few outliers, particularly around the older age groups (e.g., age 54, 59), suggesting that a few users report significantly higher app usage compared to others in their age group.

3. **Gender Differences:**

   In several age groups (e.g., ages 19, 26, 43), males appear to have a higher median app usage time than females.Conversely, females tend to use apps    more on average in certain age ranges, such as ages 23, 34, 53, 58, where their median usage time is higher compared to males.Some age groups (e.g.,    22, 32, 44) show almost equal median app usage between genders, indicating similar app usage behavior between males and females.

   Younger Age Groups: Users in their 20s tend to have higher variability in app usage time, especially around ages 20-25.Older Age Groups: Users in       their 50s show less variability, with tighter box plots, suggesting more consistent app usage behavior in older individuals.

   There does not seem to be a clear increasing or decreasing trend in app usage with age, implying that both young and older individuals have a wide      range of app usage patterns.However, certain age groups (e.g., 21, 34, 53) show notably higher usage times across both genders, which may indicate      lifestyle factors affecting these specific ages.

In [None]:
plt.figure(figsize=(15, 6))
sns.scatterplot(x='Screen On Time (hours/day)', y='Battery Drain (mAh/day)', hue='Device Model', data=df)
plt.title('Screen On Time vs Battery Drain by Device Model')
plt.show()

**Findings:**

1. Distinct Groups by Screen Time: There are clear clusters based on screen on time. The devices exhibit step-like battery drain patterns where screen on time clusters around specific ranges, such as 2, 4, 6, 8, 10 hours/day.

2. Battery Drain Increases with Screen On Time: As expected, battery drain increases with longer screen-on times, with higher mAh consumption visible as screen time grows from left to right.

3. No Significant Device Outliers: All device models—Google Pixel 5, OnePlus 9, Xiaomi Mi 11, iPhone 12, and Samsung Galaxy S21—show similar distributions within these groups, with no one device standing out dramatically for better or worse performance.

4. Tight Grouping in Early Stages: For lower screen on times (around 2–4 hours), the battery drain is tightly grouped, indicating minimal variability in battery consumption for short usage durations.

5. Greater Spread at Higher Screen On Times: At higher screen times (8–12 hours), there is more variability in battery drain across devices, suggesting that extended use might have differing impacts on battery efficiency.

In [None]:
plt.figure(figsize=(20, 6))
sns.barplot(x='Device Model', y='Number of Apps Installed', hue='Operating System', data=df)
plt.title('Number of Apps Installed by Device Model and OS')
plt.show()


In [None]:
plt.figure(figsize=(15, 6))
sns.histplot(data=df, x='Data Usage (MB/day)', hue='Gender', bins=30, kde=True)
plt.title('Distribution of Data Usage by Gender')
plt.show()


1. Data Usage Concentration: Both males and females show the highest count of users in the lower data usage range (0–500 MB/day). This suggests that most users, regardless of gender, consume relatively small amounts of data daily.

2. Males and Females Have Different Peaks: The density plot (KDE) for males peaks earlier (around 500 MB/day), while for females, it peaks slightly later (around 700 MB/day), suggesting that females tend to consume a bit more data on average compared to males.

3. Overall Distribution Shape: The distribution for both genders declines as data usage increases, with both groups having fewer users consuming more than 1500 MB/day. This indicates that heavy data usage is less common for both genders.

4. Higher Data Usage Among Males: In the higher data usage categories (above 1000 MB/day), the number of male users generally exceeds that of female users, particularly in the 1500–2500 MB/day range.

5. Female Dominance in Mid-Range Usage: Females appear to dominate the middle range (500–1000 MB/day), indicating a higher proportion of moderate data users compared to males.

In [None]:
plt.figure(figsize=(15, 6))
sns.histplot(data=df, x='Screen On Time (hours/day)', hue='Gender', bins=30, kde=True)
plt.title('Distribution of Data by Screen Time')
plt.show()


1. Most participants spend 1-3 hours/day on screens.
2. Males generally have higher screen time than females, especially in the 1-5 hour range.
3. Female screen time is more concentrated in 4-6 hours, with fewer reporting high screen times (above 7 hours).
4. Males have a broader screen time distribution, while females' usage is more concentrated.
5. Both genders peak at 1-2 hours/day.

****
## Data Preprocessing ##
****

In [None]:
df.info()

In [None]:
df.isnull().any()   #Checking for Null values

In [None]:
#Dropping unwanted columns
df.drop(columns='User ID',inplace=True,axis=1)

In [None]:
for col in df.columns[df.dtypes=='object']:
    print(df[col].value_counts(),'\n\n')

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le=LabelEncoder()    #Encoding categorical values
for col in df.columns[df.dtypes=='object']:
    df[col]=le.fit_transform(df[col])

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(df.corr(),annot=True)
plt.show()

In [None]:
#Creating New Features (Derived from existing)
df['Data_consumption_for_app']=df['App Usage Time (min/day)']/df['Data Usage (MB/day)']
df['App_Usage_Prop']=df['App Usage Time (min/day)']/(df['Screen On Time (hours/day)']*60)   

#We can have more features by playing with the columns, these two are just one of the few examples.
#The below heatmap provides you with how both of the new features are correlated with the Target variable

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(df.corr(),annot=True)
plt.show()

In [None]:
df.info()

In [None]:
#Splitting the data
from sklearn.model_selection import train_test_split
x=df.drop(columns='User Behavior Class')
y=df['User Behavior Class']
x_t,x_te,y_t,y_te=train_test_split(x,y,test_size=0.25,random_state=20)

****
## Models :) ##
****

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier , AdaBoostClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,classification_report,confusion_matrix

**K Nearest Neighbors Classifier**

In [None]:
knn=KNeighborsClassifier()
params={'n_neighbors':list(np.arange(2,32))}
nknn=RandomizedSearchCV(knn,random_state=20,scoring='accuracy',param_distributions=params,cv=10)
nknn.fit(x_t,y_t)
print(nknn.best_params_)
print(nknn.best_score_)
nknn=nknn.best_estimator_

In [None]:
pred_t=nknn.predict(x_t)
pred=nknn.predict(x_te)

**Decision Tree Classifier**

In [None]:
dt=DecisionTreeClassifier()
path=dt.cost_complexity_pruning_path(x_t,y_t)
alphas=path.ccp_alphas
params={'ccp_alpha':alphas}
ndt=RandomizedSearchCV(dt,random_state=20,scoring='accuracy',param_distributions=params,n_jobs=-1,n_iter=3,cv=10)
ndt.fit(x_t,y_t)
print(ndt.best_params_)
print(ndt.best_score_)
ba=ndt.best_params_['ccp_alpha']

In [None]:
dt=DecisionTreeClassifier(ccp_alpha=ba)
params={'criterion':['gini','entropy'],'min_samples_split':list(np.arange(2,41)),'min_samples_leaf':list(np.arange(2,41)),
        'max_depth':list(np.arange(1,10))}
ndt=RandomizedSearchCV(dt,random_state=20,scoring='accuracy',param_distributions=params,n_jobs=-1,cv=10)
ndt.fit(x_t,y_t)
print(ndt.best_params_)
print(ndt.best_score_)
ndt=ndt.best_estimator_

In [None]:
pred1_t=ndt.predict(x_t)
pred1=ndt.predict(x_te)

**Random Forest Classifier**

In [None]:
rf=RandomForestClassifier()
parmas=params={'criterion':['gini','entropy'],'min_samples_split':list(np.arange(2,41)),'min_samples_leaf':list(np.arange(2,41)),
        'max_depth':list(np.arange(1,10)),'n_estimators':[1000]}
nrf=RandomizedSearchCV(rf,random_state=20,scoring='accuracy',param_distributions=params,n_jobs=-1,cv=10)
nrf.fit(x_t,y_t)
print(nrf.best_params_)
print(nrf.best_score_)
nrf=nrf.best_estimator_

In [None]:
pred2_t=nrf.predict(x_t)
pred2=nrf.predict(x_te)

**AdaBoost Classifier**

In [None]:
ada=AdaBoostClassifier(algorithm='SAMME')
params={'n_estimators':[1000],'learning_rate': np.arange(0.01, 2.01, 0.01)}
nada=RandomizedSearchCV(ada,param_distributions=params,cv=10,n_jobs=-1,scoring='accuracy')
nada.fit(x_t ,y_t)
print(nada.best_params_)
print(nada.best_score_)
nada=nada.best_estimator_

In [None]:
pred3_t=nada.predict(x_t)
pred3=nada.predict(x_te)

In [None]:
report_knn=classification_report(y_te,pred)
report_dt=classification_report(y_te,pred1)
report_rf=classification_report(y_te,pred2)
report_ada=classification_report(y_te,pred3)
print('KNN\n',report_knn)
print('\n\nDecision Tree Classifier\n',report_dt)
print('\n\nRandom Forest Classifier\n',report_rf)
print('\n\nAdaBoost Classifier\n',report_ada)

In [None]:
tdf=pd.DataFrame({'Classification Algorithms':['KNN','Decision Tree Classifier','Random Forest Classifier','AdaBoost Classifier'],
                  'Training Accuracy':[accuracy_score(y_t,pred_t),accuracy_score(y_t,pred1_t),accuracy_score(y_t,pred2_t),accuracy_score(y_t,pred3_t)],
                  'Training Precision':[precision_score(y_t,pred_t,average='macro'),precision_score(y_t,pred1_t,average='macro'),precision_score(y_t,pred2_t,average='macro'),precision_score(y_t,pred3_t,average='macro')],
                  'Training Recall':[recall_score(y_t,pred_t,average='macro'),recall_score(y_t,pred1_t,average='macro'),recall_score(y_t,pred2_t,average='macro'),recall_score(y_t,pred3_t,average='macro')],
                  'Training F1 Score':[f1_score(y_t,pred_t,average='macro'),f1_score(y_t,pred1_t,average='macro'),f1_score(y_t,pred2_t,average='macro'),f1_score(y_t,pred3_t,average='macro')]})
tdf

In [None]:
tedf=pd.DataFrame({'Classification Algorithms':['KNN','Decision Tree Classifier','Random Forest Classifier','AdaBoost Classifier'],
                  'Testing Accuracy':[accuracy_score(y_te,pred),accuracy_score(y_te,pred1),accuracy_score(y_te,pred2),accuracy_score(y_te,pred3)],
                  'Testing Precision':[precision_score(y_te,pred,average='macro'),precision_score(y_te,pred1,average='macro'),precision_score(y_te,pred2,average='macro'),precision_score(y_te,pred3,average='macro')],
                  'Testing Recall':[recall_score(y_te,pred,average='macro'),recall_score(y_te,pred1,average='macro'),recall_score(y_te,pred2,average='macro'),recall_score(y_te,pred3,average='macro')],
                  'Testing F1 Score':[f1_score(y_te,pred,average='macro'),f1_score(y_te,pred1,average='macro'),f1_score(y_te,pred2,average='macro'),f1_score(y_te,pred3,average='macro')]})
tedf

In [None]:
fig,ax=plt.subplots(2,2,figsize=(20,15))
sns.heatmap(confusion_matrix(pred_t,y_t),annot=True,ax=ax[0][0],fmt='d')
ax[0][0].set_title('KNN')
sns.heatmap(confusion_matrix(pred1_t,y_t),annot=True,ax=ax[0][1],fmt='d')
ax[0][1].set_title('Decision Tree Classifier')
sns.heatmap(confusion_matrix(pred2_t,y_t),annot=True,ax=ax[1][0],fmt='d')
ax[1][0].set_title('Random Forest Classifier')
sns.heatmap(confusion_matrix(pred3_t,y_t),annot=True,ax=ax[1][1],fmt='d')
ax[1][1].set_title('AdaBoost Classifier')
plt.suptitle('Confusion Matrices for Training Set',fontsize=20)
plt.show()

In [None]:
fig,ax=plt.subplots(2,2,figsize=(20,15))
sns.heatmap(confusion_matrix(pred,y_te),annot=True,ax=ax[0][0],fmt='d')
ax[0][0].set_title('KNN')
sns.heatmap(confusion_matrix(pred1,y_te),annot=True,ax=ax[0][1],fmt='d')
ax[0][1].set_title('Decision Tree Classifier')
sns.heatmap(confusion_matrix(pred2,y_te),annot=True,ax=ax[1][0],fmt='d')
ax[1][0].set_title('Random Forest Classifier')
sns.heatmap(confusion_matrix(pred3,y_te),annot=True,ax=ax[1][1],fmt='d')
ax[1][1].set_title('AdaBoost Classifier')
plt.suptitle('Confusion Matrices for Testing Set',fontsize=20)
plt.show()

****
## Conclusion ##
****

**In this analysis, we evaluated four different classification algorithms: K-Nearest Neighbors (KNN), Decision Tree Classifier, Random Forest Classifier, and AdaBoost Classifier. Each model achieved perfect performance metrics across both training and testing datasets**

These results indicate that all models effectively learned the underlying patterns of the data without overfitting, as evidenced by their equal performance on both training and testing sets. The consistently high precision, recall, and F1 scores across all classifiers suggest robust predictive capabilities.

Given this performance, any of these models could be reliably deployed in practice. Further analysis may explore potential variations in hyperparameters or the impact of additional features to enhance generalization in more complex datasets.

### **Suggest Changes and Upvote if this notebook was helpful :)** ###