# Group Project #1
## Richelle Johnson, @03106854
### Data Set: Mobile Device Usage and User Behavior Data, from Vala Khorasani via Kaggle

In [4]:
import pandas as pd
import numpy as np

#### Step 2: Data Loading
##### Using the **read_csv()** function, the data set is loaded into the notebook. The data set is labeled as "df" to make the code more legible. Using the print() command, the "df" was printed and displayed. There are 11 columns in the dataset, but 700 rows, so while all the columns are present, all the rows cannot be displayed at once. This is why the head() command was helpful, and it was used to display the first 20 rows.

In [6]:
#Load data from user_behavior_dataset.csv
df = pd.read_csv ("user_behavior_dataset.csv")

In [120]:
#Print the data set
print (df)

     User ID        Device Model Operating System  App Usage Time (min/day)  \
0          1      Google Pixel 5          Android                       393   
1          2           OnePlus 9          Android                       268   
2          3        Xiaomi Mi 11          Android                       154   
3          4      Google Pixel 5          Android                       239   
4          5           iPhone 12              iOS                       187   
..       ...                 ...              ...                       ...   
695      696           iPhone 12              iOS                        92   
696      697        Xiaomi Mi 11          Android                       316   
697      698      Google Pixel 5          Android                        99   
698      699  Samsung Galaxy S21          Android                        62   
699      700           OnePlus 9          Android                       212   

     Screen On Time (hours/day)  Battery Drain (mAh

In [118]:
#Display the first few rows using head()
print(df.head(20))

    User ID        Device Model Operating System  App Usage Time (min/day)  \
0         1      Google Pixel 5          Android                       393   
1         2           OnePlus 9          Android                       268   
2         3        Xiaomi Mi 11          Android                       154   
3         4      Google Pixel 5          Android                       239   
4         5           iPhone 12              iOS                       187   
5         6      Google Pixel 5          Android                        99   
6         7  Samsung Galaxy S21          Android                       350   
7         8           OnePlus 9          Android                       543   
8         9  Samsung Galaxy S21          Android                       340   
9        10           iPhone 12              iOS                       424   
10       11      Google Pixel 5          Android                        53   
11       12           OnePlus 9          Android                

#### Step 3: Basic NumPy Operations
##### The focus columns of the data set - device model, app usage time, screen on time, number of apps installed, age, and gender - are put into a list and labeled selected_cols. The selected columns are made into an array using the **array()** function. The array is labeled numpy_array. To find the mean, median, and standard deviation, the **mean(), median(), and std()** functions were used. This was used instead of the describe() function because only those three metrics are necessary. The column for app usage time was calculated in minutes per day, but for simplicity's sake, it's been converted to hours per day using division. It was labeled app_usage_time_hours and divided (/) by 60.

In [19]:
#Device model 
#App usage time (min/day
#Screen on time (hrs/day)
#Number of apps installed
#Age
#Gender
selected_cols = ["Device Model", "App Usage Time (min/day)", "Screen On Time (hours/day)", "Number of Apps Installed", "Age", "Gender"]

In [122]:
#Putting the selected columns into an array using array() and printing the array
numpy_array = np.array (df[selected_cols])
print (numpy_array)

[['Google Pixel 5' 393 6.4 67 40 'Male']
 ['OnePlus 9' 268 4.7 42 47 'Female']
 ['Xiaomi Mi 11' 154 4.0 32 42 'Male']
 ...
 ['Google Pixel 5' 99 3.1 22 50 'Female']
 ['Samsung Galaxy S21' 62 1.7 13 44 'Male']
 ['OnePlus 9' 212 5.4 49 23 'Female']]


In [161]:
# Mean, median, and standard deviation of the array using mean(), median(), and std().
cols_analyze = ["App Usage Time (min/day)", "Screen On Time (hours/day)", "Number of Apps Installed","Age"]
#Calculating the mean for each column using mean()
means = df[cols_analyze].mean()
medians = df[cols_analyze].median()
standard_deviations = df[cols_analyze].std()
#Print the means, medians, and standard deviations of cols_anaylze
print(means)
print (medians)
print (standard_deviations)

App Usage Time (min/day)      271.128571
Screen On Time (hours/day)      5.272714
Number of Apps Installed       50.681429
Age                            38.482857
dtype: float64
App Usage Time (min/day)      227.5
Screen On Time (hours/day)      4.9
Number of Apps Installed       49.0
Age                            38.0
dtype: float64
App Usage Time (min/day)      177.199484
Screen On Time (hours/day)      3.068584
Number of Apps Installed       26.943324
Age                            12.012916
dtype: float64


In [91]:
# Convert app usage time from minutes to hours per day
app_usage_time_hours = df["App Usage Time (min/day)"] / 60
# Print the result
print(app_usage_time_hours)

0      6.550000
1      4.466667
2      2.566667
3      3.983333
4      3.116667
         ...   
695    1.533333
696    5.266667
697    1.650000
698    1.033333
699    3.533333
Name: App Usage Time (min/day), Length: 700, dtype: float64


#### Step 4: Data Cleaning with Pandas
##### Using the **isnull()**, the data is checked for missing values. There are no missing values, but if there were, **dropna()** would be used to remove the empty columns. The data set is checked for duplicates using **duplicated()**. There are no duplicate values, but if there were, **drop_duplicates()** would be used to remove them.

In [183]:
# Check for missing values using isnull()
#Using sum() function to show the total number of null values
missing_values = df.isnull().sum()
print (missing_values)
#There are no missing values!

User ID                       0
Device Model                  0
Operating System              0
App Usage Time (min/day)      0
Screen On Time (hours/day)    0
Battery Drain (mAh/day)       0
Number of Apps Installed      0
Data Usage (MB/day)           0
Age                           0
Gender                        0
User Behavior Class           0
dtype: int64


In [187]:
# Check for duplicate rows using duplicated()
#Using sum() function to show the total number of duplicate values
duplicates = df.duplicated().sum()
print(duplicates)
#There are no duplicate values!

0


#### Step 5: Data Filtering and Selection
##### Using (>) or (<), the data is filtered to fit certain conditions. The column "App Usage Time (min/day)" is filtered to show users with an app usage time of over 300 minutes per day. 275 of the 700 sample users have an app usage time greater than 300 minutes per day. The first 4 rows of the device model and app usage time columns are selected using **loc[]**. The **iloc[]** function is used to select data using a numerical index. 

In [193]:
#Filtering the data set to check for users with more than 300 minutes of app usage time per day.
filtered_df = df[df['App Usage Time (min/day)'] > 300]
print(filtered_df)

     User ID        Device Model Operating System  App Usage Time (min/day)  \
0          1      Google Pixel 5          Android                       393   
6          7  Samsung Galaxy S21          Android                       350   
7          8           OnePlus 9          Android                       543   
8          9  Samsung Galaxy S21          Android                       340   
9         10           iPhone 12              iOS                       424   
..       ...                 ...              ...                       ...   
689      690  Samsung Galaxy S21          Android                       541   
692      693        Xiaomi Mi 11          Android                       378   
693      694        Xiaomi Mi 11          Android                       505   
694      695  Samsung Galaxy S21          Android                       564   
696      697        Xiaomi Mi 11          Android                       316   

     Screen On Time (hours/day)  Battery Drain (mAh

In [198]:
#Using loc[] to select certain rows and columns
selected_data = df.loc[0:3, ['Device Model', 'App Usage Time (min/day)']]
print(selected_data)

     Device Model  App Usage Time (min/day)
0  Google Pixel 5                       393
1       OnePlus 9                       268
2    Xiaomi Mi 11                       154
3  Google Pixel 5                       239


In [201]:
#Using iloc[] to select columns 0-2 and rows 0-4
selected_data = df.iloc[0:5, 0:3]
print(selected_data)

   User ID    Device Model Operating System
0        1  Google Pixel 5          Android
1        2       OnePlus 9          Android
2        3    Xiaomi Mi 11          Android
3        4  Google Pixel 5          Android
4        5       iPhone 12              iOS


#### Step 6: Sorting and Ranking
##### Using **sort_values()**, the data set is sorted into ascending order according to the app usage time. To sort it into ascending order, it has to be "ascending=true," because if "ascending=false", the data will be sorted in descending order. The **rank()** function is used to sort the data based on the values. The highest rank is 1, and the rankings go on from there. So, the column "App Usage Time (min/day)" was ranked, with the highest value being 486 minutes per day.

In [206]:
#Using sort_values() to sort "App Usage Time (min/day)" in ascending order.
sorted_df = df.sort_values(by=['App Usage Time (min/day)'], ascending=[True])
print(sorted_df)

     User ID        Device Model Operating System  App Usage Time (min/day)  \
355      356  Samsung Galaxy S21          Android                        30   
337      338  Samsung Galaxy S21          Android                        30   
244      245           OnePlus 9          Android                        30   
73        74        Xiaomi Mi 11          Android                        31   
163      164           iPhone 12              iOS                        32   
..       ...                 ...              ...                       ...   
654      655      Google Pixel 5          Android                       594   
166      167      Google Pixel 5          Android                       595   
184      185        Xiaomi Mi 11          Android                       597   
341      342           iPhone 12              iOS                       597   
367      368           OnePlus 9          Android                       598   

     Screen On Time (hours/day)  Battery Drain (mAh

In [233]:
#Using rank() to rank the data, with 1 being the highest ranking.
#Ranking app usage time
ranked_col = df["App Usage Time (min/day)"].rank()
print(ranked_col)

0      486.0
1      394.0
2      249.0
3      367.0
4      289.5
       ...  
695    139.0
696    437.0
697    155.0
698     64.5
699    325.0
Name: App Usage Time (min/day), Length: 700, dtype: float64


#### Step 7: Grouping and Aggregation
##### Using the **groupby()** function, the data set is grouped by operating system - IOS or Android. The **count()** function is used to determine the number of users using each operating system. The count demonstrates that there are more Android users than IOS users in the sample. The **min()** and **max()** functions are used to determine the minimum battery drain and maximum app usage time for the data set, still using the groups based on the operating system. The minimum battery drain is 302 milliamp hours per day for Android users and 308 milliamp hours per day for IOS users. The maximum app usage time per day is 598 minutes for Android users and 597 minutes for IOS users. The **std()** function is used to find the standard deviation of app usage time per day, according to the user's operating system. The standard deviation for Android users is approximately 179 minutes per day and 169 minutes per day for IOS users.

In [244]:
#Using groupby() to select a column to work with
#Using count() to count the number of users using the operating system
user_count = df.groupby('Operating System')['User ID'].count()
print("User Count by Operating System:", user_count)

User Count by Operating System: Operating System
Android    554
iOS        146
Name: User ID, dtype: int64


In [248]:
#Using min() to find the minimum battery drain in the data set, grouped by operating system
min_battery_drain = df.groupby('Operating System')['Battery Drain (mAh/day)'].min()
print("Minimum Battery Drain:")
print(min_battery_drain)

Minimum Battery Drain:
Operating System
Android    302
iOS        308
Name: Battery Drain (mAh/day), dtype: int64


In [241]:
#Using max() to find the maximum app usage time in the data set, grouped by operating system
max_app_usage_time = df.groupby('Operating System')['App Usage Time (min/day)'].max()
print("Maximum App Usage Time:")
print(max_app_usage_time)

Maximum App Usage Time:
Operating System
Android    598
iOS        597
Name: App Usage Time (min/day), dtype: int64


In [252]:
#Using std() to find the standard deviation of app usage time in the data set, grouped by operating system
std_app_usage_time = df.groupby('Operating System')['App Usage Time (min/day)'].std()
print("Standard Deviation of App Usage Time:")
print(std_app_usage_time)

Standard Deviation of App Usage Time:
Operating System
Android    179.188678
iOS        169.592390
Name: App Usage Time (min/day), dtype: float64


#### Step 8: Data Export
##### Using **to_csv()**, the data cleaned and modified data set is saved to a CSV file. This is important so that the data set is available for future reference or modification. Inex=false was included so that the index of the data set will not be included, and the data set will only have the columns.

In [260]:
#Saving the data set to a csv
df.to_csv
df.to_csv('mobile_device_usage', index=False)
#The csv is saved to Files in Jupyter.