# Assignment 1: Exploratory Data Analysis
## Group 22
- Natasa Bolic (300241734)
- Brent Palmer (300193610)
## Imports

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Dataset 1: The Influence of Demographics on Digital Consumption

### Dataset Description
**Url:** https://www.kaggle.com/datasets/valakhorasani/mobile-device-usage-and-user-behavior-dataset/data <br>
**Name:** Mobile Device Usage and User Behaviour <br>
**Author:** Seyedvala Khorasani<br>
**Purpose:** The dataset provides researchers, data scientists, and analysts an opportunity to develop predictive models based on mobile user behaviour. The author specifies that the dataset is intended to be an educational resource for machine learning algorithms and emphasizes that it must not be treated as a 
reliable source.<br>
**Shape:** There are 700 rows and 11 columns. (700,11) <br>
**Features:** 
- `User ID` (category): Uniquely identifies each user.
- `Device Model` (category): The model name of the user's phone.
- `Operating System` (category): The operating system of the user's phone (Android or iOS).
- `App Usage Time (min/day)` (numerical): The user's daily time spent on apps (mins).
- `Screen On Time (hours/day)` (numerical): The user's daily time with an active screen (hours).
- `Battery Drain (mAh/day)` (numerical): The user's daily battery consumption (mAh).
- `Number of Apps Installed` (numerical): The number of apps the user has installed.
- `Data Usage (MB/day)` (numerical): The average data usage per day of the user (MB).
- `Age` (numerical): The age of the user.
- `Gender` (categorical): The gender of the user.
- `User Behavior Class` (categorical): Class of user behaviour from 1 (light usage) to 5 (extreme usage).

**Redundancy:** There is no redundancy in the dataset. Each row has a unique user ID in the first column. To check if there are any duplicate rows excluding the user ID, you can provide the `.duplicated()` method all of the other columns as follows: `data.duplicated(subset=data.columns[1:]).any()`. This will return `True` if any rows are duplicated. Since it returns `False`, there is no redundancy. <br>
**Missing Values:** There are no missing values in the dataset. The method `data.isnull().sum()` will return the number of missing values in each column. Since the total is 0 for each column, there are no missing values.

## Dataset Overview (in Code)

### Extract the Dataset (Mobile Device Usage and User Behaviour)

In [15]:
url = "https://raw.githubusercontent.com/BrentMRPalmer/mobile-data/refs/heads/main/user_behavior_dataset.csv"
data = pd.read_csv(url)

In [14]:
data.head()

Unnamed: 0,User ID,Device Model,Operating System,App Usage Time (min/day),Screen On Time (hours/day),Battery Drain (mAh/day),Number of Apps Installed,Data Usage (MB/day),Age,Gender,User Behavior Class
0,1,Google Pixel 5,Android,393,6.4,1872,67,1122,40,Male,4
1,2,OnePlus 9,Android,268,4.7,1331,42,944,47,Female,3
2,3,Xiaomi Mi 11,Android,154,4.0,761,32,322,42,Male,2
3,4,Google Pixel 5,Android,239,4.8,1676,56,871,20,Male,3
4,5,iPhone 12,iOS,187,4.3,1367,58,988,31,Female,3


In [12]:
data.shape

(700, 11)

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   User ID                     700 non-null    int64  
 1   Device Model                700 non-null    object 
 2   Operating System            700 non-null    object 
 3   App Usage Time (min/day)    700 non-null    int64  
 4   Screen On Time (hours/day)  700 non-null    float64
 5   Battery Drain (mAh/day)     700 non-null    int64  
 6   Number of Apps Installed    700 non-null    int64  
 7   Data Usage (MB/day)         700 non-null    int64  
 8   Age                         700 non-null    int64  
 9   Gender                      700 non-null    object 
 10  User Behavior Class         700 non-null    int64  
dtypes: float64(1), int64(7), object(3)
memory usage: 60.3+ KB


In [10]:
data.describe()

Unnamed: 0,User ID,App Usage Time (min/day),Screen On Time (hours/day),Battery Drain (mAh/day),Number of Apps Installed,Data Usage (MB/day),Age,User Behavior Class
count,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
mean,350.5,271.128571,5.272714,1525.158571,50.681429,929.742857,38.482857,2.99
std,202.21688,177.199484,3.068584,819.136414,26.943324,640.451729,12.012916,1.401476
min,1.0,30.0,1.0,302.0,10.0,102.0,18.0,1.0
25%,175.75,113.25,2.5,722.25,26.0,373.0,28.0,2.0
50%,350.5,227.5,4.9,1502.5,49.0,823.5,38.0,3.0
75%,525.25,434.25,7.4,2229.5,74.0,1341.0,49.0,4.0
max,700.0,598.0,12.0,2993.0,99.0,2497.0,59.0,5.0


In [17]:
data.nunique()

User ID                       700
Device Model                    5
Operating System                2
App Usage Time (min/day)      387
Screen On Time (hours/day)    108
Battery Drain (mAh/day)       628
Number of Apps Installed       86
Data Usage (MB/day)           585
Age                            42
Gender                          2
User Behavior Class             5
dtype: int64

### Checking for Missing Values
https://www.atlassian.com/data/notebook/how-to-check-if-any-value-is-nan-in-a-pandas-dataframe

In [20]:
data.isnull().sum()

User ID                       0
Device Model                  0
Operating System              0
App Usage Time (min/day)      0
Screen On Time (hours/day)    0
Battery Drain (mAh/day)       0
Number of Apps Installed      0
Data Usage (MB/day)           0
Age                           0
Gender                        0
User Behavior Class           0
dtype: int64

### Checking for Duplicates
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

In [24]:
data.duplicated(subset=data.columns[1:]).any()

False

## Insights

### Insight 1:

### Insight 2:

### Insight 3:

### Insight 4:

### Insight 5:

### Insight 6: r6 - use the scatterplot to highlight correlation

### Insight 7:

### Insight 8:

### Insight 9:

### Insight 10: