# Part 1 ‐ Exploratory data analysis
The attached logins.json file contains (simulated) timestamps of user logins in a particular geographic location. Aggregate these login counts based on 15-minute time intervals, and visualize and describe the resulting **time series** of login counts in ways that best characterize the underlying patterns of the demand. Please report/illustrate important features of the demand, such as daily cycles. If there are data quality issues, please report them.

In [3]:
# Essentials
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import chart_studio.plotly as py
import plotly.express as px

## 1.1 Data Cleaning

In [67]:
# Get logins file
logins_df = pd.read_json('logins.json')
logins_df.head(3)

Unnamed: 0,login_time
0,1970-01-01 20:13:18
1,1970-01-01 20:16:10
2,1970-01-01 20:16:37


In [68]:
logins_df.describe()

Unnamed: 0,login_time
count,93142
unique,92265
top,1970-02-12 11:16:53
freq,3
first,1970-01-01 20:12:16
last,1970-04-13 18:57:38


There are multiple logins on February 2nd, 1970 precisely at the same time. Not exactly sure if login records are based on ONE user or multiple users. I'm assuming the login records are based on MULTIPLE users therefore assuming having 3 identical login times as valid.

In [69]:
logins_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93142 entries, 0 to 93141
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   login_time  93142 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 727.8 KB


In [70]:
logins_df.isna().sum()

login_time    0
dtype: int64

Correct datatype and no missing values.

In [71]:
logins_df

Unnamed: 0,login_time
0,1970-01-01 20:13:18
1,1970-01-01 20:16:10
2,1970-01-01 20:16:37
3,1970-01-01 20:16:36
4,1970-01-01 20:26:21
...,...
93137,1970-04-13 18:50:19
93138,1970-04-13 18:43:56
93139,1970-04-13 18:54:02
93140,1970-04-13 18:57:38


In [72]:
# Sorting by datetime
logins_df = logins_df.sort_values(by='login_time')
logins_df.reset_index(drop=True, inplace=True)
logins_df

Unnamed: 0,login_time
0,1970-01-01 20:12:16
1,1970-01-01 20:13:18
2,1970-01-01 20:16:10
3,1970-01-01 20:16:36
4,1970-01-01 20:16:37
...,...
93137,1970-04-13 18:48:52
93138,1970-04-13 18:50:19
93139,1970-04-13 18:54:02
93140,1970-04-13 18:54:23


Sorted by datetime since it was out of order as shown on index 93137, 93138, and 93139.

## 1.2 Aggregate these login counts based on 15-minute time intervals (Time Resampling)

In [73]:
logins = logins_df.set_index('login_time')
logins['count'] = 1 # Create columns per login_time index
logins.head()

Unnamed: 0_level_0,count
login_time,Unnamed: 1_level_1
1970-01-01 20:12:16,1
1970-01-01 20:13:18,1
1970-01-01 20:16:10,1
1970-01-01 20:16:36,1
1970-01-01 20:16:37,1


In [74]:
logins = logins.resample('15T').sum() # sum based on 15 min intervals

In [75]:
logins.head()

Unnamed: 0_level_0,count
login_time,Unnamed: 1_level_1
1970-01-01 20:00:00,2
1970-01-01 20:15:00,6
1970-01-01 20:30:00,9
1970-01-01 20:45:00,7
1970-01-01 21:00:00,1


In [77]:
logins.shape

(9788, 1)

## 1.3 visualize and describe the resulting time series of login counts in ways that best characterize the underlying patterns of the demand

In [78]:
fig = px.histogram(logins, x="count", nbins=15)

fig.update_layout(
    title_text ='Distribution of Login Counts',
    xaxis_title_text = '# of logins in 15-min interval',
    yaxis_title_text = 'Count',
    bargap=0.2, # gap etween bars of adjacent location coordinates
)
fig.show()

Majority of login counts range from 0 to 15.

### 1.3.1 Monthly/Weekly/Daily/Hours Logins

In [None]:

df['date'] = df['login_time'].apply(lambda d: dt.date(d.year, d.month, d.day))
df['hour'] = df['login_time'].apply(lambda d: d.hour)
df['weekday'] = df['login_time'].apply(lambda d: d.weekday())

### 1.3.2 Most Logins by day