Part 1 ‑ Exploratory data analysis
The attached logins.json file contains (simulated) timestamps of user logins in a particular
geographic location. Aggregate these login counts based on 15­minute time intervals, and
visualize and describe the resulting time series of login counts in ways that best characterize the
underlying patterns of the demand. Please report/illustrate important features of the demand,
such as daily cycles. If there are data quality issues, please report them.


In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt

In [2]:
logins = pd.read_json('logins.json')
df = pd.DataFrame(logins, columns = ['login_time'], index = None)

In [3]:
df.head()

Unnamed: 0,login_time
0,1970-01-01 20:13:18
1,1970-01-01 20:16:10
2,1970-01-01 20:16:37
3,1970-01-01 20:16:36
4,1970-01-01 20:26:21


In [4]:
df.describe()

Unnamed: 0,login_time
count,93142
unique,92265
top,1970-02-12 11:16:53
freq,3
first,1970-01-01 20:12:16
last,1970-04-13 18:57:38


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93142 entries, 0 to 93141
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   login_time  93142 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 727.8 KB


In [6]:
#Checking for any missing values
print(df.isnull().sum())
print(df.isna().sum())

login_time    0
dtype: int64
login_time    0
dtype: int64


In [11]:
#looking to get 15 minute intervals of the logins
#first create a dup column? I am running into some visualization error, so let me try this
df['fifteen_interval'] = df['login_time']

interval_table = df.groupby(pd.Grouper(key='login_time', freq='15min')).count()
interval_table

Unnamed: 0_level_0,fifteen_interval
login_time,Unnamed: 1_level_1
1970-01-01 20:00:00,2
1970-01-01 20:15:00,6
1970-01-01 20:30:00,9
1970-01-01 20:45:00,7
1970-01-01 21:00:00,1
...,...
1970-04-13 17:45:00,5
1970-04-13 18:00:00,5
1970-04-13 18:15:00,2
1970-04-13 18:30:00,7


So this is a start, but it's not really helpful to see what is going on as a trend. I think what may be next is to go by day of the week, month, time of day, so that we can see when activity is highest

In [13]:
df['month'] = df['fifteen_interval'].dt.month
df.head()

Unnamed: 0,login_time,fifteen_interval,month
0,1970-01-01 20:13:18,1970-01-01 20:13:18,1
1,1970-01-01 20:16:10,1970-01-01 20:16:10,1
2,1970-01-01 20:16:37,1970-01-01 20:16:37,1
3,1970-01-01 20:16:36,1970-01-01 20:16:36,1
4,1970-01-01 20:26:21,1970-01-01 20:26:21,1


In [15]:
df['day'] = df['fifteen_interval'].dt.day
df.head()

Unnamed: 0,login_time,fifteen_interval,month,day
0,1970-01-01 20:13:18,1970-01-01 20:13:18,1,1
1,1970-01-01 20:16:10,1970-01-01 20:16:10,1,1
2,1970-01-01 20:16:37,1970-01-01 20:16:37,1,1
3,1970-01-01 20:16:36,1970-01-01 20:16:36,1,1
4,1970-01-01 20:26:21,1970-01-01 20:26:21,1,1


In [16]:
#so here I can decide if I want to do like a 'morning', 'afternoon', 'evening', split. I think first what I will 
#do is create a column by the hour and then decide later if spending time making this split is worthwhile.
df['hour'] = df['fifteen_interval'].dt.hour
df.head()

Unnamed: 0,login_time,fifteen_interval,month,day,hour
0,1970-01-01 20:13:18,1970-01-01 20:13:18,1,1,20
1,1970-01-01 20:16:10,1970-01-01 20:16:10,1,1,20
2,1970-01-01 20:16:37,1970-01-01 20:16:37,1,1,20
3,1970-01-01 20:16:36,1970-01-01 20:16:36,1,1,20
4,1970-01-01 20:26:21,1970-01-01 20:26:21,1,1,20


In [18]:
#so now I will prepare the data to be visualized by issuing a value counts over the new columns
by_month = df['month'].value_counts().sort_index()
by_day = df['day'].value_counts().sort_index()
by_hour = df['hour'].value_counts().sort_index()