# eFuse Follower Events
The aim of this notebook is to get a feel for the data and determine if further processing is needed. We'll be working with an anonymized dataset containing follower events from the eFuse platform.

**Follower events** are events that occur whenever one user follows another. This notebook has two main steps.

1. A quick inspection of our data.
2. Any further processing if necessary.

For those of you who want the quick fact sheet without going through the notebook, I've outlined our findings below. Enjoy!

## Notable Findings

1. There is roughly 2.45 times as many followers as followees.
2. Follower `5e1ab920b988fb71f5ae532db2fc449e` has followed 12,652 users.
3. Followee `bbc4a5710e40e0c9fe61a6793297db55` has been followed by 51,753 users.
4. **August 23rd, 2020** had the highest single day follower activity with over 103,000 (roughly 17% of the total) events!
5. Roughly 17.5% of all follower events occur during the 4pm (16:00) hour.
6. Roughly 42.4% of all follower events occur between 4pm (16:00) and 9pm (21:00).

**Before moving foward**, run the code cell below to load some useful functions and packages.

In [1]:
# accessing utils module
import sys
sys.path.append('../utils')

# needed for loading data:
import pandas as pd

# some problem-specific helper functions:
from utils import get_path

## Step 1: Inspecting Follower Events

Our dataset contains info about who users decided to follow and at what time they did so on the platform. Our data contains three dimensions (columns) as outlined below:

- **follower** - The id of the person that initiated the follow
- **followee** - The id of the person being followed
- **createdAt** - The timestamp that the follow event happened

**Location of the data.** The data is stored in a csv file called `followers_anonymized.csv` that can be found in `data/raw` of this repository. Take a moment to read in the data and observe the output:

In [2]:
print("Location of data files:", get_path('raw'))
print("Location of anonymized followers data:", get_path('raw/followers_anonymized.csv'))
print("Loading...")
df = pd.read_csv(get_path('raw/followers_anonymized.csv'), parse_dates=['createdAt'])
print("Done loading!\n")
print("Displaying the first 5 rows...")
df.head(5)

Location of data files: /Users/matthewquinn/dev/eFuse-sample/data/raw
Location of anonymized followers data: /Users/matthewquinn/dev/eFuse-sample/data/raw/followers_anonymized.csv
Loading...
Done loading!

Displaying the first 5 rows...


Unnamed: 0,follower,followee,createdAt
0,782bc0ce5ffe00c95bbc52f72fc654a2,e83a54eefe2ff5daf80a66505f9472a4,2019-11-26 21:36:21.221000+00:00
1,e83a54eefe2ff5daf80a66505f9472a4,782bc0ce5ffe00c95bbc52f72fc654a2,2019-11-26 21:36:39.756000+00:00
2,782bc0ce5ffe00c95bbc52f72fc654a2,4627a06c99dde3d167166eaab32e947d,2019-11-26 21:36:49.184000+00:00
3,782bc0ce5ffe00c95bbc52f72fc654a2,49b76015d44936cb2c2184fc805e88a6,2019-11-26 21:36:57.669000+00:00
4,782bc0ce5ffe00c95bbc52f72fc654a2,1e66c9b143e9ee9155d0dcaa8d5997b0,2019-11-26 21:41:21.834000+00:00


In [3]:
print("Displaying the last 5 rows...")
df.tail(5)

Displaying the last 5 rows...


Unnamed: 0,follower,followee,createdAt
609112,a9b7b5dc212eb2cc43c7503ec256d78c,5e1ab920b988fb71f5ae532db2fc449e,2021-05-28 14:40:20.915000+00:00
609113,3828c0c23a13cef79a3b9c30848c3609,c3581e6d3601d2825c10a4547d64f019,2021-05-28 14:49:39.844000+00:00
609114,255f0841aae13a484e8b9b8e314022bd,37b8780b7369005f5bc922de6501dd40,2021-05-28 14:58:24.978000+00:00
609115,49b76015d44936cb2c2184fc805e88a6,818ae3dee4d173fbc40b8aaf2c89f00c,2021-05-28 15:17:50.008000+00:00
609116,72ae2db888738150b28d520578a5907d,37b8780b7369005f5bc922de6501dd40,2021-05-28 15:21:01.119000+00:00


In [4]:
print("Some quick descriptive statistics")
display(df.describe(datetime_is_numeric=True).T)

Some quick descriptive statistics


Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max
follower,609117,97765.0,5e1ab920b988fb71f5ae532db2fc449e,12652.0,NaT,NaT,NaT,NaT,NaT,NaT
followee,609117,39956.0,bbc4a5710e40e0c9fe61a6793297db55,51753.0,NaT,NaT,NaT,NaT,NaT,NaT
createdAt,609117,,,,2020-11-14 06:46:40.858285056+00:00,2019-11-26 21:36:21.221000+00:00,2020-08-23 16:38:47.335000064+00:00,2020-12-04 18:42:52.616999936+00:00,2021-02-24 10:36:24.983000064+00:00,2021-05-28 15:21:01.119000+00:00


**Note the following:**
1. Events range from `November 11th, 2019 @ 21:36:21` to `May 28th, 2021 @ 15:21:01`.
2. There is roughly 2.45 times as many followers as followees.
3. Follower `5e1ab920b988fb71f5ae532db2fc449e` has followed 12,652 users.
4. Followee `bbc4a5710e40e0c9fe61a6793297db55` has been followed by 51,753 users.

One thing this data doesn't take into account is the number of times a follower has followed, unfollowed and or re-followed a user and vice versa. But no worries, we'll save that observation for another time. Right now though, let's direct our attention towards the `createdAt` column.

## Step 2: Determine If Data Processing is Needed
While our dataset only has three dimensions, there's still a few things we can do to touch it up. Let's take a look at **`createdAt`**.

As a reminder, `createdAt` serves as the date timestamp for each event. As seen earlier, our `createdAt` range is from `November 11th, 2019 @ 21:36:21` to `May 28th, 2021 @ 15:21:01`. In order to make future analysis easier, let's go ahead and split these out into `date`, `timestamp`, and `hour`.

In [5]:
df['date'] = pd.to_datetime(df['createdAt']).dt.date
df['year'] = pd.to_datetime(df['createdAt']).dt.year.astype(str)
df['timestamp'] = pd.to_datetime(df['createdAt']).dt.time
df['hour'] = pd.to_datetime(df['createdAt']).dt.hour

Great! Now that we've done that, we can remove the `createdAt` columns altogether.

In [6]:
df.drop(columns="createdAt", inplace=True)

In [7]:
print("Displaying the first 5 rows...")
display(df.head(5))
print("\nDisplaying the last 5 rows...")
display(df.tail(5))
print("\nSome quick descriptive statistics...")
display(df.describe(include="all").T)
print("\nFollower events by the hour:", df.hour.value_counts(), sep="\n========================\n")
print()
print("\nFollower events via bins (8)", df.hour.value_counts(bins=8), sep="\n========================\n")

Displaying the first 5 rows...


Unnamed: 0,follower,followee,date,year,timestamp,hour
0,782bc0ce5ffe00c95bbc52f72fc654a2,e83a54eefe2ff5daf80a66505f9472a4,2019-11-26,2019,21:36:21.221000,21
1,e83a54eefe2ff5daf80a66505f9472a4,782bc0ce5ffe00c95bbc52f72fc654a2,2019-11-26,2019,21:36:39.756000,21
2,782bc0ce5ffe00c95bbc52f72fc654a2,4627a06c99dde3d167166eaab32e947d,2019-11-26,2019,21:36:49.184000,21
3,782bc0ce5ffe00c95bbc52f72fc654a2,49b76015d44936cb2c2184fc805e88a6,2019-11-26,2019,21:36:57.669000,21
4,782bc0ce5ffe00c95bbc52f72fc654a2,1e66c9b143e9ee9155d0dcaa8d5997b0,2019-11-26,2019,21:41:21.834000,21



Displaying the last 5 rows...


Unnamed: 0,follower,followee,date,year,timestamp,hour
609112,a9b7b5dc212eb2cc43c7503ec256d78c,5e1ab920b988fb71f5ae532db2fc449e,2021-05-28,2021,14:40:20.915000,14
609113,3828c0c23a13cef79a3b9c30848c3609,c3581e6d3601d2825c10a4547d64f019,2021-05-28,2021,14:49:39.844000,14
609114,255f0841aae13a484e8b9b8e314022bd,37b8780b7369005f5bc922de6501dd40,2021-05-28,2021,14:58:24.978000,14
609115,49b76015d44936cb2c2184fc805e88a6,818ae3dee4d173fbc40b8aaf2c89f00c,2021-05-28,2021,15:17:50.008000,15
609116,72ae2db888738150b28d520578a5907d,37b8780b7369005f5bc922de6501dd40,2021-05-28,2021,15:21:01.119000,15



Some quick descriptive statistics...


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
follower,609117.0,97765.0,5e1ab920b988fb71f5ae532db2fc449e,12652.0,,,,,,,
followee,609117.0,39956.0,bbc4a5710e40e0c9fe61a6793297db55,51753.0,,,,,,,
date,609117.0,549.0,2020-08-23,103155.0,,,,,,,
year,609117.0,3.0,2020,315466.0,,,,,,,
timestamp,609117.0,600396.0,16:37:07.588000,6.0,,,,,,,
hour,609117.0,,,,12.967249,6.493841,0.0,8.0,15.0,17.0,23.0



Follower events by the hour:
16    106328
17     45466
15     34343
21     26379
20     24400
18     24260
11     23767
19     23396
23     23287
14     23025
0      21600
1      21260
13     21163
22     20740
12     19296
5      18914
4      18578
2      18133
3      17152
9      16797
10     16334
6      15602
8      15487
7      13410
Name: hour, dtype: int64


Follower events via bins (8)
(14.375, 17.25]    186137
(17.25, 20.125]     72056
(20.125, 23.0]      70406
(11.5, 14.375]      63484
(-0.024, 2.875]     60993
(8.625, 11.5]       56898
(2.875, 5.75]       54644
(5.75, 8.625]       44499
Name: hour, dtype: int64


**Initial Observations:**
1. The time period for these events range from `November 11th, 2019 @ 21:36:21` to `May 28th, 2021 @ 15:21:01`.
1. There is roughly 2.45 times as many followers as followees.
2. Follower `5e1ab920b988fb71f5ae532db2fc449e` has followed 12,652 users.
3. Followee `bbc4a5710e40e0c9fe61a6793297db55` has been followed by 51,753 users.
4. **August 23rd, 2020** had the highest single day follower activity with over 103,000 (roughly 17% of the total) events!
5. Roughly 17.5% of all follower events occur during the 4pm (16:00) hour.
6. Roughly 42.4% of all follower events occur between 4pm (16:00) and 9pm (21:00).

## Next Steps
This notebook served as a quick exploration of our data in case we needed to do any further processing. Luckily, there wasn't much to do, but we still found some interesting insights that will get further explored in the next notebook on EDA or Exploratory Data Analysis.

**Let's save our new dataframe before closing this out!**

In [8]:
print("Saving...")
df.to_csv(get_path('processed/followers.csv'))
print("Data saved successfully!")

Saving...
Data saved successfully!
