# Exploratory Data Analysis of Follower Events
The aim of this notebook is to get a feel for the data and determine if further processing is needed. We'll be working with an anonymized dataset containing follower events from the eFuse platform.

Note our findings from this notebook in the section to follow.

## Quick Summary



## Background
eFuse has follower event data that spans from November 11th, 2019 to May 28th, 2021. Follower events are events that occur whenever one gamer follows another. In the [previous notebook](http://localhost:8888/notebooks/notebooks/0%20-%20eFuse%20Follower%20Events.ipynb), we outlined a few observations that we'll explore further. As a reminder:

1. The time period for these events range from `November 11th, 2019 @ 21:36:21.` to `May 28th, 2021 @ 15:21:01`.
2. There is roughly 2.45 times as many followers as followees.
3. Follower `5e1ab920b988fb71f5ae532db2fc449e` has followed 12,652 gamers.
4. Followee `bbc4a5710e40e0c9fe61a6793297db55` has been followed by 51,753 gamers.
5. `August 23rd, 2020` had the highest number of follower events with over 103,000 (roughly 17%) in one day!
6. Roughly 17.5% of all follower events occur during the 4pm (16:00) hour.
7. Roughly 42.4% of all follower events occur between 4pm (16:00) and 8pm (20:00).
8. There isn't any apparent indicator of followers or followees that follow, then unfollow, and follow again.

To get started, let's load in some useful functions and packages.

In [1]:
# accessing utils module
import sys
sys.path.append('../utils')

# needed for loading data:
import pandas as pd

# some problem-specific helper functions:
from utils import get_path, generae_fig_to_plot

# needed for interactive visualizations
from IPython.display import display, Markdown
import bokeh
from bokeh.io import show, output_notebook
from bokeh.layouts import row
output_notebook()

## The follower event database

Our dataset consists of over 609,000 events across several dimensions. Those dimensions are:

- **follower** - The id of the person that initiated the follow
- **followee** - The id of the person being followed
- **date** - The date that the follow event happened
- **year** - The year that the follow event happened
- **timestamp** - The timestamp that the follow event happened
- **hour** - The hour that the follow event happened

Run the code below to load our data into our notebook!

In [2]:
print("Location of data files:", get_path('processed'))
print("Location of anonymized followers data:", get_path('processed/followers_anonymized.csv'))
print("Loading...")
df = pd.read_csv(get_path('processed/followers_anonymized.csv'), parse_dates=['date'], index_col=0)
print("...Done loading")
print("Displaying the first 5 rows...")
df.head(5)

Location of data files: /Users/matthewquinn/dev/eFuse-sample/data/processed
Location of anonymized followers data: /Users/matthewquinn/dev/eFuse-sample/data/processed/followers_anonymized.csv
Loading...
...Done loading
Displaying the first 5 rows...


Unnamed: 0,follower,followee,date,year,timestamp,hour
0,782bc0ce5ffe00c95bbc52f72fc654a2,e83a54eefe2ff5daf80a66505f9472a4,2019-11-26,2019,21:36:21.221000,21
1,e83a54eefe2ff5daf80a66505f9472a4,782bc0ce5ffe00c95bbc52f72fc654a2,2019-11-26,2019,21:36:39.756000,21
2,782bc0ce5ffe00c95bbc52f72fc654a2,4627a06c99dde3d167166eaab32e947d,2019-11-26,2019,21:36:49.184000,21
3,782bc0ce5ffe00c95bbc52f72fc654a2,49b76015d44936cb2c2184fc805e88a6,2019-11-26,2019,21:36:57.669000,21
4,782bc0ce5ffe00c95bbc52f72fc654a2,1e66c9b143e9ee9155d0dcaa8d5997b0,2019-11-26,2019,21:41:21.834000,21


## Exploratory Data Analysis!

Now let's get to the fun stuff! We won't do anything too complicated, just yet but let's take a deeper look into our data to find any noticeable relationships or patterns.

**Time-Based Dimensions** Assuming the above went well and our data loaded successfully, let's reduce our time-based dimensions and make a quick plot for follower events based on this. The columns below represent those dimensions.
- `date`
- `hour`
- `year`

**Note: The charts below are interactive so feel free to play around before moving on!**

In [3]:
time_dims = ["date","hour", "year"]
dfs = [df.groupby(x, as_index=False).size().rename(columns={"size": "total_events"}) for x in time_dims]
show(row([generae_fig_to_plot(df, ticks=3) for df in dfs]))

## Initial Observations
Fantastic! We're just about done here. This was a simple and quick exploration before diving deeper into our data. Let's take a moment to outline all of our findings and then let's see what's next to come!

**Observations:**
1. The time period for these events range from `November 11th, 2019 @ 21:36:21.` to `May 28th, 2021 @ 15:21:01`.
2. There is roughly 2.45 times as many followers as followees.
3. Follower `5e1ab920b988fb71f5ae532db2fc449e` has followed 12,652 gamers.
4. Followee `bbc4a5710e40e0c9fe61a6793297db55` has been followed by 51,753 gamers.
5. `August 23rd, 2020` had the highest number of follower events with over 103,000 (roughly 17%) in one day!
6. Roughly 17.5% of all follower events occur during the 4pm (16:00) hour.
7. Roughly 42.4% of all follower events occur between 4pm (16:00) and 8pm (20:00).

## Next Steps
This notebook served as a quick exploration of our data in case we needed to do any further processing. Luckily, there wasn't much to do, but we still found some interesting insights that will get further explored in the next notebook on EDA or Exploratory Data Analysis.

**Let's save our new dataframe before closing this out!**