# Follower Event Exploratory Data Analysis
So here we are, back again! This time, we'll start some Exploratory Data Analysis. Also known as EDA! The goal of EDA is to further investigate our data by seeing if there are any hidden findings that can help us understand what's going on. What we'll be examining here are the dimensions around time and follower-followee.

Just like with the previous notebook, if you who want the quick notes on our findings, I've outlined them below. Otherwise, feel free to explore this notebook. All of the visuals are interactive. Enjoy!

## Notable Findings
- February 2020 and August 2021 have huge spikes in follower activity nodding at some sort of event, campaign, or influencer arriving on the platform.
- With the huge spikes in February 2020 and August 2021, it's hard to get a view on what "consistent" follower engagement looks like month to month.
- With only one full year of data (2020) it's safe to say there is a lot of opportunity to shape follower activity in the future (could reward users for reaching n-number of followers or for supporting n-number of users).
- While there are some notable users with substantial followers, there is still over 1/3rd of the platform not engaging as both, a follower and followee.
- Over 3/4th of users following others on the platform have 0 followers themselves, indicating uninteresting profiles or a lack of engagement with the platform and who they've been following.
- 50% of those being followed have 2 or less followers, whereas of the top 25% users being followed, 75% of them have 29 followers or less. Nodding that folks go to the platform to follow the same n-number accounts, only wanting to engage with them versus other users.

## Previous Findings
If you skipped the first notebook, no worries, here's our findings from that.

1. There is roughly 2.45 times as many followers as followees.
2. Follower `5e1ab920b988fb71f5ae532db2fc449e` has followed 12,652 users.
3. Followee `bbc4a5710e40e0c9fe61a6793297db55` has been followed by 51,753 users.
4. **August 23rd, 2020** had the highest single day follower activity with over 103,000 (roughly 17% of the total) events!
5. Roughly 17.5% of all follower events occur during the 4pm (16:00) hour.
6. Roughly 42.4% of all follower events occur between 4pm (16:00) and 9pm (21:00).

Let's load in some useful functions and packages.

In [1]:
# accessing utils module
import sys
sys.path.append('../utils')

# needed for loading data:
import pandas as pd

# some problem-specific helper functions:
from utils import get_path, get_monthly_event_count, create_time_series_df
from visualize import generae_fig_to_plot, plot_multi_lines, generae_fig_to_plot_std

# needed for interactive visualizations
from IPython.display import display, Markdown
import bokeh
from bokeh.io import show, output_notebook
from bokeh.layouts import row
output_notebook()

## The follower event database

Our dataset consists of 609,117 events across several dimensions. Those dimensions are:

- **follower** - The id of the person that initiated the follow
- **followee** - The id of the person being followed
- **date** - The date that the follow event happened
- **year** - The year that the follow event happened
- **timestamp** - The timestamp that the follow event happened
- **hour** - The hour that the follow event happened

Run the code below to load our data into our notebook!

In [2]:
print("Location of data files:", get_path('processed'))
print("Location of anonymized followers data:", get_path('processed/followers.csv'))
print("Loading...")
df = pd.read_csv(get_path('processed/followers.csv'), parse_dates=['date'], index_col=0)
print("Done loading!\n")
print("Displaying the first 5 rows...")
df.head(5)

Location of data files: /Users/matthewquinn/dev/eFuse-sample/data/processed
Location of anonymized followers data: /Users/matthewquinn/dev/eFuse-sample/data/processed/followers.csv
Loading...
Done loading!

Displaying the first 5 rows...


Unnamed: 0,follower,followee,date,year,timestamp,hour
0,782bc0ce5ffe00c95bbc52f72fc654a2,e83a54eefe2ff5daf80a66505f9472a4,2019-11-26,2019,21:36:21.221000,21
1,e83a54eefe2ff5daf80a66505f9472a4,782bc0ce5ffe00c95bbc52f72fc654a2,2019-11-26,2019,21:36:39.756000,21
2,782bc0ce5ffe00c95bbc52f72fc654a2,4627a06c99dde3d167166eaab32e947d,2019-11-26,2019,21:36:49.184000,21
3,782bc0ce5ffe00c95bbc52f72fc654a2,49b76015d44936cb2c2184fc805e88a6,2019-11-26,2019,21:36:57.669000,21
4,782bc0ce5ffe00c95bbc52f72fc654a2,1e66c9b143e9ee9155d0dcaa8d5997b0,2019-11-26,2019,21:41:21.834000,21


## Exploratory Data Analysis!

Now let's get to the fun stuff! We won't do anything too complicated just yet, but let's take a deeper look into our data to find any noticeable relationships or patterns.

### Time-Based Dimensions
Assuming the above went well and our data loaded successfully, let's reduce our time-based dimensions and make a quick plot for follower events based on this. The columns below represent those dimensions.
- `date`
- `hour`
- `year`

**Note: The charts below are interactive so feel free to play around before moving on!**

In [3]:
time_dims = ["date","hour", "year"]
dfs = [df.groupby(x, as_index=False).size().rename(columns={"size": "total_events"}) for x in time_dims]
show(row([generae_fig_to_plot(df, ticks=3) for df in dfs]))

**Notice the following:**
1. **August 23rd, 2020** had 103,155 events. This could be due to a few things such as: an event or a raffle, a huge rush of sign-ups on the platform, or a few big time influencers joined and folks flocked to be the first to follow them.
2. The 4pm hour is the prime time hour for follower activity. Although, the **August 23rd, 2020** boost could cause this to be a lot more misleading than the norm.
3. The 7am hour is the least active time for followers.
4. 2020 saw the largest jump in follower activity. However, we're less than halfway-through 2021 and follower activity for this year looks to surpass 2020s numbers, indicating growth on the platform.

Now, let's go ahead and take a look at what's going down year over year.

In [4]:
mo_event_count = get_monthly_event_count(df)
df_to_plot = create_time_series_df(mo_event_count)
plot_multi_lines(df_to_plot)

Some descriptive statistics to go with the above graph!

In [5]:
round(df_to_plot.describe().T, 2)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
y_2019,2.0,5065.5,7146.73,12.0,2538.75,5065.5,7592.25,10119.0
y_2020,12.0,26288.83,32802.25,4851.0,9966.25,17094.0,27713.75,126558.0
y_2021,5.0,56704.0,32670.89,19150.0,36995.0,53382.0,69326.0,104667.0


2020 is the only full-year of data we have. Indicating the platform launched in 2019, and we're still barely halfway through 2021. This makes it challenging to do more complex time-based analysis, but there's still a lot this data can tell us.

**Notice the following:**
1. August 2020 and February 2021 have large spikes in follower activity, nodding at some sort of campaign or event.
2. Q1 of 2021 has seen a 795.88% increase in follower activity compared to Q1 of 2020. But note the influence of the spike in February 2021.
3. After 795.88% increase in follower activity in Q1 2021, Q2 2021 nods at follower activity leveling off, with a 151.58% increase in April and May.
4. The standard deviation being so high indicates inconsistent follower activity each month. 

Let's take a look go ahead and take a look at the standard deviation across each month.

In [6]:
show(row([generae_fig_to_plot_std(df_to_plot, year) for year in df_to_plot.columns]))

**Explaining The Visuals** Here we've plotted the upper and lower standard deviation (red shaded area) across the area of consistency (green area). Months that fall within the red area indicate inconsistent or unexpected follower activity in relation to the average, whereas months that land in the green area represent consistency or what you could expect for activity during that month. The green dashed line represents the average follower events for that year. With that being said, notice the following.

**Notice the following:**
- August 2020 and February 2021 have a huge spike which causes the remainder of the months to fall below average or to fall within the zone of inconsistency.
- Only 4 months out of every month in our dataset falls at or above the average.

I will say, don't be alarmed by these observations. First and foremost, the more data we acquire with platform longevity and normality across the industry, the more these numbers will level out in due time, even with the spikes. However, let's **remove these spikes with the mean** and see how our data looks.

In [7]:
df_to_plot["y_2020"] = df_to_plot["y_2020"].replace(df_to_plot["y_2020"].loc["Aug"], df_to_plot["y_2020"].mean())
df_to_plot["y_2021"] = df_to_plot["y_2021"].replace(df_to_plot["y_2021"].loc["Feb"], df_to_plot["y_2021"].mean())

In [8]:
round(df_to_plot.describe().T, 2)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
y_2019,2.0,5065.5,7146.73,12.0,2538.75,5065.5,7592.25,10119.0
y_2020,12.0,17933.07,9264.45,4851.0,9966.25,17094.0,26543.88,32125.0
y_2021,5.0,47111.4,19422.98,19150.0,36995.0,53382.0,56704.0,69326.0


Great, now that we've replaced those spikes with the mean, let's go ahead and re-plot this to how it looks.

In [9]:
show(row([generae_fig_to_plot_std(df_to_plot, year) for year in df_to_plot.columns]))

Replacing the huge spikes with the mean results in more consistency in follower activity across the months. However, there's still somethings to note.

**Notice the following:**
- While the inconsistency of follower activity dropped in 2020 and 2021 when removing the huge spike, there are still period of low follower activity.
- 2021 still has a large standard deviation, indicating more unknowns, but again, this is due to it being an incomplete year. However, it's something that should be watched for to get a sense of consistency throughout each month.

Alright, now that we've viewed our time-based dimensions, let's go ahead and learn a thing or two about our followers and followees!

### Follower and Followee EDA
Who is following who and who is being followed can tell us a lot about who our influencers are on the platform. Additionally, it can show us the networks and various communities within our platform based on follower and followee networks. Before diving deep into this, let's take a quick look at what we're working with!

In [10]:
print("Displaying some quick stats on our followers...")
df[["follower", "followee"]].describe()

Displaying some quick stats on our followers...


Unnamed: 0,follower,followee
count,609117,609117
unique,97765,39956
top,5e1ab920b988fb71f5ae532db2fc449e,bbc4a5710e40e0c9fe61a6793297db55
freq,12652,51753


As mentioned before, `5e1ab920b988fb71f5ae532db2fc449e` has given 12,652 follows as the platforms top follower and `bbc4a5710e40e0c9fe61a6793297db55` has been followed 51,753 users as our top followee.

Let's take a look at some other follower statistics.

In [11]:
df_followers = pd.DataFrame(
    df[["follower", "followee"]].value_counts()
).reset_index().rename(columns={0:"total_follows"})
print("\nData on how many times a follower followed the followee...")
display(df_followers.head())
print("\nQuick stats on how many times a follower followed the followee...")
display(df_followers.describe().T)
print("Number of times a follower has followed a followee:", 
      *df_followers.total_follows.unique(), sep="\n ===> ")
print("Number of folks on the platform engaging in follower activity:", 
      df[["follower", "followee"]].describe().loc["unique"].sum())


Data on how many times a follower followed the followee...


Unnamed: 0,follower,followee,total_follows
0,dbbc26201cba155406a789727d5dd5c8,086f435f286cae790ff84dcebf26b106,4
1,6f4538135f8a067aadeed3936d213908,c32b5699821fbff77bc8d5e156f04ec9,4
2,2cde2cfd19378d4c8397c2a7ca54fe72,83f507a6429017b08a5413e2387cf92a,2
3,9527c8b77b84926ae73eb9c119553a19,ee15dedd6d93b0bed849526310f8b19d,2
4,072b12123838edd30b7a1574813b2235,a92ce72df353ed57fe47913f357a5299,2



Quick stats on how many times a follower followed the followee...


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_follows,609080.0,1.000061,0.008969,1.0,1.0,1.0,1.0,4.0


Number of times a follower has followed a followee:
 ===> 4
 ===> 2
 ===> 1
Number of folks on the platform engaging in follower activity: 137721


**Notice the following:**
1. There are more followers (97,765) than followee's (39,956) indicating that the folks being followed aren't really following anyone back, but a lot of people are following specific accounts, this could be for a variety of reasons.
2. A few followers have followed the same account 2 or 4 times. This could indicate that they were followed and then unfollowed, for whatever reason. But we'll explore that more later.
3. User `bbc4a5710e40e0c9fe61a6793297db55` has been followed 51,753 times. This could be an influencer, top user, or whatever.
4. User `5e1ab920b988fb71f5ae532db2fc449e` has been followed 12,652 other users, indicating a very active member of the platform.
5. There are 137,721 users engaging in follower activity.

There's some basic info here, but the item to explore relates to followers giving more than one follow to the same account. Let's dive deeper to see what's going on there.

In [12]:
more_than_1_follow = df_followers.query("total_follows not in [1]")
print("Followers initiating with more than 1 follower for the same followee")
display(more_than_1_follow.head())
print("Total number of follower events with more than 1 follow to same account",
            more_than_1_follow["follower"].count(), sep="\n ===> ")
more_than_1_follow.head()
print("Total number of followers with more than 1 follow to same account:", 
      more_than_1_follow["follower"].nunique(), sep="\n ===> ")

Followers initiating with more than 1 follower for the same followee


Unnamed: 0,follower,followee,total_follows
0,dbbc26201cba155406a789727d5dd5c8,086f435f286cae790ff84dcebf26b106,4
1,6f4538135f8a067aadeed3936d213908,c32b5699821fbff77bc8d5e156f04ec9,4
2,2cde2cfd19378d4c8397c2a7ca54fe72,83f507a6429017b08a5413e2387cf92a,2
3,9527c8b77b84926ae73eb9c119553a19,ee15dedd6d93b0bed849526310f8b19d,2
4,072b12123838edd30b7a1574813b2235,a92ce72df353ed57fe47913f357a5299,2


Total number of follower events with more than 1 follow to same account
 ===> 33
Total number of followers with more than 1 follow to same account:
 ===> 22


Here we see that there are 33 follower events across 22 followers who have followed a user more than once. Let's take a specific look at those accounts

In [13]:
events = more_than_1_follow.set_index(["follower", "followee"]).index
multi_follows = pd.DataFrame(index = events)
multi_follow_events = multi_follows.join(df.set_index(
    ["follower", "followee"]
), how="inner", on=["follower", "followee"])
multi_follow_events.reset_index(inplace=True)
multi_follow_events.timestamp = pd.to_timedelta(multi_follow_events.timestamp)
multi_follow_events.sort_values(by=["follower", "followee", "date", "year", "timestamp"], inplace=True)
multi_follow_events.reset_index(inplace=True, drop=True)
print("Here are the first 10 rows:")
display(multi_follow_events.head(10))

Here are the first 10 rows:


Unnamed: 0,follower,followee,date,year,timestamp,hour
0,072b12123838edd30b7a1574813b2235,a92ce72df353ed57fe47913f357a5299,2021-03-10,2021,0 days 20:29:32.426000,20
1,072b12123838edd30b7a1574813b2235,a92ce72df353ed57fe47913f357a5299,2021-03-10,2021,0 days 20:29:32.548000,20
2,11fb46c7654a17a517a3e66c24133b41,4f789ca00a491a274f8e6d311e89cead,2020-05-03,2020,0 days 17:13:53.272000,17
3,11fb46c7654a17a517a3e66c24133b41,4f789ca00a491a274f8e6d311e89cead,2020-05-03,2020,0 days 17:13:53.273000,17
4,11fb46c7654a17a517a3e66c24133b41,8796a5ebad1c45f0927cec031e95bbc8,2020-05-03,2020,0 days 17:13:53.270000,17
5,11fb46c7654a17a517a3e66c24133b41,8796a5ebad1c45f0927cec031e95bbc8,2020-05-03,2020,0 days 17:13:53.272000,17
6,11fb46c7654a17a517a3e66c24133b41,c7c0bcf29a3e254d99dcb413d07f0d49,2020-05-03,2020,0 days 17:11:04.673000,17
7,11fb46c7654a17a517a3e66c24133b41,c7c0bcf29a3e254d99dcb413d07f0d49,2020-05-03,2020,0 days 17:11:04.674000,17
8,17152d75ffd165d4323dd93e10871a1d,0468c8a42373b28a837b3ffca9ae0e7a,2021-01-31,2021,0 days 21:30:27.843000,21
9,17152d75ffd165d4323dd93e10871a1d,0468c8a42373b28a837b3ffca9ae0e7a,2021-01-31,2021,0 days 21:30:27.851000,21


**Notice the following:**
1. Look specifically at the `timestamp` column. Notice how the times before follows are microseconds apart.

Let's simplify this view some.

In [14]:
print("Here are the first 2 rows:")
display(multi_follow_events[["follower", "followee", "date", "timestamp"]].head(2))

Here are the first 2 rows:


Unnamed: 0,follower,followee,date,timestamp
0,072b12123838edd30b7a1574813b2235,a92ce72df353ed57fe47913f357a5299,2021-03-10,0 days 20:29:32.426000
1,072b12123838edd30b7a1574813b2235,a92ce72df353ed57fe47913f357a5299,2021-03-10,0 days 20:29:32.548000


Follower `072b12123838edd30b7a1574813b2235` has two follower events logged within microseconds of one another. This could indicate a quick "follow-unfollow-follow" of it could indicate a minor hiccup in how we're logging these events.

Before figuring out what to do with these events, let's find the duration between each of them to make there weren't days between follows.

In [15]:
durations = multi_follow_events.groupby(
    ["follower", "followee", "date", "year", "hour"], as_index=False
).timestamp.diff().rename(columns={"timestamp":"time_btwn_event"})
durations.time_btwn_event = durations.time_btwn_event.dt.microseconds
multi_follow_events = multi_follow_events.join(durations, how="inner")
multi_follow_events[["time_btwn_event"]].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
time_btwn_event,37.0,8837.837838,21860.051735,0.0,1000.0,2000.0,4000.0,122000.0


Wow! Our duration is in microseconds, indicating that the time between follows for multi-follower events is so small that there could be a lag or some sort of double tapping of tracking these events. 

Let's treat these as duplicate events and remove them.

In [16]:
df.drop_duplicates(
    subset=["follower", "followee", "date", "year", "hour"], keep='first', inplace=True
)

df_followers = pd.DataFrame(
    df[["follower", "followee"]].value_counts()
).reset_index().rename(columns={0:"total_follows"})
print("\nQuick stats on how many times a follower followed the followee...")
display(df_followers.describe().T)
print("Number of times a follower has followed a followee:", 
      *df_followers.total_follows.unique(), sep="\n ===> ")


Quick stats on how many times a follower followed the followee...


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_follows,609080.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


Number of times a follower has followed a followee:
 ===> 1


Great! Now that that's resolved, let's go ahead and continue exploring our followers and followees!

In [17]:
follower_stats = df.groupby("follower", as_index=False).size().rename(columns={"size":"num_following"})
print("\nSome stats about who is doing the following!")
display(pd.DataFrame(follower_stats.describe().astype(int)).T)
print("\nTop 10 users following the most users:\n", 
      follower_stats.nlargest(10, columns="num_following").reset_index(drop=True))


Some stats about who is doing the following!


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
num_following,97765,6,102,1,2,2,3,12649



Top 10 users following the most users:
                            follower  num_following
0  5e1ab920b988fb71f5ae532db2fc449e          12649
1  4627a06c99dde3d167166eaab32e947d          11767
2  a2902fb04be2e55118937cb7e3ecb84d           8792
3  0559f44be415ed2db1936906744f22e6           8742
4  91d48a3b5bb53adbfb9150f996d4a6ad           8527
5  ac27c266cf4edaf9a16f3d78db318c18           7593
6  4bf5a1c9cc9874feb7d2948783d3e0b9           7577
7  0468c8a42373b28a837b3ffca9ae0e7a           6805
8  a6c26024c466c7df983b3d0fa23f94eb           6671
9  1977229aff9c172eea93e65ce9f688be           5680


**Notice the following:**
1. 75% of users follow 3 others on the platform.
2. The average user follows 6 others on the platform, however, this could be due to the top 2 being potential outliers.
3. The top two users following others have follow more than 11,500 others!

Let's check out our followees!

In [18]:
followee_stats = df.groupby("followee", as_index=False).size().rename(columns={"size":"num_followers"})
print("\nSome stats about the folks being followed on the platform!")
display(pd.DataFrame(followee_stats.describe().astype(int)).T)
print("\nTop 10 gamers with the most followers:\n", 
      followee_stats.nlargest(10, columns="num_followers").reset_index(drop=True))


Some stats about the folks being followed on the platform!


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
num_followers,39956,15,373,1,1,2,10,51753



Top 10 gamers with the most followers:
                            followee  num_followers
0  bbc4a5710e40e0c9fe61a6793297db55          51753
1  ff0d35ecf1f3b20e1bf08292d12b7966          50861
2  49b76015d44936cb2c2184fc805e88a6           5463
3  ce5e4a49602ff3fe1084228b60092ae0           5359
4  fb61b286f0c18d6a7f4ea7bea11a6299           4567
5  1ca9e8d41d676b1255055bef3ed0c008           4417
6  7da3095ae5f27161b3a0421d14a56193           4367
7  521a56d52b9c3f40930a4f38192dca7e           3815
8  aad82f104ba33c96ce48386cc51ea43b           3744
9  4627a06c99dde3d167166eaab32e947d           3468


**Notice the following:**
1. Half of the folks being followed have 2 or less followers.
2. The top two users being followed have more than 50,000 followers!
3. 25% of users being followed have more than 10 followers.

Let's explore the dynamic between followers (97,765) and followees (39,956) to paint a better picture of our users.

In [19]:
user_stats_followers = follower_stats.set_index("follower").join(
    followee_stats.set_index("followee")
).reset_index()
user_stats_followers.columns = ["user", "friends", "followers"]
print("\nChecking out our user stats")
display(user_stats_followers.head())


Checking out our user stats


Unnamed: 0,user,friends,followers
0,00015559b9fd9bb87a510726e667515d,2,
1,0002c4cbeb6e13c76243d4e410081466,1,
2,0002e671d27f14cc96969bb7816490de,4,8.0
3,0004541e3a5fcf800bba2171bbec0938,2,
4,0005271d95d68ac34968b91764ac6a5a,2,


**An important note** The above represents user stats for users who are followers. Meaning, they themselves may or may not have any followers, but they are doing the following. Not having followers indicates they won't be in the `folloee_stats` dataset that we'll peak at later.

Let's make note of the following dimensions:
- `user` - the Id of our platform users
- `friends` - the number of users being followed
- `followers` - the number of users following this user

Also note the `NaN`. This indicates that the specific user has 0 followers. Let's replace these with `0`

In [20]:
user_stats_followers.fillna(0, inplace=True)
user_stats_followers.followers = user_stats_followers.followers.astype(int)
print("\nChecking out the first 5 users")
display(user_stats_followers.head())
print("\nSome stats!")
display(pd.DataFrame(user_stats_followers.describe().astype(int)).T)


Checking out the first 5 users


Unnamed: 0,user,friends,followers
0,00015559b9fd9bb87a510726e667515d,2,0
1,0002c4cbeb6e13c76243d4e410081466,1,0
2,0002e671d27f14cc96969bb7816490de,4,8
3,0004541e3a5fcf800bba2171bbec0938,2,0
4,0005271d95d68ac34968b91764ac6a5a,2,0



Some stats!


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
friends,97765,6,102,1,2,2,3,12649
followers,97765,4,170,0,0,0,1,50861


In [21]:
print("\nStats on followers who also have followers")
user_stats_followers.query("followers > 0").describe().T.astype(int)


Stats on followers who also have followers


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
friends,25307,17,200,1,1,5,7,12649
followers,25307,17,335,1,1,3,12,50861


**Notice the following:**
1. 75% of the users following others, have 0 followers themselves.
2. 25,307 of 97,765 followees have followers.
3. Of the followers who also have followers, 50% of them have 3 or less.
4. 25,307 of the 39,956 (63.34%) **followees** also follow others.
5. There are 14,649 followees, not following anyone.

Let's explore the dynamic between followee (39,956) and followers (97,765) to paint a better picture of this relationship.

In [22]:
user_stats_followees = followee_stats.set_index("followee").join(
    follower_stats.set_index("follower")
).reset_index()
user_stats_followees.columns = ["user", "followers", "friends"]
print("\nChecking out our user stats")
display(user_stats_followees.head())


Checking out our user stats


Unnamed: 0,user,followers,friends
0,0002e671d27f14cc96969bb7816490de,8,4.0
1,0005c17c32267aaaee6556abd76a5209,29,
2,00067463e2e33891e363092a5a82d469,46,10.0
3,00070928a4162930be27149f06456589,1,1.0
4,00075e9f6e308daa29735d445ab33e8d,1,1.0


Same thing as before!

In [23]:
user_stats_followees.fillna(0, inplace=True)
user_stats_followees.followers = user_stats_followees.followers.astype(int)
print("\nChecking out the first 5 users")
display(user_stats_followees.head())
print("\nSome stats!")
display(pd.DataFrame(user_stats_followees.describe().astype(int)).T)


Checking out the first 5 users


Unnamed: 0,user,followers,friends
0,0002e671d27f14cc96969bb7816490de,8,4.0
1,0005c17c32267aaaee6556abd76a5209,29,0.0
2,00067463e2e33891e363092a5a82d469,46,10.0
3,00070928a4162930be27149f06456589,1,1.0
4,00075e9f6e308daa29735d445ab33e8d,1,1.0



Some stats!


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
followers,39956,15,373,1,1,2,10,51753
friends,39956,10,160,0,0,1,5,12649


In [24]:
print("\nStats on followees not following others")
user_stats_followees.query("friends == 0").describe().T.astype(int)


Stats on followees not following others


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
followers,14649,11,429,1,1,2,6,51753
friends,14649,0,0,0,0,0,0,0


In [25]:
print("\nStats on the top 25% users being followed")
user_stats_followees.query("followers > 6").describe().T.astype(int)


Stats on the top 25% users being followed


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
followers,12001,46,679,7,13,23,29,51753
friends,12001,29,291,0,0,5,11,12649


**Notice the following:**
1. Of the followees who are following, 75% of them are following less than 5 users
2. 63.34% of platform users are both, followees and followers.
3. 50% of the 14,649 users not following anyone have 2 or less followers (75% have 6 or less followers)
4. Users with 13 followers or less, follow 0 users themselves.

## Next Steps
Great, we've made it to the end! To recap, we completed some EDA around our time-based and follower-followee dimensions. While we're done with EDA (for now), there's still lots of opportunities to come back and add to this notebook. Next steps though, are to examine the networks of specific users with what we call, a network analysis. So let's start framing our thoughts around "Who is in whose network?"

**Before closing out though, let's save our dataframe!**

In [26]:
print("Saving...")
df[["follower", "followee"]].to_csv(get_path('processed/followers_deduped.csv'))
print("Data saved successfully!")

Saving...
Data saved successfully!
