In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb

import io
import numpy as np
import pandas as pd

import scipy.stats as stats
from scipy.stats import kendalltau, pearsonr, spearmanr
import statsmodels.api as sm
from statsmodels.formula.api import ols

from datetime import datetime, date


In [None]:
from google.colab import files

animeList = files.upload()

Saving UserList.csv to UserList.csv


In [None]:
df = pd.read_csv(io.BytesIO(animeList['UserList.csv']))

# Question
## Do users who plan to watch lots of anime actually watch lots of anime? 

In the anime dataset we are going to look for an answer to what seems like an obvious question. Intuitively, there should be a pretty strong correlation between what a person wants to watch and how much they actually watch (in the context of a streaming service). If you are like me, when you see something you want to watch you'll add it to your library so you can come back and watch it later - why else add it to your library, right? 

So, this will be a good test to see if people are mostly window shoppers, who can't be bothered to watch the movies they want to see, or see if people have conviction by wholeheartedly watching what they want to watch. 

Since there is such a vast amount of data in this set, we should see a good result.  

Now, let's take a look.

In [None]:
df.head()

Unnamed: 0,username,user_id,user_watching,user_completed,user_onhold,user_dropped,user_plantowatch,user_days_spent_watching,gender,location,birth_date,access_rank,join_date,last_online,stats_mean_score,stats_rewatched,stats_episodes
0,karthiga,2255153,3,49,1,0,0,55.31,Female,"Chennai, India",1990-04-29,,2013-03-03,2014-02-04 01:32:00,7.43,0.0,3391.0
1,RedvelvetDaisuki,1897606,61,396,39,0,206,118.07,Female,Manila,1995-01-01,,2012-12-13,1900-05-13 02:47:00,6.78,80.0,7094.0
2,Damonashu,37326,45,195,27,25,59,83.7,Male,"Detroit,Michigan",1991-08-01,,2008-02-13,1900-03-24 12:48:00,6.15,6.0,4936.0
3,bskai,228342,25,414,2,5,11,167.16,Male,"Nayarit, Mexico",1990-12-14,,2009-08-31,2014-05-12 16:35:00,8.27,1.0,10081.0
4,shuzzable,2347781,36,72,16,2,25,35.48,,,,,2013-03-25,2015-09-09 21:54:00,9.06,7.0,2154.0


In [None]:
df.tail()

Unnamed: 0,username,user_id,user_watching,user_completed,user_onhold,user_dropped,user_plantowatch,user_days_spent_watching,gender,location,birth_date,access_rank,join_date,last_online,stats_mean_score,stats_rewatched,stats_episodes
302670,ScruffyPuffy,3119025,0,27,0,0,0,7.92,,,,,2013-09-06,2014-10-10 09:04:00,0.0,0.0,477.0
302671,Torasori,3975907,22,239,0,4,176,86.88,Male,"Latvia, Riga",1998-11-18,,2014-07-30,2018-05-24 21:34:46,8.98,47.0,5313.0
302672,onpc,1268417,5,169,2,5,24,38.36,Male,,,,2012-04-23,2016-12-28 14:35:00,7.72,0.0,2280.0
302673,HMicca,1289601,11,73,2,2,16,119.97,Female,"Birmingham, England",1995-08-12,,2012-05-05,2012-11-15 08:10:00,8.89,11.0,7049.0
302674,mini_kaila,236339,2,10,5,3,5,4.17,Female,,,,2009-09-19,2011-08-19 01:15:00,7.82,0.0,245.0


In [None]:
df.shape

(302675, 17)

Lots of data to work with here, but it will need some serious cleaning up. To start with, lets drop all the columns we know we aren't going to use.

In [None]:
df = df.drop(["access_rank", "stats_mean_score", "stats_rewatched", "stats_episodes", "username", "location", "birth_date", "gender", "user_onhold", "user_dropped"], axis = 1)

Now that our columns have been whittled down a fair bit, lets see how many NaN entries we have.

In [None]:
df.isnull().sum().sort_values(ascending=False)

join_date                   129
last_online                 129
user_id                       0
user_watching                 0
user_completed                0
user_plantowatch              0
user_days_spent_watching      0
dtype: int64

Okay, there are a few, but not nearly as many as I was expecting. Let's go ahead and drop all of our null rows and see what we have left.

In [None]:
newdf = df.dropna(subset = ["last_online", "join_date"])

In [None]:
newdf.head()

Unnamed: 0,user_id,user_watching,user_completed,user_plantowatch,user_days_spent_watching,join_date,last_online
0,2255153,3,49,0,55.31,2013-03-03,2014-02-04 01:32:00
1,1897606,61,396,206,118.07,2012-12-13,1900-05-13 02:47:00
2,37326,45,195,59,83.7,2008-02-13,1900-03-24 12:48:00
3,228342,25,414,11,167.16,2009-08-31,2014-05-12 16:35:00
4,2347781,36,72,25,35.48,2013-03-25,2015-09-09 21:54:00


In [None]:
newdf.shape

(302546, 7)

Awesome! Still have most of the dataset to work with from here.

## Data now clean
After sorting down to the columns we will be using, there were very few NaN entries to remove, so overall the majority of the dataset is still intact - great!

The next step we're going to take is to filter out all the users who are not consistently active on the platform. We will do this by limiting our users to those who have logged on in the last year. 

Lets find out what the latest date is in the dataset.

In [None]:
newdf["last_online"].sort_values(ascending = False )

294913    2018-05-25 12:53:00
292747    2018-05-25 12:52:00
296948    2018-05-25 12:51:00
296109    2018-05-25 12:50:00
294821    2018-05-25 12:49:00
                 ...         
198235    1900-01-01 00:54:00
194033    1900-01-01 00:52:00
101153    1900-01-01 00:38:00
201461    1900-01-01 00:31:00
284163    1900-01-01 00:13:00
Name: last_online, Length: 302546, dtype: object

Looks like the latest login was May 25th, 2018. We will keep all entries that logged in between May 25th, 2017 through May 25th, 2018.

Another thing to note, it looks like some dates are recording as if they logged-in in 1900. Hopefully not many are falling under this quirk.

In [None]:
activeUser_df = (df.loc[(df['last_online'] >= '2017-05-25') & (df['last_online'] <= "2018-05-25")])
activeUser_df["last_online"].sort_values(ascending=False)

298654    2018-05-24 23:58:00
297959    2018-05-24 23:58:00
298112    2018-05-24 23:57:36
302386    2018-05-24 23:57:00
298061    2018-05-24 23:56:33
                 ...         
47478     2017-05-25 02:19:00
141785    2017-05-25 01:44:00
2163      2017-05-25 01:40:00
96595     2017-05-25 00:19:00
217497    2017-05-25 00:03:00
Name: last_online, Length: 67724, dtype: object

Nice, we still have a solid amount of active users at about 67k. Let's make sure we only keep those who 'actually' use the service. 


We are going to do this by only keeping users who have added at least 30 titles to their collection. Keep in mind, these are going to be our power-users. However, rather than breaking our data down into a variety of subsets that accomodate the various start dates for each user, this will be an easier way to track a correlation between titles added and titles watched. 

By going this direction we will keep in mind that our results could be biased toward those who heavily use the service vs those who don't. Though I'd argue that's what we're looking for anyway, we don't really want the people that don't actively use the service.

In [None]:
 reallyActive_df = (activeUser_df.loc[(activeUser_df['user_plantowatch'] >= 30) & (activeUser_df['user_completed'] > 30)])

In [None]:
reallyActive_df.shape

(41177, 7)

Nice, still have the majority of our active users at 41k. Now we are ready to see if there is a correlation.

# Our Correlation Test
This test will tell us if there is a correlation between people that add a lot of titles to watch with how many titles they actually do watch.

In [None]:
reallyActive = reallyActive_df.drop(columns = ['user_id', 'join_date', 'last_online'])
reallyActive.corr().style.background_gradient(cmap = "GnBu")

Unnamed: 0,user_watching,user_completed,user_plantowatch,user_days_spent_watching
user_watching,1.0,0.18919,0.173954,0.129732
user_completed,0.18919,1.0,0.227587,0.636512
user_plantowatch,0.173954,0.227587,1.0,0.116852
user_days_spent_watching,0.129732,0.636512,0.116852,1.0


## Result
Super interesting - the only somewhat-strong correlation is between titles watched and days spent watching, which makes sense. You'd hope more titles watched would equal more time watching.

What's weird to me is the weak correlation between planning to watch something and actually watching it. I mean, a correlation is there, but it's so weak that the only genuine conclusion to our question is that most people really are just window shoppers. They just keep adding movies, but then what? Clearly they aren't watching them all! Such a shame.

#Write Up

Okay, so I chose the anime-viewer dataset because I thought it would be really interesting to use such a large dataset compared to the other’s we have used so far in this class. There was a bit of a challenge in getting it uploaded properly. Initially I was trying to get it uploaded to my github repo so I could link it in from there, but the file size was too large for a browser upload and trying to get my computer to push through a command line prompt wasn’t working as I would have hoped. In the end, I decided I would just be lame and upload the file directly into the notebook. 

The primary question I wanted to ask was this: do users who add movies or TV shows to their streaming library actually watch those shows? In theory, the more shows a person adds, the more they should be watching. Personally, my list is the first thing I look at when I log into my streaming account and I’ll usually watch something from there, so I figured that would make sense that there should be a correlation between added shows and viewed shows. There is so much user data I think we can get a pretty good representation of whether or not people actually do that, or if they just add shows to the library all the time and never even watch them like window shoppers. It would be interesting to figure this out. 

Diving into the dataset a little bit more, looking at the head, tail, and size I found that there were really a lot of entries. I only really needed information that told me when they joined, how active they were, how many movies are adding to their library, and how many movies are actually being watched. So, out of the 17 columns or so I dropped quite a few. Then, out of the columns that were relevant to the question we were asking, there were very few null entries left. 
Breaking the data down like that I realized we still have most of our data because of the question I asked. It seemed as though we didn’t need too much cleaning, however we did dive in a little bit more and actually clean the data in the sense that we only wanted active users of people that have logged on in the last year or so. Did this by finding the latest date recorded in the data set, then took the liberty to keep only active users by dropping everyone who hadn’t logged on in the last year. This refined the data down quite a bit from 300k to about 60k. Now that we only had active users, I wanted to focus on the really active users - people that have actually added at least 30 titles to their library over the course of their account history. This would root out casual viewers that don’t really watch consistently, keeping only the active ones. Filtering that down and we were left with about 40k users to pull data from. 

Did a correlation test just to see what’s correlated with the variables we had left. Results were more surprising to me than I thought. There really was just a weak correlation between shows added and shows watched. It seems as though most people just add shows to their library but they don’t necessarily watch them. I mean there’s a little correlation there, but not really. Interesting to see another correlation came up between those who did watch more ended up watching for more days, which does make sense. There should be a pretty good correlation between those two, so I’m glad that at least happened. Interesting to see the titles added aren’t really being watched, who knew. 

I used the pandas correlation table since it displays really well. It gives good results and I like how it color contrasts the correlated elements in a nice little box grid format. The darker the color implies that there’s a greater correlation and so on.