# Users analysis

In [None]:
playlog = spark.read.format("csv").option("header", "true").option("inferSchema","true").load("s3://full-stack-bigdata-datasets/Big_Data/youtube_playlog.csv")
playlog.printSchema()

root
 |-- timestamp: integer (nullable = true)
 |-- user: integer (nullable = true)
 |-- song: string (nullable = true)



In [None]:
playlog.limit(5).toPandas()

Unnamed: 0,timestamp,user,song
0,1392387533,0,t1l8Z6gLPzo
1,1392387538,1,t1l8Z6gLPzo
2,1392387556,2,t1l8Z6gLPzo
3,1392387561,3,we5gzZq5Avg
4,1392387566,4,we5gzZq5Avg


1. Compute a new column `datetime` that converts the timestamp to a datetime, drop the `timestamp` column, and order by `datetime`, save this as a new DataFrame `df`, show the first 5 rows of `df`.

> TIP: use the method `.from_unixtime(...)`, this method converts integers into dates.

In [None]:
from pyspark.sql import functions as F
playlog = playlog.withColumn("datetime", F.from_unixtime(F.col("timestamp")))
playlog = playlog.drop("timestamp")
playlog.show(5)

+----+-----------+-------------------+
|user|       song|           datetime|
+----+-----------+-------------------+
|   0|t1l8Z6gLPzo|2014-02-14 14:18:53|
|   1|t1l8Z6gLPzo|2014-02-14 14:18:58|
|   2|t1l8Z6gLPzo|2014-02-14 14:19:16|
|   3|we5gzZq5Avg|2014-02-14 14:19:21|
|   4|we5gzZq5Avg|2014-02-14 14:19:26|
+----+-----------+-------------------+
only showing top 5 rows



In [None]:
# slow 
playlog.orderBy("datetime").show(5)

+----+-----------+-------------------+
|user|       song|           datetime|
+----+-----------+-------------------+
|   4|nRa-eGzpT6o|1965-07-26 03:21:43|
|   0|t1l8Z6gLPzo|2014-02-14 14:18:53|
|  70|VJ6ofd0pB_c|2014-02-14 14:18:57|
|  22|Q24VZL8wpOM|2014-02-14 14:18:57|
|   1|t1l8Z6gLPzo|2014-02-14 14:18:58|
+----+-----------+-------------------+
only showing top 5 rows



Now that we have a datetime column, we can compute new columns, namely:
- [year](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.year.html#pyspark.sql.functions.year)
- [month](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.month.html#pyspark.sql.functions.month)
- [dayofmonth](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.dayofmonth.html#pyspark.sql.functions.dayofmonth)
- [dayofweek](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.dayofweek.html#pyspark.sql.functions.dayofweek)
- [dayofyear](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.dayofyear.html#pyspark.sql.functions.dayofyear)
- [weekofyear](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.weekofyear.html#pyspark.sql.functions.weekofyear)

We will put the resulting DataFrame in a variable called `df_enriched`.

2. Follow previous instructions

*Tip: you use the reduce function from the functools package in order to automatically produce all the columns, otherwise you can just manually create them one by one*

In [None]:
playlog = playlog.withColumn('year', F.year('datetime'))
playlog.show(5)

+----+-----------+-------------------+----+
|user|       song|           datetime|year|
+----+-----------+-------------------+----+
|   0|t1l8Z6gLPzo|2014-02-14 14:18:53|2014|
|   1|t1l8Z6gLPzo|2014-02-14 14:18:58|2014|
|   2|t1l8Z6gLPzo|2014-02-14 14:19:16|2014|
|   3|we5gzZq5Avg|2014-02-14 14:19:21|2014|
|   4|we5gzZq5Avg|2014-02-14 14:19:26|2014|
+----+-----------+-------------------+----+
only showing top 5 rows



In [None]:
playlog = playlog.withColumn('month', F.month('datetime')).withColumn("dayofmonth", F.dayofmonth("datetime")).withColumn("dayofweek", F.dayofweek("datetime")).withColumn("dayofyear", F.dayofyear("datetime")).withColumn("weekofyear", F.weekofyear("datetime"))
playlog.show(5)

+----+-----------+-------------------+----+-----+----------+---------+---------+----------+
|user|       song|           datetime|year|month|dayofmonth|dayofweek|dayofyear|weekofyear|
+----+-----------+-------------------+----+-----+----------+---------+---------+----------+
|   0|t1l8Z6gLPzo|2014-02-14 14:18:53|2014|    2|        14|        6|       45|         7|
|   1|t1l8Z6gLPzo|2014-02-14 14:18:58|2014|    2|        14|        6|       45|         7|
|   2|t1l8Z6gLPzo|2014-02-14 14:19:16|2014|    2|        14|        6|       45|         7|
|   3|we5gzZq5Avg|2014-02-14 14:19:21|2014|    2|        14|        6|       45|         7|
|   4|we5gzZq5Avg|2014-02-14 14:19:26|2014|    2|        14|        6|       45|         7|
+----+-----------+-------------------+----+-----+----------+---------+---------+----------+
only showing top 5 rows



Unnamed: 0,user,song,datetime,year,month,dayofmonth,dayofyear,weekofyear
0,4,nRa-eGzpT6o,1965-07-26 03:21:43,1965,7,26,207,30
1,0,t1l8Z6gLPzo,2014-02-14 14:18:53,2014,2,14,45,7
2,22,Q24VZL8wpOM,2014-02-14 14:18:57,2014,2,14,45,7
3,70,VJ6ofd0pB_c,2014-02-14 14:18:57,2014,2,14,45,7
4,1,t1l8Z6gLPzo,2014-02-14 14:18:58,2014,2,14,45,7


### Aggregates

#### `firstPlay`, `lastPlay`, `playCount`, `uniquePlayCount`
For each user, we will compute these metrics:
- `firstPlay`: datetime of the first listening
- `lastPlay`: datetime of the last listening
- `playCount`: total play counts
- `uniquePlayCount`: unique play counts

We'll save all these in a new DataFrame: `users`.  
When you're done, print out the first 5 rows of `users` ordered by descending `playCount`.

3. Compute, for each user
- firstPlay
- lastPlay
- playCount
- uniquePlayCount
Save the results in a DataFrame named `users`

In [None]:
playlog.groupBy("user").agg(F.min("datetime").alias("FirstPlay")).show(5)

+----+-------------------+
|user|          FirstPlay|
+----+-------------------+
|   0|2014-02-14 14:18:53|
|   1|2014-02-14 14:18:58|
|   2|2014-02-14 14:19:16|
|   3|2014-02-14 14:19:21|
|   4|1965-07-26 03:21:43|
+----+-------------------+
only showing top 5 rows



In [None]:
playlog.groupBy("user").agg(F.min("datetime").alias("FirstPlay"), F.max("datetime").alias("LastPlay"), F.count("song").alias("PlayCount"), F.countDistinct("song").alias("UniquePlayCount")).limit(5).show()

+----+-------------------+-------------------+
|user|          FirstPlay|           LastPlay|
+----+-------------------+-------------------+
|   1|2014-02-14 14:18:58|2016-11-10 01:23:00|
|   3|2014-02-14 14:19:21|2017-01-16 12:54:35|
|   5|2014-02-14 14:19:26|2019-03-13 15:35:46|
|   6|2014-02-14 14:19:27|2019-03-17 17:44:38|
|  12|2014-02-14 14:20:12|2016-07-16 14:31:38|
+----+-------------------+-------------------+



In [None]:
playlog.groupBy("user").agg(F.min("datetime").alias("FirstPlay"), F.max("datetime").alias("LastPlay"), F.count("song").alias("PlayCount"), F.countDistinct("song").alias("UniquePlayCount")).orderBy("UniquePlayCount").limit(5).show()

+----+-------------------+-------------------+---------+---------------+
|user|          FirstPlay|           LastPlay|PlayCount|UniquePlayCount|
+----+-------------------+-------------------+---------+---------------+
|1649|2014-02-18 04:17:33|2014-02-18 04:17:33|        1|              1|
|1086|2014-02-16 13:52:48|2014-02-16 13:52:48|        1|              1|
|1075|2014-02-15 22:22:03|2014-02-15 22:22:03|        1|              1|
|1174|2014-02-16 12:59:19|2014-02-16 12:59:19|        1|              1|
|2375|2014-02-21 15:56:44|2014-02-21 15:56:44|        1|              1|
+----+-------------------+-------------------+---------+---------------+



In [None]:
my_aggregate = [
  F.min("datetime").alias("FirstPlay"), 
  F.max("datetime").alias("LastPlay"), 
  F.count("song").alias("PlayCount"), 
  F.countDistinct("song").alias("UniquePlayCount")
]

playlog.groupBy("user").agg(*my_aggregate).orderBy("UniquePlayCount").limit(5).show()

+----+-------------------+-------------------+---------+---------------+
|user|          FirstPlay|           LastPlay|PlayCount|UniquePlayCount|
+----+-------------------+-------------------+---------+---------------+
|1147|2014-02-16 09:16:26|2014-02-16 09:16:26|        1|              1|
|1086|2014-02-16 13:52:48|2014-02-16 13:52:48|        1|              1|
|1427|2014-02-17 16:01:46|2014-02-17 16:01:46|        2|              1|
|1174|2014-02-16 12:59:19|2014-02-16 12:59:19|        1|              1|
|1075|2014-02-15 22:22:03|2014-02-15 22:22:03|        1|              1|
+----+-------------------+-------------------+---------+---------------+



In [None]:
my_aggregate = [
  F.min("datetime").alias("FirstPlay"), 
  F.max("datetime").alias("LastPlay"), 
  F.count("song").alias("PlayCount"), 
  F.countDistinct("song").alias("UniquePlayCount")
]

playlog.groupBy("user").agg(*my_aggregate).orderBy(F.desc("UniquePlayCount")).limit(5).toPandas()

Unnamed: 0,user,FirstPlay,LastPlay,PlayCount,UniquePlayCount
0,213,2014-02-14 15:34:17,2019-04-02 06:04:08,278749,161406
1,7290,2014-04-30 20:12:41,2019-04-03 06:50:05,151513,83831
2,226,2014-02-14 16:28:13,2019-04-02 13:36:34,94589,38883
3,22332,2014-10-29 19:05:20,2018-12-13 06:50:53,63516,37584
4,347,2014-02-14 17:53:17,2019-03-26 20:07:34,93724,31240


In [None]:
my_aggregate = [
  F.min("datetime").alias("FirstPlay"), 
  F.max("datetime").alias("LastPlay"), 
  F.count("song").alias("PlayCount"), 
  F.countDistinct("song").alias("UniquePlayCount")
]

# oublié de sauver dans dataframe users
users = playlog.groupBy("user").agg(*my_aggregate).orderBy(F.desc("UniquePlayCount")).limit(5)
users.toPandas()

Unnamed: 0,user,FirstPlay,LastPlay,PlayCount,UniquePlayCount
0,213,2014-02-14 15:34:17,2019-04-02 06:04:08,278749,161406
1,7290,2014-04-30 20:12:41,2019-04-03 06:50:05,151513,83831
2,226,2014-02-14 16:28:13,2019-04-02 13:36:34,94589,38883
3,22332,2014-10-29 19:05:20,2018-12-13 06:50:53,63516,37584
4,347,2014-02-14 17:53:17,2019-03-26 20:07:34,93724,31240


Unnamed: 0,user,firstPlay,lastPlay,playCount,uniquePlayCount
0,213,2014-02-14 15:34:17,2019-04-02 06:04:08,278749,161406
1,7290,2014-04-30 20:12:41,2019-04-03 06:50:05,151513,83831
2,435,2014-02-14 19:51:09,2019-04-03 19:36:28,144711,20055
3,21950,2014-10-23 09:09:36,2019-02-06 00:54:54,126285,15075
4,6270,2014-04-13 18:45:54,2018-08-11 20:46:08,125056,9247


4. Run a sanity check that all firstPlay are anterior to lastPlay

In [None]:
users.filter(F.col("LastPlay") < F.col("FirstPlay")).count() 

Out[50]: 0

5. Another sanity check, we grouped on the user column, so each user should represent a single row. Make sure all users are unique in the DataFrame

In [None]:
print("# users           : ", users.count)
print("# disctinct users : ", users.select("user").distinct().count())

# users           :  <bound method DataFrame.count of DataFrame[user: int, FirstPlay: string, LastPlay: string, PlayCount: bigint, UniquePlayCount: bigint]>
# disctinct users :  5


### `timespan`
We will compute `timespan`: the overall span of activity from a user in days, rounded to the inferior, for example:
- if a user was active 23 hours on the service, we will say he was active 0 days
- for 53 hours, that would be 2 days of activity

We **will not** transform the `users` DataFrame in place, but instead save the result as a new DataFrame: `users_with_timespan`.

6. Compute timespan and save the result a new DataFrame: `users_with_timespan`

In [None]:
users_with_timespan = users.select(F.col("LastPlay")-F.col("FirstPlay"))
users_with_timespan.show(5)

+----------------------+
|(LastPlay - FirstPlay)|
+----------------------+
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
+----------------------+



In [None]:
# Copier coller de la solution
from pyspark.sql.types import IntegerType

def compute_timespan(df):
  return 


users_with_timespan = users.transform(compute_timespan)
users_with_timespan.limit(5).toPandas()

Unnamed: 0,user,FirstPlay,LastPlay,PlayCount,UniquePlayCount,timespan
0,213,2014-02-14 15:34:17,2019-04-02 06:04:08,278749,161406,1872
1,7290,2014-04-30 20:12:41,2019-04-03 06:50:05,151513,83831,1798
2,226,2014-02-14 16:28:13,2019-04-02 13:36:34,94589,38883,1872
3,22332,2014-10-29 19:05:20,2018-12-13 06:50:53,63516,37584,1505
4,347,2014-02-14 17:53:17,2019-03-26 20:07:34,93724,31240,1866


Let's check how this looks like, we will be using Databricks' `display` to plot an histogram of `timespan`.

7. Plot an histogram of `timespan`

In [None]:
display(users_with_timespan.select('timespan'))

timespan
1872
1798
1872
1505
1866


Databricks visualization. Run in Databricks to view.

Looking like a powerlaw, let's try to log transform.

8. Use describe on the `timespan` column

In [None]:
users_with_timespan.select("timespan").describe().show()

+-------+------------------+
|summary|          timespan|
+-------+------------------+
|  count|                 5|
|   mean|            1782.6|
| stddev|158.30287426322994|
|    min|              1505|
|    max|              1872|
+-------+------------------+



In [None]:
# la solution proposée
users_with_timespan.select('timespan').describe().toPandas().set_index('summary')

9. Plot a histogram of log transformed `timespan`

In [None]:
display(users_with_timespan.select(F.log('timespan')))

ln(timespan)
7.534762657037537
7.494430215031565
7.534762657037537
7.316548177182976
7.531552381407289


Databricks visualization. Run in Databricks to view.

10. Plot a QQ-Plot of log transformed `timespan`

In [None]:
display(users_with_timespan.select(F.log('timespan')))

ln(timespan)
7.534762657037537
7.494430215031565
7.534762657037537
7.316548177182976
7.531552381407289


In [None]:
# Voir : https://fr.wikipedia.org/wiki/Diagramme_quantile-quantile


We'll filter out users who stayed for less than a day and plot an histogram of this filtered data.

11. Plot a histogram of log transformed `timespan` of users who stayed more than one day

In [None]:
display(users_with_timespan.where(F.col('timespan') != 0).select(F.log('timespan')))

### `isSingleDayUser`
What percentage of users used the service for less than one day?

12. Compute the percentage of users who used the service for less than a day

Wow, that's a lot! We will flag this as its own column.  
That means we will create a new Boolean column `isSingleDayUser` that is `True` if the user used the service for less than a day and `False` otherwise.

13. Create a new column (isSingleDayUser) to flag if a user used the service for less than a day

### Measure of activity: `activeDaysCount` and `meanPlaycountByActiveDay`
This one is a bit harder, we want to compute:
- the number of active days for each user (not the `timespan`)
- the average play count on these active days for each user

14. Create 2 new columns
- activeDaysCount: the count of days each user was active
- dailyAvgPlayCount: the daily average playcount per user (active days only)
- activeDay

15. Plot a histogram of log of `activeDaysCount`

16. Plot a histogram of log of `dailyAvgPlayCount`

## Going further
What else do you think would be interesting to compute?
What about the ratio of activity, e.g. the ratio between `timespan` and `activeDaysCount`?