# Preprocessing

## Loading our data from S3

In [None]:
from pyspark.sql import functions as F

In [None]:
filepath = "s3://full-stack-bigdata-datasets/Big_Data/youtube_playlog.csv"

In [None]:
playlog = (spark.read.format('csv')         \
             .option('header', 'true')      \
             .option('inferSchema', 'true') \
             .load(filepath))
playlog.show(5)

+----------+----+-----------+
| timestamp|user|       song|
+----------+----+-----------+
|1392387533|   0|t1l8Z6gLPzo|
|1392387538|   1|t1l8Z6gLPzo|
|1392387556|   2|t1l8Z6gLPzo|
|1392387561|   3|we5gzZq5Avg|
|1392387566|   4|we5gzZq5Avg|
+----------+----+-----------+
only showing top 5 rows



## First analysis
1. Print out our DataFrame's schema

In [None]:
playlog.printSchema()

root
 |-- timestamp: integer (nullable = true)
 |-- user: integer (nullable = true)
 |-- song: string (nullable = true)



2. Use `.describe(...)` on your DataFrame

In [None]:
playlog.describe().display()

summary,timestamp,user,song
count,25739537.0,25739537.0,25739537
mean,1442700656.1045842,12697.352275450798,2.532571778181818E8
stddev,34432848.72371195,13094.065905828476,8.334645614940468E8
min,-139955897.0,0.0,---AtpxbkaE
max,1554321113.0,45903.0,zzzcFgRMY6c


### Missing values check

3. Count the missing values for each column put the result in a pandas DataFrame and print it out.

*TIP: you may use dictionnary comprehension in order to create the base to build the DataFrame from*


In [None]:
from pyspark.sql.functions import col, isnan, when, count
playlog.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in playlog.columns]).show()

+---------+----+----+
|timestamp|user|song|
+---------+----+----+
|        0|   0|   0|
+---------+----+----+



In [None]:
na_counts = {c: playlog.where(col(c).isNull()).count() for c in playlog.columns}
na_counts

Out[68]: {'timestamp': 0, 'user': 0, 'song': 0}

In [None]:
import pandas as pd
pd.DataFrame.from_dict(na_counts, orient='index', columns=['missing values'])

Unnamed: 0,missing values
timestamp,0
user,0
song,0


### Duplicates check

4. Check if playlog without duplicates has the same number of rows as the original.

In [None]:
count1 = playlog.count()

In [None]:
count2 = playlog.dropDuplicates().count()
if( count1 != count2):
  print(f"There are {count1-count2:_} duplicates")
else:
  print("No duplicate")  

There are 123_651 duplicates


Seems like we have duplicates, let's count how many.

5. Figure out a way to count the number of duplicates.

In [None]:
print(f"{count1-count2:_}")

123_651


### Other checks
6. Order the dataframe by ascending `timestamp` and show the first 5 rows.

In [None]:
playlog.orderBy('timestamp').show(5)

+----------+----+-----------+
| timestamp|user|       song|
+----------+----+-----------+
|-139955897|   4|nRa-eGzpT6o|
|1392387533|   0|t1l8Z6gLPzo|
|1392387537|  70|VJ6ofd0pB_c|
|1392387537|  22|Q24VZL8wpOM|
|1392387538|   1|t1l8Z6gLPzo|
+----------+----+-----------+
only showing top 5 rows



Do you see anything suspicious?

The first timestamp is negative, and it seems like it's the only one.  
We will make sure there aren't other like this.

7. count the number of rows with a negative timestamp

In [None]:
tmp = playlog.filter('timestamp<0').count()
tmp

Out[74]: 1

In [None]:
tmp = playlog.select('timestamp').filter('timestamp<0').count()
tmp

Out[75]: 1

As expected, only one such negative timestamp. Since we have only one we can actually `.collect(...)` it.

8. Collect the problematic rows

In [None]:
playlog.select('timestamp').filter('timestamp<0').collect()

Out[76]: [Row(timestamp=-139955897)]

There's only one problematic value among more than 25M.  This negative timestamp is an error, as such the real value is missing. We could try to reconstruct the real value but that would be a really tedious task, since it's one value over 25M, we will simply remove it.

## Removing the row with a negative timestamp

We will use our new knowledge about the data to perform some preprocessing.  

Our pipeline will have 2 steps:
* Remove duplicates (123651 rows)
* Remove row with negative timestamps (1 row)

We will call our new DataFrame `playlog_processed` and save it to S3 in parquet format.

9. Filter out:
* duplicated values
* rows with negative timestamp
* save the result to a new DataFrame: `playlog_processed`
* Finally, print out the number of rows in this DataFrame

In [None]:
# Garder les distincts
playlog_processed = playlog.distinct()
playlog_processed.count()

Out[77]: 25615886

In [None]:
# Virer le timestamp < 0 => garder les valeurs > 0
playlog_processed = playlog_processed.filter('timestamp>0')
playlog_processed.count()

Out[78]: 25615885

In [None]:
playlog_processed = playlog.distinct().filter('timestamp>0')
playlog_processed.count()

Out[79]: 25615885

10. save the processed DataFrame to S3 using the parquet format for this you may use the the method .write.parquet(...)
*You may use this path 's3://full-stack-bigdata-datasets/Big_Data/playlog_processed_student.parquet'*

In [None]:
out_path = "s3://full-stack-bigdata-datasets/Big_Data/playlog_processed_student.parquet"
playlog_processed.write.parquet(out_path, mode="overwrite")

[0;31m---------------------------------------------------------------------------[0m
[0;31mPy4JJavaError[0m                             Traceback (most recent call last)
File [0;32m<command-2891180002313736>:2[0m
[1;32m      1[0m out_path [38;5;241m=[39m [38;5;124m"[39m[38;5;124ms3://full-stack-bigdata-datasets/Big_Data/playlog_processed_student.parquet[39m[38;5;124m"[39m
[0;32m----> 2[0m playlog_processed[38;5;241m.[39mwrite[38;5;241m.[39mparquet(out_path, mode[38;5;241m=[39m[38;5;124m"[39m[38;5;124moverwrite[39m[38;5;124m"[39m)

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m start [38;5;241m=[39m time[38;5;241m.[39mperf_counter()
[1;32m     47[0m [38;5;28;01mtry[39;00m:
[0;32m---> 48[0m     res [38;5;241m=[39m [43mfunc[49m[43m([49m[38;5;241;43m*[39;49m[43margs[49m[43m,[49m[43m [49m[38;5;241;43m*[39;49m[38;5;24