# First steps with DataFrames

## Learning objectives

- Learn basic transformations and actions on PySpark DataFrames
- Learn to define a temporary view and execute SQL statements using the SparkSession

In [None]:
spark

In [None]:
# Load the file hosted at `filepath` onto a PySpark DataFrame: user_logs
filepath = "s3://full-stack-bigdata-datasets/Big_Data/youtube_playlog.csv"

user_logs_df = (spark.read.format('csv')\
             .option('header', 'true')\
             .option('inferSchema', 'true')\
             .load(filepath))

It's easier to see PySpark DataFrames abstractions as SQL tables rather than to think of them as equivalent to `pandas`.  If you are familiar with data manipulation in `pandas`, it will be tempting to shortcut your thinking into `pandas`, this is the worse you can do.

The goal of this notebook is to help you counter your intuition on this.

This is why, for every task in this notebook, we will first implement them using declarative SQL (using `spark.sql(...)`), you will then try to get the same result using PySpark DataFrames imperative programming style.

Before we get started, we will first start by running a few actions that have no equivalent in SQL: `.show()`, `.printSchema()` and `.describe()`.

Remember, these are actions, that means they will **actually perform computations**.

Unlike most actions, `.show()` and `.printSchema()` won't return a result, but just print out to the screen.

1. Show the first 10 rows of `user_logs`:

In [None]:
user_logs_df.show(10)

+----------+----+-----------+
| timestamp|user|       song|
+----------+----+-----------+
|1392387533|   0|t1l8Z6gLPzo|
|1392387538|   1|t1l8Z6gLPzo|
|1392387556|   2|t1l8Z6gLPzo|
|1392387561|   3|we5gzZq5Avg|
|1392387566|   4|we5gzZq5Avg|
|1392387566|   5|we5gzZq5Avg|
|1392387574|   6|49esza4eiK4|
|1392387579|   2|BoO6LfR7ca0|
|1392387583|   7|DaH4W1rY9us|
|1392387584|   2|BoO6LfR7ca0|
+----------+----+-----------+
only showing top 10 rows



2. Print out the schema of `user_logs`

In [None]:
user_logs_df.printSchema()

root
 |-- timestamp: integer (nullable = true)
 |-- user: integer (nullable = true)
 |-- song: string (nullable = true)



Another action, `.describe()`, this one returns a value: descriptive statistics about the DataFrame in a Spark DataFrame format.

3. Use `.describe()` on `user_logs` and put it inside `user_describe`:

In [None]:
user_describe = user_logs_df.describe()

4. Show the results with `.toPandas()`:

In [None]:
# ATTENTION ça prend 2H !!!!
# 7 min !
tmp = user_describe.toPandas()

5. Show the results with `display()`:

In [None]:
tmp.display()

summary,timestamp,user,song
count,25739537.0,25739537.0,25739537
mean,1442700656.1045842,12697.352275450798,2.532571778181818E8
stddev,34432848.72371195,13094.065905828476,8.334645614940468E8
min,-139955897.0,0.0,---AtpxbkaE
max,1554321113.0,45903.0,zzzcFgRMY6c


6. Show the results using `.show()`:

In [None]:
spark.createDataFrame(tmp).show()

+-------+--------------------+------------------+-------------------+
|summary|           timestamp|              user|               song|
+-------+--------------------+------------------+-------------------+
|  count|            25739537|          25739537|           25739537|
|   mean|1.4427006561045842E9|12697.352275450798|2.532571778181818E8|
| stddev| 3.443284872371195E7|13094.065905828476|8.334645614940468E8|
|    min|          -139955897|                 0|        ---AtpxbkaE|
|    max|          1554321113|             45903|        zzzcFgRMY6c|
+-------+--------------------+------------------+-------------------+



7. Before we can query using SQL, we need a `TempView`. Create a TempView of `user_logs` in `user_logs_table`.

In [None]:
user_logs_df.createOrReplaceTempView('my_table')

## Task 1: count the number of records

`.count(...)` is an action not a transformation (and will perform computation), while using COUNT in a SQL statement will still return a DataFrame (you'll have to force the compute).

1. count the number of records using SQL

In [None]:
# ATTENTION le cours dit qu'il faut 3 guillemet

result = spark.sql("SELECT COUNT(*) FROM my_table") # filters elements from my_table where position 
display(result)

count(1)
25739537


2. count the number of records using PySpark DataFrames transformations and actions

In [None]:
result = user_logs_df.count()
display(result)

25739537

## Task 2: select the column `user`

1. Select the column 'user' using SQL

In [None]:
result = spark.sql("SELECT user FROM my_table LIMIT 20")  
display(result)

user
0
1
2
3
4
5
6
2
7
2


2. Select the column 'user' using PySpark SQL

In [None]:
user_logs_df.select("user").show(20)

+----+
|user|
+----+
|   0|
|   1|
|   2|
|   3|
|   4|
|   5|
|   6|
|   2|
|   7|
|   2|
|   8|
|   9|
|   3|
|  10|
|  11|
|   7|
|  12|
|  13|
|   3|
|  14|
+----+
only showing top 20 rows



## Task 3: select all distinct user

1. Select distinct user using SQL

In [None]:
result = spark.sql("SELECT DISTINCT user FROM my_table LIMIT 20")  
display(result)

user
12
1
13
6
16
3
5
19
15
9


2. Select distinct user using PySpark DataFrame API

In [None]:
user_logs_df.select('user').distinct().show(20)

+----+
|user|
+----+
|  12|
|   1|
|  13|
|   6|
|  16|
|   3|
|  20|
|   5|
|  19|
|  15|
|   9|
|  17|
|   4|
|   8|
|   7|
|  10|
|  11|
|  14|
|   2|
|   0|
+----+
only showing top 20 rows



## Task 4: Select all distinct users and alias the column name to `distinct_user`

1. Select distinct user using SQL and alias the name of the new column to `distinct_user`

In [None]:
result = spark.sql("SELECT DISTINCT user as distinct_user FROM my_table LIMIT 20")  
display(result)

distinct_user
12
1
13
6
16
3
5
19
15
9


2. Select distinct user using SQL and alias the name of the new column to `distinct_user`

In [None]:
user_logs_df.select(user_logs_df["user"].alias("distinct_user")).distinct().show(20)

+-------------+
|distinct_user|
+-------------+
|           12|
|            1|
|           13|
|            6|
|           16|
|            3|
|           20|
|            5|
|           19|
|           15|
|            9|
|           17|
|            4|
|            8|
|            7|
|           10|
|           11|
|           14|
|            2|
|            0|
+-------------+
only showing top 20 rows



## Task 5: count the number of distinct user

1. Count the number of distinct user using SQL. Alias the resulting column to `total_distinct_user`

In [None]:
result = spark.sql("SELECT COUNT(DISTINCT user) as total_distinct_user FROM my_table LIMIT 20")  
display(result)

total_distinct_user
45904


2. Count the number of distinct user using PySpark DataFrame API

In [None]:
user_logs_df.select(user_logs_df["user"].alias("distinct_user")).distinct().count()

Out[93]: 45904

## Task 6: count the number of distinct songs

1. Count the number of distinct songs using SQL. Alias the resulting column to `total_distinct_song`

In [None]:
result = spark.sql("SELECT COUNT(DISTINCT song) as total_distinct_song FROM my_table LIMIT 20")  
display(result)

total_distinct_song
631348


2. Count the number of distinct songs using SQL

In [None]:
user_logs_df.select(user_logs_df["song"].alias("distinct_user")).distinct().count()

Out[96]: 631348