## Data Exploration with Spark

---

### Import Libraries

In [1]:
# import libraries
import findspark

# Locate the spark installation
findspark.init()

import pyspark as ps
from pyspark.sql.functions import col, sum

### Initialize Spark

In [2]:
# Initialize a SparkContext
sc = ps.SparkContext(appName="prior_analysis")

23/08/30 22:44:57 WARN Utils: Your hostname, MacBook-Pro-di-Andrea.local resolves to a loopback address: 127.0.0.1; using 192.168.1.129 instead (on interface en0)
23/08/30 22:44:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/30 22:44:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Connect and import data from HDFS directly into a Spark DataFrame

In [3]:
# Initialize the Session
spark_session = ps.sql.SparkSession(sc)

# Load the data
df_data = spark_session.read.csv('hdfs://localhost:9900/user/andreaalberti/book_reviews/books_data.csv', header=True, inferSchema=True)
df_ratings = spark_session.read.csv('hdfs://localhost:9900/user/andreaalberti/book_reviews/books_rating.csv', header=True, inferSchema=True)

                                                                                

### Data Exploration

- Show the first 5 rows of the data
- Investigate the inferred schema of the data
- Discover data dimensionality
- Show some statistics
- Discover null values

In [14]:
# Show the data
print('Data Table: \n')
df_data.show(5)

print('Ratings Table: \n')
df_ratings.show(5)

Data Table: 

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------+--------------------+--------------------+------------+
|               Title|         description|             authors|               image|         previewLink|           publisher| publishedDate|            infoLink|          categories|ratingsCount|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------+--------------------+--------------------+------------+
|Its Only Art If I...|                null|    ['Julie Strain']|http://books.goog...|http://books.goog...|                null|          1996|http://books.goog...|['Comics & Graphi...|        null|
|Dr. Seuss: Americ...|"Philip Nel takes...| like that of Lew...| has changed lang...| giving us new wo...| inspiring artist...|['Philip Nel']|http://books.goog...|http://books.goog...|   A&C Bla

In [13]:
#Investigate the schema
print('Data Table Schema: \n')
df_data.printSchema()

print('Ratings Table Schema: \n')
df_ratings.printSchema()

Data Table Schema: 

root
 |-- Title: string (nullable = true)
 |-- description: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- image: string (nullable = true)
 |-- previewLink: string (nullable = true)
 |-- publisher: string (nullable = true)
 |-- publishedDate: string (nullable = true)
 |-- infoLink: string (nullable = true)
 |-- categories: string (nullable = true)
 |-- ratingsCount: string (nullable = true)

Ratings Table Schema: 

root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- profileName: string (nullable = true)
 |-- review/helpfulness: string (nullable = true)
 |-- review/score: string (nullable = true)
 |-- review/time: string (nullable = true)
 |-- review/summary: string (nullable = true)
 |-- review/text: string (nullable = true)



In [18]:
# Check dimensionality
print(f'Data Table Dimensionality: {df_data.count(), len(df_data.columns)}')
print(f'Ratings Table Dimensionality: {df_ratings.count(), len(df_ratings.columns)}')

# Statistical summary
print('Data Table Summary: \n')
df_data.describe().show()

print('Ratings Table Summary: \n')
df_ratings.describe().show()

Data Table Dimensionality: (212404, 10)


                                                                                

Ratings Table Dimensionality: (3000000, 10)
Data Table Summary: 



23/08/30 19:01:47 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


CodeCache: size=131072Kb used=36524Kb max_used=36868Kb free=94548Kb
 bounds [0x00000001081d8000, 0x000000010a628000, 0x00000001101d8000]
 total_blobs=14040 nmethods=13045 adapters=907
 compilation: disabled (not enough contiguous free space left)


                                                                                

+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|summary|               Title|         description|             authors|               image|         previewLink|           publisher|       publishedDate|            infoLink|          categories|        ratingsCount|
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  count|              212403|              144047|              181153|              161213|              188349|              139274|              186560|              188103|              171880|               63852|
|   mean|   3823.672941176471|  1.4285714285714286|              1578.4|              1184.0|            Infinity|      



+-------+--------------------+--------------------+--------------------+-------------------+-----------+-------------------+------------------+--------------------+--------------------+--------------------+
|summary|                  Id|               Title|               Price|            User_id|profileName| review/helpfulness|      review/score|         review/time|      review/summary|         review/text|
+-------+--------------------+--------------------+--------------------+-------------------+-----------+-------------------+------------------+--------------------+--------------------+--------------------+
|  count|             3000000|             2999792|              482421|            2437750|    2437800|            2999633|           2999870|             2999973|             2999935|             2999957|
|   mean|1.0568515696607149E9|   2012.796651763537|  21.767951161877054|  18.29299003322259|        NaN|3.285048033703448E8| 1656.860421970827|1.1270533345949814E9|        

                                                                                

In [29]:
from pyspark.sql.functions import col, count, when

# Check for missing values
print('Data Table Missing Values: \n')
df_data.select([count(when(col(c).isNull(), c)).alias(c) for c in df_data.columns]).show()

print('Ratings Table Missing Values: \n')
df_ratings.select([count(when(col(c).isNull(), c)).alias(c) for c in df_ratings.columns]).show()


Data Table Missing Values: 

+-----+-----------+-------+-----+-----------+---------+-------------+--------+----------+------------+
|Title|description|authors|image|previewLink|publisher|publishedDate|infoLink|categories|ratingsCount|
+-----+-----------+-------+-----+-----------+---------+-------------+--------+----------+------------+
|    1|      68357|  31251|51191|      24055|    73130|        25844|   24301|     40524|      148552|
+-----+-----------+-------+-----+-----------+---------+-------------+--------+----------+------------+

Ratings Table Missing Values: 





+---+-----+-------+-------+-----------+------------------+------------+-----------+--------------+-----------+
| Id|Title|  Price|User_id|profileName|review/helpfulness|review/score|review/time|review/summary|review/text|
+---+-----+-------+-------+-----------+------------------+------------+-----------+--------------+-----------+
|  0|  208|2517579| 562250|     562200|               367|         130|         27|            65|         43|
+---+-----+-------+-------+-----------+------------------+------------+-----------+--------------+-----------+



                                                                                

In [55]:
# Stop the SparkContext
sc.stop()