## Data Cleaning with spark
### in this notebook the useless columns will be removed

### **PLEASE NOTE :**  
### Since this script stores the results in hadoop, execute it only once, otherwise an error will be thrown

---

### Import Libraries

In [1]:
# import libraries
import pandas as pd
import pyspark as ps
from pyspark.sql.functions import col, sum
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
import findspark

### Initialize Spark

In [2]:
# Locate the spark installation
findspark.init()

# Initialize a SparkContext
spark = SparkSession.builder.appName("data_cleaning").getOrCreate()
spark.stop()
sc = ps.SparkContext(appName="data_cleaning")

# Initialize the Session
spark_session = ps.sql.SparkSession(sc)

23/09/07 18:46:54 WARN Utils: Your hostname, MacBook-Pro-di-Andrea.local resolves to a loopback address: 127.0.0.1; using 192.168.1.148 instead (on interface en0)
23/09/07 18:46:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/07 18:46:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/07 18:46:55 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/09/07 18:46:56 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Connect and import data from HDFS directly into a Spark DataFrame

In [3]:
# Define schema for better manipulation

data_schema = StructType([
    StructField("Title", StringType(), True),
    StructField("description", StringType(), True),
    StructField("authors", StringType(), True),
    StructField("image", StringType(), True),
    StructField("previewLink", StringType(), True),
    StructField("publisher", StringType(), True),
    StructField("publishedDate", StringType(), True),
    StructField("infoLink", StringType(), True),
    StructField("categories", StringType(), True),
    StructField("ratingsCount", FloatType(), True)
])

ratings_schema = StructType([
    StructField("Id", IntegerType(), True),
    StructField("Title", StringType(), True),
    StructField("Price", FloatType(), True),
    StructField("User_id", IntegerType(), True),
    StructField("profileName", StringType(), True),
    StructField("review/helpfulness", StringType(), True),
    StructField("review/score", FloatType(), True),
    StructField("review/time", IntegerType(), True),
    StructField("review/summary", StringType(), True),
    StructField("review/text", StringType(), True)
])

# Load the original data

df_data = spark_session.read.option('escape', '"').csv(
    'hdfs://localhost:9900/user/book_reviews/original_data/books_data.csv', header=True, schema=data_schema)
df_ratings = spark_session.read.option('escape', '"').csv(
    'hdfs://localhost:9900/user/book_reviews/original_data/books_rating.csv', header=True, schema=ratings_schema)

## Data Transformations
---

### - Remove useless columns

These are the columns whcih are not useful for our analysis. The original files are kept unchanged in HDFS, and the new files are stored in HDFS as well.

**Data Table:**
All the links are removed.
- image
- previewLink
- infoLink
- ratingsCount

**Rating Table:**
- id

In [4]:
# Remove image column from data
df_data = df_data.drop(df_data.image)

# Remove previewLink column from data
df_data = df_data.drop(df_data.previewLink)

# Remove infoLink column from data
df_data = df_data.drop(df_data.infoLink)

# Remove ratingsCount column from data
df_data = df_data.drop(df_data.ratingsCount)

# Show the results
df_data.show(5)

# Remove Id column from ratings data
df_ratings = df_ratings.drop(df_ratings.Id)

# Show the results
df_ratings.show(5)

+--------------------+--------------------+-------------------+---------+-------------+--------------------+
|               Title|         description|            authors|publisher|publishedDate|          categories|
+--------------------+--------------------+-------------------+---------+-------------+--------------------+
|Its Only Art If I...|                null|   ['Julie Strain']|     null|         1996|['Comics & Graphi...|
|Dr. Seuss: Americ...|Philip Nel takes ...|     ['Philip Nel']|A&C Black|   2005-01-01|['Biography & Aut...|
|Wonderful Worship...|This resource inc...|   ['David R. Ray']|     null|         2000|        ['Religion']|
|Whispers of the W...|Julia Thomas find...|['Veronica Haddon']|iUniverse|      2005-02|         ['Fiction']|
|Nation Dance: Rel...|                null|    ['Edward Long']|     null|   2003-03-01|                null|
+--------------------+--------------------+-------------------+---------+-------------+--------------------+
only showing top 5 

### - Remove all the punctuation inside each column

This is to avoid parsing problem when the csv in read

In [5]:
from pyspark.sql.functions import col, sum, regexp_replace
import string
punctuations = string.punctuation

data_cols_to_change = ['Title', 'description',
                       'authors', 'publisher', 'categories']

for col_name in data_cols_to_change:
    df_data = df_data.withColumn(col_name, regexp_replace(
        col(col_name), r'[\t\n\r\\\\!\"#$%&\'()*+,-./:;<=>?@\[\\\]^_`{|}~]', ' '))

ratings_cols_to_change = ['Title', 'profileName',
                          'review/summary', 'review/text']
for col_name in ratings_cols_to_change:
    df_ratings = df_ratings.withColumn(col_name, regexp_replace(
        col(col_name), r'[\t\n\r\\\\!\"#$%&\'()*+,-./:;<=>?@\[\\\]^_`{|}~]', ' '))

In [14]:
# # Check if a given column contains a given character

contains_A = df_ratings.filter(col("review/text").contains("\t")).count() > 0
print("Does the 'name' column contain 'A'? ", contains_A)



Does the 'name' column contain 'A'?  False


                                                                                

23/09/07 22:30:46 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 203966 ms exceeds timeout 120000 ms
23/09/07 22:30:46 WARN SparkContext: Killing executors is not supported by current scheduler.
23/09/07 22:31:17 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:117)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$driverEndpoint(BlockManagerMasterEndpoint.scala:116)
	at org.apache.spark.storage.B

### Store the results in hadoop

In [6]:
df_ratings.repartition(1).write.option('escape', '"').csv(
    'hdfs://localhost:9900/user/book_reviews/books_rating_cleaned', mode='overwrite', header=True)

df_data.repartition(1).write.option('escape', '"').csv(
    'hdfs://localhost:9900/user/book_reviews/books_data_cleaned', mode='overwrite', header=True)

[Stage 2:>                                                         (0 + 8) / 22]

CodeCache: size=131072Kb used=20712Kb max_used=20907Kb free=110359Kb
 bounds [0x00000001091d8000, 0x000000010a668000, 0x00000001111d8000]
 total_blobs=8460 nmethods=7496 adapters=877
 compilation: disabled (not enough contiguous free space left)


                                                                                

## Read the new data to check soundness

In [7]:
data_df = spark_session.read.option('escape', '"').csv(
    'hdfs://localhost:9900/user/book_reviews/books_data_cleaned/part-00000-ea742e5c-f0c3-4d06-a362-b6623fef520d-c000.csv', header=True, inferSchema=True)
rating_df = spark_session.read.option('escape', '"').csv(
    'hdfs://localhost:9900/user/book_reviews/books_rating_cleaned/part-00000-f127d73d-c204-4663-bfe6-30dd15b39a1e-c000.csv', header=True, inferSchema=True)

                                                                                

In [8]:
data_df.show(50)
data_df.printSchema()
print("Num values :",data_df.count())
data_df.describe().show()

+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
|               Title|         description|             authors|           publisher|publishedDate|          categories|
+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
|Its Only Art If I...|                null|        Julie Strain|                null|         1996|Comics   Graphic ...|
|Dr  Seuss  Americ...|Philip Nel takes ...|          Philip Nel|           A C Black|   2005-01-01|Biography   Autob...|
|Wonderful Worship...|This resource inc...|        David R  Ray|                null|         2000|            Religion|
|Whispers of the W...|Julia Thomas find...|     Veronica Haddon|           iUniverse|      2005-02|             Fiction|
|Nation Dance  Rel...|                null|         Edward Long|                null|   2003-03-01|                null|
|The Church of Chr...|In The Chu

23/09/07 18:53:57 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

+-------+--------------------+--------------------+--------+--------------+------------------+--------------------+
|summary|               Title|         description| authors|     publisher|     publishedDate|          categories|
+-------+--------------------+--------------------+--------+--------------+------------------+--------------------+
|  count|              212403|              143927|  180987|        136517|            187099|              171205|
|   mean|   3533.684210526316|  1.4285714285714286|   102.0|       51495.0|1983.8165452207459|              1858.0|
| stddev|  10146.955559031441|  0.9759000729485332|    null|          null| 32.53827334249166|                null|
|    min|               00 01|                   0|     102|010 Publishers|         101-01-01|1  Short Stories ...|
|    max|you can do anythi...|�Una novela llam�...|편집부편|          펜립|              20??|             Śaivism|
+-------+--------------------+--------------------+--------+--------------+---

                                                                                

In [9]:
rating_df.show(5)
rating_df.printSchema()
print("Num values :",rating_df.count())
rating_df.describe().show()

+--------------------+-----+-------+--------------------+------------------+------------+-----------+--------------------+--------------------+
|               Title|Price|User_id|         profileName|review/helpfulness|review/score|review/time|      review/summary|         review/text|
+--------------------+-----+-------+--------------------+------------------+------------+-----------+--------------------+--------------------+
|Its Only Art If I...| null|   null|Jim of Oz  jim of oz|               7/7|         4.0|  940636800|Nice collection o...|This is only for ...|
|Dr  Seuss  Americ...| null|   null|       Kevin Killian|             10/10|         5.0| 1095724800|   Really Enjoyed It|I don t care much...|
|Dr  Seuss  Americ...| null|   null|        John Granger|             10/11|         5.0| 1078790400|Essential for eve...|If people become ...|
|Dr  Seuss  Americ...| null|   null|Roy E  Perry  ama...|               7/7|         4.0| 1090713600|Phlip Nel gives s...|Theodore Seuss

                                                                                

Num values : 3000000




+-------+--------------------+------------------+-------+--------------------+------------------+-----------------+--------------------+--------------------+--------------------+
|summary|               Title|             Price|User_id|         profileName|review/helpfulness|     review/score|         review/time|      review/summary|         review/text|
+-------+--------------------+------------------+-------+--------------------+------------------+-----------------+--------------------+--------------------+--------------------+
|  count|             2999792|            481171|      0|             2437808|           3000000|          3000000|             3000000|             2998132|             2999986|
|   mean|   2009.919466403162| 21.76265587493888|   null|                 NaN|              null|4.215289333333334| 1.132306772630393E9|            Infinity|             17963.0|
| stddev|   1534.295236559636|26.206540521370115|   null|                 NaN|              null|1.203053

                                                                                

In [None]:
spark_session.stop()