## Data Cleaning with spark
### in this notebook the useless columns will be removed

### **PLEASE NOTE :**  
### Since this script stores the results in hadoop, execute it only once, otherwise an error will be thrown

---

### Import Libraries

In [1]:
# import libraries
import pandas as pd
import pyspark as ps
from pyspark.sql.functions import col, sum
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
import findspark

### Initialize Spark

In [2]:
# Locate the spark installation
findspark.init()

# Initialize a SparkContext
spark = SparkSession.builder.appName("data_cleaning").getOrCreate()
spark.stop()
sc = ps.SparkContext(appName="data_cleaning")

# Initialize the Session
spark_session = ps.sql.SparkSession(sc)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/05 17:44:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Connect and import data from HDFS directly into a Spark DataFrame

In [3]:
# Define schema for better manipulation

data_schema = StructType([
    StructField("Title", StringType(), True),
    StructField("description", StringType(), True),
    StructField("authors", StringType(), True),
    StructField("image", StringType(), True),
    StructField("previewLink", StringType(), True),
    StructField("publisher", StringType(), True),
    StructField("publishedDate", StringType(), True),
    StructField("infoLink", StringType(), True),
    StructField("categories", StringType(), True),
    StructField("ratingsCount", FloatType(), True)
])

ratings_schema = StructType([
    StructField("Id", IntegerType(), True),
    StructField("Title", StringType(), True),
    StructField("Price", FloatType(), True),
    StructField("User_id", IntegerType(), True),
    StructField("profileName", StringType(), True),
    StructField("review/helpfulness", StringType(), True),
    StructField("review/score", FloatType(), True),
    StructField("review/time", IntegerType(), True),
    StructField("review/summary", StringType(), True),
    StructField("review/text", StringType(), True)
])

# Load the original data

df_data = spark_session.read.option('escape', '"').csv(
    'hdfs://localhost:9900/user/book_reviews/original_data/books_data.csv', header=True, schema=data_schema)
df_ratings = spark_session.read.option('escape', '"').csv(
    'hdfs://localhost:9900/user/book_reviews/original_data/books_rating.csv', header=True, schema=ratings_schema)

## Data Transformations
---

### - Remove useless columns

These are the columns whcih are not useful for our analysis. The original files are kept unchanged in HDFS, and the new files are stored in HDFS as well.

**Data Table:**
All the links are removed.
- image
- previewLink
- infoLink
- ratingsCount

**Rating Table:**
- id

In [4]:
# Remove image column from data
df_data = df_data.drop(df_data.image)

# Remove previewLink column from data
df_data = df_data.drop(df_data.previewLink)

# Remove infoLink column from data
df_data = df_data.drop(df_data.infoLink)

# Remove ratingsCount column from data
df_data = df_data.drop(df_data.ratingsCount)

# Show the results
df_data.show(5)

# Remove Id column from ratings data
df_ratings = df_ratings.drop(df_ratings.Id)

# Show the results
df_ratings.show(5)

+--------------------+--------------------+-------------------+---------+-------------+--------------------+
|               Title|         description|            authors|publisher|publishedDate|          categories|
+--------------------+--------------------+-------------------+---------+-------------+--------------------+
|Its Only Art If I...|                null|   ['Julie Strain']|     null|         1996|['Comics & Graphi...|
|Dr. Seuss: Americ...|Philip Nel takes ...|     ['Philip Nel']|A&C Black|   2005-01-01|['Biography & Aut...|
|Wonderful Worship...|This resource inc...|   ['David R. Ray']|     null|         2000|        ['Religion']|
|Whispers of the W...|Julia Thomas find...|['Veronica Haddon']|iUniverse|      2005-02|         ['Fiction']|
|Nation Dance: Rel...|                null|    ['Edward Long']|     null|   2003-03-01|                null|
+--------------------+--------------------+-------------------+---------+-------------+--------------------+
only showing top 5 

### - Remove all the punctuation inside each column

This is to avoid parsing problem when the csv in read

In [5]:
from pyspark.sql.functions import col, sum, regexp_replace
import string
punctuations = string.punctuation

data_cols_to_change = ['Title', 'description', 'authors', 'publisher', 'categories']
for col_name in data_cols_to_change:
    df_data = df_data.withColumn(col_name, regexp_replace(col(col_name), r'[!"#$%&\'()*+,-./:;<=>?@\[\\\]^_`{|}~]', ' '))

ratings_cols_to_change = ['Title','profileName', 'review/summary', 'review/text']
for col_name in ratings_cols_to_change:
    df_ratings = df_ratings.withColumn(col_name, regexp_replace(col(col_name), r'[!"#$%&\'()*+,-./:;<=>?@\[\\\]^_`{|}~]', ' '))

In [6]:
# Check if a given column contains a given character
'''
contains_A = df_ratings.filter(col("User_id").contains(",")).count() > 0
print("Does the 'name' column contain 'A'? ", contains_A)
'''



Does the 'name' column contain 'A'?  False


                                                                                

### Store the results in hadoop

In [6]:
df_ratings.repartition(1).write.option('escape', '"').csv(
    'hdfs://localhost:9900/user/book_reviews/books_rating', mode='overwrite', header=True)

df_data.repartition(1).write.option('escape', '"').csv(
    'hdfs://localhost:9900/user/book_reviews/books_data', mode='overwrite', header=True)

[Stage 2:>                                                         (0 + 8) / 22]

CodeCache: size=131072Kb used=20651Kb max_used=20663Kb free=110420Kb
 bounds [0x00000001061d8000, 0x0000000107628000, 0x000000010e1d8000]
 total_blobs=8419 nmethods=7455 adapters=877
 compilation: disabled (not enough contiguous free space left)


                                                                                

---

### Check whether the columns has been correctly removed

In [None]:
ratings_df = spark_session.read.option('escape', '"').csv(
    'hdfs://localhost:9900/user/book_reviews/books_rating.csv', header=True, inferSchema=True)
ratings_df.printSchema()
ratings_df.describe().show()
ratings_df.show(5)

In [None]:
data_df = spark_session.read.option('escape', '"').csv(
    'hdfs://localhost:9900/user/book_reviews/books_data.csv', header=True, inferSchema=True)
data_df.printSchema()
data_df.describe().show()
data_df.show(5)


In [None]:
spark_session.stop()