# Student Details

***

**Name:** Hoai Nhan Nguyen <br>
**Student Number:** sba24098 <br>
**Course:** Higher Diploma in Science in Artificial Intelligence Applications

***

# Data Cleaning and Transformation


**Importing Apache Spark Libraries.**

In [1]:
# Importing libraries 
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import col, sum, when, regexp_replace, round

**Creating a new Spark Session.**

In [3]:
# Creating new SparkSession
spark = SparkSession.builder \
    .appName("MySparkApp") \
    .master("local[*]") \
    .config("spark.driver.host", "localhost") \
    .getOrCreate()

**Reading the Amazon-Products.csv in Hadoop.**

In [None]:
# Reading Amazon-Products.csv in Hadoop while applying options to read it correctly
df = spark.read.option("header", "true") \
               .option("inferSchema", "true") \
               .option("multiLine", "true") \
               .option("escape", "\"") \
               .option("quote", "\"") \
               .csv("hdfs://localhost:9000/user1/big_data_ca1/data/Amazon-Products.csv")

**Understanding the structure of the Spark Dataframe.**

In [None]:
# Reviewing the schema of the Spark Dataframe
df.printSchema()

In [None]:
# Dropping the columns that are not required for this task 
df = df.drop("_c0","image","link")
df.printSchema()

In [None]:
# Reviewing the rows of the Spark dataframe 
df.show()

In [13]:
# Get the number of rows
num_rows = df.count()
# Get the number of columns
num_columns = len(df.columns)

# Printing the shape (rows, columns)
print(f"Shape of the DataFrame: ({num_rows}, {num_columns})")


Shape of the DataFrame: (495197, 7)


**Checking the null values in the Spark Dataframe.**

In [None]:
# Counting the null or empty values per column
df.select([
    sum(when(col(column_name).isNull() | (col(column_name) == ""), 1).otherwise(0)).alias(column_name + "_nulls")
    for column_name in df.columns
]).show()

**Handling Null values in the Spark Dataframe.**

In [None]:
# Filtering out rows where both 'discount_price' and 'actual_price' are null.
df_clean = df.filter(~(col("discount_price").isNull() & col("actual_price").isNull()))

# Filling the null values for ratings, no_of_ratings, discount_price and actual_price to 0
df_clean = df_clean.fillna({"ratings": 0,"no_of_ratings": 0, "discount_price":0, "actual_price":0})

In [None]:
# Counting the null or empty values per column
df_clean.select([
    sum(when(col(column_name).isNull() | (col(column_name) == ""), 1).otherwise(0)).alias(column_name + "_nulls")
    for column_name in df.columns
]).show()

**Handling duplicate rows in the Spark Dataframe.**

In [None]:
# Removing the duplicate rows from the Spark DataFrame 
df_clean = df_clean.dropDuplicates()

**Cleaning the ratings and no_of_rating columns.**

In [None]:
# Ensuring the rows where the 'ratings' column contains valid numbers (integers or decimals).
df_clean = df_clean.filter(F.col('ratings').rlike(r'^[0-9]*\.?[0-9]+$'))

# Ensuring ratings are between 0 and 5.0
df_clean = df_clean.filter((F.col('ratings') >= 0) & (F.col('ratings') <= 5.0))


In [None]:
# Removing commas from 'no_of_ratings'
df_clean = df_clean.withColumn("no_of_ratings", regexp_replace(col("no_of_ratings"), ",", ""))

# Ensuring the rows where the 'no_of_ratings' column contains valid numbers (integers).
df_clean = df_clean.filter(col("no_of_ratings").rlike("^[0-9]+$"))

**Converting currency from Indian Rupee to Euro for the actual_price and discount_price columns.**

In [None]:
# Removing ₹, commas and convert to double
df_converted = df_clean.withColumn(
    "actual_price",
    regexp_replace(col("actual_price"), "[₹,]", "").cast("double")  
)

# Converting INR to EUR (using conversion rate: 1 INR = 0.011 EUR)
conversion_rate = 0.011
df_converted = df_converted.withColumn(
    "actual_price",
    round(col("actual_price") * conversion_rate, 2)
)

In [None]:
# Removing ₹, commas and convert to double
df_converted = df_converted.withColumn(
    "discount_price",
    regexp_replace(col("discount_price"), "[₹,]", "").cast("double") 
)

# Converting INR to EUR (using conversion rate: 1 INR = 0.011 EUR)
conversion_rate = 0.011
df_converted = df_converted.withColumn(
    "discount_price",
    round(col("discount_price") * conversion_rate, 2) 
)

**Saving converted Spark Dataframe as a CSV file in Hadoop**

In [None]:
df_converted.write \
  .option("header", "true") \
  .option("quoteAll", "true") \
  .option("escape", "\"") \
  .csv("hdfs://localhost:9000/user1/big_data_ca1/data/Amazon-Products-Cleaned.csv")

# Loading CSV Data into HBase 

**Importing HappyBase Library to connect to HBase.**

In [4]:
import happybase

**Reading the Amazon-Products-Cleaned.csv in Hadoop**

In [5]:
# Reading Amazon-Products-Cleaned.csv in Hadoop while applying options to read it correctly
df = spark.read.option("header", "true") \
               .option("inferSchema", "true") \
               .option("multiLine", "true") \
               .option("escape", "\"") \
               .option("quote", "\"") \
               .csv("hdfs://localhost:9000/user1/big_data_ca1/data/Amazon-Products-Cleaned.csv")

                                                                                

**Reviewing the Schema of Amazon-Products-Cleaned.csv**

In [15]:
# Reviewing the schema of the Spark Dataframe
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- main_category: string (nullable = true)
 |-- sub_category: string (nullable = true)
 |-- ratings: double (nullable = true)
 |-- no_of_ratings: integer (nullable = true)
 |-- discount_price: double (nullable = true)
 |-- actual_price: double (nullable = true)



In [7]:
df.show()

+--------------------+-------------+----------------+-------+-------------+--------------+------------+
|                name|main_category|    sub_category|ratings|no_of_ratings|discount_price|actual_price|
+--------------------+-------------+----------------+-------+-------------+--------------+------------+
|Voltas 1.5 Ton 3 ...|   appliances|Air Conditioners|    0.0|            0|        329.89|      472.89|
|Panasonic 1 Ton 4...|   appliances|Air Conditioners|    4.0|            3|        406.89|       556.6|
|Samsung 1.5 Ton 3...|   appliances|Air Conditioners|    1.7|            2|        557.47|       579.7|
|Usha Mist Air Icy...|   appliances|  All Appliances|    4.1|        17394|         31.89|       33.55|
|Wonderchef Nutri-...|   appliances|  All Appliances|    3.7|        11480|         32.71|        71.5|
|SUJATA Powermatic...|   appliances|  All Appliances|    4.4|         2848|         64.63|       89.79|
|Bajaj MX 3 Neo St...|   appliances|  All Appliances|    3.6|   

**Connecting to Hbase**

In [11]:
# Connecting to HBase with the happybase library 
connection = happybase.Connection('localhost')
connection.open()

**Creating and adding data to the table.**

In [12]:
# Defining the column families 
column_families = {
    'Item_Info': dict(),
    'Ratings_Info': dict(),
    'Pricing_Info': dict(),
}

# Creating the table amazon_products if it doesn't exist
table_name = 'amazon_products'
if table_name not in connection.tables():
    connection.create_table(table_name, column_families)

# Getting the table amazon_products
table = connection.table(table_name)

# For loop to add the data from Amazon-Products-Cleaned.csv to the table amazon_products
for idx, row in enumerate(df.rdd.collect(), start=1):
    # Generating a 6-character row key for each item
    row_key = str(idx).zfill(6)
    
    # Adding the data to the table
    table.put(row_key, {
        'Item_Info:name': row['name'],
        'Item_Info:main_category': row['main_category'],
        'Item_Info:sub_category': row['sub_category'],
        'Ratings_Info:ratings': str(row['ratings']),
        'Ratings_Info:no_of_ratings': str(row['no_of_ratings']),
        'Pricing_Info:discount_price': str(row['discount_price']),
        'Pricing_Info:actual_price': str(row['actual_price'])
    })


                                                                                

**Closing the connection**

In [None]:
connection.close()