### Heavy Load Benchmark: User Score Analysis and Average Computation

This notebook focuses on performing a heavy load benchmark to evaluate the performance of PySpark in processing and analyzing large datasets. The primary goal is to handle a dataset of user scores efficiently, compute meaningful statistics, and prepare insights for downstream tasks. 

The workflow is divided into two key stages:

1. **Data Preprocessing and Cleaning**:
   - The raw dataset containing user scores is preprocessed to extract relevant columns, clean the data, and structure it into a trimmed and manageable format.
   - The dataset is saved as a new file to serve as the basis for further analysis.

2. **Average Score Computation**:
   - The preprocessed dataset is analyzed to calculate the average score for each anime based on user ratings.
   - PySpark’s distributed computation capabilities are leveraged to process the data efficiently, demonstrating its scalability under heavy computational loads.

This benchmark evaluates the Spark session's ability to manage memory, allocate resources across nodes, and execute parallel computations, providing insights into the framework's performance under heavy workloads.


In [37]:
%reset

### Data Loading and Preprocessing for User Scores

This section focuses on loading and preprocessing a large dataset of user scores, preparing it for further analysis. Below is a summary of the steps:

1. **Spark Session Initialization**:
   - A Spark session is configured with specific parameters for distributed data processing, including memory allocation, the number of cores, and the number of executor instances.
   - This configuration ensures the Spark environment is optimized for handling heavy workloads efficiently.

2. **Data Loading**:
   - The dataset is loaded as an RDD from a CSV file. Each line of the file is treated as a string, allowing flexible processing.

3. **Header Extraction and Filtering**:
   - The header row is extracted to identify column names.
   - The header is then filtered out from the dataset to avoid processing it as part of the data.

4. **Column Selection**:
   - Each line in the dataset is parsed into fields using a CSV reader.
   - Selected fields (`user_id`, `anime_id`, and `rating`) are extracted to focus on the relevant data for analysis.

5. **DataFrame Creation**:
   - The processed RDD is converted into a Spark DataFrame, with a schema (`user_id`, `anime_id`, `rating`) assigned to define the column structure.
   - This structured format facilitates data manipulation and storage in subsequent steps.

6. **Verification**:
   - The resulting DataFrame is displayed to verify the correctness of the preprocessing and ensure the selected columns contain valid data.

7. **Data Export**:
   - The cleaned and structured DataFrame is saved as a new CSV file. The output file includes headers, and the `overwrite` mode ensures that any existing file at the specified path is replaced.

This section prepares a trimmed and structured version of the user scores dataset, making it ready for downstream tasks such as data analysis or feature engineering.


In [38]:
from pyspark.sql import SparkSession
from pyspark.sql import Row
import csv
from io import StringIO




# Variables for Spark configuration
spark_master = "yarn"
driver_memory = "2048m"
am_memory = "1024m"
executor_memory = "2g"
executor_cores = "2"
executor_instances = "6"
max_cores = "12"

# Create a Spark session with the configurations
spark = SparkSession.builder \
    .appName("Benchmark heavy load 3 config 5") \
    .master(spark_master) \
    .config("spark.driver.memory", driver_memory) \
    .config("spark.yarn.am.memory", am_memory) \
    .config("spark.executor.memory", executor_memory) \
    .config("spark.executor.cores", executor_cores) \
    .config("spark.executor.instances", executor_instances) \
    .config("spark.cores.max", max_cores) \
    .getOrCreate()


# Define the file path
filepath = "project/users-score-2023.csv"

# Read the CSV file as an RDD of strings
rdd = spark.sparkContext.textFile(filepath)

def process_line(line):
    reader = csv.reader(StringIO(line), quotechar='"', skipinitialspace=True)
    fields = next(reader)

    return fields

header = rdd.first()


# Filter out the header and map the columns
selected_columns_rdd = rdd.filter(lambda line: line != header) \
    .map(process_line) \
    .map(lambda row: (row[0], row[2], row[4]))

schema = ["user_id","anime_id","rating"]

# Create a DataFrame from the filled RDD
users_df = spark.createDataFrame(selected_columns_rdd, schema=schema)
users_df.show(truncate= False)


output_path = "project/users-score-2023-trimmed"

# Save the DataFrame as a CSV file with overwrite mode
users_df.write.mode("overwrite").csv(output_path, header=True)


                                                                                

+-------+--------+------+
|user_id|anime_id|rating|
+-------+--------+------+
|1      |21      |9     |
|1      |48      |7     |
|1      |320     |5     |
|1      |49      |8     |
|1      |304     |8     |
|1      |306     |8     |
|1      |53      |7     |
|1      |47      |5     |
|1      |591     |6     |
|1      |54      |7     |
|1      |55      |5     |
|1      |56      |6     |
|1      |57      |9     |
|1      |368     |5     |
|1      |68      |7     |
|1      |889     |9     |
|1      |1519    |7     |
|1      |58      |8     |
|1      |1222    |7     |
|1      |458     |4     |
+-------+--------+------+
only showing top 20 rows



                                                                                

### Average Score Computation for Anime

This section focuses on calculating the average score for each anime based on user ratings in a preprocessed dataset. The steps are as follows:

1. **Data Loading**:
   - The trimmed dataset containing user scores is loaded as an RDD from the specified file path.
   - Each line in the file is read as a string, and the header row is identified for exclusion.

2. **Data Parsing and Filtering**:
   - The header is filtered out to ensure only valid data records are processed.
   - Each line is split into fields, and the relevant columns (`anime_id` and `rating`) are extracted. Ratings are converted into floating-point numbers for numerical computations.

3. **Score Aggregation**:
   - Each record is mapped into a key-value pair where the key is the `anime_id` and the value is a tuple containing the score and a count of 1.
   - The `reduceByKey` function is used to aggregate the scores and counts for each `anime_id`. This step calculates the total score and the number of ratings for each anime.

4. **Average Score Calculation**:
   - After aggregation, the average score for each anime is computed by dividing the total score by the count of ratings. This step results in an RDD containing `anime_id` as the key and the computed average score as the value.

5. **Results Collection**:
   - A subset of the results (e.g., 50 records) is collected and printed to verify the correctness of the computations.
   - Each result displays the `anime_id` and its corresponding average score, formatted to two decimal places.

6. **Stopping the Spark Session**:
   - The Spark session is stopped after the computation is complete to release resources.

This section provides a scalable and efficient way to compute average scores for anime titles based on user ratings, leveraging PySpark's distributed processing capabilities.


In [39]:
import matplotlib.pyplot as plt

# Filepath to the trimmed dataset
filepath = "project/users-score-2023-trimmed"

# Read the file as an RDD
rdd = spark.sparkContext.textFile(filepath)

# Extract the header
header = rdd.first()

# Define the process_line function
def process_line(line):
    fields = line.split(",")
    return fields

# Filter out the header and parse the data
data_rdd = rdd.filter(lambda line: line != header) \
              .map(process_line) \
              .map(lambda cols: (cols[1], float(cols[2])))  # (anime_id, score)

# Compute the sum and count for each anime_id using map and reduce
# Step 1: Map each record to a tuple (anime_id, (score, 1))
# Step 2: Use reduce to aggregate the sum and count
# Step 3: Calculate the average score for each anime_id
average_rdd = data_rdd.map(lambda x: (x[0], (x[1], 1))) \
                      .reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1])) \
                      .map(lambda x: (x[0], x[1][0] / x[1][1]))


# Collect and display the results
results = average_rdd.take(50)
for anime_id, avg_score in results:
    print(f"Anime ID: {anime_id}, Average Score: {avg_score:.2f}")


# Stop the Spark session
spark.stop()


                                                                                

Anime ID: 11813, Average Score: 7.74
Anime ID: 16742, Average Score: 7.23
Anime ID: 22537, Average Score: 6.45
Anime ID: 53, Average Score: 7.23
Anime ID: 16201, Average Score: 6.99
Anime ID: 54, Average Score: 7.21
Anime ID: 28249, Average Score: 7.69
Anime ID: 2251, Average Score: 8.56
Anime ID: 7674, Average Score: 8.25
Anime ID: 12365, Average Score: 8.62
Anime ID: 13535, Average Score: 7.82
Anime ID: 740, Average Score: 7.63
Anime ID: 1239, Average Score: 7.53
Anime ID: 1278, Average Score: 7.30
Anime ID: 3076, Average Score: 7.38
Anime ID: 1498, Average Score: 7.28
Anime ID: 889, Average Score: 8.22
Anime ID: 3389, Average Score: 6.66
Anime ID: 2341, Average Score: 6.37
Anime ID: 1474, Average Score: 7.52
Anime ID: 1476, Average Score: 7.61
Anime ID: 225, Average Score: 6.62
Anime ID: 3784, Average Score: 8.65
Anime ID: 504, Average Score: 6.77
Anime ID: 1611, Average Score: 5.96
Anime ID: 27631, Average Score: 7.17
Anime ID: 16099, Average Score: 6.18
Anime ID: 4999, Average Sco