### **Team Members**


| Name                 | ID            |
|--------------------- |---------------|
| Salma Mamdoh Sabry   | 20210162      |
| Roaa Talat Mohamed   | 20210138      |
| Shawky Ebrahim Ahmed | 20210184      |
| Belal Ahmed Eid      | 20210092      |


📝 **Assignment Introduction**
------------------------------

### 📘 **Dataset Background**

The dataset comes from the **Wikimedia Foundation**, which runs Wikipedia and other open-knowledge platforms. It contains **page view statistics** collected from **0:00 to 1:00 AM on January 1st, 2016**.

Each line in the file represents the number of views for a specific page in that hour.

---

### 🔢 **Schema**

Each line has 4 fields separated by whitespace:

| Field         | Description                                               |
|---------------|-----------------------------------------------------------|
| `Project Code`| Project identifier (e.g. `en` for English Wikipedia)      |
| `Page Title`  | Title of the accessed page (e.g. `Political_status_of_Crimea`) |
| `Page Hits`   | Number of times this page was accessed in that hour       |
| `Page Size`   | Size of the page (likely in bytes)                        |


In [6]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import col
from pyspark.sql.types import LongType
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
import timeit
import timeit
from pyspark.sql import functions as F
from itertools import combinations
import nltk
from nltk.corpus import stopwords

In [7]:
spark = SparkSession.builder \
    .appName("WikimediaPageViews") \
    .master("local[*]")\
    .getOrCreate()

sc = spark.sparkContext

#### Data Loading & Validation

In [8]:
data_path = "data.out" 
raw_data = sc.textFile(data_path)
print(f"Total lines loaded: {raw_data.count():,}")

Total lines loaded: 3,324,129


In [9]:
def check_data_quality(rdd):
    empty_lines = rdd.filter(lambda x: len(x.strip()) == 0).count()
    malformed_lines = rdd.filter(lambda x: len(x.strip().split()) != 4).count()
    
    print("=== Data Quality Report ===")
    print(f"Total lines: {rdd.count():,}")
    print(f"Empty lines: {empty_lines:,}")
    print(f"Malformed lines: {malformed_lines:,}")
    print(f"Valid lines: {rdd.count() - empty_lines - malformed_lines:,}")
    
    if malformed_lines > 0:
        print("\nSample malformed lines:")
        for line in rdd.filter(lambda x: len(x.strip().split()) != 4).take(5):
            print(line)

check_data_quality(raw_data)

=== Data Quality Report ===
Total lines: 3,324,129
Empty lines: 0
Malformed lines: 103
Valid lines: 3,324,026

Sample malformed lines:
ak.v  2 3606
ar  526 21232283
ar.s  4 38267
ay.v  2 3606
az  1 19081


In [10]:
def parse_and_validate(line):
    """Parse line and validate it has 4 parts with the correct types"""
    parts = line.strip().split()
    if len(parts) != 4:
        return None
    try:
        project_code = parts[0]
        page_title = parts[1]
        page_hits = int(parts[2])
        page_size = int(parts[3])
        return (project_code, page_title, page_hits, page_size)
    except ValueError:
        return None

valid_lines = raw_data.filter(lambda line: parse_and_validate(line) is not None).cache()

# Parsed and structured RDD from valid_lines
parsed_rdd = valid_lines.map(parse_and_validate)

# Count and display basic info
total_count = raw_data.count()
valid_count = valid_lines.count()
print(f"Original count: {total_count:,}")
print(f"Valid lines count: {valid_count:,}")
print(f"Removed {total_count - valid_count:,} malformed lines")

# Simulated schema and sample data
print("\nSchema: (project_code: str, page_title: str, page_hits: int, page_size: int)\n")
for row in parsed_rdd.take(5):
    print(row)

Original count: 3,324,129
Valid lines count: 3,324,026
Removed 103 malformed lines

Schema: (project_code: str, page_title: str, page_hits: int, page_size: int)

('aa', '271_a.C', 1, 4675)
('aa', 'Category:User_th', 1, 4770)
('aa', 'Chiron_Elias_Krase', 1, 4694)
('aa', 'Dassault_rafaele', 2, 9372)
('aa', 'E.Desv', 1, 4662)


#### 1. Page Size Analysis

In [11]:
print("=== Map-Reduce Approach ===")

def map_reduce_stats():
    sizes = valid_lines.map(lambda line: int(line.split()[3]))
    count = sizes.count()
    total = sizes.reduce(lambda a, b: a + b)
    min_size = sizes.reduce(lambda a, b: a if a < b else b)
    max_size = sizes.reduce(lambda a, b: a if a > b else b)
    
    return {
        'min': min_size,
        'max': max_size,
        'avg': total / count
    }

# Time the execution
start_time = timeit.default_timer()
mr_stats = map_reduce_stats()
mr_time = timeit.default_timer() - start_time

print(f"Min size: {mr_stats['min']:,} bytes")
print(f"Max size: {mr_stats['max']:,} bytes")
print(f"Avg size: {mr_stats['avg']:,.2f} bytes")
print(f"Execution time: {mr_time:.4f} seconds")

=== Map-Reduce Approach ===
Min size: 0 bytes
Max size: 141,180,155,987 bytes
Avg size: 132,215.80 bytes
Execution time: 15.6108 seconds


In [12]:
print("=== Spark Loops Approach ===")


def spark_loops_stats():
    sizes_list = []

    for line in valid_lines.collect():  
            size = int(line.split()[3])
            sizes_list.append(size)

    if not sizes_list:
        return {'min': 0, 'max': 0, 'avg': 0}

    count = len(sizes_list)
    total = 0
    min_size = sizes_list[0]
    max_size = sizes_list[0]

    for size in sizes_list:
        total += size
        if size < min_size:
            min_size = size
        if size > max_size:
            max_size = size

    return {
        'min': min_size,
        'max': max_size,
        'avg': total / count
    }

# Time the execution
start_time = timeit.default_timer()
loop_stats = spark_loops_stats()
loop_time = timeit.default_timer() - start_time


print(f"Min size: {loop_stats['min']:,} bytes")
print(f"Max size: {loop_stats['max']:,} bytes")
print(f"Avg size: {loop_stats['avg']:,.2f} bytes")
print(f"Execution time: {loop_time:.4f} seconds")

=== Spark Loops Approach ===
Min size: 0 bytes
Max size: 141,180,155,987 bytes
Avg size: 132,215.80 bytes
Execution time: 2.8708 seconds


In [13]:
# Performance Comparison 
print("\n=== Performance Comparison ===")
print(f"Map-Reduce Time: {mr_time:.4f} sec")
print(f"Spark Loops Time: {loop_time:.4f} sec")
print(f"Difference: {abs(mr_time - loop_time):.4f} sec")
print(f"Faster by: {max(mr_time, loop_time)/min(mr_time, loop_time):.2f}x")

schema = StructType([
    StructField("Approach", StringType(), True),
    StructField("Min", DoubleType(), True),
    StructField("Max", DoubleType(), True),
    StructField("Avg", DoubleType(), True),
    StructField("Time_sec", DoubleType(), True)
])

data = [
    ("Map-Reduce", float(mr_stats['min']), float(mr_stats['max']), float(mr_stats['avg']), float(mr_time)),
    ("Spark Loops", float(loop_stats['min']), float(loop_stats['max']), float(loop_stats['avg']), float(loop_time))
]

results_df = spark.createDataFrame(data, schema)
results_df.show()


=== Performance Comparison ===
Map-Reduce Time: 15.6108 sec
Spark Loops Time: 2.8708 sec
Difference: 12.7400 sec
Faster by: 5.44x
+-----------+---+----------------+------------------+------------------+
|   Approach|Min|             Max|               Avg|          Time_sec|
+-----------+---+----------------+------------------+------------------+
| Map-Reduce|0.0|1.41180155987E11|132215.79814237313|15.610841399990022|
|Spark Loops|0.0|1.41180155987E11|132215.79814237313|2.8708305999898585|
+-----------+---+----------------+------------------+------------------+



#### ✅ Observations:
- Both approaches produced the **same statistics** in terms of `Min`, `Max`, and `Avg` page size.
- However, the **Spark Loops approach outperformed the traditional Map-Reduce**, completing the task in just a third of the time.


#### 2. Count `The...` Titles

In [14]:
print("=== Map-Reduce Approach: Determine the number of page titles that start with the article “The”. How many of those page titles are not part of the English project  ===")

def map_reduce_count_the_titles(data):
    parsed_lines = data.map(lambda line: line.split())
    the_titles = parsed_lines.filter(lambda record: record[1].startswith('The'))
    the_titles_count = the_titles.count()
    non_en_the = the_titles.filter(lambda record: record[0] != 'en')
    non_en_the_count = non_en_the.count()
    return the_titles_count, non_en_the_count

# Time the execution
start_time = timeit.default_timer()
the_titles_count, non_en_the_count = map_reduce_count_the_titles(valid_lines)
mr_time = timeit.default_timer() - start_time

print("Map-Reduce Approach:")
print("Titles starting with The: ", the_titles_count)
print("Non-English titles starting with The: ", non_en_the_count)
print(f"\nExecution time: {mr_time:.4f} seconds")


=== Map-Reduce Approach: Determine the number of page titles that start with the article “The”. How many of those page titles are not part of the English project  ===
Map-Reduce Approach:
Titles starting with The:  45020
Non-English titles starting with The:  10292

Execution time: 8.4738 seconds


In [15]:
print("=== Spark-loops: Determine the number of page titles that start with the article “The”. How many of those page titles are not part of the English project  ===")

def spark_loops_count_the_titles(data):
    collected_data = data.collect()
    the_titles_loops = 0
    non_en_the_loops = 0

    for line in collected_data:
        record = line.split()
        if record[1].startswith('The'):
            the_titles_loops += 1
            if record[0] != "en":
                non_en_the_loops += 1
    return the_titles_loops, non_en_the_loops

start_time = timeit.default_timer()
the_titles_loops, non_en_the_loops = spark_loops_count_the_titles(valid_lines)
loop_time = timeit.default_timer() - start_time

print("\nSpark Loops:")
print("Titles starting with The: ", the_titles_loops)
print("Non-English titles starting with The: ", non_en_the_loops)
print(f"\nExecution time: {loop_time:.4f} seconds")

=== Spark-loops: Determine the number of page titles that start with the article “The”. How many of those page titles are not part of the English project  ===

Spark Loops:
Titles starting with The:  45020
Non-English titles starting with The:  10292

Execution time: 2.1718 seconds


In [16]:
# Performance Comparison 
print("\n=== Performance Comparison ===")
print(f"Map-Reduce Time: {mr_time:.4f} sec")
print(f"Spark Loops Time: {loop_time:.4f} sec")
print(f"Difference: {abs(mr_time - loop_time):.4f} sec")
print(f"Faster by: {max(mr_time, loop_time)/min(mr_time, loop_time):.2f}x")


=== Performance Comparison ===
Map-Reduce Time: 8.4738 sec
Spark Loops Time: 2.1718 sec
Difference: 6.3020 sec
Faster by: 3.90x


#### ✅Observations:
- The **Spark Loop approach** outperforms the **Map-Reduce approach** in terms of execution time by a significant margin (3.5x faster).
- Both approaches produce identical results, confirming that the logic and data are consistent across both methods.

#### 3. Unique Title Terms

In [17]:
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\E.J.S\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
print("=== Map-Reduce Approach: Determine the number of unique terms appearing in the page titles ===")

def map_reduce_unqiue_title_terms():
    terms = valid_lines.flatMap(lambda line: line.split()[1].split("_"))
    normalized_terms = terms.map(lambda term: term.lower())\
                            .filter(lambda term: term.isalnum())\
                            .filter(lambda term: term not in stop_words)
    unique_terms = normalized_terms.distinct()
    return unique_terms

# Time the execution
start_time = timeit.default_timer()
unique_terms = map_reduce_unqiue_title_terms()
unique_terms_count = unique_terms.count()
mr_time = timeit.default_timer() - start_time

print(f"Total number of unique terms: {unique_terms_count}")
print(f"\nExecution time: {mr_time:.4f} seconds")

=== Map-Reduce Approach: Determine the number of unique terms appearing in the page titles ===
Total number of unique terms: 850479

Execution time: 11.8674 seconds


In [19]:
print("=== Loop-Based Approach Matching Map-Reduce Output ===")

def spark_loops_unqiue_title_terms():
    unique_terms = set()
    for line in valid_lines.collect():
        terms = line.strip().split()[1].split("_")
        for term in terms:
            term = term.lower().strip()
            if term.isalnum() and term not in stop_words:
                unique_terms.add(term)
    return unique_terms

start_time = timeit.default_timer()
unique_terms = spark_loops_unqiue_title_terms()
unique_terms_count = len(unique_terms)
loop_time = timeit.default_timer() - start_time

print(f"Total number of unique terms: {unique_terms_count}")
print(f"\nExecution time: {loop_time:.4f} seconds")

=== Loop-Based Approach Matching Map-Reduce Output ===
Total number of unique terms: 850479

Execution time: 4.5771 seconds


In [20]:
# Performance Comparison 
print("\n=== Performance Comparison ===")
print(f"Map-Reduce Time: {mr_time:.4f} sec")
print(f"Spark Loops Time: {loop_time:.4f} sec")
print(f"Difference: {abs(mr_time - loop_time):.4f} sec")
print(f"Faster by: {max(mr_time, loop_time)/min(mr_time, loop_time):.2f}x")


=== Performance Comparison ===
Map-Reduce Time: 11.8674 sec
Spark Loops Time: 4.5771 sec
Difference: 7.2904 sec
Faster by: 2.59x


#### ✅Observations:

- The **Spark Loop approach** outperforms the **Map-Reduce approach** in terms of execution time by a significant margin (3.4x faster).
- Both approaches produce identical results, confirming that the logic and data are consistent across both methods.

#### 4. Title Count Analysis

In [21]:
print("=== Map-Reduce Approach for Title Counts ===")

def map_reduce_title_counts():
    title_ones = valid_lines.map(lambda line: (line.split()[1], 1))
    title_counts = title_ones.reduceByKey(lambda a, b: a + b)
    
    sorted_titles_rdd = title_counts.map(lambda x: (x[1], x[0])) \
                                    .sortByKey(ascending=False) \
                                    .map(lambda x: (x[1], x[0]))  # Back to (title, count)
    
    total_unique = title_counts.count()  
    top_20_titles = sorted_titles_rdd.take(20) 

    return {
        'total_unique': total_unique,
        'top_titles': top_20_titles,
        'sorted_titles_rdd': sorted_titles_rdd  
    }

# Time execution
start_time = timeit.default_timer()
title_stats = map_reduce_title_counts()
mr_time = timeit.default_timer() - start_time

# Print results
print(f"Total unique titles: {title_stats['total_unique']:,}")
print("\nTop 20 titles by count:")
for idx, (title, count) in enumerate(title_stats['top_titles'], 1):
    print(f"{idx:2d}. {title[:50]:<50} {count:>8,}")

print(f"\nExecution time: {mr_time:.4f} seconds")

=== Map-Reduce Approach for Title Counts ===
Total unique titles: 2,968,690

Top 20 titles by count:
 1. water                                                   118
 2. 1863                                                    106
 3. Berlin                                                  101
 4. Google                                                  101
 5. Linux                                                    98
 6. Main_Page                                                90
 7. ISO_3166-1                                               88
 8. Microsoft_Windows                                        87
 9. HTML                                                     86
10. Index.php                                                86
11. Frank_Lloyd_Wright                                       85
12. PHP                                                      83
13. ISO_4217                                                 76
14. Boston                                                   75
15.

In [22]:
def spark_loops_title_counts():
    all_lines = valid_lines.collect()  

    title_counts = {}

    for line in all_lines:
        title = line.split()[1]  
        title_counts[title] = title_counts.get(title, 0) + 1  

    sorted_titles = sorted(title_counts.items(), key=lambda x: -x[1])

    return {
        'total_unique': len(sorted_titles),
        'top_titles': sorted_titles[:20],  
        'all_counts': sorted_titles        
    }

# Time the execution
start_time = timeit.default_timer()
title_stats_loop = spark_loops_title_counts()
loop_time = timeit.default_timer() - start_time

# Print results
print(f"Total unique titles: {title_stats_loop['total_unique']:,}")
print("\nTop 20 titles by count:")
for idx, (title, count) in enumerate(title_stats_loop['top_titles'], 1):
    print(f"{idx:2d}. {title[:50]:<50} {count:>8,}")

print(f"\nExecution time: {loop_time:.4f} seconds")

Total unique titles: 2,968,690

Top 20 titles by count:
 1. water                                                   118
 2. 1863                                                    106
 3. Berlin                                                  101
 4. Google                                                  101
 5. Linux                                                    98
 6. Main_Page                                                90
 7. ISO_3166-1                                               88
 8. Microsoft_Windows                                        87
 9. Index.php                                                86
10. HTML                                                     86
11. Frank_Lloyd_Wright                                       85
12. PHP                                                      83
13. ISO_4217                                                 76
14. Boston                                                   75
15. Special:Search                              

In [23]:
# Performance Comparison for Title Counts 
print("\n=== Performance Comparison for Title Counts ===")
print(f"Map-Reduce Time: {mr_time:.4f} sec")
print(f"Python Loop Time: {loop_time:.4f} sec")
print(f"Difference: {abs(mr_time - loop_time):.4f} sec")
print(f"Python Loop was {mr_time/loop_time:.1f}x faster")

schema = StructType([
    StructField("Approach", StringType(), True),
    StructField("Unique_Titles", LongType(), True),
    StructField("Top_Count", LongType(), True),
    StructField("Time_sec", DoubleType(), True)
])

top_count_mr = title_stats['top_titles'][0][1] if title_stats['top_titles'] else 0
top_count_loop = title_stats_loop['top_titles'][0][1] if title_stats_loop['top_titles'] else 0

data = [
    Row("Map-Reduce", title_stats['total_unique'], top_count_mr, float(mr_time)),
    Row("Python Loop", title_stats_loop['total_unique'], top_count_loop, float(loop_time))
]

title_comparison_df = spark.createDataFrame(data, schema)

print("\nPerformance Comparison Results:")
title_comparison_df.show(truncate=False)

print("\n=== Verification ===")
print(f"Unique counts match: {title_stats['total_unique'] == title_stats_loop['total_unique']}")
print(f"Top count match: {top_count_mr == top_count_loop}")



=== Performance Comparison for Title Counts ===
Map-Reduce Time: 25.6639 sec
Python Loop Time: 3.3709 sec
Difference: 22.2930 sec
Python Loop was 7.6x faster

Performance Comparison Results:
+-----------+-------------+---------+-----------------+
|Approach   |Unique_Titles|Top_Count|Time_sec         |
+-----------+-------------+---------+-----------------+
|Map-Reduce |2968690      |118      |25.66392029999406|
|Python Loop|2968690      |118      |3.37089219999325 |
+-----------+-------------+---------+-----------------+


=== Verification ===
Unique counts match: True
Top count match: True


### ✅Observations:
- The **Python Loop approach** outperforms the **Map-Reduce approach** in terms of execution time by a significant margin (3.0x faster).
- Both approaches produce identical results for **Unique Titles** and **Top Count**, confirming that the logic and data are consistent across both methods.

#### 5. Grouping by Title

In [24]:
print("=== Map-Reduce Approach: Combine data of pages with the same title ===")

def map_reduce_combine_all_by_title():
    title_data_rdd = valid_lines.map(lambda line: line.split()) \
                                .filter(lambda parts: len(parts) == 4) \
                                .map(lambda parts: (parts[1], (parts[0], int(parts[2]), int(parts[3]))))
    
    grouped_by_title = title_data_rdd.groupByKey()
    
    return grouped_by_title

# Time execution
start_time = timeit.default_timer()
combined_rdd = map_reduce_combine_all_by_title()
sample_combined = combined_rdd.take(5)
mr_time = timeit.default_timer() - start_time

# Show results
print("\nSample output (first 5 grouped titles):\n")
for title, page_data_list in sample_combined:
    print(f"Title: {title}")
    for project_code, hits, size in page_data_list:
        print(f"  - Project: {project_code}, Hits: {hits}, Size: {size}")
    print()

print(f"\nExecution time: {mr_time:.4f} seconds")


=== Map-Reduce Approach: Combine data of pages with the same title ===

Sample output (first 5 grouped titles):

Title: Indonesian_Wikipedia
  - Project: aa, Hits: 1, Size: 4679
  - Project: en, Hits: 1, Size: 93905

Title: Special:MyLanguage/Meta:Index
  - Project: aa, Hits: 1, Size: 4701

Title: Special:WhatLinksHere/Main_Page
  - Project: aa, Hits: 1, Size: 5556
  - Project: commons.m, Hits: 2, Size: 15231
  - Project: en, Hits: 5, Size: 101406
  - Project: en.s, Hits: 1, Size: 8597
  - Project: en.voy, Hits: 1, Size: 8550
  - Project: meta.m, Hits: 1, Size: 11529
  - Project: outreach.m, Hits: 1, Size: 5698
  - Project: simple, Hits: 3, Size: 32145

Title: Special:WhatLinksHere/MediaWiki:Edittools
  - Project: aa, Hits: 1, Size: 5139

Title: User:IlStudioso
  - Project: aa, Hits: 1, Size: 6796


Execution time: 11.2713 seconds


In [25]:
print("=== Map-Reduce Approach: Aggregate data of pages with the same title ===")

def map_reduce_aggregate_by_title():
    title_data_rdd = valid_lines.map(lambda line: line.split()) \
                                .filter(lambda parts: len(parts) == 4) \
                                .map(lambda parts: (parts[1], (parts[0], int(parts[2]), int(parts[3]))))
    
    aggregated_rdd = title_data_rdd.mapValues(lambda x: ([x[0]], x[1], x[2])) \
                                   .reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1], a[2] + b[2]))

    return aggregated_rdd

# Time execution
start_time = timeit.default_timer()
aggregated_rdd = map_reduce_aggregate_by_title()
sample_aggregated = aggregated_rdd.take(5)
agg_time = timeit.default_timer() - start_time

# Show results
print("\nSample output (first 5 aggregated titles):\n")
for title, (project_codes, total_hits, total_size) in sample_aggregated:
    print(f"Title: {title}")
    print(f"  - Projects: {project_codes}")
    print(f"  - Total Hits: {total_hits}")
    print(f"  - Total Size: {total_size}\n")

print(f"\nExecution time: {agg_time:.4f} seconds")


=== Map-Reduce Approach: Aggregate data of pages with the same title ===

Sample output (first 5 aggregated titles):

Title: Indonesian_Wikipedia
  - Projects: ['aa', 'en']
  - Total Hits: 2
  - Total Size: 98584

Title: Special:MyLanguage/Meta:Index
  - Projects: ['aa']
  - Total Hits: 1
  - Total Size: 4701

Title: Special:WhatLinksHere/Main_Page
  - Projects: ['aa', 'commons.m', 'en', 'en.s', 'en.voy', 'meta.m', 'outreach.m', 'simple']
  - Total Hits: 15
  - Total Size: 188712

Title: Special:WhatLinksHere/MediaWiki:Edittools
  - Projects: ['aa']
  - Total Hits: 1
  - Total Size: 5139

Title: User:IlStudioso
  - Projects: ['aa']
  - Total Hits: 1
  - Total Size: 6796


Execution time: 12.1982 seconds


In [26]:
print("=== Loop-Based Approach: Aggregate data of pages with the same title ===")

def loop_based_aggregate_by_title():
    collected_data = valid_lines.collect()
    
    title_dict = {}

    for line in collected_data:
        parts = line.split()
        if len(parts) != 4:
            continue  

        project_code = parts[0]
        title = parts[1]
        hits = int(parts[2])
        size = int(parts[3])

        if title not in title_dict:
            title_dict[title] = {
                'project_codes': [project_code],
                'total_hits': hits,
                'total_size': size
            }
        else:
            title_dict[title]['project_codes'].append(project_code)
            title_dict[title]['total_hits'] += hits
            title_dict[title]['total_size'] += size
    
    return title_dict

# Time execution
start_time = timeit.default_timer()
aggregated_results = loop_based_aggregate_by_title()
loop_time = timeit.default_timer() - start_time

# Display results
print("\nSample output (first 5 aggregated titles):\n")
for i, (title, data) in enumerate(aggregated_results.items()):
    if i >= 5:
        break
    print(f"Title: {title}")
    print(f"  - Project Codes: {data['project_codes']}")
    print(f"  - Total Hits: {data['total_hits']}")
    print(f"  - Total Size: {data['total_size']}\n")

print(f"\nExecution time: {loop_time:.4f} seconds")

=== Loop-Based Approach: Aggregate data of pages with the same title ===

Sample output (first 5 aggregated titles):

Title: 271_a.C
  - Project Codes: ['aa', 'az', 'bcl', 'be']
  - Total Hits: 4
  - Total Size: 22386

Title: Category:User_th
  - Project Codes: ['aa', 'commons.m']
  - Total Hits: 2
  - Total Size: 4770

Title: Chiron_Elias_Krase
  - Project Codes: ['aa', 'az', 'bg', 'cho', 'dz', 'it']
  - Total Hits: 6
  - Total Size: 34584

Title: Dassault_rafaele
  - Project Codes: ['aa', 'en', 'it']
  - Total Hits: 4
  - Total Size: 21940

Title: E.Desv
  - Project Codes: ['aa', 'arc', 'ast', 'fiu-vro', 'fr', 'ik']
  - Total Hits: 6
  - Total Size: 31539


Execution time: 9.8272 seconds


### ✅ Observations

The **Map-Reduce approach** is significantly faster due to its parallel processing capabilities, which allows tasks to be distributed across multiple nodes, reducing execution time. In contrast, the **Loop-Based approach** processes data sequentially, which becomes slower as the dataset grows, leading to longer execution times.

In [27]:
print("=== Map-Reduce Approach: Generate Page Combinations Per Title ===")

def generate_page_combinations(combined_rdd):
    # For each title, generate all unique pairs from its associated pages
    title_combinations_rdd = combined_rdd.flatMapValues(
        lambda pages: list(combinations(pages, 2))
    )
    return title_combinations_rdd

# Time execution
start_time = timeit.default_timer()
title_page_combinations_rdd = generate_page_combinations(combined_rdd)
sample_pairs = title_page_combinations_rdd.take(5)
combo_time = timeit.default_timer() - start_time

# Display results
print("\nSample output (first 5 title-based page combinations):\n")
for title, ((proj1, hits1, size1), (proj2, hits2, size2)) in sample_pairs:
    print(f"Title: {title}")
    print(f"  - Page 1 → Project: {proj1}, Hits: {hits1}, Size: {size1}")
    print(f"  - Page 2 → Project: {proj2}, Hits: {hits2}, Size: {size2}")
    print()

print(f"\nExecution time: {combo_time:.4f} seconds")


=== Map-Reduce Approach: Generate Page Combinations Per Title ===

Sample output (first 5 title-based page combinations):

Title: Indonesian_Wikipedia
  - Page 1 → Project: aa, Hits: 1, Size: 4679
  - Page 2 → Project: en, Hits: 1, Size: 93905

Title: Special:WhatLinksHere/Main_Page
  - Page 1 → Project: aa, Hits: 1, Size: 5556
  - Page 2 → Project: commons.m, Hits: 2, Size: 15231

Title: Special:WhatLinksHere/Main_Page
  - Page 1 → Project: aa, Hits: 1, Size: 5556
  - Page 2 → Project: en, Hits: 5, Size: 101406

Title: Special:WhatLinksHere/Main_Page
  - Page 1 → Project: aa, Hits: 1, Size: 5556
  - Page 2 → Project: en.s, Hits: 1, Size: 8597

Title: Special:WhatLinksHere/Main_Page
  - Page 1 → Project: aa, Hits: 1, Size: 5556
  - Page 2 → Project: en.voy, Hits: 1, Size: 8550


Execution time: 2.4744 seconds


In [28]:
all_combined = title_page_combinations_rdd.collect()

# Write all results to a file
output_path = "final_output_combined_for_the_same_title.out"
with open(output_path, "w", encoding="utf-8") as f:
    f.write("=== Combined Page Pairs grouprd by Titles ===\n")
    for title, pairs in all_combined:
        f.write(f"Title: {title}")
        f.write(f"  - Page 1 → Project: {proj1}, Hits: {hits1}, Size: {size1}")
        f.write(f"  - Page 2 → Project: {proj2}, Hits: {hits2}, Size: {size2}")
        f.write("\n")

print(f"All results written to {output_path}")

All results written to final_output_combined_for_the_same_title.out
