### **Bonus Question - Connected Components on MapReduce** 
MapReduce is ideal for network analysis as it enables parallel processing of large graph datasets, making it scalable and efficient. By breaking tasks into map and reduce steps, it allows for distributed analysis of networks, which is essential for handling large-scale graph problems like connected components.

1. In this task, you are required to use PySpark and the MapReduce paradigm to identify the connected components in a flight network graph. The focus should be on airports rather than cities. As you know, a connected component refers to a group of airports where every pair of airports within the group is connected either directly or indirectly.

The function takes the following inputs: 
1. Flight network
2. A starting date
3. An end date

The function outputs: 
1. The number of the connected components during that period
2. The size of each connectd componenet
3. The airports within the largest connected component identified.

__Note:__ For this task, you should check if there is a flight between two airports during that period.
__Note:__ You are not allowed to use pre-existing packages or functions in PySpark; instead, you must implement the algorithm from scratch using the MapReduce paradigm.

2. Compare the execution time and the results of your implementation with those of the GraphFrames package for identifying connected components. If there is any difference in the results, provide an explanation for why that might occur.


In [6]:
! pip install pyngrok gdown  pyspark  yellowbrick graphframes

Collecting graphframes
  Downloading graphframes-0.6-py2.py3-none-any.whl.metadata (934 bytes)
Collecting nose (from graphframes)
  Downloading nose-1.3.7-py3-none-any.whl.metadata (1.7 kB)
Downloading graphframes-0.6-py2.py3-none-any.whl (18 kB)
Downloading nose-1.3.7-py3-none-any.whl (154 kB)
   ---------------------------------------- 0.0/154.7 kB ? eta -:--:--
   ----------------------- ---------------- 92.2/154.7 kB 2.6 MB/s eta 0:00:01
   ---------------------------------------- 154.7/154.7 kB 3.1 MB/s eta 0:00:00
Installing collected packages: nose, graphframes
Successfully installed graphframes-0.6 nose-1.3.7


In [7]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark import SparkContext


In [3]:
# Caricamento del file CSV
df = spark.read.csv("Airposrts2.csv", header=True, inferSchema=True)

# Mostra i primi record per confermare
df.show()


NameError: name 'spark' is not defined

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, collect_list, min as spark_min, struct, size, row_number
from pyspark.sql.window import Window
import time

class ImprovedConnectedComponents:
    def __init__(self, max_iterations=20, spark_configs=None):
        """
        Initializes SparkSession with optimized configurations.

        Args:
            max_iterations (int, optional): Maximum number of iterations for component detection. Defaults to 20.
            spark_configs (dict, optional): Additional Spark configurations as key-value pairs.
        """
        print("Initializing SparkSession with optimized configurations...")
        
        builder = SparkSession.builder.appName("Optimized Connected Components")
        default_configs = {
            "spark.sql.adaptive.enabled": "true",
            "spark.sql.adaptive.skewJoin.enabled": "true",
            "spark.sql.shuffle.partitions": "400",
            "spark.driver.memory": "16g",
            "spark.executor.memory": "16g",
            "spark.memory.fraction": "0.8",
            "spark.memory.storageFraction": "0.2"
        }
        # Merge default and custom configurations
        for key, value in {**default_configs, **(spark_configs or {})}.items():
            builder = builder.config(key, value)
        
        self.spark = builder.getOrCreate()
        self.sc = self.spark.sparkContext
        self.max_iterations = max_iterations

    def find_connected_components(self, flight_network_path, start_date, end_date):
        """
        Identifies connected components in a flight network using an optimized algorithm.

        Args:
            flight_network_path (str): Path to the flight network CSV file.
            start_date (str): Start date for filtering (format YYYY-MM-DD).
            end_date (str): End date for filtering (format YYYY-MM-DD).

        Returns:
            dict: Results containing connected components information and performance metrics.
        """
        start_time = time.time()
        print(f"Analyzing flight network from {start_date} to {end_date}")
        print(f"Maximum iterations set to: {self.max_iterations}")

        # STEP 1: Load and filter the dataset
        print("Loading and filtering flight network data...")
        df = self.spark.read.csv(flight_network_path, header=True, inferSchema=True)
        
        # Check if necessary columns exist
        required_columns = ['Fly_date', 'Origin_airport', 'Destination_airport']
        for col_name in required_columns:
            if col_name not in df.columns:
                raise ValueError(f"Required column '{col_name}' not found in the dataset")
        
        # Filter by date range
        filtered_df = df.filter((col("Fly_date") >= start_date) & (col("Fly_date") <= end_date))
        
        # Print dataset statistics
        total_flights = filtered_df.count()
        unique_airports = filtered_df.select("Origin_airport").distinct().count()
        print(f"Total flights in the period: {total_flights}")
        print(f"Unique airports: {unique_airports}")

        # Create undirected graph edges
        edges_df = filtered_df.select("Origin_airport", "Destination_airport").distinct()
        bidirectional_edges = edges_df.union(
            edges_df.select(
                col("Destination_airport").alias("Origin_airport"),
                col("Origin_airport").alias("Destination_airport")
            )
        ).distinct()

        # STEP 2: Connected Components using Disjoint Set (Union-Find) algorithm
        print("Computing connected components...")
        
        # Initial node mapping
        initial_nodes = bidirectional_edges.select("Origin_airport").distinct() \
            .withColumnRenamed("Origin_airport", "airport") \
            .withColumn("component", col("airport"))

        # Iterative component merging
        current_nodes = initial_nodes
        previous_component_count = 0

        for iteration in range(self.max_iterations):
            print(f"Iteration {iteration + 1}")
            
            # Join edges with current component mapping
            merged_nodes = bidirectional_edges.join(current_nodes, 
                bidirectional_edges.Origin_airport == current_nodes.airport, "left") \
                .select(
                    col("Destination_airport").alias("airport"),
                    col("component")
                )

            # Merge components by taking the minimum component label
            current_nodes = merged_nodes.union(current_nodes) \
                .groupBy("airport") \
                .agg(spark_min("component").alias("component"))

            # Optional early stopping
            current_component_count = current_nodes.select("component").distinct().count()
            if current_component_count == previous_component_count:
                print(f"Converged at iteration {iteration + 1}")
                break
            previous_component_count = current_component_count

        # STEP 3: Analyze Connected Components
        print("Analyzing connected components...")
        components_df = current_nodes.groupBy("component") \
            .agg(collect_list("airport").alias("airports"))
        
        # Add component size using size() function
        components_df = components_df.withColumn("size", size(col("airports")))
        
        # Sort components by size in descending order
        window_spec = Window.orderBy(col("size").desc())
        components_df = components_df.withColumn("rank", row_number().over(window_spec))

        # Collect results
        results_rows = components_df.collect()
        
        # Calculate performance metrics
        end_time = time.time()
        total_processing_time = end_time - start_time
        
        # Prepare final results
        results = {
            "number_of_components": len(results_rows),
            "component_sizes": [row["size"] for row in results_rows],
            "largest_component_size": results_rows[0]["size"] if results_rows else 0,
            "largest_component_airports": results_rows[0]["airports"] if results_rows else [],
            "performance_metrics": {
                "total_flights": total_flights,
                "unique_airports": unique_airports,
                "processing_time_seconds": total_processing_time,
                "iterations_completed": iteration + 1
            }
        }

        print("Connected Components Analysis Completed!")
        return results

    def cleanup(self):
        """
        Stops the SparkSession to free up resources.
        """
        if self.spark:
            self.spark.stop()


In [2]:
# Initialize with more iterations
import json


connector = ImprovedConnectedComponents(
    max_iterations=10  # Increased from 10 to 50
)

# Expand date range
results = connector.find_connected_components(
    flight_network_path="Airports2.csv",
    start_date="1990-01-01",
    end_date="2000-12-31"
)

# Print detailed results
print(json.dumps(results, indent=2))

# Cleanup
connector.cleanup()

Initializing SparkSession with optimized configurations...
Analyzing flight network from 1990-01-01 to 2000-12-31
Maximum iterations set to: 10
Loading and filtering flight network data...
Total flights in the period: 1650268
Unique airports: 450
Computing connected components...
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Converged at iteration 5
Analyzing connected components...
Connected Components Analysis Completed!
{
  "number_of_components": 1,
  "component_sizes": [
    463
  ],
  "largest_component_size": 463,
  "largest_component_airports": [
    "ABE",
    "ABI",
    "ABQ",
    "ABR",
    "ABY",
    "ACT",
    "ACV",
    "ACY",
    "ADM",
    "ADQ",
    "ADS",
    "AEX",
    "AFW",
    "AGC",
    "AGS",
    "AHN",
    "AID",
    "AIY",
    "ALB",
    "ALM",
    "ALO",
    "ALW",
    "AMA",
    "ANB",
    "ANC",
    "AND",
    "AOH",
    "AOO",
    "APF",
    "APN",
    "ARA",
    "ART",
    "AST",
    "ATL",
    "ATW",
    "ATY",
    "AUS",
    "AVL",
    "AV

## **Graph Frames**

In [None]:
from pyspark.sql import SparkSession
from graphframes import GraphFrame
from pyspark.sql.functions import col, explode

def analyze_flight_network_graphframes(flight_network_df, start_date, end_date):
    """
    Analyze connected components in a flight network using GraphFrames
    
    Parameters:
    - flight_network_df: PySpark DataFrame with flight network data
    - start_date: Beginning of the date range to analyze
    - end_date: End of the date range to analyze
    
    Returns:
    - Dictionary with analysis results
    """
    # 1. Filter flights within the specified date range
    filtered_flights = flight_network_df.filter(
        (col("date") >= start_date) & (col("date") <= end_date)
    )
    
    # 2. Create vertices (unique airports)
    vertices = filtered_flights.select(
        col("origin").alias("id")
    ).union(
        filtered_flights.select(
            col("destination").alias("id")
        )
    ).distinct()
    
    # 3. Create edges (flights between airports)
    edges = filtered_flights.select(
        col("origin").alias("src"),
        col("destination").alias("dst")
    )
    
    # 4. Create GraphFrame
    graph = GraphFrame(vertices, edges)
    
    # 5. Find connected components
    connected_components = graph.connectedComponents()
    
    # 6. Analyze components
    component_analysis = connected_components.groupBy("component") \
        .agg(
            F.count("id").alias("component_size"),
            F.collect_list("id").alias("airports_in_component")
        ).orderBy(col("component_size").desc())
    
    # 7. Prepare results
    results = {
        "total_connected_components": component_analysis.count(),
        "component_sizes": [row["component_size"] for row in component_analysis.collect()],
        "largest_component_airports": component_analysis.first()["airports_in_component"]
    }
    
    return results

In [1]:
!pip install graphframes




In [2]:
from graphframes import GraphFrame
from pyspark.sql import SparkSession
import time
from pyspark.sql import functions as F

# Initialize Spark session
spark = SparkSession.builder \
    .appName("ConnectedComponentsGraphFrames") \
    .config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.0-s_2.12") \
    .getOrCreate()

def find_connected_components(flight_network_df, start_date, end_date):
    # 1. Filter flights within the specified date range
    filtered_flights = flight_network_df.filter(
        (F.col("Fly_date") >= start_date) & (F.col("Fly_date") <= end_date)
    )

    # 2. Create vertices (unique airports)
    vertices = filtered_flights.select(
        F.col("Origin_airport").alias("id")
    ).union(
        filtered_flights.select(
            F.col("Destination_airport").alias("id")
        )
    ).distinct()

    # 3. Create edges (flights between airports)
    edges = filtered_flights.select(
        F.col("Origin_airport").alias("src"),
        F.col("Destination_airport").alias("dst")
    )

    # Create a GraphFrame
    graph = GraphFrame(vertices, edges)

    # Start the timer
    start_time = time.time()

    # Run the connectedComponents algorithm
    result = graph.connectedComponents()

    # Stop the timer
    end_time = time.time()

    # 4. Number of connected components
    num_components = result.select("component").distinct().count()

    # 5. Size of each connected component
    component_sizes = result.groupBy("component").count().withColumnRenamed("count", "size")

    # 6. Airports in the largest connected component
    largest_component_id = component_sizes.orderBy(F.desc("size")).first()["component"]
    largest_component_airports = result.filter(F.col("component") == largest_component_id).select("id").collect()

    # Print the execution time
    print(f"Execution Time (GraphFrames): {end_time - start_time} seconds")

    # Output the results
    print(f"Number of connected components: {num_components}")
    print(f"Size of each connected component:")
    component_sizes.show()
    
    print(f"Airports in the largest connected component:")
    for airport in largest_component_airports:
        print(airport["id"])


Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
	at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:735)
	at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:270)
	at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:1139)
	at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:1125)
	at org.apache.spark.util.Utils$.fetchFile(Utils.scala:489)
	at org.apache.spark.SparkContext.addFile(SparkContext.scala:1790)
	at org.apache.spark.SparkContext.$anonfun$new$16(SparkContext.scala:528)
	at org.apache.spark.SparkContext.$anonfun$new$16$adapted(SparkContext.scala:528)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:528)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
	at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:547)
	at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:568)
	at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:591)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:688)
	at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
	at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1907)
	at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1867)
	at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1840)
	at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183)
	at org.apache.hadoop.util.ShutdownHookManager$HookEntry.<init>(ShutdownHookManager.java:207)
	at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304)
	at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181)
	at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
	at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
	at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58)
	at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala)
	at org.apache.spark.util.Utils$.createTempDir(Utils.scala:242)
	at org.apache.spark.util.SparkFileUtils.createTempDir(SparkFileUtils.scala:103)
	at org.apache.spark.util.SparkFileUtils.createTempDir$(SparkFileUtils.scala:102)
	at org.apache.spark.util.Utils$.createTempDir(Utils.scala:94)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:372)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:964)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
	at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:467)
	at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:438)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:515)
	... 25 more


In [2]:
# Example usage
# Assuming you have a DataFrame called `flight_network_df` with 'origin', 'destination', and 'date' columns
# Define your start and end dates (in 'yyyy-MM-dd' format)
start_date = '1990-01-01'
end_date = '1990-12-31'

df = spark.read.csv("Airports2.csv", header=True, inferSchema=True)

# Call the function with your DataFrame
find_connected_components(df, start_date, end_date)



Py4JJavaError: An error occurred while calling o63.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)
