# PySpark and Big Data Project -  Solutions
* [View Solution Notebook](./solution.html)

* [Project Page Link](https://www.codecademy.com/courses/big-data-pyspark/projects/pyspark-common-crawl)

## Task Group 1 - Analyzing Common Crawl Data with RDDs

### Task 1

Initialize a new Spark Context and read in the domain graph as an RDD.

In [None]:
# Import required modules
from pyspark.sql import SparkSession

# Create a new SparkSession
spark = SparkSession \
    .builder \
    .getOrCreate()

# Get SparkContext
sc = spark.sparkContext

In [None]:
# Read Domains CSV File into an RDD
common_crawl_domain_counts = sc.textFile('./crawl/cc-main-limited-domains.csv')

# Display first few domains from the RDD


### Task 2

Apply `fmt_domain_graph_entry` over `common_crawl_domain_counts` and save the result as a new RDD named `formatted_host_counts`.

In [None]:
def fmt_domain_graph_entry(entry):
    """
    Formats a Common Crawl domain graph entry. Extracts the site_id, 
    top-level domain (tld), domain name, and subdomain count as seperate items.
    """

    # Split the entry on delimiter ('\t') into site_id, domain, tld, and num_subdomains
    site_id, domain, tld, num_subdomains = entry.split('\t')        
    return int(site_id), domain, tld, int(num_subdomains)

In [None]:
# Apply `fmt_domain_graph_entry` to the raw data RDD
formatted_host_counts = 

# Display the first few entries of the new RDD


### Task 3

Apply `extract_subdomain_counts` over `common_crawl_domain_counts` and save the result as a new RDD named `host_counts`.

In [None]:
def extract_subdomain_counts(entry):
    """
    Extract the subdomain count from a Common Crawl domain graph entry.
    """
    
    # Split the entry on delimiter ('\t') into site_id, domain, tld, and num_subdomains
    site_id, domain, tld, num_subdomains = entry.split('\t')
    
    # return ONLY the num_subdomains
    return int(num_subdomains)


# Apply `extract_subdomain_counts` to the raw data RDD
host_counts = 

# Display the first few entries


### Task 4

Using `host_counts`, calculate the total number of subdomains across all domains in the dataset, save the result to a variable named `total_host_counts`.

In [None]:
# Reduce the RDD to a single value, the sum of subdomains, with a lambda function
# as the reduce function
total_host_counts = 

# Display result count


### Task 5

Stop the current `SparkSession` and `sparkContext` before moving on to analyze the data with SparkSQL

In [None]:
# Stop the sparkContext and the SparkSession


## Task Group 2 - Exploring Domain Counts with PySpark DataFrames and SQL

### Task 6

Create a new `SparkSession` and assign it to a variable named `spark`.

In [None]:
from pyspark.sql import SparkSession

# Create a new SparkSession
spark = 

### Task 7

Read `./crawl/cc-main-limited-domains.csv` into a new Spark DataFrame named `common_crawl`.

In [None]:
# Read the target file into a DataFrame
common_crawl = 


# Display the DataFrame to the notebook


### Task 8

Rename the DataFrame's columns to the following: 

- site_id
- domain
- top_level_domain
- num_subdomains


In [None]:
# Rename the DataFrame's columns with `withColumnRenamed()`
common_crawl = 


  
# Display the first few rows of the DataFrame and the new schema



## Task Group 3 - Reading and Writing Datasets to Disk

### Task 9

Save the `common_crawl` DataFrame as parquet files in a directory called `./results/common_crawl/`.

In [None]:
# Save the `common_crawl` DataFrame to a series of parquet files



### Task 10

Read `./results/common_crawl/` into a new DataFrame to confirm our DataFrame was saved properly.

In [None]:
# Read from parquet directory
common_crawl_domains = 

# Display the first few rows of the DataFrame and the schema



## Task Group 4 - Querying Domain Counts with PySpark DataFrames and SQL

### Task 11

Create a local temporary view from `common_crawl_domains`

In [None]:
# Create a temporary view in the metadata for this `SparkSession`


### Task 12

Calculate the total number of domains for each top-level domain in the dataset.

In [None]:
# Aggregate the DataFrame using DataFrame methods



In [None]:
# Aggregate the DataFrame using SQL



### Task 13

Calculate the total number of subdomains for each top-level domain in the dataset.

In [None]:
# Aggregate the DataFrame using DataFrame methods



In [None]:
# Aggregate the DataFrame using SQL



### Task 14

How many sub-domains does `nps.gov` have? Filter the dataset to that website's entry, display the columns `top_level_domain`, `domain`, and `num_subdomains` in your result.

In [None]:
# Filter the DataFrame using DataFrame Methods



In [None]:
# Filter the DataFrame using SQL



### Task 15

Close the `SparkSession` and underlying `sparkContext`.

In [None]:
# Stop the notebook's `SparkSession` and `sparkContext`
