## Import all the libraries needed for this notebook
PySpark is the Python API for Apache Spark. It enables you to perform real-time, large-scale data processing in a distributed environment using Python. It also provides a PySpark shell for interactively analyzing your data.


In [3]:
import os  # Miscellaneous operating system interfaces 
import sys  # System-specific parameters and functions (https://docs.python.org/3/library/sys.html)
os.environ["JAVA_HOME"] = r"C:\Program Files\Eclipse Adoptium\jdk-17.0.18.8-hotspot"  # Access Java OpenJDK installation path
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

# Import SparkSession from pyspark.sql module to create a Spark session
from pyspark.sql import SparkSession  # The entry point to programming Spark with the Dataset and DataFrame API

# Import the necessary types as classes
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType



## Creating a DataFrame from filestores
**About DataFrames**

- DataFrames: Tabular format (rows/columns).
- Support SQL-Like operations.
- Comparable to a Pandas Dataframe or a SQL TABLE.
- Structured Data.

**Difference between Pandas and PySpark Dataframe**
- Pandas operates on a single computer instance.

PySpark distribute data across multiple instances. Afecting processing speed and scalability.

In [4]:
# Create a Spark session
spark = SparkSession.builder.appName("MySparkAPP").getOrCreate()
print("Success") # Print a success message to indicate that the Spark session has been created successfully

# Create a DataFrame from a local CSV file with header and inferSchema options enabled
salaries_df = spark.read.csv("data/salaries.csv", header=True, inferSchema=True)

# Show the schema of the DataFrame to understand the structure of the data
salaries_df.printSchema()

# Show the DataFrame
salaries_df.show()

Success
root
 |-- work_year: integer (nullable = true)
 |-- experience_level: string (nullable = true)
 |-- employment_type: string (nullable = true)
 |-- job_title: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- salary_currency: string (nullable = true)
 |-- salary_in_usd: integer (nullable = true)
 |-- employee_residence: string (nullable = true)
 |-- remote_ratio: integer (nullable = true)
 |-- company_location: string (nullable = true)
 |-- company_size: string (nullable = true)

+---------+----------------+---------------+--------------------+------+---------------+-------------+------------------+------------+----------------+------------+
|work_year|experience_level|employment_type|           job_title|salary|salary_currency|salary_in_usd|employee_residence|remote_ratio|company_location|company_size|
+---------+----------------+---------------+--------------------+------+---------------+-------------+------------------+------------+----------------+--------

## Basic Analytics on PySpark Dataframes
**Aggregate functions:**
- count()
- sum()
- min()
- max()

**Key functions for PySpark analytics**
- **.select()**: Selects specific columns from the DataFrame
- **.filter():** Filters rows based on specific conditions
- **.groupBy():** Groups rows based on one or more columns
- **.agg():** Applies aggregate functions to grouped data


In [5]:
# Count the number of rows in the DataFrame and print the result
row_count = salaries_df.count()
print(f"Total number of rows in the DataFrame: {row_count}")

# Sum the values in the "salary_in_usd" column and print the total salary
total_salary_df = salaries_df.groupBy("company_location").agg({"salary_in_usd": "avg"}).sort("avg(salary_in_usd)", ascending=False)
total_salary_df.show()

Total number of rows in the DataFrame: 37234
+----------------+------------------+
|company_location|avg(salary_in_usd)|
+----------------+------------------+
|              QA|          300000.0|
|              SK|          225000.0|
|              VE|          192500.0|
|              PR|          167500.0|
|              US| 165911.9742057623|
|              IL|        157888.625|
|              CA| 143228.7273542601|
|              SA|139999.33333333334|
|              EG| 136903.7037037037|
|              CH|130909.36363636363|
|              AU|128938.14942528735|
|              NZ|          127998.5|
|              MX|117452.03448275862|
|              DE|110896.81879194631|
|              JP|        110821.625|
|              IE|109723.14285714286|
|              BE|          105576.5|
|              UA|          103000.0|
|              DZ|          100000.0|
|              CN|          100000.0|
+----------------+------------------+
only showing top 20 rows


## More on Spark DataFrames
**Creating DataFrames from various data sources:**
- **CSV Files:** Common for structured, delimited data. Don't define or enforce data type or schema. Leading to potencial incosistency.
- **JSON Files:** Semi-structured, hierarchiecal data format. However, can be storage insentive.
- **Parquet Files:** Optimized for storage and querying, often used in data engineering. Enforces schema definition and support complex data structured.

In [6]:
# Load JSON file into a DataFrame and show the contents
adults_df = spark.read.json("data/adults.json")

adults_df.show()

+---+-------------+------+--------------+-----------------+
|age|education.num|income|marital.status|       occupation|
+---+-------------+------+--------------+-----------------+
| 90|            9| <=50K|       Widowed|                ?|
| 82|            9| <=50K|       Widowed|  Exec-managerial|
| 66|           10| <=50K|       Widowed|                ?|
| 54|            4| <=50K|      Divorced|Machine-op-inspct|
| 41|           10| <=50K|     Separated|   Prof-specialty|
| 34|            9| <=50K|      Divorced|    Other-service|
| 38|            6| <=50K|     Separated|     Adm-clerical|
| 74|           16|  >50K| Never-married|   Prof-specialty|
| 68|            9| <=50K|      Divorced|   Prof-specialty|
| 41|           10|  >50K| Never-married|     Craft-repair|
| 45|           16|  >50K|      Divorced|   Prof-specialty|
| 38|           15|  >50K| Never-married|   Prof-specialty|
| 52|           13|  >50K|       Widowed|    Other-service|
| 32|           14|  >50K|     Separated

In [7]:
# Load Parquet file into a DataFrame and show the contents
house_prices_df = spark.read.parquet("data/house-price.parquet")

house_prices_df.show()

+--------+-----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
|   price| area|bedrooms|bathrooms|stories|mainroad|guestroom|basement|hotwaterheating|airconditioning|parking|prefarea|furnishingstatus|
+--------+-----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
|13300000| 7420|       4|        2|      3|     yes|       no|      no|             no|            yes|      2|     yes|       furnished|
|12250000| 8960|       4|        4|      4|     yes|       no|      no|             no|            yes|      3|      no|       furnished|
|12250000| 9960|       3|        2|      2|     yes|       no|     yes|             no|             no|      2|     yes|  semi-furnished|
|12215000| 7500|       4|        2|      2|     yes|       no|     yes|             no|            yes|      3|     yes|       furnished|
|11410000| 7420|       4|        1

## Shema inference and manual schema definition
- Spark can infer schemas from data with inferShema=True
- Manually define schema for better control - useful for fixed data structures

## DataTypes in PySpark DataFrames
- IntegerType: Whole numbers
- LongType: Larger whole numbers (8-byte signed numbers)
- FloatType and DoubleType: Floating-point numbers for decimal values
- StringType: Used for text or string data


In [8]:
# Construct the schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("scores", ArrayType(IntegerType()), True)
])

# Set the schema
df = spark.createDataFrame([(1, "Alice", [85, 90, 92]), (2, "Bob", [78, 82, 88])], schema)
df.show()


+---+-----+------------+
| id| name|      scores|
+---+-----+------------+
|  1|Alice|[85, 90, 92]|
|  2|  Bob|[78, 82, 88]|
+---+-----+------------+



## Dataframes operations - selection and filtering
- Use **.select()** to choose specific columns
- Use **.filter()** or .where() to filter rows based on conditions
- Use **.sort()** to order by a collection of columns

In [10]:
house_prices_df.select("area", "bedrooms").show()

+-----+--------+
| area|bedrooms|
+-----+--------+
| 7420|       4|
| 8960|       4|
| 9960|       3|
| 7500|       4|
| 7420|       4|
| 7500|       3|
| 8580|       4|
|16200|       5|
| 8100|       4|
| 5750|       3|
|13200|       3|
| 6000|       4|
| 6550|       4|
| 3500|       4|
| 7800|       3|
| 6000|       4|
| 6600|       4|
| 8500|       3|
| 4600|       3|
| 6420|       3|
+-----+--------+
only showing top 20 rows


In [16]:
house_prices_df.where(house_prices_df["parking"]==2).filter(house_prices_df["bedrooms"]>4).show()

+-------+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
|  price|area|bedrooms|bathrooms|stories|mainroad|guestroom|basement|hotwaterheating|airconditioning|parking|prefarea|furnishingstatus|
+-------+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
|8400000|7950|       5|        2|      2|     yes|       no|     yes|            yes|             no|      2|      no|     unfurnished|
|6440000|8580|       5|        3|      2|     yes|       no|      no|             no|             no|      2|      no|       furnished|
+-------+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+



## Sorting and dropping missing values
- Order data using **.sort()** or **.orderBy()**
- Use **na.drop()** to remove rows with null values

In [None]:
adults_df.sort("age", ascending=False).show()

+---+-------------+------+------------------+-----------------+
|age|education.num|income|    marital.status|       occupation|
+---+-------------+------+------------------+-----------------+
| 90|            9| <=50K|           Widowed|                ?|
| 82|            9| <=50K|           Widowed|  Exec-managerial|
| 74|           16|  >50K|     Never-married|   Prof-specialty|
| 73|            9| <=50K|Married-civ-spouse|  Farming-fishing|
| 71|            9| <=50K|Married-civ-spouse|                ?|
| 71|            9| <=50K|Married-civ-spouse|            Sales|
| 68|            9| <=50K|          Divorced|   Prof-specialty|
| 68|           10| <=50K|Married-civ-spouse|                ?|
| 67|           10| <=50K|Married-civ-spouse|                ?|
| 66|           10| <=50K|           Widowed|                ?|
| 63|           16|  >50K|          Divorced|  Exec-managerial|
| 62|           13|  >50K|Married-civ-spouse|  Farming-fishing|
| 61|            9| <=50K|          Divo

In [None]:
# Replace dots with underscores for all column names   
for col_name in adults_df.columns:
    if '.' in col_name:
        new_name = col_name.replace('.', '_')
        adults_df = adults_df.withColumnRenamed(col_name, new_name)

# Drop rows with null values and show the resulting DataFrame
adults_df.na.drop().show()

+---+-------------+------+--------------+-----------------+
|age|education_num|income|marital_status|       occupation|
+---+-------------+------+--------------+-----------------+
| 90|            9| <=50K|       Widowed|                ?|
| 82|            9| <=50K|       Widowed|  Exec-managerial|
| 66|           10| <=50K|       Widowed|                ?|
| 54|            4| <=50K|      Divorced|Machine-op-inspct|
| 41|           10| <=50K|     Separated|   Prof-specialty|
| 34|            9| <=50K|      Divorced|    Other-service|
| 38|            6| <=50K|     Separated|     Adm-clerical|
| 74|           16|  >50K| Never-married|   Prof-specialty|
| 68|            9| <=50K|      Divorced|   Prof-specialty|
| 41|           10|  >50K| Never-married|     Craft-repair|
| 45|           16|  >50K|      Divorced|   Prof-specialty|
| 38|           15|  >50K| Never-married|   Prof-specialty|
| 52|           13|  >50K|       Widowed|    Other-service|
| 32|           14|  >50K|     Separated