# Spark SQL?

Spark SQL is a module in Apache Spark that provides a programming interface for working with structured and semi-structured data using SQL (Structured Query Language). It enables users to seamlessly integrate SQL queries with their Spark programs, allowing them to perform complex data analysis tasks using the power of Spark's distributed computing capabilities.

Key features and components of Spark SQL include:

1. **DataFrame API:**
   - Spark SQL introduces the concept of DataFrames, which are distributed collections of data organized into named columns. DataFrames provide a higher-level abstraction compared to RDDs (Resilient Distributed Datasets) and are designed to work seamlessly with Spark SQL.

2. **SQL Queries:**
   - Spark SQL allows users to execute SQL queries on DataFrames. Users can express complex transformations and aggregations using SQL syntax, providing a familiar interface for those who are already familiar with SQL.

3. **Hive Compatibility:**
   - Spark SQL is compatible with Apache Hive, allowing users to run Hive queries and access Hive UDFs (User-Defined Functions) within Spark applications. This compatibility makes it easier for organizations that have existing Hive queries to transition to Spark.

4. **Catalyst Optimizer:**
   - Spark SQL includes the Catalyst optimizer, a powerful engine that optimizes the execution plan of Spark SQL queries. Catalyst optimizes the logical and physical execution plans of Spark SQL queries to improve performance.

5. **DataSource API:**
   - Spark SQL provides a DataSource API that allows users to read and write data in various formats and storage systems. This includes support for reading and writing data in Parquet, Avro, ORC, JSON, CSV, and more.

6. **Unified Data Access:**
   - With Spark SQL, users can seamlessly switch between DataFrame and SQL API, providing a unified programming interface. This flexibility allows users to choose the API that best suits their requirements for a particular task.

7. **Structured Streaming:**
   - Spark SQL extends its capabilities to structured streaming, enabling users to process streaming data using SQL queries on DataFrames.

Here's a simple example of using Spark SQL:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("SparkSQLExample").getOrCreate()

# Create a DataFrame from a CSV file
df = spark.read.csv("path/to/your/data.csv", header=True, inferSchema=True)

# Register the DataFrame as a temporary table
df.createOrReplaceTempView("my_table")

# Perform a SQL query on the DataFrame
result = spark.sql("SELECT * FROM my_table WHERE age > 25")

# Show the result
result.show()

# Stop the Spark session when done
spark.stop()
```

In this example, a DataFrame is created from a CSV file, registered as a temporary table, and then a SQL query is executed on that table using Spark SQL.

In [2]:
'''CREATE TABLE flights (
DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count LONG)
USING JSON OPTIONS (path '/home/blackheart/Documents/Data/Apache-Spark/Data/flight_data/2011-summary.json')'''

"CREATE TABLE flights (\nDEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count LONG)\nUSING JSON OPTIONS (path '/home/blackheart/Documents/Data/Apache-Spark/Data/flight_data/2011-summary.json')"

In [3]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("SparkSQLExample").getOrCreate()

# Assuming df is your DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

# Register the DataFrame as a temporary table
df.createOrReplaceTempView("people")

# 1. Select Operation
result_select = df.select("name", "age")
result_select.show()

# 2. Filter Operation
result_filter = df.filter(df["age"] > 25)
result_filter.show()

# 3. GroupBy and Aggregation
result_groupby = df.groupBy("age").count()
result_groupby.show()

# 4. Sorting
result_sort = df.orderBy("age")
result_sort.show()

# 5. SQL Queries
result_sql = spark.sql("SELECT name, age FROM people WHERE age > 25")
result_sql.show()

# 6. Join Operation
df2 = spark.createDataFrame([("Alice", "Engineer"), ("Bob", "Doctor")], ["name", "profession"])
result_join = df.join(df2, "name")
result_join.show()

# 7. Union Operation
df3 = spark.createDataFrame([("David", 28), ("Eva", 35)], ["name", "age"])
result_union = df.union(df3)
result_union.show()

# Stop the Spark session when done
spark.stop()


23/11/11 16:31:37 WARN Utils: Your hostname, blackheart resolves to a loopback address: 127.0.1.1; using 192.168.144.222 instead (on interface wlp1s0)
23/11/11 16:31:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/11 16:31:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 22|
+-------+---+

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 22|
+-------+---+

+----+---+
|name|age|
+----+---+
| Bob| 30|
+----+---+



                                                                                

+---+-----+
|age|count|
+---+-----+
| 25|    1|
| 30|    1|
| 22|    1|
+---+-----+



                                                                                

+-------+---+
|   name|age|
+-------+---+
|Charlie| 22|
|  Alice| 25|
|    Bob| 30|
+-------+---+

+----+---+
|name|age|
+----+---+
| Bob| 30|
+----+---+



                                                                                

+-----+---+----------+
| name|age|profession|
+-----+---+----------+
|Alice| 25|  Engineer|
|  Bob| 30|    Doctor|
+-----+---+----------+



                                                                                

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 22|
|  David| 28|
|    Eva| 35|
+-------+---+



 In Spark SQL, you can perform various operations beyond the basic querying and manipulation of DataFrames. Here are examples of some additional operations:

### 1. Creating External Tables:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("ExternalTableExample").getOrCreate()

# Assuming 'external_data' is an external table in Hive or another supported metastore
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS external_data (name STRING, age INT) LOCATION '/path/to/external/table'")
```

This example creates an external table named `external_data` pointing to data located at `/path/to/external/table`. The actual table creation syntax may vary based on the underlying metastore (Hive, etc.) and the data source.

### 2. Inserting into Tables:

```python
# Assuming 'people' is a registered DataFrame or table
df_insert = spark.createDataFrame([("John", 28), ("Emma", 32)], ["name", "age"])

# Insert data into an existing table
df_insert.write.insertInto("people")
```

This example inserts data from a DataFrame (`df_insert`) into an existing table named `people`.

### 3. Describing Table Metadata:

```python
# Describe the metadata of a table
spark.sql("DESCRIBE EXTENDED people").show(truncate=False)
```

This SQL query provides detailed information about the structure and metadata of the `people` table.

### 4. Refreshing Table Metadata:

```python
# Refresh the metadata of a table
spark.sql("REFRESH TABLE people")
```

This operation is useful when the underlying data of a table has changed, and you want to update the metadata.

### 5. Dropping Tables:

```python
# Drop a table
spark.sql("DROP TABLE IF EXISTS people")
```

This SQL query drops the `people` table if it exists.

### 6. Caching Tables:

```python
# Cache a table in memory
spark.sql("CACHE TABLE people")
```

This operation caches the `people` table in memory, improving the performance of subsequent queries.

### 7. Uncaching Tables:

```python
# Uncache a table from memory
spark.sql("UNCACHE TABLE people")
```

This operation removes the table from the in-memory cache.

Remember to adapt these examples based on your specific use case, table structures, and storage systems. The syntax might vary depending on the underlying storage system (Hive, HDFS, etc.) and the Spark version you are using.