# spark sql

Exactly! ✅ **It does *not* create a physical table** in any storage system like Hive, a database, or disk.

---

### 🔍 What `createOrReplaceTempView("my_table")` really does:

* It **registers a logical name** (`"my_table"`) for a DataFrame **inside Spark's session catalog**.
* It is a **temporary in-memory reference**.
* It's like saying:

  > “Hey Spark, if someone runs a SQL query on `my_table`, use this DataFrame.”

---

### ❌ It does NOT:

* Store any data on disk.
* Create a real table in Hive or an external database.
* Persist the table across sessions.

---

### ✅ It DOES:

* Allow you to use SQL queries on your DataFrame:

  ```python
  spark.sql("SELECT * FROM my_table WHERE age > 25").show()
  ```
* Exist only for the duration of the current Spark session.

---

### 🔄 If you want to create an actual table:

Use:

```python
df.write.saveAsTable("permanent_table")
```

This will store data in Hive (or a metastore) if configured.

---

### 🧠 TL;DR

| Action                      | Creates a real table? | Stored on disk? | Survives session? |
| --------------------------- | --------------------- | --------------- | ----------------- |
| `createOrReplaceTempView()` | ❌ No                  | ❌ No            | ❌ No              |
| `df.write.saveAsTable()`    | ✅ Yes (Hive)          | ✅ Yes           | ✅ Yes             |

---

Let me know if you're using Hive or want to save a DataFrame as a real table!


In [4]:
# so simply it doesnt do anything like creating tables or persisiting its just a reference to the spqrk sql -brian copler ,spark streaming infoq

# just to make this df visible in sql land

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("JupyterStandalo") \
    .master("spark://8fa087ac675c:7077") \
    .config("spark.executor.instances", "3") \
    .config("spark.executor.cores", "6") \
    .config("spark.executor.memory", "2g") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/24 12:35:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/06/24 12:35:43 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
# Initialize our data
data2 = [("Pulkit", 12, "CS32", 82, "Programming"),
         ("Ritika", 20, "CS32", 94, "Writing"),
         ("Atirikt", 4, "BB21", 78, None),
         ("Reshav", 18, None, 56, None)
         ]

# Start spark session

# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Roll Number", IntegerType(), True),
    StructField("Class ID", StringType(), True),
    StructField("Marks", IntegerType(), True),
    StructField("Extracurricular", StringType(), True)
])

# read the dataframe
df = spark.createDataFrame(data=data2, schema=schema)

In [6]:
df.createOrReplaceTempView("my_table")


In [7]:
# note that this my_table is avaiable in this session only so 

#The temporary table is scoped to the SparkSession in which it was created

In [8]:
df1=spark.sql("""select * from my_table""")

In [9]:
df1.show()

[Stage 1:>                                                          (0 + 1) / 1]

+-------+-----------+--------+-----+---------------+
|   Name|Roll Number|Class ID|Marks|Extracurricular|
+-------+-----------+--------+-----+---------------+
| Pulkit|         12|    CS32|   82|    Programming|
| Ritika|         20|    CS32|   94|        Writing|
|Atirikt|          4|    BB21|   78|           NULL|
| Reshav|         18|    NULL|   56|           NULL|
+-------+-----------+--------+-----+---------------+



                                                                                

In [10]:
# now you can do as much sql as you want