# Spark SQL 
### Core Classes

### class pyspark.sql.SparkSession(sparkContext: pyspark.context.SparkContext, jsparkSession: Optional[py4j.java_gateway.JavaObject] = None, options: Dict[str, Any] = {})[source]

The entry point to programming Spark with the Dataset and DataFrame API. A SparkSession can be used to create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

In [5]:
from pyspark.sql import SparkSession 

In [16]:
spark = (
    SparkSession.builder
        .master("local")
        .appName("Example Erarser")
        .getOrCreate()
)
print("Spark Version : "+spark.version)

Spark Version : 3.2.4


# class pyspark.sql.DataFrame(jdf: py4j.java_gateway.JavaObject, sql_ctx: Union[SQLContext, SparkSession])

 A distributed collection of data grouped into named columns. A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession:

In [51]:
# To create DataFrame using SparkSession
department = spark.createDataFrame([
    {"id": 1, "name": "PySpark"},
    {"id": 2, "name": "ML"},
    {"id": 3, "name": "Spark SQL"}
]).show()

+---+---------+
| id|     name|
+---+---------+
|  1|  PySpark|
|  2|       ML|
|  3|Spark SQL|
+---+---------+



### pyspark.sql.Column

##### class pyspark.sql.Column(jc: py4j.java_gateway.JavaObject)

#### A column in a DataFrame.

In [53]:
df = spark.createDataFrame(
     [(2, "Alice"), (5, "Bob")], ["age", "name"])
df.age

Column<'age'>

### pyspark.sql.Observation

Class to observe (named) metrics on a DataFrame.

Metrics are aggregation expressions, which are applied to the DataFrame while it is being processed by an action.

The metrics have the following guarantees:

    It will compute the defined aggregates (metrics) on all the data that is flowing through the Dataset during the action.

    It will report the value of the defined aggregate columns as soon as we reach the end of the action.

The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input Dataset’s columns must always be wrapped in an aggregate function.

An Observation instance collects the metrics while the first action is executed. Subsequent actions do not modify the metrics returned by Observation.get. Retrieval of the metric via Observation.get blocks until the first action has finished and metrics become available.

In [None]:
#### NOTA: NO LEE LA FUNCION
from pyspark.sql.functions import col, count, lit, max

from pyspark.sql import Observation

df = spark.createDataFrame([["Alice", 2], ["Bob", 5]], ["name", "age"])

observation = Observation("my metrics")

observed_df = df.observe(observation, count(lit(1)).alias("count"), max(col("age")))

observed_df.count()
>>>2

observation.get
>>>{'count': 2, 'max(age)': 5}

### pyspark.sql.Row

 class pyspark.sql.Row[source]

    A row in DataFrame. The fields in it can be accessed:
        like attributes (row.key)
        like dictionary values (row[key])
    key in row will search through row keys.

    Row can be used to create a row object by using named arguments.
    It is not allowed to omit a named argument to represent that the value is None or missing.
    This should be explicitly set to None in this case.

In [65]:
from pyspark.sql import Row

row = Row(name="Alice", age=11)
print(row)
Row(name='Alice', age=11)
print(row['name'], row['age'])
print(row.name, row.age)
print('name' in row)
'wrong_key' in row

Row(name='Alice', age=11)
Alice 11
Alice 11
True


False

### pyspark.sql.DataFrameReader.csv

DataFrameReader.csv(path: Union[str, List[str]], schema: Union[pyspark.sql.types.StructType, str, None] = None, sep: Optional[str] = None, encoding: Optional[str] = None, quote: Optional[str] = None, escape: Optional[str] = None, comment: Optional[str] = None, header: Union[bool, str, None] = None, inferSchema: Union[bool, str, None] = None, ignoreLeadingWhiteSpace: Union[bool, str, None] = None, ignoreTrailingWhiteSpace: Union[bool, str, None] = None, nullValue: Optional[str] = None, nanValue: Optional[str] = None, positiveInf: Optional[str] = None, negativeInf: Optional[str] = None, dateFormat: Optional[str] = None, timestampFormat: Optional[str] = None, maxColumns: Union[str, int, None] = None, maxCharsPerColumn: Union[str, int, None] = None, maxMalformedLogPerPartition: Union[str, int, None] = None, mode: Optional[str] = None, columnNameOfCorruptRecord: Optional[str] = None, multiLine: Union[bool, str, None] = None, charToEscapeQuoteEscaping: Optional[str] = None, samplingRatio: Union[str, float, None] = None, enforceSchema: Union[bool, str, None] = None, emptyValue: Optional[str] = None, locale: Optional[str] = None, lineSep: Optional[str] = None, pathGlobFilter: Union[bool, str, None] = None, recursiveFileLookup: Union[bool, str, None] = None, modifiedBefore: Union[bool, str, None] = None, modifiedAfter: Union[bool, str, None] = None, unescapedQuoteHandling: Optional[str] = None) → DataFrame

In [None]:
Loads a CSV file and returns the result as a DataFrame.

This function will go through the input once to determine 
the input schema if inferSchema is enabled. 
To avoid going through the entire data once, disable inferSchema option 
or specify the schema explicitly using schema

In [69]:
df = spark.createDataFrame([{"age": 100, "name": "Hyukjin Kwon"}])
df.write.mode("overwrite").format("csv").save("Carpeta_temporal/Ejemplo_csv")
# Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon'.
spark.read.csv("Carpeta_temporal/Ejemplo_csv", schema=df.schema, nullValue="Hyukjin Kwon").show()

+---+----+
|age|name|
+---+----+
|100|null|
+---+----+



### pyspark.sql.DataFrameReader.format

Specifies the input data source format. 
Write a DataFrame into a JSON file and read it back.

In [70]:
# Write a DataFrame into a JSON file
spark.createDataFrame(
    [{"age": 100, "name": "Hyukjin Kwon"}]
).write.mode("overwrite").format("json").save("Carpeta_temporal/Ejemplo_json")

# Read the JSON file as a DataFrame.
spark.read.format('json').load("Carpeta_temporal/Ejemplo_json").show()

+---+------------+
|age|        name|
+---+------------+
|100|Hyukjin Kwon|
+---+------------+



### pyspark.sql.DataFrameReader.jdbc

###  DataFrameReader.jdbc(url: str, table: str, column: Optional[str] = None, lowerBound: Union[str, int, None] = None, upperBound: Union[str, int, None] = None, numPartitions: Optional[int] = None, predicates: Optional[List[str]] = None, properties: Optional[Dict[str, str]] = None) → DataFrame[source]
Construct a DataFrame representing the database table named table accessible via JDBC 
URL url and connection properties.
Partitions of the table will be retrieved in parallel 
if either column or predicates is specified. 
lowerBound, upperBound and numPartitions is needed when column is specified.
If both column and predicates are specified, column will be used.

#### More information of parameters:
    
URL : https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.jdbc.html

In [39]:
df = spark.read.format("csv").option("header","true").option("delimiter", ";").load("data/registro_factura.csv")#Mobiles_Dataset_(2025).csv

In [40]:
df.show(10)

+-----------+------+--------------------+-------+-----------------+----------------+
|Supermecado|Unidad|            Producto|Importe|Precio por Unidad|           Fecha|
+-----------+------+--------------------+-------+-----------------+----------------+
|  Mercadona|     1|Bicarbonato Blanq...|   1.10|             1.10|17/02/2025_11:53|
|     Eroski|     1|       Lacon Ahumado|      3|                3|08/02/2025_12:26|
|     Eroski|     1|       Sprite 0,33 l|   0.65|             0.65|08/02/2025_12:26|
|     Eroski|     1|    Pepsi Zero Limao|   0.75|             0.75|08/02/2025_12:26|
|     Eroski|     1|       Tiburon Pasta|   0.80|             0.80|  08/02/25_12:26|
|     Eroski|     2|  Aceitunas Rellenas|   1.94|             0.97|  08/02/25_12:26|
|     Eroski|     4|       Amstel Radler|   3.08|             0.77|  08/02/25_12:26|
|     Eroski|     1|    Gel WC Eucalipto|   0.93|             0.93|  08/02/25_12:26|
|     Eroski|     1|Jabon lavaplatos ...|   1.43|             1.4