### Create a table:

In [None]:
%sql
CREATE A TABLE IF NOT EXISTS table_name USING parquet OPTIONS (path "/the/file/path/table_file.parquet")

In [None]:
%python
spark.sql("""CREATE A TABLE IF NOT EXISTS table_name USING parquet OPTIONS (path "{}")""".format(table_name_path))

### Create a widget:

In [None]:
CREATE WIDGET TEXT widget_name DEFAULT "widget value"

getArgument("widget_name") #Returns the widget

REMOVE WIDGET widget_name # Eliminate the widget

### SparkSession (Used to be Spark Context in older versions):

![SparkSession Methods](pics/SparkSession%20Methods.PNG)

### SQL vs DataFrame API:

In [None]:
%sql
SELECT col_1, col_2
FROM table_name
WHERE col_2 < 20
ORDER BY col_2

In [None]:
%python
display(spark.table("table_name")
    .select("col_1", "col_2")
    .where("col_2 < 20")
    .orderBy("col_2"))

### Create a DataFrame:

In [None]:
tableDF = spark.table("table_name")

### SparkSession Methods:

- **spark.sql()** -> creates a df from a query  
- **spark.table()** -> creates a df from a table  
- **spark.read()** -> reads data in into dfs  
- **spark.range()** -> creates df with a col made of elements in a range  
- **spark.createDataFrame()** -> creates a df from tuples  

### DataFrame Action Methods:

- **df.show()** -> display top n rows in table form  
- **df.count()** -> returns number of rows in the df  
- **df.describe()/summary()** -> computes basic statistics 
- **df.first()/head()** -> returns first row 
- **df.collect()** -> returns all the df rows in an array 
- **df.take()** -> returns first n rows in an array 


### Column Operators and Methods:

![Column Operators and Methods](pics/Column%20Operators%20and%20Methods.PNG)

### Row Methods:

![Row Methods](pics/Row%20Methods.PNG)

### Create a View:

In [None]:
df.createOrReplaceTempView("view_name")

### Defining Schema Structure:

In [None]:
df_struct_schema = StructType([
    StructField("col_1", StringType(), True),
    StructField("col_2", LongType(), True),
    StructField("col_n", StringType(), True)
])

#You also have ArrayType, IntegerType, DoubleType, etc
#You can also add StructTypes inside of StructFields

In [None]:
#ddl
df_ddl_schema = "col_1 string, col_2 long, col_n string"

### Data Sources:

In [None]:
#parquet
READ -> df = (spark
                .read
                .parquet("path/to/the.parquet")
            )  
WRITE -> (df.write  
                .option("compression", "snappy")  
                .mode("overwrite")  
                .parquet("path/to/target.parquet")  
                )

In [None]:
#csv
READ -> df = (spark
                .read
                .option("sep", "\t")
                .option("header", True)
                .option("inferSchema", True)
                #or
                .schema(df_defined_schema)
                .csv("path/to/the.csv")
            )  
#OR
READ -> df = (spark
                .read
                .csv("path/to/the.csv", sep ="\t", header=True, inferSchema=True)
            )

In [None]:
#json
READ -> df = (spark
                .read
                .option("inferSchema", True)
                .json("path/to/the.json")
            )  


In [None]:
#You can also write as table
WRITE -> df.write.mode("overwrite").saveAsTable("table_name")

#or Delta Table
WRITE -> df.write.format("delta").mode("overwrite").save("path/to/the/delta")