<a href="https://colab.research.google.com/github/Ricardo-Jaramillo/PySpark/blob/main/00_Basics_of_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark Basics

In this notebook I'll follow the course of Jose Portilla and apply some of my own comments.
The course shows how to install and use PySpark in 4 different ways, a Virtual Machine Environment was sett as the main way.

Originally the course is taken with a linux OS in a VM env. I just moved onto Colab and here's my journey...

## Install pyspark and get familiar with its sintax

In [32]:
# Install and import pyspark libraries
!pip install pyspark

import pyspark
from pyspark.sql import SparkSession



In [33]:
# Init Spark
spark = SparkSession.builder.appName('Basics').getOrCreate()

In [34]:
# Get our first dataframe
df = spark.read.json('people.json')

In [35]:
# Show our dataframe
df.show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [36]:
# Get the current schema of the df
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [37]:
# Get columns
df.columns

['age', 'name']

In [38]:
# Describe df
df.describe().show()

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   NULL|
| stddev|7.7781745930520225|   NULL|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+



## Change the schema type

In [39]:
# import methods
from pyspark.sql.types import (StructField, StringType,
                               IntegerType, StructType)

In [40]:
# Create the new schema and set in final struct
data_schema = [StructField('age', IntegerType(), True),
               StructField('name', StringType(), True)]

final_struct = StructType(fields=data_schema)

In [41]:
# Read in again people.json with the new struct
df_int = spark.read.json('people.json', schema=final_struct)

In [42]:
# Print the new df
df_int.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)



In [43]:
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



## Work with columns

In [47]:
# Print type of entire column and its type
df['age'], type(df['age'])

(Column<'age'>, pyspark.sql.column.Column)

In [48]:
# Get the values of that column
df.select('age').show()

+----+
| age|
+----+
|NULL|
|  30|
|  19|
+----+



In [55]:
# Get a single row of the dataframe
df.head(2)[0], type(df.head(2)[0])

(Row(age=None, name='Michael'), pyspark.sql.types.Row)

In [57]:
# Select column values
df.select(['age', 'name']).show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [59]:
# Show a new column
df.withColumn('double_age', df['age']*2).show()

+----+-------+----------+
| age|   name|double_age|
+----+-------+----------+
|NULL|Michael|      NULL|
|  30|   Andy|        60|
|  19| Justin|        38|
+----+-------+----------+



In [61]:
# Show the non-modified df
df.show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [65]:
# Show a renamed column
df.withColumnRenamed('age', 'my_new_age').show()

+----------+-------+
|my_new_age|   name|
+----------+-------+
|      NULL|Michael|
|        30|   Andy|
|        19| Justin|
+----------+-------+



In [66]:
# Show the non-modified df
df.show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



## Working with SQL Sintax

In [84]:
# We need to specify we'll work on a temporary view to use SQL Sintax
df.createOrReplaceTempView('people')

In [71]:
# Make a simple query
results = spark.sql('SELECT * FROM people')

In [72]:
# show results
results.show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [85]:
# Make a new query request
new_results = spark.sql('SELECT * FROM people WHERE age = 30')

In [76]:
# Show new_results
new_results.show()

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+



In [86]:
# Drop temporary view
spark.catalog.dropTempView("people")

True