## References and Links

In [None]:
#### Pyspark
01. <a>https://youtu.be/EB8lfdxpirM?feature=shared</a>
02. <a>https://www.udemy.com/course/introduction-to-python-for-big-data-engineering-with-pyspark/?couponCode=ST4MT73124</a>
03. <a>https://www.youtube.com/watch?v=6MaZoOgJa84&list=PLMWaZteqtEaJFiJ2FyIKK0YEuXwQ9YIS_&index=1</a>
04. <a>https://www.youtube.com/watch?v=AGgyf9bO_8M&list=PLlUZLZydkS7_8WnK8fMENmJFSfPwxw9Fi</a>

#### Spark
01. <a>https://www.youtube.com/watch?v=qU7u9wGB0JA&list=PLLa_h7BriLH27lOCmOOWhOuarb3HtSeaH</a>

#### MySQL
01. <a>https://youtu.be/5OdVJbNCSso?feature=shared</a>
02. <a>https://youtu.be/7mz73uXD9DA?feature=shared</a>
03. <a>https://www.javatpoint.com/mysql-partitioning</a>

#### Linux
01. <a>https://www.youtube.com/watch?v=4e669hSjaX8</a> (File and Directory Permissions)
02. <a>https://www.youtube.com/watch?v=19WOD84JFxA</a> (User Management)
03. <a>https://www.youtube.com/watch?v=bz0ZCUv5rYo</a> (Full Linux Tutorial : Process, File & Directories, User Management)

#### DataBricks
01. <a>https://www.youtube.com/watch?v=2-RIPNhhgHU&list=PL8zzpRdWG891m4LmFeVp-XkYh7ya-G6b6</a>

#### Python
01. <a>https://www.youtube.com/@Indently/videos</a>

#### Others
01. <a>https://www.youtube.com/watch?v=hdFk3EvL1ug&list=PLLa_h7BriLH1NK3hcydotCPgTh5xSWivM&index=2</a>

#### Projects
01. <a>https://www.youtube.com/watch?v=BlWS4foN9cY</a>

## Notes

In [None]:
01. pyspark does not support DataSets because of type checking is not happens in python in compile time
02. RDD and DataFrames in sufficient for the pyspark, and it has Spark’s Catalyst optimizer for the better performance
03. Since PySpark does not support the Dataset API (which is available in Scala and Java), the PySpark DataFrame API has been designed to offer several advantages and features to make up for this limitation

## Create SparkContext in Apache Spark version 1.x

In [None]:
from pyspark import SparkContext

# Create a SparkContext object
sc = SparkContext(appName="MySparkApplication")

# Print some info regarding the SparkContext
sc

# Need to stop it or else we can run another SparkContext
sc.stop()

## Create SparkContext via SparkSession in Apache Spark version 2.x

In [None]:
# Import PySpark
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
        .appName("PySpark-Get-Started") \
        .getOrCreate()

# Create sparkContext from SparkSession
sc = spark.sparkContext

# Print some info regarding the SparkContext
sc

# Need to stop it or else we can run another SparkContext
sc.stop() # or spark.stop()

## Creating RDD and DataFrame from RDD

In [None]:
from pyspark.sql import SparkSession

# Create spark session
spark = SparkSession.builder.getOrCreate()

# Create RDD
rdd = spark.sparkContext.parallelize([(1, 'Alice'), (2, 'Bob')])

# Creating DataFrame from another RDD
df = spark.createDataFrame(rdd, schema=['id', 'name'])

# Creating DataFrame from other external source like CSV, JSON, Parquet
# df = spark.read.csv(path = "?")

# Creating new RDD by converting DataFrame to RDD
rdd_new = df.rdd

spark.stop()

## Functions and Imports

In [None]:
#### RDD Functions

01. collect()
02. count()
03. distinct()
04. filter()
05. map()            : Have the same number of output elements as input
06. flatMap()
07. sortByKey()      : sortByKey(bool), where bool represents ascending or desending.  False represents decending order.
08. take()           : take(n)
09. reduce()
10. first()
11. last()
12. min()
13. max()
14. union()
15. subtract()

#### DF Functions
01. show()
02. printSchema()            : printSchema(truncate = False)
03. select()
04. columns()
05. filter()
06. where()
07. distinct()
08. show()
09. alias()
10. orderBy()
11. dropDuplicates()
12. withColumn()              : To add the column
13. withColumnRenamed()
14. drop()
15. na.drop()                 : Drop the null value rows [na.drop("any") vs na.drop("all")]
16. describe()                : Prints count, mean, max, min, stddev of a mentioned column
17. groupBy()
18. agg()
19. count()
20. isNull()
21. cast()
22. limit()
23. toPandas()                : To convert DF to pandas DF
24. createOrReplaceTempView   : To convert DF to the more SQL Table format

#### Spark Functions
1. createDataFrame()      : To create a DataFrame
2. sparkContext()         : To create a SparkContext which used to do some works
3. sql()                  : To use Spark SQL
4. read()                 : To read various files from different formats
5. write()                : To write various files in different formats
6. range()                : To create a column with some values

#### Pyspark imports
1. from pyspark.sql import SparkSession, Row

2. from pyspark.sql.functions import col, expr, concat, concat_ws, year, array_contains, count, desc, round, udf, countDistinct
3. from pyspark.sql.functions import min, max, sumDistinct, avg, to_timestamp, date_format, regexp_replace, regexp_extract, lit, array
4. from pyspark.sql.functions import explode, map_keys, map_values, split, array_contains, when, from_json, current_date, date_format
5. from pyspark.sql.functions import to_date

6. from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, FloatType, DateType, BooleanType
7. from pyspark.sql.types import MapType