In [0]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("PySpark Example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()


Here is an explanation of each part:

1. `from pyspark.sql import SparkSession`: This imports the SparkSession class from the pyspark.sql module. SparkSession is the entry point to programming Spark with the Dataset and DataFrame API.

2. `spark = SparkSession.builder`: This creates a SparkSessionBuilder, which is used to configure the session.

3. `appName("PySpark Example")`: This method sets the name of the Spark application, which will be visible in the Spark web UI.

4. `config("spark.some.config.option", "some-value")`: This method is used to set various Spark properties. Here, it sets a configuration option with the key "spark.some.config.option" and the value "some-value". You can configure various properties depending on your use case.

5. `getOrCreate()`: This method gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in the builder.

In summary, this code sets up a Spark session with the specified configuration and properties, allowing you to leverage the capabilities of the Spark SQL module.

In [0]:
spark

# Load a csv file as Spark Dataframe

In [0]:
df1 = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/gaurav.ojha008@nmims.edu.in/bank_full-1.csv")

In [0]:
df1

DataFrame[age: string, job: string, marital: string, education: string, default: string, balance: string, housing: string, loan: string, contact: string, day: string, month: string, duration: string, campaign: string, pdays: string, previous: string, poutcome: string, Target: string]

# Layout of the Dataframe

In [0]:
df1.printSchema()

root
 |-- age: string (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: string (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: string (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- campaign: string (nullable = true)
 |-- pdays: string (nullable = true)
 |-- previous: string (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- Target: string (nullable = true)



# Reading a Parquet File

In [0]:
df2 = spark.read.format("parquet").option("header", "true").load("dbfs:/FileStore/shared_uploads/gaurav.ojha008@nmims.edu.in/userdata1.parquet")

In [0]:
df2

DataFrame[registration_dttm: timestamp, id: int, first_name: string, last_name: string, email: string, gender: string, ip_address: string, cc: string, country: string, birthdate: string, salary: double, title: string, comments: string]

In [0]:
df2.printSchema()

root
 |-- registration_dttm: timestamp (nullable = true)
 |-- id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- cc: string (nullable = true)
 |-- country: string (nullable = true)
 |-- birthdate: string (nullable = true)
 |-- salary: double (nullable = true)
 |-- title: string (nullable = true)
 |-- comments: string (nullable = true)



### Get the number of records

In [0]:
df2.count()

1000