## Connecting to spark cluster 

## Objectives

In this Jupyter notebook, I learned to::

 - Use PySpark to connect to a spark cluster.
 - Create a spark session.
 - Read a csv file into a data frame using the spark session.
 - Stop the spark session
 - Learn how to use this lab notebook offline.


In [1]:
#!pip install pyspark==3.1.2 -q
#!pip install findspark -q

### Importing Required Libraries

In [2]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

# import SparkSession
from pyspark.sql import SparkSession

# Examples


## Task 1 - Create a spark session


In [3]:
#Create SparkSession
spark = SparkSession.builder.appName("Getting Started with Spark").getOrCreate()

24/07/19 08:57:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## Task 2 - Download the data file


In [4]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv

--2024-07-19 08:57:22--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104, 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13891 (14K) [text/csv]
Saving to: ‘mpg.csv’


2024-07-19 08:57:22 (41.6 MB/s) - ‘mpg.csv’ saved [13891/13891]



## Task 3 - Load the data in a csv file into a dataframe


In [5]:
# Load mpg dataset
mpg_data = spark.read.csv("mpg.csv", header=True, inferSchema=True)

                                                                                

## Task 4 - Explore the data set


Print the schema of the dataset

In [6]:
mpg_data.printSchema()

root
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Engine Disp: double (nullable = true)
 |-- Horsepower: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Accelerate: double (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Origin: string (nullable = true)



Look at some sample rows from the dataset we loaded:


In [7]:
# show 5 rows from the dataset
mpg_data.head(5)

[Row(MPG=15.0, Cylinders=8, Engine Disp=390.0, Horsepower=190, Weight=3850, Accelerate=8.5, Year=70, Origin='American'),
 Row(MPG=21.0, Cylinders=6, Engine Disp=199.0, Horsepower=90, Weight=2648, Accelerate=15.0, Year=70, Origin='American'),
 Row(MPG=18.0, Cylinders=6, Engine Disp=199.0, Horsepower=97, Weight=2774, Accelerate=15.5, Year=70, Origin='American'),
 Row(MPG=16.0, Cylinders=8, Engine Disp=304.0, Horsepower=150, Weight=3433, Accelerate=12.0, Year=70, Origin='American'),
 Row(MPG=14.0, Cylinders=8, Engine Disp=455.0, Horsepower=225, Weight=3086, Accelerate=10.0, Year=70, Origin='American')]

## Task 5 - Stop the spark session


First we identify the target. Target is the value that our machine learning model needs to predict


In [8]:
spark.stop()

# Exercises


### Exercise 1 - Create a Spark Session


Create a spark session with appname "Diamond data analysis"


In [9]:
spark = SparkSession.builder.appName("Diamond data analysis").getOrCreate()

### Exercise 2 - Load the dataset into a dataframe


Download the data set from "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv"


In [10]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv


--2024-07-19 09:00:40--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104, 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3192561 (3.0M) [text/csv]
Saving to: ‘diamonds.csv’


2024-07-19 09:00:41 (15.3 MB/s) - ‘diamonds.csv’ saved [3192561/3192561]



Load diamond dataset into a dataframe named diamond_data


In [13]:
diamond_data = spark.read.csv("diamonds.csv")

### Exercise 3 - Explore the data


Print the schema of the dataframe


In [14]:
diamond_data.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)



### Exercise 4 - Print the top 5 rows of the dataframe


In [15]:
diamond_data.head(5)

[Row(_c0='s', _c1='carat', _c2='cut', _c3='color', _c4='clarity', _c5='depth', _c6='table', _c7='price', _c8='x', _c9='y', _c10='z'),
 Row(_c0='1', _c1='0.23', _c2='Ideal', _c3='E', _c4='SI2', _c5='61.5', _c6='55', _c7='326', _c8='3.95', _c9='3.98', _c10='2.43'),
 Row(_c0='2', _c1='0.21', _c2='Premium', _c3='E', _c4='SI1', _c5='59.8', _c6='61', _c7='326', _c8='3.89', _c9='3.84', _c10='2.31'),
 Row(_c0='3', _c1='0.23', _c2='Good', _c3='E', _c4='VS1', _c5='56.9', _c6='65', _c7='327', _c8='4.05', _c9='4.07', _c10='2.31'),
 Row(_c0='4', _c1='0.29', _c2='Premium', _c3='I', _c4='VS2', _c5='62.4', _c6='58', _c7='334', _c8='4.2', _c9='4.23', _c10='2.63')]