In [1]:
# run this to shorten the data import from the files
path_data = '/home/nero/Documents/Estudos/DataCamp/Python/Machine_Learning_with_PySpark/datasets/'


In [2]:
# start spark session

from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('chapter_01').getOrCreate()

23/07/11 09:03:30 WARN Utils: Your hostname, nero resolves to a loopback address: 127.0.1.1; using 192.168.1.14 instead (on interface wlp2s0)
23/07/11 09:03:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/11 09:03:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/07/11 09:03:35 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [1]:
# exercise 01

"""
Creating a SparkSession

In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a SparkSession object.

The SparkSession class has a builder attribute, which is an instance of the Builder class. The Builder class exposes three important methods that let you:

    specify the location of the master node;
    name the application (optional); and
    retrieve an existing SparkSession or, if there is none, create a new one.

The SparkSession class has a version attribute which gives the version of Spark. Note: The version can also be accessed via the __version__ attribute on the pyspark module.

Find out more about SparkSession here.
(https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession)

Once you are finished with the cluster, it's a good idea to shut it down, which will free up its resources, making them available for other processes.

Note:: You might find it useful to review the slides from the lessons in the Slides panel next to the IPython Shell.
"""

# Instructions

"""

    Import the SparkSession class from pyspark.sql.
    Create a SparkSession object connected to a local cluster. Use all available cores. Name the application 'test'.
    Use the version attribute on the SparkSession object to retrieve the version of Spark running on the cluster. Note: The version might be different to the one that's used in the presentation (it gets updated from time to time).
    Shut down the cluster.

"""

# solution

# Import the SparkSession class
from pyspark.sql import SparkSession

# Create SparkSession object
spark = SparkSession.builder \
                    .master('local[*]') \
                    .appName('test') \
                    .getOrCreate()

# What version of Spark?
print(spark.version)

# Terminate the cluster
spark.stop()

#----------------------------------#

# Conclusion

"""
Nicely done! The session object will now allow us to load data into Spark.
"""

23/07/11 08:47:23 WARN Utils: Your hostname, nero resolves to a loopback address: 127.0.1.1; using 192.168.1.14 instead (on interface wlp2s0)
23/07/11 08:47:23 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/11 08:47:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


3.4.1


'\nNicely done! The session object will now allow us to load data into Spark.\n'

In [3]:
# exercise 02

"""
Loading flights data

In this exercise you're going to load some airline flight data from a CSV file. To ensure that the exercise runs quickly these data have been trimmed down to only 50 000 records. You can get a larger dataset in the same format here.
(https://assets.datacamp.com/production/repositories/3918/datasets/e1c1a03124fb2199743429e9b7927df18da3eacf/flights-larger.csv)

Notes on CSV format:

    fields are separated by a comma (this is the default separator) and
    missing data are denoted by the string 'NA'.

Data dictionary:

    mon — month (integer between 1 and 12)
    dom — day of month (integer between 1 and 31)
    dow — day of week (integer; 1 = Monday and 7 = Sunday)
    carrier — carrier (IATA code)(https://en.wikipedia.org/wiki/List_of_airline_codes)
    flight — flight number
    org — origin airport (IATA code)(https://en.wikipedia.org/wiki/IATA_airport_code)
    mile — distance (miles)
    depart — departure time (decimal hour)
    duration — expected duration (minutes)
    delay — delay (minutes)

pyspark has been imported for you and the session has been initialized.

Note: The data have been aggressively down-sampled.
"""

# Instructions

"""

    Read data from a CSV file called 'flights.csv'. Assign data types to columns automatically. Deal with missing data.
    How many records are in the data?
    Take a look at the first five records.
    What data types have been assigned to the columns? Do these look correct?

"""

# solution

# Read data from CSV file
flights = spark.read.csv(path_data + 'flights.csv',
                         sep=',',
                         header=True,
                         inferSchema=True,
                         nullValue='NA')

# Get number of records
print("The data contain %d records." % flights.count())

# View the first five records
flights.show(5)

# Check column data types
print(flights.dtypes)

#----------------------------------#

# Conclusion

"""
The correct data types have been inferred for all of the columns.
"""

                                                                                

The data contain 50000 records.
+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|
|  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|
|  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|
|  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|
|  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows

[('mon', 'int'), ('dom', 'int'), ('dow', 'int'), ('carrier', 'string'), ('flight', 'int'), ('org', 'string'), ('mile', 'int'), ('depart', 'double'), ('duration', 'int'), ('delay', 'int')]


'\nThe correct data types have been inferred for all of the columns.\n'

In [4]:
# exercise 03

"""
Loading SMS spam data

You've seen that it's possible to infer data types directly from the data. Sometimes it's convenient to have direct control over the column types. You do this by defining an explicit schema.

The file sms.csv contains a selection of SMS messages which have been classified as either 'spam' or 'ham'. These data have been adapted from the UCI Machine Learning Repository. There are a total of 5574 SMS, of which 747 have been labelled as spam.
(https://archive.ics.uci.edu/ml/datasets/sms+spam+collection)
Notes on CSV format:

    no header record and
    fields are separated by a semicolon (this is not the default separator).

Data dictionary:

    id — record identifier
    text — content of SMS message
    label — spam or ham (integer; 0 = ham and 1 = spam)

"""

# Instructions

"""

    Specify the data schema, giving columns names ("id", "text", and "label") and column types.
    Read data from a delimited file called "sms.csv".
    Print the schema for the resulting DataFrame.


"""

# solution

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Specify column names and types
schema = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])

# Load data from a delimited file
sms = spark.read.csv(path_data+'sms.csv', sep=';', header=False, schema=schema)

# Print schema of DataFrame
sms.printSchema()

#----------------------------------#

# Conclusion

"""
Excellent! You now know how to initiate a Spark session and load data. In the next chapter you'll use the data you've just loaded to build a classification model.
"""

root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- label: integer (nullable = true)



"\nExcellent! You now know how to initiate a Spark session and load data. In the next chapter you'll use the data you've just loaded to build a classification model.\n"

In [7]:
sms.createOrReplaceTempView('sms')
flights.createOrReplaceTempView('flights')

spark.catalog.listTables()

[Table(name='flights', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='sms', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

In [8]:
spark.stop()