So the function below is used to make it easy for you to create a data frame from a Cloud Object Store data frame using the so called "DataSource" which is some sort of a plugin which allows ApacheSpark to use different data sources.

This is the first function you have to implement. You are passed a dataframe object. We've also registered the dataframe in the ApacheSparkSQL catalog - so you can also issue queries against the "washing" table using "spark.sql()". Hint: To get an idea about the contents of the catalog you can use: spark.catalog.listTables().
So now it's time to implement your first function. You are free to use the dataframe API, SQL or RDD API. In case you want to use the RDD API just obtain the encapsulated RDD using "df.rdd". You can test the function by running one of the three last cells of this notebook, but please make sure you run the cells from top to down since some are dependant of each other...

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark


In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [0]:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType
sc = SparkSession.builder.master("local[*]").getOrCreate()
#Spark Contexto
#sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

In [18]:
import numpy as np
a = np.array([[1,2,3]])
b = np.array([[4,5,6]])
c = np.sum(a.dot(b,a))


ValueError: ignored

In [0]:
#Please implement a function returning the number of rows in the dataframe
def count(df,sc):
    #TODO Please enter your code here, you are not required to use the template code below
    #some reference: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
    return sc.sql('select count as cnt from washing').first().cnt

Now it's time to implement the second function. Please return an integer containing the number of fields. The most easy way to get this is using the dataframe API. Hint: You might find the dataframe API documentation useful: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

In [0]:
def getNumberOfFields(df,sc):
    #TODO Please enter your code here, you are not required to use the template code below
    #some reference: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
    return len(df.count)

Finally, please implement a function which returns a (python) list of string values of the field names in this data frame. Hint: Just copy&past doesn't work because the auto-grader will create a random data frame for testing, so please use the data frame API as well. Again, this is the link to the documentation: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

In [0]:
def getFieldNames(df,sc):
    #TODO Please enter your code here, you are not required to use the template code below
    #some reference: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
    return df.count

Now it is time to grab a PARQUET file and create a dataframe out of it. Using SparkSQL you can handle it like a database. 

In [0]:
!wget https://github.com/IBM/coursera/blob/master/coursera_ds/washing.parquet?raw=true
!mv /content/washing.parquet?raw=true /content/washing.parquet

--2019-11-03 18:14:48--  https://github.com/IBM/coursera/blob/master/coursera_ds/washing.parquet?raw=true
Resolving github.com (github.com)... 192.30.253.113
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/IBM/coursera/raw/master/coursera_ds/washing.parquet [following]
--2019-11-03 18:14:48--  https://github.com/IBM/coursera/raw/master/coursera_ds/washing.parquet
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/IBM/coursera/master/coursera_ds/washing.parquet [following]
--2019-11-03 18:14:48--  https://raw.githubusercontent.com/IBM/coursera/master/coursera_ds/washing.parquet
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTT

In [0]:
from pyspark.sql import SparkSession  
sc = SparkSession \
    .builder \
    .appName("Python Spark IBM") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

df = sc.read.parquet('/content/washing.parquet')
df.createOrReplaceTempView('washing')
df.show()

+--------------------+--------------------+-----+--------+----------+---------+--------+-----+-----------+-------------+-------+
|                 _id|                _rev|count|flowrate|fluidlevel|frequency|hardness|speed|temperature|           ts|voltage|
+--------------------+--------------------+-----+--------+----------+---------+--------+-----+-----------+-------------+-------+
|0d86485d0f88d1f9d...|1-57940679fb8a713...|    4|      11|acceptable|     null|      77| null|        100|1547808723923|   null|
|0d86485d0f88d1f9d...|1-15ff3a0b304d789...|    2|    null|      null|     null|    null| 1046|       null|1547808729917|   null|
|0d86485d0f88d1f9d...|1-97c2742b68c7b07...|    4|    null|      null|       71|    null| null|       null|1547808731918|    236|
|0d86485d0f88d1f9d...|1-eefb903dbe45746...|   19|      11|acceptable|     null|      75| null|         86|1547808738999|   null|
|0d86485d0f88d1f9d...|1-5f68b4c72813c25...|    7|    null|      null|       75|    null| null|   

The following cell can be used to test your count function

In [0]:
cnt = None
nof = None
fn = None
cnt = count(df,sc)
print(cnt)


4
4
4


The following cell can be used to test your getNumberOfFields function

In [0]:
nof = getNumberOfFields(df,sc)
print(nof)

11


The following cell can be used to test your getFieldNames function

In [0]:
fn = getFieldNames(df,sc)
print(fn)

<bound method DataFrame.count of DataFrame[_id: string, _rev: string, count: bigint, flowrate: bigint, fluidlevel: string, frequency: bigint, hardness: bigint, speed: bigint, temperature: bigint, ts: bigint, voltage: bigint]>


Congratulations, you are done. So please submit your solutions to the grader now.

# Start of Assignment-Submission

The first thing we need to do is to install a little helper library for submitting the solutions to the coursera grader:


In [0]:
!rm -f rklib.py
!wget https://raw.githubusercontent.com/IBM/coursera/master/rklib.py

--2019-11-03 18:16:47--  https://raw.githubusercontent.com/IBM/coursera/master/rklib.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2540 (2.5K) [text/plain]
Saving to: ‘rklib.py’


2019-11-03 18:16:47 (49.1 MB/s) - ‘rklib.py’ saved [2540/2540]



Now it’s time to submit first solution. Please make sure that the token variable contains a valid submission token. You can obtain it from the coursera web page of the course using the grader section of this assignment.

Please specify you email address you are using with cousera as well.


In [0]:
from rklib import submit, submitAll
import json

key = "SVDiVSHNEeiDqw70MIp2vA"

email = "web2ajax@gmail.com"
token = "NsabU9JibaReIcx2"

parts_data = {}
parts_data["2FjQw"] = json.dumps(cnt)
parts_data["j8gMs"] = json.dumps(nof)
parts_data["xaauC"] = json.dumps(fn)


submitAll(email, token, key, parts_data)

Submission successful, please check on the coursera grader page for the status
-------------------------
{"elements":[{"itemId":"7Yp62","id":"sUpST4RAEeawAApvKZgcCQ~7Yp62~f5_Wgf5uEemQmQ5viwNtrA","courseId":"sUpST4RAEeawAApvKZgcCQ"}],"paging":{},"linked":{}}
-------------------------
