### DataFrame object

#### Create SparkContext and SparkSession 

In [11]:
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

In [12]:
# create entry points to spark
try:
    sc.stop()
except:
    pass
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc=SparkContext()
spark = SparkSession(sparkContext=sc)

In [13]:
spark

#### Create a DataFrame object
##### - By reading a file

In [14]:
fpp = spark.read.csv(path='datasets/Furniture Price Prediction.csv',
                        sep=',',
                        encoding='UTF-8',
                        comment=None,
                        header=True, 
                        inferSchema=True)
fpp.show(n=5, truncate=False)

+-----------------------------------------------------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------+----+--------+----+------+
|furniture                                                                                      |type             |url                                                                                                                                         |rate|delivery|sale|price |
+-----------------------------------------------------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------+----+--------+----+------+
|Bed side table with storage shelf                                                              |Home Decor Center|https://www.jumia.com.eg//ar/home-de

##### - By reading createDataFrame function

######  1. From an RDD
where elements in RDD become a Row object 

In [15]:
from pyspark.sql import Row
rdd = sc.parallelize([
    Row(x=[5,6,4], y=['apple', 'orange', 'berries']),
    Row(x=[4,5,6], y=['pear', 'kiwi', 'banana'])
])
rdd.collect()

[Row(x=[5, 6, 4], y=['apple', 'orange', 'berries']),
 Row(x=[4, 5, 6], y=['pear', 'kiwi', 'banana'])]

######  2. From an RDD From pandas DataFrame

In [16]:
import pandas as pd
pd_df = pd.DataFrame({
    'x': [[5,6,4], [4,5,6]],
    'y': [['apple', 'orange', 'berries'], ['pear', 'kiwi', 'banana']]
})
pd_df

Unnamed: 0,x,y
0,"[5, 6, 4]","[apple, orange, berries]"
1,"[4, 5, 6]","[pear, kiwi, banana]"


######  3. From a list
Each element in the list becomes an Row in the DataFrame.

In [19]:
df_list = [['apple', 5], ['orange', 6]]
df = spark.createDataFrame(df_list, ['fruit', 'price'])
df.show()

+------+-----+
| fruit|price|
+------+-----+
| apple|    5|
|orange|    6|
+------+-----+



In [23]:
df.dtypes

[('fruit', 'string'), ('price', 'bigint')]

### Conversion between Dataframe and RDD

#### DataFrame to RDD
A DataFrame can be easily converted to an RDD by calling the **pyspark.sql.DataFrame.rdd()** function. Each element in the returned RDD is an pyspark.sql.Row object which is a list of key-value pairs.

In [28]:
fpp.rdd.take(3)

[Row(furniture='Bed side table with storage shelf ', type='Home Decor Center', url='https://www.jumia.com.eg//ar/home-decor-center-bedside-table-side-table-with-drawer-white-403055-cm-304-27151996.html', rate='3.3', delivery=172.14, sale='72%', price='2500.0'),
 Row(furniture='Bed side table with storage shelf ', type='Modern Home', url='https://www.jumia.com.eg//ar/modern-home-bedside-table-side-table-with-storage-white-554040-27151979.html', rate='0', delivery=172.14, sale='54%', price='1200.0'),
 Row(furniture='Modern Zigzag TV Table ', type='Modern Home', url='https://www.jumia.com.eg//ar/generic-zigzag-tv-table-beige-120cm-32489890.html', rate='0', delivery=172.14, sale='18%', price='1099.0')]

and from here, we can apply a set of mapping functions, such as **map, mapValues, flatMap, flatMapValues** and other methods that come from RDD.

In [45]:
fpp_map = fpp.rdd.map(lambda x: (x['furniture'], x['price']))
fpp_map.take(5)

[('Bed side table with storage shelf ', '2500.0'),
 ('Bed side table with storage shelf ', '1200.0'),
 ('Modern Zigzag TV Table ', '1099.0'),
 ('Bedside table with storage shelf ', '1200.0'),
 ('Wall Mounted TV Unit with Cabinet TV Stand Unit with Shelves for Living Room (Brown with White)',
  '1400.0')]

here , a new element with the **price** is added for each element of furniture, the result of the RDD is **PairRDDFunction**s which contains key-value pairs, word of type String as Key and price of type Int as value. SO **map() operates on the entire key-value pair**.

But **mapValues() operates only on the values of the key-value pair** let's see with this exemple.

In [47]:
fpp_mapvalues = fpp_map.mapValues(lambda x: [x, x * 2])
fpp_mapvalues.take(5)

[('Bed side table with storage shelf ', ['2500.0', '2500.02500.0']),
 ('Bed side table with storage shelf ', ['1200.0', '1200.01200.0']),
 ('Modern Zigzag TV Table ', ['1099.0', '1099.01099.0']),
 ('Bedside table with storage shelf ', ['1200.0', '1200.01200.0']),
 ('Wall Mounted TV Unit with Cabinet TV Stand Unit with Shelves for Living Room (Brown with White)',
  ['1400.0', '1400.01400.0'])]

####  RDD to DataFrame 
To convert an RDD to a DataFrame, we can use the SparkSession.createDataFrame() function. Every element in the RDD has be to an Row object.

##### Create an RDD

In [51]:
rdd_raw = sc.textFile('datasets/Furniture Price Prediction.csv')
rdd_raw.take(3)

['furniture,type,url,rate,delivery,sale,price',
 'Bed side table with storage shelf ,Home Decor Center,https://www.jumia.com.eg//ar/home-decor-center-bedside-table-side-table-with-drawer-white-403055-cm-304-27151996.html,3.3,172.14,72%,2500.0',
 'Bed side table with storage shelf ,Modern Home,https://www.jumia.com.eg//ar/modern-home-bedside-table-side-table-with-storage-white-554040-27151979.html,0,172.14,54%,1200.0']

##### Save the first row to a variable

In [57]:
header = rdd_raw.map(lambda x: x.split(',')).filter(lambda x: x[0] == 'furniture').collect()[0]
header

['furniture', 'type', 'url', 'rate', 'delivery', 'sale', 'price']

##### Save the rest to a new RDD

In [56]:
rdd = rdd_raw.map(lambda x: x.split(',')).filter(lambda x: x[0] != 'furniture')
rdd.take(2)

[['Bed side table with storage shelf ',
  'Home Decor Center',
  'https://www.jumia.com.eg//ar/home-decor-center-bedside-table-side-table-with-drawer-white-403055-cm-304-27151996.html',
  '3.3',
  '172.14',
  '72%',
  '2500.0'],
 ['Bed side table with storage shelf ',
  'Modern Home',
  'https://www.jumia.com.eg//ar/modern-home-bedside-table-side-table-with-storage-white-554040-27151979.html',
  '0',
  '172.14',
  '54%',
  '1200.0']]

###### Convert RDD elements to RDD Row objects

First we define a function which takes a list of column names and a list of values and create a Row of key-value pairs. Since keys in an Row object are variable names, we can’t simply pass a dictionary to the Row() function. We can think of a dictionary as an argument list and use the ** to unpack the argument list.

let's define the function

In [58]:
def list_to_row(keys, values):
    row_dict = dict(zip(keys, values))
    return Row(**row_dict)

In [59]:
rdd_rows = rdd.map(lambda x: list_to_row(header, x))
rdd_rows.take(3)

[Row(furniture='Bed side table with storage shelf ', type='Home Decor Center', url='https://www.jumia.com.eg//ar/home-decor-center-bedside-table-side-table-with-drawer-white-403055-cm-304-27151996.html', rate='3.3', delivery='172.14', sale='72%', price='2500.0'),
 Row(furniture='Bed side table with storage shelf ', type='Modern Home', url='https://www.jumia.com.eg//ar/modern-home-bedside-table-side-table-with-storage-white-554040-27151979.html', rate='0', delivery='172.14', sale='54%', price='1200.0'),
 Row(furniture='Modern Zigzag TV Table ', type='Modern Home', url='https://www.jumia.com.eg//ar/generic-zigzag-tv-table-beige-120cm-32489890.html', rate='0', delivery='172.14', sale='18%', price='1099.0')]

Now we can convert the **RDD** to a **DataFrame**.

In [60]:
df = spark.createDataFrame(rdd_rows)
df.show(5)

+--------------------+-----------------+--------------------+----+--------+----+------+
|           furniture|             type|                 url|rate|delivery|sale| price|
+--------------------+-----------------+--------------------+----+--------+----+------+
|Bed side table wi...|Home Decor Center|https://www.jumia...| 3.3|  172.14| 72%|2500.0|
|Bed side table wi...|      Modern Home|https://www.jumia...|   0|  172.14| 54%|1200.0|
|Modern Zigzag TV ...|      Modern Home|https://www.jumia...|   0|  172.14| 18%|1099.0|
|Bedside table wit...|      Modern Home|https://www.jumia...|   0|  172.14| 58%|1200.0|
|Wall Mounted TV U...|      Modern Home|https://www.jumia...|   5|   52.44| 54%|1400.0|
+--------------------+-----------------+--------------------+----+--------+----+------+
only showing top 5 rows

