## Dataframe Basics

### Let's create two DataFrames one as a Dimension and the other one as a Fact.

In [3]:
personDIM = [
  (123, 'John', 25),
  (234, 'Doe', 27),
  (345, 'Jane', 21),
  (456, 'Jimmy', 45),
]

transFact = [
  (1, 123, 'Soap', 5.00, '2018-01-01'),
  (2, 123 , 'Fruit', 4.67, '2018-01-01'),
  (3, 234 , 'Soap', 5.00, '2018-02-01'),
  (4, 234 , 'Bread', 1.99, '2018-03-01'),
  (5, 234, 'Milk', 4.55, '2018-08-01'),
  (6, 345 , 'Chips', 5.99, '2018-09-01'),
]

In [4]:
personDF = spark.createDataFrame(personDIM, ["id", "name", "age"])

In [5]:
transDF = spark.createDataFrame(transFact, ["id", "person_id", "item", "amount", "purchase_date"])

#### Convert purchase_date to a date format using to_date function

In [7]:
from pyspark.sql.functions import to_date
transDF = transDF.withColumn("purchase_date", to_date("purchase_date" , "yyyy-mm-dd"))

### Perform some DataFrames functions

In [9]:
from pyspark.sql.functions import col
transDF.withColumn("id", col("id") * 2).show()

In [10]:
joinExpression = (  transDF["person_id"] == personDF["id"] )
joinType = 'left_outer'
df = personDF.join(transDF,joinExpression,joinType)

In [11]:
display(df)

id,name,age,id.1,person_id,item,amount,purchase_date
234,Doe,27,3.0,234.0,Soap,5.0,2018-01-01
234,Doe,27,4.0,234.0,Bread,1.99,2018-01-01
234,Doe,27,5.0,234.0,Milk,4.55,2018-01-01
123,John,25,1.0,123.0,Soap,5.0,2018-01-01
123,John,25,2.0,123.0,Fruit,4.67,2018-01-01
345,Jane,21,6.0,345.0,Chips,5.99,2018-01-01
456,Jimmy,45,,,,,


In [12]:
from pyspark.sql.functions import sum, desc
df.groupBy("name").agg(sum("amount").alias("Total Spent")).orderBy(desc("Total Spent")).show()

### Let's make the following the SQL Way
```
joinExpression = (  transDF["person_id"] == personDF["id"] )
joinType = 'left_outer'
df = personDF.join(transDF,joinExpression,joinType)

from pyspark.sql.functions import sum, desc
df.groupBy("name").agg(sum("amount").alias("Total Spent")).orderBy(desc("Total Spent")).show()

```

In [14]:
transDF.createOrReplaceTempView("transSQL")

In [15]:
personDF.createOrReplaceTempView("personSQL")

In [16]:
dfSQL = spark.sql("""
select *
from personSQL
Left join transSQL on (personSQL.id = transSQL.person_id)
""")

In [17]:
dfSQL.createOrReplaceTempView("dfSQL")

In [18]:
spark.sql("""
select name, sum(amount) as `total amount` from dfSQL
group by name
order by `total amount` desc
""").show()

#### In order to mount s3 to Databricks Execute the following:
```
#%fs mount s3a://Access key ID:Secret access key@bucketname /mnt/mys3
```
#### Make sure that you replace / with %2F in Access Key or Secret Key

In [20]:
%fs ls /mnt/

In [21]:
dfSQL.toPandas()

In [22]:
dfSQL.drop("id").write.format("json").option("path", "/mnt/mys3/demo").save()

In [23]:
df_train = spark.read.format("csv").option("inferSchema", "True").option("header", "True")\
.option("path", "/mnt/mys3/test/train.csv").load()

In [24]:
df_train.show(5)