# <img src ='https://airsblobstorage.blob.core.windows.net/airstream/Asset 275.png' width="50px"> RDDs and Data Frames

This notebook will show you how to create RDDs and Dataframes and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.microsoft.com/en-us/azure/databricks/data/databricks-file-system) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
dbutils.fs.ls("/databricks-datasets")

In [0]:
dbutils.fs.ls("/databricks-datasets/adult")

In [0]:
%fs

ls /databricks-datasets/adult/

path,name,size
dbfs:/databricks-datasets/adult/README.md,README.md,2672
dbfs:/databricks-datasets/adult/adult.data,adult.data,3974305
dbfs:/databricks-datasets/adult/adult.test,adult.test,2003132


In [0]:
# TODO Recording: In the cell below please expand the dataframe to show what columns are present in there

In [0]:
adult_census_data = spark.read.csv("dbfs:/databricks-datasets/adult/adult.data", header=True)

type(adult_census_data)

In [0]:
adult_census_rdd = adult_census_data.rdd

type(adult_census_rdd)

In [0]:
# Expand the pointer for Spark Jobs
# Click on View
# Completed stages should be 1
# Expand the event timeline -- select and scroll right -- hover over the blue bar representing the collect operation
# Expand the DAG visualization

In [0]:
adult_census_rdd.collect()

In [0]:
adult_census_rdd.count()

In [0]:
adult_census_rdd.first()

In [0]:
adult_census_rdd.map(lambda row: (row[1],  row[3], row[5]))

In [0]:
adult_census_rdd.map(lambda row: (row[1],  row[3], row[5]))\
                .collect()

In [0]:
adult_census_rdd.map(lambda row: (row[' State-gov'],  row[' Adm-clerical'], row[' <=50K']))\
                .collect()

In [0]:
adult_census_rdd_filtered = adult_census_rdd.filter(lambda row: row[' <=50K'] == ' <=50K')

In [0]:
adult_census_rdd_filtered.count()

In [0]:
adult_census_rdd_filtered.collect()

In [0]:
dbutils.fs.ls("/databricks-datasets/bikeSharing/")

In [0]:
dbutils.fs.ls("/databricks-datasets/bikeSharing/data-001")

In [0]:
# TODO Recording: Please expand the dataframe to show the fields

In [0]:
bike_sharing_data = spark.read.format('csv') \
                         .option("inferSchema", "True") \
                         .option("header", "True") \
                         .option("sep", ",") \
                         .load("/databricks-datasets/bikeSharing/data-001/day.csv")

In [0]:
bike_sharing_data.show()

In [0]:
bike_sharing_data.show(10)

In [0]:
bike_sharing_data_selected = bike_sharing_data.select('season', 'holiday', 'cnt')

In [0]:
bike_sharing_data_selected.show()

In [0]:
bike_sharing_data.select('season').distinct().show()

In [0]:
bike_sharing_data.filter(bike_sharing_data['cnt'] > 1000).show()

In [0]:
bike_sharing_data.filter(bike_sharing_data['mnth'] == 12).show()

In [0]:
bike_sharing_data.filter(bike_sharing_data['yr'] == 0).count()