## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [None]:
import string
import re
#all spark imports
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
 
spark = SparkSession.builder.getOrCreate()
# File location and type
file_location = "/FileStore/tables/shakespeare_1.txt"
text= sc.textFile(file_location)
given_words="Shakespeare,When,Lord,Library,GUTENBERG,WILLIAM,COLLEGE,WORLD"
given_words=given_words.split(",")
#print(given_words)

def clean_data(s):
    data=s.strip(string.punctuation)
    return(data)

df=text.map(clean_data)
#print(df.collect())
df=df.flatMap(lambda x: re.split(r"\W+",x))
#print(df.collect())
df=df.filter(lambda x : len(x)>1)
df=df.map(lambda word :(word,1))
df=df.reduceByKey(lambda a,b:a+b)
#print(df.collect())

for i in df.collect():
    if i[0] in given_words:
        print(i)

('Shakespeare', 22)
('GUTENBERG', 100)
('WILLIAM', 128)
('WORLD', 98)
('COLLEGE', 98)
('When', 406)
('Lord', 402)
('Library', 5)


**please note that this result is based on the assumption that WORLD is not same as  "World", "world.". The data has been convereted to lowercase stripped om punctuation to get this results.**

In [None]:
res=df.sortBy(lambda x : x[1]).collect()
print("bottom 20 words are:")
print(res[:20])

bottom 20 words are:
[('anyone', 1), ('restrictions', 1), ('online', 1), ('www', 1), ('gutenberg', 1), ('org', 1), ('COPYRIGHTED', 1), ('Details', 1), ('guidelines', 1), ('Posting', 1), ('2011', 1), ('EBook', 1), ('January', 1), ('1994', 1), ('Character', 1), ('encoding', 1), ('cooperation', 1), ('Public', 1), ('Domain', 1), ('implications', 1)]


In [None]:
print("Top 20 words in ascending order are:")
print(res[-20:])

Top 20 words in ascending order are:
[('him', 2527), ('this', 2609), ('for', 2713), ('your', 2797), ('be', 2986), ('it', 3078), ('with', 3221), ('his', 3278), ('me', 3448), ('not', 3595), ('is', 3722), ('And', 3735), ('that', 3864), ('in', 4803), ('my', 4922), ('you', 5360), ('to', 7742), ('of', 7968), ('and', 8942), ('the', 11412)]
