For help, look here:
https://spark.apache.org/docs/latest/rdd-programming-guide.html

In [0]:
# Check out the pre-loaded dataset
display(dbutils.fs.ls('dbfs:/databricks-datasets/'))

path,name,size,modificationTime
dbfs:/databricks-datasets/COVID/,COVID/,0,0
dbfs:/databricks-datasets/README.md,README.md,976,1532468253000
dbfs:/databricks-datasets/Rdatasets/,Rdatasets/,0,0
dbfs:/databricks-datasets/SPARK_README.md,SPARK_README.md,3359,1455043490000
dbfs:/databricks-datasets/adult/,adult/,0,0
dbfs:/databricks-datasets/airlines/,airlines/,0,0
dbfs:/databricks-datasets/amazon/,amazon/,0,0
dbfs:/databricks-datasets/asa/,asa/,0,0
dbfs:/databricks-datasets/atlas_higgs/,atlas_higgs/,0,0
dbfs:/databricks-datasets/bikeSharing/,bikeSharing/,0,0


## 1. Word Count

In [0]:
# Create a rdd (sc = SparkContext)
rdd = sc.textFile("dbfs:/databricks-datasets/SPARK_README.md")

In [0]:
# Read 20 lines 
rdd.take(20)

Out[5]: ['# Apache Spark',
 '',
 'Spark is a fast and general cluster computing system for Big Data. It provides',
 'high-level APIs in Scala, Java, Python, and R, and an optimized engine that',
 'supports general computation graphs for data analysis. It also supports a',
 'rich set of higher-level tools including Spark SQL for SQL and DataFrames,',
 'MLlib for machine learning, GraphX for graph processing,',
 'and Spark Streaming for stream processing.',
 '',
 '<http://spark.apache.org/>',
 '',
 '',
 '## Online Documentation',
 '',
 'You can find the latest Spark documentation, including a programming',
 'guide, on the [project web page](http://spark.apache.org/documentation.html)',
 'and [project wiki](https://cwiki.apache.org/confluence/display/SPARK).',
 'This README file only contains basic setup instructions.',
 '',
 '## Building Spark']

In [0]:
# Example: lambda functions  
words = rdd.flatMap(lambda lines: lines.split(" "))

for w in words.collect():
  print(w)

#
Apache
Spark

Spark
is
a
fast
and
general
cluster
computing
system
for
Big
Data.
It
provides
high-level
APIs
in
Scala,
Java,
Python,
and
R,
and
an
optimized
engine
that
supports
general
computation
graphs
for
data
analysis.
It
also
supports
a
rich
set
of
higher-level
tools
including
Spark
SQL
for
SQL
and
DataFrames,
MLlib
for
machine
learning,
GraphX
for
graph
processing,
and
Spark
Streaming
for
stream
processing.

<http://spark.apache.org/>


##
Online
Documentation

You
can
find
the
latest
Spark
documentation,
including
a
programming
guide,
on
the
[project
web
page](http://spark.apache.org/documentation.html)
and
[project
wiki](https://cwiki.apache.org/confluence/display/SPARK).
This
README
file
only
contains
basic
setup
instructions.

##
Building
Spark

Spark
is
built
using
[Apache
Maven](http://maven.apache.org/).
To
build
Spark
and
its
example
programs,
run:





build/mvn
-DskipTests
clean
package

(You
do
not
need
to
do
this
if
you
downloaded
a
pre-built
package.)
More
detaile

In [0]:
# Take the previous function and
# 1. count the occurence of each word

# Load text file into DataFrame
df = spark.read.text("dbfs:/databricks-datasets/SPARK_README.md")

# Count word occurrences
word_counts = (
    rdd.flatMap(lambda line: line.split(" "))  # Split each line into words
    .map(lambda word: word.strip())  # Remove extra spaces
    .filter(lambda word: word != "")  # Remove empty words
    .countByValue()  # Count occurrences of each word
)

# Display results
for word, count in list(word_counts.items())[:20]:  # Show first 20 results
    print(f"{word}: {count}")

#: 1
Apache: 1
Spark: 13
is: 6
a: 8
fast: 1
and: 10
general: 2
cluster: 2
computing: 1
system: 1
for: 11
Big: 1
Data.: 1
It: 2
provides: 1
high-level: 1
APIs: 1
in: 5
Scala,: 1


In [0]:
# 2. change all capital letters to lower case
lower_rdd = rdd.map(lambda line: line.lower())
for line in lower_rdd.collect():
  print(line)

# Show results
rdd.take(20)

# apache spark

spark is a fast and general cluster computing system for big data. it provides
high-level apis in scala, java, python, and r, and an optimized engine that
supports general computation graphs for data analysis. it also supports a
rich set of higher-level tools including spark sql for sql and dataframes,
mllib for machine learning, graphx for graph processing,
and spark streaming for stream processing.

<http://spark.apache.org/>


## online documentation

you can find the latest spark documentation, including a programming
guide, on the [project web page](http://spark.apache.org/documentation.html)
and [project wiki](https://cwiki.apache.org/confluence/display/spark).
this readme file only contains basic setup instructions.

## building spark

spark is built using [apache maven](http://maven.apache.org/).
to build spark and its example programs, run:

    build/mvn -dskiptests clean package

(you do not need to do this if you downloaded a pre-built package.)
more detaile

In [0]:
# 3. eliminate stopwords 
stop_words = ['and', 'to', 'in', 'at', 'the', 'an']
filtered_rdd = (
    rdd.flatMap(lambda line: line.split())    # Split lines into words
        .map(lambda word: word.lower())       # Convert each word to lowercase
        .filter(lambda word: word not in stop_words)  # Remove stopwords
)

# Collect and print the filtered content
for word in filtered_rdd.collect():
  print(word)

#
apache
spark
spark
is
a
fast
general
cluster
computing
system
for
big
data.
it
provides
high-level
apis
scala,
java,
python,
r,
optimized
engine
that
supports
general
computation
graphs
for
data
analysis.
it
also
supports
a
rich
set
of
higher-level
tools
including
spark
sql
for
sql
dataframes,
mllib
for
machine
learning,
graphx
for
graph
processing,
spark
streaming
for
stream
processing.
<http://spark.apache.org/>
##
online
documentation
you
can
find
latest
spark
documentation,
including
a
programming
guide,
on
[project
web
page](http://spark.apache.org/documentation.html)
[project
wiki](https://cwiki.apache.org/confluence/display/spark).
this
readme
file
only
contains
basic
setup
instructions.
##
building
spark
spark
is
built
using
[apache
maven](http://maven.apache.org/).
build
spark
its
example
programs,
run:
build/mvn
-dskiptests
clean
package
(you
do
not
need
do
this
if
you
downloaded
a
pre-built
package.)
more
detailed
documentation
is
available
from
project
site,
["building
sp

In [0]:
# 4. sort in alphabetical order
words_rdd = rdd.flatMap(lambda x: x.split(" "))
sorted_rdd = words_rdd.sortBy(lambda word: word.lower())
sorted_rdd.collect()

Out[12]: ['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '"local"',
 '"local[N]"',
 '"yarn"',
 '#',
 '##',
 '##',
 '##',
 '##',
 '##',
 '##',
 '##',
 '##',
 '(You',
 '-DskipTests',
 './bin/pyspark',
 './bin/run-example',
 './bin/run-example',
 './bin/spark-shell',
 './dev/run-tests',
 '1000).count()',
 '1000:',
 '1000:',
 '<class>',
 '<http://spark.apache.org/>',
 '>>>',
 '["Building',
 '["Specifying',
 '[Apache',
 '[building',
 '[Configuration',
 '[params]`.',
 '[project',
 '[project',
 '[run',
 '`./bin/run-example',
 '`examples`',
 '`examples`',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'A',
 'a',
 'abbreviated',
 'About',
 'against',
 'also',
 'also',
 'also',
 'also',
 'Alternatively,',
 'an',
 'an

In [0]:
# 5. sort from most to least frequent word
# Step 1: Split the text into words and count occurrences
word_counts = rdd.flatMap(lambda x: x.split(" ")) \
                 .map(lambda word: (word.lower(), 1)) \
                 .reduceByKey(lambda a, b: a + b)

# Step 2: Sort the word counts from most frequent to least frequent
sorted_word_counts = word_counts.sortBy(lambda x: x[1], ascending=False)

# Step 3: Collect and display the sorted results
sorted_word_counts.collect()

Out[13]: [('', 67),
 ('the', 22),
 ('to', 16),
 ('spark', 13),
 ('for', 13),
 ('and', 11),
 ('a', 9),
 ('##', 8),
 ('run', 7),
 ('you', 7),
 ('is', 6),
 ('can', 6),
 ('in', 5),
 ('of', 5),
 ('on', 5),
 ('documentation', 4),
 ('also', 4),
 ('example', 4),
 ('if', 4),
 ('an', 3),
 ('this', 3),
 ('use', 3),
 ('programs', 3),
 ('tests', 3),
 ('hadoop', 3),
 ('including', 3),
 ('building', 3),
 ('build', 3),
 ('with', 3),
 ('or', 3),
 ('please', 3),
 ('supports', 2),
 ('set', 2),
 ('online', 2),
 ('[project', 2),
 ('using', 2),
 ('do', 2),
 ('at', 2),
 ('scala', 2),
 ('shell', 2),
 ('following', 2),
 ('1000:', 2),
 ('python', 2),
 ('./bin/run-example', 2),
 ('examples', 2),
 ('locally', 2),
 ('class', 2),
 ('guidance', 2),
 ('versions', 2),
 ('hadoop,', 2),
 ('refer', 2),
 ('particular', 2),
 ('hive', 2),
 ('general', 2),
 ('cluster', 2),
 ('it', 2),
 ('python,', 2),
 ('that', 2),
 ('sql', 2),
 ('detailed', 2),
 ('interactive', 2),
 ('shell:', 2),
 ('command,', 2),
 ('which', 2),
 ('should'

In [0]:
# 6.** remove punctuations 
import re
cleaned_rdd = rdd.map(lambda line: re.sub(r'[^\w\s]', '', line))
cleaned_rdd.take(20)

Out[15]: [' Apache Spark',
 '',
 'Spark is a fast and general cluster computing system for Big Data It provides',
 'highlevel APIs in Scala Java Python and R and an optimized engine that',
 'supports general computation graphs for data analysis It also supports a',
 'rich set of higherlevel tools including Spark SQL for SQL and DataFrames',
 'MLlib for machine learning GraphX for graph processing',
 'and Spark Streaming for stream processing',
 '',
 'httpsparkapacheorg',
 '',
 '',
 ' Online Documentation',
 '',
 'You can find the latest Spark documentation including a programming',
 'guide on the project web pagehttpsparkapacheorgdocumentationhtml',
 'and project wikihttpscwikiapacheorgconfluencedisplaySPARK',
 'This README file only contains basic setup instructions',
 '',
 ' Building Spark']

## 2. What does it do?

In [0]:
# Create an RDD of tuples (name, age)
dataRDD = sc.parallelize([("Brooke", 20), ("Denny", 31), ("Jules", 30),
("TD", 35), ("Brooke", 25)])

# Try to undestand what this code does (line by line)
agesRDD = (dataRDD
  .map(lambda x: (x[0], (x[1], 1)))
  .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
  .map(lambda x: (x[0], x[1][0]/x[1][1])))

# Collect and print results
for name, avg_age in agesRDD.collect():
    print(f"{name}: {avg_age:.2f}")

#This Spark code computes the average age for each unique name in the dataRDD. Here's a step-by-step breakdown:

Brooke: 22.50
Denny: 31.00
TD: 35.00
Jules: 30.00
