<a href="https://colab.research.google.com/github/SrijaG29/spark_streaming/blob/main/Spark_streaming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pyspark



**Pyspark Streaming:**

Create a spark session.

In [2]:
from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .appName("Spark Streaming")
    .master("local[*]")
    .getOrCreate()
)

spark

Iam creating a text file to use in spark streaming.

In [3]:
with open('Example.txt','w') as f:
  f.write('simon has a dog and a cat the dog and cat used to love simon')

In [4]:
with open('Example.txt','r') as f:
  x = f.read()
  print(x)

simon has a dog and a cat the dog and cat used to love simon


Now we will read data from the text.

In [5]:
df_raw = spark.read.format("text").load('/content/Example.txt')
df_raw.printSchema()

root
 |-- value: string (nullable = true)



In [6]:
df_raw.show()

+--------------------+
|               value|
+--------------------+
|simon has a dog a...|
+--------------------+



Inside show if we use truncate = False then it will print complete data instead of above show statement.

In [7]:
# from os import truncate
df_raw.show(truncate = False)

+------------------------------------------------------------+
|value                                                       |
+------------------------------------------------------------+
|simon has a dog and a cat the dog and cat used to love simon|
+------------------------------------------------------------+



Now we need to count the repetition of each word so for that 1st we need to split this sentence into words

So we will import split function.

In [9]:
from pyspark.sql.functions import split

In [10]:
df_words = df_raw.withColumn('words',split('value',' '))

In [11]:
df_words.show(truncate = False)

+------------------------------------------------------------+----------------------------------------------------------------------------+
|value                                                       |words                                                                       |
+------------------------------------------------------------+----------------------------------------------------------------------------+
|simon has a dog and a cat the dog and cat used to love simon|[simon, has, a, dog, and, a, cat, the, dog, and, cat, used, to, love, simon]|
+------------------------------------------------------------+----------------------------------------------------------------------------+



now we need to explode the list into seperate words.

So for that we need to import explode.

In [12]:
from pyspark.sql.functions import explode

In [13]:
df_explode = df_words.withColumn('word',explode('words'))
df_explode.show(truncate = False)

+------------------------------------------------------------+----------------------------------------------------------------------------+-----+
|value                                                       |words                                                                       |word |
+------------------------------------------------------------+----------------------------------------------------------------------------+-----+
|simon has a dog and a cat the dog and cat used to love simon|[simon, has, a, dog, and, a, cat, the, dog, and, cat, used, to, love, simon]|simon|
|simon has a dog and a cat the dog and cat used to love simon|[simon, has, a, dog, and, a, cat, the, dog, and, cat, used, to, love, simon]|has  |
|simon has a dog and a cat the dog and cat used to love simon|[simon, has, a, dog, and, a, cat, the, dog, and, cat, used, to, love, simon]|a    |
|simon has a dog and a cat the dog and cat used to love simon|[simon, has, a, dog, and, a, cat, the, dog, and, cat, used, to

Now we have got all the words inside the text file in word column so we wiil drop rest of the columns.

In [14]:
df_explode = df_explode.drop('value','words')
df_explode.show(truncate = False)

+-----+
|word |
+-----+
|simon|
|has  |
|a    |
|dog  |
|and  |
|a    |
|cat  |
|the  |
|dog  |
|and  |
|cat  |
|used |
|to   |
|love |
|simon|
+-----+



Now we will count the no_of occurances of each word.

For that we need to use groupBy and agg functions

In [22]:
# from pyspark.sql.functions import count
df_agg = df_explode.groupBy('word').agg(count('*').alias('word_count'))
df_agg.show(truncate = False)

+-----+----------+
|word |word_count|
+-----+----------+
|used |1         |
|simon|2         |
|dog  |2         |
|love |1         |
|cat  |2         |
|the  |1         |
|and  |2         |
|a    |2         |
|has  |1         |
|to   |1         |
+-----+----------+



In [None]:
Spark = (
    SparkSession
    .builder
    .appName('temp')
    .config('spark.')
)