<a href="https://colab.research.google.com/github/SrijaG29/spark_streaming/blob/main/Spark_Streaming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488490 sha256=0bcd11bb4e7ad1b615644cd6a13af63f77e8fa20c6f8c39bb066a2a9f2278aa3
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


**Pyspark Streaming:**

Create a spark session.

In [7]:
from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .appName("Spark Streaming")
    .master("local[*]")
    .getOrCreate()
)

spark

Iam creating a text file to use in spark streaming.

In [8]:
with open('Example.txt','w') as f:
  f.write('simon has a dog and a cat the dog and cat used to love simon')

In [9]:
with open('Example.txt','r') as f:
  x = f.read()
  print(x)

simon has a dog and a cat the dog and cat used to love simon


Now we will read data from the text.

In [10]:
df_raw = spark.read.format("text").load('/content/Example.txt')
df_raw.printSchema()

root
 |-- value: string (nullable = true)



In [11]:
df_raw.show()

+--------------------+
|               value|
+--------------------+
|simon has a dog a...|
+--------------------+



Inside show if we use truncate = False then it will print complete data instead of above show statement.

In [12]:
# from os import truncate
df_raw.show(truncate = False)

+------------------------------------------------------------+
|value                                                       |
+------------------------------------------------------------+
|simon has a dog and a cat the dog and cat used to love simon|
+------------------------------------------------------------+



Now we need to count the repetition of each word so for that 1st we need to split this sentence into words

So we will import split function.

In [13]:
from pyspark.sql.functions import split

In [14]:
df_words = df_raw.withColumn('words',split('value',' '))

In [15]:
df_words.show(truncate = False)

+------------------------------------------------------------+----------------------------------------------------------------------------+
|value                                                       |words                                                                       |
+------------------------------------------------------------+----------------------------------------------------------------------------+
|simon has a dog and a cat the dog and cat used to love simon|[simon, has, a, dog, and, a, cat, the, dog, and, cat, used, to, love, simon]|
+------------------------------------------------------------+----------------------------------------------------------------------------+



now we need to explode the list into seperate words.

So for that we need to import explode.

In [16]:
from pyspark.sql.functions import explode

In [17]:
df_explode = df_words.withColumn('word',explode('words'))
df_explode.show(truncate = False)

+------------------------------------------------------------+----------------------------------------------------------------------------+-----+
|value                                                       |words                                                                       |word |
+------------------------------------------------------------+----------------------------------------------------------------------------+-----+
|simon has a dog and a cat the dog and cat used to love simon|[simon, has, a, dog, and, a, cat, the, dog, and, cat, used, to, love, simon]|simon|
|simon has a dog and a cat the dog and cat used to love simon|[simon, has, a, dog, and, a, cat, the, dog, and, cat, used, to, love, simon]|has  |
|simon has a dog and a cat the dog and cat used to love simon|[simon, has, a, dog, and, a, cat, the, dog, and, cat, used, to, love, simon]|a    |
|simon has a dog and a cat the dog and cat used to love simon|[simon, has, a, dog, and, a, cat, the, dog, and, cat, used, to

Now we have got all the words inside the text file in word column so we wiil drop rest of the columns.

In [18]:
df_explode = df_explode.drop('value','words')
df_explode.show(truncate = False)

which of the most classes or neither thread share nor mutable is it true according to java

+-----+
|word |
+-----+
|simon|
|has  |
|a    |
|dog  |
|and  |
|a    |
|cat  |
|the  |
|dog  |
|and  |
|cat  |
|used |
|to   |
|love |
|simon|
+-----+



Now we will count the no_of occurances of each word.

For that we need to use groupBy and agg functions

In [67]:
from pyspark.sql.functions import count
df_agg = df_explode.groupBy('word').agg(count('*').alias('word_count')).col
df_agg.show(truncate = False)

+-----+----------+
|word |word_count|
+-----+----------+
|used |1         |
|simon|2         |
|dog  |2         |
|love |1         |
|cat  |2         |
|the  |1         |
|and  |2         |
|a    |2         |
|has  |1         |
|to   |1         |
+-----+----------+



Counting Unique Values.

In [26]:
df_unique = df_explode.select('word').distinct()
df_unique.show()

+-----+
| word|
+-----+
| used|
|simon|
|  dog|
| love|
|  cat|
|  the|
|  and|
|    a|
|  has|
|   to|
+-----+



Filtering Rows. if len of word greater than 3 then you need to print.

In [64]:
from pyspark.sql.functions import col
from pyspark.sql.functions import length
x = df_explode.filter(length(col('word')) > 3)
x.show(truncate = False)

+-----+
|word |
+-----+
|simon|
|used |
|love |
|simon|
+-----+



Calculating Average.

Generally we will get output in double float form i want ans in integer.

**Cast('Required datatype')** is used to change the datatype form one format to another.

This will return a datatframe as output.

In [65]:
from pyspark.sql.functions import avg, length, round

avg_length = df_explode.select(round(avg(length(col('word'))),0).cast('int').alias('avg length of words'))
avg_length.show(truncate = False)

+-------------------+
|avg length of words|
+-------------------+
|3                  |
+-------------------+



This will return as normal text as output.

**Collect:**
The collect method retrieves the data from the DataFrame to the driver program as a list of rows. It’s a way to pull the computed data into the local Python environment.

In [55]:
from pyspark.sql.functions import avg, length, round

avg_length = df_explode.select(round(avg(length(col('word'))),0).cast('int').alias('avg length of words')).collect()[0][0]
print('average length is: ',avg_length)

average length is:  3


In [None]:
from pyspark.sql import SparkSession
Spark = (
    SparkSession
    .builder
    .appName('temp')
    .config('spark.streaming.stopGracefullyOnShutdown',True)
    .master("local[*]")
    .getOrCreate()
)