Условие: создайте csv файл с таким содержимым:

title,author,genre,sales,year

"1984", "George Orwell", "Science Fiction", 5000, 1949

"The Lord of the Rings", "J.R.R. Tolkien", "Fantasy", 3000, 1954

"To Kill a Mockingbird", "Harper Lee", "Southern Gothic", 4000, 1960

"The Catcher in the Rye", "J.D. Salinger", "Novel", 2000, 1951

"The Great Gatsby", "F. Scott Fitzgerald", "Novel", 4500, 1925

Задание:

— Используя Spark прочитайте данные из файла csv.
— Фильтруйте данные, чтобы оставить только книги, продажи которых превышают 3000 экземпляров.
— Сгруппируйте данные по жанру и вычислите общий объем продаж для каждого жанра.
— Отсортируйте данные по общему объему продаж в порядке убывания.
— Выведите результаты на экран.

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=1f4fe34f541c7d414b46341629a943a36969daf298750c8ee014cd1ee5d75efb
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [20]:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as fn
from pyspark.sql.types import DoubleType

spark = SparkSession.builder.appName("Sem5").getOrCreate()

data = [("1984", "George Orwell", "Science Fiction", 5000, 1949),

("The Lord of the Rings", "J.R.R. Tolkien", "Fantasy", 3000, 1954),

("To Kill a Mockingbird", "Harper Lee", "Southern Gothic", 4000, 1960),

("The Catcher in the Rye", "J.D. Salinger", "Novel", 2000, 1951),

("The Great Gatsby", "F. Scott Fitzgerald", "Novel", 4500, 1925)]

schema = ["title", "author", "genre", "sales", "year"]
df = spark.createDataFrame(data, schema)
df.write.csv("book1")
df.show()

+--------------------+-------------------+---------------+-----+----+
|               title|             author|          genre|sales|year|
+--------------------+-------------------+---------------+-----+----+
|                1984|      George Orwell|Science Fiction| 5000|1949|
|The Lord of the R...|     J.R.R. Tolkien|        Fantasy| 3000|1954|
|To Kill a Mocking...|         Harper Lee|Southern Gothic| 4000|1960|
|The Catcher in th...|      J.D. Salinger|          Novel| 2000|1951|
|    The Great Gatsby|F. Scott Fitzgerald|          Novel| 4500|1925|
+--------------------+-------------------+---------------+-----+----+



In [28]:

df = spark.read.option("header",True).csv("book1.csv")
print("Исходный датасет:")
df.show()

filtered_df = df.filter(df.sales> 3000)
print("Фильтрация по продажам (более 3000 экземпляров):")
filtered_df.show()


window_spec = Window.partitionBy("genre")
sum_sales_by_genre = fn.sum("sales").over(window_spec)
sum_sales_by_genre_df = df.withColumn("sum_sales_by_genre", sum_sales_by_genre)
# Сортировка данных по общему объему продаж в порядке убывания и вывод результатов:
print("Общие объемы продаж по жанрам в порядке убывания:")
sum_sales_by_genre_df.select(["genre", "sum_sales_by_genre"]).distinct() \
    .orderBy(sum_sales_by_genre_df.sum_sales_by_genre.desc()).show()

Исходный датасет:
+--------------------+--------------------+--------------------+-----+------+
|               title|              author|               genre|sales|  year|
+--------------------+--------------------+--------------------+-----+------+
|           """1984""|   ""George Orwell""| ""Science Fiction""| 5000| 1949"|
|"""The Lord of th...|  ""J.R.R. Tolkien""|         ""Fantasy""| 3000| 1954"|
|"""To Kill a Mock...|      ""Harper Lee""| ""Southern Gothic""| 4000| 1960"|
|"""The Catcher in...|   ""J.D. Salinger""|           ""Novel""| 2000| 1951"|
|"""The Great Gats...| ""F. Scott Fitzg...|           ""Novel""| 4500| 1925"|
+--------------------+--------------------+--------------------+-----+------+

Фильтрация по продажам (более 3000 экземпляров):
+--------------------+--------------------+--------------------+-----+------+
|               title|              author|               genre|sales|  year|
+--------------------+--------------------+--------------------+-----+----