# Задание

**Условие:** создайте csv файл с таким содержимым:

title,author,genre,sales,year

"1984", "George Orwell", "Science Fiction", 5000, 1949

"The Lord of the Rings", "J.R.R. Tolkien", "Fantasy", 3000, 1954

"To Kill a Mockingbird", "Harper Lee", "Southern Gothic", 4000, 1960

"The Catcher in the Rye", "J.D. Salinger", "Novel", 2000, 1951

"The Great Gatsby", "F. Scott Fitzgerald", "Novel", 4500, 1925

**Задание:**

— Используя Spark прочитайте данные из файла csv.

— Фильтруйте данные, чтобы оставить только книги, продажи которых превышают 3000 экземпляров.

— Сгруппируйте данные по жанру и вычислите общий объем продаж для каждого жанра.

— Отсортируйте данные по общему объему продаж в порядке убывания.

— Выведите результаты на экран.

In [1]:
!pip install pyspark >> None

In [5]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

conf = SparkConf().setMaster("local[*]").setAppName("Hw5")
sc = SparkContext(conf = conf)

spark = SparkSession.builder.getOrCreate()

In [7]:
# Create csv
data = [
    ("1984", "George Orwell", "Science Fiction", 5000, 1949),
    ("The Lord of the Rings", "J.R.R. Tolkien", "Fantasy", 3000, 1954),
    ("To Kill a Mockingbird", "Harper Lee", "Southern Gothic", 4000, 1960),
    ("The Catcher in the Rye", "J.D. Salinger", "Novel", 2000, 1951),
    ("The Great Gatsby", "F. Scott Fitzgerald", "Novel", 4500, 1925)
]
columns = ["title", "author", "genre", "sales", "year"]
df = spark.createDataFrame(data, columns)
df.write.csv("./data.csv", header=True, mode="overwrite")

In [8]:
# Read csv
csv_file_path = "./data.csv"
df = spark.read.csv(csv_file_path, header=True, inferSchema=True)
print("Source:")
df.show()

Source:
+--------------------+-------------------+---------------+-----+----+
|               title|             author|          genre|sales|year|
+--------------------+-------------------+---------------+-----+----+
|To Kill a Mocking...|         Harper Lee|Southern Gothic| 4000|1960|
|The Catcher in th...|      J.D. Salinger|          Novel| 2000|1951|
|    The Great Gatsby|F. Scott Fitzgerald|          Novel| 4500|1925|
|                1984|      George Orwell|Science Fiction| 5000|1949|
|The Lord of the R...|     J.R.R. Tolkien|        Fantasy| 3000|1954|
+--------------------+-------------------+---------------+-----+----+



In [9]:
# Filter
filtered_df = df.filter(df.sales > 3000)
print("Filtered: sales > 3000:")
filtered_df.show()

Filtered: sales > 3000:
+--------------------+-------------------+---------------+-----+----+
|               title|             author|          genre|sales|year|
+--------------------+-------------------+---------------+-----+----+
|To Kill a Mocking...|         Harper Lee|Southern Gothic| 4000|1960|
|    The Great Gatsby|F. Scott Fitzgerald|          Novel| 4500|1925|
|                1984|      George Orwell|Science Fiction| 5000|1949|
+--------------------+-------------------+---------------+-----+----+



In [13]:
# Grouping by genre
grouped_df = df.groupBy("genre").sum("sales")
print("Grouped by genre and sum of sales")
grouped_df.show()

Grouped by genre and sum of sales
+---------------+----------+
|          genre|sum(sales)|
+---------------+----------+
|Southern Gothic|      4000|
|          Novel|      6500|
|        Fantasy|      3000|
|Science Fiction|      5000|
+---------------+----------+



In [20]:
# Sorting by sales values
sorted_df = df.orderBy(df.sales.desc())
print("Sorted by sales value, desc:")
sorted_df.show()

Sorted by sales value, desc:
+--------------------+-------------------+---------------+-----+----+
|               title|             author|          genre|sales|year|
+--------------------+-------------------+---------------+-----+----+
|                1984|      George Orwell|Science Fiction| 5000|1949|
|    The Great Gatsby|F. Scott Fitzgerald|          Novel| 4500|1925|
|To Kill a Mocking...|         Harper Lee|Southern Gothic| 4000|1960|
|The Lord of the R...|     J.R.R. Tolkien|        Fantasy| 3000|1954|
|The Catcher in th...|      J.D. Salinger|          Novel| 2000|1951|
+--------------------+-------------------+---------------+-----+----+



In [21]:
spark.stop()