<a href="https://colab.research.google.com/github/IsfaquethedataAnalyst/Data_Analyst/blob/main/Analyze_a_Dataset_of_Books_using_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's create a PySpark application that analyzes a dataset of books. We'll perform various operations to extract insights from the data.



1.   First, set up your environment:

*   Install PySpark
*   Create a new Python file, book_analysis.py








In [28]:
!pip install pyspark



In [29]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, year

# Initialize Spark Session
spark = SparkSession.builder.appName("BookAnalysis").getOrCreate()

# Load the dataset
# Assuming you have a CSV file named 'books.csv' with columns: title, author, publication_year, genre, rating
df = spark.read.csv("Books.csv", header=True, inferSchema=True)

In [30]:
def analyze_books(df):
    pass

In [31]:
df.show(10)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|               Title|             Authors|         Description|            Category|           Publisher|        Publish Date|               Price|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|       Goat Brothers|    By Colton, Larry|                NULL|   History , General|           Doubleday|Friday, January 1...|Price Starting at...|
|  The Missing Person|  By Grumbach, Doris|                NULL|   Fiction , General|    Putnam Pub Group|Sunday, March 1, ...|Price Starting at...|
|Don't Eat Your He...|By Piscatella, Jo...|                NULL| Cooking , Reference|      Workman Pub Co|Thursday, Septemb...|Price Starting at...|
|When Your Corpora...|   By Davis, Paul D.|                NULL|                NULL|       Natl Pr Books|

In [32]:
genre_counts = df.groupBy("Category").count().orderBy(col("count").desc())
genre_counts.show()

+--------------------+-----+
|            Category|count|
+--------------------+-----+
|                NULL|25763|
|   Fiction , General| 2208|
| Fiction , Myster...| 1390|
|  Fiction , Literary| 1302|
| Fiction , Romanc...|  943|
| Fiction , Thrill...|  903|
| Fiction , Thrill...|  878|
|  Religion , General|  871|
|   Cooking , General|  820|
| Fiction , Romanc...|  657|
| Fiction , Romanc...|  601|
| Fiction , Scienc...|  597|
|   History , General|  585|
| Juvenile Fiction...|  561|
| Juvenile Nonfict...|  538|
| Social Science ,...|  526|
| Business & Econo...|  525|
| Juvenile Fiction...|  438|
| Religion , Chris...|  430|
|   Science , General|  414|
+--------------------+-----+
only showing top 20 rows



In [33]:
avg_price = df.groupBy("Category").agg(avg("Price").alias("avg_price"))
avg_price.show()

+--------------------+---------+
|            Category|avg_price|
+--------------------+---------+
|       Art , General|     NULL|
| Religion , Judai...|     NULL|
| Religion , Bibli...|     NULL|
| Fiction , Africa...|     NULL|
| Business & Econo...|     NULL|
| Political Scienc...|     NULL|
| but nevertheless...|     NULL|
| Cooking , Region...|     NULL|
| Technology & Eng...|     NULL|
| Medical , Nursin...|     NULL|
|   for mental acuity|     NULL|
|    once and for all|     NULL|
|            get lean|     NULL|
| But That Don't M...|     NULL|
|        or mercenary|     NULL|
| fans have been f...|     NULL|
| acting as his ey...|     NULL|
| Medical , Oncolo...|     NULL|
| including:* Deve...|     NULL|
| knowing and hila...|     NULL|
+--------------------+---------+
only showing top 20 rows



In [34]:
author_counts = df.groupBy("Authors").count().orderBy(col("count").desc()) # Use "Authors" instead of "author"
author_counts.show(10)

+--------------------+-----+
|             Authors|count|
+--------------------+-----+
|                  By| 1043|
|    By Roberts, Nora|  195|
|  By Time-Life Books|  172|
|          By unknown|  122|
|"By ""Better Home...|  121|
|  By Steel, Danielle|  120|
|      By Lucado, Max|   97|
|              By n/a|   92|
|By Reader's Diges...|   89|
| By Macomber, Debbie|   85|
+--------------------+-----+
only showing top 10 rows



In [35]:
books_per_year = df.groupBy(year("Publish Date").alias("year")).count().orderBy("year")
books_per_year.show()

+----+------+
|year| count|
+----+------+
|NULL|103078|
|1865|     1|
|1955|     1|
|1995|     1|
|2012|     1|
+----+------+



In [36]:
def clean_data(df):
  # Implement your data cleaning logic here
  # For example, removing duplicates, handling missing values, etc.
  return df

cleaned_df = clean_data(df)

In [37]:
print("Basic Statistics:")
cleaned_df.describe().show()

Basic Statistics:
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|summary|               Title|             Authors|         Description|            Category|           Publisher|        Publish Date|               Price|
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  count|              103082|              103082|               70231|               77319|              103000|              103023|              103041|
|   mean|             1310.04|                NULL|                 2.0|              1285.2|  3540556.8493150687|            1046.375|             834.625|
| stddev|   864.0054361968898|                NULL|                NULL|    955.917889008421|  3234904.2949715196|   987.1423312485678|   958.2619590398323|
|    min| American Heritag...|          

In [38]:
from pyspark.sql.functions import length

print("Book Length Analysis:")
length_df = cleaned_df.withColumn("title_length", length(col("title")))
length_df.select(avg("title_length").alias("avg_title_length")).show()

print("Top 5 Longest Titles:")
length_df.orderBy(col("title_length").desc()).select("title", "title_length").show(5)

Book Length Analysis:
+------------------+
|  avg_title_length|
+------------------+
|44.744776003569974|
+------------------+

Top 5 Longest Titles:
+--------------------+------------+
|               title|title_length|
+--------------------+------------+
|A Generous Orthod...|         293|
|Rules of Contract...|         254|
|The Old West Spea...|         248|
|Fleeced: How Bara...|         244|
|Nineteenth Centur...|         243|
+--------------------+------------+
only showing top 5 rows



In [41]:
print("Author Productivity Over Time:")
# Verify the column names in your cleaned DataFrame
print(cleaned_df.columns)

# Assuming the date column is named 'Publish Date' after cleaning
author_productivity = cleaned_df.groupBy("Authors", year("Publish Date").alias("year")) \
                               .count() \
                               .orderBy("Authors", "year")
author_productivity.show()

Author Productivity Over Time:
['Title', 'Authors', 'Description', 'Category', 'Publisher', 'Publish Date', 'Price']
+--------------------+----+-----+
|             Authors|year|count|
+--------------------+----+-----+
|                14)"|NULL|    1|
|          1780-1835"|NULL|    1|
|            Book 3)"|NULL|    2|
| Charlie Brown! V...|NULL|    1|
|            Expanded|NULL|    1|
| Jr. With researc...|NULL|    1|
|           Level 1)"|NULL|    1|
|   Love and Losers."|NULL|    1|
| Maintain and Ach...|NULL|    1|
| More Peanuts! Vo...|NULL|    1|
| Plain and Tall""...|NULL|    1|
| Proverbs 20:17 (...|NULL|    1|
|           Saving It|NULL|    1|
| Sexuality and Su...|NULL|    1|
| Shed Pounds for ...|NULL|    1|
|               The)"|NULL|    1|
| Together with Da...|NULL|    1|
|            Vol. 4)"|NULL|    1|
|            Vol. 9)"|NULL|    1|
|                   "|NULL|    1|
+--------------------+----+-----+
only showing top 20 rows

