
# 📘 Exploratory Data Analysis on Books Adapted from Movies

This notebook explores the dataset **MoviesAdaptedFromBooks.csv**.  
The goal is to find some simple patterns and insights using beginner-level Python code.



## 📊 Dataset Explanation

The dataset has the following columns:

- **title** – name of the book  
- **author** – who wrote the book  
- **publishDate** – when the book was first published  
- **rating** – average user rating  
- **numRatings** – number of people who rated the book  
- **awards** – awards won by the book  
- **price** – cost of the book  
- **pages** – number of pages  
- **language** – language of the book  
- **likedPercent** – how many people liked it (percentage)

We will also create two new columns:

- **publishYear** → extracted from publishDate  
- **hasAward** → 1 if the book has an award, 0 if not


In [None]:

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt

# Read the dataset
df = pd.read_csv("MoviesAdaptedFromBooks.csv")

# Add new columns
df['publishYear'] = pd.to_datetime(df['publishDate'], errors='coerce').dt.year
df['hasAward'] = df['awards'].apply(lambda x: 1 if pd.notna(x) and len(str(x).strip()) > 0 else 0)

# Show first few rows
df.head()



## 📈 Hypothesis 1: Newer books have higher ratings


In [None]:

# Scatter plot for publishYear vs rating
plt.figure(figsize=(6,4))
plt.scatter(df['publishYear'], df['rating'], alpha=0.5)
plt.title("Publish Year vs Rating")
plt.xlabel("Publish Year")
plt.ylabel("Rating")
plt.show()



**Observation:**  
We can see how ratings change with the year of publication.

**Conclusion:**  
Newer books do not always have higher ratings. Some older books are still rated very well, so the year alone does not decide the rating.



## 📚 Hypothesis 2: Books with more pages get higher ratings


In [None]:

plt.figure(figsize=(6,4))
plt.scatter(df['pages'], df['rating'], alpha=0.5, color='orange')
plt.title("Pages vs Rating")
plt.xlabel("Number of Pages")
plt.ylabel("Rating")
plt.show()



**Observation:**  
We can see how the number of pages affects the rating.

**Conclusion:**  
Books with more pages don’t always get higher ratings. Some short books are liked a lot too.



## 🏆 Hypothesis 3: Books with awards have higher ratings


In [None]:

avg_rating_award = df.groupby('hasAward')['rating'].mean()
avg_rating_award.plot(kind='bar', color=['red', 'green'], figsize=(6,4))
plt.title("Average Rating: Award vs No Award")
plt.xlabel("Has Award (1 = Yes, 0 = No)")
plt.ylabel("Average Rating")
plt.show()



**Observation:**  
We can compare books that won awards and those that didn’t.

**Conclusion:**  
Books with awards don’t always have much higher ratings. Awards might help visibility, but not always the rating.



## 💰 Hypothesis 4: Higher-rated books cost more


In [None]:

plt.figure(figsize=(6,4))
plt.scatter(df['rating'], df['price'], alpha=0.5, color='purple')
plt.title("Rating vs Price")
plt.xlabel("Rating")
plt.ylabel("Price")
plt.show()



**Observation:**  
This shows how price changes with rating.

**Conclusion:**  
There is no clear link between rating and price. Some expensive books have average ratings, while cheaper ones are well-rated.



## 👥 Hypothesis 5: Books with more readers have higher liked percentages


In [None]:

plt.figure(figsize=(6,4))
plt.scatter(df['numRatings'], df['likedPercent'], alpha=0.5, color='teal')
plt.title("Number of Ratings vs Liked Percentage")
plt.xlabel("Number of Ratings")
plt.ylabel("Liked Percentage")
plt.show()



**Observation:**  
We can see how the number of ratings relates to liked percent.

**Conclusion:**  
Books with many readers often have good liked percentages, but not always. Popularity doesn’t always mean everyone likes it.



## 🌍 Hypothesis 6: English books have higher liked percentages


In [None]:

avg_like_lang = df.groupby('language')['likedPercent'].mean().sort_values(ascending=False).head(10)
avg_like_lang.plot(kind='bar', figsize=(10,6))
plt.title("Average Liked Percentage by Language (Top 10)")
plt.xlabel("Language")
plt.ylabel("Average Liked Percentage")
plt.show()



**Observation:**  
We can see which languages have higher liked percentages.

**Conclusion:**  
English is not at the top here. This means books in some other languages have higher liked percentages in this dataset.



## 🏁 Final Summary

We checked six simple ideas about books adapted from movies.  
Some guesses were right, and some were not.  
We learned that ratings and likes depend on many things — not just price, pages, or language.  
This basic EDA helped us understand the dataset better using easy Python steps.
