
# 📘 EDA: Books Adapted into Movies

This notebook explores a dataset of **books that were adapted into movies**.  
We will study the dataset to understand how book features such as ratings, price, and publication year relate to each other.  

To make the analysis more meaningful, two new columns are created:
1. `publishYear` – the year the book was published.  
2. `hasAward` – whether the book has won any awards.  

We will test **six hypotheses** using different types of graphs (scatter, bar, line, and box) to visualize relationships and patterns.


In [None]:

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv("BooksAdaptedFromMovies.csv")

# Show first few rows
df.head()


In [None]:

# Remove unused columns to make data cleaner and faster to analyze
unused_columns = ['isbn', 'link', 'coverImg', 'description', 'authorLink']
df = df.drop(columns=unused_columns, errors='ignore')

# Display basic info
df.info()



## 📊 Hypothesis 1: Books with awards have higher ratings on average

We will check if books that won awards tend to have higher ratings.


In [None]:

# Bar chart comparing average ratings with and without awards
plt.figure(figsize=(8,6))
df.groupby('hasAward')['rating'].mean().plot(kind='bar', color=['orange', 'green'])
plt.title("Average Rating: With vs Without Awards")
plt.xlabel("Has Award (0 = No, 1 = Yes)")
plt.ylabel("Average Rating")
plt.show()



**Conclusion:**
- Books that have won awards generally show a higher average rating than those without awards.  
- This means awards often go to well-rated and popular books.  
- There is a positive relationship between awards and ratings.



## 💰 Hypothesis 2: Newer books are priced higher

We will test if books published more recently have higher prices.


In [None]:

# Scatter plot: Publish year vs price
plt.figure(figsize=(8,6))
plt.scatter(df['publishYear'], df['price'])
plt.title("Publish Year vs Price")
plt.xlabel("Publish Year")
plt.ylabel("Price ($)")
plt.show()



**Conclusion:**
- The scatter plot shows a weak trend where newer books (higher publish years) are slightly more expensive.  
- However, some older books still have higher prices, possibly due to being classics or rare editions.  
- So, price is not strongly tied to publish year but shows mild influence.



## 📈 Hypothesis 3: Higher-rated books get more readers

We will check if books with higher ratings attract more readers.


In [None]:

# Line plot: Average number of ratings by rating score
plt.figure(figsize=(8,6))
df.groupby('rating')['numRatings'].mean().plot(kind='line', marker='o')
plt.title("Rating vs Number of Readers")
plt.xlabel("Book Rating")
plt.ylabel("Average Number of Ratings")
plt.show()



**Conclusion:**
- The line graph shows that books with higher ratings generally have more readers.  
- This means readers tend to like and recommend books that already have good ratings.  
- A positive relationship exists between rating and number of readers.



## 📦 Hypothesis 4: Books with more pages cost more

We will check if books with more pages are priced higher.


In [None]:

# Scatter plot: Pages vs Price
plt.figure(figsize=(8,6))
plt.scatter(df['numPages'], df['price'])
plt.title("Number of Pages vs Price")
plt.xlabel("Number of Pages")
plt.ylabel("Price ($)")
plt.show()



**Conclusion:**
- The scatter plot shows that price slightly increases with the number of pages.  
- This is logical as longer books may require more production cost.  
- However, the trend is not very strong, suggesting price also depends on other factors like popularity or author fame.



## 🌍 Hypothesis 5: English books have higher liked percentages

We will compare the average liked percentage of English books with books in other languages.


In [None]:

# Bar chart for top 10 languages by average liked percentage
plt.figure(figsize=(10,6))
df.groupby('language')['likedPercent'].mean().sort_values(ascending=False).head(10).plot(kind='bar')
plt.title("Average Liked Percent by Language (Top 10)")
plt.xlabel("Language")
plt.ylabel("Average Liked Percent")
plt.show()



**Conclusion:**
- English books have the highest average liked percentage compared to other languages.  
- This may be because English-language books reach a larger audience and are widely adapted into movies.  
- The relationship shows that popularity and language reach play a key role in reader liking.



## 🏆 Hypothesis 6: Award-winning books have higher reader engagement (ratings + likes)

We will check if award-winning books also have more ratings and likes.


In [None]:

# Box plot comparing number of ratings based on awards
plt.figure(figsize=(8,6))
df.boxplot(column='numRatings', by='hasAward')
plt.title("Reader Engagement: Awards vs No Awards")
plt.suptitle("")
plt.xlabel("Has Award (0 = No, 1 = Yes)")
plt.ylabel("Number of Ratings")
plt.show()



**Conclusion:**
- The box plot shows that award-winning books generally receive more reader ratings.  
- This means books that win awards attract more attention and engagement.  
- There is a clear positive connection between awards and reader involvement.



# 🧾 Final Summary

Through this analysis, we explored six hypotheses about books adapted into movies using simple graphs.  

### Key findings:
- Award-winning books have higher ratings and engagement.  
- Newer books are slightly more expensive.  
- More pages may lead to a higher price, but not always.  
- English books are the most liked by readers.  
- Books with higher ratings also attract more readers.  

Overall, we see that **ratings, awards, and language** have strong effects on how readers respond to books that are later made into movies.
