
# 📘 EDA: Books Adapted into Movies

This notebook explores a dataset of **books that were adapted into movies**.  
We will look at parts of the dataset such as book ratings, number of pages, prices, and awards.  

We will create two new columns:
1. `publishYear` – the year the book was published.  
2. `hasAward` – whether the book has won any awards (1 = yes, 0 = no).

We will test **six simple hypotheses** using easy plots and explain each result in **short bullet points**. Each conclusion will also say which data is on the **X-axis** and the **Y-axis**.



**Assignment notes (from professor):**

- This EDA should include at least six hypotheses related to the dataset features.  
- Use Markdown cells to explain what the code and plots show.  
- Each hypothesis must be a statement (not a question) and must be tested with code and plots.  
- Include an introduction and a conclusion.  
- This assignment is usually done with a partner — make sure you follow your course rules.



## 📊 Dataset Explanation

Columns used in this notebook:

- **title** – name of the book  
- **author** – who wrote the book  
- **publishDate** – when the book was first published (we will extract year)  
- **rating** – average user rating (numeric)  
- **numRatings** – number of people who rated the book (numeric)  
- **awards** – awards won by the book (text or empty)  
- **price** – cost of the book (numeric)  
- **pages** – number of pages (numeric)  
- **language** – language of the book (text)  
- **likedPercent** – percent of readers who liked the book (numeric, 0-100)


In [None]:

# Simple imports and data loading
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset - make sure the CSV is in the same folder as this notebook
df = pd.read_csv("MoviesAdaptedFromBooks.csv")

# Create publishYear and hasAward
df['publishYear'] = pd.to_datetime(df.get('publishDate', pd.Series()), errors='coerce').dt.year
df['hasAward'] = df['awards'].apply(lambda x: 1 if pd.notna(x) and len(str(x).strip())>0 else 0)

# Show basic info and first rows
df.info()
df.head()



---
## Hypothesis 1 (statement): Newer books have higher ratings.

We will compare **publishYear** (X-axis) with **rating** (Y-axis) using a scatter plot.


In [None]:

# Scatter: publishYear vs rating
plt.figure(figsize=(6,4))
plt.scatter(df['publishYear'], df['rating'], alpha=0.5)
plt.title("Publish Year (X) vs Rating (Y)")
plt.xlabel("Publish Year (X)")
plt.ylabel("Rating (Y)")
plt.grid(True)
plt.show()



**Conclusion:**  
- X-axis: publishYear, Y-axis: rating.  
- The points show ratings for books across years. Some recent books have good ratings, and some old books also have high ratings.  
- There is no clear steady increase in rating as year increases. New books are not always better rated.  



---
## Hypothesis 2 (statement): Books with more pages get higher ratings.

We will compare **pages** (X-axis) with **rating** (Y-axis).


In [None]:

# Scatter: pages vs rating
plt.figure(figsize=(6,4))
plt.scatter(df['pages'], df['rating'], alpha=0.5)
plt.title("Pages (X) vs Rating (Y)")
plt.xlabel("Number of Pages (X)")
plt.ylabel("Rating (Y)")
plt.grid(True)
plt.show()



**Conclusion:**  
- X-axis: pages, Y-axis: rating.  
- The plot shows ratings for books with different page counts. Some long books have high ratings, but some short books also have high ratings.  
- Page count alone does not clearly predict a higher rating.  



---
## Hypothesis 3 (statement): Books that won awards have higher ratings.

We will compare average rating for **hasAward** (X-axis, 0 or 1) and **rating** (Y-axis).


In [None]:

# Bar: average rating by hasAward
avg_rating_award = df.groupby('hasAward', dropna=False)['rating'].mean()
avg_rating_award.plot(kind='bar', figsize=(6,4))
plt.title("Average Rating by Award Status")
plt.xlabel("Has Award (X) - 0 = No, 1 = Yes")
plt.ylabel("Average Rating (Y)")
plt.grid(axis='y')
plt.show()



**Conclusion:**  
- X-axis: hasAward (0 = no award, 1 = has award), Y-axis: average rating.  
- The bar plot compares average ratings for books with and without awards.  
- If the bars are similar, awards do not strongly change average rating. If a bar is higher, those books are rated a bit better on average.  
- In this dataset, the difference is small, so awards do not guarantee much higher ratings.



---
## Hypothesis 4 (statement): Higher-rated books cost more.

We will compare **rating** (X-axis) with **price** (Y-axis).


In [None]:

# Scatter: rating vs price
plt.figure(figsize=(6,4))
plt.scatter(df['rating'], df['price'], alpha=0.5)
plt.title("Rating (X) vs Price (Y)")
plt.xlabel("Rating (X)")
plt.ylabel("Price (Y)")
plt.grid(True)
plt.show()



**Conclusion:**  
- X-axis: rating, Y-axis: price.  
- The scatter shows prices for books at different rating levels. Some high-rated books are expensive, but many are not.  
- Price is not a strong predictor of a higher rating in this dataset.



---
## Hypothesis 5 (statement): Books with more readers have higher liked percentages.

We will compare **numRatings** (X-axis) with **likedPercent** (Y-axis).


In [None]:

# Scatter: numRatings vs likedPercent
plt.figure(figsize=(6,4))
plt.scatter(df['numRatings'], df['likedPercent'], alpha=0.5)
plt.title("Number of Ratings (X) vs Liked Percentage (Y)")
plt.xlabel("Number of Ratings (X)")
plt.ylabel("Liked Percentage (Y)")
plt.grid(True)
plt.show()



**Conclusion:**  
- X-axis: numRatings, Y-axis: likedPercent.  
- The plot shows how likedPercent changes as more people rate a book. Some popular books have high likedPercent, but not all.  
- High reader counts do not always mean a higher liked percentage, though many popular books still have good likedPercent.



---
## Hypothesis 6 (statement): English books have higher liked percentages than other languages.

We will compare **language** (X-axis) with average **likedPercent** (Y-axis) for the top 10 languages.


In [None]:

# Bar: average likedPercent by language (top 10)
avg_like_lang = df.groupby('language', dropna=False)['likedPercent'].mean().sort_values(ascending=False).head(10)
avg_like_lang.plot(kind='bar', figsize=(10,6))
plt.title("Average Liked Percentage by Language (Top 10)")
plt.xlabel("Language (X)")
plt.ylabel("Average Liked Percentage (Y)")
plt.grid(axis='y')
plt.show()



**Conclusion:**  
- X-axis: language, Y-axis: average likedPercent.  
- The bar chart shows which languages have higher liked percentages on average.  
- If English is not in the top bars, then English books are not the highest-liked in this dataset. Other languages may show higher likedPercent.



---
## 🏁 Final Conclusion and Submission Notes

- We tested six clear hypotheses related to columns in the dataset.  
- Each hypothesis used a simple plot and short explanation.  
- We explained what is on the X-axis and Y-axis for each plot and gave short bullet conclusions in plain English.  

**Submission:** Save this notebook and upload it to Blackboard as your EDA assignment. If you worked alone, keep written permission from your professor if required.
