# üé¨ Will I Like This Movie?
## A Hands-On Introduction to Data Science & AI

Welcome! In this notebook, you'll analyze movie data and build your first machine learning model.

**Your Mission:** Discover what makes a movie successful and predict movie ratings!

---

## Checkpoint 1: üîß Environment Setup

Let's start by importing the libraries we need. Run the cell below by pressing `Shift + Enter`.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Make plots look nice
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Show plots inline
%matplotlib inline

print("‚úÖ All libraries imported successfully!")
print(f"üìä Pandas version: {pd.__version__}")
print(f"üî¢ NumPy version: {np.__version__}")

**‚úÖ Checkpoint 1 Complete!** If you see the success message above, you're ready to continue!

---

## Checkpoint 2: üìÇ Load the Data

Now let's load our movie dataset. We'll use `pandas` to read a CSV file.

In [None]:
# Load the movie dataset
df = pd.read_csv('../data/movies.csv')

print(f"üé¨ Loaded {len(df)} movies!")

### üëÄ Take a First Look

Let's see what our data looks like!

In [None]:
# TODO: Display the first 5 rows of the dataset
# Hint: Use df.head()



In [None]:
# TODO: Check the shape of the dataset (rows, columns)
# Hint: Use df.shape



In [None]:
# TODO: List all column names
# Hint: Use df.columns



### üìù Understanding the Columns

| Column | Description |
|--------|-------------|
| `title` | Movie title |
| `release_year` | Year the movie was released |
| `budget` | Production budget (in USD) |
| `revenue` | Box office revenue (in USD) |
| `runtime` | Movie length in minutes |
| `vote_average` | Average user rating (0-10) ‚≠ê |
| `vote_count` | Number of votes |
| `popularity` | Popularity score |
| `genre` | Primary genre |

**‚úÖ Checkpoint 2 Complete!** You've loaded the data and explored its structure!

---

## Checkpoint 3: üîç Explore the Data

Let's dig deeper and understand our data better.

In [None]:
# Get summary statistics
df.describe()

In [None]:
# TODO: What is the average movie rating?
# Hint: Use df['vote_average'].mean()

average_rating = ___
print(f"‚≠ê Average movie rating: {average_rating:.2f}")

In [None]:
# TODO: Find the highest-rated movie
# Hint: Use df.loc[df['vote_average'].idxmax()]

best_movie = ___
print(f"üèÜ Highest rated movie: {best_movie['title']} ({best_movie['vote_average']})")

In [None]:
# TODO: Find the lowest-rated movie
# Hint: Use df.loc[df['vote_average'].idxmin()]

worst_movie = ___
print(f"üëé Lowest rated movie: {worst_movie['title']} ({worst_movie['vote_average']})")

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

### ü§î Quick Questions

Based on your exploration:
1. What's the average movie rating? ___
2. What's the range of budgets? ___
3. Which genres are represented? ___

**‚úÖ Checkpoint 3 Complete!** You now understand the data!

---

## Checkpoint 4: üìä Visualize Patterns

A picture is worth a thousand data points! Let's create some visualizations.

### üìà Distribution of Ratings

In [None]:
# TODO: Create a histogram of movie ratings
# Hint: Use plt.hist(df['vote_average'], bins=20)

plt.figure(figsize=(10, 6))

# Your code here:


plt.xlabel('Rating')
plt.ylabel('Number of Movies')
plt.title('Distribution of Movie Ratings')
plt.show()

### üí∞ Budget vs Revenue

In [None]:
# TODO: Create a scatter plot of budget vs revenue
# Hint: Use plt.scatter(df['budget'], df['revenue'])

plt.figure(figsize=(10, 6))

# Your code here:


plt.xlabel('Budget ($)')
plt.ylabel('Revenue ($)')
plt.title('Budget vs Revenue')
plt.show()

### üé≠ Ratings by Genre

In [None]:
# TODO: Create a bar chart of average rating by genre
# Hint: Use df.groupby('genre')['vote_average'].mean().plot(kind='bar')

plt.figure(figsize=(12, 6))

# Your code here:


plt.xlabel('Genre')
plt.ylabel('Average Rating')
plt.title('Average Rating by Genre')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### üìÖ Ratings Over Time

In [None]:
# Average rating per year
plt.figure(figsize=(12, 6))

yearly_avg = df.groupby('release_year')['vote_average'].mean()
yearly_avg.plot(kind='line', marker='o', markersize=3)

plt.xlabel('Year')
plt.ylabel('Average Rating')
plt.title('Average Movie Rating Over Time')
plt.show()

**‚úÖ Checkpoint 4 Complete!** You've created beautiful visualizations!

---

## Checkpoint 5: üîó Find Correlations

Which features are related to higher ratings? Let's find out!

In [None]:
# Select numeric columns for correlation
numeric_cols = ['budget', 'revenue', 'runtime', 'popularity', 'vote_average', 'vote_count']
correlation_matrix = df[numeric_cols].corr()

# Display correlation with vote_average
print("Correlation with ratings (vote_average):")
print(correlation_matrix['vote_average'].sort_values(ascending=False))

In [None]:
# TODO: Create a correlation heatmap
# Hint: Use sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

plt.figure(figsize=(10, 8))

# Your code here:


plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

### ü§î Interpreting Correlations

- **Positive correlation (close to 1)**: As one goes up, the other goes up
- **Negative correlation (close to -1)**: As one goes up, the other goes down
- **No correlation (close to 0)**: No relationship

**Question:** Which feature has the strongest correlation with movie ratings?

**‚úÖ Checkpoint 5 Complete!** You understand what drives good ratings!

---

## Checkpoint 6: ü§ñ Build a Simple Predictor

Now for the exciting part - let's build a machine learning model!

In [None]:
# Import machine learning tools
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

print("‚úÖ ML libraries imported!")

### Step 1: Prepare the Data

In [None]:
# Select features (X) and target (y)
# We'll use budget, runtime, and popularity to predict vote_average

features = ['budget', 'runtime', 'popularity', 'vote_count']
target = 'vote_average'

# Remove rows with missing values
df_clean = df[features + [target]].dropna()

X = df_clean[features]
y = df_clean[target]

print(f"üìä Using {len(X)} movies for training")
print(f"üéØ Features: {features}")
print(f"üéØ Target: {target}")

### Step 2: Split into Training and Testing Sets

In [None]:
# TODO: Split the data into training (80%) and testing (20%) sets
# Hint: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



print(f"üèãÔ∏è Training set: {len(X_train)} movies")
print(f"üß™ Testing set: {len(X_test)} movies")

### Step 3: Train the Model

In [None]:
# TODO: Create and train a Linear Regression model
# Hint: 
# model = LinearRegression()
# model.fit(X_train, y_train)



print("‚úÖ Model trained!")

### Step 4: Make Predictions

In [None]:
# TODO: Use the model to make predictions on the test set
# Hint: y_pred = model.predict(X_test)



print("üîÆ Predictions made!")
print(f"\nFirst 5 predictions: {y_pred[:5]}")
print(f"Actual values:       {y_test[:5].values}")

**‚úÖ Checkpoint 6 Complete!** You've built your first ML model!

---

## Checkpoint 7: üìà Evaluate Your Model

How good is our model? Let's find out!

In [None]:
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("üìä Model Performance:")
print(f"   Root Mean Squared Error: {rmse:.2f}")
print(f"   R¬≤ Score: {r2:.3f}")
print(f"\nüí° Interpretation:")
print(f"   On average, our predictions are off by {rmse:.2f} rating points")
print(f"   Our model explains {r2*100:.1f}% of the variance in ratings")

In [None]:
# TODO: Visualize predictions vs actual ratings
# Hint: Use plt.scatter(y_test, y_pred)

plt.figure(figsize=(10, 6))

# Your code here:


# Add a diagonal line (perfect predictions would fall on this line)
plt.plot([0, 10], [0, 10], 'r--', label='Perfect Prediction')

plt.xlabel('Actual Rating')
plt.ylabel('Predicted Rating')
plt.title('Predicted vs Actual Ratings')
plt.legend()
plt.show()

In [None]:
# See which features are most important
feature_importance = pd.DataFrame({
    'Feature': features,
    'Coefficient': model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print("üìä Feature Importance:")
print(feature_importance)

**‚úÖ Checkpoint 7 Complete!** You can evaluate your model!

---

## Checkpoint 8: üéâ Conclusions & Next Steps

Congratulations! You've completed your first Data Science project!

### üìù Your Findings

Fill in your conclusions:

1. **What factors correlate with good ratings?**
   - _Your answer here..._

2. **Which genre has the best average rating?**
   - _Your answer here..._

3. **How well does our model predict ratings?**
   - _Your answer here..._

4. **What could improve our model?**
   - _Your answer here..._

### üöÄ What You Learned

Today you practiced skills from across the Data Science & AI curriculum:

- ‚úÖ **Python programming** - loops, functions, libraries
- ‚úÖ **Data manipulation** - pandas DataFrames, filtering, grouping
- ‚úÖ **Visualization** - matplotlib, seaborn, charts
- ‚úÖ **Statistics** - correlations, distributions, averages
- ‚úÖ **Machine Learning** - training models, making predictions
- ‚úÖ **Git & GitHub** - version control, collaboration

### üèÜ Bonus Challenges

Want to keep going? Try these:

1. Add more features to the model (like genre as a category)
2. Try a different algorithm (Random Forest, Decision Tree)
3. Find the most underrated movies (low popularity, high rating)
4. Predict revenue instead of rating

In [None]:
# üéä CONGRATULATIONS! üéä
print("="*50)
print("üéâ CONGRATULATIONS! üéâ")
print("="*50)
print("\nYou've completed your first Data Science project!")
print("\nYou now know how to:")
print("  üìÇ Load and explore datasets")
print("  üìä Create insightful visualizations")
print("  üîó Analyze correlations")
print("  ü§ñ Build and evaluate ML models")
print("\nWelcome to the world of Data Science & AI! üöÄ")