#  Will I Like This Movie?## A Hands-On Introduction to Data Science & AIWelcome! In this notebook, you'll analyze movie data and build your first machine learning model.**Your Mission:** Discover what makes a movie successful and predict movie ratings!---

## Checkpoint 1:  Environment SetupLet's start by importing the libraries we need. Run the cell below by pressing `Shift + Enter`.

In [None]:
# Import librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns# Make plots look niceplt.style.use('seaborn-v0_8-whitegrid')sns.set_palette('husl')# Show plots inline%matplotlib inlineprint(" All libraries imported successfully!")print(f" Pandas version: {pd.__version__}")print(f" NumPy version: {np.__version__}")

** Checkpoint 1 Complete!** If you see the success message above, you're ready to continue!---

## Checkpoint 2:  Load the DataNow let's load our movie dataset. We'll use `pandas` to read a CSV file.

In [None]:
# Load the movie datasetdf = pd.read_csv('../data/movies.csv')print(f" Loaded {len(df)} movies!")

###  Take a First LookLet's see what our data looks like!

In [None]:
# TODO: Display the first 5 rows of the dataset# Hint: Use df.head()

In [None]:
# TODO: Check the shape of the dataset (rows, columns)# Hint: Use df.shape

In [None]:
# TODO: List all column names# Hint: Use df.columns

###  Understanding the Columns| Column | Description ||--------|-------------|| `title` | Movie title || `release_year` | Year the movie was released || `budget` | Production budget (in USD) || `revenue` | Box office revenue (in USD) || `runtime` | Movie length in minutes || `vote_average` | Average user rating (0-10)  || `vote_count` | Number of votes || `popularity` | Popularity score || `genre` | Primary genre |

** Checkpoint 2 Complete!** You've loaded the data and explored its structure!---

## Checkpoint 3:  Explore the DataLet's dig deeper and understand our data better.

In [None]:
# Get summary statisticsdf.describe()

In [None]:
# TODO: What is the average movie rating?# Hint: Use df['vote_average'].mean()average_rating = ___print(f" Average movie rating: {average_rating:.2f}")

In [None]:
# TODO: Find the highest-rated movie# Hint: Use df.loc[df['vote_average'].idxmax()]best_movie = ___print(f" Highest rated movie: {best_movie['title']} ({best_movie['vote_average']})")

In [None]:
# TODO: Find the lowest-rated movie# Hint: Use df.loc[df['vote_average'].idxmin()]worst_movie = ___print(f" Lowest rated movie: {worst_movie['title']} ({worst_movie['vote_average']})")

In [None]:
# Check for missing valuesprint("Missing values per column:")print(df.isnull().sum())

###  Quick QuestionsBased on your exploration:1. What's the average movie rating? ___2. What's the range of budgets? ___3. Which genres are represented? ___

** Checkpoint 3 Complete!** You now understand the data!---

## Checkpoint 4:  Visualize PatternsA picture is worth a thousand data points! Let's create some visualizations.

###  Distribution of Ratings

In [None]:
# TODO: Create a histogram of movie ratings# Hint: Use plt.hist(df['vote_average'], bins=20)plt.figure(figsize=(10, 6))# Your code here:plt.xlabel('Rating')plt.ylabel('Number of Movies')plt.title('Distribution of Movie Ratings')plt.show()

###  Budget vs Revenue

In [None]:
# TODO: Create a scatter plot of budget vs revenue# Hint: Use plt.scatter(df['budget'], df['revenue'])plt.figure(figsize=(10, 6))# Your code here:plt.xlabel('Budget ($)')plt.ylabel('Revenue ($)')plt.title('Budget vs Revenue')plt.show()

###  Ratings by Genre

In [None]:
# TODO: Create a bar chart of average rating by genre# Hint: Use df.groupby('genre')['vote_average'].mean().plot(kind='bar')plt.figure(figsize=(12, 6))# Your code here:plt.xlabel('Genre')plt.ylabel('Average Rating')plt.title('Average Rating by Genre')plt.xticks(rotation=45)plt.tight_layout()plt.show()

###  Ratings Over Time

In [None]:
# Average rating per yearplt.figure(figsize=(12, 6))yearly_avg = df.groupby('release_year')['vote_average'].mean()yearly_avg.plot(kind='line', marker='o', markersize=3)plt.xlabel('Year')plt.ylabel('Average Rating')plt.title('Average Movie Rating Over Time')plt.show()

** Checkpoint 4 Complete!** You've created beautiful visualizations!---

## Checkpoint 5:  Find CorrelationsWhich features are related to higher ratings? Let's find out!

In [None]:
# Select numeric columns for correlationnumeric_cols = ['budget', 'revenue', 'runtime', 'popularity', 'vote_average', 'vote_count']correlation_matrix = df[numeric_cols].corr()# Display correlation with vote_averageprint("Correlation with ratings (vote_average):")print(correlation_matrix['vote_average'].sort_values(ascending=False))

In [None]:
# TODO: Create a correlation heatmap# Hint: Use sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')plt.figure(figsize=(10, 8))# Your code here:plt.title('Correlation Heatmap')plt.tight_layout()plt.show()

###  Interpreting Correlations- **Positive correlation (close to 1)**: As one goes up, the other goes up- **Negative correlation (close to -1)**: As one goes up, the other goes down- **No correlation (close to 0)**: No relationship**Question:** Which feature has the strongest correlation with movie ratings?

** Checkpoint 5 Complete!** You understand what drives good ratings!---

## Checkpoint 6:  Build a Simple PredictorNow for the exciting part - let's build a machine learning model!

In [None]:
# Import machine learning toolsfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, r2_scoreprint(" ML libraries imported!")

### Step 1: Prepare the Data

In [None]:
# Select features (X) and target (y)# We'll use budget, runtime, and popularity to predict vote_averagefeatures = ['budget', 'runtime', 'popularity', 'vote_count']target = 'vote_average'# Remove rows with missing valuesdf_clean = df[features + [target]].dropna()X = df_clean[features]y = df_clean[target]print(f" Using {len(X)} movies for training")print(f" Features: {features}")print(f" Target: {target}")

### Step 2: Split into Training and Testing Sets

In [None]:
# TODO: Split the data into training (80%) and testing (20%) sets# Hint: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)print(f" Training set: {len(X_train)} movies")print(f" Testing set: {len(X_test)} movies")

### Step 3: Train the Model

In [None]:
# TODO: Create and train a Linear Regression model# Hint:# model = LinearRegression()# model.fit(X_train, y_train)print(" Model trained!")

### Step 4: Make Predictions

In [None]:
# TODO: Use the model to make predictions on the test set# Hint: y_pred = model.predict(X_test)print(" Predictions made!")print(f"\nFirst 5 predictions: {y_pred[:5]}")print(f"Actual values:       {y_test[:5].values}")

** Checkpoint 6 Complete!** You've built your first ML model!---

## Checkpoint 7:  Evaluate Your ModelHow good is our model? Let's find out!

In [None]:
# Calculate metricsmse = mean_squared_error(y_test, y_pred)rmse = np.sqrt(mse)r2 = r2_score(y_test, y_pred)print(" Model Performance:")print(f"   Root Mean Squared Error: {rmse:.2f}")print(f"   RÂ² Score: {r2:.3f}")print(f"\n Interpretation:")print(f"   On average, our predictions are off by {rmse:.2f} rating points")print(f"   Our model explains {r2*100:.1f}% of the variance in ratings")

In [None]:
# TODO: Visualize predictions vs actual ratings# Hint: Use plt.scatter(y_test, y_pred)plt.figure(figsize=(10, 6))# Your code here:# Add a diagonal line (perfect predictions would fall on this line)plt.plot([0, 10], [0, 10], 'r--', label='Perfect Prediction')plt.xlabel('Actual Rating')plt.ylabel('Predicted Rating')plt.title('Predicted vs Actual Ratings')plt.legend()plt.show()

In [None]:
# See which features are most importantfeature_importance = pd.DataFrame({'Feature': features,'Coefficient': model.coef_}).sort_values('Coefficient', key=abs, ascending=False)print(" Feature Importance:")print(feature_importance)

** Checkpoint 7 Complete!** You can evaluate your model!---

## Checkpoint 8:  Conclusions & Next StepsCongratulations! You've completed your first Data Science project!

###  Your FindingsFill in your conclusions:1. **What factors correlate with good ratings?**- _Your answer here..._2. **Which genre has the best average rating?**- _Your answer here..._3. **How well does our model predict ratings?**- _Your answer here..._4. **What could improve our model?**- _Your answer here..._

###  What You LearnedToday you practiced skills from across the Data Science & AI curriculum:-  **Python programming** - loops, functions, libraries-  **Data manipulation** - pandas DataFrames, filtering, grouping-  **Visualization** - matplotlib, seaborn, charts-  **Statistics** - correlations, distributions, averages-  **Machine Learning** - training models, making predictions-  **Git & GitHub** - version control, collaboration

###  Bonus ChallengesWant to keep going? Try these:1. Add more features to the model (like genre as a category)2. Try a different algorithm (Random Forest, Decision Tree)3. Find the most underrated movies (low popularity, high rating)4. Predict revenue instead of rating

In [None]:
#  CONGRATULATIONS!print("="*50)print(" CONGRATULATIONS! ")print("="*50)print("\nYou've completed your first Data Science project!")print("\nYou now know how to:")print("   Load and explore datasets")print("   Create insightful visualizations")print("   Analyze correlations")print("   Build and evaluate ML models")print("\nWelcome to the world of Data Science & AI! ")