# 🪄 Harry Potter Fan Fiction Logistic Regression Analysis

## Learning Objectives

In this notebook, you'll learn to:
- Build logistic regression models with **interaction terms** (capturing how variables work together)
- Create **polynomial terms** (quadratic) to model non-linear relationships
- Compare **StatsModels** and **scikit-learn** approaches to logistic regression
- Interpret model coefficients and assess model performance
- Understand how complex relationships in data can be captured through feature engineering

## Dataset Overview

We're working with a cleaned subset of Harry Potter fan fiction data from [fanfiction.net](https://www.fanfiction.net/book/Harry-Potter/). The dataset contains:

**Quantitative Variables:**
- `words` - Number of words in the story
- `reviews` - Number of reviews the story received
- `favorites` - Number of readers who favorited the story
- `follows` - Number of readers who follow the story

**Binary Categorical Variables (1 = True, 0 = False):**
- `harry` - Harry is a character in the story
- `hermione` - Hermione is a character in the story
- `multiple` - The story has multiple chapters
- `english` - The story is in English
- `humor` - The story's genre is humor

**Goal:** Predict whether a fan fiction story is "popular" based on its characteristics using logistic regression with interaction and polynomial terms.


## Phase 1: Setup and Imports

First, let's import all the libraries we'll need for this analysis.


In [1]:
# TODO: Import all necessary libraries
# Hint: You'll need pandas for data manipulation, numpy for numerical operations,
# matplotlib and seaborn for visualization, statsmodels for logistic regression,
# and sklearn for additional modeling and evaluation tools

# Write your import statements here:


## Phase 2: Data Loading and Initial Exploration

Understanding your data is the foundation of any good analysis. Let's start by loading the dataset and getting familiar with its structure.


In [2]:
# TODO: Load the Harry Potter fan fiction dataset
# Hint: Use pandas' read_csv() function to load 'hp.csv' from the data directory
# Store the result in a variable called 'hp'

# Write your code here:


In [3]:
# TODO: Display the first few rows of the dataset
# Hint: Use the .head() method to see the first 5 rows

# Write your code here:


In [4]:
# TODO: Check the shape and basic info about the dataset
# Hint: Use .shape to see dimensions and .info() to see data types

# Write your code here:


In [5]:
# TODO: Check for missing values
# Hint: Use .isnull().sum() to count missing values in each column

# Write your code here:


In [6]:
# TODO: Generate descriptive statistics for all variables
# Hint: Use .describe() to get summary statistics like mean, median, std, etc.

# Write your code here:


## Phase 3: Exploratory Data Analysis (EDA)

Exploratory Data Analysis is crucial before building any model. It helps us understand patterns, relationships, and potential issues in our data. This understanding will guide our feature engineering decisions.


In [7]:
# TODO: Create distribution plots for quantitative variables
# Hint: Use seaborn's histplot() or displot() to visualize distributions
# Plot distributions for: words, reviews, favorites, follows
# Consider using subplots to show multiple distributions at once

# Write your code here:


In [8]:
# TODO: Create count plots for binary categorical variables
# Hint: Use seaborn's countplot() to show the distribution of 0s and 1s
# Plot for: harry, hermione, multiple, english, humor

# Write your code here:


In [9]:
# TODO: Create scatter plots to explore relationships between quantitative variables
# Hint: Use seaborn's scatterplot() or pairplot()
# Focus on key relationships like: words vs reviews, favorites vs follows, words vs favorites
# Look for potential non-linear patterns that might benefit from polynomial terms

# Write your code here:


In [10]:
# TODO: Generate a correlation heatmap for quantitative variables
# Hint: Use pandas' .corr() method to calculate correlations, then seaborn's heatmap() to visualize
# Consider adding annotations to show correlation values

# Write your code here:


### Reflection Questions:
1. What patterns do you notice in the distributions? Are there any variables with skewed distributions?
2. Which quantitative variables seem most correlated with each other?
3. Are there any relationships that look non-linear (curved rather than straight)?
4. How balanced are the binary variables? Are some characters more common than others?


## Phase 4: Target Variable Creation

Logistic regression requires a binary outcome variable. Since we want to predict story popularity, we need to define what makes a story "popular" based on the engagement metrics available (favorites, follows, reviews).

**Think about it:** What would make a fan fiction story popular? High number of favorites? Many followers? Lots of reviews? A combination of these?


In [11]:
# TODO: Explore the distribution of engagement metrics
# Hint: Look at the distribution of favorites, follows, and reviews
# Consider using box plots or violin plots to understand the spread
# This will help you decide on a reasonable threshold for "popularity"

# Write your code here:


In [12]:
# TODO: Create a binary target variable for story popularity
# Hint: You can use a threshold approach (e.g., stories above median favorites)
# Or create a composite score combining multiple metrics
# Name your target variable 'is_popular' (1 = popular, 0 = not popular)
# Consider using np.where() or boolean indexing

# Write your code here:


In [13]:
# TODO: Check the distribution of your target variable
# Hint: Use value_counts() to see how many stories are popular vs not popular
# Also create a count plot to visualize the balance
# Note: Perfect balance isn't required, but extreme imbalance (e.g., 95% in one class) can be problematic

# Write your code here:


## Phase 5: Feature Engineering - Interaction Terms

**What are interaction terms?** They capture how the effect of one variable depends on the value of another variable. For example:
- `harry × words`: Does the relationship between story length and popularity differ for Harry stories vs non-Harry stories?
- `words × reviews`: Do longer stories get more reviews, and does this affect popularity?

**Why do we need them?** Real-world relationships are often more complex than simple linear effects. Interaction terms help us model these complex relationships.


In [14]:
# TODO: Create interaction terms between binary and quantitative variables
# Hint: Multiply each binary variable (harry, hermione, multiple, english, humor) 
# with each quantitative variable (words, reviews, favorites, follows)
# Use simple multiplication: binary_var * quantitative_var
# Name them descriptively like 'harry_words', 'hermione_reviews', etc.

# Write your code here:


In [15]:
# TODO: Create interaction terms between quantitative variables
# Hint: Multiply each pair of quantitative variables
# Examples: words*reviews, words*favorites, reviews*follows, etc.
# Be careful not to create redundant terms (words*reviews is the same as reviews*words)

# Write your code here:


In [16]:
# TODO: Create interaction terms between binary variables
# Hint: Multiply each pair of binary variables
# Examples: harry*hermione, harry*multiple, hermione*english, etc.
# These capture whether certain combinations of characteristics are more popular

# Write your code here:


### Understanding Interaction Terms:
**Binary × Quantitative:** If `harry_words` has a positive coefficient, it means Harry stories benefit more from being longer than non-Harry stories.

**Quantitative × Quantitative:** If `words_reviews` has a positive coefficient, it means the effect of word count on popularity is stronger when there are more reviews.

**Binary × Binary:** If `harry_hermione` has a positive coefficient, it means stories with both Harry and Hermione are more popular than you'd expect from just adding their individual effects.


## Phase 6: Feature Engineering - Polynomial Terms

**What are polynomial terms?** They capture non-linear relationships. A quadratic (squared) term can model relationships that curve upward or downward.

**Why do we need them?** Sometimes the relationship between a variable and the outcome isn't straight-line. For example:
- Very short stories might be unpopular (too brief)
- Very long stories might also be unpopular (too long to read)
- Medium-length stories might be most popular (sweet spot)

This creates a curved (quadratic) relationship that linear terms alone can't capture.


In [17]:
# TODO: Create quadratic (squared) terms for quantitative variables
# Hint: Square each quantitative variable using ** 2 or np.square()
# Create: words_squared, reviews_squared, favorites_squared, follows_squared
# These will help capture non-linear relationships

# Write your code here:


In [18]:
# TODO: (Optional) Visualize original vs squared relationships
# Hint: Create scatter plots showing the relationship between original variables and target
# Then show the same for squared variables
# This can help you see if the quadratic terms are capturing important patterns

# Write your code here:


## Phase 7: Model Building with StatsModels

StatsModels provides detailed statistical output that's great for understanding your model. It gives you p-values, confidence intervals, and detailed diagnostics that help with interpretation.

**Key difference:** StatsModels requires you to explicitly add a constant term (intercept) to your feature matrix.


In [19]:
# TODO: Prepare your feature matrix and target variable
# Hint: Create a DataFrame or array containing all your features:
# - Original quantitative variables (words, reviews, favorites, follows)
# - Original binary variables (harry, hermione, multiple, english, humor)
# - All interaction terms you created
# - All polynomial terms you created
# Make sure to exclude the target variable from your features!

# Write your code here:


In [20]:
# TODO: Add constant term to your feature matrix
# Hint: Use sm.add_constant() to add a column of 1s for the intercept
# This is required for StatsModels logistic regression

# Write your code here:


In [21]:
# TODO: Fit the logistic regression model using StatsModels
# Hint: Use sm.Logit() with your target variable and feature matrix
# Then call .fit() to train the model
# Store the result in a variable like 'model_stats'

# Write your code here:


In [22]:
# TODO: Display the model summary
# Hint: Use .summary() method on your fitted model
# This will show coefficients, p-values, confidence intervals, and model statistics

# Write your code here:


### Interpreting StatsModels Output:
1. **Coefficients:** Show the log-odds change for a 1-unit increase in the predictor
2. **P-values:** Indicate statistical significance (typically < 0.05)
3. **Pseudo R-squared:** Measures model fit (higher is better, but don't expect values as high as in linear regression)
4. **Log-likelihood:** Used for model comparison (higher is better)

**Key insight:** Positive coefficients increase the probability of the outcome, negative coefficients decrease it.


## Phase 8: Model Building with scikit-learn

Scikit-learn is more focused on prediction and machine learning workflows. It's often easier to use for model evaluation and comparison, but provides less detailed statistical output.

**Key differences from StatsModels:**
- Automatically handles the intercept (no need to add constant)
- Built-in train/test split and cross-validation tools
- More machine learning focused (prediction vs. inference)


In [23]:
# TODO: Split your data into training and testing sets
# Hint: Use train_test_split() from sklearn.model_selection
# Use a reasonable test size (e.g., 0.2 or 0.3)
# Set random_state for reproducibility
# Remember: you need to split both features and target

# Write your code here:


In [24]:
# TODO: Fit a logistic regression model using scikit-learn
# Hint: Import LogisticRegression from sklearn.linear_model
# Create a model instance and fit it on your training data
# Note: You don't need to add a constant term - sklearn handles this automatically

# Write your code here:


In [25]:
# TODO: Make predictions on the test set
# Hint: Use .predict() for binary predictions (0 or 1)
# Use .predict_proba() to get probability scores
# Store both types of predictions

# Write your code here:


In [26]:
# TODO: Evaluate your model performance
# Hint: Import evaluation metrics from sklearn.metrics
# Calculate: accuracy_score, confusion_matrix, classification_report
# Also consider: roc_auc_score, precision_score, recall_score, f1_score

# Write your code here:


### Comparing StatsModels vs scikit-learn:
1. **StatsModels:** Better for understanding relationships, hypothesis testing, detailed statistics
2. **scikit-learn:** Better for prediction, model comparison, machine learning workflows
3. **Coefficients:** Should be similar between both approaches (except for the intercept handling)
4. **Use both:** StatsModels for interpretation, scikit-learn for evaluation and comparison


## Phase 9: Model Comparison and Interpretation

Now let's compare different models to understand the value of interaction and polynomial terms. We'll build a baseline model with only original features and compare it to our full model.


In [27]:
# TODO: Build a baseline model with only original features (no interactions or polynomials)
# Hint: Use only the original quantitative and binary variables
# Fit this model using both StatsModels and scikit-learn
# This will help you see the improvement from adding interaction and polynomial terms

# Write your code here:


In [28]:
# TODO: Compare model performance metrics
# Hint: Calculate the same evaluation metrics for both baseline and full models
# Compare: accuracy, AUC, precision, recall, F1-score
# Create a comparison table or summary

# Write your code here:


In [29]:
# TODO: Identify the most significant interaction and polynomial terms
# Hint: Look at the coefficients and p-values from your StatsModels output
# Find terms with large coefficients and small p-values
# Consider creating a summary of the top 5-10 most important terms

# Write your code here:


### Key Questions to Consider:
1. **Which features matter most?** Look at the largest coefficients (in absolute value)
2. **Are interactions helpful?** Compare baseline vs full model performance
3. **Do polynomial terms capture important patterns?** Check if squared terms are significant
4. **What makes a story popular?** Interpret the most significant coefficients in real-world terms

**Example interpretation:** If `harry_words` has a positive coefficient, it means Harry stories benefit more from being longer than non-Harry stories.


## Phase 10: Visualization of Results

Visualizing your model results helps you understand performance and communicate findings effectively. Good visualizations can reveal patterns that numbers alone might miss.


In [30]:
# TODO: Create ROC curve and calculate AUC
# Hint: Use roc_curve() and auc() from sklearn.metrics
# Plot the ROC curve using matplotlib
# The AUC (Area Under Curve) measures how well your model separates the classes
# AUC = 0.5 is random, AUC = 1.0 is perfect

# Write your code here:


In [31]:
# TODO: Visualize predicted probabilities distribution
# Hint: Create histograms of predicted probabilities for each class (popular vs not popular)
# Use your model's .predict_proba() output
# This shows how confident your model is in its predictions

# Write your code here:


In [32]:
# TODO: Create a feature importance visualization
# Hint: Plot the coefficients from your model (absolute values)
# Focus on the top 10-15 most important features
# Use a horizontal bar plot for easy reading
# This helps identify which features drive predictions most

# Write your code here:


In [33]:
# TODO: Create a confusion matrix heatmap
# Hint: Use seaborn's heatmap() to visualize your confusion matrix
# Add annotations to show the actual numbers
# This helps you see where your model makes mistakes

# Write your code here:


## Phase 11: Conclusion and Reflection

Congratulations! You've completed a comprehensive logistic regression analysis with interaction and polynomial terms. Let's reflect on what you've learned and discovered.


### Reflection Questions:

1. **Model Performance:** How well did your model perform? What was your final accuracy and AUC?

2. **Feature Importance:** Which features were most important for predicting story popularity? Did this match your initial expectations?

3. **Interaction Terms:** Which interaction terms were most significant? What do they tell you about how different story characteristics work together?

4. **Polynomial Terms:** Were any quadratic terms significant? What non-linear relationships did they capture?

5. **Model Comparison:** How much did interaction and polynomial terms improve your model compared to the baseline?

6. **Real-world Insights:** What insights can you draw about what makes Harry Potter fan fiction popular? Are there any surprising findings?

7. **Limitations:** What are the limitations of your analysis? What other factors might affect story popularity that aren't in your dataset?

8. **Next Steps:** If you were to continue this analysis, what would you do next? Consider:
   - Regularization (L1/L2) to handle overfitting
   - Feature selection to identify the most important predictors
   - Cross-validation for more robust performance estimates
   - Other algorithms (Random Forest, XGBoost) for comparison


### Key Learning Takeaways:

✅ **Interaction Terms:** You learned how to create and interpret interaction terms that capture how variables work together

✅ **Polynomial Terms:** You discovered how quadratic terms can capture non-linear relationships

✅ **StatsModels vs scikit-learn:** You compared two different approaches to logistic regression and understood their strengths

✅ **Model Interpretation:** You learned to interpret coefficients, p-values, and model performance metrics

✅ **Feature Engineering:** You practiced creating new features from existing ones to improve model performance

✅ **Real-world Application:** You applied these techniques to a realistic dataset with complex relationships


### Congratulations! 🎉

You've successfully completed a comprehensive logistic regression analysis with advanced feature engineering techniques. These skills will serve you well in future machine learning projects where you need to model complex relationships in your data.

**Remember:** The goal isn't just to build a model, but to understand the relationships in your data and extract meaningful insights that can inform real-world decisions.
