# Process 4

## Quiz
1. **In a linear regression model, the coefficient of an independent variable represents:**
   - A) The variance in the dependent variable explained by the independent variable
   - B) The average change in the dependent variable for a one-unit increase in the independent variable, holding other variables constant
   - C) The intercept of the regression line, holding other variables constant
   - D) The average prediction accuracy of the model
   - **Your Answer:**  B

2. **Which of the following best describes R-squared in the context of a linear regression model?**
   - A) It shows the proportion of unexplained variation in the dependent variable after fitting the model.
   - B) It shows the proportion of variation in the dependent variable that is explained by the independent variables.
   - C) It quantifies the strength and direction of the linear relationship between the dependent and independent variables.
   - D) It measures the difference between predicted and actual values for each data point.
   - **Your Answer:**  B

3. **In multiple regression, which of the following is correct?**
   - A) Each predictor’s coefficient indicates the correlation with the dependent variable, assuming all other predictors are zero.
   - B) The model only uses one predictor at a time to ensure independence.
   - C) All predictor variables must be standardised before inclusion in the model.
   - D) Each predictor’s coefficient indicates its association with the dependent variable while holding other predictors constant.
   - **Your Answer:**  D

4. **Which of the following is *not* an assumption of multiple linear regression?**
   - A) The residuals are normally distributed.
   - B) There is no multicollinearity among independent variables.
   - C) The dependent variable must be binary.
   - D) The relationship between independent and dependent variables is linear.
   - **Your Answer:** C

5. **When comparing multiple models using AIC (Akaike Information Criterion), the model with the lowest AIC value:**
   - A) is generally preferred because it has fewer predictors and a higher R-squared.
   - B) is generally preferred because it's a better balance between model fit and model complexity.
   - C) is generally discarded because it minimises the residual sum of squares (RSS) without considering the number of predictors.
   - D) is generally discarded because it indicates overfitting.
   - **Your Answer:**  B

## Code

**Dataset Description**
The dataset contains information about music album sales and three related factors that could influence sales:

- **adverts**: the advertising budget spent on promoting the album (in thousands of dollars),
- **sales**: the number of album copies sold (in thousands),
- **airplay**: the number of times songs from the album were played on BBC Radio 1,
- **attract**: a measure of how attractive the artist or artists is/are perceived by a random selection of their fans (on a scale from 1 to 10).

In [29]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

def analyze_regression(data, independent_vars, dependent_var):
    
    # Handle dependent_var if it's a list
    if isinstance(dependent_var, list):
        dependent_var = dependent_var[0]
    
    
    X = data[independent_vars]
    y = data[dependent_var]
    
    
    model = LinearRegression()
    model.fit(X, y)
    
    
    y_pred = model.predict(X)
    r_squared = r2_score(y, y_pred)
    
    # Convert numpy types to Python native types
    results = {
        'coefficients': {var: float(coef) for var, coef in zip(independent_vars, model.coef_)},
        'intercept': float(model.intercept_),
        'r_squared': float(r_squared)
    }
    
    return results

def print_regression_results(results):

    print("\nRegression Analysis Results:")
    print(f"Intercept: {results['intercept']:.4f}")
    print("\nCoefficients:")
    for var, coef in results['coefficients'].items():
        print(f"  {var}: {coef:.4f}")
    print(f"\nR-squared: {results['r_squared']:.4f}")
    print("\n")

> ### 1. **Simple Regression on Advertising Budget and Sales**
>   Fit a simple linear regression model using `adverts` as the independent variable and `sales` as the dependent variable. Comment on the model's coefficient and intercept.


In [20]:
# your code here
df = pd.read_csv('/Users/olisa/Lis/data_sci/week7/Album Sales 2.txt', delimiter='\t')
df.head()

Unnamed: 0,adverts,sales,airplay,attract
0,10.256,330,43,10
1,985.685,120,28,7
2,1445.563,360,35,7
3,1188.193,270,33,7
4,574.513,220,44,5


In [24]:
results = analyze_regression(df, ['adverts'], ['sales'])
print_regression_results(results)


Regression Analysis Results:
Intercept: 134.1399

Coefficients:
  adverts: 0.0961

R-squared: 0.3346




The model explains 33.46% of sales variation (R² = 0.3346).

For every unit increase in advertising budget, sales increase by 0.0961 units.

> ### 2. **Simple Regression on Airplay and Sales**
>  Fit a simple linear regression model using `airplay` as the independent variable and `sales` as the dependent variable. Comment on the R-squared value and interpret the model fit.


In [25]:
# your code here
results2 = analyze_regression(df, ['airplay'], ['sales'])
print_regression_results(results2)


Regression Analysis Results:
Intercept: 84.8725

Coefficients:
  airplay: 3.9392

R-squared: 0.3587




Slightly better model with R² = 0.3587 (35.87% of sales variation explained).

Each unit increase in airplay leads to 3.9392 unit increase in sales.
Shows stronger individual impact than advertising.

> ###  3. **Multiple Regression with Advertising Budget and Airplay**
>  Fit a multiple regression model using `adverts` and `airplay` as predictors for `sales`. Comment on each coefficient. Can you tell which variable has a stronger impact on sales?



In [26]:
# your code here
results3 = analyze_regression(df, ['adverts', 'airplay'], ['sales'])
print_regression_results(results3)


Regression Analysis Results:
Intercept: 41.1238

Coefficients:
  adverts: 0.0869
  airplay: 3.5888

R-squared: 0.6293




Substantial improvement with R² = 0.6293 (62.93% explained variance).

Airplay has stronger impact (coefficient = 3.5888) compared to advertising (0.0869).
The combination of both variables provides much better predictive power.

> ###  4. **Model Comparison**
>    Extend the previous model by adding `attract` as a third predictor. Fit the model and interpret how attractiveness influences sales. Compare the three models you created so far. Which model would you choose to predict album sales?



In [28]:
# your code here
results4 = analyze_regression(df, ['adverts', 'airplay', 'attract'], ['sales'])
print_regression_results(results4)


Regression Analysis Results:
Intercept: -26.6130

Coefficients:
  adverts: 0.0849
  airplay: 3.3674
  attract: 11.0863

R-squared: 0.6647




Best performing model with R² = 0.6647 (66.47% explained variance)

Attractiveness has significant positive impact (coefficient = 11.0863)
Effect order: Attractiveness > Airplay > Advertising
This is the best performing model for predictions due to:

Highest R² value
Captures three different aspects of sales influence
Each variable contributes meaningfully to the prediction. 

The coefficient looks small (0.0849) because advertising budget is measured in larger units (e.g., thousands of pounds).
For example: A 10,000 pound advertising budget would contribute 849 units to sales prediction (10000 × 0.0849).
Meanwhile, airplay and attractiveness are likely measured in smaller scales.

> ### 5. **Model Prediction**
> 
> Three artists sent their material to your music production company. Using the model you selected in the previous task, choose the artist who is more likely to sell. The artists' profiles are in the following cell. You can only pick one artist. Feel free to ignore any predictor that your model does not need. 
> 
> Stats has consequences! You will launch one artist to fame,  while shattering two artists' dreams. Choose wisely.

In [32]:
# artist profiles
artists = {
    'artist 1': {'adverts': 215, 'airplay': 28, 'attract': 8},
    'artist 2': {'adverts': 531, 'airplay': 36, 'attract': 6},
    'artist 3': {'adverts': 911, 'airplay': 19, 'attract': 7}
}

In [None]:

# Extract model coefficients and intercept from the results
coefficients = results4['coefficients']
intercept = results4['intercept']

# A function to predict sales based on the artist's features
def predict_sales(artist_profile, coefficients, intercept):
   
    sales = sum(coefficients[feature] * value for feature, value in artist_profile.items()) + intercept
    return sales


predicted_sales = {}
for artist, profile in artists.items():
    predicted_sales[artist] = predict_sales(profile, coefficients, intercept)


best_artist = max(predicted_sales, key=predicted_sales.get)


print("Predicted sales for each artist:")
for artist, sales in predicted_sales.items():
    print(f"{artist}: {sales:.2f}")

print(f"\nThe artist most likely to sell is: {best_artist}")


Predicted sales for each artist:
artist 1: 174.62
artist 2: 206.21
artist 3: 192.30

The artist most likely to sell is: artist 2
