<a href="https://colab.research.google.com/github/Sameera326/EXPLAINABLE-AI-Assignment/blob/main/EX_AI_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np

# Data
data = {'Newspaper Ads': [1, 2, 3, 1, 2],
        'Orders Received': [35, 48, 60, 40, 50]}
df = pd.DataFrame(data)

# 1. Perform Linear Regression Analysis
X = df[['Newspaper Ads']]
y = df['Orders Received']

model = LinearRegression()
model.fit(X, y)

# Coefficients
print("Linear Regression Coefficients:")
print(f"Intercept: {model.intercept_}")
print(f"Coefficient for Newspaper Ads: {model.coef_[0]}")

# 2. Calculate the Baseline Value
baseline = np.mean(y)
print(f"\nBaseline Value (Mean of Orders Received): {baseline}")

# 3. Calculate SHAP Values
# For linear regression, SHAP value for a feature is coefficient * (feature value - mean of feature)
mean_newspaper_ads = np.mean(X['Newspaper Ads'])
df['SHAP Value'] = model.coef_[0] * (X['Newspaper Ads'] - mean_newspaper_ads)

# 4. Compute Final Prediction
df['Predicted Orders'] = model.predict(X)

# Confirm that Final Prediction = Baseline + SHAP Value
# We can add a check column to verify this
df['Prediction Check'] = baseline + df['SHAP Value']

print("\nSHAP Values and Predictions:")
print(df[['Newspaper Ads', 'Orders Received', 'SHAP Value', 'Predicted Orders', 'Prediction Check']])

# 5. Interpret the Results and Comparison
print("\nInterpretation and Comparison:")
for index, row in df.iterrows():
    print(f"\nRecord {index + 1}:")
    print(f"Newspaper Ads: {row['Newspaper Ads']}")
    print(f"Actual Orders: {row['Orders Received']}")
    print(f"Predicted Orders: {row['Predicted Orders']:.2f}")
    print(f"SHAP Value: {row['SHAP Value']:.2f}")
    print(f"Interpretation: With {row['Newspaper Ads']} newspaper ads, the predicted orders are {row['Predicted Orders']:.2f}. The SHAP value of {row['SHAP Value']:.2f} indicates how much this number of ads contributed to the prediction compared to the baseline of {baseline:.2f}.")

    difference = row['Predicted Orders'] - row['Orders Received']
    print(f"Difference (Predicted - Actual): {difference:.2f}")

    if difference > 0:
        print("Status: Overprediction")
    elif difference < 0:
        print("Status: Underprediction")
    else:
        print("Status: Accurate prediction")

# Summary Analysis
print("\nSummary Analysis:")
# Model Accuracy (using R-squared for simplicity)
from sklearn.metrics import r2_score
r2 = r2_score(y, df['Predicted Orders'])
print(f"\nModel Accuracy (R-squared): {r2:.2f}")

# Trend Analysis
print(f"\nTrend Analysis: The coefficient for Newspaper Ads is {model.coef_[0]:.2f}. This indicates that for each additional newspaper ad, the predicted number of orders increases by approximately {model.coef_[0]:.2f}.")

# SHAP Interpretation insights
print("\nSHAP Interpretation Insights: The SHAP values show the contribution of each specific number of newspaper ads to the predicted orders relative to the average number of orders (baseline). Positive SHAP values indicate that the number of ads is higher than the average, leading to a prediction above the baseline. Negative SHAP values indicate that the number of ads is lower than the average, leading to a prediction below the baseline.")

Linear Regression Coefficients:
Intercept: 26.285714285714295
Coefficient for Newspaper Ads: 11.285714285714281

Baseline Value (Mean of Orders Received): 46.6

SHAP Values and Predictions:
   Newspaper Ads  Orders Received  SHAP Value  Predicted Orders  \
0              1               35   -9.028571         37.571429   
1              2               48    2.257143         48.857143   
2              3               60   13.542857         60.142857   
3              1               40   -9.028571         37.571429   
4              2               50    2.257143         48.857143   

   Prediction Check  
0         37.571429  
1         48.857143  
2         60.142857  
3         37.571429  
4         48.857143  

Interpretation and Comparison:

Record 1:
Newspaper Ads: 1.0
Actual Orders: 35.0
Predicted Orders: 37.57
SHAP Value: -9.03
Interpretation: With 1.0 newspaper ads, the predicted orders are 37.57. The SHAP value of -9.03 indicates how much this number of ads contributed to th

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np

# Data
data = {'Doctors Available': [3, 2, 4, 1, 2],
        'Reminders Sent': [1, 1, 0, 0, 1],
        'Appointments': [40, 35, 30, 20, 38]}
df = pd.DataFrame(data)

# 1. Perform Multiple Linear Regression Analysis
X = df[['Doctors Available', 'Reminders Sent']]
y = df['Appointments']

model = LinearRegression()
model.fit(X, y)

# Coefficients
print("Multiple Linear Regression Coefficients:")
print(f"Intercept: {model.intercept_}")
print(f"Coefficient for Doctors Available: {model.coef_[0]}")
print(f"Coefficient for Reminders Sent: {model.coef_[1]}")

# 2. Calculate the Baseline Value
baseline = np.mean(y)
print(f"\nBaseline Value (Mean of Appointments): {baseline}")

# 3. Calculate SHAP Values
# For linear regression, SHAP value for a feature is coefficient * (feature value - mean of feature)
mean_doctors_available = np.mean(X['Doctors Available'])
mean_reminders_sent = np.mean(X['Reminders Sent'])

df['SHAP (Doctors Available)'] = model.coef_[0] * (X['Doctors Available'] - mean_doctors_available)
df['SHAP (Reminders Sent)'] = model.coef_[1] * (X['Reminders Sent'] - mean_reminders_sent)
df['Total SHAP Value'] = df['SHAP (Doctors Available)'] + df['SHAP (Reminders Sent)']


# 4. Compute Final Prediction for Each Record
df['Predicted Appointments'] = model.predict(X)

# Verify: Prediction = Baseline + SHAP (Doctors Available) + SHAP (Reminders Sent)
# We can add a check column to verify this
df['Prediction Check'] = baseline + df['Total SHAP Value']


print("\nSHAP Values and Predictions:")
print(df[['Doctors Available', 'Reminders Sent', 'Appointments', 'SHAP (Doctors Available)', 'SHAP (Reminders Sent)', 'Total SHAP Value', 'Predicted Appointments', 'Prediction Check']])


# 5. Interpret the Results and Comparison
print("\nInterpretation and Comparison:")
for index, row in df.iterrows():
    print(f"\nRecord {index + 1}:")
    print(f"Doctors Available: {row['Doctors Available']}")
    print(f"Reminders Sent: {row['Reminders Sent']}")
    print(f"Actual Appointments: {row['Appointments']}")
    print(f"Predicted Appointments: {row['Predicted Appointments']:.2f}")
    print(f"SHAP (Doctors Available): {row['SHAP (Doctors Available)']:.2f}")
    print(f"SHAP (Reminders Sent): {row['SHAP (Reminders Sent)']:.2f}")
    print(f"Interpretation: For {row['Doctors Available']} doctors available and {row['Reminders Sent']} reminder sent, the predicted appointments are {row['Predicted Appointments']:.2f}. The SHAP value for Doctors Available ({row['SHAP (Doctors Available)']:.2f}) and Reminders Sent ({row['SHAP (Reminders Sent)']:.2f}) show their individual contributions to the prediction compared to the baseline of {baseline:.2f}.")

    difference = row['Predicted Appointments'] - row['Appointments']
    print(f"Difference (Predicted - Actual): {difference:.2f}")

    if difference > 0:
        print("Status: Overprediction")
    elif difference < 0:
        print("Status: Underprediction")
    else:
        print("Status: Accurate prediction")

# Summary Analysis
print("\nSummary Analysis:")
# Model Accuracy (using R-squared for simplicity)
from sklearn.metrics import r2_score
r2 = r2_score(y, df['Predicted Appointments'])
print(f"\nModel Accuracy (R-squared): {r2:.2f}")

# Trend Analysis
print(f"\nTrend Analysis:")
print(f"- For each additional doctor available, the predicted number of appointments increases by approximately {model.coef_[0]:.2f}.")
print(f"- When a reminder is sent (compared to not sent), the predicted number of appointments increases by approximately {model.coef_[1]:.2f}.")


# SHAP Interpretation insights
print("\nSHAP Interpretation Insights:")
print("The SHAP values show the contribution of the number of doctors available and whether a reminder was sent to the predicted appointments relative to the average number of appointments (baseline).")
print("- Positive SHAP values for 'Doctors Available' indicate more doctors than average, leading to predictions above the baseline.")
print("- Negative SHAP values for 'Doctors Available' indicate fewer doctors than average, leading to predictions below the baseline.")
print("- Positive SHAP values for 'Reminders Sent' indicate a reminder was sent (relative to the average), contributing positively to the prediction.")
print("- Negative SHAP values for 'Reminders Sent' indicate no reminder was sent (relative to the average), contributing negatively to the prediction.")

Multiple Linear Regression Coefficients:
Intercept: 16.612903225806456
Coefficient for Doctors Available: 3.3548387096774186
Coefficient for Reminders Sent: 13.2258064516129

Baseline Value (Mean of Appointments): 32.6

SHAP Values and Predictions:
   Doctors Available  Reminders Sent  Appointments  SHAP (Doctors Available)  \
0                  3               1            40                  2.012903   
1                  2               1            35                 -1.341935   
2                  4               0            30                  5.367742   
3                  1               0            20                 -4.696774   
4                  2               1            38                 -1.341935   

   SHAP (Reminders Sent)  Total SHAP Value  Predicted Appointments  \
0               5.290323          7.303226               39.903226   
1               5.290323          3.948387               36.548387   
2              -7.935484         -2.567742               30.

In [None]:
from sklearn.datasets import load_diabetes

# Load the dataset
diabetes_data = load_diabetes()

# Create DataFrame for features and Series for target
X = pd.DataFrame(diabetes_data.data, columns=diabetes_data.feature_names)
y = pd.Series(diabetes_data.target, name='target')

# Display the first few rows of the features and target
display(X.head())
display(y.head())

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


Unnamed: 0,target
0,151.0
1,75.0
2,141.0
3,206.0
4,135.0


In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (353, 10)
Shape of X_test: (89, 10)
Shape of y_train: (353,)
Shape of y_test: (89,)


In [None]:
# Instantiate and train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Print the intercept and coefficients
print("Linear Regression Model Coefficients:")
print(f"Intercept: {model.intercept_}")
print("Coefficients:")
for feature, coef in zip(X_train.columns, model.coef_):
    print(f"{feature}: {coef}")

Linear Regression Model Coefficients:
Intercept: 151.34560453985995
Coefficients:
age: 37.904021350074984
sex: -241.96436231273995
bmi: 542.4287585162899
bp: 347.70384391385636
s1: -931.4888458835163
s2: 518.0622769833376
s3: 163.41998299131035
s4: 275.3179015786484
s5: 736.1988589046839
s6: 48.67065743196543


In [None]:
# Calculate the baseline value as the mean of the target variable from the training data
baseline = np.mean(y_train)

# Print the baseline value
print(f"Baseline Value (Mean of Disease Progression in Training Data): {baseline}")

Baseline Value (Mean of Disease Progression in Training Data): 153.73654390934846


In [None]:
import shap

# Create a SHAP explainer object for the trained linear regression model
explainer = shap.Explainer(model, X_train)

# Calculate the SHAP values for the test set
shap_values = explainer.shap_values(X_test)

# Display the shape of the SHAP values
print("Shape of SHAP values:", shap_values.shape)

Shape of SHAP values: (89, 10)


In [None]:
# Calculate predicted values for the test set
predicted_values = model.predict(X_test)

# Calculate the sum of SHAP values for each record in the test set
sum_of_shap_values = np.sum(shap_values, axis=1)

# Add the baseline value to the sum of SHAP values
baseline_plus_shap = baseline + sum_of_shap_values

# Compare the predicted values with the baseline plus sum of SHAP values
# We expect these to be very close for a linear model
difference = np.abs(predicted_values - baseline_plus_shap)

# Check if the maximum absolute difference is close to zero
tolerance = 1e-9
verification_successful = np.all(difference < tolerance)

# Print verification result
if verification_successful:
    print("Verification successful: Predicted values are equal to baseline + sum of SHAP values.")
else:
    print("Verification failed: Predicted values are NOT equal to baseline + sum of SHAP values.")

print(f"Maximum absolute difference: {np.max(difference)}")

Verification failed: Predicted values are NOT equal to baseline + sum of SHAP values.
Maximum absolute difference: 3.787469330827207


In [None]:
# 1. Analyze the relationship between SHAP values and coefficients
print("Relationship between SHAP Values and Feature Coefficients:")
print("For a linear model, the SHAP value for a feature for a specific instance is approximately the feature's coefficient multiplied by the difference between the feature's value for that instance and the feature's average value across the training data.")
print("\nModel Coefficients:")
for feature, coef in zip(X_train.columns, model.coef_):
    print(f"{feature}: {coef:.4f}")

print("\nSample SHAP Values (First 5 records in test set):")
# Create a DataFrame for easier viewing
shap_df = pd.DataFrame(shap_values, columns=X_test.columns)
display(shap_df.head())

# Demonstrate the relationship for a sample record and feature
sample_record_index = 0
sample_feature = 'bmi'
feature_value = X_test.iloc[sample_record_index][sample_feature]
mean_feature_value = np.mean(X_train[sample_feature])
coefficient = model.coef_[X_train.columns.get_loc(sample_feature)]
calculated_shap = coefficient * (feature_value - mean_feature_value)
actual_shap = shap_df.iloc[sample_record_index][sample_feature]

print(f"\nExample for record {sample_record_index} and feature '{sample_feature}':")
print(f"Feature value: {feature_value:.4f}")
print(f"Mean feature value (training): {mean_feature_value:.4f}")
print(f"Coefficient: {coefficient:.4f}")
print(f"Calculated SHAP value (coef * (value - mean)): {calculated_shap:.4f}")
print(f"Actual SHAP value: {actual_shap:.4f}")
print("Note: The values might not be exactly equal due to potential numerical precision or complexities in the SHAP calculation library for non-exact linear models.")


# 2. Compare predicted vs actual values for sample records
print("\nComparison of Predicted vs Actual Values (Sample Records):")
sample_indices = [0, 1, 2, 3, 4] # Select first 5 records from test set

sample_comparison_df = pd.DataFrame({
    'Actual': y_test.iloc[sample_indices],
    'Predicted': predicted_values[sample_indices],
    'Difference (Predicted - Actual)': predicted_values[sample_indices] - y_test.iloc[sample_indices],
    'Total SHAP Value': sum_of_shap_values[sample_indices]
}, index=X_test.index[sample_indices])

display(sample_comparison_df)


# 3. Examine total SHAP value and prediction difference for samples
print("\nAnalysis of Total SHAP Value and Prediction Difference:")
for index, row in sample_comparison_df.iterrows():
    actual = row['Actual']
    predicted = row['Predicted']
    total_shap = row['Total SHAP Value']
    difference = row['Difference (Predicted - Actual)']

    print(f"\nRecord Index (from test set): {index}")
    print(f"Actual Value: {actual:.2f}")
    print(f"Predicted Value: {predicted:.2f}")
    print(f"Total SHAP Value: {total_shap:.2f}")
    print(f"Baseline Value: {baseline:.2f}")
    print(f"Baseline + Total SHAP: {baseline + total_shap:.2f}")
    print(f"Difference (Predicted - Actual): {difference:.2f}")

    if difference > 0:
        status = "Overprediction"
    elif difference < 0:
        status = "Underprediction"
    else:
        status = "Accurate prediction"
    print(f"Status: {status}")
    print(f"Interpretation: The predicted value ({predicted:.2f}) is the baseline ({baseline:.2f}) adjusted by the total contribution of all features (Total SHAP = {total_shap:.2f}). The difference of {difference:.2f} indicates how far off the prediction is from the actual value.")


# 4. Discuss SHAP values in the context of the baseline
print("\nSHAP Value Interpretation in the Context of Baseline:")
print(f"The baseline value represents the average predicted outcome when we don't know any specific feature values for an individual (it's the mean of the training target variable, {baseline:.2f}).")
print("Each feature's SHAP value for a specific individual tells us how much that feature's value for that individual contributes to pushing the prediction away from the baseline.")
print("- A positive SHAP value for a feature means that feature's value for this individual increases the prediction relative to the baseline.")
print("- A negative SHAP value for a feature means that feature's value for this individual decreases the prediction relative to the baseline.")
print("The sum of all SHAP values for an individual, when added to the baseline, gives the final prediction for that individual (ideally, for linear models).")
print("\nOverall Interpretation:")
print("By examining the SHAP values for different features across individuals, we can understand which features are most influential in driving the prediction and in which direction (increasing or decreasing the predicted disease progression relative to the average). Features with larger absolute SHAP values have a greater impact.")

Relationship between SHAP Values and Feature Coefficients:
For a linear model, the SHAP value for a feature for a specific instance is approximately the feature's coefficient multiplied by the difference between the feature's value for that instance and the feature's average value across the training data.

Model Coefficients:
age: 37.9040
sex: -241.9644
bmi: 542.4288
bp: 347.7038
s1: -931.4888
s2: 518.0623
s3: 163.4200
s4: 275.3179
s5: 736.1989
s6: 48.6707

Sample SHAP Values (First 5 records in test set):


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,1.962051,10.37901,-3.361659,-3.683095,-116.415425,64.587169,3.561787,9.387414,23.691276,-0.510046
1,3.751993,10.37901,20.023792,9.484923,23.288212,-8.903205,0.553521,-10.931664,-16.762007,-1.316442
2,2.65049,-12.685457,-2.192386,-2.486002,-95.908469,25.00295,9.578319,-0.772125,62.017204,-1.114843
3,3.889681,10.37901,28.2087,29.440455,-51.049503,18.675964,-12.081196,38.850078,72.439226,2.715539
4,0.722861,-12.685457,-10.96193,1.105276,-35.669286,27.274176,-0.649785,9.387414,-3.971039,-0.711645



Example for record 0 and feature 'bmi':
Feature value: -0.0062
Mean feature value (training): 0.0017
Coefficient: 542.4288
Calculated SHAP value (coef * (value - mean)): -4.3078
Actual SHAP value: -3.3617
Note: The values might not be exactly equal due to potential numerical precision or complexities in the SHAP calculation library for non-exact linear models.

Comparison of Predicted vs Actual Values (Sample Records):


Unnamed: 0,Actual,Predicted,Difference (Predicted - Actual),Total SHAP Value
287,219.0,139.547558,-79.452442,-10.401516
211,70.0,179.517208,109.517208,29.568134
72,202.0,134.038756,-67.961244,-15.910319
321,230.0,291.417029,61.417029,141.467955
73,111.0,123.789659,12.789659,-26.159416



Analysis of Total SHAP Value and Prediction Difference:

Record Index (from test set): 287
Actual Value: 219.00
Predicted Value: 139.55
Total SHAP Value: -10.40
Baseline Value: 153.74
Baseline + Total SHAP: 143.34
Difference (Predicted - Actual): -79.45
Status: Underprediction
Interpretation: The predicted value (139.55) is the baseline (153.74) adjusted by the total contribution of all features (Total SHAP = -10.40). The difference of -79.45 indicates how far off the prediction is from the actual value.

Record Index (from test set): 211
Actual Value: 70.00
Predicted Value: 179.52
Total SHAP Value: 29.57
Baseline Value: 153.74
Baseline + Total SHAP: 183.30
Difference (Predicted - Actual): 109.52
Status: Overprediction
Interpretation: The predicted value (179.52) is the baseline (153.74) adjusted by the total contribution of all features (Total SHAP = 29.57). The difference of 109.52 indicates how far off the prediction is from the actual value.

Record Index (from test set): 72
Actua

In [None]:
import requests
import zipfile
import io

# Download the zip file
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip'
response = requests.get(url)

# Extract the zip file in memory
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    # Assuming 'student-mat.csv' is the relevant file
    with z.open('student-mat.csv') as f:
        # Load the CSV into a pandas DataFrame
        df = pd.read_csv(f, sep=';')

# Display the first 5 rows
display(df.head())

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [None]:
# Identify categorical columns (object and category dtypes)
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

# Apply one-hot encoding to the identified categorical columns
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Define features (independent variables)
X = df_encoded.drop('G3', axis=1)

# Define target variable (dependent variable)
y = df_encoded['G3']

# Display the first few rows of the processed features DataFrame and the target Series
print("Processed Features (X) after One-Hot Encoding:")
display(X.head())

print("\nTarget Variable (y):")
display(y.head())

Processed Features (X) after One-Hot Encoding:


Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,...,guardian_mother,guardian_other,schoolsup_yes,famsup_yes,paid_yes,activities_yes,nursery_yes,higher_yes,internet_yes,romantic_yes
0,18,4,4,2,2,0,4,3,4,1,...,True,False,True,False,False,False,True,True,False,False
1,17,1,1,1,2,0,5,3,3,1,...,False,False,False,True,False,False,False,True,True,False
2,15,1,1,1,2,3,4,3,2,2,...,True,False,True,False,True,False,True,True,True,False
3,15,4,2,1,3,0,3,2,2,1,...,True,False,False,True,True,True,True,True,True,True
4,16,3,3,1,2,0,4,3,2,1,...,False,False,False,True,True,False,True,True,False,False



Target Variable (y):


Unnamed: 0,G3
0,6
1,6
2,10
3,15
4,10


In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting training and testing sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (316, 41)
Shape of X_test: (79, 41)
Shape of y_train: (316,)
Shape of y_test: (79,)


In [None]:
# Instantiate and train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Print the intercept and coefficients
print("Linear Regression Model Coefficients:")
print(f"Intercept: {model.intercept_}")
print("Coefficients:")
for feature, coef in zip(X_train.columns, model.coef_):
    print(f"{feature}: {coef}")

Linear Regression Model Coefficients:
Intercept: -1.9970238353909675
Coefficients:
age: -0.11592661953142151
Medu: 0.08631677199446891
Fedu: -0.16737767954799243
traveltime: 0.08859281947145183
studytime: -0.007742757332106258
failures: -0.2857811460549535
famrel: 0.3149624187326617
freetime: -0.02022573762987478
goout: 0.18985939119239
Dalc: -0.18550918311119666
Walc: 0.053905720507275726
health: 0.044026942775088215
absences: 0.05559268675514427
G1: 0.2116986990276982
G2: 0.9577720755378567
school_MS: 0.09381411270762968
sex_M: 0.3744096005036927
address_U: 0.0826123683206245
famsize_LE3: -0.008771732366087678
Pstatus_T: -0.14039392973725973
Mjob_health: -0.4632439048909869
Mjob_other: -0.23719518388174726
Mjob_services: -0.05164319597163882
Mjob_teacher: 0.09915197113328098
Fjob_health: 0.48124209905265314
Fjob_other: 0.20644054426045141
Fjob_services: -0.2947620057278342
Fjob_teacher: -0.06829100970015514
reason_home: -0.6119007186133253
reason_other: 0.30497262275199855
reason_rep

In [None]:
# Calculate the mean of the target variable from the training data
baseline = np.mean(y_train)

# Print the baseline value
print(f"Baseline Value (Mean of Final Exam Scores in Training Data): {baseline}")

Baseline Value (Mean of Final Exam Scores in Training Data): 10.325949367088608


In [None]:
import shap

# Create a SHAP explainer object for the trained linear regression model
explainer = shap.Explainer(model, X_train)

# Calculate the SHAP values for the test set
shap_values = explainer.shap_values(X_test)

# Print the shape of the SHAP values
print("Shape of SHAP values:", shap_values.shape)

Shape of SHAP values: (79, 41)


In [None]:
# Calculate predicted values for the test set
predicted_values = model.predict(X_test)

# Calculate the sum of SHAP values for each record in the test set
sum_of_shap_values = np.sum(shap_values, axis=1)

# Add the baseline value to the sum of SHAP values
baseline_plus_shap = baseline + sum_of_shap_values

# Compare the predicted values with the baseline plus sum of SHAP values
# We expect these to be very close for a linear model
difference = np.abs(predicted_values - baseline_plus_shap)

# Check if the maximum absolute difference is close to zero
tolerance = 1e-9
verification_successful = np.all(difference < tolerance)

# Print verification result
if verification_successful:
    print("Verification successful: Predicted values are equal to baseline + sum of SHAP values.")
else:
    print("Verification failed: Predicted values are NOT exactly equal to baseline + sum of SHAP values within the specified tolerance.")

print(f"Maximum absolute difference: {np.max(difference)}")

Verification failed: Predicted values are NOT exactly equal to baseline + sum of SHAP values within the specified tolerance.
Maximum absolute difference: 0.30495432978401915


In [None]:
# 1. Print the coefficients of the trained linear regression model
print("Linear Regression Model Coefficients:")
print(f"Intercept: {model.intercept_:.4f}")
print("Coefficients:")
for feature, coef in zip(X_train.columns, model.coef_):
    print(f"{feature}: {coef:.4f}")

# 2. Create a DataFrame from the calculated SHAP values and display its head
shap_df = pd.DataFrame(shap_values, columns=X_test.columns)
print("\nSHAP Values for Test Set (First 5 Records):")
display(shap_df.head())

# 3. Select sample records and create a comparison DataFrame
sample_indices = [0, 1, 2, 3, 4] # Select first 5 records from test set

sample_comparison_df = pd.DataFrame({
    'Actual': y_test.iloc[sample_indices],
    'Predicted': predicted_values[sample_indices],
    'Difference (Predicted - Actual)': predicted_values[sample_indices] - y_test.iloc[sample_indices],
    'Total SHAP Value': sum_of_shap_values[sample_indices]
}, index=X_test.index[sample_indices])

print("\nComparison of Predicted vs Actual Values and Total SHAP (Sample Records):")
display(sample_comparison_df)


# 4. Iterate through sample records and print detailed analysis
print("\nDetailed Analysis of Sample Records:")
for index, row in sample_comparison_df.iterrows():
    actual = row['Actual']
    predicted = row['Predicted']
    total_shap = row['Total SHAP Value']
    difference = row['Difference (Predicted - Actual)']

    print(f"\nRecord Index (from test set): {index}")
    print(f"Actual Value: {actual:.2f}")
    print(f"Predicted Value: {predicted:.2f}")
    print(f"Total SHAP Value: {total_shap:.2f}")
    print(f"Baseline Value: {baseline:.2f}")
    # Verify Prediction = Baseline + Total SHAP for this record (should be close)
    print(f"Baseline + Total SHAP: {baseline + total_shap:.2f}")
    print(f"Difference (Predicted - Actual): {difference:.2f}")

    if difference > 0.5: # Using a small tolerance for over/underprediction status
        status = "Overprediction"
    elif difference < -0.5: # Using a small tolerance for over/underprediction status
        status = "Underprediction"
    else:
        status = "Accurate prediction (within +/- 0.5)"
    print(f"Status: {status}")
    print(f"Interpretation: The predicted value ({predicted:.2f}) is the baseline ({baseline:.2f}) adjusted by the total contribution of all features (Total SHAP = {total_shap:.2f}). The difference of {difference:.2f} indicates how far off the prediction is from the actual value.")


# 5. Explain SHAP value contribution relative to the baseline
print("\nSHAP Value Interpretation in the Context of Baseline:")
print(f"The baseline value represents the average predicted outcome when we don't know any specific feature values for an individual (it's the mean of the training target variable, {baseline:.2f}).")
print("Each feature's SHAP value for a specific individual tells us how much that feature's value for that individual contributes to pushing the prediction away from the baseline.")
print("- A positive SHAP value for a feature means that feature's value for this individual increases the prediction relative to the baseline.")
print("- A negative SHAP value for a feature means that feature's value for this individual decreases the prediction relative to the baseline.")
print("The sum of all SHAP values for an individual, when added to the baseline, gives the final prediction for that individual (ideally, for linear models, though small discrepancies can occur).")


# 6. Provide an overall interpretation of the SHAP analysis
print("\nOverall Interpretation of Feature Contributions using SHAP:")
print("By examining the SHAP values across the test set, we can gain insights into which features are most influential in determining the predicted final exam scores.")
print("Features with large absolute SHAP values have a significant impact on the prediction.")
print("A positive average SHAP value for a feature across the dataset suggests that higher values of that feature generally lead to higher predicted scores (relative to the baseline).")
print("A negative average SHAP value suggests that higher values of that feature generally lead to lower predicted scores (relative to the baseline).")
print("Individual SHAP values allow us to understand the unique contribution of each feature for each student's predicted score.")

# Optional: You can also create a SHAP summary plot for a visual overview
# shap.summary_plot(shap_values, X_test)

Linear Regression Model Coefficients:
Intercept: -1.9970
Coefficients:
age: -0.1159
Medu: 0.0863
Fedu: -0.1674
traveltime: 0.0886
studytime: -0.0077
failures: -0.2858
famrel: 0.3150
freetime: -0.0202
goout: 0.1899
Dalc: -0.1855
Walc: 0.0539
health: 0.0440
absences: 0.0556
G1: 0.2117
G2: 0.9578
school_MS: 0.0938
sex_M: 0.3744
address_U: 0.0826
famsize_LE3: -0.0088
Pstatus_T: -0.1404
Mjob_health: -0.4632
Mjob_other: -0.2372
Mjob_services: -0.0516
Mjob_teacher: 0.0992
Fjob_health: 0.4812
Fjob_other: 0.2064
Fjob_services: -0.2948
Fjob_teacher: -0.0683
reason_home: -0.6119
reason_other: 0.3050
reason_reputation: -0.2217
guardian_mother: 0.0953
guardian_other: -0.1493
schoolsup_yes: 0.7857
famsup_yes: 0.2037
paid_yes: 0.0672
activities_yes: -0.5175
nursery_yes: -0.2369
higher_yes: 0.3754
internet_yes: -0.1682
romantic_yes: -0.3904

SHAP Values for Test Set (First 5 Records):


Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,...,guardian_mother,guardian_other,schoolsup_yes,famsup_yes,paid_yes,activities_yes,nursery_yes,higher_yes,internet_yes,romantic_yes
0,-0.044052,-0.073369,0.251067,0.046068,0.007975,-0.768751,0.009449,-0.034586,-0.430981,0.100175,...,0.022873,0.01045,0.67566,0.073316,-0.027561,-0.25875,-0.037898,-0.367869,-0.030277,0.109299
1,-0.159979,-0.159686,0.083689,0.134661,0.007975,0.088592,0.009449,0.005865,-0.051262,-0.085334,...,-0.072431,0.01045,-0.109991,0.073316,0.039661,-0.25875,-0.037898,-0.367869,-0.030277,-0.281054
2,-0.159979,0.012948,-0.083689,-0.042525,0.000232,-0.197189,0.009449,0.005865,-0.051262,0.100175,...,0.022873,0.01045,-0.109991,0.073316,-0.027561,0.25875,-0.037898,0.007508,-0.030277,-0.281054
3,0.071875,-0.073369,0.251067,-0.042525,0.000232,0.088592,0.324411,0.005865,0.138597,0.100175,...,0.022873,0.01045,-0.109991,-0.13034,0.039661,-0.25875,-0.037898,0.007508,-0.030277,-0.281054
4,-0.391832,-0.073369,0.083689,-0.042525,0.000232,-0.48297,0.324411,-0.034586,0.138597,-0.456353,...,-0.072431,-0.13884,-0.109991,0.073316,0.039661,0.25875,-0.037898,0.007508,0.137927,0.109299



Comparison of Predicted vs Actual Values and Total SHAP (Sample Records):


Unnamed: 0,Actual,Predicted,Difference (Predicted - Actual),Total SHAP Value
78,10,6.001607,-3.998393,-4.629297
371,12,11.528478,-0.471522,0.897575
248,5,2.866437,-2.133563,-7.764467
55,10,8.796631,-1.203369,-1.834273
390,9,8.553106,-0.446894,-2.077797



Detailed Analysis of Sample Records:

Record Index (from test set): 78
Actual Value: 10.00
Predicted Value: 6.00
Total SHAP Value: -4.63
Baseline Value: 10.33
Baseline + Total SHAP: 5.70
Difference (Predicted - Actual): -4.00
Status: Underprediction
Interpretation: The predicted value (6.00) is the baseline (10.33) adjusted by the total contribution of all features (Total SHAP = -4.63). The difference of -4.00 indicates how far off the prediction is from the actual value.

Record Index (from test set): 371
Actual Value: 12.00
Predicted Value: 11.53
Total SHAP Value: 0.90
Baseline Value: 10.33
Baseline + Total SHAP: 11.22
Difference (Predicted - Actual): -0.47
Status: Accurate prediction (within +/- 0.5)
Interpretation: The predicted value (11.53) is the baseline (10.33) adjusted by the total contribution of all features (Total SHAP = 0.90). The difference of -0.47 indicates how far off the prediction is from the actual value.

Record Index (from test set): 248
Actual Value: 5.00
Predi