# 🔍 Murder in the Machine Learning Manor 🔎

## A Data Science Detective Investigation

![Crime Scene](data/assets/1.png)

### 📱 BREAKING NEWS 📱

**TRAGEDY AT MACHINE LEARNING MANOR**: Renowned data scientist Professor Reginald "Regressor" Fisher has been found dead in his study during the annual International Conference on Statistical Learning. The cause of death appears to be blunt force trauma from what investigators believe to be a vintage calculating machine.

**Detective's Note**: _You've been called in as data science detectives to solve this case using your machine learning expertise. Eight suspects were at the manor during the time of the murder. Each has motives, alibis, and various characteristics that may point to their guilt or innocence. Your job is to analyze the evidence and identify the killer using the techniques you've learned in class._

**Your Task**: Progress through this notebook, analyzing the evidence, and building different models to identify the killer. You'll discover that some models may struggle with certain evidence patterns, while others might just crack the case!

## Case Setup

First, let's import the necessary detective tools (libraries) and examine the evidence (data).

In [None]:
# Import our detective tools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import modeling libraries
# For modeling - import what you need
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Set aesthetic style of the plots
sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = [10, 6]

np.random.seed(42)  # The answer to everything (DO NOT MODIFY THIS)

## Case Background

**Detective's Note**: _You have access to two crucial datasets:_

1. **Previous Case Files** (`previous_murders_training_data.csv`): Records from previous solved cases with known guilt scores.
2. **Current Case Evidence** (`current_murder_evidence.csv`): Evidence collected from the current murder investigation without known guilt scores.

_Your mission is to analyze patterns from previous cases to determine who is most likely guilty in the current case._

## Part 1: Examining Previous Cases

![Evidence Locker](data/assets/2.png)

**Detective's Note**: _Let's first examine the records from previous cases to understand what factors are associated with guilt._

In [None]:
# Load the previous case files
previous_cases_file = 'data/previous_murders_training_data.csv'
previous_cases = pd.read_csv(previous_cases_file)

# Display the first few rows to understand the data structure
previous_cases.head()

**Detective's Note**: _These previous cases contain a 'guilt_score' column which indicates how likely each suspect was to have committed the crime (higher values = more likely to be guilty). The other columns represent evidence, characteristics, and circumstances surrounding each suspect._

In [None]:
# Examine the structure of the previous cases
# 1. Print the dataset shape
print(f"Dataset shape: {previous_cases.shape}")

# 2. Check for missing values
print("\nMissing values per column:")
print(previous_cases.isnull().sum())

# 3. Examine the distribution of guilt scores
print("\nGuilt score statistics:")
print(previous_cases['guilt_score'].describe())

# Plot the distribution of guilt scores
plt.figure(figsize=(10, 6))
sns.histplot(previous_cases['guilt_score'], bins=30, kde=True)
plt.title('Distribution of Guilt Scores in Previous Cases')
plt.xlabel('Guilt Score')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Let's examine if there are any extremely high guilt scores 
# This might indicate "smoking gun" evidence patterns
high_guilt = previous_cases[previous_cases['guilt_score'] > 0.85]
print(f"Number of high guilt cases (>0.85): {len(high_guilt)}")

if len(high_guilt) > 0:
    print("\nExample of high guilt cases:")
    print(high_guilt.head())
    
    # Let's see what features these high guilt cases have in common
    print("\nFeature statistics for high guilt cases:")
    numeric_high_guilt = high_guilt.select_dtypes(include=['number'])
    print(numeric_high_guilt.describe().round(2).T[['mean', 'min', 'max']])

In [None]:
# Create a heatmap to visualize the correlation between features and guilt
# 1. Select the numeric columns
numeric_columns = previous_cases.select_dtypes(include=['number']).columns

# 2. Calculate the correlation matrix
correlation_matrix = previous_cases[numeric_columns].corr()

# 3. Create a heatmap visualization
plt.figure(figsize=(14, 12))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix of Numeric Features')
plt.tight_layout()
plt.show()

# Let's also look at the top correlations with guilt_score specifically
guilt_correlations = correlation_matrix['guilt_score'].drop('guilt_score').sort_values(ascending=False)
print("\nTop features correlated with guilt:")
print(guilt_correlations.head(10))

print("\nFeatures negatively correlated with guilt:")
print(guilt_correlations.tail(5))

**Detective's Note**: _Interestingly, while the average correlations can be informative, they might not tell the whole story. In many criminal cases, a single piece of damning evidence (a "smoking gun") can be more important than many weak correlations. Let's keep this in mind as we build our models._

**Group Discussion (5 minutes)**: 
- What factors appear to be correlated with guilt in previous cases?
- Are there any surprising relationships in the data?
- What evidence would you prioritize if you were investigating a new case?

## Part 2: Building Detective Models

![Detective at Desk](data/assets/4.png)

**Detective's Note**: _Now that we understand the previous cases, let's build different detective models to learn patterns of guilt. Each model has its own approach to analyzing evidence._

In [None]:
# Prepare previous cases data for modeling
# 1. Separate features (X) and target (y = guilt_score)
# First, drop non-predictive columns
X = previous_cases.drop(['suspect_id', 'suspect_name', 'guilt_score'], axis=1)
y = previous_cases['guilt_score']

# 2. Handle categorical variables using one-hot encoding
X = pd.get_dummies(X, columns=['relationship_to_victim'], drop_first=False)

# 3. Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Standardize numerical features
# First identify numeric columns (excluding binary/one-hot encoded features)
numeric_cols = X.select_dtypes(include=['float64', 'int64']).columns
binary_cols = [col for col in X.columns if col.startswith('relationship_to_victim_')]
numeric_cols = [col for col in numeric_cols if col not in binary_cols]

# Create scaler and fit on training data only
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_val_scaled = X_val.copy()

# Apply scaling to numeric columns only
X_train_scaled[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_val_scaled[numeric_cols] = scaler.transform(X_val[numeric_cols])

print(f"Training features shape: {X_train_scaled.shape}")
print(f"Validation features shape: {X_val_scaled.shape}")

### Detective Model 1: Linear Regression

**Detective's Note**: _This model analyzes evidence by looking at the overall relationships between each piece of evidence and guilt. It treats all data points equally and focuses on average patterns rather than specific combinations of evidence._

In [None]:
# Train a linear regression model on previous cases
# 1. Create and fit a linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train)

# 2. Evaluate model performance (R²)
train_r2 = linear_model.score(X_train_scaled, y_train)
val_r2 = linear_model.score(X_val_scaled, y_val)

print(f"Linear Regression Model - Training R²: {train_r2:.4f}")
print(f"Linear Regression Model - Validation R²: {val_r2:.4f}")

# Make predictions and calculate RMSE
y_val_pred = linear_model.predict(X_val_scaled)
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
print(f"Linear Regression Model - Validation RMSE: {val_rmse:.4f}")

# 3. Examine coefficients to see what evidence this model values
# Create a DataFrame with feature names and their coefficients
coefficients = pd.DataFrame({
    'Feature': X_train_scaled.columns,
    'Coefficient': linear_model.coef_
})

# Sort by absolute value of coefficient (most important features first)
coefficients['Abs_Coefficient'] = np.abs(coefficients['Coefficient'])
coefficients = coefficients.sort_values('Abs_Coefficient', ascending=False).reset_index(drop=True)

# Display the top 10 most important features according to linear regression
print("\nTop 10 most important features according to Linear Regression:")
print(coefficients.head(10))

# Plot the top 15 features by coefficient magnitude
plt.figure(figsize=(12, 8))
top_features = coefficients.head(15)
colors = ['green' if c > 0 else 'red' for c in top_features['Coefficient']]
sns.barplot(x='Coefficient', y='Feature', data=top_features, palette=colors)
plt.title('Top 15 Feature Importance in Linear Regression Model')
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

### Detective Model 2: Decision Tree

**Detective's Note**: _This model works like a detective asking a series of yes/no questions about the evidence to determine guilt. It can find patterns in specific combinations of evidence that might be missed by linear models._

In [None]:
# Train a decision tree regression model
# 1. Create and fit a decision tree regressor with increased depth to capture complex patterns
tree_model = DecisionTreeRegressor(max_depth=8, random_state=42)
tree_model.fit(X_train, y_train)  # Note: Trees don't require scaling

# 2. Evaluate model performance (R²)
tree_train_r2 = tree_model.score(X_train, y_train)
tree_val_r2 = tree_model.score(X_val, y_val)

print(f"Decision Tree Model - Training R²: {tree_train_r2:.4f}")
print(f"Decision Tree Model - Validation R²: {tree_val_r2:.4f}")

# Make predictions and calculate RMSE
y_val_pred_tree = tree_model.predict(X_val)
tree_val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred_tree))
print(f"Decision Tree Model - Validation RMSE: {tree_val_rmse:.4f}")

# 3. Extract and visualize feature importance
tree_importances = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': tree_model.feature_importances_
})
tree_importances = tree_importances.sort_values('Importance', ascending=False).reset_index(drop=True)

# Display the top 10 most important features according to the decision tree
print("\nTop 10 most important features according to Decision Tree:")
print(tree_importances.head(10))

# Plot the top 15 features by importance
plt.figure(figsize=(12, 8))
top_tree_features = tree_importances.head(15)
sns.barplot(x='Importance', y='Feature', data=top_tree_features, color='skyblue')
plt.title('Top 15 Feature Importance in Decision Tree Model')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

# Visualize the tree structure (first few levels)
plt.figure(figsize=(20, 10))
plot_tree(tree_model, feature_names=list(X_train.columns), filled=True, max_depth=3, fontsize=10)
plt.title('Decision Tree Structure (First 3 Levels)')
plt.tight_layout()
plt.show()

### Detective Model 3: Random Forest

**Detective's Note**: _This model is like a team of detectives, each examining the evidence from a slightly different angle, then coming together to make a final determination. Random Forests are especially good at identifying specific patterns or "smoking gun" evidence that might be buried in a sea of other information._

In [None]:
# Train a random forest regression model
# 1. Create and fit a random forest regressor with increased complexity
forest_model = RandomForestRegressor(n_estimators=100, max_depth=15, min_samples_leaf=2, random_state=42)
forest_model.fit(X_train, y_train)  # Random Forests don't require scaling

# 2. Evaluate model performance (R²)
forest_train_r2 = forest_model.score(X_train, y_train)
forest_val_r2 = forest_model.score(X_val, y_val)

print(f"Random Forest Model - Training R²: {forest_train_r2:.4f}")
print(f"Random Forest Model - Validation R²: {forest_val_r2:.4f}")

# Make predictions and calculate RMSE
y_val_pred_forest = forest_model.predict(X_val)
forest_val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred_forest))
print(f"Random Forest Model - Validation RMSE: {forest_val_rmse:.4f}")

# 3. Extract and visualize feature importance
forest_importances = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': forest_model.feature_importances_
})
forest_importances = forest_importances.sort_values('Importance', ascending=False).reset_index(drop=True)

# Display the top 10 most important features according to the random forest
print("\nTop 10 most important features according to Random Forest:")
print(forest_importances.head(10))

# Plot the top 15 features by importance
plt.figure(figsize=(12, 8))
top_forest_features = forest_importances.head(15)
sns.barplot(x='Importance', y='Feature', data=top_forest_features, color='green')
plt.title('Top 15 Feature Importance in Random Forest Model')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

### Let's check how our models handle high-guilt cases
![Detective at Desk](data/assets/3.png)
**Detective's Note**: _We should specifically look at how well each model performs at identifying very high guilt cases. These would be the "smoking gun" patterns that we want to detect in our current case._

In [None]:
# Check how our models perform on the very high guilt cases
high_guilt_indices = y_val.index[y_val > 0.85]

if len(high_guilt_indices) > 0:
    high_guilt_X = X_val.loc[high_guilt_indices]
    high_guilt_X_scaled = X_val_scaled.loc[high_guilt_indices]
    high_guilt_y = y_val.loc[high_guilt_indices]
    
    # Make predictions
    linear_high_guilt_pred = linear_model.predict(high_guilt_X_scaled)
    tree_high_guilt_pred = tree_model.predict(high_guilt_X)
    forest_high_guilt_pred = forest_model.predict(high_guilt_X)
    
    # Compare results
    print("Performance on high guilt cases (guilt > 0.85):")
    print(f"Linear regression average prediction: {np.mean(linear_high_guilt_pred):.4f}")
    print(f"Decision tree average prediction: {np.mean(tree_high_guilt_pred):.4f}")
    print(f"Random forest average prediction: {np.mean(forest_high_guilt_pred):.4f}")
    print(f"Actual average guilt: {np.mean(high_guilt_y):.4f}")
    
    # Plot individual predictions
    results = pd.DataFrame({
        'Actual': high_guilt_y,
        'Linear': linear_high_guilt_pred,
        'Tree': tree_high_guilt_pred,
        'Forest': forest_high_guilt_pred
    })
    
    plt.figure(figsize=(10, 6))
    results.plot(kind='bar', figsize=(12, 6))
    plt.title('Model Predictions on High Guilt Cases')
    plt.ylabel('Guilt Score')
    plt.grid(axis='y')
    plt.tight_layout()
    plt.show()
else:
    print("No high guilt cases (guilt > 0.85) found in the validation set.")

In [None]:
# Compare model performances
# 1. Create a comparison visualization of all three model performances
models = ['Linear Regression', 'Decision Tree', 'Random Forest']
train_scores = [train_r2, tree_train_r2, forest_train_r2]
val_scores = [val_r2, tree_val_r2, forest_val_r2]
rmse_scores = [val_rmse, tree_val_rmse, forest_val_rmse]

# Plot R² comparison
plt.figure(figsize=(12, 6))
X_axis = np.arange(len(models))
width = 0.35

plt.bar(X_axis - width/2, train_scores, width, label='Training R²')
plt.bar(X_axis + width/2, val_scores, width, label='Validation R²')

plt.xticks(X_axis, models)
plt.xlabel('Model')
plt.ylabel('R² Score')
plt.title('Model Performance Comparison (R²)')
plt.legend()
plt.tight_layout()
plt.show()

# Plot RMSE comparison
plt.figure(figsize=(10, 6))
plt.bar(models, rmse_scores, color='coral')
plt.xlabel('Model')
plt.ylabel('RMSE (lower is better)')
plt.title('Model Performance Comparison (RMSE)')
plt.tight_layout()
plt.show()

# 2. Let's compare which features each model considers most important
# First, standardize the importance scores
linear_importance = coefficients.copy()
linear_importance['Normalized_Importance'] = linear_importance['Abs_Coefficient'] / linear_importance['Abs_Coefficient'].max()

tree_importances['Normalized_Importance'] = tree_importances['Importance'] / tree_importances['Importance'].max()
forest_importances['Normalized_Importance'] = forest_importances['Importance'] / forest_importances['Importance'].max()

# Identify top 5 features for each model
print("Top 5 features by model:")
print("\nLinear Regression:")
print(linear_importance[['Feature', 'Coefficient']].head(5))

print("\nDecision Tree:")
print(tree_importances[['Feature', 'Importance']].head(5))

print("\nRandom Forest:")
print(forest_importances[['Feature', 'Importance']].head(5))

**Group Discussion (10 minutes)**:
- Which model performed best on previous cases?
- What evidence did each model consider most important?
- Why might different models value different types of evidence?
- What kinds of evidence patterns might tree-based models detect that linear models cannot?

## Part 3: Investigating the Current Case

**Detective's Note**: _Now it's time to apply our trained detective models to the current murder case. Let's load the evidence and see who each model identifies as the most likely culprit._

In [None]:
# Load the current case evidence
current_case_file = 'data/current_murder_evidence.csv'
current_evidence = pd.read_csv(current_case_file)

# Display the first few rows
current_evidence.head()

In [None]:
# Examine the structure of the current evidence
# 1. Check the dataset shape
print(f"Current evidence dataset shape: {current_evidence.shape}")

# 2. Identify the suspects in this case
suspects = current_evidence['suspect_name'].unique()
print(f"\nSuspects in the current case ({len(suspects)}):")
for i, suspect in enumerate(suspects, 1):
    print(f"{i}. {suspect}")

# Count data points per suspect
suspect_counts = current_evidence['suspect_name'].value_counts()
print("\nNumber of evidence points per suspect:")
print(suspect_counts)

# 3. Confirm there's no guilt_score column
print("\nNote: As expected, there is no 'guilt_score' column in the current evidence.")
print("We'll need to use our models to predict it.")

In [None]:
# Prepare the current case data for prediction
# 1. Apply the same preprocessing steps used on the previous cases
# First, separate the suspect identification columns
suspect_info = current_evidence[['suspect_id', 'suspect_name']]
X_current = current_evidence.drop(['suspect_id', 'suspect_name'], axis=1)

# Check for any missing values
print("Missing values in current evidence:")
print(X_current.isnull().sum().sum())

if X_current.isnull().sum().sum() > 0:
    print("Columns with missing values:")
    print(X_current.columns[X_current.isnull().any()].tolist())
    
    # Fill missing values with appropriate strategies
    # For this case, we'll use median for numeric columns and mode for categorical
    for col in X_current.columns:
        if X_current[col].isnull().any():
            if X_current[col].dtype.kind in 'f':
                # Numeric column - use median
                X_current[col] = X_current[col].fillna(X_current[col].median())
            else:
                # Categorical column - use mode
                X_current[col] = X_current[col].fillna(X_current[col].mode()[0])

# 2. Handle categorical variables with one-hot encoding
# Get the same columns as in training
X_current_encoded = pd.get_dummies(X_current, columns=['relationship_to_victim'], drop_first=False)

# 3. Ensure feature columns match those used in training
# Get the list of columns from the training data
training_columns = X_train.columns.tolist()

# Find columns in current data that weren't in training data
new_columns = [col for col in X_current_encoded.columns if col not in training_columns]
if new_columns:
    print(f"New columns found in current evidence: {new_columns}")
    # Drop columns that weren't in training data
    X_current_encoded = X_current_encoded.drop(columns=new_columns)

# Add missing columns that were in training data but not in current data
missing_columns = [col for col in training_columns if col not in X_current_encoded.columns]
if missing_columns:
    print(f"Missing columns in current evidence: {missing_columns}")
    # Add missing columns with zeros
    for col in missing_columns:
        X_current_encoded[col] = 0

# Ensure columns are in the same order as in training
X_current_encoded = X_current_encoded[training_columns]

# 4. Apply the same scaling to numeric features
X_current_scaled = X_current_encoded.copy()
X_current_scaled[numeric_cols] = scaler.transform(X_current_encoded[numeric_cols])

print(f"Current evidence features shape: {X_current_encoded.shape}")
print(f"Original training features shape: {X_train.shape}")

In [None]:
# Apply each model to predict guilt scores for the current case
# 1. Use each trained model to predict guilt scores
linear_predictions = linear_model.predict(X_current_scaled)
tree_predictions = tree_model.predict(X_current_encoded)
forest_predictions = forest_model.predict(X_current_encoded)

# 2. Add these predictions to the evidence dataframe
results = pd.DataFrame({
    'suspect_id': suspect_info['suspect_id'],
    'suspect_name': suspect_info['suspect_name'],
    'linear_guilt': linear_predictions,
    'tree_guilt': tree_predictions,
    'forest_guilt': forest_predictions
})

# Display a few predictions
print("Sample of predictions:")
results.head(10)

## Part 4: Solving the Case

![Case Solved](data/assets/5.png)

**Detective's Note**: _Let's analyze both the average and maximum guilt scores predicted by each model for each suspect. We need to be especially vigilant for potential "smoking gun" evidence that might only appear in a few data points._

In [None]:
# Calculate average predicted guilt for each suspect by each model
# 1. Group by suspect_name and compute mean for each model's predictions
avg_guilt = results.groupby('suspect_name').mean().reset_index()

# Round to 4 decimal places for readability
avg_guilt['linear_guilt'] = avg_guilt['linear_guilt'].round(4)
avg_guilt['tree_guilt'] = avg_guilt['tree_guilt'].round(4)
avg_guilt['forest_guilt'] = avg_guilt['forest_guilt'].round(4)

# 2. Create a comparison table showing each model's top suspects by average guilt
print("Average guilt scores by model:")
print(avg_guilt[['suspect_name', 'linear_guilt', 'tree_guilt', 'forest_guilt']])

# Create separate DataFrames for each model's average guilt rankings
linear_ranking = avg_guilt[['suspect_name', 'linear_guilt']].sort_values('linear_guilt', ascending=False).reset_index(drop=True)
tree_ranking = avg_guilt[['suspect_name', 'tree_guilt']].sort_values('tree_guilt', ascending=False).reset_index(drop=True)
forest_ranking = avg_guilt[['suspect_name', 'forest_guilt']].sort_values('forest_guilt', ascending=False).reset_index(drop=True)

# Display each model's suspect rankings by average guilt
print("\nLinear Regression Model - Suspect Rankings (Average Guilt):")
print(linear_ranking)

print("\nDecision Tree Model - Suspect Rankings (Average Guilt):")
print(tree_ranking)

print("\nRandom Forest Model - Suspect Rankings (Average Guilt):")
print(forest_ranking)

In [None]:
# Now calculate MAXIMUM guilt score for each suspect to detect smoking gun evidence
max_guilt = results.groupby('suspect_name').max().reset_index()

# Round to 4 decimal places for readability
max_guilt['linear_guilt'] = max_guilt['linear_guilt'].round(4)
max_guilt['tree_guilt'] = max_guilt['tree_guilt'].round(4)
max_guilt['forest_guilt'] = max_guilt['forest_guilt'].round(4)

# Create a comparison table showing each model's top suspects by maximum guilt
print("Maximum guilt scores by model:")
print(max_guilt[['suspect_name', 'linear_guilt', 'tree_guilt', 'forest_guilt']])

# Create separate DataFrames for each model's maximum guilt rankings
max_linear_ranking = max_guilt[['suspect_name', 'linear_guilt']].sort_values('linear_guilt', ascending=False).reset_index(drop=True)
max_tree_ranking = max_guilt[['suspect_name', 'tree_guilt']].sort_values('tree_guilt', ascending=False).reset_index(drop=True)
max_forest_ranking = max_guilt[['suspect_name', 'forest_guilt']].sort_values('forest_guilt', ascending=False).reset_index(drop=True)

# Display each model's suspect rankings by maximum guilt
print("\nLinear Regression Model - Suspect Rankings (Maximum Guilt):")
print(max_linear_ranking)

print("\nDecision Tree Model - Suspect Rankings (Maximum Guilt):")
print(max_tree_ranking)

print("\nRandom Forest Model - Suspect Rankings (Maximum Guilt):")
print(max_forest_ranking)

In [None]:
# Find any suspect with extremely high guilt scores in any data point (potential smoking gun)
high_guilt_threshold = 0.75
high_guilt_points = results[results['forest_guilt'] > high_guilt_threshold]

if not high_guilt_points.empty:
    print(f"Found {len(high_guilt_points)} data points with guilt scores above {high_guilt_threshold}")
    print(high_guilt_points[['suspect_name', 'linear_guilt', 'tree_guilt', 'forest_guilt']].sort_values('forest_guilt', ascending=False))
    
    # Let's examine the smoking gun evidence for the top suspect
    top_high_guilt_idx = high_guilt_points['forest_guilt'].idxmax()
    smoking_gun_evidence = current_evidence.loc[top_high_guilt_idx]
    
    print(f"\nSmoking gun evidence for {smoking_gun_evidence['suspect_name']}:")
    
    # Show the key features that might be contributing to this high guilt score
    important_features = ['alibi_strength', 'motive_strength', 'prior_conflict', 'fingerprints_at_scene',
                         'dna_match_strength', 'at_scene_during_murder', 'had_opportunity',
                         'suspicious_behavior', 'inconsistent_statements']
    
    for feature in important_features:
        if feature in smoking_gun_evidence:
            print(f"{feature}: {smoking_gun_evidence[feature]}")
else:
    print(f"No data points found with guilt scores above {high_guilt_threshold}")

In [None]:
# Create a visualization comparing the suspects across models (average guilt)
# 1. Bar chart showing average guilt scores by model and suspect
# Prepare data for plotting
plot_data = pd.melt(avg_guilt,
                   id_vars=['suspect_name'], 
                   value_vars=['linear_guilt', 'tree_guilt', 'forest_guilt'],
                   var_name='Model', 
                   value_name='Average Guilt Score')

# Map model names to more readable labels
plot_data['Model'] = plot_data['Model'].map({
    'linear_guilt': 'Linear Regression',
    'tree_guilt': 'Decision Tree',
    'forest_guilt': 'Random Forest'
})

# Create the bar chart
plt.figure(figsize=(14, 8))
sns.barplot(x='suspect_name', y='Average Guilt Score', hue='Model', data=plot_data)
plt.xticks(rotation=45, ha='right')
plt.title('Average Predicted Guilt Scores by Model and Suspect')
plt.xlabel('Suspect')
plt.ylabel('Average Guilt Score')
plt.legend(title='Model')
plt.tight_layout()
plt.show()

In [None]:
# Create a visualization comparing the suspects across models (maximum guilt)
# Prepare data for plotting maximum guilt
max_plot_data = pd.melt(max_guilt,
                       id_vars=['suspect_name'], 
                       value_vars=['linear_guilt', 'tree_guilt', 'forest_guilt'],
                       var_name='Model', 
                       value_name='Maximum Guilt Score')

# Map model names to more readable labels
max_plot_data['Model'] = max_plot_data['Model'].map({
    'linear_guilt': 'Linear Regression',
    'tree_guilt': 'Decision Tree',
    'forest_guilt': 'Random Forest'
})

# Create the bar chart for maximum guilt
plt.figure(figsize=(14, 8))
sns.barplot(x='suspect_name', y='Maximum Guilt Score', hue='Model', data=max_plot_data)
plt.xticks(rotation=45, ha='right')
plt.title('Maximum Predicted Guilt Scores by Model and Suspect')
plt.xlabel('Suspect')
plt.ylabel('Maximum Guilt Score')
plt.legend(title='Model')
plt.tight_layout()
plt.show()

In [None]:
# Let's create heatmaps for both average and maximum guilt
# 1. Average guilt heatmap
guilt_pivot_avg = avg_guilt.set_index('suspect_name')[['linear_guilt', 'tree_guilt', 'forest_guilt']]
guilt_pivot_avg.columns = ['Linear Regression', 'Decision Tree', 'Random Forest']

plt.figure(figsize=(12, 8))
ax = sns.heatmap(guilt_pivot_avg, annot=True, cmap='YlOrRd', linewidths=0.5, fmt='.4f')
plt.title('Average Guilt Score Heatmap by Model and Suspect')
plt.tight_layout()
plt.show()

# 2. Maximum guilt heatmap
guilt_pivot_max = max_guilt.set_index('suspect_name')[['linear_guilt', 'tree_guilt', 'forest_guilt']]
guilt_pivot_max.columns = ['Linear Regression', 'Decision Tree', 'Random Forest']

plt.figure(figsize=(12, 8))
ax = sns.heatmap(guilt_pivot_max, annot=True, cmap='YlOrRd', linewidths=0.5, fmt='.4f')
plt.title('Maximum Guilt Score Heatmap by Model and Suspect')
plt.tight_layout()
plt.show()

**Detective's Note**: _The MAXIMUM guilt scores reveal something very interesting! While the average guilt would point to one suspect, looking at the specific pattern or "smoking gun" evidence points to another suspect entirely. Tree-based models are particularly good at identifying these specific patterns._

In [None]:
# Analyze evidence patterns for the top suspects by both average and maximum guilt
# Get top suspect from each approach
avg_top_suspect = forest_ranking.iloc[0]['suspect_name']
max_top_suspect = max_forest_ranking.iloc[0]['suspect_name']

print(f"Top suspect by average guilt (Random Forest): {avg_top_suspect}")
print(f"Top suspect by maximum guilt (Random Forest): {max_top_suspect}")

# Let's get the evidence for these top suspects
top_suspects = set([avg_top_suspect, max_top_suspect])

for suspect in top_suspects:
    suspect_data = current_evidence[current_evidence['suspect_name'] == suspect]
    suspect_results = results[results['suspect_name'] == suspect]
    
    print(f"\nEvidence summary for {suspect}:")
    numeric_evidence = suspect_data.select_dtypes(include=['number'])
    print(numeric_evidence.describe().T[['mean', 'min', 'max']])
    
    # Calculate the percentage of times certain key binary flags are True
    binary_cols = ['at_scene_during_murder', 'had_opportunity', 'fingerprints_at_scene', 'prior_conflict']
    binary_percent = {}
    for col in binary_cols:
        if col in suspect_data.columns:
            binary_percent[col] = suspect_data[col].mean() * 100
    
    print("\nBinary evidence percentages:")
    for col, percentage in binary_percent.items():
        print(f"{col}: {percentage:.1f}%")
        
    # Check relationship to victim
    relationship_counts = suspect_data['relationship_to_victim'].value_counts(normalize=True) * 100
    print("\nRelationship to victim percentages:")
    for relationship, percentage in relationship_counts.items():
        print(f"{relationship}: {percentage:.1f}%")
        
    # Find maximum guilt score for each model
    max_linear = suspect_results['linear_guilt'].max()
    max_tree = suspect_results['tree_guilt'].max()
    max_forest = suspect_results['forest_guilt'].max()
    
    print(f"\n{suspect} - Maximum guilt scores:")
    print(f"Linear Regression: {max_linear:.4f}")
    print(f"Decision Tree: {max_tree:.4f}")
    print(f"Random Forest: {max_forest:.4f}")
    
    # If this is the smoking gun suspect, let's examine the specific evidence point
    if max_forest > 0.8:  # Only look at very high guilt scores
        smoking_gun_idx = suspect_results['forest_guilt'].idxmax()
        smoking_gun = current_evidence.loc[smoking_gun_idx]
        
        print(f"\nSmoking gun evidence for {suspect}:")
        # Display key evidence factors that might be part of a pattern
        key_factors = ['dna_match_strength', 'alibi_strength', 'at_scene_during_murder', 
                       'had_opportunity', 'fingerprints_at_scene', 'suspicious_behavior',
                       'witness_testimony', 'motive_strength', 'prior_conflict', 'time_of_arrival',
                       'time_of_departure', 'time_at_scene']
        
        for factor in key_factors:
            if factor in smoking_gun:
                print(f"{factor}: {smoking_gun[factor]}")

## Identifying the True Murderer

**Detective's Note**: _Our analysis reveals a fascinating contrast between the average and maximum guilt scores. While the average guilt scores point to one suspect, the maximum scores tell a different story. The presence of a specific "smoking gun" evidence pattern, which tree-based models excel at detecting, points conclusively to the actual murderer._

### Final Decision Analysis

Let's perform a final analysis to conclusively identify our suspect:

In [None]:
# Create a final comparison of our top suspects
# Let's look at the distributions of guilt scores for each suspect
plt.figure(figsize=(15, 10))

# Plot for each model
models = ['linear_guilt', 'tree_guilt', 'forest_guilt']
model_names = ['Linear Regression', 'Decision Tree', 'Random Forest']

for i, (model, name) in enumerate(zip(models, model_names)):
    plt.subplot(3, 1, i+1)
    
    # Create a boxplot showing the distribution of guilt scores per suspect
    sns.boxplot(x='suspect_name', y=model, data=results)
    plt.title(f'Distribution of {name} Guilt Scores by Suspect')
    plt.xticks(rotation=45, ha='right')
    plt.ylabel('Guilt Score')
    if i < 2:
        plt.xlabel('')
    else:
        plt.xlabel('Suspect')
    
plt.tight_layout()
plt.show()

### Evidence for Each Prime Suspect

Let's take a closer look at the specific evidence against our prime suspects:

In [None]:
# Let's examine the specific evidence for the Security Guard Tom Johnson
security_guard_data = current_evidence[current_evidence['suspect_name'] == 'Security Guard Tom Johnson']
security_guard_results = results[results['suspect_name'] == 'Security Guard Tom Johnson']

# Sort the results by forest guilt score to find the most incriminating evidence
security_guard_combined = pd.concat([security_guard_data, security_guard_results[['linear_guilt', 'tree_guilt', 'forest_guilt']]], axis=1)
security_guard_sorted = security_guard_combined.sort_values('forest_guilt', ascending=False).reset_index(drop=True)

# Look at the top 3 most incriminating pieces of evidence
print("Top 3 most incriminating evidence points for Security Guard Tom Johnson:")
for i in range(min(3, len(security_guard_sorted))):
    evidence = security_guard_sorted.iloc[i]
    print(f"\nEvidence #{i+1} - Guilt scores:")
    print(f"Linear: {evidence['linear_guilt']:.4f}, Tree: {evidence['tree_guilt']:.4f}, Forest: {evidence['forest_guilt']:.4f}")
    
    print("Key evidence factors:")
    key_factors = ['dna_match_strength', 'alibi_strength', 'at_scene_during_murder', 
                   'had_opportunity', 'fingerprints_at_scene', 'suspicious_behavior',
                   'witness_testimony', 'motive_strength', 'prior_conflict']
    
    for factor in key_factors:
        if factor in evidence:
            print(f"{factor}: {evidence[factor]}")

In [None]:
# Now let's examine the Rival Scientist Dr. Michael Brooks
rival_scientist_data = current_evidence[current_evidence['suspect_name'] == 'Rival Scientist Dr. Michael Brooks']
rival_scientist_results = results[results['suspect_name'] == 'Rival Scientist Dr. Michael Brooks']

# Sort the results by linear guilt score 
rival_scientist_combined = pd.concat([rival_scientist_data, rival_scientist_results[['linear_guilt', 'tree_guilt', 'forest_guilt']]], axis=1)
rival_scientist_sorted = rival_scientist_combined.sort_values('linear_guilt', ascending=False).reset_index(drop=True)

# Look at the top 3 most incriminating pieces of evidence
print("Top 3 most incriminating evidence points for Rival Scientist Dr. Michael Brooks:")
for i in range(min(3, len(rival_scientist_sorted))):
    evidence = rival_scientist_sorted.iloc[i]
    print(f"\nEvidence #{i+1} - Guilt scores:")
    print(f"Linear: {evidence['linear_guilt']:.4f}, Tree: {evidence['tree_guilt']:.4f}, Forest: {evidence['forest_guilt']:.4f}")
    
    print("Key evidence factors:")
    key_factors = ['dna_match_strength', 'alibi_strength', 'at_scene_during_murder', 
                   'had_opportunity', 'fingerprints_at_scene', 'suspicious_behavior',
                   'witness_testimony', 'motive_strength', 'prior_conflict']
    
    for factor in key_factors:
        if factor in evidence:
            print(f"{factor}: {evidence[factor]}")

### Final Verdict

**Detective's Conclusion:** After thorough analysis of all evidence using multiple modeling approaches, we confidently conclude that **Security Guard Tom Johnson** is the murderer.

**Key Findings:**
1. While the Rival Scientist shows consistently moderate guilt across all evidence (highest average guilt), Security Guard Tom Johnson has specific evidence points with extremely high guilt scores.
2. The random forest model detected a specific pattern of evidence (the "smoking gun") for Security Guard Tom Johnson that linear models missed.
3. This smoking gun evidence includes high DNA match, low alibi strength, presence at the scene during murder, opportunity, and fingerprints at the scene.
4. The linear model was misled by the Rival Scientist's consistent moderate guilt scores across many evidence points, illustrating how linear models focus on average patterns rather than specific combinations of evidence.
5. Tree-based models were able to detect the non-linear interactions among evidence factors that conclusively identify Security Guard Tom Johnson as the murderer.

## Case Closed: Final Report
Based on your investigation, prepare a final report in the ReadMe.MD File

![Detective at Desk](data/assets/6.png)