# 🔍 Murder in the Machine Learning Manor 🔎

## A Data Science Detective Investigation

![Crime Scene](data/assets/1.png)

### 📱 BREAKING NEWS 📱

**TRAGEDY AT MACHINE LEARNING MANOR**: Renowned data scientist Professor Reginald "Regressor" Fisher has been found dead in his study during the annual International Conference on Statistical Learning. The cause of death appears to be blunt force trauma from what investigators believe to be a vintage calculating machine. Professor Fisher, famous for his work on predictive algorithms and pattern recognition, was found by his colleague Dr. Emma Clarke at approximately 10:30 PM last night. The Manor's security system recorded eight individuals on the premises at the time of the murder, all of whom are now persons of interest. Preliminary forensic analysis suggests the murder occurred between 9:15 PM and 9:45 PM, during the evening reception.

**Detective's Note**: _You've been called in as data science detectives to solve this case using your machine learning expertise. Eight suspects were at the manor during the time of the murder. Each has motives, alibis, and various characteristics that may point to their guilt or innocence. A clever murderer might try to appear innocent in most ways, with only a few tell-tale signs of guilt. Your job is to analyze the evidence and identify the killer using the techniques you've learned in class._

**Your Task**: Progress through this notebook, analyzing the evidence, and building different models to identify the killer. You'll discover that some models may struggle with certain evidence patterns, while others might just crack the case!

## Case Setup

First, let's import the necessary detective tools (libraries) and examine the evidence (data).

In [2]:
# Import our detective tools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import modeling libraries
# For modeling - import what you need
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Set aesthetic style of the plots
sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = [10, 6]

np.random.seed(42)  # The answer to everything (DO NOT MODIFY THIS)

## Case Background

**Detective's Note**: _You have access to two crucial datasets:_

1. **Previous Case Files** (`previous_murders_training_data.csv`): Records from previous solved cases with known guilt scores.
2. **Current Case Evidence** (`current_murder_evidence.csv`): Evidence collected from the current murder investigation without known guilt scores.

_Your mission is to analyze patterns from previous cases to determine who is most likely guilty in the current case._

## Part 1: Examining Previous Cases

![Evidence Locker](data/assets/2.png)

**Detective's Note**: _Let's first examine the records from previous cases to understand what factors are associated with guilt. Just like in real detective work, understanding patterns from past crimes can help us solve the current one. Remember that criminals may leave behind different types of evidence - sometimes it's a consistent pattern of suspicious behavior, other times it might be a single, damning piece of evidence amid otherwise innocent-looking circumstances._

In [3]:
# Load the previous case files
previous_cases_file = 'data/previous_murders_training_data.csv'
previous_cases = pd.read_csv(previous_cases_file)

# Display the first few rows to understand the data structure
previous_cases.head()

Unnamed: 0,suspect_id,suspect_name,age,height,weight,relationship_to_victim,physical_capability,access_to_scene,alibi_strength,motive_strength,...,time_of_arrival,time_of_departure,at_scene_during_murder,time_at_scene,had_opportunity,witness_testimony,camera_evidence,suspicious_behavior,inconsistent_statements,guilt_score
0,4,Daniel Jones,53,69,165,colleague,0.441861,1,0.374242,5.703339,...,-33,4,1,20,1,3.311828,4.032132,0.56634,1.195932,0.951991
1,29,Karen White,44,74,167,colleague,0.627642,1,0.471421,4.113072,...,-14,9,1,20,1,4.279564,3.153876,1.359889,1.425823,0.903476
2,16,Thomas Garcia,42,66,185,stranger,0.648821,1,1.251484,6.308535,...,-33,17,1,33,1,7.911329,6.131906,2.666209,1.830442,0.936273
3,48,Susan Davis,41,68,192,stranger,0.610546,1,1.362031,2.357013,...,-20,13,1,44,1,4.276733,6.93714,2.099999,2.449324,0.954769
4,5,Mary Wilson,55,70,175,stranger,0.419857,1,1.752331,4.8359,...,-4,28,1,56,1,4.753111,2.589281,1.116569,2.464788,0.921019


**Detective's Note**: _These previous cases contain a 'guilt_score' column which indicates how likely each suspect was to have committed the crime (higher values = more likely to be guilty). The other columns represent evidence, characteristics, and circumstances surrounding each suspect._

In [None]:
# Examine the structure of the previous cases
# 1. Print the dataset shape
# YOUR CODE HERE

# 2. Check for missing values
# YOUR CODE HERE

# 3. Examine the distribution of guilt scores
# YOUR CODE HERE

# Plot the distribution of guilt scores
# YOUR CODE HERE

In [None]:
# Let's examine if there are any extremely high guilt scores 
# This might tell us something about how guilt is distributed
# YOUR CODE HERE

In [None]:
# Create a heatmap to visualize the correlation between features and guilt
# 1. Select the numeric columns
# YOUR CODE HERE

# 2. Calculate the correlation matrix
# YOUR CODE HERE

# 3. Create a heatmap visualization
# YOUR CODE HERE

# Look at the top correlations with guilt_score specifically
# YOUR CODE HERE

**Detective's Note**: _Take a close look at the distribution of guilt scores. Notice anything interesting about the shape? Also, when examining correlations, remember that overall correlations might not tell the whole story. Sometimes, what matters is a specific combination of factors rather than individual relationships. The best detectives know that criminals don't always leave obvious clues._

**Group Discussion (5 minutes)**: 
- What factors appear to be correlated with guilt in previous cases?
- What evidence would you prioritize if you were investigating a new case?

## Part 2: Building Detective Models

![Detective at Desk](data/assets/4.png)

**Detective's Note**: _Now that we understand the previous cases, let's build different detective models to learn patterns of guilt. Each model has its own approach to analyzing evidence. Just as detectives may approach a case with different investigation styles, each model may identify different suspects as the most likely culprit._

In [None]:
# Prepare previous cases data for modeling
# 1. Separate features (X) and target (y = guilt_score)
# YOUR CODE HERE

# 2. Handle categorical variables using one-hot encoding
# YOUR CODE HERE

# 3. Split data into training and validation sets
# YOUR CODE HERE

# 4. Standardize numerical features
# YOUR CODE HERE

### Detective Model 1: Linear Regression

**Detective's Note**: _This model analyzes evidence by looking at the overall relationships between each piece of evidence and guilt. It treats all data points equally and focuses on average patterns rather than specific combinations of evidence. Think of this as a detective who considers all evidence equally important and calculates an overall suspicion level._

In [None]:
# Train a linear regression model on previous cases
# 1. Create and fit a linear regression model
# YOUR CODE HERE

# 2. Evaluate model performance (R²)
# YOUR CODE HERE

# 3. Examine coefficients to see what evidence this model values
# YOUR CODE HERE

# 4. Visualize the most important features
# YOUR CODE HERE

### Detective Model 2: Decision Tree

**Detective's Note**: _This model works like a detective asking a series of yes/no questions about the evidence to determine guilt. It can find patterns in specific combinations of evidence that might be missed by linear models. Imagine a detective who follows a step-by-step reasoning process, looking for critical decision points._

In [None]:
# Train a decision tree regression model
# 1. Create and fit a decision tree regressor
# Experiment with different max_depth values
# YOUR CODE HERE

# 2. Evaluate model performance (R²)
# YOUR CODE HERE

# 3. Extract and visualize feature importance
# YOUR CODE HERE

# 4. Visualize the tree structure (optional)
# YOUR CODE HERE

### Detective Model 3: Random Forest

**Detective's Note**: _This model is like a team of detectives, each examining the evidence from a slightly different angle, then coming together to make a final determination. Random Forests might spot important patterns that could be overlooked by a single detective. This approach is particularly good at identifying critical evidence amid a lot of noise._

In [None]:
# Train a random forest regression model
# 1. Create and fit a random forest regressor
# Try different combinations of n_estimators and max_depth
# YOUR CODE HERE

# 2. Evaluate model performance (R²)
# YOUR CODE HERE

# 3. Extract and visualize feature importance
# YOUR CODE HERE

### Testing Model Performance on High-Guilt Cases
![Detective at Desk](data/assets/3.png)

**Detective's Note**: _A good detective should be able to spot a murderer with overwhelming evidence against them. Let's see how our models perform on cases with very high guilt scores._

In [None]:
# Check how our models perform on high guilt cases
# Try to identify how well each model identifies the most obvious criminals
# YOUR CODE HERE

In [None]:
# Compare model performances
# 1. Create a comparison visualization of all three model performances
# YOUR CODE HERE

# 2. Compare which features each model considers most important
# YOUR CODE HERE

**Group Discussion (10 minutes)**:
- Which model performed best on previous cases?
- What evidence did each model consider most important?
- Why might different models value different types of evidence?
- What kinds of evidence patterns might tree-based models detect that linear models cannot?
- In a detective context, why might it be important to look at both overall patterns and specific combinations of evidence?

## Part 3: Investigating the Current Case

**Detective's Note**: _Now it's time to apply our trained detective models to the current murder case. Let's load the evidence and see who each model identifies as the most likely culprit. Remember that a savvy criminal might try to appear innocent in most ways, with only a few tell-tale signs of guilt._

In [None]:
# Load the current case evidence
current_case_file = 'data/current_murder_evidence.csv'
current_evidence = pd.read_csv(current_case_file)

# Display the first few rows
current_evidence.head()

In [None]:
# Examine the structure of the current evidence
# 1. Check the dataset shape
# YOUR CODE HERE

# 2. Identify the suspects in this case
# YOUR CODE HERE

# 3. Check for any missing values or other data issues
# YOUR CODE HERE

In [None]:
# Prepare the current case data for prediction
# 1. Apply the same preprocessing steps used on the previous cases
# YOUR CODE HERE

# 2. Handle categorical variables with one-hot encoding
# YOUR CODE HERE

# 3. Ensure feature columns match those used in training
# YOUR CODE HERE

# 4. Apply the same scaling to numeric features
# YOUR CODE HERE

In [None]:
# Apply each model to predict guilt scores for the current case
# 1. Use each trained model to predict guilt scores
# YOUR CODE HERE

# 2. Add these predictions to the evidence dataframe
# YOUR CODE HERE

# Display a few predictions
# YOUR CODE HERE

## Part 4: Solving the Case

![Case Solved](data/assets/5.png)

**Detective's Note**: _Now that we have predictions from our models, we need to analyze them carefully to determine who is most likely guilty. In detective work, you should consider both patterns of suspicious behavior and any "smoking gun" evidence. A suspect might appear mostly innocent but have one or two pieces of very incriminating evidence._

In [None]:
# Calculate average predicted guilt for each suspect by each model
# 1. Group by suspect_name and compute mean for each model's predictions
# YOUR CODE HERE

# 2. Create a comparison table showing each model's top suspects
# YOUR CODE HERE

# 3. Create rankings for each model
# YOUR CODE HERE

In [None]:
# Consider looking at maximum guilt scores as well as averages
# This can help identify suspects with any "smoking gun" evidence
# YOUR CODE HERE

In [None]:
# Create a visualization comparing the suspects across models
# 1. Try different visualization approaches (bar charts, heatmaps, etc.)
# YOUR CODE HERE

# 2. Consider visualizing both average and maximum guilt scores
# YOUR CODE HERE

**Detective's Note**: _Do the models agree on the prime suspect, or are they pointing to different people? If they disagree, try to understand why. Look for any suspects with extremely high guilt scores on specific pieces of evidence - this could be the "smoking gun" that cracks the case!_

In [None]:
# Analyze evidence patterns for the top suspects
# 1. Extract the evidence for the top suspects from each model
# YOUR CODE HERE

# 2. Compare their evidence profiles
# YOUR CODE HERE

# 3. Look for any particularly incriminating evidence points
# YOUR CODE HERE

In [None]:
# Examine the distribution of guilt scores for your top suspects
# This can help identify if there are outlier data points with very high guilt
# YOUR CODE HERE

## Case Closed: Final Report

**Detective's Note**: Based on your investigation, prepare a final report in the ReadMe.MD File


![Detective at Desk](data/assets/6.png)