# Problem


# The Midnight Mystery: A Data Mining Murder Investigation

## Background

It's a dark and stormy night at the Grand Hotel. Many distinguished guests have gathered for an exclusive gala. At precisely 11:50 PM, a scream pierces the night - one guest has been murdered!

The legendary detective Sherlock Holmes and his assistant Dr. Watson arrive at the scene. They conduct thorough interviews with all surviving guests. Each guest provides information about their whereabouts at the time of the murder and whom they remember seeing.

A crucial piece of evidence is found: a handwritten note clutched in the victim's hand, apparently torn from the murderer during the struggle. The note appears to be written in a distinctive style.

Your task is to help Holmes and Watson identify the most likely suspects.

## Dataset Description

You are provided with `murder_mystery.json` containing:
- **Metadata**: Case details including victim name, murder time, and the mysterious note
- **Interrogations**: Interview reports, each containing:
  - Guest name
  - Their statement about location and sightings
  - Interview timestamp

## Your Mission

You must analyze the evidence to identify the prime suspects. Your investigation should follow these steps:


Combine your findings from all analyses to:
1. Identify the most likely murderer
2. Provide evidence supporting your conclusion
3. Explain any alternative suspects and why they were ruled out

**Deliverable**: A final report presenting your conclusion with supporting evidence, and the code.

## Technical Requirements

- You can use any Python  libraries you like, such as: `networkx`, `python-louvain`, `sentence-transformers`, `openai/anthropic` (via OpenRouter)
- Your code should be well-documented and reproducible
- Handle API rate limits appropriately

## Evaluation Criteria

- **Code Quality**: Clean, documented, efficient code
- **Analysis Depth**: Thoroughness of investigation and use of multiple techniques
- **Conclusions**: Logical reasoning and evidence-based conclusions

## Hints

- People may be mistaken as to who they saw. This is less likely so, if, for example, they were playing cards with a guest, rather than just briefly seeing them pass by.
- Some guests might naturally be more isolated due to their behavior
- Writing styles can reveal more than just dialect - look for patterns


## Submission

Submit a Jupyter notebook containing:
1. All code with explanations
2. Your final report as a markdown cell

USE ONLY THE API KEY PROVIDED VIA EMAIL 

Good luck, detective! The truth is hidden in the data...

# Reasoning

To identify the murderer, I approach the problem using a network analysis and natural language processing pipeline:

1. **Graph Construction**
   I begin by constructing a network graph where each node represents a person, and each edge (or link) represents a sighting or interaction between two individuals. I assign a **reliability score** to each edge based on the nature of the interaction:

   * **1.0** if the two individuals directly interacted (e.g., had a conversation).
   * **0.5** if one person merely observed the other.
     These scores are determined with the help of OpenAI by analyzing the nature of the statements.

2. **Filtering Based on Alibis**
   Using this graph, I remove all individuals who had **at least one connection with a reliability score of 1.0** during the time of the murder. This implies they were with someone else and thus likely have a credible alibi.

3. **Text Embedding and Similarity Matching**
   For the remaining suspects, I embed both their statements and the mysterious note using **SentenceTransformer**.
   I then calculate **cosine similarity** between each person's statement and the note to find semantic overlaps that might indicate authorship or involvement.

4. **Narrowing Down Suspects**
   Based on similarity scores, I shortlist the **top 3 suspects** whose statements most closely matched the note.

5. **Language and Contextual Analysis with GPT**
   I then provid these top 3 statements to ChatGPT along with the content of the note. ChatGPT analyze the writing style, tone, and potential motivations to suggest the most likely author of the note — and, by extension, the probable culprit.

6. **Final Judgment Criteria**
   The final determination is made using a combination of:

   * **Alibi validation** (who was alone and unaccounted for),
   * **Writing style similarity** between the note and statements,
   * **Intent**, as interpreted from their statements by ChatGPT.

In the following code I will try to execute this list




## Code 0

this is the first try, the following are trying to improve the result

In [None]:
# Base code

import json
import networkx as nx
from sentence_transformers import SentenceTransformer, util
import openai
import matplotlib.pyplot as plt


# Load the data
with open('murder_mystery.json', 'r') as file:
    data = json.load(file)

metadata = data['metadata']
interrogations = data['interrogations']

# Extract the victim's note
victim_note = metadata['victim_note']

# Initialize the sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Compute the embedding for the victim's note
victim_note_embedding = model.encode(victim_note, convert_to_tensor=True)

# Analyze guest statements for similarity to the victim's note
similarities = []
for interrogation in interrogations:
    guest = interrogation['guest']
    statement = interrogation['statement']
    statement_embedding = model.encode(statement, convert_to_tensor=True)
    similarity = util.pytorch_cos_sim(victim_note_embedding, statement_embedding).item()
    similarities.append((guest, similarity))

# Sort guests by similarity
similarities.sort(key=lambda x: x[1], reverse=True)

# Print the top 3 most similar guests
print("Top 3 guests with similar writing style to the victim's note:")
for guest, similarity in similarities[:3]:
    print(f"{guest}: Similarity = {similarity:.4f}")


# Build a network graph of guest interactions
G = nx.Graph()

# Add nodes (guests)
for interrogation in interrogations:
    guest = interrogation['guest']
    G.add_node(guest)

# Add edges based on sightings in statements
for interrogation in interrogations:
    guest = interrogation['guest']
    statement = interrogation['statement']
    # Extract mentioned guests (simplified heuristic)
    mentioned_guests = [g for g in G.nodes if g in statement]
    for mentioned_guest in mentioned_guests:
        G.add_edge(guest, mentioned_guest)

# Analyze the graph
print("\nNetwork Graph Analysis:")
print(f"Number of nodes (guests): {G.number_of_nodes()}")
print(f"Number of edges (interactions): {G.number_of_edges()}")

# Identify isolated nodes (guests with no interactions)
isolated_guests = list(nx.isolates(G))
print(f"Isolated guests: {isolated_guests}")

# Visualize the graph
plt.figure(figsize=(12, 10))
nx.draw(G, with_labels=True, node_color='lightblue', edge_color='gray', node_size=2000, font_size=10)
plt.title("Guest Interaction Network")
plt.show()

# Combine results to identify the murderer
prime_suspect = similarities[0][0]  # Guest with the highest similarity to the victim's note
print(f"\nPrime Suspect based on writing style: {prime_suspect}")


# Code 1

In [None]:
# Code with chatgpt prompt
import json
import networkx as nx
from sentence_transformers import SentenceTransformer, util
import openai
import matplotlib.pyplot as plt

# Set up OpenAI API key
#openai.api_key = "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"
client = OpenAI(api_key="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",
base_url="https://openrouter.ai/api/v1")

# Load the data
with open('murder_mystery.json', 'r') as file:
    data = json.load(file)

metadata = data['metadata']
interrogations = data['interrogations']

# Extract the victim's note
victim_note = metadata['victim_note']

# Initialize the sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Function to query OpenAI for interaction reliability
def ask_openai_for_reliability(guest, statement):
    prompt = f"""
    The following is a statement from {guest}:
    "{statement}"
    
    Please extract two lists:
    1. Guests the person has interacted with (e.g., playing cards, having a drink, chatting).
    2. Guests the person has only seen (e.g., be seen, passing by or observing).
    
    Return the lists in JSON format with keys "interacted_with" and "only_seen".
    """
    try:
        response = client.chat.completions.create(
                model="openai/gpt-4.1",
                messages=[
                    {"role": "system", "content": "You are a precise information extraction assistant. Extract only what is explicitly stated."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.1,
                max_tokens=200
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {e}"
        return {"interacted_with": [], "only_seen": []}

# Build a network graph of guest interactions with reliability parameter
G = nx.Graph()

# Add nodes (guests)
for interrogation in interrogations:
    guest = interrogation['guest']
    G.add_node(guest)

# Add edges based on OpenAI-determined interactions
for interrogation in interrogations:
    guest = interrogation['guest']
    statement = interrogation['statement']
    
    # Query OpenAI for interaction details
    reliability_data = ask_openai_for_reliability(guest, statement)
    print(f"Reliability data for {guest}: {reliability_data}")

    interacted_with = reliability_data[0]
    only_seen = reliability_data[1]

    # Add edges for "interacted_with" with reliability 1.0
    for other_guest in interacted_with:
        if G.has_edge(guest, other_guest):
            G[guest][other_guest]['weight'] = max(G[guest][other_guest]['weight'], 1.0)
        else:
            G.add_edge(guest, other_guest, weight=1.0)
    
    # Add edges for "only_seen" with reliability 0.5
    for other_guest in only_seen:
        if G.has_edge(guest, other_guest):
            G[guest][other_guest]['weight'] = max(G[guest][other_guest]['weight'], 0.5)
        else:
            G.add_edge(guest, other_guest, weight=0.5)

# Analyze the graph
print("\nNetwork Graph Analysis:")
print(f"Number of nodes (guests): {G.number_of_nodes()}")
print(f"Number of edges (interactions): {G.number_of_edges()}")

# Identify isolated nodes (guests with no interactions)
isolated_guests = list(nx.isolates(G))
print(f"Isolated guests: {isolated_guests}")

# Visualize the graph with edge weights
plt.figure(figsize=(12, 10))
pos = nx.spring_layout(G)
edges = G.edges(data=True)
weights = [edge[2]['weight'] for edge in edges]
nx.draw(
    G, pos, with_labels=True, node_color='lightblue', edge_color=weights,
    edge_cmap=plt.cm.Blues, node_size=2000, font_size=10
)
plt.title("Guest Interaction Network with Reliability Parameter")
#plt.colorbar(plt.cm.ScalarMappable(cmap=plt.cm.Blues), label="Reliability")
plt.show()

# Exclude suspects with high-reliability links
possible_suspects = set(G.nodes)
for edge in G.edges(data=True):
    if edge[2]['weight'] == 1.0:  # High-reliability link
        possible_suspects.discard(edge[0])
        possible_suspects.discard(edge[1])

print(f"\nPossible suspects after excluding high-reliability links: {possible_suspects}")

# Compute embeddings for the victim's note
victim_note_embedding = model.encode(victim_note, convert_to_tensor=True)

# Analyze guest statements for similarity to the victim's note
similarities = []
for interrogation in interrogations:
    guest = interrogation['guest']
    if guest in possible_suspects:  # Only consider possible suspects
        statement = interrogation['statement']
        statement_embedding = model.encode(statement, convert_to_tensor=True)
        similarity = util.pytorch_cos_sim(victim_note_embedding, statement_embedding).item()
        similarities.append((guest, similarity))

# Sort guests by similarity
similarities.sort(key=lambda x: x[1], reverse=True)

# Print the top 3 most similar guests
print("\nTop 3 guests with similar writing style to the victim's note:")
for guest, similarity in similarities[:3]:
    print(f"{guest}: Similarity = {similarity:.4f}")

# Combine results to identify the murderer
prime_suspect = similarities[0][0]  # Guest with the highest similarity to the victim's note
print(f"\nPrime Suspect based on writing style: {prime_suspect}")

# Use OpenAI to analyze the note and statements
def ask_openai_for_reasoning(victim_note, prime_suspect):
    prompt = f"""
    The victim's note is: "{victim_note}"
    The prime suspect based on writing style is {prime_suspect}.
    Explain why this person might be the murderer based on the note and their statement.
    """
    try:
        response = client.chat.completions.create(
                model="openai/gpt-4.1",
                messages=[
                    {"role": "system", "content": "You are a precise information extraction assistant. Extract only what is explicitly stated."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.1,
                max_tokens=100
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {e}"

# Get reasoning for the prime suspect
reasoning = ask_openai_for_reasoning(victim_note, prime_suspect)
print("\nOpenAI Reasoning on the Prime Suspect:")
print(reasoning)

### Problem

Chatgpt isn't responding well to differentiate with guest were only seen and which were interacted with so as a consequence only the list of interacted with is populated and so the after embedding do not work since it takes a void list.

## Code 2

I tried to parse the statements myself to improve it

In [None]:
# Manual parser

import json
import networkx as nx
from sentence_transformers import SentenceTransformer, util
import matplotlib.pyplot as plt
import re

# Load the data
with open('murder_mystery.json', 'r') as file:
    data = json.load(file)

metadata = data['metadata']
interrogations = data['interrogations']

# Extract the victim's note
victim_note = metadata['victim_note']

# Initialize the sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

def parse_statement_for_reliability(statement):
    interacted_with = []
    only_seen = []
    flag = None  # Keeps track of the current context (interacted_with or only_seen)

    # Keywords for interactions and sightings
    interaction_keywords = ["chatting", "discussion", "talking", "playing"]
    sighting_keywords = ["seen", "see", "saw"]

    # Split the statement into words
    words = statement.split()

    i = 0
    while i < len(words):
        word = words[i]

        # Check for interaction keywords
        if word in interaction_keywords:
            flag = "interacted_with"
            i += 1
            continue

        # Check for sighting keywords
        if word in sighting_keywords:
            flag = "only_seen"
            i += 1
            continue

        # Check for names (two consecutive capitalized words)
        if i + 1 < len(words) and words[i][0].isupper() and words[i + 1][0].isupper():
            name = f"{words[i]} {words[i + 1]}"
            if flag == "interacted_with" and name not in interacted_with:
                interacted_with.append(name)
            elif flag == "only_seen" and name not in only_seen:
                only_seen.append(name)
            i += 2  # Skip the second part of the name
            continue

        i += 1

    return interacted_with, only_seen
# Build a network graph of guest interactions with reliability parameter
G = nx.Graph()

# Add nodes (guests)
for interrogation in interrogations:
    guest = interrogation['guest']
    G.add_node(guest)

# Add edges based on parsed interactions
for interrogation in interrogations:
    guest = interrogation['guest']
    statement = interrogation['statement']

    # Parse the statement for reliability
    interacted_with, only_seen = parse_statement_for_reliability(statement)
    print(f"Guest: {guest}, Interacted With: {interacted_with}, Only Seen: {only_seen}")

    # Add edges for "interacted_with" with reliability 1.0
    for other_guest in interacted_with:
        if G.has_edge(guest, other_guest):
            G[guest][other_guest]['weight'] = max(G[guest][other_guest]['weight'], 1.0)
        else:
            G.add_edge(guest, other_guest, weight=1.0)

    # Add edges for "only_seen" with reliability 0.5
    for other_guest in only_seen:
        if G.has_edge(guest, other_guest):
            G[guest][other_guest]['weight'] = max(G[guest][other_guest]['weight'], 0.5)
        else:
            G.add_edge(guest, other_guest, weight=0.5)

# Analyze the graph
print("\nNetwork Graph Analysis:")
print(f"Number of nodes (guests): {G.number_of_nodes()}")
print(f"Number of edges (interactions): {G.number_of_edges()}")

# Identify isolated nodes (guests with no interactions)
isolated_guests = list(nx.isolates(G))
print(f"Isolated guests: {isolated_guests}")

# Visualize the graph with edge weights
plt.figure(figsize=(12, 10))
pos = nx.spring_layout(G)
edges = G.edges(data=True)
weights = [edge[2]['weight'] for edge in edges]
nx.draw(
    G, pos, with_labels=True, node_color='lightblue', edge_color=weights,
    edge_cmap=plt.cm.Blues, node_size=2000, font_size=10
)
plt.title("Guest Interaction Network with Reliability Parameter")
plt.show()

# Exclude suspects with high-reliability links
possible_suspects = set(G.nodes)
for edge in G.edges(data=True):
    if edge[2]['weight'] == 1.0:  # High-reliability link
        possible_suspects.discard(edge[0])
        possible_suspects.discard(edge[1])

print(f"\nPossible suspects after excluding high-reliability links: {possible_suspects}")

# Compute embeddings for the victim's note
victim_note_embedding = model.encode(victim_note, convert_to_tensor=True)

# Analyze guest statements for similarity to the victim's note
similarities = []
for interrogation in interrogations:
    guest = interrogation['guest']
    if guest in possible_suspects:  # Only consider possible suspects
        statement = interrogation['statement']
        statement_embedding = model.encode(statement, convert_to_tensor=True)
        similarity = util.pytorch_cos_sim(victim_note_embedding, statement_embedding).item()
        similarities.append((guest, similarity))

# Sort guests by similarity
similarities.sort(key=lambda x: x[1], reverse=True)

# Print the top 3 most similar guests
print("\nTop 3 guests with similar writing style to the victim's note:")
for guest, similarity in similarities[:3]:
    print(f"{guest}: Similarity = {similarity:.4f}")

# Combine results to identify the murderer
prime_suspect = similarities[0][0]  # Guest with the highest similarity to the victim's note
print(f"\nPrime Suspect based on writing style: {prime_suspect}")

### Problem

The name are repeated too many time so the graph represented is more messy. Also we have the same problem as before, there are almost no guest that populated the only_seen list, only 'Doctor Ashcroft'

## Code 3

I try not to emebed only the suspect list but every guest

In [None]:
import json
import networkx as nx
from sentence_transformers import SentenceTransformer, util
import matplotlib.pyplot as plt
import re

# Load the data
with open('murder_mystery.json', 'r') as file:
    data = json.load(file)

metadata = data['metadata']
interrogations = data['interrogations']

# Extract the victim's note
victim_note = metadata['victim_note']

# Initialize the sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Function to parse statements and assign reliability
# def parse_statement_for_reliability(statement):
#     interacted_with = []
#     only_seen = []

#     # Patterns for interactions (reliability = 1.0)
#     interaction_keywords = r"\b(chatting|discussion|talking|playing)\b"
#     interaction_pattern = re.compile(f"{interaction_keywords}.*?([A-Z][a-z]+(?:\s[A-Z][a-z]+)?)")

#     # Patterns for sightings (reliability = 0.5)
#     sighting_keywords = r"\b(seen|see|saw)\b"
#     sighting_pattern = re.compile(f"{sighting_keywords}.*?([A-Z][a-z]+(?:\s[A-Z][a-z]+)?)")

#     # Find all matches for interactions
#     for match in interaction_pattern.finditer(statement):
#         interacted_with.append(match.group(1))

#     # Find all matches for sightings
#     for match in sighting_pattern.finditer(statement):
#         only_seen.append(match.group(1))

#     return interacted_with, only_seen
# Function to parse statements and assign reliability
# Function to parse statements and assign reliability
def parse_statement_for_reliability(statement):
    interacted_with = []
    only_seen = []
    flag = None  # Keeps track of the current context (interacted_with or only_seen)

    # Keywords for interactions and sightings
    interaction_keywords = ["chatting", "discussion", "talking", "playing"]
    sighting_keywords = ["seen", "see", "saw"]

    # Split the statement into words
    words = statement.split()

    i = 0
    while i < len(words):
        word = words[i]

        # Check for interaction keywords
        if word in interaction_keywords:
            flag = "interacted_with"
            i += 1
            continue

        # Check for sighting keywords
        if word in sighting_keywords:
            flag = "only_seen"
            i += 1
            continue

        # Check for names (two consecutive capitalized words)
        if i + 1 < len(words) and words[i][0].isupper() and words[i + 1][0].isupper():
            name = f"{words[i]} {words[i + 1]}"
            if flag == "interacted_with" and name not in interacted_with:
                interacted_with.append(name)
            elif flag == "only_seen" and name not in only_seen:
                only_seen.append(name)
            i += 2  # Skip the second part of the name
            continue

        i += 1

    return interacted_with, only_seen
# Build a network graph of guest interactions with reliability parameter
G = nx.Graph()

# Add nodes (guests)
for interrogation in interrogations:
    guest = interrogation['guest']
    G.add_node(guest)

# Add edges based on parsed interactions
for interrogation in interrogations:
    guest = interrogation['guest']
    statement = interrogation['statement']

    # Parse the statement for reliability
    interacted_with, only_seen = parse_statement_for_reliability(statement)
    print(f"Guest: {guest}, Interacted With: {interacted_with}, Only Seen: {only_seen}")

    # Add edges for "interacted_with" with reliability 1.0
    for other_guest in interacted_with:
        if G.has_edge(guest, other_guest):
            G[guest][other_guest]['weight'] = max(G[guest][other_guest]['weight'], 1.0)
        else:
            G.add_edge(guest, other_guest, weight=1.0)

    # Add edges for "only_seen" with reliability 0.5
    for other_guest in only_seen:
        if G.has_edge(guest, other_guest):
            G[guest][other_guest]['weight'] = max(G[guest][other_guest]['weight'], 0.5)
        else:
            G.add_edge(guest, other_guest, weight=0.5)

# Analyze the graph
print("\nNetwork Graph Analysis:")
print(f"Number of nodes (guests): {G.number_of_nodes()}")
print(f"Number of edges (interactions): {G.number_of_edges()}")

# Identify isolated nodes (guests with no interactions)
isolated_guests = list(nx.isolates(G))
print(f"Isolated guests: {isolated_guests}")

# Visualize the graph with edge weights
plt.figure(figsize=(12, 10))
pos = nx.spring_layout(G)
edges = G.edges(data=True)
weights = [edge[2]['weight'] for edge in edges]
nx.draw(
    G, pos, with_labels=True, node_color='lightblue', edge_color=weights,
    edge_cmap=plt.cm.Blues, node_size=2000, font_size=10
)
plt.title("Guest Interaction Network with Reliability Parameter")
plt.show()

# Exclude suspects with high-reliability links
possible_suspects = set(G.nodes)
for edge in G.edges(data=True):
    if edge[2]['weight'] == 1.0:  # High-reliability link
        possible_suspects.discard(edge[0])
        possible_suspects.discard(edge[1])

print(f"\nPossible suspects after excluding high-reliability links: {possible_suspects}")

# Compute embeddings for the victim's note
victim_note_embedding = model.encode(victim_note, convert_to_tensor=True)

# Analyze guest statements for similarity to the victim's note
similarities = []
for interrogation in interrogations:
    guest = interrogation['guest']
    statement = interrogation['statement']
    statement_embedding = model.encode(statement, convert_to_tensor=True)
    similarity = util.pytorch_cos_sim(victim_note_embedding, statement_embedding).item()
    similarities.append((guest, similarity))

# Sort guests by similarity
similarities.sort(key=lambda x: x[1], reverse=True)

# Print the top 3 most similar guests
print("\nTop 3 guests with similar writing style to the victim's note:")
for guest, similarity in similarities[:3]:
    print(f"{guest}: Similarity = {similarity:.4f}")

# Combine results to identify the murderer
prime_suspect = similarities[0][0]  # Guest with the highest similarity to the victim's note
print(f"\nPrime Suspect based on writing style: {prime_suspect}")

# Solution

Following my result the most possible murder is Doctor Ashcroft, both based on the network graph since he is the only one that appear in the only_seen list of some guest and so he was not accounted for during the murder and in the graph on code 2 we can see he is more isolated that others. Moreover the analysis made with embedding assign him the higher grade of similarity with the message left on the crime scene.


Based on the embeddings other suspect could be Viscount Pemberton and Baron Sienna but they have a reliable link so that means the have alibi

###### Note: Code execute on colab 