# Data Analysis and Q&A Project Using a Local LLM

## Project Overview

This project requires you to perform a comprehensive analysis of a company's stock data using only the provided data sources and a local LLM. Your analysis should answer the following six questions strictly based on the supplied data and documents—no external data is allowed. All generated answers must be firmly based on the provided data, without any fabricated content. In addition, your logic must be clear, and any attribution of events must be causally linked.

---

## Provided Data

You will be provided with the following data sets:

#### Stock Price Data (Json format)
* Timeframe: Jan 22 to Feb 5
* Fields: Open, High, Low, Close, Volume

#### Quarterly Earnings Data for the Past Year (Json format)
* Contains key financial indicators (e.g., revenue, eps) for each quarter.

#### Full Earnings Transcript Call
* The complete transcript of the earnings call, including management discussions and Q&A.

#### Balance Sheet Data for the Past Year (Json format)
* Includes assets, liabilities, and shareholders' equity information.

#### News Articles
* Full text of 10 news articles related to the company during the analysis period.

---

## Questions
Using the provided data and a local LLM, you need to answer the following six questions:

1. What is the performance of the Tesla stock during this period (Jan 22 to Feb 5)?

2. Why did the price increase on Jan 30? Please provide potential factors.

3. Compared with previous quarters, how is the performance of this quarter?

4. With unsupervised Full Self Driving scheduled to launch in limited markets like Austin by June, what regulatory challenges does Tesla foresee for a nationwide or international rollout, and how is the company strategically preparing to address these hurdles?

5. What insights can be concluded from the earnings call?

6. Which key news events influenced the stock performance, and what insights do they offer?

---

## Project Requirements
- #### Data Source Restriction:
Only use the provided data and documents. No external data or information is allowed.

- #### Answer Generation:
All generated answers must strictly be based on the provided data and documents. The LLM should not "invent" information.

- #### Clear Logic and Causal Relationships:
For each question, your answers must clearly demonstrate logical reasoning, and any attribution of cause must be explicitly linked to events in the data.

- #### Prompt Design:
You must design your own prompts for calling the local LLM to ensure that the responses are generated strictly based on the analysis results.

- #### Result Evaluation:
After generating the answers, implement an evaluation step to assess whether the responses meet the above requirements in terms of data reliance, logical clarity, and correct causation.

- #### Please put the answers to these 6 questions in a dict at the end of your submitted Python nodebook file.

For example
```code
{ "Q1 answer": "Answer1", "Q2 answer": "Answer2", "Q3 answer": "Answer3", "Q4 answer4": "Answer4", "Q5 answer": "Answer5", "Q6 answer": "Answer6"}
```

## Dependencies

* Transformers
* Torch (PyTorch)
* Accelerate

In [20]:
%pip install transformers torch accelerate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Loading and Running the Local LLM

1. **Imports Transformers utilities**  
   - `AutoModelForCausalLM`: generic class for loading any GPT‑style model  
   - `AutoTokenizer`: matching tokenizer for converting text ↔ tokens  
   - `pipeline`: high‑level helper that ties model + tokenizer into one callable  

2. **Specifies the model repository**  
   ```python
   model_path = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

In [21]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_path = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="cuda"
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

# The pipeline will automatically use the model and tokenizer you just loaded
tokenizer = AutoTokenizer.from_pretrained(model_path) # Load the tokenizer

 # Create a pipeline for text generation
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 1200, # Limit the number of tokens generated
    "return_full_text": False, # Return only the generated text
    "do_sample": True, # Use sampling to generate text
    "temperature": 0.1,# Control the randomness of the output
    "repetition_penalty": 1.1,
    "top_p": 0.9, # Control the diversity of the output
    "top_k": 50, # Control the diversity of the output
}

Device set to use cuda


### Question 1:What is the performance of the Tesla stock during this period (Jan 22 to Feb 5)?

#### Prompt construction

* Here we need to monitor the performance of the tesla stock over the specified days. To monitor the performance of the stock the LLM just need to understand the how the pricing of the stock was through the given period hence why the LLM will need to see the infomation in the `prices.json` file.
* To supplement the LLM to construct its answer we will also show the information in `balancesheet.json` so it can pickup on any trends to as why stock prices deviated and change through the mentioned days

In [6]:
system_prompt = (
    "You are a professional financial analyst with deep expertise in stocks, bonds, mutual funds, and derivatives. "
    "Your responses should be data-driven, professionally rigorous, and provide clear, step-by-step explanations. "
    "Include cautionary advice regarding potential risks, but do not offer direct investment advice."
)

# Define the user prompt template
user_prompt_template = (
    "Based on the following financial data, please analyze the company's financial health and provide insights on potential risks and opportunities for future growth.\n\n"
    "Company Name: {CompanyName}\n"
    "Time Period: {TimePeriod}\n"
    "Key Financial Data:\n"
    "- Revenue: {Revenue} million USD\n"
    "- Net Income: {NetIncome} million USD\n"
    "- Debt-to-Equity Ratio: {DebtEquityRatio}%\n"
    "- Earnings Per Share (EPS): {EPS}\n\n"
    "Answer the following questions:\n"
    "1. How strong is the company's profitability? Please explain the main factors.\n"
    "2. Is the company's debt level sustainable? Are there any financial risks?\n"
    "3. What potential risks and opportunities do you foresee based on the current data?\n\n"
    "Please detail your analysis process and provide clear conclusions and recommendations."
)

# Fill in the template with specific data
filled_user_prompt = user_prompt_template.format(
    CompanyName="ABC Corporation",
    TimePeriod="Q1 2023",
    Revenue="5000",
    NetIncome="800",
    DebtEquityRatio="45",
    EPS="1.2"
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "system", "content": "Make you final answer show the most critical and required informations the user requires. Do not use Markdown in your answer"},
    {"role": "user", "content": filled_user_prompt},
]

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

Okay, so I need to help ABC Corporation by analyzing their financial health. They provided some key data: revenue of 5 billion USD, net income of 800 million, a debt-to-equity ratio of 45%, and EPS of 1.2. Let me break this down step by step.

First, looking at profitability. The company has a high revenue but only an 8% net income compared to revenue. That seems like a big loss. But wait, they have a lot of debt, which could mean higher interest expenses. Maybe that's why their net income is low despite having good margins. So, while they're making money, it's not healthy because they're spending too much on debt.

Next, the debt level. A 45% debt-to-equity ratio is pretty high. That means they owe more than they have equity. High debt can lead to financial risks like default, which would hurt their reputation and possibly limit their growth. Also, if they take on more debt, interest payments might eat into profits, reducing future earnings.

Looking at EPS, it's 1.2, which is decent 

# Question 2 
### Why did the price increase on Jan 30? Please provide potential factors.

In [None]:
import json

system_prompt_q2 = (
    "You are an expert financial analyst."
    "You are an expert in anlysing conversations"
    "All output must be in English."
    "Use only the provided inputs—do not fetch or invent anything. "
    "Do not restate the question or repeat the prompts. Answer directly and concisely."
)

user_prompt_q2 = """

Please respond in English.

Using only the news articles data and the earnings transcript from the day before, identify all plausible factors that contributed to the price uptick.
Reference the specific news item (by title and date) or data point you’re using or quotes you are using.  
Explain the causal link to the price move. 
Do not use any external sources.  
Do not repeat yourself.
Double check you haven't repeated yourself.

**Question** 
Why did Tesla’s stock price increase on January 30?

Here are the news articles and price info:
{news_data}

Here is the earnings trascript the day before January 30.

{earnings_transcript}

"Please respond in English!"
"""

with open("447_dataset/news.json", "r") as f:
    all_news = json.load(f)

jan30_news = []

for article in all_news:
    if article.get("date") == "January 30, 2025":
        jan30_news.append(article)

jan30_news_json = json.dumps(jan30_news, ensure_ascii=False)

with open("447_dataset/earning_transcript.md", "r") as f:
    earnings_transcript_data = f.read()


# Example of how you'd fill & call it:
filled_user = user_prompt_q2.format(news_data=jan30_news_json, earnings_transcript=earnings_transcript_data)
messages = [
    {"role": "system", "content": system_prompt_q2},
    {"role": "user",   "content": filled_user},
]
output = pipe(messages, **generation_args)
print(output[0]["generated_text"])

NameError: name 'pipe' is not defined

In [11]:
print(len(output))

1


# Question 2 Enhanced

## Analyze Markdown Transcript and Create Summary

In [30]:
import re

system_prompt_ts = (
    """
    You are an expert financial analysis assistant.  
    Always respond in clear, concise English.  
    Use only the information explicitly provided in the user’s inputs.  
    Do not invent, fetch, or reference any external data.  
    Produce only bullet points—no prose.
    """
)

user_prompt_ts = (
    """
Please summarize the following earnings‑call excerpt into 3–4 bullet points,
focusing on key financial metrics, guidance, or other actionable insights.
Do not add commentary or restate the question—just the bullets.
    """
)

with open("447_dataset/earning_transcript.md", "r") as f:
    earnings_transcript_data = f.read()

lines = earnings_transcript_data.splitlines()

chunks = []
current_chunk = []
max_lines_per_chunk = 50

for line in lines:
    current_chunk.append(line)
    # once we've collected 50 lines, seal off a chunk
    if len(current_chunk) >= max_lines_per_chunk:
        # join back into one string
        chunk_text = "\n".join(current_chunk)
        chunks.append(chunk_text)
        # reset for the next chunk
        current_chunk = []

if current_chunk:
    chunk_text = "\n".join(current_chunk)
    chunks.append(chunk_text)

transcript_summary = []

def clean_summary(raw):
    # 1) remove any explicit <think>…</think> if they ever appear
    s = re.sub(r"<think>.*?</think>", "", raw, flags=re.S)
    # 2) drop any lines beginning with “Thought:” or “Thinking”
    s = re.sub(r"(?m)^(Thought:|Thinking:?).*\n?", "", s)
    # 3) keep only from the first bullet (dash or •) onward
    m = re.search(r"(?m)^[-•]\s+", s)
    if m:
        return s[m.start():].strip()
    # fallback: strip leading/trailing whitespace
    return s.strip()


for i in range(len(chunks)):
    filled_user_ts = user_prompt_ts.format(earnings_transcript=chunks[i])
    messages_ts = [
        {"role": "system", "content": system_prompt_ts},
        {"role": "user",   "content": filled_user_ts},
    ] 
    out = pipe(messages_ts, **generation_args)
    raw = out[0]["generated_text"]
    summary = clean_summary(raw)
    transcript_summary.append(summary)

full_transcript_summary = "\n\n".join(transcript_summary)



# you can print it or write it to a file
print(full_transcript_summary)

- Revenue growth of 12% year-over-year, driven by strong demand for premium products.  
- Gross margin improved to 35%, reflecting cost optimization and pricing strategy.  
- Operational efficiency increased by 8%, including reduced labor costs and energy consumption.  
- Management emphasized diversification of product lines and enhanced customer support.  
- Future outlook includes expanding into emerging markets with a 15% market share target and prioritizing sustainability with a 10% carbon reduction goal.

- Revenue Growth: Company reported a 15% year-over-year increase in revenue, driven by strong demand for its products.  
- Cost Management: The company implemented cost-saving measures, resulting in a 20% reduction in operational expenses.  
- Profitability: Despite lower costs, the company maintained a healthy profit margin of 8%, indicating strong financial health.  
- Future Outlook: Management projects continued growth potential, with expansion into new markets expected in Q

## Extract News Source from Relevant Dates

In [26]:
import json
from datetime import datetime


with open("447_dataset/news.json", "r") as f:
    all_news = json.load(f)

def parse_date(s: str) -> datetime.date:
    return datetime.strptime(s.strip(), "%B %d, %Y").date()



# define your window
start = datetime(2025, 1, 25).date()
end   = datetime(2025, 1,  30).date()

news_window = [
    article
    for article in all_news
    if start <= parse_date(article["date"]) <= end
]

print(f"Found {len(news_window)} articles between {start} and {end}:")
for a in news_window:
    print(f"- {a['date']}: {a['title']}")


Found 5 articles between 2025-01-25 and 2025-01-30:
- January 29, 2025: Tesla Caps Roller-Coaster Year With Mixed Fourth-Quarter Earnings
- January 27, 2025: Tesla, BMW Sue EU as Tension Mounts on Chinese EV Tariffs
- January 29, 2025: Trump’s Tariffs Would Hit Tesla’s Profits, Financial Chief Says
- January 30, 2025: Tesla stock rises after company pledges return to growth after Q4 results disappoint
- January 30, 2025: Analysts overhaul Tesla stock price targets after Q4 earnings


## Extract Earnings

In [23]:
import json

# load all earnings entries
with open("447_dataset/earning.json", "r") as f:
    earnings = json.load(f)

# print the extracted data
earnings = json.dumps(earnings, indent=2, ensure_ascii=False)
print(earnings)

[
  {
    "EpsActual": 0.71,
    "EpsForecast": 0.74037,
    "EpsSurprise": -0.03,
    "RevenueActual": 25167000000,
    "RevenueSurprise": -596790000,
    "RevenueForecast": 25763787850,
    "EarningReportDate": "2024-01-24T21:15:00Z"
  },
  {
    "EpsActual": 0.45,
    "EpsForecast": 0.49506,
    "EpsSurprise": -0.04,
    "RevenueActual": 21301000000,
    "RevenueForecast": 22255924550,
    "RevenueSurprise": -954920000,
    "EarningReleaseDate": "2024-04-23T20:10:00Z"
  },
  {
    "EpsActual": 0.52,
    "EpsForecast": 0.62013,
    "EpsSurprise": -0.1,
    "RevenueActual": 25500000000,
    "RevenueForecast": 24740776400,
    "RevenueSurprise": 759220000,
    "EarningReleaseDate": "2024-07-24T00:00:00Z"
  },
  {
    "EpsActual": 0.72,
    "EpsForecast": 0.59756,
    "EpsSurprise": 0.12,
    "RevenueActual": 25182000000,
    "RevenueForecast": 25441566840,
    "RevenueSurprise": -259570000,
    "EarningReleaseDate": "2024-10-23T20:09:00Z"
  },
  {
    "EpsActual": 0.73,
    "EpsForecas

## Relevant Price Window 

In [24]:
import json
from datetime import datetime

def load_prices(path: str):
    """Load the full list of price bars from a JSON file."""
    with open(path, "r") as f:
        return json.load(f)

def filter_by_date(prices, start_date: str, end_date: str):
    """
    Return only those price entries whose 'Date' field
    falls between start_date and end_date, inclusive.
    Dates must be in 'YYYY-MM-DD' format.
    """
    start = datetime.fromisoformat(start_date).date()
    end   = datetime.fromisoformat(end_date).date()

    return [
        p for p in prices
        if start <= datetime.fromisoformat(p["Date"]).date() <= end
    ]


    
start_date = "2025-01-20"
end_date   = "2025-01-30"

prices = load_prices("447_dataset/prices.json")
window = filter_by_date(prices, start_date, end_date)

relevant_prices = json.dumps(window, indent=2, ensure_ascii=False)
print(relevant_prices)

[
  {
    "Open": 416.81,
    "Hight": 428.0,
    "Low": 414.59,
    "Close": 415.11,
    "Volume": 60963300,
    "Date": "2025-01-22"
  },
  {
    "Open": 416.06,
    "Hight": 420.73,
    "Low": 408.95,
    "Close": 412.38,
    "Volume": 50690600,
    "Date": "2025-01-23"
  },
  {
    "Open": 414.45,
    "Hight": 418.88,
    "Low": 405.78,
    "Close": 406.58,
    "Volume": 56427100,
    "Date": "2025-01-24"
  },
  {
    "Open": 394.8,
    "Hight": 406.69,
    "Low": 389.0,
    "Close": 397.15,
    "Volume": 58125500,
    "Date": "2025-01-27"
  },
  {
    "Open": 396.91,
    "Hight": 400.59,
    "Low": 386.5,
    "Close": 398.09,
    "Volume": 48910700,
    "Date": "2025-01-28"
  },
  {
    "Open": 395.21,
    "Hight": 398.59,
    "Low": 384.48,
    "Close": 389.1,
    "Volume": 68033600,
    "Date": "2025-01-29"
  },
  {
    "Open": 410.78,
    "Hight": 412.5,
    "Low": 384.41,
    "Close": 400.28,
    "Volume": 98092900,
    "Date": "2025-01-30"
  }
]


## Putting Everything together 

In [46]:
system_prompt_q2 = ("""
You are an expert financial analysis assistant.  
Use only the information explicitly provided in the user’s inputs.  
Do not invent, fetch, or reference any external data.
You must suport all answers with evidence and reasoning
""")

user_prompt_q2 = """
Below is all the data—no other sources allowed:

You must suport all answers with evidence and reasoning

Be verbrose

1) Earnings results (earning.json):
{earnings_data}

2) Stock price summary:
{relevant_price_data}

3) News articles during the time:
{news_data}

4) Summary of an earnings talk the prior day:
{transcript_data}

Question
Why did the stock price increase on January 30?
List the likely drivers with an explanation, and cite exactly which data (earnings, transcript, news article, or price bar) supports it.
"""

filled_user_q2 = user_prompt_q2.format(earnings_data = earnings, relevant_price_data = relevant_prices, news_data = news_window, transcript_data = full_transcript_summary)

messages_q2 = [
        {"role": "system", "content": system_prompt_q2},
        {"role": "user",   "content": filled_user_q2},
    ] 



output_q2 = pipe(messages_q2, **generation_args)
print(output_q2[0]["generated_text"])

Alright, let me break this down step by step. So, the user provided a detailed earnings talk excerpt and wants me to summarize it into three to four bullet points. My goal is to capture the key metrics, guidance, or actionable insights from the given text without adding anything extra.

First, I'll read through the query carefully to understand the context. The user provided an earnings talk with sections like company overview, revenue and pricing, operating performance, cash flow, management comments, and closing remarks. They want this condensed into bullet points focusing on key metrics, guidance, or actionable insights.

Next, I'll identify the main points from each section. In the company overview, the CEO mentions strong fundamentals and future goals. Revenue is up 15%, which is good. Pricing strategy is stable at $20 per share. Operating margins are healthy at 28-29%. Cash flow is positive, showing resilience. Management comments include confidence in their approach and looking 