# Data Analysis and Q&A Project Using a Local LLM

## Project Overview

This project requires you to perform a comprehensive analysis of a company's stock data using only the provided data sources and a local LLM. Your analysis should answer the following six questions strictly based on the supplied data and documents—no external data is allowed. All generated answers must be firmly based on the provided data, without any fabricated content. In addition, your logic must be clear, and any attribution of events must be causally linked.

---

## Provided Data

You will be provided with the following data sets:

#### Stock Price Data (Json format)
* Timeframe: Jan 22 to Feb 5
* Fields: Open, High, Low, Close, Volume

#### Quarterly Earnings Data for the Past Year (Json format)
* Contains key financial indicators (e.g., revenue, eps) for each quarter.

#### Full Earnings Transcript Call
* The complete transcript of the earnings call, including management discussions and Q&A.

#### Balance Sheet Data for the Past Year (Json format)
* Includes assets, liabilities, and shareholders' equity information.

#### News Articles
* Full text of 10 news articles related to the company during the analysis period.

---

## Questions
Using the provided data and a local LLM, you need to answer the following six questions:

1. What is the performance of the Tesla stock during this period (Jan 22 to Feb 5)?

2. Why did the price increase on Jan 30? Please provide potential factors.

3. Compared with previous quarters, how is the performance of this quarter?

4. With unsupervised Full Self Driving scheduled to launch in limited markets like Austin by June, what regulatory challenges does Tesla foresee for a nationwide or international rollout, and how is the company strategically preparing to address these hurdles?

5. What insights can be concluded from the earnings call?

6. Which key news events influenced the stock performance, and what insights do they offer?

---

## Project Requirements
- #### Data Source Restriction:
Only use the provided data and documents. No external data or information is allowed.

- #### Answer Generation:
All generated answers must strictly be based on the provided data and documents. The LLM should not "invent" information.

- #### Clear Logic and Causal Relationships:
For each question, your answers must clearly demonstrate logical reasoning, and any attribution of cause must be explicitly linked to events in the data.

- #### Prompt Design:
You must design your own prompts for calling the local LLM to ensure that the responses are generated strictly based on the analysis results.

- #### Result Evaluation:
After generating the answers, implement an evaluation step to assess whether the responses meet the above requirements in terms of data reliance, logical clarity, and correct causation.

- #### Please put the answers to these 6 questions in a dict at the end of your submitted Python nodebook file.

For example
```code
{ "Q1 answer": "Answer1", "Q2 answer": "Answer2", "Q3 answer": "Answer3", "Q4 answer4": "Answer4", "Q5 answer": "Answer5", "Q6 answer": "Answer6"}
```

## Dependencies

* Transformers
* Torch (PyTorch)
* Accelerate

In [4]:
%pip install transformers accelerate pandas
# %pip install torch # Install PyTorch if you dont have it downloading 

DATA_DIR = "447_dataset"

import pandas as pd
import json
import os


Note: you may need to restart the kernel to use updated packages.


## Loading and Running the Local LLM

1. **Imports Transformers utilities**  
   - `AutoModelForCausalLM`: generic class for loading any GPT‑style model  
   - `AutoTokenizer`: matching tokenizer for converting text ↔ tokens  
   - `pipeline`: high‑level helper that ties model + tokenizer into one callable  

2. **Specifies the model repository**  
   ```python
   model_path = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

In [13]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_path = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="cuda"
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

# The pipeline will automatically use the model and tokenizer you just loaded
tokenizer = AutoTokenizer.from_pretrained(model_path) # Load the tokenizer

 # Create a pipeline for text generation
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 1200, # Limit the number of tokens generated
    "return_full_text": False, # Return only the generated text
    "do_sample": True, # Use sampling to generate text
    "temperature": 0.1,# Control the randomness of the output
    "repetition_penalty": 1.1,
    "top_p": 0.9, # Control the diversity of the output
    "top_k": 50, # Control the diversity of the output
}

Device set to use cuda


### Question 1:What is the performance of the Tesla stock during this period (Jan 22 to Feb 5)?

#### Prompt construction

* Here we need to monitor the performance of the tesla stock over the specified days. To monitor the performance of the stock the LLM just need to understand the how the pricing of the stock was through the given period hence why the LLM will need to see the infomation in the `prices.json` file.
* To supplement the LLM to construct its answer we will also show the information in `balancesheet.json` so it can pickup on any trends to as why stock prices deviated and change through the mentioned days

In [None]:
"""   
This Function extracts the price information and converts the informations
into a simple readable format to be included in the prompt to the LLM
"""
def get_prices_summary():
    prices_path = os.path.join(DATA_DIR, "prices.json")

    # Load raw JSON into a DataFrame
    with open(prices_path, "r") as f:
        prices = pd.DataFrame(json.load(f))

    # Parse dates and index
    prices["Date"] = pd.to_datetime(prices["Date"])
    prices = prices.set_index("Date").sort_index()

    # Build a human-readable summary for each day
    daily_summaries = []
    for date, row in prices.iterrows():
        daily_summaries.append(
            f"{date.strftime('%Y-%m-%d')}: "
            f"Open ${row['Open']:.2f}, "
            f"High ${row['Hight']:.2f}, "
            f"Low ${row['Low']:.2f}, "
            f"Close ${row['Close']:.2f}, "
            f"Volume {int(row['Volume']):,}"
        )

    # Join them into one block of text
    daily_summary_text = "\n".join(daily_summaries)
    return daily_summary_text

print(get_prices_summary())

2025-01-22: Open $416.81, High $428.00, Low $414.59, Close $415.11, Volume 60,963,300
2025-01-23: Open $416.06, High $420.73, Low $408.95, Close $412.38, Volume 50,690,600
2025-01-24: Open $414.45, High $418.88, Low $405.78, Close $406.58, Volume 56,427,100
2025-01-27: Open $394.80, High $406.69, Low $389.00, Close $397.15, Volume 58,125,500
2025-01-28: Open $396.91, High $400.59, Low $386.50, Close $398.09, Volume 48,910,700
2025-01-29: Open $395.21, High $398.59, Low $384.48, Close $389.10, Volume 68,033,600
2025-01-30: Open $410.78, High $412.50, Low $384.41, Close $400.28, Volume 98,092,900
2025-01-31: Open $401.53, High $419.99, Low $401.34, Close $404.60, Volume 83,568,200
2025-02-03: Open $386.68, High $389.17, Low $374.36, Close $383.68, Volume 93,732,100
2025-02-04: Open $382.63, High $394.00, Low $381.40, Close $392.21, Volume 57,072,200
2025-02-05: Open $387.51, High $388.39, Low $375.53, Close $378.17, Volume 57,223,300


In [None]:
system_prompt_q1 = ( 
    "You are an expert financial data analyst LLM.\n"
    "Please ensure that bellow rules are followed:\n"
    "1. Use only the data summaries provided in the prompt\n"
    "2. Show clear step-by-step reasoning , linking each claim to the data.\n"
    "3. The explanation is clear, logical, and accurate.\n"
    "4. Do not invent or hallucinate any information.\n"
    "After your answer, provide a checklist summary indicating whether each criterion is satisfied. "
    "If any criterion is not met, include a brief explanation."
)

user_prompt_template_q1 =( 
    "Please analyze the following information and provide a detailed answer. After your analysis, include a checklist verifying that:\n"
    "- The answer includes a clear introduction, analysis, and conclusion.\n"
    "- All necessary data points mentioned in the input are addressed.\n"
    "- The formatting matches the specified template.\n"
    "- The language is professional and concise.\n\n"
    "Here is the input data:\n"
    "{input_data}\n\n"
    "Provide your detailed analysis followed by the verification checklist."
)

# Fill in the template with specific data
filled_user_prompt = user_prompt_template.format(
    input_data="Company: XYZ Ltd.\nRevenue: 3000 million USD\nNet Income: 500 million USD\nDebt-to-Equity Ratio: 30%"
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": filled_user_prompt},
]

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])
