# End of week 1 exercise

To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question,  
and responds with an explanation. This is a tool that you will be able to use yourself during the course!

In [8]:
import os
from openai import OpenAI
from IPython.display import Markdown, display, update_display
from dotenv import load_dotenv

In [9]:
# constants
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')
MODEL_GPT = 'gpt-4o-mini'
MODEL_LLAMA = 'llama3.2'

In [10]:
openai = OpenAI()
system_prompt = '''
You are a tool that takes a technical question,  
and responds with an explanation. 
For example if I gave you the prompt - what is the use of Braoadcast variable in spark?
You will respond with an explanation of what a broadcast variable is, and how it is used in spark. along with a peice of example code.
Maybe depending on the expaination effort required, you can decide how long can be the response.
'''

In [11]:
# here is the question; type over this to ask something new

question = """
Please explain why it is necessary to use Spark, if I can use python too? 
"""

In [13]:
message = [{"role":"system","content":f"{system_prompt}"},{"role":"user","content":f"{question}"}]
response = openai.chat.completions.create(model = MODEL_GPT, messages = message , stream = True)
result = ""
display_handle = display(Markdown(""), display_id=True)
for chunk in response:
    result += chunk.choices[0].delta.content or ''
    update_display(Markdown(result), display_id=display_handle.display_id)

Apache Spark is a powerful distributed computing system designed to handle large-scale data processing efficiently, while Python is a programming language. While you can perform data analysis and manipulation using Python, Spark provides several advantages that make it necessary, particularly when dealing with big data scenarios:

### Reasons to Use Spark

1. **Scalability**: Spark can easily scale to handle petabytes of data across many machines. It distributes data across the cluster, allowing for parallel processing, which can dramatically speed up data manipulation operations that Python, running on a single machine, would struggle to handle.

2. **Speed**: Spark is designed for fast computation. It achieves this through in-memory processing, which reduces the amount of time spent reading from and writing to disk. This is particularly beneficial for iterative algorithms common in machine learning, which often require multiple passes over the data.

3. **Ease of Use with Large Data Sets**: Spark provides high-level APIs in Java, Scala, Python, and R, and it comes with built-in libraries for SQL, machine learning (MLlib), graph processing (GraphX), and stream processing (Spark Streaming). This makes it easier to perform complex data operations.

4. **Fault Tolerance**: Through its resilient distributed datasets (RDDs), Spark can automatically recover lost data and rerun failed tasks, making it more robust than standard Python scripts which may fail completely if an error occurs.

5. **Integration with Big Data Tools**: Spark works well with big data technologies like Hadoop and Apache Mesos. It can read and write data from various sources such as HDFS, HBase, and S3, making it highly compatible within the big data ecosystem.

6. **Distributed Processing**: Spark allows you to run your computations on a cluster of machines. This means that you can leverage the processing power of multiple nodes, which is crucial when your data set exceeds the memory capacity of a single machine.

### Example Use Case

Imagine you have a massive dataset containing millions of entries related to user behaviors on a website, and you want to perform operations such as filtering, aggregating, and machine learning model training.

**PySpark Example**:

Hereâ€™s a simple example of using PySpark to perform data processing:

```python
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("DataProcessingExample") \
    .getOrCreate()

# Load data into a DataFrame
data_file_path = "hdfs://path/to/your/large-data-file.csv"
user_data = spark.read.csv(data_file_path, header=True, inferSchema=True)

# Perform some transformations
filtered_data = user_data.filter(user_data['activity'] == 'click')
aggregated_data = filtered_data.groupBy('user_id').count()

# Show the result
aggregated_data.show()

# Stop the Spark Session
spark.stop()
```

### Conclusion

In summary, while Python is a versatile programming language for data analysis, Spark provides the tools and environment necessary for handling large datasets efficiently and effectively. If you're working with significant amounts of data or need distributed processing capabilities, utilizing Spark is a strong choice.

In [None]:
# Get Llama 3.2 to answer
OLLAMA_BASE_URL = "http://localhost:11434/v1"

ollama = OpenAI(base_url=OLLAMA_BASE_URL, api_key='ollama')

response = ollama.chat.completions.create(model="qwen3:4b", messages=[{"role":"system","content":f"{system_prompt}"},{"role":"user","content":f"{question}"}])

response.choices[0].message.content