In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from perfeed.tools.pr_summarizer import PRSummarizer
from perfeed.git_providers.github import GithubProvider
from perfeed.llms.ollama_client import OllamaClient
from perfeed.llms.openai_client import OpenAIClient
from perfeed.data_stores import FeatherStorage
import pandas as pd
import asyncio
import nest_asyncio
import json

from dotenv import load_dotenv
load_dotenv()
nest_asyncio.apply()

from IPython.display import display, Markdown

In [27]:
from perfeed.models.git_provider import CommentType
g = GithubProvider("Perfeed")
comments = asyncio.run(g._get_pr_comments(owner='Perfeed', repo_name='perfeed', pr_number=22, comment_type=CommentType.REVIEW_COMMENT))

In [28]:
[i.get('position', 'not_found') for i in comments]

[None,
 None,
 9,
 9,
 None,
 7,
 None,
 None,
 66,
 None,
 1,
 1,
 1,
 9,
 1,
 32,
 32,
 32,
 46]

In [15]:
comments_22



# PR summary

In [None]:
fs = FeatherStorage(data_type="pr_summary", overwrite=False, append=True)
summarizer = PRSummarizer(
    GithubProvider("Perfeed"), 
    llm=OpenAIClient("gpt-4o-mini"),
    # llm=OllamaClient("llama3.1"),
    store=fs)
pr_summary, metadata = asyncio.run(summarizer.run("perfeed", 22))

Here is the pull request context:

Pull request info:
PR author: 'jzxcd'
PR title: 'Weekly summary refine'
PR context:
'None'

PR code diff:
'diff --git a/perfeed/git_providers/base.py b/perfeed/git_providers/base.py
index e192fd9..45d8764 100644
--- a/perfeed/git_providers/base.py
+++ b/perfeed/git_providers/base.py
@@ -20,6 +20,11 @@ async def get_pr(self, repo: str, pr_number: int) -> PullRequest:
 
     @abstractmethod
     async def search_prs(
-        self, repo_name: str, start_date: datetime, end_date: datetime, authors: set[str], closed_only: bool = True
+        self,
+        repo_name: str,
+        start_date: datetime,
+        end_date: datetime,
+        authors: set[str],
+        closed_only: bool = True,
     ) -> list[int]:
         pass
diff --git a/perfeed/git_providers/github.py b/perfeed/git_providers/github.py
index 75d9e32..2f33d48 100644
--- a/perfeed/git_providers/github.py
+++ b/perfeed/git_providers/github.py
@@ -6,6 +6,9 @@
 from perfeed.config_loader im

TypeError: cannot unpack non-iterable NoneType object

In [31]:
# 4o-mini
json.loads(pr_summary.model_dump_json())

{'type': ['Enhancement'],
 'title': 'Weekly summary refine',
 'description': 'This PR refines the weekly summary functionality by enhancing the comment threading logic and updating the `PRComment` model to include a `code_change` attribute. It also introduces a new method `comments_to_thread` to structure comments into threads, improving the clarity of discussions in pull requests.',
 'pr_files': [{'filename': 'perfeed/git_providers/base.py',
   'language': 'python',
   'changes_summary': 'Updated the `search_prs` method signature to improve clarity.',
   'changes_title': 'Refine search_prs method signature',
   'label': 'enhancement'},
  {'filename': 'perfeed/git_providers/github.py',
   'language': 'python',
   'changes_summary': 'Implemented `comments_to_thread` function to organize comments into threads based on replies.',
   'changes_title': 'Add comments_to_thread function',
   'label': 'enhancement'},
  {'filename': 'perfeed/models/git_provider.py',
   'language': 'python',
   '

In [None]:
# llama 3.1
json.loads(pr_summary.model_dump_json())

{'type': ['Enhancement'],
 'title': 'Add BaseClient for llm interface with OpenAI and Ollama impl',
 'description': 'This PR adds a base client class for LLM interfaces, implementing the chat_completion method. It also includes two specific clients: OpenAIClient and OllamaClient. The code introduces new files in the src/llms directory and modifies the pyproject.toml file to include dependencies.',
 'pr_files': [{'filename': 'src/llms/base_client.py',
   'language': 'Python',
   'changes_summary': 'Added a base client class for LLM interfaces, implementing the chat_completion method. Introduced two specific clients: OpenAIClient and OllamaClient.',
   'changes_title': 'Base Client Implementation',
   'label': 'Enhancement'},
  {'filename': 'src/llms/ollama_client.py',
   'language': 'Python',
   'changes_summary': 'Implemented the OllamaClient class, inheriting from BaseClient. The chat_completion method interacts with the Ollama API to generate a response.',
   'changes_title': 'Ollama

# weekly

In [None]:
model_name = "llama3.1"
llm = OllamaClient(model_name)
fs = FeatherStorage(data_type="pr_summary", overwrite=False, append=True)

summarizer = PRSummarizer(
    GithubProvider("Perfeed"), 
    llm=llm,
    store=fs
)
# for pr in [11, 12, 13, 14]:
#     pr_summary, metadata = asyncio.run(summarizer.run("perfeed", pr))
#     fs.save(data=pr_summary, metadata=metadata)

'OllamaClient'

In [8]:
df = fs.load()

# option 1: filter for the input provider/model
df = df[(df['llm_provider'] == llm.__class__.__name__) & (df['model'] == model_name)]
rank = df.groupby(['pr_number'])['created_at'].rank(ascending=False)

# option 2: filter for the latest (regardless provider/model)
# rank = df.groupby(['pr_number', 'llm_provider', 'model'])['created_at'].rank(ascending=False)

df = df[rank==1]
df['review_interval'] = (pd.to_datetime(df['pr_merged_at']) - pd.to_datetime(df['pr_created_at'])).dt.total_seconds() / 3600
df = df[df['author'] == 'jzxcd']
df

Unnamed: 0,type,title,description,pr_files,comments,repo,author,pr_number,llm_provider,model,pr_created_at,pr_merged_at,created_at,review_interval
10,[Bug fix],fix pr_summarizer for proper format and async...,This PR fixes the `pr_summarizer` to properly ...,"[{'changes_summary': 'Added asyncio support, f...",[],perfeed,jzxcd,13,OllamaClient,llama3.1,2024-10-25T00:16:08Z,2024-10-25T04:03:44Z,2024-11-12T05:34:45Z,3.793333
11,[Bug fix],Datastore,Added support for feather and SQL DB storage. ...,[{'changes_summary': 'Added _data to .gitignor...,"[{'child_thread_ids': [1818137662, 1818349286]...",perfeed,jzxcd,14,OllamaClient,llama3.1,2024-10-26T23:48:11Z,2024-10-28T16:13:24Z,2024-11-12T05:35:13Z,40.420278


In [7]:
from jinja2 import Environment, StrictUndefined
from perfeed.config_loader import settings
from perfeed.utils.utils import json_output_curator

llm_llama = OllamaClient("llama3.1")
llm_openai = OpenAIClient("gpt-4o-mini")

In [8]:

variables = {
    "pr_summaries": df.to_json(orient='records'),
}
environment = Environment(undefined=StrictUndefined)
system_prompt = environment.from_string(
    settings.weekly_summary_prompt.system
).render(variables)
user_prompt = environment.from_string(
    settings.weekly_summary_prompt.user
).render(variables)


In [9]:
# llama3.1
summary = llm_llama.chat_completion(system_prompt, user_prompt)
Markdown(summary)

**PR Summary for the Week**

### Overview of PRs

This week, there were two pull requests (PRs) merged into the Perfeed repository.

#### PR 1: Fixing pr_summarizer for Proper Format and Asyncio Run
Type: Bug fix
Title: fix pr_summarizer for proper format and asyncio run
Description: This PR fixes the `pr_summarizer` to properly output JSON format and adds asyncio support. It also restructures the utils.

#### PR 2: Adding Support for Feather and SQL DB Storage
Type: Bug fix
Title: Datastore
Description: Added support for feather and SQL DB storage. Also added save and load functionality.

### Significant Changes

**PR 1**

* **Bug Fix**: Updated `pr_summarizer` to properly output JSON format and added asyncio support.
* **Enhancement**: Restructured the utils module, added _output_curator function to the utils module, and renamed tokenizer.py to utils.py.

**PR 2**

* **Bug Fix**: Added support for feather and SQL DB storage, and added save and load functionality.
* **Discussion Points**:
	+ Discussion on async function and PRSummarizer.run()
	+ Discussion on PRSummaryMetadata class
	+ Discussion on OllamaClient class
	+ Discussion on DataStore class and factory function
	+ Discussion on BaseStorage class and return type
	+ Discussion on validate_and_convert function

### Review Interval

The review interval for both PRs was within a reasonable timeframe.

### Conclusion

Two important PRs were merged into the Perfeed repository this week, fixing issues with `pr_summarizer` and adding support for feather and SQL DB storage. The discussions on these PRs provided valuable insights and suggestions from team members.

In [10]:
# gpt-4o-mini
summary = llm_openai.chat_completion(system_prompt, user_prompt)
Markdown(summary)

### Summary of Recent PRs

This week, jzxcd merged 2 PRs focused on bug fixes and enhancements to the `pr_summarizer` and datastore functionalities.

---

**Overview:**
- The first PR addressed issues with the `pr_summarizer`, ensuring proper JSON output and adding asyncio support. The second PR introduced support for feather and SQL database storage, along with save and load functionalities.

---

**Significant Changes:**
- **PR Summarizer Fixes:**
  - Fixed JSON output formatting and added asyncio support to enhance performance.
  - Restructured utility functions for better organization.
  
- **Datastore Enhancements:**
  - Introduced a base class for storage handlers, allowing for easier management of different storage types.
  - Added classes for feather and SQL database storage, enabling flexible data handling.

---

**Refactors/Architecture:**
- Renamed `tokenizer.py` to `utils.py` to better reflect its purpose.
- Updated the `.gitignore` file to exclude unnecessary data files.
- Created an `__init__.py` file for the data stores to facilitate package management.

---

**Review Process:**
- In the first PR, there were discussions regarding the use of the async keyword for parallel execution, with jzxcd agreeing to keep it despite initial issues.
- Clarifications were provided on the purpose of the `PRSummaryMetadata` class and the `self.provider` attribute, enhancing understanding among team members.
- In the second PR, suggestions were made to improve the design, such as using a generic return type for the `load()` method and moving functions to the base class. jzxcd engaged in constructive discussions, agreeing to some suggestions while providing rationale for others.
- Both PRs were merged efficiently, with the first merged within 4 hours and the second within 40 hours after addressing feedback.

--- 

This summary aligns with the sprint objectives by enhancing the functionality and performance of the `pr_summarizer` and datastore components, ensuring a more robust and maintainable codebase.