# 项目分析

逐级生成分析，使用不同模板。

粗描述生成（自上而下）：
1. 首先读取每个文件，针对每个文件生成粗略描述（限制输入文本长度以免避免模型超限）
2. 根据每个文件，生成文件命名空间（父文件夹）描述（大部分引用的为命名空间）
3. 之后针对每个文件中的类生成描述，利用文件引用将其他文件的粗描述引入作为信息
4. 针对每个文件中的方法生成描述，利用类的描述信息

精细描述生成（自下而上）：
1. 利用一个类中的其他方法的描述，以及类的描述，生成当前方法描述
2. 利用类中所有方法的精细描述，生成类的描述
3. 利用一个文件中所有类的描述，生成文件的描述
4. 结合引用文件描述，重新生成当前文件描述

In [2]:
import time
import os
import pickle
from typing import *

In [3]:
import torch

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [5]:
from tinydb import TinyDB,Query

In [6]:
os.environ["RWKV_CUDA_ON"] = 'cuda:1' # if '1' then use CUDA kernel for seq mode (much faster)

In [7]:
os.environ["CUDA_LAUNCH_BLOCKING"] = '1'

In [8]:
device = 'cuda:1' if torch.cuda.is_available() else '0'

In [9]:
def get_encoding_type(file):
    with open(file, 'rb') as f:
        rawdata = f.read()
    return detect(rawdata)['encoding']

## 1. 读取项目测试数据

In [10]:
import json

In [11]:
from chardet import detect

## 2. 读取模型参数测试

In [12]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

In [13]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("../data/deepseek-coder-6.7b-instruct-AQLM/", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("../data/deepseek-coder-6.7b-instruct-AQLM/", 
                                             # device_map = 'auto',
                                             trust_remote_code=True, 
                                             torch_dtype=torch.bfloat16).cuda()

`low_cpu_mem_usage` was None, now default to True since model is quantized.


## 3. 项目分析提示词模板

In [16]:
import tinydb

#### 自上而下-类结构描述

In [64]:
file_db_lst = tinydb.TinyDB('../data/results/amc-rough_csfile_db-dscoderv2lite.json').all()
namespace_db_lst = tinydb.TinyDB('../data/results/amc-rough_namespace_db-dscoderv2lite.json').all()
class_db_lst = tinydb.TinyDB('../data/results/amc-rough_class_db-dscoderv2lite.json').all()
method_db_lst = tinydb.TinyDB('../data/results/amc-rough_method_db-dscoderv2lite.json').all()

In [65]:
file_db_map = {n['filepath']:n['desc'] for n in file_db_lst}

namespace_db_map = {n['namespace']:n['desc'] for n in namespace_db_lst}

project_meta = json.load(open('../data/processed/amc-by-tree-sitter.json','r'))

method_db_map = {n['filepath']:
                 {n['classname']:
                  {n['methodname']:
                   n['desc']
                  }
                 }
                 for n in method_db_lst}

In [71]:
class_db_map = {}
for cnode in class_db_lst:
    if not cnode['filepath'] in class_db_map:
        class_db_map[cnode['filepath']] = {}
    class_db_map[cnode['filepath']][cnode['classname']] = cnode['desc']

In [60]:
def gen_prompt_class(ref_desc_map:Dict[str,str],
                     class_map:Dict[str,str],
                      max_length:int=1024):
    '''
    ref_desc_map: 引用文件（命名空间）名，生成的描述文本
    class_map: 代码内容  class_name class_comment class_code
    max_length: 最大长度
    '''
    template_pref = f"""你是一个资深的软件工程师，现在需要对一个 csharp 软件项目进行分析，并撰写代码文档，现在针对一个文件中单独的类class中包含的功能以及其作用进行描述。这个类在使用时可能调用了标准库以及本地的其他文件和命名空间，现在需要对这个类进行详细分析，需要撰写的文档格式如下：# 类的整体功能描述; # 类中包含的关键成员以及方法的介绍; # 综合上面信息，总结这个类的作用范围（如：信号分析、adapter、中间件等等）。
    这个类所引入的外部库的描述如下，分为引入的命名空间名称以及其对应的描述文本：
    """

    template_pref += '\n'.join(f"引用名：{k}；描述内容：{v}" for k,v in ref_desc_map.items())
    
    
    temlplate_middle = f"类本身的代码内容如下：\n 类名：{class_map['class_name']}；类注释：{class_map['class_comment']}；类代码体：{class_map['class_code'][:max_length]}"
    
    template_surf = '\n\n请生成对应的命名空间描述文档：\n'
    
    return template_pref + temlplate_middle + template_pref
    

In [61]:
db_class = tinydb.TinyDB('../data/results/amc-rough_class_db-dscoderv2lite.json')

In [56]:
amc_doc = {}
for k in project_meta:
    node = project_meta[k]
    fname = node['filename']
    fpath = node['filepath']
    namespace = node['namespace_key']
    if not namespace in amc_doc:
        amc_doc[namespace] = {
                            'namespace_doc':namespace_db_map[namespace],
                             'file_list':[]
        }
    csfile_doc_node = {
                    'filepath':fpath,
                      'file_doc':file_db_map[fpath],
                    'class_list':[]
    }
    
    allref_lst = node['source_reference'] + list(node['relocal_reference'])
    ref_desc_map = {k:('' if not k in namespace_db_map else namespace_db_map[k]) for k in allref_lst}
    for class_node in node['source_class']:
        class_name = class_node['class_name']
        if class_name in class_db_map[fpath]:
            # print(f'{fpath} {class_name} exist continue.')
            continue
            
        prompt = gen_prompt_class(ref_desc_map=ref_desc_map,class_map=class_node)
        
        pt_llama = new_p = f"""<|im_start|>使用中文的csharp资深开发者 "\n {prompt} 使用中文回答，尽可能的详尽描述每个部分的功能，以及总结整体代码功能（800字）。<|im_end|>\n"""
        
        messages=[
            { 'role': 'user', 'content': pt_llama}
        ]
        
        inputs = tokenizer.apply_chat_template(messages, 
                                               add_generation_prompt=True, 
                                               return_tensors="pt").to(model.device)
        # tokenizer.eos_token_id is the id of <|EOT|> token
        outputs = model.generate(inputs, max_new_tokens=4086, 
                                 do_sample=False, 
                                 top_k=50, 
                                 top_p=0.95, 
                                 num_return_sequences=1, 
                                 eos_token_id=tokenizer.eos_token_id)
        desc = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
    
        # 插入数据库
        db_class.insert({'filepath': fpath, 'classname':class_name,'desc': desc})
        print(f'{fpath} - {class_name} insert sucsscfull.')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/AutoScan/AbstractAutoScanTask.cs - AutoScanTaskFactory insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/AutoScan/AbstractAutoScanTask.cs - AutoScanPageName insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/AutoScan/AutoScanParallelTask.cs - AutoScanParallelPageName insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/AutoScan/AutoScanSerialTask.cs - AutoScanSerialPageName insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/SystemPara/Form_GlobalparaDataType.cs - Form_GlobalparaDataType insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/SystemPara/GlobalParameter.cs - GlobalParameter insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/SystemPara/SystemParameter.cs - SystemParameterKey insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/SystemPara/SystemParameter.cs - RobotTrajKey insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/SystemPara/SystemParameter.cs - RobotTrajKeySerial insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/SystemPara/SystemParameter.cs - RobotTrajKeyParallel insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/SystemPara/SystemParameter.cs - IOUseKey insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/SystemPara/SystemParameter.cs - DeviceAppKey insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/AMC/SystemPara/SystemParameter.cs - AuthorityControlsKey insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/Common/gclib-bak.cs - LibraryPath insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/Common/gclib.cs - LibraryPath insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/Common/gclib.cs - LibraryPath insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/Common/gclib.cs - LibraryPath insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/Devices/MotionCard/GailMotionCard/gclib.cs - LibraryPath insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/Devices/MotionCard/GailMotionCard/gclib.cs - LibraryPath insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/Devices/MotionCard/GailMotionCard/gclib.cs - LibraryPath insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/Measure/DrawObjects/Xml/XRemarks.cs - XRemarks insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/Measure/DrawObjects/Xml/XRemarks.cs - XRemarkBase insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/Measure/DrawObjects/Xml/XRemarks.cs - XRemarkLine insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/Measure/DrawObjects/Xml/XRemarks.cs - XRemarkMultLine insert sucsscfull.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


../data/AMC/Measure/DrawObjects/Xml/XRemarks.cs - XRemarkCircle insert sucsscfull.
../data/AMC/Measure/DrawObjects/Xml/XRemarks.cs - XRemarkRect insert sucsscfull.
