Notebook created by Rosa Filgueira - r.filgueira@epcc.ed.ac.uk

# Python Repository Embeddings and Similarity Analysis

## **Introduction**
This notebook explores Python code embedding and similarity analysis using transformer models. It provides a comprehensive guide for understanding, processing, and analyzing Python repositories to derive semantic insights.

### **Key Objectives**
1. Learn how to generate embeddings for Python repositories at various levels (e.g., code, documentation, requirements, and README).
2. Perform semantic similarity analysis between Python repositories using embeddings.
3. Utilize libraries like `inspect4py` and `RepoSim4Py` for parsing repositories and generating embeddings.




## **Table of Contents**
1. [Introduction](#introduction)
2. [Python Snippets Embeddings and Similarity](#python-snippets-embeddings-and-similarity)
3. [Parsing Python Repositories](#parsing-python-repositories)
4. [Visualization of the `directory_info.json` file](#visualization-of-the-directory_infojson-file)
5. [Extract Metadata from `directory_info.json`](#extract-metadata-from-directory_infojson)
6. [Python Repository Embeddings (Different Levels) - RepoSim4Py](#python-repository-embeddings-different-levels---reposim4py)
7. [Extract Embeddings from the `repo_embeddings`](#extract-embeddings-from-the-repo_embeddings)
8. [Python Repositories Similarities (Different Levels)](#python-repositories-similarities-different-levels).




## **Python Snippets Embeddings and Similarity**
### **Concept**
Embedding Python snippets involves converting code components (e.g., functions, methods, classes) into numerical vectors. These embeddings capture the semantic intent and functionality of the code, allowing for similarity analysis.

### **Steps**
1. **Tokenization**:
   - Converts Python code into tokens (keywords, symbols, identifiers).
   - These tokens are input to a transformer model.
2. **Embedding Generation**:
   - Uses a pre-trained model (e.g., `UniXcoder`) to generate dense embeddings representing the code's logic.
3. **Similarity Calculation**:
   - Compares embeddings of two or more snippets using cosine similarity.
   - A similarity score (range: 0 to 1) indicates the degree of relatedness.

### **Key Outputs**
- Example snippets and their embeddings.
- Cosine similarity between:
  - Similar snippets: Medium similarity score (e.g., 0.68).
  - Identical snippets: Perfect similarity score (1.0).

In [None]:
!pip install transformers torch



In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load UniXcoder
## Base version
# tokenizer = AutoTokenizer.from_pretrained("microsoft/unixcoder-base")
# model = AutoModel.from_pretrained("microsoft/unixcoder-base")

# The UniXcoder version used in RepoSim4Py - it gives us a slightly better result with this example.
tokenizer = AutoTokenizer.from_pretrained("Lazyhope/unixcoder-nine-advtest")
model = AutoModel.from_pretrained("Lazyhope/unixcoder-nine-advtest")

# Define a helper function to normalize embeddings
def normalize_embeddings(embeddings):
    return embeddings / torch.norm(embeddings, dim=1, keepdim=True)

# Encode Python code snippets
def encode_code(code):
    tokens = tokenizer(code, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        embeddings = model(**tokens).last_hidden_state.mean(dim=1)  # Mean pooling
    return embeddings

# Cosine similarity calculation
def cosine_similarity(embedding1, embedding2):
    embedding1 = normalize_embeddings(embedding1)
    embedding2 = normalize_embeddings(embedding2)
    return torch.matmul(embedding1, embedding2.T).item()

# Example Python code snippets
code_snippet_1 = """
def add(a, b):
    return a + b
"""

code_snippet_2 = """
def sum_numbers(x, y):
    return x + y
"""

# Encode the code snippets
embedding1 = encode_code(code_snippet_1)
embedding2 = encode_code(code_snippet_2)

# Calculate similarity
similarity = cosine_similarity(embedding1, embedding2)
print(f"Cosine Similarity: {similarity}")


tokenizer_config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/444k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/504M [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Cosine Similarity: 0.6800544261932373


In [None]:
# Example Python - this code snippet is exactly the same as code_snippet_1
code_snippet_3 ="""
def add(a, b):
    return a + b
"""
# Encode the code snippets
embedding1 = encode_code(code_snippet_1)
embedding3 = encode_code(code_snippet_3)

# Calculate similarity
similarity = cosine_similarity(embedding1, embedding3)
print(f"Cosine Similarity: {similarity}")


Cosine Similarity: 1.0


## Parsing Python Repositories

This section demonstrates how to parse Python repositories using `inspect4py`, a Python library for extracting repository metadata, structure, and dependencies. It automates the analysis of:
- Functions, classes, and source code.
- Licenses and requirements.
- Directory trees and control flow.
- ... More at https://github.com/SemanticRepoHub/inspect4py


### **Installation**
Install the `inspect4py` library using pip:
```bash
!pip install inspect4py

* Flags Explained:
 * -i or --input_path: Path to the file/directory to inspect (mandatory).
 * -o or --output_dir: Specifies the output directory for results. Defaults to "OutputDir" if not provided.
 * -r or --requirements: Extracts requirements from the repository.
 * -html or --html_output: Generates an HTML summary of the parsed data.
 * -cl or --call_list: Extracts function call lists.
 * -ld or --license_detection: Detects repository licenses.
 * -si or --software_invocation: Generates software invocation commands for the repository.
 * -dt or --directory_tree: Captures the directory structure.
 * -sc or --source_code: Extracts source code snippets.
 * -rm or --readme: Extracts README content.
 * -md or --metadata: Retrieves metadata via GitHub API.


### **Outputs**:
 * HTML File: Provides a human-readable summary of the parsed data.
    * OUTPUT_DIR/directory_info.html
 * JSON File: Contains detailed structured information for further analysis.
    * OUTPUT_DIR/directory_info.json


In [None]:
!pip install inspect4py

Collecting inspect4py
  Downloading inspect4py-0.0.8-py3-none-any.whl.metadata (12 kB)
Collecting docstring-parser==0.7 (from inspect4py)
  Downloading docstring_parser-0.7.tar.gz (13 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting astor (from inspect4py)
  Downloading astor-0.8.1-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting pigar (from inspect4py)
  Downloading pigar-2.1.7-py3-none-any.whl.metadata (7.7 kB)
Collecting setuptools==54.2.0 (from inspect4py)
  Downloading setuptools-54.2.0-py3-none-any.whl.metadata (4.8 kB)
Collecting json2html (from inspect4py)
  Downloading json2html-1.3.0.tar.gz (7.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting configparser (from inspect4py)
  Downloading configparser-7.1.0-py3-none-any.whl.metadata (5.4 kB)
Collecting bigcode-astgen (from inspect4py)
  Downloading bigcode_astgen-0.2.1-py3

In [None]:
!git clone https://github.com/lazyhope/python-hello-world.git

Cloning into 'python-hello-world'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 11 (delta 2), reused 6 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (11/11), done.
Resolving deltas: 100% (2/2), done.


In [None]:
!inspect4py -i python-hello-world -o python-hello-world-output -r -si -sc -ld -cl -dt -rm -md -html

Creating jsDir:python-hello-world-output/python-hello-world/json_files
Finding the requirements with the pigar package for python-hello-world
python-hello-world/
├── .gitignore
├── hello_world.py
├── LICENSE
└── README.md
Analysis completed
Total number of folders processed (root folder is considered a folder): 1
Total number of files found:  2
Total number of classes found:  0
Total number of dependencies found in those files 0
Total number of functions parsed:  1


In [None]:
!cat python-hello-world-output/directory_info.json

{"python-hello-world-output/python-hello-world": [{"file": {"path": "/content/python-hello-world/hello_world.py", "fileNameBase": "hello_world", "extension": "py"}, "functions": {"main": {"doc": {"short_description": "Prints hello world"}, "min_max_lineno": {"min_lineno": 1, "max_lineno": 4}, "calls": ["print"], "source_code": "def main():\n    \"\"\"Prints hello world\"\"\"\n    print('Hello World!')"}}, "body": {"calls": ["hello_world.main"], "source_code": ["main()"]}, "main_info": {"main_flag": 1, "main_function": "hello_world.main", "type": "script"}, "is_test": false}], "directory_tree": {"python-hello-world": {"README.md": "text file", "LICENSE": "license file", "hello_world.py": "python script", ".gitignore": "git file"}}, "software_invocation": [{"type": "script", "run": "python /content/python-hello-world/hello_world.py", "has_structure": "main", "mentioned_in_readme": false, "ranking": 1}], "software_type": "script", "license": {"detected_type": [{"MIT": "97.3%"}, {"MIT-0": 

## **Visualization of the `directory_info.json` File**
The file `directory_info.json` contains all the extracted information about a repository.

### **Visualization Steps**
1. Use the `json2html` library to convert the JSON data into HTML format for better readability.
2. Save the HTML file and display it within the notebook.


In [None]:
import json
import pandas as pd
from json2html import json2html
from IPython.core.display import display, HTML

# Load the JSON file
json_file_path = "python-hello-world-output/directory_info.json"

with open(json_file_path, "r") as file:
    data = json.load(file)

# Convert JSON to HTML using json2html
html_content = json2html.convert(json=data)

# Save the HTML file
html_file_path = "visualized_json.html"
with open(html_file_path, "w") as html_file:
    html_file.write(html_content)

# Display the HTML in the notebook
display(HTML(html_content))

print(f"HTML file saved to: {html_file_path}")


file,functions,body,main_info,is_test
type,run,has_structure,mentioned_in_readme,ranking
path/content/python-hello-world/hello_world.pyfileNameBasehello_worldextensionpy,"maindocshort_descriptionPrints hello worldmin_max_linenomin_lineno1max_lineno4callsprintsource_codedef main():  """"""Prints hello world""""""  print('Hello World!')",callshello_world.mainsource_codemain(),main_flag1main_functionhello_world.maintypescript,False
path,/content/python-hello-world/hello_world.py,,,
fileNameBase,hello_world,,,
extension,py,,,
main,"docshort_descriptionPrints hello worldmin_max_linenomin_lineno1max_lineno4callsprintsource_codedef main():  """"""Prints hello world""""""  print('Hello World!')",,,
doc,short_descriptionPrints hello world,,,
short_description,Prints hello world,,,
min_max_lineno,min_lineno1max_lineno4,,,
min_lineno,1,,,
max_lineno,4,,,

file,functions,body,main_info,is_test
path/content/python-hello-world/hello_world.pyfileNameBasehello_worldextensionpy,"maindocshort_descriptionPrints hello worldmin_max_linenomin_lineno1max_lineno4callsprintsource_codedef main():  """"""Prints hello world""""""  print('Hello World!')",callshello_world.mainsource_codemain(),main_flag1main_functionhello_world.maintypescript,False
path,/content/python-hello-world/hello_world.py,,,
fileNameBase,hello_world,,,
extension,py,,,
main,"docshort_descriptionPrints hello worldmin_max_linenomin_lineno1max_lineno4callsprintsource_codedef main():  """"""Prints hello world""""""  print('Hello World!')",,,
doc,short_descriptionPrints hello world,,,
short_description,Prints hello world,,,
min_max_lineno,min_lineno1max_lineno4,,,
min_lineno,1,,,
max_lineno,4,,,

0,1
path,/content/python-hello-world/hello_world.py
fileNameBase,hello_world
extension,py

0,1
main,"docshort_descriptionPrints hello worldmin_max_linenomin_lineno1max_lineno4callsprintsource_codedef main():  """"""Prints hello world""""""  print('Hello World!')"

0,1
doc,short_descriptionPrints hello world
min_max_lineno,min_lineno1max_lineno4
calls,print
source_code,"def main():  """"""Prints hello world""""""  print('Hello World!')"

0,1
short_description,Prints hello world

0,1
min_lineno,1
max_lineno,4

0,1
calls,hello_world.main
source_code,main()

0,1
main_flag,1
main_function,hello_world.main
type,script

0,1
python-hello-world,README.mdtext fileLICENSElicense filehello_world.pypython script.gitignoregit file

0,1
README.md,text file
LICENSE,license file
hello_world.py,python script
.gitignore,git file

type,run,has_structure,mentioned_in_readme,ranking
script,python /content/python-hello-world/hello_world.py,main,False,1

0,1
detected_type,MIT97.3%MIT-092.6%
extracted_text,"MIT License Copyright (c) 2023 lazyhope Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ""Software""), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED ""AS IS"", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE."

0,1
MIT,97.3%

0,1
MIT-0,92.6%

0,1
python-hello-world-output/python-hello-world/README.md,# python-hello-world Hello world in a single python file.

0,1
id,648275904
node_id,R_kgDOJqPnwA
name,python-hello-world
full_name,lazyhope/python-hello-world
private,False
owner,loginlazyhopeid78585060node_idMDQ6VXNlcjc4NTg1MDYwavatar_urlhttps://avatars.githubusercontent.com/u/78585060?v=4urlhttps://api.github.com/users/lazyhopehtml_urlhttps://github.com/lazyhopefollowers_urlhttps://api.github.com/users/lazyhope/followersfollowing_urlhttps://api.github.com/users/lazyhope/following{/other_user}gists_urlhttps://api.github.com/users/lazyhope/gists{/gist_id}starred_urlhttps://api.github.com/users/lazyhope/starred{/owner}{/repo}subscriptions_urlhttps://api.github.com/users/lazyhope/subscriptionsorganizations_urlhttps://api.github.com/users/lazyhope/orgsrepos_urlhttps://api.github.com/users/lazyhope/reposevents_urlhttps://api.github.com/users/lazyhope/events{/privacy}received_events_urlhttps://api.github.com/users/lazyhope/received_eventstypeUseruser_view_typepublicsite_adminFalse
html_url,https://github.com/lazyhope/python-hello-world
description,Hello world in a single python file.
fork,False
url,https://api.github.com/repos/lazyhope/python-hello-world

0,1
login,lazyhope
id,78585060
node_id,MDQ6VXNlcjc4NTg1MDYw
avatar_url,https://avatars.githubusercontent.com/u/78585060?v=4
url,https://api.github.com/users/lazyhope
html_url,https://github.com/lazyhope
followers_url,https://api.github.com/users/lazyhope/followers
following_url,https://api.github.com/users/lazyhope/following{/other_user}
gists_url,https://api.github.com/users/lazyhope/gists{/gist_id}
starred_url,https://api.github.com/users/lazyhope/starred{/owner}{/repo}

0,1
key,mit
name,MIT License
spdx_id,MIT
url,https://api.github.com/licenses/mit
node_id,MDc6TGljZW5zZTEz


HTML file saved to: visualized_json.html


### Extract metadata from directory_info.json

Here we are going to extract all the source_code fields from the directory_info.json. We could extract any other property/field from the directory_info.json.



### **Source Code Extraction**
A recursive function is used to extract all `source_code` fields from the JSON file.


In [None]:
import json

# Load the JSON file
json_file_path = "python-hello-world-output/directory_info.json"

with open(json_file_path, "r") as file:
    data = json.load(file)

# Recursive function to extract all 'source_code' values
def extract_source_code(json_obj):
    source_code_list = []
    if isinstance(json_obj, dict):
        for key, value in json_obj.items():
            if key == "source_code":
                if isinstance(value, list):
                    source_code_list.extend(value)  # Append all code snippets in the list
                else:
                    source_code_list.append(value)  # Append single code snippet
            else:
                source_code_list.extend(extract_source_code(value))
    elif isinstance(json_obj, list):
        for item in json_obj:
            source_code_list.extend(extract_source_code(item))
    return source_code_list

# Extract all source_code values
source_codes = extract_source_code(data)

# Display the extracted source codes
print("Extracted Source Code:")
for i, code in enumerate(source_codes, start=1):
    print(f"\nSource Code {i}:\n{code}")


Extracted Source Code:

Source Code 1:
def main():
    """Prints hello world"""
    print('Hello World!')

Source Code 2:
main()


### **README Extraction**
Extract `readme_files` data from the JSON file. This includes all README content related to the repository.


In [None]:
readme_files = data.get("readme_files")
print(readme_files)

{'python-hello-world-output/python-hello-world/README.md': '# python-hello-world\nHello world in a single python file.\n'}


## **Python Repository Embeddings (Different Levels) - RepoSim4Py**
The `RepoSim4Py` pipeline generates multi-level embeddings for Python repositories without requiring local cloning.

 Steps:
 * Initialized the pipeline to generate multi-level embeddings for GitHub repositories without cloning them locally.

 * Generated embeddings (e.g., code_embeddings, doc_embeddings) for the lazyhope/python-hello-world repository.

Lets start! First, initialise the pipeline:

In [None]:
from transformers import pipeline

model = pipeline(model="Henry65/RepoSim4Py", trust_remote_code=True)


Device set to use cpu


[*] Please set GitHub token to avoid unexpected errors. 
For more info, see: https://docs.github.com/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token


Then specify one (or multiple repositories in a tuple) as input and obtain the embeddigs of a repository at different levels. Here we are giving as input, the "lazyhope/python-hello-world" (https://github.com/lazyhope/python-hello-world). Note that reposim4py automatically generate the embeddings (at different levels) without the necessity to clone locally the repository. That is not the case with inspect4py, which needs that you clone the repository locally before running inspect4py.

In [None]:
repo_embeddings = model("lazyhope/python-hello-world")

[+] Getting metadata for lazyhope/python-hello-world
[+] Downloading lazyhope/python-hello-world
[+] Extracting lazyhope/python-hello-world info


  0%|          | 0/5 [00:00<?, ?it/s]

[*] Generating code embeddings for lazyhope/python-hello-world
[*] Generating doc embeddings for lazyhope/python-hello-world
[*] Generating requirement embeddings for lazyhope/python-hello-world
[*] Generating readme embeddings for lazyhope/python-hello-world


In [None]:
print(repo_embeddings)

[{'name': 'lazyhope/python-hello-world', 'topics': [], 'license': 'MIT', 'stars': 0, 'code_embeddings': array([[-2.07551098e+00,  2.81387782e+00,  2.35216832e+00,
         2.59944463e+00,  1.61096111e-01,  3.40696836e+00,
        -1.68738878e+00,  1.34109974e+00, -8.76387730e-02,
        -9.40873504e-01,  1.84302497e+00, -1.54258704e+00,
        -1.20860016e+00,  3.89647305e-01,  9.12968442e-02,
         1.30119666e-01,  2.70758629e+00,  1.68714738e+00,
        -1.72955608e+00,  1.99777484e+00,  1.92740226e+00,
        -6.32611141e-02, -1.65508246e+00,  3.06686664e+00,
        -6.60295308e-01,  2.11442399e+00,  1.05828118e+00,
         8.08641851e-01, -1.92431927e-01, -6.44402385e-01,
         1.18032956e+00, -1.32853782e+00, -9.91892457e-01,
        -1.34122834e-01,  1.51528263e+00,  2.25033689e+00,
         2.52902460e+00,  2.13251114e+00, -1.62592030e+00,
        -1.06958258e+00, -7.72430301e-01,  8.85902405e-01,
         2.44556636e-01, -7.68351912e-01,  3.38689780e+00,
        -1.

## **Extract Embeddings from the `repo_embeddings`**

When running the **RepoSim4Py** pipeline on a repository (e.g., `lazyhope/python-hello-world`), the pipeline generates embeddings representing different components of the repository. Below is the detailed explanation of the embeddings:

#### 1. **`code_embeddings`**
- **Description**: A matrix of embeddings representing individual segments of the codebase (e.g., functions or classes).
- **Shape**: `(1, 768)`
- **Purpose**: Captures semantic information from the code, allowing fine-grained analysis of its structure and functionality.

#### 2. **`mean_code_embedding`** - EMBEDDING OF A REPO AT CODE LEVEL
- **Description**: The mean of all embeddings in `code_embeddings`.
- **Shape**: `(1, 768)`
- **Purpose**: Provides an aggregated vector representation of the codebase, summarizing its semantic characteristics.

#### 3. **`doc_embeddings`**
- **Description**: A matrix of embeddings representing the textual content found in the repository, such as inline comments and documentation.
- **Shape**: `(1, 768)`
- **Purpose**: Captures the semantic information in the documentation for further analysis or comparison.

#### 4. **`mean_doc_embedding`** - EMBEDDING OF A REPO AT DOC LEVEL
- **Description**: The mean of all embeddings in `doc_embeddings`.
- **Shape**: `(1, 768)`
- **Purpose**: Provides a summarized representation of the repository’s documentation.

#### 5. **`requirement_embeddings`**
- **Description**: Embeddings derived from the `requirements.txt` file or similar dependency-related content.
- **Shape**: `(1, 768)`
- **Purpose**: Represents dependencies and external libraries used by the repository.

#### 6. **`mean_requirement_embedding`** - EMBEDDING OF A REPO AT REQ LEVEL
- **Description**: The mean of all embeddings in `requirement_embeddings`.
- **Shape**: `(1, 768)`
- **Purpose**: Provides a consolidated vector representation of the repository's dependencies.

#### 7. **`readme_embeddings`**
- **Description**: A matrix of embeddings derived from segments of the README file.
- **Shape**: `(3, 768)` (depending on the number of segments in the README).
- **Purpose**: Captures the semantic information in the repository’s primary documentation file.

#### 8. **`mean_readme_embedding`** - EMBEDDING OF A REPO AT README LEVEL
- **Description**: The mean of all embeddings in `readme_embeddings`.
- **Shape**: `(1, 768)`
- **Purpose**: Summarizes the content of the README into a single vector.

#### 9. **`mean_repo_embedding`** - EMBEDDIG OF A REPO AT ALL LEVELS
- **Description**: A concatenation of `mean_code_embedding`, `mean_doc_embedding`, `mean_requirement_embedding`, and `mean_readme_embedding`.
- **Shape**: `(1, 3072)` (since each component has a size of 768, resulting in a combined size of 4×768).
- **Purpose**: Provides a holistic representation of the repository, considering both its code and documentation.


We will be most interested in all the embeddings that have **MEAN** in their name: **mean_code_embedding** , **mean_doc_embedding**, **mean_readme_embedding**, **mean_requirement_embedding** and **mean_repo_embedding**.


To extract and print the mean_code_embedding from repo_embeddings, you can do this  code:

In [None]:
# Extract the mean_code_embedding from repo_embeddings
mean_code_embedding = repo_embeddings[0]['mean_code_embedding']

# Print the mean_code_embedding
print(mean_code_embedding)


[[-2.07551098e+00  2.81387782e+00  2.35216832e+00  2.59944463e+00
   1.61096111e-01  3.40696836e+00 -1.68738878e+00  1.34109974e+00
  -8.76387730e-02 -9.40873504e-01  1.84302497e+00 -1.54258704e+00
  -1.20860016e+00  3.89647305e-01  9.12968442e-02  1.30119666e-01
   2.70758629e+00  1.68714738e+00 -1.72955608e+00  1.99777484e+00
   1.92740226e+00 -6.32611141e-02 -1.65508246e+00  3.06686664e+00
  -6.60295308e-01  2.11442399e+00  1.05828118e+00  8.08641851e-01
  -1.92431927e-01 -6.44402385e-01  1.18032956e+00 -1.32853782e+00
  -9.91892457e-01 -1.34122834e-01  1.51528263e+00  2.25033689e+00
   2.52902460e+00  2.13251114e+00 -1.62592030e+00 -1.06958258e+00
  -7.72430301e-01  8.85902405e-01  2.44556636e-01 -7.68351912e-01
   3.38689780e+00 -1.32800138e+00 -9.68226433e-01 -5.88156164e-01
  -2.46586561e+00 -6.41603827e-01  5.38640738e-01 -7.32820690e-01
   1.00686514e+00 -4.46850330e-01  2.31803918e+00 -1.30480003e+00
   1.22141266e+00  8.20153773e-01  1.00025713e-01  4.63853925e-01
  -1.14763

To extract and print the mean_doc_embedding from repo_embeddings, you can use the same approach as for mean_code_embedding:

In [None]:
# Extract the mean_doc_embedding from repo_embeddings
mean_doc_embedding = repo_embeddings[0]['mean_doc_embedding']

# Print the mean_doc_embedding
print(mean_doc_embedding)

[[-2.37494445e+00  5.40957093e-01  2.29580140e+00  2.96114874e+00
   2.65712440e-01  1.86113524e+00 -4.52338845e-01  1.51038814e+00
  -1.45246160e+00  4.15124357e-01  1.62401366e+00 -9.30627167e-01
  -1.21469331e+00  1.37300289e+00 -4.74416554e-01 -4.08132970e-01
   1.09968197e+00  9.77959514e-01 -1.00369346e+00  2.32375175e-01
   1.73945343e+00  1.11439097e+00 -6.72563985e-02  2.44768929e+00
  -6.16708100e-01 -5.03109276e-01  1.56256425e+00 -3.37080032e-01
  -2.78485727e+00  3.40257913e-01  3.24099135e+00 -1.43355393e+00
   5.65988243e-01 -1.67214274e+00  2.18131113e+00  4.73105490e-01
   1.34431863e+00 -5.87368850e-03 -2.57398653e+00  1.33399236e+00
  -1.32202834e-01  8.32079768e-01 -3.81437302e-01  8.99240553e-01
   1.39589775e+00 -1.30818164e+00  1.31295133e+00  7.05342889e-01
  -2.39992023e+00  4.27395284e-01  1.44685709e+00  4.72248018e-01
   6.86656654e-01 -8.98061022e-02  2.27945089e+00 -8.24249208e-01
   3.54709089e-01 -6.83327079e-01  9.41495359e-01 -2.38424659e-01
  -4.23826

## **Python Repositories Similarities (Different Levels)**

### **Objective**
Calculate the semantic similarity between repositories based on embeddings at different levels.

### **Steps**
1. Extract mean embeddings for both repositories.
2. Use cosine similarity to compare the embeddings:
   - **Code similarity**
   - **Documentation similarity**
   - **README similarity**
   - **Overall repository similarity**
3. Present results in a tabular format.

In [34]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Load the RepoSim4Py model
from transformers import pipeline
model = pipeline(model="Henry65/RepoSim4Py", trust_remote_code=True)

# Generate embeddings for the two repositories
repo1_embeddings = model("lazyhope/python-hello-world")
repo2_embeddings = model("dbarnett/python-helloworld")




Device set to use cpu


[*] Please set GitHub token to avoid unexpected errors. 
For more info, see: https://docs.github.com/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
[+] Getting metadata for lazyhope/python-hello-world
[+] Downloading lazyhope/python-hello-world
[+] Extracting lazyhope/python-hello-world info


  0%|          | 0/5 [00:00<?, ?it/s]

[*] Generating code embeddings for lazyhope/python-hello-world
[*] Generating doc embeddings for lazyhope/python-hello-world
[*] Generating requirement embeddings for lazyhope/python-hello-world
[*] Generating readme embeddings for lazyhope/python-hello-world
[+] Getting metadata for dbarnett/python-helloworld
[+] Downloading dbarnett/python-helloworld
[+] Extracting dbarnett/python-helloworld info


  0%|          | 0/33 [00:00<?, ?it/s]

[*] Generating code embeddings for dbarnett/python-helloworld
[*] Generating doc embeddings for dbarnett/python-helloworld
[*] Generating requirement embeddings for dbarnett/python-helloworld
[*] Generating readme embeddings for dbarnett/python-helloworld


In [37]:
# Extract mean embeddings for similarity computation
def extract_mean_embeddings(embeddings, keys):
    return {key: embeddings[0][key].flatten() for key in keys}

embedding_keys = [
    "mean_code_embedding",
    "mean_doc_embedding",
    "mean_readme_embedding",
    "mean_requirement_embedding",
    "mean_repo_embedding",
]

repo1_mean_embeddings = extract_mean_embeddings(repo1_embeddings, embedding_keys)
repo2_mean_embeddings = extract_mean_embeddings(repo2_embeddings, embedding_keys)

# Compute cosine similarity for each embedding level
similarity_results = {
    "code_sim": cosine_similarity(
        [repo1_mean_embeddings["mean_code_embedding"]],
        [repo2_mean_embeddings["mean_code_embedding"]],
    )[0][0],
    "doc_sim": cosine_similarity(
        [repo1_mean_embeddings["mean_doc_embedding"]],
        [repo2_mean_embeddings["mean_doc_embedding"]],
    )[0][0],
    "readme_sim": cosine_similarity(
        [repo1_mean_embeddings["mean_readme_embedding"]],
        [repo2_mean_embeddings["mean_readme_embedding"]],
    )[0][0],
    "req_sim": cosine_similarity(
        [repo1_mean_embeddings["mean_requirement_embedding"]],
        [repo2_mean_embeddings["mean_requirement_embedding"]],
    )[0][0],
    "repo_sim": cosine_similarity(
        [repo1_mean_embeddings["mean_repo_embedding"]],
        [repo2_mean_embeddings["mean_repo_embedding"]],
    )[0][0],
}

# Convert results to a DataFrame for presentation
similarity_df = pd.DataFrame([similarity_results])

# Display the similarity results
print(similarity_df)



   code_sim   doc_sim  readme_sim  req_sim  repo_sim
0  0.704256  0.441496    0.775795      0.0  0.587949


In [38]:
similarity_df

Unnamed: 0,code_sim,doc_sim,readme_sim,req_sim,repo_sim
0,0.704256,0.441496,0.775795,0.0,0.587949


### **Analysis of Results**

The similarity scores indicate how semantically similar two Python repositories are across different levels of their structure. Here's a breakdown of the results:

| Metric         | Similarity Score | Interpretation                                                                                     |
|----------------|------------------|-----------------------------------------------------------------------------------------------------|
| **Code Similarity (`code_sim`)**   | **0.704256**         | A moderately high similarity score suggests that the codebases of the two repositories share significant structural or functional similarities. This could indicate similar coding patterns, algorithms, or functionality. |
| **Documentation Similarity (`doc_sim`)**   | **0.441496**         | A lower similarity score for documentation implies that the docstrings or inline comments vary more significantly between the repositories. This could reflect differences in documentation quality, style, or the level of detail provided. |
| **README Similarity (`readme_sim`)**   | **0.775795**         | A high similarity score for README files indicates that the repositories have very similar descriptions, which might point to shared goals, purposes, or project structures. |
| **Requirements Similarity (`req_sim`)**   | **0.0**              | A score of 0.0 for requirements suggests that the repositories either have no overlapping dependencies or that one or both repositories lack a requirements file entirely. |
| **Overall Repository Similarity (`repo_sim`)**   | **0.587949**         | The combined similarity score across all levels suggests that the repositories are somewhat related but not identical. This holistic measure combines code, documentation, README, and requirements data, weighted equally. |

### **Insights**
- The relatively high **code** and **README** similarity scores indicate that the repositories might share a common purpose or functionality.
- The lower **documentation similarity** suggests room for improvement in aligning documentation practices.
- The absence of shared dependencies highlights potential differences in the technological stacks or a lack of explicit requirement files in one or both repositories.

This analysis can guide further exploration, such as inspecting specific areas of alignment or divergence to understand the nature of these repositories' similarities.
