# Projecte github

# Imports

In [None]:
# Uncomment the following line to install PyMySQL if necessary
#!pip install PyMySQL

In [None]:
# Draw the plots immediately after the current cell
%matplotlib inline

import pandas as pd
import pymysql
import warnings

# Project configuration

The file that we are going to use to set up the basic structure:

GitHubScraper/JupyterNotebook/Structure/githubProjectStructure.sql

from the github repository.

### Step 1, import the structure
In workbench: 
"Server" -> "Data Import", and then you should select the direction of "githubProjectStructure.sql" using the 3 dots and import from self-contained file like in the following image:

![Javatpoint](./Images/Project_Setup/data_import.png)  

Then scroll down, and create a new Schema by clicking on the "New..." button. Name it: "githubProject".

![Javatpoint](./Images/Project_Setup/create_new_schema.png)  

Scroll to the bottom, and in the bottom right, click on "Start import". If there are no errors, it will transitionate to "Import progress" and indicate success.

Refresh the scehmas:

![Javatpoint](./Images/Project_Setup/refresh.png)  

### Step 2, importing the data from the csv files

Download the csv files [here](https://drive.google.com/drive/folders/1NWhfFss0_M9V_clkcE9TH-Fy3oIEndWV?usp=sharing).

Selecting the "githubProject" scheme, right click and "Table Data Import Wizard", with this, we are going to import each csv file to the matching table.

You have to select the .csv you want to import, for example in this case, the trendVisits.csv

![Javatpoint](./Images/Project_Setup/file_import.png) 

Then, since you already have the structure, you have to use the existing table that matches with the correct csv file:

![Javatpoint](./Images/Project_Setup/use_existing.png) 

After that, check that the variables type are correct and import it. Do this for each csv file.

You can also verify the data by doing this:

![Javatpoint](./Images/Project_Setup/verify.png) 



# Connexió

Primer de tot, cal connectar-se a la base de dades:

In [None]:
db_name = "githubProject"
db_host = "localhost"
db_port = 3306
db_username = "root"
db_password = input("Enter the DB password")

dataBaseConnection = pymysql.connect(host=db_host,
                            port=db_port,
                            user=db_username,
                            password=db_password,
                            db=db_name)

In [None]:
# Get the cursor to interact with the database
cursor = dataBaseConnection.cursor()

In [None]:
def execute_select_query(cursor, query):
    """
    Executes a query and returns the results as a pandas dataframe.
    """
    try:
        cursor.execute(query)
        output = cursor.fetchall()
        # Fetch column names from cursor's description
        columns = [desc[0] for desc in cursor.description]
        
        # Convert output to pandas DataFrame
        if output:
            df = pd.DataFrame(output, columns=columns)
            print("Query executed successfully!")
        return output, df
            
    except pymysql.Error as e:
        print("Error executing query:")
        warnings.warn(str(e))
        return None, None

# Repositoris "Open Source" & dades per a la pàgina web
Quant a l'objectiu de destacar repositoris open-source (i no només repositoris populars, com fa GitHub), volem exportar dades d'aquests repositoris en format `.json` per a que puguin ser usades per la pàgina web.

Considerem que repositoris són projectes open-source si tenen més de 5 contribuïdors i més de 50 issues en total - un criteri simple però efectiu.

Recollir les dades necessàries involucra fer JOINs amb múltiples taules.

In [None]:
open_source_projects_query = """
SELECT r.owner, r.name, MAX(r.description) as description, r.mainLanguage,
MAX(stars) as total_stars, MAX(contributors) as total_contributors, MAX(openIssues + closedIssues) as total_issues,
MAX(o.avatar_url) as avatar_url, GROUP_CONCAT(DISTINCT t.topic) as topics,
MAX(watchers) as total_watchers FROM Repositories r
-- Join with RepositoryVisits to get metrics like contributors, stars, issues amount
JOIN RepositoryVisits v
ON r.owner = v.owner AND r.name = v.name
-- Join with Owners to get the avatar URL
JOIN Owners o
ON r.owner = o.username
-- Join with RepositoryTopics to get a list of the tags/topics of the repos
JOIN RepositoryTopics t
ON t.repo = r.name AND t.owner = r.owner
GROUP BY r.owner, r.name, r.mainLanguage, r.license -- Must group by mainLanguage and license as well as they're non-aggregated
HAVING total_contributors > 5 AND total_issues > 50
ORDER BY total_contributors DESC; -- Show repos with most contributors first
"""


output_repo, df_repo = execute_select_query(cursor, open_source_projects_query)
print(len(df_repo), "repositories matching criteria")
df_repo.head(100)

Les dades d'aquests repositoris s'exporten a .json perque puguin ser usades per la pàgina web.

En aquest procés també categoritzem els repositoris segons les seves temàtiques generals:
- Desenvolupament web
- Data science
- Aplicacions
- Eines de desenvolupament
- Repositoris de recursos (ex. una col·lecció d'algorismes)

La categorització es fa segons els "topics" (tags) que tenen els repositoris. Els repositoris també es categoritzen pel seu llenguatge de programació principal.

In [None]:
from collections import defaultdict
import json

used_languages = set()
used_tags = set()

# Renames or discards languages.
# Used to group up frameworks, variants and transpilers.
LANGUAGE_REMAP = {
    "TypeScript": "JavaScript",
    "Vue": "JavaScript",
    "HTML": "JavaScript", # Bruh
    "Jupyter Notebook": "Python",
    "Kotlin": "Java & Kotlin",
    "Java": "Java & Kotlin",
    "C": "C/C++",
    "C++": "C/C++",
    "Ruby": "Others",
    "Go": "Others",
    "Swift": "Others",
    "Clojure": "Others",
    "Haskell": "Others",
    "Dart": "Others",
    "Shell": "Others",
    "PowerShell": "Others",
    "Scala": "Others",
    "Svelte": "Others",
    "Vim Script": "",
    "CSS": "",
    "SCSS": "",
    "MDX": "",
}

# Remaps GitHub topics to the tags that are used in the website.
TAG_MAP = {}
# Maps tags used by the website to a list of Github "topics" that will be considered as the same tag.
# Ex. repos with topics "react", "vue" become tagged as "Web"
TAG_ALIASES = {
    "Web": ["react", "vue", "web", "reactjs", "css", "chrome-extension", "react-grid", "react-table", "php", "http", "nodejs", "typescript", "electron", "search-engine", "webgl", "rest", "rest-api", "swagger", "static-site-generator", "blog-engine", "router", "webview", "jquery", "http-client", "website", "reactive-templates", "nuxt", "nat", "javascript", "http2", "nginx", "apache", "aws", "api-gateway", "jekyll", "bootstrap"],
    "Modding": ["mod", "minecraft", "emulation", "emulator", "forge", "minecraft-launcher", "modrinth", "minecraft-api", "minecraft-server", "bepinex", "unity3d", "unreal", "unity-mono", "craftbukkit", "valheim", "minecraft-mod", "gta5", "fabric", "gamedev", "retroarch", "game"],
    "Data Science": ["math", "numpy", "data-science", "graphql", "data-visualization", "jupyter-notebook", "big-data"],
    "Machine Learning": ["ml", "pytorch", "deep-learning", "machine-learning", "deep-neural-networks", "tensorflow", "neural-network", "tensor", "computer-vision", "reinforcement-learning", "hyperparameter-tuning", "ai", "artificial-intelligence", "llama", "llms", "llm", "openai"],
    "Tool": ["containers", "zsh", "docker", "github", "cli", "searchengine", "postgrest", "devtool", "cloudstorage", "git", "npm", "database", "postgresql", "backend", "shell-scripting", "websocket", "collaboration", "developer-tools", "promise", "api", "testing", "translation", "i18n", "language", "golang-library", "algorithm", "firmware", "style-linter", "linting", "converter", "blockchain", "wordpress", "static-site-generator", "blog-engine", "material", "material-design", "framework", "argument-parser", "command-line-parser", "readme-generator", "ssh", "backup", "reverse-engineering", "animation", "sdk", "devops", "jenkins", "documentation", "terminal", "encryption", "scrapers", "3d-printing", "reactive-templates", "image-optimization", "file-server", "nat", "proxy", "shell", "linters", "git-client", "raspberry-pi", "blogging", "npm-cli", "aws", "api-gateway", "decompiler", "kubernetes", "tools"],
    "App": ["note-taking", "productivity", "prest", "download", "latex", "text-editor", "curl", "ftp", "bot", "synchronization", "sqlite", "mattermost", "messaging", "conferencing", "remote-desktop", "emacs", "color-picker", "cli-app", "subtitle-downloader", "decompiler", "mobile-app"],

    "Resource": ["learn-to-code", "freecodecamp", "curriculum", "certification", "learnopengl", "lists", "resources", "resource", "dataset", "public-api", "public-apis", "practice", "interview", "styleguide", "list", "interview-questions", "awesome-list", "principles", "design-patterns"],
}
# Tags manually added to some repositories (which otherwise lack descriptive ones)
MANUAL_TAGS = {
    "minio/minio": ["Machine Learning"],
    "Aliucord/Aliucord": ["Modding"],
    "cli-guidelines/cli-guidelines": ["Resource"],
    "yjs/yjs": ["Tool"],
    "TigerVNC/tigervnc": ["Tool"],
    "ollama/ollama": ["App"],
    "micropython/micropython": ["Tool"], # Python implementation
    "raspberrypi/linux": ["App"],
    "rust-lang/rust": ["Tool"],
    "home-assistant/core": ["Tool"],
    "vuejs/vue-cli": ["Tool", "Web"],
    "remix-run/remix": ["Tool", "Web"],
}
# Same thing as above, but mapping tag to list of repos with it instead, as I realized at the end this would've been more convenient.
MANUAL_TAGS2 = {
    "Tool": ["pytorch/tutorials", "vuejs/core", "google/googletest", "google/guava", "ReactiveX/RxJava"],
    "Web": ["vuejs/core"],
    "App": ["square/retrofit"],
}
for tag,aliases in TAG_ALIASES.items():
    for alias in aliases:
        TAG_MAP[alias] = tag
for tag,repos in MANUAL_TAGS2.items():
    for repo in repos:
        if repo in MANUAL_TAGS:
            MANUAL_TAGS[repo].append(tag)
        else:
            MANUAL_TAGS[repo] = [tag]

IGNORED_TAGS = defaultdict(int)

oss_repos = {}
for index, row in df_repo.iterrows():
    key = row["owner"] + "/" + row["name"]
    main_language = row["mainLanguage"]
    if main_language in LANGUAGE_REMAP:
        main_language = LANGUAGE_REMAP[main_language]
    repo = {
        "topics": set(),
        "languages": set([main_language] if main_language != "" else []),
        "stars": row["total_stars"],
        "contributors": row["total_contributors"],
        "icon": row["avatar_url"],
        "description": row["description"],
    }
    for topic in str.split(row["topics"], ","):
        if topic in TAG_MAP:
            repo["topics"].add(TAG_MAP[topic])
        else:
            IGNORED_TAGS[tag] += 1
        if key in MANUAL_TAGS:
            for tag in MANUAL_TAGS[key]:
                repo["topics"].add(tag)

    oss_repos[key] = repo
    used_tags = used_tags.union(repo["topics"])
    used_languages.add(main_language)

# Exclude mirrors and other projects that are not contributable projects or unsuitable
BLACKLISTED_REPOS = [
    "gitlabhq/gitlabhq", # Read-only mirror.
    "qemu/qemu", # Read-only mirror.
    "xasset/xasset",# Not english.
    "jynew/jynew", # Unity RPG game framework, documentation in chinese-only though.
    "doocs/advanced-java", # Chinese-only Java interview questions.
    "CyC2018/CS-Notes", # Chinese computer science course resources.
    "apache/kafka", # Read-only mirror.
]
for blacklisted_repo in BLACKLISTED_REPOS:
    del oss_repos[blacklisted_repo]

# Convert sets to lists for json serialization, and add other
# keys the site expects
for key,repo in oss_repos.items():
    repo["owner"] = key.split("/")[0]
    repo["repo"] = key.split("/")[1]
    repo["topics"] = list(repo["topics"])
    repo["languages"] = list(repo["languages"])

# Export the .json
with open("repositories_output.json", "w") as f:
    json.dump(oss_repos, f, indent=2)

print("Valid repositories:", len(oss_repos))
print("Languages used:", used_languages)
print("Tags used:", used_tags)

Com els repositoris en la base de dades foren principalment trobades per scraping de les pàgines "trending" de GitHub, podem concloure que un 50% dels repositoris que GitHub destaca són només repositoris populars de projectes individuals o d'equips petits i no projectes contribuïbles; només uns 1000 repositoris dels 2000 en la base de dades compleixen els requisits que hem imposat.

# Tables
Shows the table for the different entities of the Data Base.

## Repositories

In [None]:
query = "SELECT * FROM Repositories;"

output_repo, df_repo = execute_select_query(cursor, query)
df_repo.tail(5)

In [None]:
query = "SELECT * FROM RepositoryVisits;"

output_repo_visists, df_repo_visists = execute_select_query(cursor, query)
df_repo_visists.head(5)

In [None]:
query = "SELECT * FROM RepositoryTopics;"

output_repo_topics, df_repo_topics = execute_select_query(cursor, query)
df_repo_topics.tail(5)

## Owners

In [None]:
query = "SELECT * FROM Owners;"
# Use _ to ignore
_, df_owners = execute_select_query(cursor, query)
df_owners.head(5)

In [None]:
query = "SELECT * FROM OwnerVisits;"

output_owner_visits, df_owner_visits = execute_select_query(cursor, query)
df_owner_visits.head(5)

## Commits

In [None]:
query = "SELECT * FROM Commits;"

output_commits, df_commits = execute_select_query(cursor, query)

df_commits.head(5)

## Topics

In [None]:
query = "SELECT * FROM Topics;"

output_topics, df_topics = execute_select_query(cursor, query)
df_topics.tail(5)

In [None]:
query = "SELECT * FROM TopicVisits;"

_, df_topic_visits = execute_select_query(cursor, query)
df_topic_visits.head(5)

## Trend

In [None]:
query = "SELECT * FROM TrendVisits;"

_, df_trend_visits = execute_select_query(cursor, query)
df_trend_visits.head(5)

# Analysis

### Basic statistics
* Distribution of main topics (bar or pie chart). Aka, in what proportion of our studied repos are they treated.
* Distribution of languages (bar or pie chart). Like above.
* Time evolution of interest in topics: repositories per topic, followers per topic.

### Dimensional reduction for an overview on repositories
What we mean here is to perform a PCA (principal components analysis) in order to able to have an insight on the structre of the whole dataset. We would then color the data points in it depending on the language used, for instance, to see if these groups have similar characteristics and lie close in the dataframe or not.
* Create a dataframe containing all RepositoryVisits data for a certain date for all the studied repositories.
* Perform a standarization on the data (i.e., to prevent some variables such as commits to be far more important than others such as forks).
* Perform a PCA into 2 components on it.
* Plot the results while clustering the points depending on different criteria:
    + mainLanguage
    + topic
* Supervised machine learning using stars.
### Open questions
* How to use stars and trends?
* How to use contributions by owners?


# Analisis missatge de commits

In [None]:
# Import necessary libraries
import numpy as np
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import warnings
import nltk
from nltk.corpus import wordnet
import unicodedata
import re

"""
If 
!pip install wordcloud 
doesn't work:
import sys
print(sys.executable) # use the path

path -m pip install wordcloud
"""

# Suppress warnings
warnings.filterwarnings("ignore")

# Define constants and configurations
STANDARDIZE_SUBSTITUTION = re.compile(r"[!\"$#%&()*+,\-./:;<=>?[\]^_`{|}~]")
CONSECUTIVE_SPACE_SUBSTITUTION = re.compile(r"  +")
LEMMATIZATION_BLACKLIST = {"was", "as"}
WORD_REPLACEMENTS = {
    "read-me": "readme",
    "read.me": "readme",
    "readme.md": "readme"
}
stopwords = set(STOPWORDS)
# stopwords.update(["a"]) # Manually add stopwords if needed

# Ensure NLTK resources are downloaded
# nltk.download()

### Funcions auxiliars

In [None]:
def remove_accents(input_str):
    """
    Convert accented characters to base characters.
    Source: https://stackoverflow.com/a/1207479
    """
    nfkd_form = unicodedata.normalize('NFKD', input_str).encode("ascii", "ignore")
    return bytes.decode(nfkd_form)

lemmatizer = nltk.wordnet.WordNetLemmatizer()

# Function to get the wordnet POS tag
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

def standardize(word):
    """Standardize a word."""
    word = word.lower()
    word = re.sub(STANDARDIZE_SUBSTITUTION, " ", word)
    word = re.sub(CONSECUTIVE_SPACE_SUBSTITUTION, " ", word)
    word = remove_accents(word)
    word = word.replace("'s", "").replace("#", "").strip()
    words = [WORD_REPLACEMENTS.get(w, w) for w in word.split()]
    
    # Enhanced lemmatization with POS tagging
    words = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) if w not in LEMMATIZATION_BLACKLIST else w for w in words]
    
    return " ".join(words)

In [None]:
print(standardize("thIs a TesT"))
print(standardize("readme.md"))
_standardize_test_words = [
    "Testing @here as#Dasd 333232 tests testings gItHUbs testing added",  # tests -> test due lemmatization
    "Test_ing !!, (asd,as d).",  # Underscore are considered separated words
    "Test łłłł ñaññañ áaaa",  # Removal of accents and non standard characters
    "テスト #asdasd arabian, wolves",
]

for word in _standardize_test_words:
    print(standardize(word))

## Neteja de dades

Com que l'objectiu es analitzar com escriuen les persones, hi ha un conjunt de missatges que no ens interessen. Per això cal excloure tots aquells missatges que no compleixin uns requisits.

Els nostres criteris d'exclusió són:
- Missatges de bots, els quals podem identificar per un autor que tingui en el nom `[bot]` i missatges que tinguin noms com pot ser: `dependabot` o `renovatebot`
- Qualsevol tipus de misstage duplicat, ja que això faria que en l'anàlisis es comptéssin dades duplicades.
- Els missatges automatics que genera github com poden ser: `Merge pull request`, `Merge branch`, `Squash`, `Initial commit` i `Revert`, ja que no són escrits per les persones.

A partir d'aquests criteris, hem realitzat una neteja de les dades.

In [None]:
# Find all the commits authored by a bot.
from IPython.display import display
pd.set_option('display.max_rows', None)

cond_a = df_commits[df_commits["author"].str.contains("\[bot]|-bot")]
cond_c = df_commits[df_commits["message"].str.contains("dependabot")]
cond_d = df_commits[df_commits["message"].str.contains("renovatebot")]

temp = pd.concat([cond_a])

unique_auth = pd.unique(temp.author)
unique_msg = pd.unique(cond_c.message)

dict = {'bots' : unique_auth}
df_bots = pd.DataFrame(dict)

# displaying the bots names
display(df_bots)

dict = {'bots_message' : unique_msg}
df_bots_msg = pd.DataFrame(dict)

display(df_bots_msg)

In [None]:
# Filter out the bots
print("With bots", len(df_commits))
df_commits = df_commits[~df_commits["author"].isin(df_bots["bots"])]
print("Without bots", len(df_commits))

# Filtering duplicated messages (probably automatic messages)
df_commits = df_commits.drop_duplicates(subset=['message'])
print("Without dupplicated messages", len(df_commits))

# Filter out the automatic messages
df_commits = df_commits[
    ~(df_commits.message.str.startswith("Merge pull request") |
      df_commits.message.str.startswith("Merge branch") |
      df_commits.message.str.startswith("Squash") |
      df_commits.message.str.startswith("Initial commit") |
      df_commits.message.str.startswith("Revert ")
     )
]
print("Without automatic messages", len(df_commits))

Aqui podem veure que de 94.872 missatges, hem reduit els missatges usables a 66.582 .

## Analisis paraules més usades

Per fer l'anàlisis de les paraules més usades, ho farem de dues formes. 
- Usarem wordcloud, el qual és una representacó gràfica d'un dibuix creat per paraules on segons les ocurrències d'aquestes, són més grans o si apareixen poc, més petites. Obtenint així, una visió ràpida, de quines paraules són les més usades.
- Comptarem les paraules i mostrarem un gràfic dels top n paraules més usats en ordre descendent, segons paraula més usada en missatges individuals o ocurrències totals.

In [None]:
def plot_word_cloud(word_cloud, text, save_image=False, image_name="none"):
    """Create and display a word cloud image."""
    word_cloud_output = word_cloud.generate(text)
    plt.imshow(word_cloud_output, interpolation='bilinear')
    plt.axis("off")
    plt.show()
    if save_image:
        word_cloud_output.to_file(f"./Images/wordclouds/{image_name}_word_cloud.png")

def transform_format(val):
    """Transform format for mask creation."""
    return 255 if val == 0 else val

def obtain_mask(mask):
    """Obtain mask for word cloud."""
    trans_mask = np.ndarray((mask.shape[0], mask.shape[1]), np.int32)
    for i in range(len(mask)):
        trans_mask[i] = list(map(transform_format, mask[i]))
    return trans_mask

In [None]:
# Load all text
all_text = " ".join(commit for commit in df_commits.message)
# Set the wordcloud params
word_cloud = WordCloud(stopwords=stopwords, max_font_size=30, max_words=200, background_color="black")
# Plot the wordcloud
plot_word_cloud(word_cloud, all_text)

In [None]:
# Use a mask and a color map

# <a href='https://dryicons.com/icon/square-github-icon-8312'> Icon by Dryicons </a>
github_image = np.array(Image.open("./Images/wordclouds/github_square.png"))

word_cloud_mask = obtain_mask(github_image)
# Word cloud repeats words since he needs fill the gaps using the correct size.
# print(len(all_text.split(' ')))
# print(len(pd.unique(all_text.split(' '))))

word_cloud = WordCloud(stopwords=stopwords, colormap='rainbow', mask=word_cloud_mask, max_font_size=30, max_words=10000, background_color="black")
plot_word_cloud(word_cloud, all_text, True, "git_mask")

In [None]:
# Use a mask        
git_image = np.array(Image.open("./Images/wordclouds/git.png"))

word_cloud_mask = obtain_mask(git_image)

word_cloud = WordCloud(stopwords=stopwords, mask=word_cloud_mask,colormap='hot', max_font_size=20, max_words=10000, background_color="#474747")
plot_word_cloud(word_cloud, all_text)

Podem veure que les paraules més usades, són: `fix, add, update` sense comptar els stopwords, ja que podem veure que són les paraules més grans. Tot seguit veurem els valors exactes, de les paraules més usades.

In [None]:
def count_words(df: pd.DataFrame):
    """
    :param df: DataFrame with the messages and associated information
    :return: Dictionary with the format {word: {n_ocur: value, n_messages: value}, ...}
    """
    # Using defaultdict for convenience; we won't have to add keys explicitly
    dicc = defaultdict(lambda: {"n_ocur": 0, "n_messages": 0})

    total_rows = len(df)
    processed_rows_count = 0
    for row in df.iterrows():
        text = row[1][0]
        text = standardize(text)  # Apply standardization
        if any(char.isdigit() for char in text):
            continue
        # Count the words
        words = text.split(" ")
        unique_words = set(words)
        # Times that a word appears and times that appears in different messages.
        for word in words:
            dicc[word]["n_ocur"] += 1
        for word in unique_words:
            dicc[word]["n_messages"] += 1

        processed_rows_count += 1
        # print(f"Word {processed_rows_count}/{total_rows} done; {processed_rows_count / total_rows * 100:.2f}%")
    
    return dicc

frame = {'Messages': df_commits.message}

result = pd.DataFrame(frame)

ocurrencesDict = count_words(result)

# Sort the words by occurrences and occurrences in messages.
def obtain_top_n_words(dictionary, N, filter="n_ocur", desc=True):
    # Get all the words and their frequency
    filtered_words = [(word, freq[filter]) for word, freq in dictionary.items() if word not in stopwords]
    
    # Sort from most to least frequent
    sorted_words = sorted(filtered_words, key=lambda x: x[1], reverse=True)
    
    # Take the N most frequent
    return [(word, freq) for word, freq in sorted_words[:N]]

top_words_list_ocur = obtain_top_n_words(ocurrencesDict, 30, "n_ocur", True)
top_words_list_msg = obtain_top_n_words(ocurrencesDict, 30, "n_messages", True)

print("Ocurrences Top", top_words_list_ocur)
print("\nOcurrences per message Top", top_words_list_msg)

# Function to plot the top words
def plot_top_words(topList, yLabel, ax):
    x = [word for word, freq in topList]
    y = [freq for word, freq in topList]
    
    # Making the bar chart on the data
    ax.bar(x, y)
    
    # Giving title to the plot
    ax.set_title("Top paraules més usades")
    ax.set_xticks(x)
    ax.set_xticklabels(x, rotation=90, ha='right')
    # Giving X and Y labels
    ax.set_xlabel("Word")
    ax.set_ylabel(yLabel)
     
    # We do not call plt.show() here, as we want to show both plots together

# Create a figure with two subplots
fig, axs = plt.subplots(1, 2, figsize=(15, 7))

# Plot the top words by occurrences
plot_top_words(top_words_list_ocur, "Ocurrències de la paraula", axs[0])

# Plot the top words by occurrences in different messages
plot_top_words(top_words_list_msg, "Ocurrències de la paraula en diferents missatges", axs[1])

# Adjust layout for better spacing
plt.tight_layout()

# Show the plots
plt.show()

# We can see that there is no significant difference, since normally, in a commit message, we do not repeat the same word twice.

De les gràfiques, podem veure que en termes quantitatius, són casi idèntics, tenen la mateixa escala i en general, el mateix ordre de les paraules més usades. Això denota que no se sol repetir les paraules en un mateix text, ja que la gràfica de l'esquerra, compta les paraules totals, mentre que el de la dreta, compta les paraules per missatge.

Les tres paraules més usades, son `fix, add, update`, els quals eren els mateixos que hem vist en el wordcloud. Hem agrupat les paraules segons un possible significat semàntic.

### Cicle de Desenvolupament:

- **Add (7121 occurrences):** Adició de noves característiques i creixement del projecte.
- **Fix (6692 occurrences):** Correció d'errors per assegurar l'estabilitat del projecte.
- **Update (5167 occurrences):** Millores i actualitzacions del codi existent, assegurant que el projecte es mantingui actualitzat amb les millors pràctiques i tecnologies.

### Paraules Complementàries:

- **Remove (1943 occurrences):** Reflecteix l'eliminació de codi o funcionalitats obsoletes o innecessàries, la qual cosa és crucial per mantenir el codi net i eficient.
- **Use (1491 occurrences):** Indica la implementació o reutilització de codi o biblioteques, suggerint un ús eficient dels recursos disponibles.
- **Test (1537 occurrences):** Subratlla la importància de les proves en el cicle de desenvolupament per assegurar la funcionalitat correcta i evitar regressions.
- **Readme (1176 occurrences):** Reflecteix la documentació del projecte, important per a la col·laboració i la claredat entre els membres de l'equip i els usuaris.
- **Feat (1149 occurrences):** Similar a "add", mostra l'enfocament en noves funcionalitats.
  
### Paraules Clau en el Manteniment i Millora:

- **Version (1080 occurrences):** Indica canvis en les versions del projecte, reflectint una evolució contínua i la gestió de versions.
- **File (1077 occurrences):** Pot referir-se a la gestió de fitxers dins del projecte, com la creació, modificació o eliminació de fitxers.
- **Refactor (606 occurrences):** Subratlla l'esforç de reorganització del codi per millorar la seva estructura sense canviar-ne el comportament extern.
- **Improve (576 occurrences):** Reflecteix les millores contínues del codi o funcionalitats existents.

### Altres Paraules Rellevants:

- **Build (812 occurrences), Link (773 occurrences), Error (732 occurrences):** Aquestes paraules indiquen activitats específiques relacionades amb la compilació del projecte, la gestió d'enllaços, i la correcció d'errors respectivament.
- **CI (640 occurrences):** Fa referència a la integració contínua, important per al desplegament automàtic i la verificació contínua de canvis.
- **Config (479 occurrences):** Reflecteix la configuració del projecte, que és crucial per a la seva correcta execució i desplegament



## Grups de paraules més usades

En aquesta secció, usarem **ítems freqüents** per trobar patrons els quals els usuaris escriuen. La idea principal, es trobar grups de paraules els quals quan apareix un, apareix un altre. Això indicaria que les paraules estan relacionades o s'utilitzen conjuntament amb freqüència. Aquest procés ens ajuda a identificar associacions o patrons significatius en el text.

In [None]:
#https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/#apriori-frequent-itemsets-via-the-apriori-algorithm
def generate_data_set(df: pd.DataFrame, rmStopWords = True, useStandard = True):
    """
    :param df: DataFrame with the messages and associated information
    :return dataSet: The message of the commit is the transaction and the words the items.
    """
    data_set = []
    total_rows = len(df)
    processed_rows_count = 0
    for row in df.iterrows():
        text = row[1][0]
        if (useStandard):
            text = standardize(text)  # Apply standardization
       
        # Split the words, but don't hold digits.
        words = [word for word in text.split() if not any(char.isdigit() for char in word)]
        if (rmStopWords):
            unique_words = list(set(words) - stopwords)
        else:
            unique_words = list(set(words))
        data_set.append(unique_words)

        processed_rows_count += 1
        # print(f"Word {processed_rows_count}/{total_rows} done; {processed_rows_count / total_rows * 100:.2f}%")
    return data_set

In [None]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import math
#!pip install mlxtend

def get_frequent_itemsets(data, min_supp = 0.01):
    """
    To save memory, you may want to represent your transaction data in the sparse format. 
    This is especially useful if you have lots of products and small transactions.
    """
    te = TransactionEncoder()
    oht_ary = te.fit(data).transform(data, sparse=True)
    sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary, columns=te.columns_)
    
    # total message * support = number of messages which the bundle appears.
    print("We consider that at least the bundle appears in:", math.ceil(len(data) * min_supp), " messages")
    
    # Calculate it using apriori algorithm
    frequent_itemsets = apriori(sparse_df, min_support=min_supp, use_colnames=True, verbose=1)
    frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
    return frequent_itemsets

In [None]:
# Create the data frame using the commits.
frame = {'Messages': df_commits.message}
result = pd.DataFrame(frame)

In [None]:
# Since the data set is too big, and our computers can't handle it, we are going to use a randomly sampled fraction of it.
partial_data_set = generate_data_set(result.sample(frac =.50))

In [None]:
frequent_itemsets = get_frequent_itemsets(partial_data_set, 0.001)
# Sort them from more length and then support
frequent_itemsets_sorted = frequent_itemsets.sort_values(by=['length', 'support'], ascending=[False, False])
print(frequent_itemsets_sorted)

In [None]:
average_support = math.ceil(frequent_itemsets_sorted['support'].mean() * len(partial_data_set))
print("Average amount of support: ", average_support, " messages")

In [None]:
data_group = frequent_itemsets[(frequent_itemsets['length'] >= 2)]
data_group['itemsets'] = data_group['itemsets'].apply(lambda x: str(sorted(list(x))))
data_group['support messages'] = data_group['support'].apply(lambda x: math.ceil(len(partial_data_set) * x))
data_group = data_group.sort_values(by=['support messages'], ascending=False)
# Show only those groups, who at least, have the average support.
# With this, we can preserve significant data, while reducing the amount of itemsets
data_group = data_group[(data_group['support messages'] >= average_support)]

print(data_group[['itemsets', 'support messages']])

In [None]:
# Function to group itemsets by starting word
def group_by_starting_word(df):
    grouped_phrases = {}
    for itemset in df['itemsets']:
        cleaned_string = itemset.replace("[", "").replace("]", "").replace("'", "")
        items = cleaned_string.split(', ')
        items_list = list(map(str.strip, items))  # This removes leading/trailing spaces
        if items_list[0] in grouped_phrases:
            grouped_phrases[items_list[0]].append(items_list[1])
        else:
            grouped_phrases[items_list[0]] = [items_list[1]]
    return grouped_phrases

# Group itemsets by starting word
grouped_phrases = group_by_starting_word(data_group)
print(grouped_phrases)

In [None]:
import networkx as nx
# Show the 
def plot_graph(key, values):
    G = nx.Graph()

    # Add main node (key)
    G.add_node(key)

    # Add edges between main node and each phrase (value)
    for value in values:
        G.add_edge(key, value)

    # Plotting the graph
    pos = nx.spring_layout(G, seed=42)  # positions for all nodes

    plt.figure(figsize=(8, 6))
    nx.draw(G, pos, with_labels=True, node_color='lightblue', node_size=2000, font_size=10, font_weight='bold', arrows=False)
    plt.title(f'Undirected Graph for "{key}"')
    plt.show()

# Iterate through each key-value pair and plot the graph
for key, values in grouped_phrases.items():
    plot_graph(key, values)

### Què passa si no normalitzem o comptem els stop words?

In [None]:
data_set_no_standard = generate_data_set(result.sample(frac =.50), rmStopWords = False, useStandard = True)
frequent_itemsets = get_frequent_itemsets(data_set_no_standard)
# Sort them from more length and then support
frequent_itemsets_sorted = frequent_itemsets.sort_values(by=['length', 'support'], ascending=[False, False])

In [None]:
data_group = frequent_itemsets[(frequent_itemsets['length'] >= 2)]
data_group['itemsets'] = data_group['itemsets'].apply(lambda x: str(sorted(list(x))))
data_group['support messages'] = data_group['support'].apply(lambda x: math.ceil(len(partial_data_set) * x))
data_group = data_group.sort_values(by=['support messages'], ascending=False)

print(data_group[['itemsets', 'support messages']])

In [None]:
grouped_phrases = group_by_starting_word(data_group)
print(grouped_phrases)
# Iterate through each key-value pair and plot the graph
for key, values in grouped_phrases.items():
    plot_graph(key, values)

In [None]:
data_set_no_standard = generate_data_set(result.sample(frac =.5), rmStopWords = False, useStandard = False)
frequent_itemsets = get_frequent_itemsets(data_set_no_standard, min_supp = 0.001) 

# Sort them from more length and then support
data_group = frequent_itemsets[(frequent_itemsets['length'] >= 2)]
data_group['itemsets'] = data_group['itemsets'].apply(lambda x: str(sorted(list(x))))
data_group['support messages'] = data_group['support'].apply(lambda x: math.ceil(len(partial_data_set) * x))
data_group = data_group.sort_values(by=['support messages'], ascending=False)

print(data_group[['itemsets', 'support messages']])

In [None]:
grouped_phrases = group_by_starting_word(data_group)
print(grouped_phrases)
# Iterate through each key-value pair and plot the graph
for key, values in grouped_phrases.items():
    plot_graph(key, values)

Per limitacions d'espai, i seguint el consell del professor, hem decidit utilitzar fragments parcials del conjunt de dades en lloc de calcular l'algoritme amb tot el data set.

El resultat s'ha decidit mostrar usant graphs, ja que l'algoritme d'ítems freqüents concentra la major quantitat de grups en els paquets de 2 i és fàcil de veure com una paraula, es relaciona amb una altra usant aquesta.
I el que hem obtingut, es el conjunt de paraules clau que solen apareixer junts. Per exemple, per fix, tenim:

`'fix': ['in', 'to', 'for', 'on', 'the', 'of', 'issue', 'with']`

El qual significa, que les persones solen escriure algo com:
- "fix xxxx to yyyyy"
- "fix ssss with nnnn"
- "fix issue eeee"


### Exploració sintàctica

En aquesta secció, busquem identificar les categories gramaticals més comunes o dominants dels missatges.

In [None]:
import nltk
from nltk import word_tokenize, pos_tag
from collections import Counter
from nltk.corpus import wordnet as wn

# Uncomment to dowload the necessary data to run the functions.
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
#nltk.download('wordnet')

In [None]:
# Map to make the tags more reaedable //https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
tag_map = {
    'CC': 'Conjunció coordinativa',
    'CD': 'Nombre cardinal',
    'DT': 'Determinant',
    'EX': 'Hi ha existencial',
    'FW': 'Paraula estrangera',
    'IN': 'Preposició o conjunció subordinant',
    'JJ': 'Adjectiu',
    'JJR': 'Adjectiu comparatiu',
    'JJS': 'Adjectiu superlatiu',
    'LS': 'Marcador ítems de llista ',
    'MD': 'Modal',
    'NN': 'Nom, singular o massa',
    'NNS': 'Nom, plural',
    'NNP': 'Nom propi, singular',
    'NNPS': 'Nom propi, plural',
    'PDT': 'Predeterminant',
    'POS': 'Sufix possessiu',
    'PRP': 'Pronom personal',
    'PRP$': 'Pronom possessiu',
    'RB': 'Adverbi',
    'RBR': 'Adverbi comparatiu',
    'RBS': 'Adverbi superlatiu',
    'RP': 'Partícula',
    'SYM': 'Símbol',
    'TO': 'a',
    'UH': 'Interjecció',
    'VB': 'Verb, forma base',
    'VBD': 'Verb, pretèrit',
    'VBG': 'Verb, gerundi o participi present',
    'VBN': 'Verb, participi passat',
    'VBP': 'Verb, present no tercera persona singular',
    'VBZ': 'Verb, present tercera persona singular',
    'WDT': 'Determinant WH',
    'WP': 'Pronom WH',
    'WP$': 'Pronom possessiu WH',
    'WRB': 'Adverbi WH'
}

# Load all text
all_text = " ".join(commit for commit in df_commits.message)
print("Totes les paraules:", len(all_text))

In [None]:
def calc_ratios(all_text, tag_map):
    # apply tokenization to the text and tag it
    tokens = word_tokenize(all_text)
    tagged = pos_tag(tokens)
    
    # Count the occurrence of each tag
    tag_counts = Counter(tag for word, tag in tagged)
    
    # Convert tags to full wording in English
    english_tag_counts = {tag_map.get(tag, tag): count for tag, count in tag_counts.items()}
    
    # Calculate the ratio of each tag
    total_tags = sum(tag_counts.values())
    tag_ratios = {tag: count / total_tags for tag, count in english_tag_counts.items()}
    
    # Sort ratios by values in descending order
    sorted_ratios = sorted(tag_ratios.items(), key=lambda item: item[1], reverse=True)

    # Create a dataframe to show the result
    df_ratios = pd.DataFrame(sorted_ratios, columns=['Etiqueta', 'Ratio'])
    
    # Show the equivalence of ratio in words
    data_size = len(all_text)
    df_ratios['Paraules'] = df_ratios['Ratio'].apply(lambda x: math.ceil(x * data_size))
    
    return df_ratios
    
df_ratios = calc_ratios(all_text, tag_map)
print(df_ratios)

In [None]:
# Plotting the ratios
tags = df_ratios["Etiqueta"]
ratios = df_ratios["Ratio"]

plt.figure(figsize=(10, 6))
plt.bar(tags, ratios, color='skyblue')
plt.xlabel('Etiquetes de les paraules')
plt.ylabel('Percentatge (%)')
plt.title('Distribució ús categoria gramatical')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

A través dels resultats, podem destacar que hi ha una dominància dels noms: Les categories "Nom, singular o massa" i "Nom propi, singular" ocupen les dues primeres posicions en termes de freqüència relativa (ratio). Això podria indicar que solen descriure o senyalar molt a entitas concrets i per això potser tenim en tercera posició els adejctius. 

Per tant, es podria dir que els missatges són més aviat descriptives, amb l'objectiu de facilitar i saber a què fan referència. 

### I quines formes verbals s'usen més?

In [None]:
# Calculate ratio of verbs
data_size = len(all_text)
verb_usage = df_ratios[df_ratios['Etiqueta'].str.startswith('Verb')].copy()
verb_usage['Paraules'] = verb_usage['Ratio'].apply(lambda x: math.ceil(x * len(all_text)))
print(verb_usage)

In [None]:
tags = verb_usage['Etiqueta']
ratios = verb_usage['Ratio']

# Plotting the ratios
plt.figure(figsize=(10, 6))
plt.bar(tags, ratios, color='skyblue')
plt.xlabel('Etiquetes de les paraules')
plt.ylabel('Percentatge (%)')
plt.title('Distribució ús verbs')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

Per tant, com a resultat, podem concloure que els missatges estan enfocats a clarificar i identificar els canvis realitzats en el codi. Aquest enfocament és crucial per a la comunicació eficient entre els membres de l'equip de desenvolupament i per a la comprensió precisa de les modificacions implementades.

In [None]:
# Closes the connection
dataBaseConnection.close()