# Processing Libraries ( Contributions ) Dataset

The libraries are made available through [this endpoint](https://download.processing.org/contribs) on the Processing website. This only contains the latest versions of libraries.

**It is not clear where to get all the other data, I need to ask in the forum**

In [33]:
import git
import os
import shutil
import pandas as pd
import glob

In [5]:
!mkdir -p libraries
!wget -q -O libraries/current-libs.txt https://download.processing.org/contribs
!head libraries/current-libs.txt

library
name=proscene
authors=[Jean Pierre Charalambos](http://otrolado.info)
url=http://otrolado.info
categories=3D,Animation,Geometry,GUI,I/O,Utilities
sentence=This project is deprecated and will soon no longer be available. Download the nub library instead.
paragraph=Main features include: 1. Default interactivity through the mouse and keyboard that simply does what you expect; 2. Generic support for Human Interface Devices; 3. Arcball, walkthrough and third person camera modes; 4. Hierarchical coordinate systems (frames), with functions to convert between them; 5. Coordinate systems can easily be moved with the mouse. 6. Keyframes; 7. Object picking; 8. Keyboard shortcuts and camera profiles customization; 8. Animation framework; 9. Screen drawing; and, 10. Off-screen rendering mode support.
version=33
prettyVersion=3.0.1
minRevision=256


To reconstruct the history, I will need snapshots of the `contributions` file after each change. However, capturing individual updates to libraries will be challenging, since once they are validated, the library authors control update streams. This is achieved by having a trusted endpoint that returns the latest version and build of the library.

In [None]:
website_repo = 'https://github.com/processing/processing-web-archive.git'
folder = 'libraries'
local_dir = 'libraries/processing-web-archive'

contrib_fn = "contrib_generate/contributions.txt"

if not os.path.exists(local_dir):
    repo = git.Repo.clone_from(website_repo, local_dir)
else:
    print(f"Repository already exists at {local_dir}, skipping clone.")

g = git.Git(local_dir)

# Get log history
loginfo = g.log('--pretty=%H %ct', '--', contrib_fn)

# Loop through logs and checkout each version
for line in loginfo.splitlines():
    line = line.strip()

    commit_hash, timestamp = line.split(" ")
    print(commit_hash)
    g.checkout(f'{commit_hash}') #This could be spead up, but maybe not worth it
    dest_fn = f'{folder}/contributions-{timestamp}-{commit_hash}.txt'

    # Copy the file to the destination
    shutil.copy(f'{local_dir}/{contrib_fn}', dest_fn)

## Identifying events accross the libraries

### Aggregating all the values

In [61]:
all_dfs = []

def read_and_parse_file(file_path, timestamp):
    with open(file_path, 'r') as f:
        text = f.read()

    # Function to parse a single block (could be library, tool, or mode)
    def parse_block(block, block_type):
        block_dict = {}
        lines = block.strip().split("\n")[1:]  # Exclude the first line containing block type
        for line in lines:
            key, value = line.split("=", 1)
            block_dict[key] = value
        block_dict['timestamp'] = timestamp
        block_dict['type'] = block_type
        return block_dict

    all_parsed = []
    
    # Split by blank lines and iterate through each block
    for block in text.strip().split("\n\n"):
        first_line = block.split("\n")[0]
        if first_line in ['library', 'tool', 'mode']:
            all_parsed.append(parse_block(block, first_line))

    return pd.DataFrame(all_parsed)

for file_path in glob.glob(f'{folder}/contributions-*.txt'):
    timestamp = os.path.basename(file_path).split('.')[0].split('-')[1]
    df = read_and_parse_file(file_path, timestamp)
    all_dfs.append(df)

combined_df = pd.concat(all_dfs, ignore_index=True)
combined_df.to_csv("temp.csv", index=False)

## Preserve only changes

In [79]:
# Sort the DataFrame by 'id' and 'timestamp'
sorted_df = combined_df.sort_values(by=['id', 'timestamp'])

# Compute the 'changed' column based on the 'version' column
sorted_df['changed'] = sorted_df.groupby('id')['version'].apply(lambda x: x != x.shift(1)).reset_index(level=0, drop=True)

# Add a removed at column to signify when it wasn't in the database anymore
# Sort the DataFrame by 'id' and 'timestamp'
sorted_df = sorted_df.sort_values(by=['id', 'timestamp'])

# Group by 'id' and then collect lists of timestamps for each 'id'
grouped = sorted_df.groupby('id')['timestamp'].apply(list).reset_index()

# Initialize a list to store information about removed libraries
removed_libraries = []

# Iterate through each group to check for removals
for idx, row in grouped.iterrows():
    id_ = row['id']
    timestamps = row['timestamp']

    # Check if this library exists in the latest timestamp
    if timestamps[-1] != sorted_df['timestamp'].max():
        removed_libraries.append({'id': id_, 'removed_at': timestamps[-1]})

# Create a DataFrame for removed libraries
df_removed = pd.DataFrame(removed_libraries)

sorted_df = pd.merge(sorted_df, df_removed, on='id', how='left')

# Filter only the rows where 'changed' is True
filtered_df = sorted_df[sorted_df['changed'] == True]

# Drop the 'changed' column
final_df = filtered_df.drop('changed', axis=1)

# Optionally, reset the index
final_df.reset_index(drop=True, inplace=True)

# Now, df_removed contains the 'id' and the last known timestamp ('removed_at') for removed libraries
final_df.to_csv("libraries-data.csv", index=False)