The preprocessing includes the following steps:

1. Prepration for projects.csv (Result in projects_processed.csv)
2. Map projects with hackathons (Result in hackathon_project.csv)
3. Map projects (hackathon_project.csv) with participants (Result in proj_hack_people.csv)

In [2]:
import pandas as pd

The original datasets:

In [3]:
project = pd.read_csv('../data/projects.csv')
participants = pd.read_csv('../data/participants.csv')
hackathons = pd.read_csv('../data/hackathons.csv')

### 1.Prepration for projects.csv (Result in projects_processed.csv)

Only keep relavant fields 

Remove records with empty values

Only keep the projects with accessible GitHub repository


In [4]:
print(project.shape[0])

303680


In [5]:
project.columns

Index(['Unnamed: 0', 'submission_gal_url', 'project_URL', 'github_links',
       'participants', 'participants_num', 'build_with', 'repo_link', 'repo',
       'submitted_to_link', 'submitted_to_name', 'submitted_to_hacks_num',
       'likes', 'comments'],
      dtype='object')

In [19]:
project_filtered = project[['submitted_to_link', 'project_URL', 'github_links','submitted_to_hacks_num']]
project_filtered = project_filtered.dropna(subset=['submitted_to_link', 'project_URL', 'github_links'])

print(project_filtered.shape[0])

project_filtered

157960


Unnamed: 0,submitted_to_link,project_URL,github_links,submitted_to_hacks_num
0,https://supernova.devpost.com/,https://devpost.com/software/faefolk,"https://github.com/ICCards/faefolk, https://gi...",1.0
1,https://supernova.devpost.com/,https://devpost.com/software/hls-nzcdmt,https://github.com/Hls-labs,1.0
2,https://supernova.devpost.com/,https://devpost.com/software/ic-map-collector,https://github.com/stumpigit/icmaps,1.0
3,https://supernova.devpost.com/,https://devpost.com/software/kontribute,https://github.com/teambonsai/bonsai_dapp/blob...,1.0
5,https://supernova.devpost.com/,https://devpost.com/software/tingram,https://github.com/tingramtingram/dfinity,1.0
...,...,...,...,...
303663,"https://dvhacks-3.devpost.com/, https://sf-glo...",https://devpost.com/software/resumex,https://github.com/rohan-patra/ResumeX,2.0
303664,https://sf-global-progcomp.devpost.com/,https://devpost.com/software/reach-du8xmv,https://github.com/detorresramos/Reach.io,1.0
303665,https://sf-global-progcomp.devpost.com/,https://devpost.com/software/gloid-r6nxlq,https://github.com/hsrambo07/gloID,1.0
303670,https://bay-area-hacks-2021-12377.devpost.com/,https://devpost.com/software/green-karma-n5lybx,https://github.com/AravTewari/Green-Karma,1.0


Check the accessibilities of project GitHub repositories

Code in accessibility.py, result in projects_github_accessibility.csv

If redo this step to verify the result of accessibility, some following steps (drop duplicate and 'Nan'...) can be done before it to make the checking faster

Drop null values and duplicate lines (standard for duplicate: the same in all fields)

In [None]:
proj_accs = pd.read_csv('../data/projects_github_accessibility.csv')
# print(proj_accs.shape[0])

# Filter to keep the accessible ones
proj_accs = proj_accs[proj_accs['accessibility']==True]
# print(proj_accs.shape[0])

# Found that 'Nan' could not be removed by dropna, so:
proj_accs = proj_accs[~proj_accs['submitted_to_link'].isin(['Nan', 'nan'])]
# print('Projects with accessible GitHub links: ',proj_accs.shape[0])

# Drop duplicate rows (the same in all fields)
proj_accs = proj_accs.drop_duplicates(subset=['project_URL', 'submitted_to_link', 'github_links'], keep='first')
print('Projects with accessible GitHub links: ',proj_accs.shape[0])

Projects with accessible GitHub links:  72202


Found that in some rows, the project_URL are the same but the submitted_to_link are not 

Merge the submitted_to_link (Union Set) for the rows with the same project_URL

In [None]:
# Found that in some rows, the project_URL are the same but the submitted_to_link are not 
# Merge the submitted_to_link (Union Set) for the rows with the same project_URL
def analyze_and_merge_project_urls(df, fields_to_merge):
    # Count total rows and unique project URLs
    total_rows = len(df)
    
    # Find duplicate rows
    duplicate_mask = df.duplicated(subset=['project_URL'], keep=False)
    duplicate_rows = df[duplicate_mask]
    
    # Group duplicates by project_URL
    duplicates_grouped = duplicate_rows.groupby('project_URL')
    
    # Create a list to store merged rows
    merged_rows = []
    
    # Function to merge links (create a unique set of links)
    def merge_links(links_series):
        # Remove NaN and convert to set of unique links
        merged_links = set()
        for links in links_series:
            if pd.notna(links) and isinstance(links, str):
                merged_links.update(link.strip() for link in links.split(','))
        # Convert back to comma-separated string
        return ', '.join(sorted(merged_links)) if merged_links else None
    
    # Process duplicate groups
    for url, group in duplicates_grouped:
        # Merge the duplicate rows
        merged_row = group.iloc[0].copy()  # Start with the first row
        
        # Merge fields:
        for field in fields_to_merge:
            merged_row[field] = merge_links(group[field])

        # # Merge github_links
        # merged_row['github_links'] = merge_links(group['github_links'])
        # # Merge submitted_to_link
        # merged_row['submitted_to_link'] = merge_links(group['submitted_to_link'])
        
        merged_rows.append(merged_row)
    
    # Remove duplicate rows from the original dataframe
    df_unique = df[~duplicate_mask].copy()
    
    # Add merged rows
    df_merged = pd.concat([df_unique, pd.DataFrame(merged_rows)], ignore_index=True)
    
    # Print analysis
    print(f"Total rows in the original dataframe: {total_rows}")
    print(f"Number of unique project URLs after merging: {df_merged['project_URL'].nunique()}")
    print(f"Number of rows in merged dataframe: {len(df_merged)}")
    
    # print("\n--- Examples of Merged Project URLs ---")
    # for url, group in duplicates_grouped:
    #     print(f"\nProject URL: {url}")
    #     print("Original Duplicate Rows:")
    #     print(group[['project_URL', 'github_links', 'submitted_to_link']].to_string(index=False))
        
    #     # Find the merged row
    #     merged_row = df_merged[df_merged['project_URL'] == url]
    #     print("\nMerged Row:")
    #     print(merged_row[['project_URL', 'github_links', 'submitted_to_link']].to_string(index=False))
    
    return df_merged

proj_accs = analyze_and_merge_project_urls(proj_accs, fields_to_merge=['github_links','submitted_to_link'])

Total rows in the original dataframe: 72202
Number of unique project URLs after merging: 72122
Number of rows in merged dataframe: 72122


One project may be submitted to several hackathons

Count projects submitted to multiple hackathons (the commented 2 methods are tested to prove to be equal)

In [103]:
# submitted_to_many1 = project_filtered[project_filtered['submitted_to_hacks_num']>1]
# print(len(submitted_to_many1))
# submitted_to_many2 = project_filtered[project_filtered['submitted_to_link'].str.contains(',')]
# print(len(submitted_to_many2))

submitted_to_many = proj_accs[proj_accs['submitted_to_link'].str.contains(',')]
print(len(submitted_to_many))

2003


In [104]:
print(proj_accs.shape[0])
proj_accs.columns

72122


Index(['project_URL', 'github_links', 'submitted_to_link', 'accessibility'], dtype='object')

The processed project dataset is expected to have the participants information which exists in the original projects.csv

May be better if these fields are not dropped when checking the accessibility and constructing the projects_github_accessibility.csv

But checking accessibility is time-consuming, here we just join the proj_accs with project to get these fields

In [111]:
proj_for_join = project[['project_URL', 'participants']]
proj_for_join = proj_for_join.dropna(subset=['project_URL', 'participants'])
print(proj_for_join.shape[0])

proj_for_join = proj_for_join.drop_duplicates(subset=['project_URL', 'participants'], keep='first')
print(proj_for_join.shape[0])

# proj_for_join = proj_for_join.drop_duplicates(subset=['project_URL'], keep='first')
# print(proj_for_join.shape[0])

298201
187724


In [112]:
proj_accs['project_URL'] = proj_accs['project_URL'].astype(str) 
proj_for_join['project_URL'] = proj_for_join['project_URL'].astype(str)

merged_df = pd.merge(
    proj_accs,
    proj_for_join,
    on='project_URL',
    how='left'
)

merged_df

Unnamed: 0,project_URL,github_links,submitted_to_link,accessibility,participants
0,https://devpost.com/software/faefolk,"https://github.com/ICCards/faefolk, https://gi...",https://supernova.devpost.com/,True,"https://devpost.com/RAW4RMCS, https://devpost...."
1,https://devpost.com/software/ic-map-collector,https://github.com/stumpigit/icmaps,https://supernova.devpost.com/,True,https://devpost.com/stumpigit
2,https://devpost.com/software/kontribute,https://github.com/teambonsai/bonsai_dapp/blob...,https://supernova.devpost.com/,True,"https://devpost.com/TsuDohNimhh, https://devpo..."
3,https://devpost.com/software/tingram,https://github.com/tingramtingram/dfinity,https://supernova.devpost.com/,True,"https://devpost.com/k-tsytsyn, https://devpost..."
4,https://devpost.com/software/ant-kingdom,https://github.com/NFPTU/dfinity-fu,https://supernova.devpost.com/,True,"https://devpost.com/damtuankhanglm1, https://d..."
...,...,...,...,...,...
72224,https://devpost.com/software/web-scraping-of-u...,https://github.com/kishanrajput23/Personal-Pyt...,"https://init-weekend.devpost.com/, https://lhd...",True,https://devpost.com/kishan_rajput23
72225,https://devpost.com/software/weblight,https://github.com/tirtharajsinha/MLH-projects...,"https://init-day-3.devpost.com/, https://lhd-b...",True,https://devpost.com/tirtharajsinha
72226,https://devpost.com/software/womentr-8e3df1,https://github.com/shametha/WoMentr,"https://inclutech-hack.devpost.com/, https://t...",True,"https://devpost.com/20eucs019, https://devpost..."
72227,https://devpost.com/software/womentr-8e3df1,https://github.com/shametha/WoMentr,"https://inclutech-hack.devpost.com/, https://t...",True,"https://devpost.com/19eucs130, https://devpost..."


The merged_df has more rows compared with proj_accs, because we only dropped the duplication by the standard that 'project_URL', 'participants', should all be the same, but there are some rows only  with the same project_URL. Therefore:

Merge the participants (Union set) and participants_num (pick the greater one)

In [113]:
merged_df = analyze_and_merge_project_urls(merged_df, fields_to_merge=['participants'])

Total rows in the original dataframe: 72229
Number of unique project URLs after merging: 72122
Number of rows in merged dataframe: 72122


Now it's matched. Write into csv for reuse.

In [None]:
merged_df.to_csv("../data/projects_processed.csv", index=False, encoding="utf-8")

### 2. Merge the projects and hackathons

In [None]:
import pandas as pd

projects = pd.read_csv('../data/projects_processed.csv')
hackathons = pd.read_csv('../data/hackathons.csv')

In [120]:
hackathons.columns

Index(['Unnamed: 0', 'URL', 'Criteria', 'schedule', 'hack_type', 'info',
       'start_date_format', 'end_date_format', 'Prizes', 'prize_money', 'Id',
       'Title', 'Location', 'start_date', 'end_date', 'year', 'themes',
       'prize', 'registered_N', 'featured', 'organization_name',
       'winners_announced', 'submission_gallery_url',
       'start_a_submission_url'],
      dtype='object')

In [7]:
hackathons['start_date_format'] = pd.to_datetime(hackathons['start_date_format'], errors='coerce')
hackathons['end_date_format'] = pd.to_datetime(hackathons['end_date_format'], errors='coerce')

earliest_start = hackathons['start_date_format'].min()
latest_end = hackathons['end_date_format'].max()

print("Earliest start date:", earliest_start)
print("Latest end date:", latest_end)

Earliest start date: 2009-10-06 00:00:00
Latest end date: 2022-07-03 00:00:00


Drop nan and duplicates for hackathons.csv (no nan and duplicates were found)

In [121]:
print(hackathons.shape[0])

hackathons = hackathons.dropna(subset=['URL'])
print(hackathons.shape[0])

hackathons = hackathons.drop_duplicates(subset=['URL'], keep='first')
print(hackathons.shape[0])

7053
7053
7053


In [None]:
def merge_projects_hackathons(projects, hackathons):
    # Set hackathons index to URL for efficient lookup
    hackathons_indexed = hackathons.set_index('URL')
    
    # Create a list to store rows for the final dataframe
    merged_rows = []
    
    # Iterate through each project row
    for _, project_row in projects.iterrows():
        # Split submitted_to_link into individual links
        submitted_links = str(project_row['submitted_to_link']).split(',')
        
        # Clean and strip each link
        submitted_links = [link.strip() for link in submitted_links if link.strip()]
        
        # Check each link against hackathons URLs
        for link in submitted_links:
            # Check if the link exists in hackathons URLs
            if link in hackathons_indexed.index:
                # Create a new row combining project and hackathon info
                merged_row = {
                    'hackathon_URL': link,
                    'project_URL': project_row['project_URL'],
                    'github_links': project_row['github_links'],
                    'start_date_format': hackathons_indexed.loc[link, 'start_date_format'],
                    'end_date_format': hackathons_indexed.loc[link, 'end_date_format'],
                    'participants': project_row['participants']
                }
                merged_rows.append(merged_row)
    
    # Convert the list of merged rows to a DataFrame
    new_df = pd.DataFrame(merged_rows)
    
    # Save to CSV
    new_df.to_csv("../data/hackathon_project.csv", index=False, encoding="utf-8")
    
    # Print some information about the merge
    print(f"Original projects rows: {len(projects)}")
    print(f"Merged dataframe rows: {len(new_df)}")
    
    return new_df

merged_hackathon_projects = merge_projects_hackathons(projects, hackathons)

Original projects rows: 72122
Merged dataframe rows: 76284


### 3. Map projects (hackathon_project.csv) with participants (Result in proj_hack_people.csv)

In [None]:
import pandas as pd

proj_hack = pd.read_csv('../data/hackathon_project.csv')

proj_hack.columns

Index(['hackathon_URL', 'project_URL', 'github_links', 'start_date_format',
       'end_date_format', 'participants'],
      dtype='object')

In [None]:
participants = pd.read_csv('../data/participants.csv')

participants.columns

Index(['url', 'name', 'website', 'github', 'twitter', 'address', 'skills',
       'interests'],
      dtype='object')

DevPost page: proj_hack['participants'] and participants['url']

GitHub page: participants['github']

Drop nan and duplicates: Should null github records dropped? Now we keep it...

In [128]:
print(participants.shape[0])

# participants = participants.dropna(subset=['url','github'])
participants = participants.dropna(subset=['url'])
print(participants.shape[0])

participants = participants.drop_duplicates(subset=['url','github'],keep='first')
print(participants.shape[0])

participants = participants.drop_duplicates(subset=['url'],keep='first')
print(participants.shape[0])

191638
191638
183752
183752


In [131]:
people_with_multiple_github = participants[participants['github'].str.contains(',',na=False)]
print('Number of particiapnts with more than one GitHub accounts for hackathon events: ',len(people_with_multiple_github))

Number of particiapnts with more than one GitHub accounts for hackathon events:  0


Add participants' GitHub links to proj_hack as a new column

In [None]:
# Set participants to index for efficient lookup
participants_indexed = participants.set_index('url')

def get_participant_githubs(devpost_links):
    # If proj_hack['participants'] is NaN, return empty string
    if pd.isna(devpost_links):
        return ''
    
    # Split participants links
    links = str(devpost_links).split(',')
    
    # Clean and strip links
    links = [link.strip() for link in links]
    
    # Collect githubs for each link
    githubs = []
    for link in links:
        # Check if link exists in participants index
        if link in participants_indexed.index:
            # Get github, use 'no github link' if NaN
            github = participants_indexed.loc[link, 'github']
            githubs.append(str(github) if pd.notna(github) else 'no github link')
        else:
            githubs.append('no github link')
    
    return ', '.join(githubs)

proj_hack_people = proj_hack.copy()
proj_hack_people['participants_githubs'] = proj_hack['participants'].apply(get_participant_githubs)
proj_hack_people.to_csv('../data/proj_hack_people.csv', index=False, encoding="utf-8")

proj_hack_people

Unnamed: 0,hackathon_URL,project_URL,github_links,start_date_format,end_date_format,participants,participants_githubs
0,https://supernova.devpost.com/,https://devpost.com/software/faefolk,"https://github.com/ICCards/faefolk, https://gi...",2022-05-10,2022-06-22,"https://devpost.com/RAW4RMCS, https://devpost....","https://github.com/RAW4RMCS, no github link, n..."
1,https://supernova.devpost.com/,https://devpost.com/software/ic-map-collector,https://github.com/stumpigit/icmaps,2022-05-10,2022-06-22,https://devpost.com/stumpigit,no github link
2,https://supernova.devpost.com/,https://devpost.com/software/kontribute,https://github.com/teambonsai/bonsai_dapp/blob...,2022-05-10,2022-06-22,"https://devpost.com/TsuDohNimhh, https://devpo...","https://github.com/TsuDohNimhh, no github link"
3,https://supernova.devpost.com/,https://devpost.com/software/tingram,https://github.com/tingramtingram/dfinity,2022-05-10,2022-06-22,"https://devpost.com/k-tsytsyn, https://devpost...","no github link, no github link, no github link"
4,https://supernova.devpost.com/,https://devpost.com/software/ant-kingdom,https://github.com/NFPTU/dfinity-fu,2022-05-10,2022-06-22,"https://devpost.com/damtuankhanglm1, https://d...","no github link, no github link, https://github..."
...,...,...,...,...,...,...,...
76279,https://lhd-learn-day-4.devpost.com/,https://devpost.com/software/web-scraping-hgdp3c,https://github.com/aayushibansal2001/Web-Scraping,2021-10-13,2021-10-14,"https://devpost.com/aayushibansal2001, https:/...","no github link, no github link"
76280,https://inclutech-hack.devpost.com/,https://devpost.com/software/womentr-8e3df1,https://github.com/shametha/WoMentr,2021-12-03,2021-12-05,"https://devpost.com/19eucs130, https://devpost...","no github link, no github link, no github link..."
76281,https://the-superpositron.devpost.com/,https://devpost.com/software/womentr-8e3df1,https://github.com/shametha/WoMentr,2021-08-13,2021-08-15,"https://devpost.com/19eucs130, https://devpost...","no github link, no github link, no github link..."
76282,https://youthhacks-14264.devpost.com/,https://devpost.com/software/womentr-8e3df1,https://github.com/shametha/WoMentr,2021-12-17,2021-12-20,"https://devpost.com/19eucs130, https://devpost...","no github link, no github link, no github link..."
