# Analysis of top contributors for ICML 2022

This repository analyzes recent icml contributions. If you want to play around with the dataset yourself, you can try it out in the releases section of this repo.

[![Open in Gitpod](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#https://github.com/TobiasJacob/icml-crawler)

## Setup

Follow the script [build_and_publish.sh](build_and_publish.sh) for setup and report generation.

The download uses a multiprocessing architecture to crawl through all paper submissions within several minutes.

## I just want to download the dataset

You can download the dataset in the [releases](https://github.com/TobiasJacob/icml-crawler/releases) section. 

## Example Analysis

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("../data/records.csv")
df = df.dropna()
df

Number of individual papers

In [None]:
df["paperid"].nunique()

We can see how the conference grew over time

In [None]:
df.groupby("year")["paperid"].nunique().plot()
plt.ylabel("papers")
pass

These are the Authors with most contributions

In [None]:
df.groupby("author")["paperid"].nunique().sort_values(ascending=False).head(20)

These are the institutions contributing most

In [None]:
df_leads = df.groupby(["institution", "year"])["paperid"].nunique().unstack().sort_values(2022, ascending=False)
df_leads.to_csv("Leading Institutions.csv")
df_leads.head(30)

I am particularily interested in Northeastern, KIT, Tübingen, Munich, Zürich, and RWTH

In [None]:
print("Tübingen", df[df["institution"].str.contains("Tübingen")]["paperid"].nunique())
print("Northeastern", df[df["institution"].str.contains("Northeastern")]["paperid"].nunique())
print("Karlsruhe", df[df["institution"].str.contains("Karlsruhe")]["paperid"].nunique())
print("Munich", df[df["institution"].str.contains("Munich")]["paperid"].nunique())
print("RWTH", df[df["institution"].str.contains("RWTH")]["paperid"].nunique())
print("ETH Zürich", df[df["institution"].str.contains("ETH")]["paperid"].nunique())

In [None]:
df[df["institution"].str.contains("Northeastern")].groupby("author")["paperid"].nunique().sort_values(ascending=False).head(10)

In [None]:
df[df["institution"].str.contains("Tübingen")].groupby("author")["paperid"].nunique().sort_values(ascending=False).head(10)

In [None]:
df[df["institution"].str.contains("ETH")].groupby("author")["paperid"].nunique().sort_values(ascending=False).head(10)

In [None]:
df[df["institution"].str.contains("Munich")].groupby("author")["paperid"].nunique().sort_values(ascending=False).head(10)

In [None]:
df[df["institution"].str.contains("RWTH")].groupby("author")["paperid"].nunique().sort_values(ascending=False).head(10)

In [None]:
df[df["institution"].str.contains("Karlsruhe")].groupby("author")["paperid"].nunique().sort_values(ascending=False).head(10)