## Project Objective

Build an interactive web page using a Jupyter notebook with a Python kernel, ipywidgets, and voila to serve the notebook as a web page. The web page should display and explain the data (e.g., the number of downloads, number of UofT authors, time, and journal topic) allowing the user to answer the questions. These types of web pages are often called dashboards.

## Assignment Questions

- Which journals are downloaded most frequently? (good chunk done)

- How many authors from UofT publish in the journals that are downloaded? (rough work)

- How do the download patterns change over time? Can you predict future downloads? (Tao)

- Any other interesting questions that your group think could be answered using this data.

## Issues to consider
- What information will the user see on the web page?

- How will your group display different data? As a visualization, table, text, or combination? Where will you add interactivity? How will you know if your choices lead to effective and accurate communication of information?

- How will your group predict future downloads? How will you display this information?

## User Documentation for Interactive web page
- The user documentation should explain to users what data is being displayed on your web page. For example, if you use the data to do a calculation or create a plot then explain why the calculation was done, and how it should be interpreted.

- The documentation should be broken into sections that correspond to the sections of your web page.

- The user documentation should be done using a Jupyter notebook. Ideally your group would find a way to incorporate the documentation into the design of the web page, although this isn't necessary.

## Data

In [1]:
import glob
import pandas as pd
import datetime
import calendar
import numpy as np
import operator

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import ipywidgets as widgets
from ipywidgets import interact
from IPython.display import display

# once everything works uncomment this
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("JSC370 Data KM.csv").drop("Unnamed: 0", axis = 1)
report_data2 = df

uoft_data2 = pd.read_csv("Web of Science data UofT affiliated pubs 2014-2018.csv")

def tryint(x):
    try:
        x = int(x)
    except:
        x = np.nan
    return x

report_data2["Reporting Period Total"] = report_data2["Reporting Period Total"].apply(lambda x: tryint(x))
report_data2 = report_data2.dropna(subset=["Reporting Period Total"]).reset_index(drop=True)

report_data2["Reporting Period HTML"] = pd.to_numeric(report_data2["Reporting Period HTML"])
report_data2["Reporting Period PDF"] = pd.to_numeric(report_data2["Reporting Period PDF"])

months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
for month in months:
    report_data2[month] = pd.to_numeric(report_data2[month])

In [3]:
report_data = report_data2
uoft_data = uoft_data2

In [2]:
# takes around 4 minutes to run

# get report names
reportnames = glob.glob('JSC370 Data KM/*.*', recursive = True)

# read in all excel files
report_data = pd.DataFrame()
for name in reportnames:
    if "JR1" not in name:
        print("Not JR1")
        print(name)
        continue
    
    excel = True
    try:
        curr_report = pd.read_excel(name)
    except:
        try:
            curr_report = pd.read_csv(name, skiprows=7, sep="\t")
        except:
            curr_report = pd.read_csv(name, skiprows=9, sep="\t")
        excel = False
    
    if excel:
        # get index where data starts 
        # (this is different between some files so using a loop to get the starting point)
        colnamesindex = -1
        for i in range(len(curr_report)):
            if curr_report.iloc[i, 0] == "Journal" or curr_report.iloc[i, 1] == "Publisher":
                colnamesindex = i
                break
        if colnamesindex == -1:
            print("Might be in a different language, not matching up properly")
            print(name)
            continue

        # set column names as the proper thing
        curr_report.columns = curr_report.loc[colnamesindex,]

        if "+" in curr_report.columns:
            # for the files formated weirdly with a + and spreading title names across rows
            # fix column names
            colnames = list(curr_report.columns)
            for i in range(len(colnames)):
                if pd.notna(curr_report.iloc[colnamesindex + 1, i]):
                    colnames[i] = colnames[i] + " " + str(curr_report.iloc[colnamesindex + 1, i])
                if pd.notna(curr_report.iloc[colnamesindex + 2, i]):
                    colnames[i] = colnames[i] + " " + str(curr_report.iloc[colnamesindex + 2, i])
            curr_report.columns = colnames
            curr_report = curr_report.drop(columns=["+"])
            curr_report = curr_report.drop(curr_report[curr_report["Journal"] == "+"].index, axis=0)

            # save only data part
            curr_report = curr_report[colnamesindex + 3:]
        else:
            # save only data part
            curr_report = curr_report[colnamesindex + 1:]

    # insert year of report
    jr1_i = name.index("JR1")
    year = name[jr1_i + 4 : jr1_i + 8]
    curr_report.insert(0, "Year", int(year))

    # reformat months to not include year or date for generality
    datetimes = {}
    for colname in curr_report.columns:
        if pd.notna(colname):
            if isinstance(colname, datetime.datetime):
                # some encoded as datetime
                calendar.month_name[colname.month][:3]
                datetimes[colname] = calendar.month_name[colname.month][:3]
            elif isinstance(colname, str):
                # sometimes there's whitespace messing things up
                strippedcolname = colname.strip()
                if strippedcolname.endswith(year):
                    # some encoded as MMM-YYYY
                    datetimes[colname] = strippedcolname[:3]
                else:
                    datetimes[colname] = strippedcolname
    curr_report = curr_report.rename(columns=datetimes)

    # for some reports that don't label the Journal column
    curr_report = curr_report.rename(columns={np.nan: "Journal"})
    
    # insert file package name (not sure if useful)
    package_i = name.index("\\")
    filepub = name[package_i + 1 : jr1_i - 1]
    curr_report.insert(0, "FilePackage", filepub)
    
    # drop totals row
    curr_report = curr_report.drop(curr_report.index[0])
    
    # insert if SP or not
    mainname = name.split(".")[0]
    if mainname.endswith("SP"):
        curr_report.insert(0, "SP", "Yes")
    else:
        curr_report.insert(0, "SP", "No")
    
    try:
        report_data = report_data.append(curr_report)
    except ValueError:
        print("Can't append, some other weird error")
        print(name)
        continue
    
print("Done")

Not JR1
JSC370 Data KM\ALJC JR5 2017 SP.xlsx
Not JR1
JSC370 Data KM\ALJC JR5 2018 SP.xlsx
Might be in a different language, not matching up properly
JSC370 Data KM\CAIRN JR1 2014.xls
Might be in a different language, not matching up properly
JSC370 Data KM\CAIRN JR1 2015.xls
Might be in a different language, not matching up properly
JSC370 Data KM\CAIRN JR1 2016 .xls
Might be in a different language, not matching up properly
JSC370 Data KM\CAIRN JR1 2017.xls
Might be in a different language, not matching up properly
JSC370 Data KM\CAIRN JR1 2018.xls
Done


In [3]:
# fixing more formatting problems
report_data["Reporting Period Total"] = report_data["Reporting Period Total"].fillna(report_data["Retrievals"])
report_data["Reporting Period HTML"] = report_data["Reporting Period HTML"].fillna(report_data["HTML"])
report_data["Reporting Period PDF"] = report_data["Reporting Period PDF"].fillna(report_data["PDF"])
report_data["Journal DOI"] = report_data["Journal DOI"].fillna(report_data["Journal Doi"])
report_data["Journal"] = report_data["Journal"].fillna(report_data["Title"])
report_data["Journal"] = report_data["Journal"].fillna(report_data["Unnamed: 0"])

report_data = report_data.drop(columns=["Retrievals", "HTML", "PDF", "Journal Doi", "Title", "Dec-2015", "Unnamed: 0"])
report_data.loc[report_data["Online ISSN"] == " ", "Online ISSN"] = np.nan

# drop the rows that are accidentally still there
report_data = report_data[report_data["Reporting Period Total"].notna()]

In [4]:
# strip excess whitespace
report_data["Journal"] = report_data["Journal"].apply(lambda x: x.strip() if isinstance(x, str) else x)
report_data["FilePackage"] = report_data["FilePackage"].apply(lambda x: x.strip() if isinstance(x, str) else x)

# double check data
report_data["Reporting Period Total"] = report_data["Reporting Period Total"].apply(
    lambda x: x if isinstance(x, int) else np.nan)
report_data = report_data.dropna(subset=["Reporting Period Total"])
report_data = report_data.reset_index(drop=True)

In [5]:
# uoft report names
uoftnames = glob.glob('Web of Science data UofT affiliated pubs 2014-2018/*.*', recursive = True)

# takes around 40 seconds to load
uoft_data = pd.DataFrame()
year = 2014
for name in uoftnames:
    curr_uoft = pd.read_excel(name)
    curr_uoft.insert(0, "Year", year)
    uoft_data = uoft_data.append(curr_uoft)
    year = year + 1
    
uoft_data = uoft_data.drop_duplicates()
uoft_data = uoft_data.reset_index(drop=True)

In [6]:
# fix formatting problems
uoft_data["Category: Heading 1"] = uoft_data["Category: Heading 1"].fillna(uoft_data["Category: Headings 1"])
uoft_data["PubType"] = uoft_data["PubType"].fillna(uoft_data["Pubtype"])

uoft_data = uoft_data.drop(columns=["Category: Headings 1", "Pubtype"])

In [7]:
def get_num_uoft_authors(row):
    new_row = row.copy()
    first_three = ["(a1) First UofT affiliated author's position in the author list ",
                   " (a2) Second UofT affiliated author's position in the author list",
                   "(a3) Third UofT affiliated author's position in the author list "]
    num = 0
    if pd.notna(row[first_three[0]]):
        new_row["NumUofTAuthors"] = new_row["NumUofTAuthors"] + 1
    else:
        return new_row
        
    if pd.notna(row[first_three[1]]):
        new_row["NumUofTAuthors"] = new_row["NumUofTAuthors"] + 1
    else:
        return new_row
        
    if pd.notna(row[first_three[2]]):
        new_row["NumUofTAuthors"] = new_row["NumUofTAuthors"] + 1
    else:
        return new_row
        
    for j in range(4, 61):
        if pd.notna(row['a' + str(j)]):
            new_row["NumUofTAuthors"] = new_row["NumUofTAuthors"] + 1
        else:
            return new_row
            
    return new_row

In [8]:
# takes like a few minutes?  I dunno how to make it faster
uoft_data["NumUofTAuthors"] = 0
uoft_data = uoft_data.apply(get_num_uoft_authors, axis=1)

In [9]:
uoft_data.head()

Unnamed: 0,Year,UID,PubDate,Issue,Volume,Pages,Start Page,End Page,Number of Pages,Source Title,...,b56,a57,b57,a58,b58,a59,b59,a60,b60,NumUofTAuthors
0,2014,WOS:000341974900031,2014-08-01,2,98,541-548,541,548,9,ANNALS OF THORACIC SURGERY,...,,,,,,,,,,0
1,2014,WOS:000346385100013,2014-01-01,11,138,1495-1502,1495,1502,8,ARCHIVES OF PATHOLOGY & LABORATORY MEDICINE,...,,,,,,,,,,1
2,2014,WOS:000342917300011,2014-08-01,4,23,302-307,302,307,6,CURRENT DIRECTIONS IN PSYCHOLOGICAL SCIENCE,...,,,,,,,,,,2
3,2014,WOS:000331684800009,2014-03-01,3,66,404-410,404,410,7,ARTHRITIS CARE & RESEARCH,...,,,,,,,,,,1
4,2014,WOS:000331927800004,2014-03-01,3,71,916-928,916,928,13,JOURNAL OF THE ATMOSPHERIC SCIENCES,...,,,,,,,,,,1


(a) Create an interactive scatter plot that shows the effect of mean, variance, and sample size on a fitted simple linear regression line. Assume that the independent variable has $N(\mu,\sigma^2)$ distribution.

(b) Briefly explain why your interactive scatter plot is effective at communicating the impact of mean, variance, and sample size on simple linear regression.

In [10]:
# lololololololol
def plott(mean, var, sample_size):
    fig, axs = plt.subplots()
    x = np.random.normal(mean, 100, sample_size)
    errors = np.random.normal(0, var, sample_size)
    y = 5 * x + errors
    sns.regplot(x, y, ci=99, ax=axs, color="blue", line_kws={"color": "red"})
    
    
interact(plott, mean = widgets.IntSlider(value=1, min=0, max=20, step=1),
                var = widgets.IntSlider(value=1, min=1, max=1000, step=10),
                sample_size= widgets.IntSlider(value=2, min=10, max=1000, step=10))

widgets.HTML(
    value="<b>Lolololol its linear regression</b>",
)

interactive(children=(IntSlider(value=1, description='mean', max=20), IntSlider(value=1, description='var', ma…

HTML(value='<b>Lolololol its linear regression</b>')

# User Selections

In [10]:
# month breakdown can only happen if type of downloads is total
# selecting other things like subgroup of publishers or journal names is easy panda methods
#    implement with widgets later

type_of_downloads = "total" # ["total", "pdf", "html"] 
start_year = 2014
start_month = "Jan"
end_year = 2014
end_month = "Dec"
######

In [11]:
# will create SettingWithCopyWarning even though it's fine
def get_selected_report_data(type_of_downloads, start_year, start_month, end_year, end_month):
    selected_data = pd.DataFrame()
    months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
    years = []
    for i in range(end_year - start_year + 1):
        years.append(start_year + i)

    d_type = ""
    if type_of_downloads == "total":
        d_type = "Reporting Period Total"
    elif type_of_downloads == "pdf":
        d_type = "Reporting Period PDF"
    elif type_of_downloads == "html":
        d_type = "Reporting Period HTML"
    else:
        # download type not valid
        print("Download Type not valid")

    years = []
    for i in range(end_year - start_year + 1):
        years.append(start_year + i)

    # monthly specifics only available if download type is total
    if type_of_downloads == "total":
        start_month_i = months.index(start_month)
        end_month_i = months.index(end_month)
        if start_year == end_year:
            # only within the year range
            selected_data = report_data[report_data["Year"] == start_year]
            selected_data["Downloads"] = 0

            for j in range(start_month_i, end_month_i + 1):
                selected_data.loc[:, "Downloads"] = selected_data.loc[:, "Downloads"] + selected_data[months[j]].fillna(0)
        
        elif start_year < end_year:
            # start year months
            selected_data = report_data[report_data["Year"] == start_year]
            selected_data["Downloads"] = 0

            # get counts within month range for start year
            for j in range(start_month_i, len(months)):
                selected_data.loc[:, "Downloads"] = selected_data.loc[:, "Downloads"] + selected_data[months[j]].fillna(0)

            # inbetween years entire year
            inbetween_years = years[1 : len(years) - 1]
            inbetween_data = report_data[report_data["Year"].isin(inbetween_years)]
            inbetween_data.loc[:, "Downloads"] = inbetween_data.loc[:, d_type]

            # loop for end year months
            end_data = report_data[report_data["Year"] == end_year]
            end_data["Downloads"] = 0

            for j in range(0, end_month_i + 1):
                end_data.loc[:, "Downloads"] = end_data.loc[:, "Downloads"] + end_data[months[j]].fillna(0)

            selected_data = selected_data.append(inbetween_data)
            selected_data = selected_data.append(end_data)

        else:
            # start year > end year, no data
            print("Start year after end year, no data")
            selected_data = pd.DataFrame()
    else:
        # do only year selections
        selected_data = report_data[report_data["Year"].isin(years)]
        selected_data.loc[:, "Downloads"] = report_data[d_type]
    
    return selected_data


# this merging method probably needs work, to take into account missing ISSNs in some rows and stuff
def merge_selected_reports_and_uoft(selected_data):
    # ISSNs of only selected data
    available_links = selected_data[["Print ISSN", "Online ISSN"]]
    available_links = available_links.drop_duplicates()
    available_links = available_links[(available_links["Print ISSN"].notna()) | 
                                      (available_links["Online ISSN"].notna())]
    available_links = available_links.reset_index(drop=True)

    # get uoft publications that match the selected ISSNs
    matched_uoft = pd.merge(uoft_data, available_links, how="inner", 
                            left_on=["ISSN", "eISSN"], right_on=["Print ISSN", "Online ISSN"])

    # save their ISSNs
    matched_uoft = matched_uoft.drop(columns=["ISSN", "eISSN"])
    available_links = matched_uoft[["Print ISSN", "Online ISSN"]]
    available_links = available_links.drop_duplicates()
    available_links = available_links[(available_links["Print ISSN"].notna()) | 
                                      (available_links["Online ISSN"].notna())]
    available_links = available_links.reset_index(drop=True)

    # get download counts of those uoft publications
    matched_selected = pd.merge(selected_data, available_links, how="inner", 
                            on=["Print ISSN", "Online ISSN"])

    # takes a while and can cause memory problems if both dataframes really big
    matched_data = pd.merge(matched_selected, matched_uoft, how="inner", 
                            on=["Print ISSN", "Online ISSN", "Year"]) #, right_on=["ISSN", "eISSN"])
    
    matched_data = matched_data.reset_index(drop=True)
    return matched_data

# Which journals are downloaded most frequently?

If time:
Also try which publishers, which platforms, and which packages too.  For those though might wanna also do downloads / amount because a platform might have a bunch but their all trash or something.

In [12]:
# probably select data elsewhere to avoid continuous use of that long function
selected_data = get_selected_report_data(type_of_downloads, start_year, start_month, end_year, end_month)
uoft_reports = merge_selected_reports_and_uoft(selected_data)
uoft_downloads = uoft_reports.drop_duplicates(subset=["Print ISSN", "Online ISSN", "Journal", "Year"])

In [7]:
# functions

def get_top_journals(top_num, data):
    # aggregate ratings across and different versions
    aggregated = data.groupby(["Journal"])["Downloads"].sum()
    aggregated = pd.DataFrame(aggregated)
    results = aggregated.sort_values(by="Downloads", ascending=False)[0:top_num]
    
    # get order 
    order = results.reset_index()
    order = order.drop(columns=["Downloads"])
    order = list(order["Journal"])

    # get publishers
    top_journals = results.index
    top_journals = data[data["Journal"].isin(top_journals)]
    publishers = top_journals.groupby(["Journal", "Publisher"])["Downloads"].sum()
    publishers = pd.DataFrame(publishers)
    publishers = publishers.reset_index()
    # each journal only has one publisher
    # any extra publishers is just the same publisher but spelt slightly differently
    publishers = publishers.drop_duplicates(subset=["Journal"]) 
    publishers = publishers.drop(columns=["Downloads"])
    
    # get platform counts
    platforms = top_journals.groupby(["Journal", "Platform"])["Downloads"].sum()
    platforms = pd.DataFrame(platforms)
    platforms = platforms.reset_index()
    platforms = platforms.merge(publishers)

    # get package counts
    packages = top_journals.groupby(["Journal", "FilePackage"])["Downloads"].sum()
    packages = pd.DataFrame(packages)
    packages = packages.reset_index()
    packages = packages.merge(publishers)
    
    return results, order, platforms, packages


def barchart_of_downloads(data, extra, order, totals):
    fig = px.bar(data, x='Journal', y='Downloads', color=extra, 
                 hover_data=['Downloads', 'Publisher', extra],
                 category_orders={"Journal": order}, height=600, width=900) 

    y1 = list(totals["Downloads"])
    xcoord = list(totals.index) 
    annotations = [dict(x=xi, y=yi, text=str(yi), xanchor='center', 
                        yanchor='bottom', showarrow=False) for xi, yi in zip(xcoord, y1)]

    title_part = "Packages"
    if extra == "Platform":
        title_part = "Platforms"
    
    fig.update_layout(
        title={'text': "Top " + str(top) + " Journals and Their " + title_part,
               'y':0.95,
               'x':0.5,
               'xanchor': 'center',
               'yanchor': 'top'},
        annotations=annotations)
    
    return fig

In [31]:
# widgets

# dropdown menu of top
top_dd = widgets.Dropdown(options = [5, 10], description='Top:')
# dropdown menu of if uoft or not
uoft_dd = widgets.Dropdown(options = ["All Reports", "Only UofT Reports"], description="Type:")

counts_table = widgets.Output()
platform_graph = go.FigureWidget() # can't get figure widget working, even though it would probably
package_graph = go.FigureWidget() # be faster

In [32]:
# global selections
curr_data1 = selected_data
top = 5

# update functions
def update_journal_figs(results, order, platforms, packages):
    with counts_table:
        display(results)
        
    new_fig = barchart_of_downloads(platforms, "Platform", order, results)
    platform_graph.data = []
    platform_graph.add_traces(new_fig.data)
    platform_graph.layout = new_fig.layout
    
    new_fig = barchart_of_downloads(packages, "FilePackage", order, results)
    package_graph.data = []
    package_graph.add_traces(new_fig.data)
    package_graph.layout = new_fig.layout

        
def top_update(change):  
    global top
    counts_table.clear_output() 
    
    # don't need to change current data
    top = change.new
    results, order, platforms, packages = get_top_journals(top, curr_data1)
    update_journal_figs(results, order, platforms, packages)
        
        
def uoft_update(change):  
    global curr_data1
    counts_table.clear_output() 
    
    if change.new == "All Reports":
        curr_data1 = selected_data
    elif change.new == "Only UofT Reports":
        curr_data1 = uoft_downloads
    
    results, order, platforms, packages = get_top_journals(top, curr_data1)
    update_journal_figs(results, order, platforms, packages)


top_dd.observe(top_update, names = 'value')
uoft_dd.observe(uoft_update, names = 'value')

In [33]:
results, order, platforms, packages = get_top_journals(top, curr_data1)
update_journal_figs(results, order, platforms, packages)

input_widgets = widgets.HBox([top_dd, uoft_dd])
display(input_widgets)

tab = widgets.Tab([counts_table, platform_graph, package_graph])
tab.set_title(0, "Counts")
tab.set_title(1, 'Graph With Platform')
tab.set_title(2, 'Graph With Package')
display(tab)

# TODO: set some color map thing so that it doesn't repeat colours when the legend gets large

HBox(children=(Dropdown(description='Top:', options=(5, 10), value=5), Dropdown(description='Type:', options=(…

Tab(children=(Output(), FigureWidget({
    'data': [{'alignmentgroup': 'True',
              'customdata': arr…

# How many authors from UofT publish in the journals that are downloaded?

Could try to also allow topic selection.

There are some cases of multiple publications in one journal.

In the user selected year range:

- Number of UofT publications in journals (and percentage).  If they want currently subscribed selecting just the most recent year should work.
- Distribution of downloads and total downloads (sees how popular uoft publications are)
- Distribution of "number of uoft authors" + distribution of "proportion of uoft authors"
- Distribution of "number of pages" for the uoft publications
- value counts of Publication Type, Document Type
- Distribution of Categories.  Headings 1, Subheadings, Subjects.

In [46]:
# functions
head_selection = "All Headings"
subhead_selection = "All Subheadings"
curr_data2 = uoft_reports
curr_downloads = uoft_downloads
proportions = curr_data2["NumUofTAuthors"] / curr_data2["Total number of authors"]
proportions = pd.DataFrame(proportions).rename(columns={0: "Proportions"})

def select_topics(head, subhead, uoft_selected_data):
    if head != "All Headings":
        # select that heading
        curr_data2 = uoft_selected_data[uoft_selected_data["Category: Heading 1"] == head]
    else:
        curr_data2 = uoft_selected_data
        
    if subhead != "All Subheadings":
        has_subhead = []
        # actually gotta check each individual one to see if we include it or not
        for i, r in curr_data2.iterrows():
            if pd.notna(r["Category: Subheadings"]):
                if subhead in r["Category: Subheadings"]:
                    has_subhead.append(i)
        curr_data2 = curr_data2.loc[has_subhead,:]
        
    unique_downloads = curr_data2.drop_duplicates(subset=["Print ISSN", "Online ISSN", "Journal", "Year"])
    return curr_data2, unique_downloads


def get_stats(all_data, download_data):
    stats = {"Number of UofT publications in journals subscribed to:": str(len(all_data)),
             "Number of UofT authors of publications in journals subscribed to:": str(sum(all_data["NumUofTAuthors"])),
             "Number of journals with UofT publications subscribed to:": str(len(download_data)),
             "Percentage of journals with UofT publications subscribed to:": 
              str(round((len(download_data) / len(selected_data)) * 100, 3)) + "%"
            }
    stats = pd.DataFrame(stats.items(), columns=["Stat", "Value"])
    stats = stats.set_index("Stat")
    return stats


def get_doctype_counts(all_data):
    doctype_counts = all_data["Document Type"].value_counts()
    indexes = list(doctype_counts.index)
    curr_counts = list(doctype_counts.values)

    i = 0
    counts = {}
    for doctype in indexes:
        types = doctype.replace(',',';').split(";")
        for onetype in types:
            onetype = onetype.strip()
            if onetype not in counts:
                counts[onetype] = 0
            counts[onetype] = counts[onetype] + curr_counts[i]
        i = i + 1
    return counts


def get_cat2_counts(all_data):
    cat2_counts = all_data["Category: Subheadings"].value_counts()
    indexes = list(cat2_counts.index)
    curr_counts = list(cat2_counts.values)

    i = 0
    cat2_counts = {}
    for cat2 in indexes:
        types = cat2.replace(',',';').split(";")
        for onetype in types:
            onetype = onetype.strip()
            if onetype not in cat2_counts:
                cat2_counts[onetype] = 0
            cat2_counts[onetype] = cat2_counts[onetype] + curr_counts[i]
        i = i + 1
        
    return cat2_counts


def get_subject_counts(all_data):
    cat3_counts = all_data["Category: Subjects"].value_counts()
    indexes = list(cat3_counts.index)
    curr_counts = list(cat3_counts.values)

    i = 0
    cat3_counts = {}
    for cat3 in indexes:
        types = cat3.replace(',',';').split(";")
        for onetype in types:
            onetype = onetype.strip().lower()
            if onetype not in cat3_counts:
                cat3_counts[onetype] = 0
            cat3_counts[onetype] = cat3_counts[onetype] + curr_counts[i]
        i = i + 1

    all_cat3_counts = cat3_counts.copy()
    max_subject = max(all_cat3_counts.items(), key=operator.itemgetter(1))[0]
    bound = cat3_counts[max_subject] / 4

    small = {}
    total = 0
    for cat3 in cat3_counts:
        if cat3_counts[cat3] < bound:
            small[cat3] = cat3_counts[cat3]
            total = total + cat3_counts[cat3]

    too_small = list(small.keys())
    for key in too_small:
        del cat3_counts[key]
    cat3_counts["Other"] = total

    all_cat3_data = pd.DataFrame(all_cat3_counts.items(), columns=['Subject', 'Number of Publications'])
    cat3_data = pd.DataFrame(cat3_counts.items(), columns=["Subject", "Num Publications"])
    return cat3_data, all_cat3_data

In [47]:
# widgets

head_dd = widgets.Dropdown(options = ["All Headings"] + 
                           list(uoft_reports["Category: Heading 1"].value_counts().index))
subhead_dd = widgets.Dropdown(options = ["All Subheadings"] + 
                              list(get_cat2_counts(uoft_reports).keys()))

uoft_stats = widgets.Output()

download_dist = go.FigureWidget(skip_invalid=True)
num_authors_dist = go.FigureWidget()
prop_authors_dist = go.FigureWidget(skip_invalid=True)
num_pages_dist = go.FigureWidget()

pubtype_dist = go.FigureWidget()
doctype_dist = go.FigureWidget()

cat1_graph = go.FigureWidget()
cat2_graph = go.FigureWidget()
cat3_graph = go.FigureWidget()

In [48]:
# update functions

def head_update(change):  
    global head_selection
    global curr_data2
    global curr_downloads
    global proportions
    uoft_stats.clear_output() 
    
    head_selection = change.new
    curr_data2, curr_downloads = select_topics(head_selection, subhead_selection, uoft_reports)
    proportions = curr_data2["NumUofTAuthors"] / curr_data2["Total number of authors"]
    proportions = pd.DataFrame(proportions).rename(columns={0: "Proportions"})
    update_uoft_figs(curr_data2, curr_downloads, proportions)
    
    
def subhead_update(change):  
    global subhead_selection
    global curr_data2
    global curr_downloads
    global proportions
    uoft_stats.clear_output() 
    
    subhead_selection = change.new
    curr_data2, curr_downloads = select_topics(head_selection, subhead_selection, uoft_reports)
    proportions = curr_data2["NumUofTAuthors"] / curr_data2["Total number of authors"]
    proportions = pd.DataFrame(proportions).rename(columns={0: "Proportions"})
    update_uoft_figs(curr_data2, curr_downloads, proportions)
    
    
def update_uoft_figs(curr_data2, curr_downloads, proportions):
    with uoft_stats:
        display(get_stats(curr_data2, curr_downloads))
    
    # update download distribution
    new_fig = px.histogram(curr_downloads, x="Downloads", marginal="box", nbins=200, 
                           title="Downloads of Journals with UofT Publications", height=400)
    download_dist.data = []
    download_dist.add_traces(new_fig.data)
    download_dist.layout = new_fig.layout
    
    # update proportion authors distribution
    fig2 = px.histogram(proportions, x="Proportions", nbins=20, 
                        title="Proportion of UofT Authors in Publications", height=400,
                        labels={"Proportions": "Number of UofT Authors / Total Number of Authors"})
    prop_authors_dist.data = []
    prop_authors_dist.add_traces(fig2.data)
    prop_authors_dist.layout = fig2.layout
        
    # update num authors distribution
    fig1 = px.histogram(curr_data2, x="NumUofTAuthors", marginal="box", nbins=100, 
                        title="Number of UofT Authors in Publications", height=400,
                        labels={"NumUofTAuthors": "Number of UofT Authors"})
    num_authors_dist.data = []
    num_authors_dist.add_traces(fig1.data)
    num_authors_dist.layout = fig1.layout
        
    # update num pages
    fig3 = px.histogram(curr_data2, x="Number of Pages", marginal="box", nbins=100, 
                        title="Number of Pages in UofT Publications", height=400)
    num_pages_dist.data = []
    num_pages_dist.add_traces(fig3.data)
    num_pages_dist.layout = fig3.layout
        
    update_pies(curr_data2)
        
        
def update_pies(curr_data2):
    # update pubtypes
    fig = px.pie(values=list(curr_data2["PubType"].value_counts().values), 
                 names=list(curr_data2["PubType"].value_counts().index), 
                 title='Publication Types')
    pubtype_dist.data = []
    pubtype_dist.add_traces(fig.data)
    pubtype_dist.layout = fig.layout
        
    # update doctypes
    doctype_counts = get_doctype_counts(curr_data2)
    fig1 = px.pie(values=list(doctype_counts.values()), 
                  names=list(doctype_counts.keys()), 
                  title='Document Types')
    doctype_dist.data = []
    doctype_dist.add_traces(fig1.data)
    doctype_dist.layout = fig1.layout
        
    # update cat1
    cat1_counts = curr_data2["Category: Heading 1"].value_counts().reset_index()
    cat1_counts = cat1_counts.rename(columns={"index": "Heading"})
    fig2 = px.pie(cat1_counts, values="Category: Heading 1", 
                  names="Heading", 
                  title='Heading 1 Categories', labels={"Category: Heading 1": "Num Publications"})
    cat1_graph.data = []
    cat1_graph.add_traces(fig2.data)
    cat1_graph.layout = fig2.layout
        
    # update cat2
    cat2_counts = get_cat2_counts(curr_data2)
    fig3 = px.pie(values=list(cat2_counts.values()), 
                  names=list(cat2_counts.keys()), 
                  title='Subheading Categories')
    cat2_graph.data = []
    cat2_graph.add_traces(fig3.data)
    cat2_graph.layout = fig3.layout
        
    # update cat3
    cat3_counts, all_cat3_counts = get_subject_counts(curr_data2)
    fig4 = px.pie(cat3_counts, values="Num Publications", 
                  names="Subject", 
                  title='Subject Categories')
    cat3_graph.data = []
    cat3_graph.add_traces(fig4.data)
    cat3_graph.layout = fig4.layout
    

head_dd.observe(head_update, names = 'value')
subhead_dd.observe(subhead_update, names = 'value')

In [49]:
# display

update_uoft_figs(curr_data2, curr_downloads, proportions)

input_widgets = widgets.HBox([head_dd, subhead_dd])

dist_tab = widgets.Tab([download_dist, num_authors_dist, prop_authors_dist, num_pages_dist])
dist_tab.set_title(0, "Downloads")
dist_tab.set_title(1, 'Num Authors')
dist_tab.set_title(2, 'Prop Authors')
dist_tab.set_title(3, 'Pages')

type_tab = widgets.Tab([pubtype_dist, doctype_dist])
type_tab.set_title(0, "Publication Types")
type_tab.set_title(1, 'Document Types')

cat_tab = widgets.Tab([cat1_graph, cat2_graph, cat3_graph])
cat_tab.set_title(0, "Headings")
cat_tab.set_title(1, 'Subheadings')
cat_tab.set_title(2, 'Subjects')

main_uoft_tab = widgets.Tab([uoft_stats, dist_tab, type_tab, cat_tab])
main_uoft_tab.set_title(0, "Stats")
main_uoft_tab.set_title(1, 'Distributions')
main_uoft_tab.set_title(2, 'Journal Types')
main_uoft_tab.set_title(3, 'Categories')

In [50]:
display(input_widgets)
display(main_uoft_tab)

HBox(children=(Dropdown(options=('All Headings', 'Science & Technology', 'Social Sciences', 'Arts & Humanities…

Tab(children=(Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '                        …

In [32]:
data = uoft_reports
proportions = data["NumUofTAuthors"] / data["Total number of authors"]
proportions = pd.DataFrame(proportions).rename(columns={0: "Proportions"})
proportions

Unnamed: 0,Proportions
0,0.090909
1,0.818182
2,0.050000
3,0.035714
4,0.142857
...,...
14512,0.555556
14513,0.800000
14514,0.111111
14515,1.000000


In [45]:
fig2 = px.histogram(proportions, x="Proportions", title="Proportion of UofT Authors in Publications", height=400,
                       labels={"Proportions": "Number of UofT Authors / Total Number of Authors"})
prop_authors_dist.data = []
prop_authors_dist.add_traces(fig2.data)
prop_authors_dist.layout = fig2.layout
#px.histogram(x=checking, marginal="box", nbins=200, 
#                           title="Downloads of Journals with UofT Publications", height=400)

check.data = []
check.add_traces(fig2.data)
check.layout = fig2.layout

In [41]:
prop_authors_dist

FigureWidget({
    'data': [],
    'layout': {'barmode': 'relative',
               'legend': {'tracegroupgap'…

In [35]:
check

FigureWidget({
    'data': [], 'layout': {'template': '...'}
})

In [30]:
check.data = []
check.add_traces(new_fig.data[])

FigureWidget({
    'data': [{'alignmentgroup': 'True',
              'bingroup': 'x',
              'hoverlabe…

In [28]:
new_fig.data

(Histogram({
     'alignmentgroup': 'True',
     'bingroup': 'x',
     'hoverlabel': {'namelength': 0},
     'hovertemplate': 'Number of UofT Authors=%{x}<br>count=%{y}',
     'legendgroup': '',
     'marker': {'color': '#636efa'},
     'name': '',
     'nbinsx': 100,
     'offsetgroup': '',
     'orientation': 'v',
     'showlegend': False,
     'x': array([1, 9, 1, ..., 1, 1, 3], dtype=int64),
     'xaxis': 'x',
     'yaxis': 'y'
 }),
 Box({
     'alignmentgroup': 'True',
     'hoverlabel': {'namelength': 0},
     'hovertemplate': 'Number of UofT Authors=%{x}',
     'legendgroup': '',
     'marker': {'color': '#636efa'},
     'name': '',
     'notched': True,
     'offsetgroup': '',
     'showlegend': False,
     'x': array([1, 9, 1, ..., 1, 1, 3], dtype=int64),
     'xaxis': 'x2',
     'yaxis': 'y2'
 }))

In [74]:
num_authors_dist.data = []
num_authors_dist.add_traces(fig2.data)
#num_authors_dist.layout = fig2.layout

FigureWidget({
    'data': [{'alignmentgroup': 'True',
              'bingroup': 'x',
              'hoverlabe…

In [75]:
num_authors_dist

FigureWidget({
    'data': [{'alignmentgroup': 'True',
              'bingroup': 'x',
              'hoverlabe…

In [65]:
fig1 = px.histogram(curr_data2, x="NumUofTAuthors", marginal="box", nbins=100, 
                        title="Number of UofT Authors in Publications", height=400,
                        labels={"NumUofTAuthors": "Number of UofT Authors"})
fig1.show()

In [66]:
fig1.data

(Histogram({
     'alignmentgroup': 'True',
     'bingroup': 'x',
     'hoverlabel': {'namelength': 0},
     'hovertemplate': 'Number of UofT Authors=%{x}<br>count=%{y}',
     'legendgroup': '',
     'marker': {'color': '#636efa'},
     'name': '',
     'nbinsx': 100,
     'offsetgroup': '',
     'orientation': 'v',
     'showlegend': False,
     'x': array([1, 9, 1, ..., 1, 1, 3], dtype=int64),
     'xaxis': 'x',
     'yaxis': 'y'
 }),
 Box({
     'alignmentgroup': 'True',
     'hoverlabel': {'namelength': 0},
     'hovertemplate': 'Number of UofT Authors=%{x}',
     'legendgroup': '',
     'marker': {'color': '#636efa'},
     'name': '',
     'notched': True,
     'offsetgroup': '',
     'showlegend': False,
     'x': array([1, 9, 1, ..., 1, 1, 3], dtype=int64),
     'xaxis': 'x2',
     'yaxis': 'y2'
 }))

In [67]:
fig2.data

(Histogram({
     'alignmentgroup': 'True',
     'bingroup': 'x',
     'hoverlabel': {'namelength': 0},
     'hovertemplate': 'x=%{x}<br>count=%{y}',
     'legendgroup': '',
     'marker': {'color': '#636efa'},
     'name': '',
     'nbinsx': 20,
     'offsetgroup': '',
     'orientation': 'v',
     'showlegend': False,
     'x': array([0.09090909, 0.81818182, 0.05      , ..., 0.11111111, 1.        ,
                 1.        ]),
     'xaxis': 'x',
     'yaxis': 'y'
 }),
 Box({
     'alignmentgroup': 'True',
     'hoverlabel': {'namelength': 0},
     'hovertemplate': 'x=%{x}',
     'legendgroup': '',
     'marker': {'color': '#636efa'},
     'name': '',
     'notched': True,
     'offsetgroup': '',
     'showlegend': False,
     'x': array([0.09090909, 0.81818182, 0.05      , ..., 0.11111111, 1.        ,
                 1.        ]),
     'xaxis': 'x2',
     'yaxis': 'y2'
 }))

# Comparing uoft vs non-uoft

If I have free time