# FOCUSED Project: OSPO adoption

As part of the [FOCUSED Collaboration project](https://github.com/JumpsuitWizard/FOCUSED-Collaboration), this notebook examines OSPO (Open Source Programs Offices) adoption across the [Standard and Poor's 500 index](https://en.wikipedia.org/wiki/S%26P_500).

## Authors

- **PI**: Duane O'Brien
- **Researcher**: julia ferraioli
- **Analyst**: Reshama Shaikh

## Research question

## Methodology

## Data sources

The following data sources are used in the analysis:

- [S&P 500](https://github.com/datasets/s-and-p-500-companies/blob/master/data/constituents.csv) retrieved on 2021-10-05
- [OSCI Index](https://opensourceindex.io/) retrieved on 2022-02-28
- [OSPO Landscape](https://landscape.todogroup.org/) retrieved on 2022-05-09


## Data fetching

Get the latest version of the data if the cached copy is more than 1 month old


In [117]:
import csv
from datetime import date
import json
import numpy as np
import pandas as pd
import plotly.express as px
import pprint
import re
import requests
from yaml import safe_load

pp = pprint.PrettyPrinter(indent=2)

# Set the raw output directory in YYYY/MM format
today = date.today()
data_dir = "data_raw/%s/%s/" % (today.year, ('%02d' % (today.month)))

# Dictionary of sources
sources = {"SP500": {"name": "S&P 500",
                     "link": "https://raw.githubusercontent.com/datasets/s-and-p-500-companies/main/data/constituents.csv",
                     "format": "csv",
                     "data": None
                     },
           "OSCI": {"name": "OSCI",
                    # This is the base URL; needs to be fully qualified with year/month.json (ex /monthly/2023/07.json)
                    "link": "https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/osci-ranking/monthly/%s/%s.json" % (today.year, ('%02d' % (today.month-1))),
                    "format": "json",
                    "data": None
                    },
           "TODO": {"name": "TODO Group Landscape",
                    "link": "https://raw.githubusercontent.com/todogroup/ospolandscape/master/landscape.yml",
                    "format": "yml",
                    "data": None
                    },
           "OSPOPlusPlus": {"name": "OSPO++",
                            "link": "https://raw.githubusercontent.com/ospoplusplus/ospoplusplus/main/content/about/members/_index.md",
                            "format": "md",
                            "data": None
                            },
           "OSPOAlliance": {"name": "OSPO Alliance",
                            "link": "https://gitlab.eclipse.org/eclipse/plato/www/-/raw/main/layouts/shortcodes/section-members.html",
                            "format": "html",
                            "data": None
                            }
           }

# Fetch and save the versioned data if it is out of date
for key, value in sources.items():
    # Check most recent data file
    filepath = os.path.join(data_dir, ("%s.%s" % (key, value['format'])))
    if os.path.isfile(filepath):
        print(value['name'], " is up-to-date")
    else:
        # Request the data
        try:
            req = requests.get(value['link'], stream=True)
            req.raise_for_status()
            if not os.path.exists(data_dir):
                os.makedirs(data_dir)
            with open(os.path.join(data_dir, ("%s.%s" % (key, value['format']))), "w") as f:
                f.write(req.text)
        except requests.exceptions.HTTPError as e:
            print("received %s; skipping." % (e))
            continue


S&P 500  is up-to-date
TODO Group Landscape  is up-to-date
OSPO++  is up-to-date
OSPO Alliance  is up-to-date


## Data setup


### Create utility functions for parsing specific data sources


In [119]:
# Process TODOGroup Landscape
def parse_todo(filepath):
    with open(filepath, 'r') as f:
        raw = safe_load(f)
        return pd.json_normalize(
            raw['landscape'],
            record_path=['subcategories', 'items'],
            meta=['category', 'name'],
            meta_prefix='category_'
        )[['name', 'category_name']]

# Process OSPO++ members
def parse_opp(filepath):
    regex = r"(?<=company name\=)\"(.*?)\""
    with open(filepath, 'rt') as f:
        raw = f.read()
        return pd.DataFrame({'OSPO++ Member': re.findall(regex, raw)})

# Process OSPO Alliance members
def parse_alliance(filepath):
    name_re = r"(?<=alt\=)\"(.*?)(?= logo|\")"
    site_re = r"(?<=href\=)\"(.*?)(?=\")"
    with open(filepath, 'r') as f:
        raw = f.read()
        return pd.DataFrame({'OSPO Alliance Member': re.findall(name_re, raw), 'Website': re.findall(site_re, raw)})


### Load the data into Pandas


In [123]:
# Parse and process fetched data files
for key, value in sources.items():
    format = value['format']
    filepath = os.path.join(data_dir, ("%s.%s" % (key, format)))
    if os.path.isfile(filepath) is False:
        continue
    if format == 'csv':
        value['data'] = pd.read_csv(filepath)
    elif format == 'json':
        value['data'] = pd.read_json(filepath)
        value['data'] = pd.json_normalize(value['data']['data'])
    elif format == 'yml' or format == 'yaml':
        if key == 'TODO':
            value['data'] = parse_todo(filepath)
        else:
            print("file %s has no currently implemented handler" % (key))
    elif format == 'md':
        if key == 'OSPOPlusPlus':
            value['data'] = parse_opp(filepath)
        else:
             print("file %s has no currently implemented handler" % (key))
    elif format == 'html':
        if key == 'OSPOAlliance':
            value['data'] = parse_alliance(filepath)
        else:
             print("file %s has no currently implemented handler" % (key))
    else:
        print("file format %s has no currently implemented handler" % (format))
            

In [124]:
for key, value in sources.items():
    pp.pprint(key)
    pp.pprint(value['data'])


'SP500'
    Symbol              Security             GICS Sector  \
0      MMM                    3M             Industrials   
1      AOS           A. O. Smith             Industrials   
2      ABT                Abbott             Health Care   
3     ABBV                AbbVie             Health Care   
4      ACN             Accenture  Information Technology   
..     ...                   ...                     ...   
498    YUM           Yum! Brands  Consumer Discretionary   
499   ZBRA    Zebra Technologies  Information Technology   
500    ZBH         Zimmer Biomet             Health Care   
501   ZION  Zions Bancorporation              Financials   
502    ZTS                Zoetis             Health Care   

                      GICS Sub-Industry    Headquarters Location  Date added  \
0              Industrial Conglomerates    Saint Paul, Minnesota  1957-03-04   
1                     Building Products     Milwaukee, Wisconsin  2017-07-26   
2                 Health Care E

### Preview the data


In [None]:
for key, value in sources.items():
    print(key, "\n------")
    pp.pprint(value['data'].sample(5))
    print("\n")

### Clean up the data


In [639]:
# Rename columns in S&P 500 and add a few fields for comparison purposes
df_sp = df_sp.rename(columns={'Name': 'company', 'Sector': 'sector'})

df_sp['in S&P 500'] = True
df_sp['country'] = "United States"

# Reorder the columns for clarity's sake
order = ['company', 'sector', 'country', 'in S&P 500']

df_sp = df_sp.reindex(order, axis=1)

# Filter out columns from OSCI that we don't need
keep_cols = ['company', 'position', 'industry']
df_osci = df_osci.filter(keep_cols)
print(df_osci.head())

# Rename columns in OSCI and add a field for comparison purposes
df_osci = df_osci.rename(
    columns={'position': 'OSCI position', 'industry': 'OSCI sector'})
df_osci['in OSCI'] = True

# Reorder the columns for clarity's sake
order = ['company', 'OSCI sector', 'OSCI position', 'in OSCI']

df_osci = df_osci.reindex(order, axis=1)

# Rename columns in OSPO Landscape and add a field for comparison purposes
df_ospo = df_ospo.rename(columns={'Name': 'OSPO status',
                                  'Organization': 'company',
                                  'Market Cap': 'market cap',
                                  'Crunchbase Country': 'OSPO country',
                                  'License': 'license'})

df_ospo['in OSPO landscape'] = True

keep_cols = ['OSPO status', 'company', 'market cap',
             'OSPO country', 'in OSPO landscape']
df_ospo = df_ospo.filter(keep_cols)

# Filter out those who have not adopted an OSPO
df_ospo = df_ospo.loc[df_ospo['OSPO status'].str.contains(
    "adopter", case=False)]

# Reorder the columns for clarity's sake
order = ['company', 'OSPO country', 'market cap',
         'OSPO status', 'in OSPO landscape']


df_ospo = df_ospo.reindex(order, axis=1)


     company  position    industry
0     Google         1  Technology
1  Microsoft         2  Technology
2    Red Hat         3  Technology
3      Intel         4  Technology
4        IBM         5  Technology


### Preview the data


In [640]:
df_sp.sample(5)


Unnamed: 0,company,sector,country,in S&P 500
420,State Street Corporation,Financials,United States,True
276,Kroger,Consumer Staples,United States,True
477,Vulcan Materials,Materials,United States,True
154,Domino's Pizza,Consumer Discretionary,United States,True
249,Intercontinental Exchange,Financials,United States,True


In [641]:
df_osci.sample(5)


Unnamed: 0,company,OSCI sector,OSCI position,in OSCI
168,Pantheon,Technology,169,True
200,Linutronix,Technology,201,True
62,Iohk,Technology,63,True
203,Tensor,Technology,204,True
60,Atlassian,Technology,61,True


In [642]:
df_ospo.sample(5)


Unnamed: 0,company,OSPO country,market cap,OSPO status,in OSPO landscape
90,Aiven,Finland,,Aiven (Adopter),True
123,Equinix,United States,65114260000.0,Equinix (Adopter),True
147,Morgan Stanley,United States,150626000000.0,Morgan Stanley (Adopter),True
182,Verizon Media,United States,199440900000.0,Verizon Media (Adopter),True
148,National Instruments,United States,4715077000.0,National Instruments (Adopter),True


### Merge the data sources


In [643]:
# First, merge S&P Index and OSCI, then merge with OSPO Landscape
all_data = (df_sp.merge(df_osci, left_on='company',
                        right_on='company', how='outer')).merge(df_ospo,
                                                                left_on='company',
                                                                right_on='company',
                                                                how='outer')

# Prefer S&P data over OSCI and OSPO Landscape data
all_data['sector'] = all_data['sector'].mask(
    pd.isnull, all_data['OSCI sector'])
all_data['country'] = all_data['country'].mask(
    pd.isnull, all_data['OSPO country'])
all_data = all_data.drop(['OSCI sector', 'OSPO country'], axis=1)

all_data.sort_values(by=['company']).head(10)


Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
0,3M,Industrials,United States,True,,,,,
768,4teamwork,Technology,,,285.0,True,,,
759,5minds,Technology,,,276.0,True,,,
1,A. O. Smith,Industrials,United States,True,,,,,
7,ADM,Consumer Staples,United States,True,,,,,
11,AES Corp,Utilities,United States,True,,,,,
526,AMD,Technology,,,31.0,True,,,
44,APA Corporation,Energy,United States,True,,,,,
519,ARM,Technology,,,22.0,True,,,
51,AT&T,Communication Services,United States,True,,,,,


### Do some manual cleanup for known issues


In [644]:
# Google is split across two Alphabet stock options and Google so we'll merge them
googles = all_data.loc[all_data['company'].str.contains(
    'Alphabet|Google', case=False, regex=True)]
col_order = ['company', 'sector', 'country', 'in S&P 500', 'OSCI position', 'in OSCI', 'market cap',
             'OSPO status', 'in OSPO landscape']
google = googles.groupby('country', as_index=False).last()

# Update the data set and drop the extraneous entries
all_data.set_index('company', inplace=True)
all_data.update(google.set_index('company'))
all_data.reset_index(inplace=True)
all_data[all_data['company'] == 'Google']
all_data.drop(googles.iloc[:2].index, inplace=True)


In [645]:
all_data.sample(25)


Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
87,CarMax,Consumer Discretionary,United States,True,,,,,
47,Aptiv,Consumer Discretionary,United States,True,,,,,
601,Truss Works,Technology,,,116.0,True,,,
412,Sherwin-Williams,Materials,United States,True,,,,,
405,Salesforce,Information Technology,United States,True,30.0,True,174780300000.0,Salesforce (Adopter),True
642,ArangoDB,Technology,,,158.0,True,,,
65,Best Buy,Consumer Discretionary,United States,True,,,,,
411,ServiceNow,Information Technology,United States,True,,,,,
619,Kaltura,Technology,,,135.0,True,,,
200,Ford,Consumer Discretionary,United States,True,,,,,


In [646]:
# Normalize the sectors across data sets

all_data['sector'].mask(all_data['sector'] ==
                        'Information Technology', 'Technology', inplace=True)
all_data['sector'].mask(all_data['sector'] == 'Health Care',
                        'Healthcare & Pharma', inplace=True)
all_data['sector'].mask(all_data['sector'] == 'Financials',
                        'Banking, Insurance & Financial Services ', inplace=True)

# Fill null values with default ones where needed
all_data = all_data.fillna(value={'in S&P 500': False, 'in OSPO landscape': False, 'in OSCI': False,
                                  'OSCI position': 'n/a', 'OSPO status': 'n/a', 'market cap': 'unknown', 'sector': 'unknown'})


### Get a sneak peak at the data


In [647]:
data_sample = all_data.sample(20)
data_sample.sort_values(by=['company'])


Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
19,Align Technology,Healthcare & Pharma,United States,True,,False,unknown,,False
21,Alliant Energy,Utilities,United States,True,,False,unknown,,False
533,Automattic,Technology,,False,38.0,True,unknown,,False
77,Broadcom,Technology,United States,True,70.0,True,unknown,,False
656,Buro Happold,Professional Services,,False,172.0,True,unknown,,False
88,Carnival Corporation,Consumer Discretionary,United States,True,,False,unknown,,False
105,Chipotle Mexican Grill,Consumer Discretionary,United States,True,,False,unknown,,False
106,Chubb,"Banking, Insurance & Financial Services",United States,True,,False,unknown,,False
172,Entergy,Utilities,United States,True,,False,unknown,,False
177,Essex Property Trust,Real Estate,United States,True,,False,unknown,,False


## Inspect the data


In [648]:
sp_count = len(all_data[all_data['in S&P 500']])
ospo_count = len(all_data[all_data['in OSPO landscape']])
osci_count = len(all_data[all_data['in OSCI']])
sp_ospo_count = len(all_data.query('`in S&P 500` & `in OSPO landscape`'))
sp_osci_count = len(all_data.query('`in S&P 500` & `in OSCI`'))
ospo_osci = len(all_data.query('`in OSPO landscape` & `in OSCI`'))
intersection = len(all_data.query(
    '`in S&P 500` & `in OSPO landscape` & `in OSCI`'))

listings = pd.DataFrame([[sp_count, ospo_count, osci_count, sp_ospo_count, sp_osci_count, ospo_osci, intersection]],
                        ['count'], ['in S&P 500', 'in OSPO landscape', 'in OSCI', 'in S&P and OSPO landscape',
                        'in S&P and OSCI', 'in OSPO landscape and  OSCI', 'in all three'])
listings


Unnamed: 0,in S&P 500,in OSPO landscape,in OSCI,in S&P and OSPO landscape,in S&P and OSCI,in OSPO landscape and OSCI,in all three
count,504,102,299,23,23,36,12


### What companies are present in all three datasets?


In [649]:
all_data.query('`in S&P 500` & `in OSPO landscape` & `in OSCI`')


Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
8,Adobe,Technology,United States,True,17.0,True,188740124672.0,Adobe (Adopter),True
45,Apple,Technology,United States,True,24.0,True,2608140386304.0,Apple (Adopter),True
53,Autodesk,Technology,United States,True,98.0,True,41994772480.0,Autodesk (Adopter),True
220,Goldman Sachs,"Banking, Insurance & Financial Services",United States,True,248.0,True,107847974912.0,Goldman Sachs (Adopter),True
228,Hewlett Packard Enterprise,Technology,United States,True,40.0,True,20440801280.0,HPE (Adopter),True
240,IBM,Technology,United States,True,5.0,True,120321998848.0,IBM (Adopter),True
248,Intel,Technology,United States,True,4.0,True,183187193856.0,Intel (Adopter),True
314,Microsoft,Technology,United States,True,2.0,True,2087771176960.0,Microsoft (Adopter),True
328,Netflix,Communication Services,United States,True,77.0,True,85567168512.0,Netflix (Adopter),True
405,Salesforce,Technology,United States,True,30.0,True,174780301312.0,Salesforce (Adopter),True


## Save processed data to file


In [650]:
all_data.to_csv('data_derived/merged_data.csv', index=False)


## Proceed to create visualizations

Head over to the [Visualizations notebook](Visualizations.ipynb) to generate some charts about the data.
