# FOCUSED Project: OSPO adoption

As part of the [FOCUSED Collaboration project](https://github.com/JumpsuitWizard/FOCUSED-Collaboration), this notebook examines OSPO (Open Source Programs Offices) adoption across the [Standard and Poor's 500 index](https://en.wikipedia.org/wiki/S%26P_500).

## Authors

- **PI**: Duane O'Brien
- **Researcher**: julia ferraioli
- **Analyst**: Reshama Shaikh

## Research question

## Methodology

## Data sources

The following data sources are used in the analysis:

- [S&P 500](https://github.com/datasets/s-and-p-500-companies/blob/master/data/constituents.csv) retrieved on 2021-10-05
- [OSCI Index](https://opensourceindex.io/) retrieved on 2022-02-28
- [OSPO Landscape](https://landscape.todogroup.org/) retrieved on 2022-05-09

## Data setup
### Load the data into Pandas

In [594]:
import csv
import json
import numpy as np
import pandas as pd
import plotly.express as px

data_dir = "data_raw/"

# Load S&P 500 dataset into a dataframe
df_sp = pd.read_csv(data_dir+'sp500.csv')


# Load Open Source Contributor Index (OSCI) dataset into a dataframe
json_osci = json.load(open(data_dir + 'osci.json'))
df_osci = pd.read_json(data_dir + 'osci.json')
df_osci = pd.json_normalize(df_osci['data'])
print(df_osci.shape)

# Load OSPO landscape dataset into a dataframe
df_ospo = pd.read_csv(data_dir + 'ospo_todo_landscape.csv')

(298, 12)


### Preview the data

In [595]:
df_sp.sample(5)

Unnamed: 0,Symbol,Name,Sector
425,SNPS,Synopsys,Information Technology
59,BLL,Ball Corp,Materials
500,YUM,Yum! Brands,Consumer Discretionary
222,HBI,Hanesbrands,Consumer Discretionary
491,WRK,WestRock,Materials


In [596]:
df_osci.sample(5)

Unnamed: 0,positionChange,company,activeContributors,activeContributorsChange,totalCommunity,totalCommunityChange,position,yoy,contributors,languages,licenses,industry
285,7.0,Pilz,0.0,0.0,4,0.0,286,"[{'date': '2020-04-17', 'active': 8.0}, {'date...","[{'Contributor': 'jschleicher', 'Commits': 5},...","[{'name': 'HTML', 'amount': 1}, {'name': 'C++'...","[{'name': 'bsd-3-clause', 'amount': 4}, {'name...",Technology
9,0.0,SAP,623.0,51.0,1307,83.0,10,"[{'date': '2020-04-17', 'active': 431.0}, {'da...","[{'Contributor': 'Rafael Franzke', 'Commits': ...","[{'name': 'Groovy', 'amount': 11}, {'name': 'S...","[{'name': 'apache-2.0', 'amount': 34358}, {'na...",Technology
76,2.0,Netflix,70.0,2.0,203,22.0,77,"[{'date': '2020-04-17', 'active': 60.0}, {'dat...","[{'Contributor': 'Edgar Lee', 'Commits': 285},...","[{'name': 'Jupyter Notebook', 'amount': 65}, {...","[{'name': 'apache-2.0', 'amount': 2934}, {'nam...",Media & Telecoms
56,-2.0,ThoughtWorks,91.0,12.0,291,25.0,57,"[{'date': '2020-04-17', 'active': 182.0}, {'da...","[{'Contributor': 'Chad Wilson', 'Commits': 151...","[{'name': 'Rust', 'amount': 15}, {'name': 'Obj...","[{'name': 'mit', 'amount': 2253}, {'name': 'ap...",Technology
203,-9.0,Tensor,10.0,2.0,31,6.0,204,"[{'date': '2020-04-17', 'active': 56.0}, {'dat...","[{'Contributor': 'va.ten', 'Commits': 74}, {'C...","[{'name': 'Java', 'amount': 4}, {'name': 'SCSS...","[{'name': 'mit', 'amount': 332}, {'name': 'oth...",Technology


In [597]:
df_ospo.sample(5)

Unnamed: 0,Name,Organization,Homepage,Logo,Twitter,Crunchbase URL,Market Cap,Ticker,Funding,Member,...,Github Stars,Github Description,Github Latest Commit Date,Github Latest Commit Link,Github Release Date,Github Release Link,Github Start Commit Date,Github Start Commit Link,Github Contributors Count,Github Contributors Link
149,Netapp (Adopter),NetApp,https://netapp.com,https://landscape.todogroup.org/logos/netapp-a...,https://twitter.com/netapp,https://www.crunchbase.com/organization/netapp,16616760000.0,NTAP,0.0,General,...,,,,,,,,,,
197,Debricked License Compliance,Debricked,https://www.debricked.com/tools/license-compli...,https://landscape.todogroup.org/logos/debricke...,https://twitter.com/debrickedab,https://www.crunchbase.com/organization/debricked,,,4552074.0,false,...,,,,,,,,,,
45,JiHu(GitLab) (Member),GitLab,https://about.gitlab.cn,https://landscape.todogroup.org/logos/ji-hu-gi...,,https://www.crunchbase.com/organization/gitlab...,,,0.0,General,...,,,,,,,,,,
66,Sauce Labs (Member),Sauce Labs,https://github.com/saucelabs,https://landscape.todogroup.org/logos/sauce-la...,https://twitter.com/saucelabs,https://www.crunchbase.com/organization/sauce-...,,,151000000.0,General,...,,,,,,,,,,
109,Chef (Adopter),Chef Software,https://chef.github.io,https://landscape.todogroup.org/logos/chef-ado...,https://twitter.com/chef,https://www.crunchbase.com/organization/chef,,,105000000.0,false,...,,,,,,,,,,


### Clean up the data

In [598]:
# Rename columns in S&P 500 and add a few fields for comparison purposes
df_sp = df_sp.rename(columns = {'Name': 'company', 'Sector': 'sector'})

df_sp['in S&P 500'] = True
df_sp['country'] = "United States"

# Reorder the columns for clarity's sake
order = ['company', 'sector', 'country', 'in S&P 500']

df_sp = df_sp.reindex(order, axis=1)

# Filter out columns from OSCI that we don't need
keep_cols = ['company', 'position', 'industry']
df_osci = df_osci.filter(keep_cols)
print(df_osci.head())

# Rename columns in OSCI and add a field for comparison purposes
df_osci = df_osci.rename(columns = {'position': 'OSCI position', 'industry': 'OSCI sector'})
df_osci['in OSCI'] = True

# Reorder the columns for clarity's sake
order = ['company', 'OSCI sector', 'OSCI position', 'in OSCI']

df_osci = df_osci.reindex(order, axis=1)

# Rename columns in OSPO Landscape and add a field for comparison purposes
df_ospo = df_ospo.rename(columns =
                         {'Name': 'OSPO status',
                          'Organization': 'company',
                          'Market Cap': 'market cap',
                          'Crunchbase Country': 'OSPO country',
                          'License': 'license'})

df_ospo['in OSPO landscape'] = True

keep_cols = ['OSPO status','company', 'market cap', 'OSPO country', 'in OSPO landscape']
df_ospo = df_ospo.filter(keep_cols)

# Filter out those who have not adopted an OSPO
df_ospo = df_ospo.loc[df_ospo['OSPO status'].str.contains("adopter", case = False)]

# Reorder the columns for clarity's sake
order = ['company', 'OSPO country', 'market cap', 'OSPO status', 'in OSPO landscape']


df_ospo = df_ospo.reindex(order, axis=1)

     company  position    industry
0     Google         1  Technology
1  Microsoft         2  Technology
2    Red Hat         3  Technology
3      Intel         4  Technology
4        IBM         5  Technology


### Preview the data

In [599]:
df_sp.sample(5)

Unnamed: 0,company,sector,country,in S&P 500
313,Micron Technology,Information Technology,United States,True
36,AmerisourceBergen,Health Care,United States,True
395,Regions Financial Corporation,Financials,United States,True
38,Amgen,Health Care,United States,True
330,Newmont,Materials,United States,True


In [600]:
df_osci.sample(5)

Unnamed: 0,company,OSCI sector,OSCI position,in OSCI
110,Mendix,Technology,111,True
106,Okta,Technology,107,True
95,MuleSoft,Technology,96,True
59,Couchbase,Technology,60,True
157,ArangoDB,Technology,158,True


In [601]:
df_ospo.sample(5)

Unnamed: 0,company,OSPO country,market cap,OSPO status,in OSPO landscape
118,Dropbox,United States,8088215000.0,Dropbox (Adopter),True
104,Bosch,Germany,,Bosch (Adopter),True
172,Stripe,United States,,Stripe (Adopter),True
107,Capital One,United States,51953790000.0,Capital One (Adopter),True
132,Google,United States,1524992000000.0,Google (Adopter),True


### Merge the data sources

In [602]:
# First, merge S&P Index and OSCI, then merge with OSPO Landscape
all_data = (df_sp.merge(df_osci, left_on = 'company',
                        right_on = 'company', how = 'outer')).merge(df_ospo,
                                                                    left_on = 'company',
                                                                    right_on = 'company',
                                                                    how = 'outer')

# Prefer S&P data over OSCI and OSPO Landscape data
all_data['sector'] = all_data['sector'].mask(pd.isnull, all_data['OSCI sector'])
all_data['country'] = all_data['country'].mask(pd.isnull, all_data['OSPO country'])
all_data = all_data.drop(['OSCI sector', 'OSPO country'], axis = 1)

all_data.sort_values(by = ['company']).head(10)


Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
0,3M,Industrials,United States,True,,,,,
768,4teamwork,Technology,,,285.0,True,,,
759,5minds,Technology,,,276.0,True,,,
1,A. O. Smith,Industrials,United States,True,,,,,
7,ADM,Consumer Staples,United States,True,,,,,
11,AES Corp,Utilities,United States,True,,,,,
526,AMD,Technology,,,31.0,True,,,
44,APA Corporation,Energy,United States,True,,,,,
519,ARM,Technology,,,22.0,True,,,
51,AT&T,Communication Services,United States,True,,,,,


### Do some manual cleanup for known issues

In [603]:
# Google is split across two Alphabet stock options and Google so we'll merge them
googles = all_data.loc[all_data['company'].str.contains('Alphabet|Google', case = False, regex = True)]
col_order = ['company', 'sector', 'country', 'in S&P 500', 'OSCI position', 'in OSCI', 'market cap',
             'OSPO status', 'in OSPO landscape']
google = googles.groupby('country', as_index = False).last()

# Update the data set and drop the extraneous entries
all_data.set_index('company', inplace = True)
all_data.update(google.set_index('company'))
all_data.reset_index(inplace = True)
all_data[all_data['company'] == 'Google']
all_data.drop(googles.iloc[:2].index)

Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
0,3M,Industrials,United States,True,,,,,
1,A. O. Smith,Industrials,United States,True,,,,,
2,Abbott Laboratories,Health Care,United States,True,,,,,
3,AbbVie,Health Care,United States,True,,,,,
4,Abiomed,Health Care,United States,True,,,,,
...,...,...,...,...,...,...,...,...,...
832,"University of California, Santa Cruz",,United States,,,,,University of Santa Cruz (Adopter),True
833,US Bank,,United States,,,,7.334153e+10,US Bank (Adopter),True
834,Walmart Labs,,United States,,,,4.214506e+11,WalmartLabs (Adopter),True
835,Wayfair,,United States,,,,9.014573e+09,Wayfair (Adopter),True


In [604]:
all_data.sample(25)

Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
499,Xylem,Industrials,United States,True,,,,,
570,Instructure,Technology,,,81.0,True,,,
460,Union Pacific,Industrials,United States,True,,,,,
132,Crown Castle,Real Estate,United States,True,,,,,
298,Marriott International,Consumer Discretionary,United States,True,,,,,
392,Realty Income Corporation,Real Estate,United States,True,,,,,
399,Rockwell Automation,Industrials,United States,True,,,,,
136,D. R. Horton,Consumer Discretionary,United States,True,,,,,
30,American Electric Power,Utilities,United States,True,,,,,
700,Kakao,Media & Telecoms,,,216.0,True,,,


In [605]:
# Normalize the sectors across data sets

all_data['sector'].mask(all_data['sector'] == 'Information Technology', 'Technology', inplace = True)
all_data['sector'].mask(all_data['sector'] == 'Health Care', 'Healthcare & Pharma', inplace = True)
all_data['sector'].mask(all_data['sector'] == 'Financials', 'Banking, Insurance & Financial Services ', inplace = True)

# Fill null values with default ones where needed
all_data = all_data.fillna(value={'in S&P 500': False, 'in OSPO landscape': False, 'in OSCI': False,
                                  'OSCI position': 'n/a', 'OSPO status': 'n/a', 'market cap': 'unknown'})

### Get a sneak peak at the data

In [606]:
data_sample = all_data.sample(20)
data_sample.sort_values(by = ['company'])

Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
613,Arbisoft,Technology,,False,129.0,True,unknown,,False
57,Avery Dennison,Materials,United States,True,,False,unknown,,False
571,Bloomberg,"Banking, Insurance & Financial Services",United States,False,82.0,True,unknown,Bloomberg (Adopter),True
607,Brave,Technology,,False,123.0,True,unknown,,False
113,Citizens Financial Group,"Banking, Insurance & Financial Services",United States,True,,False,unknown,,False
126,Constellation Brands,Consumer Staples,United States,True,,False,unknown,,False
167,Edwards Lifesciences,Healthcare & Pharma,United States,True,,False,unknown,,False
195,First Republic Bank,"Banking, Insurance & Financial Services",United States,True,,False,unknown,,False
505,Google,Technology,United States,True,1.0,True,1524992311296.0,Google (Adopter),True
325,MSCI,"Banking, Insurance & Financial Services",United States,True,,False,unknown,,False


## Inspect the data

In [607]:
sp_count = len(all_data[all_data['in S&P 500']])
ospo_count = len(all_data[all_data['in OSPO landscape']])
osci_count = len(all_data[all_data['in OSCI']])
sp_ospo_count = len(all_data.query('`in S&P 500` & `in OSPO landscape`'))
sp_osci_count = len(all_data.query('`in S&P 500` & `in OSCI`'))
ospo_osci = len(all_data.query('`in OSPO landscape` & `in OSCI`'))
intersection = len(all_data.query('`in S&P 500` & `in OSPO landscape` & `in OSCI`'))

listings = pd.DataFrame([[sp_count, ospo_count, osci_count, sp_ospo_count, sp_osci_count, ospo_osci, intersection]],
                        ['count'], ['in S&P 500','in OSPO landscape','in OSCI', 'in S&P and OSPO landscape',
                        'in S&P and OSCI', 'in OSPO landscape and  OSCI', 'in all three'])
listings

Unnamed: 0,in S&P 500,in OSPO landscape,in OSCI,in S&P and OSPO landscape,in S&P and OSCI,in OSPO landscape and OSCI,in all three
count,506,102,299,23,23,36,12


### What companies are present in all three datasets?

In [608]:
all_data.query('`in S&P 500` & `in OSPO landscape` & `in OSCI`')

Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
8,Adobe,Technology,United States,True,17.0,True,188740124672.0,Adobe (Adopter),True
45,Apple,Technology,United States,True,24.0,True,2608140386304.0,Apple (Adopter),True
53,Autodesk,Technology,United States,True,98.0,True,41994772480.0,Autodesk (Adopter),True
220,Goldman Sachs,"Banking, Insurance & Financial Services",United States,True,248.0,True,107847974912.0,Goldman Sachs (Adopter),True
228,Hewlett Packard Enterprise,Technology,United States,True,40.0,True,20440801280.0,HPE (Adopter),True
240,IBM,Technology,United States,True,5.0,True,120321998848.0,IBM (Adopter),True
248,Intel,Technology,United States,True,4.0,True,183187193856.0,Intel (Adopter),True
314,Microsoft,Technology,United States,True,2.0,True,2087771176960.0,Microsoft (Adopter),True
328,Netflix,Communication Services,United States,True,77.0,True,85567168512.0,Netflix (Adopter),True
405,Salesforce,Technology,United States,True,30.0,True,174780301312.0,Salesforce (Adopter),True
