# FOCUSED Project: OSPO adoption

As part of the [FOCUSED Collaboration project](https://github.com/JumpsuitWizard/FOCUSED-Collaboration), this notebook examines OSPO (Open Source Programs Offices) adoption across the [Standard and Poor's 500 index](https://en.wikipedia.org/wiki/S%26P_500).

## Authors

- **PI**: Duane O'Brien
- **Researcher**: julia ferraioli
- **Analyst**: Reshama Shaikh

## Research question

## Methodology

## Data sources

The following data sources are used in the analysis:

- [S&P 500](https://github.com/datasets/s-and-p-500-companies/blob/master/data/constituents.csv) retrieved on 2021-10-05
- [OSCI Index](https://opensourceindex.io/) retrieved on 2022-02-28
- [OSPO Landscape](https://landscape.todogroup.org/) retrieved on 2022-05-09

## Data setup
### Load the data into Pandas

In [635]:
import csv
import json
import numpy as np
import pandas as pd
import plotly.express as px

data_dir = "data_raw/"

# Load S&P 500 dataset into a dataframe
df_sp = pd.read_csv(data_dir+'sp500.csv')


# Load Open Source Contributor Index (OSCI) dataset into a dataframe
json_osci = json.load(open(data_dir + 'osci.json'))
df_osci = pd.read_json(data_dir + 'osci.json')
df_osci = pd.json_normalize(df_osci['data'])
print(df_osci.shape)

# Load OSPO landscape dataset into a dataframe
df_ospo = pd.read_csv(data_dir + 'ospo_todo_landscape.csv')

(298, 12)


### Preview the data

In [636]:
df_sp.sample(5)

Unnamed: 0,Symbol,Name,Sector
50,AIZ,Assurant,Financials
290,L,Loews Corporation,Financials
185,EXPD,Expeditors,Industrials
288,LKQ,LKQ Corporation,Consumer Discretionary
299,MMC,Marsh & McLennan,Financials


In [637]:
df_osci.sample(5)

Unnamed: 0,positionChange,company,activeContributors,activeContributorsChange,totalCommunity,totalCommunityChange,position,yoy,contributors,languages,licenses,industry
1,0.0,Microsoft,3630.0,308.0,8128,545.0,2,"[{'date': '2020-04-17', 'active': 2562.0}, {'d...","[{'Contributor': 'SDKAuto', 'Commits': 48812},...","[{'name': 'Lean', 'amount': 1040}, {'name': 'C...","[{'name': 'mit', 'amount': 227121}, {'name': '...",Technology
193,4.0,datavisyn,11.0,0.0,13,0.0,194,"[{'date': '2020-04-17', 'active': []}, {'date'...","[{'Contributor': 'dvzacharycutler', 'Commits':...","[{'name': 'Python', 'amount': 42}, {'name': 'J...","[{'name': 'bsd-3-clause', 'amount': 702}, {'na...",Healthcare & Pharma
236,6.0,Galera Cluster,5.0,0.0,9,1.0,237,"[{'date': '2020-04-17', 'active': 4.0}, {'date...","[{'Contributor': 'Teemu Ollakka', 'Commits': 8...","[{'name': 'C', 'amount': 25}, {'name': 'C++', ...","[{'name': 'other', 'amount': 135}, {'name': 'g...",Technology
182,2.0,Azavea,14.0,1.0,23,0.0,183,"[{'date': '2020-04-17', 'active': 20.0}, {'dat...","[{'Contributor': 'Michael Maurizi', 'Commits':...","[{'name': 'JavaScript', 'amount': 492}, {'name...","[{'name': 'apache-2.0', 'amount': 768}, {'name...",Technology
189,-5.0,Travis CI,13.0,2.0,20,2.0,190,"[{'date': '2020-04-17', 'active': 58.0}, {'dat...",[{'Contributor': 'Deployment Bot (from Travis ...,"[{'name': 'Procfile', 'amount': 1}, {'name': '...","[{'name': 'mit', 'amount': 4232}, {'name': 'ap...",Technology


In [638]:
df_ospo.sample(5)

Unnamed: 0,Name,Organization,Homepage,Logo,Twitter,Crunchbase URL,Market Cap,Ticker,Funding,Member,...,Github Stars,Github Description,Github Latest Commit Date,Github Latest Commit Link,Github Release Date,Github Release Link,Github Start Commit Date,Github Start Commit Link,Github Contributors Count,Github Contributors Link
200,FOSSA License Compliance,FOSSA,https://fossa.com/product/open-source-license-...,https://landscape.todogroup.org/logos/fossa-li...,https://twitter.com/getfossa,https://www.crunchbase.com/organization/fossa-2,,,33900000.0,false,...,,,,,,,,,,
65,SAP (Member),SAP,https://github.com/sap,https://landscape.todogroup.org/logos/sap-memb...,https://twitter.com/sapopensource,https://www.crunchbase.com/organization/sap,116731200000.0,SAP,1301371000.0,General,...,,,,,,,,,,
0,Adobe (Member),Adobe,https://www.adobe.com/,https://landscape.todogroup.org/logos/adobe-me...,https://twitter.com/Adobe,https://www.crunchbase.com/organization/adobe,188740100000.0,ADBE,2500000.0,General,...,,,,,,,,,,
198,Debricked Security,Debricked,https://debricked.com/tools/security,https://landscape.todogroup.org/logos/debricke...,https://twitter.com/debrickedab,https://www.crunchbase.com/organization/debricked,,,4552074.0,false,...,,,,,,,,,,
27,eBay (Member),eBay,https://ebay.github.io,https://landscape.todogroup.org/logos/e-bay-me...,https://twitter.com/eBay,https://www.crunchbase.com/organization/ebay,30270380000.0,EBAY,6700000.0,General,...,,,,,,,,,,


### Clean up the data

In [639]:
# Rename columns in S&P 500 and add a few fields for comparison purposes
df_sp = df_sp.rename(columns = {'Name': 'company', 'Sector': 'sector'})

df_sp['in S&P 500'] = True
df_sp['country'] = "United States"

# Reorder the columns for clarity's sake
order = ['company', 'sector', 'country', 'in S&P 500']

df_sp = df_sp.reindex(order, axis=1)

# Filter out columns from OSCI that we don't need
keep_cols = ['company', 'position', 'industry']
df_osci = df_osci.filter(keep_cols)
print(df_osci.head())

# Rename columns in OSCI and add a field for comparison purposes
df_osci = df_osci.rename(columns = {'position': 'OSCI position', 'industry': 'OSCI sector'})
df_osci['in OSCI'] = True

# Reorder the columns for clarity's sake
order = ['company', 'OSCI sector', 'OSCI position', 'in OSCI']

df_osci = df_osci.reindex(order, axis=1)

# Rename columns in OSPO Landscape and add a field for comparison purposes
df_ospo = df_ospo.rename(columns =
                         {'Name': 'OSPO status',
                          'Organization': 'company',
                          'Market Cap': 'market cap',
                          'Crunchbase Country': 'OSPO country',
                          'License': 'license'})

df_ospo['in OSPO landscape'] = True

keep_cols = ['OSPO status','company', 'market cap', 'OSPO country', 'in OSPO landscape']
df_ospo = df_ospo.filter(keep_cols)

# Filter out those who have not adopted an OSPO
df_ospo = df_ospo.loc[df_ospo['OSPO status'].str.contains("adopter", case = False)]

# Reorder the columns for clarity's sake
order = ['company', 'OSPO country', 'market cap', 'OSPO status', 'in OSPO landscape']


df_ospo = df_ospo.reindex(order, axis=1)

     company  position    industry
0     Google         1  Technology
1  Microsoft         2  Technology
2    Red Hat         3  Technology
3      Intel         4  Technology
4        IBM         5  Technology


### Preview the data

In [640]:
df_sp.sample(5)

Unnamed: 0,company,sector,country,in S&P 500
420,State Street Corporation,Financials,United States,True
276,Kroger,Consumer Staples,United States,True
477,Vulcan Materials,Materials,United States,True
154,Domino's Pizza,Consumer Discretionary,United States,True
249,Intercontinental Exchange,Financials,United States,True


In [641]:
df_osci.sample(5)

Unnamed: 0,company,OSCI sector,OSCI position,in OSCI
168,Pantheon,Technology,169,True
200,Linutronix,Technology,201,True
62,Iohk,Technology,63,True
203,Tensor,Technology,204,True
60,Atlassian,Technology,61,True


In [642]:
df_ospo.sample(5)

Unnamed: 0,company,OSPO country,market cap,OSPO status,in OSPO landscape
90,Aiven,Finland,,Aiven (Adopter),True
123,Equinix,United States,65114260000.0,Equinix (Adopter),True
147,Morgan Stanley,United States,150626000000.0,Morgan Stanley (Adopter),True
182,Verizon Media,United States,199440900000.0,Verizon Media (Adopter),True
148,National Instruments,United States,4715077000.0,National Instruments (Adopter),True


### Merge the data sources

In [643]:
# First, merge S&P Index and OSCI, then merge with OSPO Landscape
all_data = (df_sp.merge(df_osci, left_on = 'company',
                        right_on = 'company', how = 'outer')).merge(df_ospo,
                                                                    left_on = 'company',
                                                                    right_on = 'company',
                                                                    how = 'outer')

# Prefer S&P data over OSCI and OSPO Landscape data
all_data['sector'] = all_data['sector'].mask(pd.isnull, all_data['OSCI sector'])
all_data['country'] = all_data['country'].mask(pd.isnull, all_data['OSPO country'])
all_data = all_data.drop(['OSCI sector', 'OSPO country'], axis = 1)

all_data.sort_values(by = ['company']).head(10)


Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
0,3M,Industrials,United States,True,,,,,
768,4teamwork,Technology,,,285.0,True,,,
759,5minds,Technology,,,276.0,True,,,
1,A. O. Smith,Industrials,United States,True,,,,,
7,ADM,Consumer Staples,United States,True,,,,,
11,AES Corp,Utilities,United States,True,,,,,
526,AMD,Technology,,,31.0,True,,,
44,APA Corporation,Energy,United States,True,,,,,
519,ARM,Technology,,,22.0,True,,,
51,AT&T,Communication Services,United States,True,,,,,


### Do some manual cleanup for known issues

In [644]:
# Google is split across two Alphabet stock options and Google so we'll merge them
googles = all_data.loc[all_data['company'].str.contains('Alphabet|Google', case = False, regex = True)]
col_order = ['company', 'sector', 'country', 'in S&P 500', 'OSCI position', 'in OSCI', 'market cap',
             'OSPO status', 'in OSPO landscape']
google = googles.groupby('country', as_index = False).last()

# Update the data set and drop the extraneous entries
all_data.set_index('company', inplace = True)
all_data.update(google.set_index('company'))
all_data.reset_index(inplace = True)
all_data[all_data['company'] == 'Google']
all_data.drop(googles.iloc[:2].index, inplace= True)

In [645]:
all_data.sample(25)

Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
87,CarMax,Consumer Discretionary,United States,True,,,,,
47,Aptiv,Consumer Discretionary,United States,True,,,,,
601,Truss Works,Technology,,,116.0,True,,,
412,Sherwin-Williams,Materials,United States,True,,,,,
405,Salesforce,Information Technology,United States,True,30.0,True,174780300000.0,Salesforce (Adopter),True
642,ArangoDB,Technology,,,158.0,True,,,
65,Best Buy,Consumer Discretionary,United States,True,,,,,
411,ServiceNow,Information Technology,United States,True,,,,,
619,Kaltura,Technology,,,135.0,True,,,
200,Ford,Consumer Discretionary,United States,True,,,,,


In [646]:
# Normalize the sectors across data sets

all_data['sector'].mask(all_data['sector'] == 'Information Technology', 'Technology', inplace = True)
all_data['sector'].mask(all_data['sector'] == 'Health Care', 'Healthcare & Pharma', inplace = True)
all_data['sector'].mask(all_data['sector'] == 'Financials', 'Banking, Insurance & Financial Services ', inplace = True)

# Fill null values with default ones where needed
all_data = all_data.fillna(value={'in S&P 500': False, 'in OSPO landscape': False, 'in OSCI': False,
                                  'OSCI position': 'n/a', 'OSPO status': 'n/a', 'market cap': 'unknown', 'sector': 'unknown'})

### Get a sneak peak at the data

In [647]:
data_sample = all_data.sample(20)
data_sample.sort_values(by = ['company'])

Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
19,Align Technology,Healthcare & Pharma,United States,True,,False,unknown,,False
21,Alliant Energy,Utilities,United States,True,,False,unknown,,False
533,Automattic,Technology,,False,38.0,True,unknown,,False
77,Broadcom,Technology,United States,True,70.0,True,unknown,,False
656,Buro Happold,Professional Services,,False,172.0,True,unknown,,False
88,Carnival Corporation,Consumer Discretionary,United States,True,,False,unknown,,False
105,Chipotle Mexican Grill,Consumer Discretionary,United States,True,,False,unknown,,False
106,Chubb,"Banking, Insurance & Financial Services",United States,True,,False,unknown,,False
172,Entergy,Utilities,United States,True,,False,unknown,,False
177,Essex Property Trust,Real Estate,United States,True,,False,unknown,,False


## Inspect the data

In [648]:
sp_count = len(all_data[all_data['in S&P 500']])
ospo_count = len(all_data[all_data['in OSPO landscape']])
osci_count = len(all_data[all_data['in OSCI']])
sp_ospo_count = len(all_data.query('`in S&P 500` & `in OSPO landscape`'))
sp_osci_count = len(all_data.query('`in S&P 500` & `in OSCI`'))
ospo_osci = len(all_data.query('`in OSPO landscape` & `in OSCI`'))
intersection = len(all_data.query('`in S&P 500` & `in OSPO landscape` & `in OSCI`'))

listings = pd.DataFrame([[sp_count, ospo_count, osci_count, sp_ospo_count, sp_osci_count, ospo_osci, intersection]],
                        ['count'], ['in S&P 500','in OSPO landscape','in OSCI', 'in S&P and OSPO landscape',
                        'in S&P and OSCI', 'in OSPO landscape and  OSCI', 'in all three'])
listings

Unnamed: 0,in S&P 500,in OSPO landscape,in OSCI,in S&P and OSPO landscape,in S&P and OSCI,in OSPO landscape and OSCI,in all three
count,504,102,299,23,23,36,12


### What companies are present in all three datasets?

In [649]:
all_data.query('`in S&P 500` & `in OSPO landscape` & `in OSCI`')

Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
8,Adobe,Technology,United States,True,17.0,True,188740124672.0,Adobe (Adopter),True
45,Apple,Technology,United States,True,24.0,True,2608140386304.0,Apple (Adopter),True
53,Autodesk,Technology,United States,True,98.0,True,41994772480.0,Autodesk (Adopter),True
220,Goldman Sachs,"Banking, Insurance & Financial Services",United States,True,248.0,True,107847974912.0,Goldman Sachs (Adopter),True
228,Hewlett Packard Enterprise,Technology,United States,True,40.0,True,20440801280.0,HPE (Adopter),True
240,IBM,Technology,United States,True,5.0,True,120321998848.0,IBM (Adopter),True
248,Intel,Technology,United States,True,4.0,True,183187193856.0,Intel (Adopter),True
314,Microsoft,Technology,United States,True,2.0,True,2087771176960.0,Microsoft (Adopter),True
328,Netflix,Communication Services,United States,True,77.0,True,85567168512.0,Netflix (Adopter),True
405,Salesforce,Technology,United States,True,30.0,True,174780301312.0,Salesforce (Adopter),True


## Save processed data to file

In [650]:
all_data.to_csv('data_derived/merged_data.csv', index=False)

## Proceed to create visualizations

Head over to the [Visualizations notebook](Visualizations.ipynb) to generate some charts about the data.