# FOCUSED Project: OSPO adoption

As part of the [FOCUSED Collaboration project](https://github.com/JumpsuitWizard/FOCUSED-Collaboration), this notebook examines OSPO (Open Source Programs Offices) adoption across the [Standard and Poor's 500 index](https://en.wikipedia.org/wiki/S%26P_500).

## Authors

- **PI**: Duane O'Brien
- **Researcher**: julia ferraioli
- **Analyst**: Reshama Shaikh

## Research question

## Methodology

## Data sources

The following data sources are used in the analysis:

- [S&P 500](https://github.com/datasets/s-and-p-500-companies/blob/master/data/constituents.csv) retrieved on 2021-10-05
- [OSCI Index](https://opensourceindex.io/) retrieved on 2022-02-28
- [OSPO Landscape](https://landscape.todogroup.org/) retrieved on 2022-05-09

## Data setup
### Load the data into Pandas

In [611]:
import csv
import json
import numpy as np
import pandas as pd
import plotly.express as px

data_dir = "data_raw/"

# Load S&P 500 dataset into a dataframe
df_sp = pd.read_csv(data_dir+'sp500.csv')


# Load Open Source Contributor Index (OSCI) dataset into a dataframe
json_osci = json.load(open(data_dir + 'osci.json'))
df_osci = pd.read_json(data_dir + 'osci.json')
df_osci = pd.json_normalize(df_osci['data'])
print(df_osci.shape)

# Load OSPO landscape dataset into a dataframe
df_ospo = pd.read_csv(data_dir + 'ospo_todo_landscape.csv')

(298, 12)


### Preview the data

In [612]:
df_sp.sample(5)

Unnamed: 0,Symbol,Name,Sector
77,AVGO,Broadcom,Information Technology
162,EMN,Eastman Chemical,Materials
332,NWS,News Corp (Class B),Communication Services
71,BA,Boeing,Industrials
400,ROL,Rollins,Industrials


In [613]:
df_osci.sample(5)

Unnamed: 0,positionChange,company,activeContributors,activeContributorsChange,totalCommunity,totalCommunityChange,position,yoy,contributors,languages,licenses,industry
246,4.0,OpenLattice,3.0,0.0,7,0.0,247,"[{'date': '2020-04-17', 'active': 10.0}, {'dat...","[{'Contributor': 'Matthew Tamayo-Rios', 'Commi...","[{'name': 'Java', 'amount': 632}, {'name': 'CS...","[{'name': 'gpl-3.0', 'amount': 858}, {'name': ...",Technology
39,1.0,Hewlett Packard Enterprise,148.0,8.0,334,25.0,40,"[{'date': '2020-04-17', 'active': 56.0}, {'dat...","[{'Contributor': 'Mitch Harding', 'Commits': 8...","[{'name': 'C#', 'amount': 3}, {'name': 'Python...","[{'name': 'mit', 'amount': 6425}, {'name': 'ap...",Technology
254,-8.0,Atypon,3.0,1.0,8,1.0,255,"[{'date': '2020-04-17', 'active': 2.0}, {'date...","[{'Contributor': 'asouqi', 'Commits': 54}, {'C...","[{'name': 'Go', 'amount': 9}, {'name': 'Python...","[{'name': 'apache-2.0', 'amount': 60}, {'name'...",Technology
269,6.0,EuroLinux,2.0,0.0,3,0.0,270,"[{'date': '2020-04-17', 'active': 1.0}, {'date...","[{'Contributor': 'Tomasz Podsiadły', 'Commits'...","[{'name': 'Shell', 'amount': 64}, {'name': 'C'...","[{'name': 'upl-1.0', 'amount': 64}, {'name': '...",Technology
297,0.0,Talan,0.0,0.0,1,0.0,298,"[{'date': '2020-04-17', 'active': 1.0}, {'date...","[{'Contributor': 'ezzeddine', 'Commits': 3}]","[{'name': 'JavaScript', 'amount': 3}]","[{'name': 'mit', 'amount': 3}]",Professional Services


In [614]:
df_ospo.sample(5)

Unnamed: 0,Name,Organization,Homepage,Logo,Twitter,Crunchbase URL,Market Cap,Ticker,Funding,Member,...,Github Stars,Github Description,Github Latest Commit Date,Github Latest Commit Link,Github Release Date,Github Release Link,Github Start Commit Date,Github Start Commit Link,Github Contributors Count,Github Contributors Link
115,Deutsche Bahn (Adopter),Deutsche Bahn,https://www.dbsystel.de/dbsystel-en,https://landscape.todogroup.org/logos/deutsche...,https://twitter.com/DB_Bahn,https://www.crunchbase.com/organization/deutsc...,,,,General,...,,,,,,,,,,
8,Banco Itaú (Member),Itau Unibanco,https://www.itau.com/,https://landscape.todogroup.org/logos/banco-it...,https://twitter.com/itauunibanco_ri,https://www.crunchbase.com/organization/banco-...,,,,General,...,,,,,,,,,,
54,Nokia (Member),Nokia,https://www.nokia.com/,https://landscape.todogroup.org/logos/nokia-me...,https://twitter.com/nokia,https://www.crunchbase.com/organization/nokia,,,868070947.0,General,...,,,,,,,,,,
83,Western Digital (Member),Western Digital,https://github.com/westerndigitalcorporation,https://landscape.todogroup.org/logos/western-...,https://twitter.com/westerndigital,https://www.crunchbase.com/organization/wester...,20172500000.0,WDC.F,0.0,General,...,,,,,,,,,,
202,FossID,FossID,https://fossid.com,https://landscape.todogroup.org/logos/foss-id.svg,https://twitter.com/FOSSID_AB,https://www.crunchbase.com/organization/fossid,,,,false,...,,,,,,,,,,


### Clean up the data

In [615]:
# Rename columns in S&P 500 and add a few fields for comparison purposes
df_sp = df_sp.rename(columns = {'Name': 'company', 'Sector': 'sector'})

df_sp['in S&P 500'] = True
df_sp['country'] = "United States"

# Reorder the columns for clarity's sake
order = ['company', 'sector', 'country', 'in S&P 500']

df_sp = df_sp.reindex(order, axis=1)

# Filter out columns from OSCI that we don't need
keep_cols = ['company', 'position', 'industry']
df_osci = df_osci.filter(keep_cols)
print(df_osci.head())

# Rename columns in OSCI and add a field for comparison purposes
df_osci = df_osci.rename(columns = {'position': 'OSCI position', 'industry': 'OSCI sector'})
df_osci['in OSCI'] = True

# Reorder the columns for clarity's sake
order = ['company', 'OSCI sector', 'OSCI position', 'in OSCI']

df_osci = df_osci.reindex(order, axis=1)

# Rename columns in OSPO Landscape and add a field for comparison purposes
df_ospo = df_ospo.rename(columns =
                         {'Name': 'OSPO status',
                          'Organization': 'company',
                          'Market Cap': 'market cap',
                          'Crunchbase Country': 'OSPO country',
                          'License': 'license'})

df_ospo['in OSPO landscape'] = True

keep_cols = ['OSPO status','company', 'market cap', 'OSPO country', 'in OSPO landscape']
df_ospo = df_ospo.filter(keep_cols)

# Filter out those who have not adopted an OSPO
df_ospo = df_ospo.loc[df_ospo['OSPO status'].str.contains("adopter", case = False)]

# Reorder the columns for clarity's sake
order = ['company', 'OSPO country', 'market cap', 'OSPO status', 'in OSPO landscape']


df_ospo = df_ospo.reindex(order, axis=1)

     company  position    industry
0     Google         1  Technology
1  Microsoft         2  Technology
2    Red Hat         3  Technology
3      Intel         4  Technology
4        IBM         5  Technology


### Preview the data

In [616]:
df_sp.sample(5)

Unnamed: 0,company,sector,country,in S&P 500
340,NortonLifeLock,Information Technology,United States,True
240,IBM,Information Technology,United States,True
245,Illumina,Health Care,United States,True
494,Williams Companies,Energy,United States,True
460,Union Pacific,Industrials,United States,True


In [617]:
df_osci.sample(5)

Unnamed: 0,company,OSCI sector,OSCI position,in OSCI
152,Cloudbees,Technology,153,True
218,eyeo,Technology,219,True
56,ThoughtWorks,Technology,57,True
256,Polidea,Technology,257,True
116,Spryker,Technology,117,True


In [618]:
df_ospo.sample(5)

Unnamed: 0,company,OSPO country,market cap,OSPO status,in OSPO landscape
185,Wayfair,United States,9014573000.0,Wayfair (Adopter),True
127,Fannie Mae,United States,888533100.0,Fannie Mae (Adopter),True
100,BlackRock,United States,96751120000.0,BlackRock (Adopter),True
137,IBM,United States,120322000000.0,IBM (Adopter),True
178,Two Sigma,United States,,Two Sigma (Adopter),True


### Merge the data sources

In [619]:
# First, merge S&P Index and OSCI, then merge with OSPO Landscape
all_data = (df_sp.merge(df_osci, left_on = 'company',
                        right_on = 'company', how = 'outer')).merge(df_ospo,
                                                                    left_on = 'company',
                                                                    right_on = 'company',
                                                                    how = 'outer')

# Prefer S&P data over OSCI and OSPO Landscape data
all_data['sector'] = all_data['sector'].mask(pd.isnull, all_data['OSCI sector'])
all_data['country'] = all_data['country'].mask(pd.isnull, all_data['OSPO country'])
all_data = all_data.drop(['OSCI sector', 'OSPO country'], axis = 1)

all_data.sort_values(by = ['company']).head(10)


Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
0,3M,Industrials,United States,True,,,,,
768,4teamwork,Technology,,,285.0,True,,,
759,5minds,Technology,,,276.0,True,,,
1,A. O. Smith,Industrials,United States,True,,,,,
7,ADM,Consumer Staples,United States,True,,,,,
11,AES Corp,Utilities,United States,True,,,,,
526,AMD,Technology,,,31.0,True,,,
44,APA Corporation,Energy,United States,True,,,,,
519,ARM,Technology,,,22.0,True,,,
51,AT&T,Communication Services,United States,True,,,,,


### Do some manual cleanup for known issues

In [620]:
# Google is split across two Alphabet stock options and Google so we'll merge them
googles = all_data.loc[all_data['company'].str.contains('Alphabet|Google', case = False, regex = True)]
col_order = ['company', 'sector', 'country', 'in S&P 500', 'OSCI position', 'in OSCI', 'market cap',
             'OSPO status', 'in OSPO landscape']
google = googles.groupby('country', as_index = False).last()

# Update the data set and drop the extraneous entries
all_data.set_index('company', inplace = True)
all_data.update(google.set_index('company'))
all_data.reset_index(inplace = True)
all_data[all_data['company'] == 'Google']
all_data.drop(googles.iloc[:2].index, inplace= True)

In [621]:
all_data.sample(25)

Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
317,Mohawk Industries,Consumer Discretionary,United States,True,,,,,
136,D. R. Horton,Consumer Discretionary,United States,True,,,,,
190,Fastenal,Industrials,United States,True,,,,,
145,Diamondback Energy,Energy,United States,True,,,,,
372,Pool Corporation,Consumer Discretionary,United States,True,,,,,
9,Advance Auto Parts,Consumer Discretionary,United States,True,,,,,
274,KLA Corporation,Information Technology,United States,True,,,,,
437,Texas Instruments,Information Technology,United States,True,,,,,
355,Paccar,Industrials,United States,True,,,,,
574,Verizon Media,Media & Telecoms,United States,,86.0,True,199440900000.0,Verizon Media (Adopter),True


In [622]:
# Normalize the sectors across data sets

all_data['sector'].mask(all_data['sector'] == 'Information Technology', 'Technology', inplace = True)
all_data['sector'].mask(all_data['sector'] == 'Health Care', 'Healthcare & Pharma', inplace = True)
all_data['sector'].mask(all_data['sector'] == 'Financials', 'Banking, Insurance & Financial Services ', inplace = True)

# Fill null values with default ones where needed
all_data = all_data.fillna(value={'in S&P 500': False, 'in OSPO landscape': False, 'in OSCI': False,
                                  'OSCI position': 'n/a', 'OSPO status': 'n/a', 'market cap': 'unknown'})

### Get a sneak peak at the data

In [623]:
data_sample = all_data.sample(20)
data_sample.sort_values(by = ['company'])

Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
746,Abilian,Technology,,False,263.0,True,unknown,,False
30,American Electric Power,Utilities,United States,True,,False,unknown,,False
642,ArangoDB,Technology,,False,158.0,True,unknown,,False
715,CS Group,Technology,,False,231.0,True,unknown,,False
85,Capital One Financial,"Banking, Insurance & Financial Services",United States,True,,False,unknown,,False
799,Deutsche Bahn,,Germany,False,,False,unknown,Deutsche Bahn (Adopter),True
208,Gap,Consumer Discretionary,United States,True,,False,unknown,,False
507,GitHub,Technology,United States,False,8.0,True,2087771176960.0,GitHub (Adopter),True
745,Infosiftr,Technology,,False,262.0,True,unknown,,False
773,Insolar Technologies,Technology,,False,290.0,True,unknown,,False


## Inspect the data

In [624]:
sp_count = len(all_data[all_data['in S&P 500']])
ospo_count = len(all_data[all_data['in OSPO landscape']])
osci_count = len(all_data[all_data['in OSCI']])
sp_ospo_count = len(all_data.query('`in S&P 500` & `in OSPO landscape`'))
sp_osci_count = len(all_data.query('`in S&P 500` & `in OSCI`'))
ospo_osci = len(all_data.query('`in OSPO landscape` & `in OSCI`'))
intersection = len(all_data.query('`in S&P 500` & `in OSPO landscape` & `in OSCI`'))

listings = pd.DataFrame([[sp_count, ospo_count, osci_count, sp_ospo_count, sp_osci_count, ospo_osci, intersection]],
                        ['count'], ['in S&P 500','in OSPO landscape','in OSCI', 'in S&P and OSPO landscape',
                        'in S&P and OSCI', 'in OSPO landscape and  OSCI', 'in all three'])
listings

Unnamed: 0,in S&P 500,in OSPO landscape,in OSCI,in S&P and OSPO landscape,in S&P and OSCI,in OSPO landscape and OSCI,in all three
count,504,102,299,23,23,36,12


### What companies are present in all three datasets?

In [625]:
all_data.query('`in S&P 500` & `in OSPO landscape` & `in OSCI`')

Unnamed: 0,company,sector,country,in S&P 500,OSCI position,in OSCI,market cap,OSPO status,in OSPO landscape
8,Adobe,Technology,United States,True,17.0,True,188740124672.0,Adobe (Adopter),True
45,Apple,Technology,United States,True,24.0,True,2608140386304.0,Apple (Adopter),True
53,Autodesk,Technology,United States,True,98.0,True,41994772480.0,Autodesk (Adopter),True
220,Goldman Sachs,"Banking, Insurance & Financial Services",United States,True,248.0,True,107847974912.0,Goldman Sachs (Adopter),True
228,Hewlett Packard Enterprise,Technology,United States,True,40.0,True,20440801280.0,HPE (Adopter),True
240,IBM,Technology,United States,True,5.0,True,120321998848.0,IBM (Adopter),True
248,Intel,Technology,United States,True,4.0,True,183187193856.0,Intel (Adopter),True
314,Microsoft,Technology,United States,True,2.0,True,2087771176960.0,Microsoft (Adopter),True
328,Netflix,Communication Services,United States,True,77.0,True,85567168512.0,Netflix (Adopter),True
405,Salesforce,Technology,United States,True,30.0,True,174780301312.0,Salesforce (Adopter),True


## Save processed data to file

In [626]:
all_data.to_csv('data_derived/merged_data.csv', index=False)