 ## Data Analysis

### Here is a breakdown of what each column represents:

- gvkey: a unique identifier for the company (Global Company Key)
- cid: a unique identifier for the customer
- cnms: customer name
- ctype: customer type
- gareac: geographic area code
- gareat: geographic area type
- salecs: sales in current period (in millions)
- sid: segment identifier
- stype: segment type
- srcdate: source date
- conm: company name
- tic: stock ticker symbol
- cusip: CUSIP number, a unique identifier for a security
- cik: SEC Central Index Key, a unique identifier for a company
- sic: Standard Industrial Classification code, a numerical code used to classify industries



In [2]:
## imports
import pandas as pd
import os

In [13]:
comp = pd.read_csv('inputs/cust_supply_2019_2022.csv')

In [25]:
# Select rows where 'cnms' is not equal to 'U.S. Government'
comp2 = comp[comp['ctype'] == 'COMPANY']
comp2 = comp2[comp2['cnms'] != 'Not Reported']
comp3 = comp2.dropna(subset=['salecs'])
comp3

Unnamed: 0,gvkey,cid,cnms,ctype,gareac,gareat,salecs,sid,stype,srcdate,conm,tic,cusip,cik,sic
183,1166,43,10 Customers,COMPANY,,,999.563,2,BUSSEG,2019-12-31,ASM INTERNATIONAL NV,ASMIY,N07045102,351483.0,3559
184,1166,45,7 Customers,COMPANY,,,416.234,2,BUSSEG,2020-12-31,ASM INTERNATIONAL NV,ASMIY,N07045102,351483.0,3559
185,1166,44,3 Customers,COMPANY,,,946.132,2,BUSSEG,2020-12-31,ASM INTERNATIONAL NV,ASMIY,N07045102,351483.0,3559
186,1166,44,3 Customers,COMPANY,,,1168.678,2,BUSSEG,2021-12-31,ASM INTERNATIONAL NV,ASMIY,N07045102,351483.0,3559
187,1166,45,7 Customers,COMPANY,,,383.657,2,BUSSEG,2021-12-31,ASM INTERNATIONAL NV,ASMIY,N07045102,351483.0,3559
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77875,345556,3,AstraZeneca AB,COMPANY,SWE,ISO,0.500,1,BUSSEG,2021-12-31,F-STAR THERAPEUTICS INC,FSTX,30315R107,1566373.0,2836
77876,345556,2,Denali Therapeutics Inc,COMPANY,USA,ISO,0.117,1,BUSSEG,2021-12-31,F-STAR THERAPEUTICS INC,FSTX,30315R107,1566373.0,2836
77877,345556,1,Ares Trading S.A,COMPANY,CHE,ISO,2.800,1,BUSSEG,2021-12-31,F-STAR THERAPEUTICS INC,FSTX,30315R107,1566373.0,2836
77878,345764,2,2 Customers,COMPANY,,,1.668,1,BUSSEG,2020-12-31,T STAMP INC,IDAI,873048409,1718939.0,7372


In [30]:
comp3.rename(columns = {'cik': 'CIK'})
merged = comp3.merge(sp500, on='CIK', how = 'inner')
merged

Unnamed: 0,gvkey,cid,cnms,ctype,gareac,gareat,salecs,sid,stype,srcdate,...,cusip,CIK,sic,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,Founded
0,1327,23,Apple Inc,COMPANY,,,1722.168,7,BUSSEG,2019-09-30,...,83088M102,4127.0,3674,SWKS,Skyworks Solutions,Information Technology,Semiconductors,"Irvine, California",2015-03-12,2002
1,1327,23,Apple Inc,COMPANY,,,1879.192,7,BUSSEG,2020-09-30,...,83088M102,4127.0,3674,SWKS,Skyworks Solutions,Information Technology,Semiconductors,"Irvine, California",2015-03-12,2002
2,1327,23,Apple Inc,COMPANY,,,3014.369,7,BUSSEG,2021-09-30,...,83088M102,4127.0,3674,SWKS,Skyworks Solutions,Information Technology,Semiconductors,"Irvine, California",2015-03-12,2002
3,1327,23,Apple Inc,COMPANY,,,3181.590,7,BUSSEG,2022-09-30,...,83088M102,4127.0,3674,SWKS,Skyworks Solutions,Information Technology,Semiconductors,"Irvine, California",2015-03-12,2002
4,1598,35,4 Customers,COMPANY,,,99.686,3,BUSSEG,2019-12-31,...,031100100,1037868.0,3823,AME,Ametek,Industrials,Electrical Components & Equipment,"Berwyn, Pennsylvania",2013-09-23,1930
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1180,187697,17,5 Customers,COMPANY,,,284.339,5,BUSSEG,2021-12-31,...,29414B104,1352010.0,7370,EPAM,EPAM Systems,Information Technology,IT Consulting & Other Services,"Newtown, Pennsylvania",2021-12-14,1993
1181,187697,50,5 Customers,COMPANY,,,682.147,5,BUSSEG,2021-12-31,...,29414B104,1352010.0,7370,EPAM,EPAM Systems,Information Technology,IT Consulting & Other Services,"Newtown, Pennsylvania",2021-12-14,1993
1182,316056,3,10 Customers,COMPANY,,,656.420,1,BUSSEG,2019-12-31,...,G0176J109,1579241.0,3420,ALLE,Allegion,Industrials,Building Products,"New York City, New York",2013-12-02,1908
1183,316056,3,10 Customers,COMPANY,,,652.776,1,BUSSEG,2020-12-31,...,G0176J109,1579241.0,3420,ALLE,Allegion,Industrials,Building Products,"New York City, New York",2013-12-02,1908


In [35]:
merged['cnms'].unique()
g = merged.groupby("CIK")['cnms'].apply(lambda x: list(np.unique(x)))
g

NameError: name 'np' is not defined

## Getting the SP500 data

In [3]:
## downloading the SP500 info from the web

os.makedirs("inputs", exist_ok=True)
sp500_file = 'inputs/sp500_2022.csv'

if not os.path.exists(sp500_file):
    url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
    pd.read_html(url)[0].to_csv(sp500_file,index=False)

sp500 = pd.read_csv('inputs/sp500_2022.csv')

In [None]:
## merging the sp500 dataset to the compustat dataset

# add merge code here

Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989
...,...,...,...,...,...,...,...,...
498,YUM,Yum! Brands,Consumer Discretionary,Restaurants,"Louisville, Kentucky",1997-10-06,1041061,1997
499,ZBRA,Zebra Technologies,Information Technology,Electronic Equipment & Instruments,"Lincolnshire, Illinois",2019-12-23,877212,1969
500,ZBH,Zimmer Biomet,Health Care,Health Care Equipment,"Warsaw, Indiana",2001-08-07,1136869,1927
501,ZION,Zions Bancorporation,Financials,Regional Banks,"Salt Lake City, Utah",2001-06-22,109380,1873
