# [Cleaning: SEC dataset, Sector Market Value](#section-title)

In [1]:
#imports
import pandas as pd
from polygon import RESTClient
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

---

## Importing and Consolidationg of past datasets and SEC Dataset:
- The purpose of including the SEC dataset was to obtain a unique identifier (known as the CIK) for each company and reference this across all datasets to ensure I am identifying and matching companies correctly. 

- The GICS dataframe identifies companies by stock ticker and company name, the latter of which can vary by records. For example, "Apple" and "Apple Inc." refer to the same company. To avoid misidentifying identical companies based on name variations, I have amended the GICS dataframe to include each company's CIK identifier, per the __[SEC's database Edgar](https://www.sec.gov/include/ticker.txt)__ .


- CIK, or __[Central Index Key](https://www.sec.gov/edgar/searchedgar/cik)__, is a unique identifier for each exchange-traded stock. Within the context of this project, this key is useful to map companies in the GICS dataframe to this unique code. The CIK is also used by the financial statement API employed below, so encorporatingthe CIK into the GICS dataframe allows for a similar column by which to join the dataframes after calling and cleaning all data. 

In [2]:
# Importing the dataframe for ticker to Cik conversion

df_g = pd.read_csv("../data/cleaned_csvs_interim_steps/gics_cleaned.csv")
df_sec = pd.read_csv("../data/sec_ciks.csv")

df_tickers_list = pd.read_csv("../data/cleaned_csvs_interim_steps/tickers_list.csv")
tickers = df_tickers_list["symbol"].tolist()

In [3]:
tickers

['CTVA',
 'NTR',
 'FMC',
 'MOS',
 'CF',
 'ICL',
 'SMG',
 'UAN',
 'AVD',
 'BIOX',
 'IPI',
 'MBII',
 'RKDA',
 'CGA',
 'SEED',
 'YTEN',
 'SVFD',
 'RYAAY',
 'DAL',
 'LUV',
 'UAL',
 'AAL',
 'ALK',
 'ZNH',
 'CPA',
 'JBLU',
 'CEA',
 'ULCC',
 'SAVE',
 'ALGT',
 'SNCY',
 'VLRS',
 'SKYW',
 'HA',
 'GOL',
 'MESA',
 'AER',
 'ASR',
 'PAC',
 'OMAB',
 'JOBY',
 'AAWW',
 'CAAP',
 'ASLE',
 'MIC',
 'BLDE',
 'UP',
 'AA',
 'CSTM',
 'ACH',
 'KALU',
 'CENX',
 'VFC',
 'RL',
 'GIL',
 'CPRI',
 'COLM',
 'PVH',
 'UA',
 'UAA',
 'ZGN',
 'KTB',
 'OXM',
 'HBI',
 'LEVI',
 'GOOS',
 'FIGS',
 'SGC',
 'LAKE',
 'VNCE',
 'DLA',
 'JRSH',
 'EVK',
 'XELB',
 'LLL',
 'TJX',
 'LULU',
 'ROST',
 'BURL',
 'ONON',
 'GPS',
 'CRI',
 'AEO',
 'VSCO',
 'URBN',
 'BOOT',
 'BKE',
 'ANF',
 'GES',
 'CHS',
 'GIII',
 'SCVL',
 'DBI',
 'PLCE',
 'GCO',
 'ZUMZ',
 'DXLG',
 'CURV',
 'JILL',
 'DLTH',
 'BIRD',
 'CATO',
 'RENT',
 'TLYS',
 'CTRN',
 'LVLU',
 'EXPR',
 'DBGI',
 'BLK',
 'BX',
 'KKR',
 'BK',
 'APO',
 'AMP',
 'STT',
 'TROW',
 'NTRS',
 'ARES',
 'B

In [4]:
print(df_sec.shape)
df_sec.head()

(12084, 2)


Unnamed: 0.1,Unnamed: 0,Unnamed: 1
0,aapl,320193
1,msft,789019
2,brk-b,1067983
3,unh,731766
4,jnj,200406


In [5]:
df_sec = df_sec.rename(columns={'Unnamed: 0': 'ticker', 'Unnamed: 1': 'cik'})

In [6]:
df_sec['cik'].isnull().sum()

0

In [7]:
df_sec.duplicated().sum()

0

In [8]:
# Save a clean copy of the df_sec as csv, for record
df_sec.to_csv("../data/cleaned_csvs_interim_steps/sec_cleaned.csv", index = False)

---

## Importing and Consolidationg of Sector Market Value:

In [9]:
# Importing the sector contribution to total market value df:
df_mv = pd.read_csv("../data/gics_sector_chart.csv", index_col = None)

## Deriving Sector Contribution to Total Market Value of US Stock Market

- In order to determine market concentration of each sector and company, I first need to understand the size of the US Financial Markets (in USD) and each sector's contribution to that market.

In [10]:
# Quick Cleaning:
# Converting columns to lowercase, removing whitespace, removing extraneous symbols
df_mv.columns = df_mv.columns.str.replace(' ', '_')
df_mv.columns = df_mv.columns.str.lower()
df_mv.drop(columns = ["url"], inplace = True)
df_mv.head()

Unnamed: 0,sector,sector_revenue_total_(trillions),total_cap_(trillions)
0,Consumer Discretionary,2.9,46.3
1,Consumer Staples,2.6,46.3
2,Energy,2.2,46.3
3,Financials,2.7,46.3
4,Health Care,2.8,46.3


In [11]:
df_mv.to_csv('../data/cleaned_csvs_interim_steps/mv_cleaned.csv',index=False)

---

## Concatenating SEC + GICS dataframes, then concatenating (SEC+GICS) + MV

In [12]:
# Convert the data within the common column to uppercase in both dataframes, per convention, to check for matches

df_sec['ticker'] = df_sec['ticker'].str.upper()
df_g["symbol"] = df_g["symbol"].str.upper()

In [13]:
# Using an inner join to merge the dataframes df_sec and df_g to df_sec_g
df_sec_g = pd.merge(df_sec, df_g, left_on='ticker', right_on='symbol')

In [14]:
# Check if all values in the 'ticker' column match the values in the 'symbol' column
all(df_sec_g['ticker'] == df_sec_g['symbol'])

True

In [15]:
# Check if there are rows where 'ticker' and 'symbol' do not match:

# All rows of 'ticker' and 'symbol' match
df_sec_g.query('ticker != symbol')

Unnamed: 0,ticker,cik,symbol,description,gics_sector,equity_securities,cap_size


### Justification for differences between merged df and df_g count

- Difference in count between merged dataframe and df_g might be explained by companies delisting (buyouts, bankruptcies, etc.) and being updated on the SEC website without being updated on MBA_stocks.com.

1. MBA website error or sec website error
2. lists made at different times, thus recognizing (or not recognizing) bankruptcy, mergers, acquisitions, name changes
3. slightly different "universes" of stocks

In [16]:
df_sec_g.head()

Unnamed: 0,ticker,cik,symbol,description,gics_sector,equity_securities,cap_size
0,AAPL,320193,AAPL,Apple Inc,Information Technology,Common stocks,Large cap
1,MSFT,789019,MSFT,Microsoft Corp,Information Technology,Common stocks,Large cap
2,UNH,731766,UNH,Unitedhealth Group Inc,Health Care,Common stocks,Large cap
3,JNJ,200406,JNJ,Johnson & Johnson,Health Care,Common stocks,Large cap
4,V,1403161,V,Visa Inc Class A,Information Technology,Common stocks,Large cap


In [17]:
df_sec_g["gics_sector"].unique()

array(['Information Technology', 'Health Care', 'Energy',
       'Consumer Staples', 'Financials', 'Consumer Discretionary',
       'Communication Services', 'Utilities', 'Industrials', 'Materials',
       'Real Estate'], dtype=object)

In [18]:
df_sec_g

Unnamed: 0,ticker,cik,symbol,description,gics_sector,equity_securities,cap_size
0,AAPL,320193,AAPL,Apple Inc,Information Technology,Common stocks,Large cap
1,MSFT,789019,MSFT,Microsoft Corp,Information Technology,Common stocks,Large cap
2,UNH,731766,UNH,Unitedhealth Group Inc,Health Care,Common stocks,Large cap
3,JNJ,200406,JNJ,Johnson & Johnson,Health Care,Common stocks,Large cap
4,V,1403161,V,Visa Inc Class A,Information Technology,Common stocks,Large cap
...,...,...,...,...,...,...,...
5983,WLYB,107140,WLYB,"John Wiley & Sons, Inc. Class B",Communication Services,Common stocks,Mid cap
5984,APTS,1481832,APTS,"Preferred Apartment Communities, Inc",Real Estate,REITs,Small cap
5985,GBBKR,1894951,GBBKR,Global Blockchain Acquisition Corp. Right,Financials,Common stocks,Micro cap
5986,GFGDR,1876714,GFGDR,The Growth for Good Acquisition Corporation Right,Financials,Common stocks,Micro cap


In [19]:
df_mv["sector"].unique

<bound method Series.unique of 0     Consumer Discretionary
1           Consumer Staples
2                     Energy
3                 Financials
4                Health Care
5                Industrials
6                  Materials
7                Real Estate
8     Information Technology
9     Communication Services
10                Uttilities
Name: sector, dtype: object>

In [20]:
# Now for the next concatenation:
# Using an inner join to merge the dataframes df_sec and df_g to df_sec_g

df_sec_g_mv = df_sec_g.merge(df_mv, left_on="gics_sector", right_on="sector", how="left")
df_sec_g_mv.drop(columns = ["sector"], inplace = True) # Same as "gics_sector"
df_sec_g_mv.head()

Unnamed: 0,ticker,cik,symbol,description,gics_sector,equity_securities,cap_size,sector_revenue_total_(trillions),total_cap_(trillions)
0,AAPL,320193,AAPL,Apple Inc,Information Technology,Common stocks,Large cap,5.8,46.3
1,MSFT,789019,MSFT,Microsoft Corp,Information Technology,Common stocks,Large cap,5.8,46.3
2,UNH,731766,UNH,Unitedhealth Group Inc,Health Care,Common stocks,Large cap,2.8,46.3
3,JNJ,200406,JNJ,Johnson & Johnson,Health Care,Common stocks,Large cap,2.8,46.3
4,V,1403161,V,Visa Inc Class A,Information Technology,Common stocks,Large cap,5.8,46.3


In [21]:
df_sec_g_mv.to_csv('../data/cleaned_csvs_interim_steps/sec_gics_mv_cleaned.csv', index = False)

---