# Data Sources
- [cust_supply_2019_2022.csv](inputs/cust_supply_2019_2022.csv) provided by Dr. Bowen
- SP500 data (obtained from scarping Wikipedia)
- Data from the accounting dataset provided by Dr. Bowen based on variables we determined after our data cleaning on the compustat dataset 
    - Variables:
    - fyear (fiscal year)
    - sale (net sales)
    - acominc (net income)
    - at (total assets)
    - capx (capex, dollar amount)
    - capxv (capex ratio for current fiscal year)
    - cogs (cost of goods sold)
    - gp (gross profit)
    - epsfx (eps basic (takes into account the actual number of shares outstanding, and does not include any potentially dilutive securities))


# Data
- We acquired our data from Dr. Bowen as a .csv file
- To load the data into python we used `pd.read_csv('inputs/cust_supply_2019_2022.csv')`
- we then downloaded the SP500 data by 
```python
os.makedirs("inputs", exist_ok=True)
sp500_file = 'inputs/sp500_2022.csv'

if not os.path.exists(sp500_file):
    url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
    pd.read_html(url)[0].to_csv(sp500_file,index=False)
    
sp500 = pd.read_csv('inputs/sp500_2022.csv')
sp500    
```

# EDA
- We used `eda.py` file from the community codebook to perform EDA on the raw compustat data
```python
insufficient_but_starting_eda(comp, ['cnms', 'ctype', 'gareac', 'gareat', 'stype', 'srcdate', 'conm', 'tic', 'cusip'])
```
- We found that
    - there are 77901 data entries in this csv
    - there are 9 categorical variables
    - there are 6 numerical variables
    - the unit level is sales
    - the only variables with missing data are 
        - gareac (57.8%) 
        - gareat (57.8%)
        - stype (14.0%)
        - salecs (12.4%)
        - cik (0.6%)


Additional EDA will be done once we recieve the accounting data from Dr. Bowen

# Cleaning 
- Initially, on the compustat data, we filtered to just look at just company in company type (`ctype`), that query only provided us with about 150 firms. Therefore we, decided to look at all company types (`ctypes`) as it provided about twice the amount of data (about 350 firms).
- This also provided us with a more hollistic view of the firms financials
    - for example AAPL did not report selling to specific companies, and in the initial method would have not been included in our list of firms that we are observing 

```python
comp2 = comp
#comp2 = comp[comp['ctype'] == 'COMPANY']
comp2 = comp2[comp2['cnms'] != 'Not Reported']
# comp4 = comp2[comp2['cnms'] == 'Not Reported']
# comp4

comp3 = comp2.dropna(subset=['salecs'])
comp3
```
- Additionally, we made sure our columns in compustat matched the names in the SP500 dataset before merging
    - We renamed `cik` to `CIK` to match