In [1]:
from eda import insufficient_but_starting_eda
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
comp = pd.read_csv('inputs/cust_supply_2019_2022.csv')
sp500 = pd.read_csv('inputs/sp500_2022.csv')
acct_raw = pd.read_csv("inputs/acct_data.csv")

# Data Sources
- [cust_supply_2019_2022.csv](inputs/cust_supply_2019_2022.csv) provided by Dr. Bowen
- SP500 data (obtained from scarping Wikipedia)
- Data from the accounting dataset provided by Dr. Bowen based on variables we determined after our data cleaning on the compustat dataset 
    - Variables:
    - fyear (fiscal year)
    - sale (net sales)
    - rect (receivables/total)
    - invt (inventories)
    - ap (accounts payable - trade)
    - ib (income before extraordinary items)
    - ni (net income (loss))
    - obidp (operating income before depreciation)
    - at (total assets)
    - capx (capex, dollar amount)
    - capxv (capex ratio for current fiscal year)
    - cogs (cost of goods sold)
    - gp (gross profit)
    - epsfx (eps basic (takes into account the actual number of shares outstanding, and does not include any potentially dilutive securities))
    - acominc (net income)


# Data
- We acquired our data from Dr. Bowen as a .csv file
- To load the data into python we used
    - `pd.read_csv('inputs/cust_supply_2019_2022.csv')` for the compustat dataset  
    - `acct = pd.read_csv("inputs/acct_data.csv")` for the accounting dataset
- We then downloaded the SP500 data by 
```python
os.makedirs("inputs", exist_ok=True)
sp500_file = 'inputs/sp500_2022.csv'

if not os.path.exists(sp500_file):
    url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
    pd.read_html(url)[0].to_csv(sp500_file,index=False)
    
sp500 = pd.read_csv('inputs/sp500_2022.csv')
sp500    
```

# EDA - Compustat Data
- We used `eda.py` file from the community codebook to perform EDA on the raw compustat data
```python
insufficient_but_starting_eda(comp, ['cnms', 'ctype', 'gareac', 'gareat', 'stype', 'srcdate', 'conm', 'tic', 'cusip'])
```
- We found that
    - there are 77901 data entries in this csv
    - there are 9 categorical variables
    - there are 6 numerical variables
    - the unit level is sales
    - the only variables with missing data are 
        - gareac (57.8%) 
        - gareat (57.8%)
        - stype (14.0%)
        - salecs (12.4%)
        - cik (0.6%)
        
- Although more than 50% of geograpic area code (gareac) and geograpic area type (gareat) are missing, these values will not be used for our analysis and will be dropped.
- We are also not concerned about segment type (stype) as we are going to use gics sector instead to describe the seller.
- Often sales (salecs) are not reported when the buyer is also "not reported", therefore it is not of a concern that 12.4% of that data is blank.

In [3]:
insufficient_but_starting_eda(comp, ['cnms', 'ctype', 'gareac', 'gareat', 'stype', 'srcdate', 'conm', 'tic', 'cusip'])

   gvkey  cid             cnms   ctype  gareac gareat  salecs  sid   stype  \
0   1004   31        All Other  MARKET     NaN    NaN     NaN    0     NaN   
1   1004   18  U.S. Government  GOVDOM     USA    ISO   455.9   20  BUSSEG   
2   1004   26  U.S. Government  GOVDOM     USA    ISO    90.3   22  BUSSEG   
3   1004   36    Europe/Africa  GEOREG  EUROPE    REG     5.8   22  BUSSEG   
4   1004   34            Other  GEOREG   OTHER    REG   170.4   20  BUSSEG   

      srcdate      conm  tic      cusip     cik   sic  
0  2019-05-31  AAR CORP  AIR  000361105  1750.0  5080  
1  2019-05-31  AAR CORP  AIR  000361105  1750.0  5080  
2  2019-05-31  AAR CORP  AIR  000361105  1750.0  5080  
3  2019-05-31  AAR CORP  AIR  000361105  1750.0  5080  
4  2019-05-31  AAR CORP  AIR  000361105  1750.0  5080   
---
        gvkey  cid                     cnms   ctype   gareac gareat    salecs  \
77896  350681    1  Large Corporate Clients  MARKET      NaN    NaN    76.814   
77897  353444    4        Re

# EDA - Accounting Data
- We also used `eda.py` on the accounting data set also prodivided by Dr. Bowen
```python
insufficient_but_starting_eda(acct)
```
- We found that
    - there are 26905 obersvations
    - the unit level is firm year
    - all variables are numerical as we requested them
        - there are 16 variables
    - the variables with missing data are
        - capxv (16.3%)
        - oibdp (11.4%)
        - invt (8.9%)
        - capx (8.7%)
        - rect (8.5%)
        - acominc (8.3%)
        - ap (8.3%)
        - epsfx (8.1%)
        - ib (8.0%)
        - ni (8.0%)
        - cogs (8.0%)
        - gp (8.0%)
        - sale (8.0%)
        - at (7.7%)

In [5]:
insufficient_but_starting_eda(acct_raw)

   gvkey  fyear  acominc      ap       at    capx   capxv     cogs  epsfx  \
0   1004   2018    -40.9   187.8   1517.2    17.4    17.4   1679.5   2.40   
1   1004   2019    -44.6   191.6   2079.0    23.6    23.6   1728.7   0.71   
2   1004   2020    -18.3   127.2   1539.7    11.3    11.3   1364.6   1.30   
3   1004   2021    -19.6   156.4   1573.9    17.3    17.3   1470.3   2.16   
4   1045   2018  -5274.0  1773.0  60580.0  3745.0  3745.0  31365.0   3.03   

        gp      ib    invt      ni   oibdp    rect     sale  
0    372.3    84.1   589.0     7.5   153.5   258.1   2051.8  
1    360.6    24.8   692.7     4.4   150.1   229.1   2089.3  
2    286.8    46.3   591.0    35.8   101.8   238.6   1651.4  
3    346.8    78.5   604.1    78.7   149.3   290.3   1817.1  
4  13176.0  1412.0  1522.0  1412.0  5606.0  1706.0  44541.0   
---
        gvkey  fyear   acominc        ap         at     capx    capxv  \
26900  349972   2022     0.096     1.378     28.064    0.000    0.000   
26901  350681 

# Cleaning 
- Initially, on the compustat data, we filtered to just look at just company in company type (`ctype`), that query only provided us with about 150 firms. Therefore we, decided to look at all company types (`ctypes`) as it provided about twice the amount of data (about 350 firms).
- This also provided us with a more hollistic view of the firms financials
    - for example AAPL did not report selling to specific companies, and in the initial method would have not been included in our list of firms that we are observing 

The code we used to clean/filter the comp data is below:

```python
comp2 = comp
comp2 = comp2[comp2['cnms'] != 'Not Reported'] #dropping when the compnay name is not reported
comp3 = comp2.dropna(subset=['salecs']) # dropping datasets with no sales data
```

# Merging
- Before merging, we made sure our columns in compustat matched the names in the SP500 dataset
    - We renamed `cik` to `CIK` to match
```python
comp3 = comp3.rename(columns = {'cik': 'CIK'})
merged = comp3.merge(sp500, on='CIK', how = 'inner')
```

# Finalizing the Dataset
- Since we are only looking at firms that filed in 2019 and 2022 we needed to make sure our dataset represented the accurate years
```python
start_date = '2020-01-01'
end_date = '2021-12-31'
filtered_df = merged.query('@start_date <= date <= @end_date')
# get the indices of the filtered dates
filtered_indices = filtered_df.index
# drop the filtered dates from the original dataframe
filtered_out_df = merged.drop(filtered_indices)
```


# Caveats/Going Forward
- For this dataset we are going to say that any firm filing in 2019 corresponds to information in 2019 fiscal year
    - We know that this can lead to some inaccuracies when firms don't file in 2019 for 2019 fiscal year. 
    - For instance if a firm files in January of 2020 our analysis is that this data will correspond with the fiscal year of 2020, wehn in reality the data actually corresponds with 2019 fiscal year
    - TLDR: the fiscal year = the filing year
    
Going Forward
- Correcting above
- Adding years 2020 and 2021 to the analysis