# Data Collection and Preparation

# Mary Donovan Martello

## The first part of this project consisted of collecting, cleaning and preparing data from three different sources: a flat file, a website (collected by scraping the website), and an API.  The final product included joining all three sources on the same key, storing them in a SQLite database and executing SQL commands.  This notebook includes pulling data via an API and cleaning the data.

# Part 3: API Data Source / Cleaning

In [1]:

%matplotlib inline

# import libraries
import urllib.request, urllib.parse
from urllib.error import HTTPError,URLError
import requests
import pandas as pd
import json
import re
import numpy as np
import matplotlib.pyplot as plt
import datetime

## This file uses mutual fund data from the FMP Finance API: https://financialmodelingprep.com/developer/docs/

### Designate which mutual fund data to pull.

In [2]:
tickers3 = ['ACAMU', 'ACTTW', 'CBSHP', 'CENTA', 'CFFAU', 'CHNGU', 'CHSCL', 'CHSCM', 'CHSCN', 'CHSCO', 'CHSCP', 'CMCSA', 'CNBKA', 
            'COWNL', 'CRSAU', 'CYCCP', 'DISCB', 'DISCK', 'DPHCW', 'FCNCA', 'FITBI', 'DGICA', 'DGICB', 'DISCA', 'ETON', 'MARPS', 
            'MINDP', 'NGHCP', 'PACQU', 'PTVCA', 'PTVCB', 'RUSHA', 'RUSHB', 'RYAAY', 'QRTEB', 'RBCAA', 'SENEA', 'SENEB', 'TOTAR', 
            'TZACW', 'WHLRP', 'WTFCM', 'AACG', 'AAL', 'AAMC', 'AAME', 'AAN', 'AAOI', 'AAON', 'AAP', 'AAPL', 'AAT', 'AAU', 'AAWW', 
            'AAXN', 'ABB', 'ABBV', 'ABC', 'ABCB', 'ABEO', 'ABEV', 'ABG', 'ABIO', 'ABM', 'ABMD', 'ABR', 'ABT', 'ABTX', 'ABUS', 
            'ACA', 'ACAD', 'ACAMU', 'ACB', 'ACBI', 'ACC', 'ACCO', 'ACEL', 'ACER', 'ACGL', 'ACH', 'ACHC', 'ACHV', 'ACIA', 'ACIU', 
            'ACIW', 'ACLS', 'ACM', 'ACMR', 'ACN', 'ACNB', 'ACOR', 'ACRE', 'ACRS', 'ACRX', 'ACST', 'ACTG', 'ACU', 'ACY', 'ADAP', 
            'ADBE', 'ADC', 'ADES', 'ADI', 'ADIL', 'ADM', 'ADMA', 'ADMP', 'ADMS', 'ADNT', 'ADP', 'ADPT', 'ADRO', 'ADS', 'ADSK', 
            'ADSW', 'ADT', 'ADTN', 'ADUS', 'ADXS', 'AEE', 'AEG', 'AEGN', 'AEHR', 'AEIS', 'AEL', 'AEM', 'AEMD', 'AEO', 'AEP', 
            'AER', 'AERI', 'AES', 'AESE', 'AEY', 'AEYE', 'AEZS', 'AFG', 'AFH', 'AFI', 'AFIN', 'AFL', 'AFMD', 'AGCO', 'AGE', 
            'AGEN', 'AGFS', 'AGI', 'AGIO', 'AGLE', 'AGM', 'AGMH', 'AGNC', 'AGO', 'AGR', 'AGRO', 'AGRX', 'AGS', 'AGTC', 'AGX', 
            'AGYS', 'AHC', 'AHH', 'AHI', 'AHPI', 'AHT', 'AIG', 'AIH', 'AIHS', 'AIM', 'AIMC', 'AIMT', 'AINC', 'AINV', 'AIRG', 
            'FB', 'GOOG']

### Set variables for API pull.

In [3]:
apiKey = '' # financial modeling api deleted for privacy purposes
httpVar = 'https://financialmodelingprep.com/api/v3/key-metrics/' # original api
param = '?apikey=' # financial modeling api

**Below are two functions to get the json data from the API and convert it to a DataFrame.  Sometimes they work and other times they return not defined. I do not know if it is an API or anaconda or my system issue. The code has worked, however, so that I do have the data saved in a DataFrame.**

In [4]:
# function to check request and return the data list

def get_jsonparsed_data(api, http, param):
    """
   Receive the content of ``url``, parse it as JSON and return the object.
   Parameters
   ----------
   url : str
   Returns
   -------
   dict
   """ 
    
    try:
        data = []
        for i in tickers3:
            url = httpVar + i + param + apiKey # financial modeling api
            resp = urllib.request.urlopen(url)
            data.append(json.load(resp))
        return data
       
        if data['Response']=='True':
            print(data)
       
        else:
            print("Error encountered: ",data['Error'])
   
    except urllib.error.URLError as e:
        print(f"ERROR: {e.reason}")

get_jsonparsed_data(apiKey, httpVar, param)

[[{'symbol': 'ACAMU',
   'date': '2019-12-31',
   'revenuePerShare': 0.0,
   'netIncomePerShare': 0.11411605841221746,
   'operatingCashFlowPerShare': -0.06743312901905195,
   'freeCashFlowPerShare': -0.06743312901905195,
   'cashPerShare': 0.052521605167326485,
   'bookValuePerShare': 0.16404490570289276,
   'tangibleBookValuePerShare': 10.22516287497235,
   'shareholdersEquityPerShare': 0.16404490570289276,
   'interestDebtPerShare': 7.047782979741737,
   'marketCap': 318206126.15999997,
   'enterpriseValue': 531418293.15999997,
   'peRatio': 91.48580966832864,
   'priceToSalesRatio': None,
   'pocfratio': -154.82004397349522,
   'pfcfRatio': -154.82004397349522,
   'pbRatio': 63.64111067800078,
   'ptbRatio': 63.64111067800078,
   'evToSales': None,
   'enterpriseValueOverEBITDA': -569.6815222858514,
   'evToOperatingCashFlow': -258.55631539281546,
   'evToFreeCashFlow': -258.55631539281546,
   'earningsYield': 0.010930656936036155,
   'freeCashFlowYield': -0.006459111975004976,
   

In [6]:
def makeDataframe(data):
    dfList = []
    for count, x in enumerate(data):
        dfList.append(pd.DataFrame(data[count]))
    # Concatenating the dataframes 
    combApiDf1 = pd.concat(dfList)
    return combApiDf1



In [None]:
makeDataframe(data)

In [None]:
indexList = []
count = -1
for i in range(1868):
    count += 1
    indexList.append(i)

combApiDf1.index = indexList

**As mentioned above, the two functions above have worked to return data from the API and convert it to a DataFrame. However, other times I get an error that the returned variable is not defined. I do not know if it is an API, anaconda or my system issue that using the functions does not always work. Therefore, I am including the code below outside of functions that did produce the JSON data and DataFrame. I then exported it to a csv file to use for for the data wrangling.**

In [9]:
data = []
for i in tickers3:
    url = httpVar + i + param + apiKey # financial modeling api
    resp  = urllib.request.urlopen(url)
    data.append(json.load(resp))

In [10]:
# Because of how the json is returned, convert multiple responses to a dataframe and put each dataframe into a list.
dfList = []
for count, x in enumerate(data):
    dfList.append(pd.DataFrame(data[count]))

In [11]:
len(dfList)

176

In [12]:
# Concatenating the dataframes 
combApiDf1 = pd.concat(dfList)

In [13]:
combApiDf1.shape

(1868, 59)

In [14]:
combApiDf1

Unnamed: 0,symbol,date,revenuePerShare,netIncomePerShare,operatingCashFlowPerShare,freeCashFlowPerShare,cashPerShare,bookValuePerShare,tangibleBookValuePerShare,shareholdersEquityPerShare,...,averagePayables,averageInventory,daysSalesOutstanding,daysPayablesOutstanding,daysOfInventoryOnHand,receivablesTurnover,payablesTurnover,inventoryTurnover,roe,capexPerShare
0,ACAMU,2019-12-31,0,0.114116,-0.0674331,-0.0674331,0.0525216,0.164045,10.2252,0.164045,...,,,,,,,,,0.695639,0
0,ACTTW,2019-12-31,0,0.467505,-0.0471785,-0.0471785,0.119586,0.594467,36.3091,0.594467,...,,,,,,,,,0.786427,0
0,CBSHP,2019-12-31,7.218,3.70202,4.50673,4.13256,4.3206,27.5494,227.777,27.5494,...,,,0,,,,,,0.134378,0.374174
1,CBSHP,2018-12-31,7.43444,3.91241,4.98737,4.68691,4.58337,26.4529,228.46,26.4529,...,,,0,,,,,,0.147901,0.300455
2,CBSHP,2017-12-31,6.95068,3.02575,4.10427,3.81225,4.15365,25.736,233.877,25.736,...,,,0,,,,,,0.117569,0.292018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,GOOG,2019-12-31,466.895,99.0663,157.269,89.3422,53.3596,581.082,730.69,581.082,...,4969500000,1.053e+09,57.1121,28.232,5.0717,6.39094,12.9286,71.968,0.170486,67.9269
1,GOOG,2018-12-31,391.215,87.8854,137.167,65.285,47.7543,507.903,608.141,507.903,...,4378000000,1.107e+09,55.5907,26.8345,6.78525,6.56584,13.6019,53.7931,0.173036,71.8816
2,GOOG,2017-12-31,318.411,36.3693,106.537,68.6685,30.7769,438.034,510.859,438.034,...,3757500000,9.28e+08,60.3729,25.1191,5.99752,6.04576,14.5308,60.8585,0.0830284,37.8686
3,GOOG,2016-12-31,261.884,56.5068,104.542,74.9169,37.4758,403.351,428.55,403.351,...,3209500000,6.875e+08,57.1606,21.2011,2.78388,6.38551,17.2161,131.112,0.140093,29.6256


In [15]:
# index looks problematic; change index
# https://note.nkmk.me/en/python-pandas-dataframe-rename/
indexList = []
count = -1
for i in range(1868):
    count += 1
    indexList.append(i)
len(indexList)

1868

In [16]:
combApiDf1.index = indexList

In [17]:
combApiDf1.index

Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867],
           dtype='int64', length=1868)

In [None]:
# export datafram to a csv file
#combApiDf1.to_csv (r'C:\dev\code\540FinalProject\api_dataframe.csv', index = False, header=True)

## Data Cleaning

In [2]:
# import file
apDf = pd.read_csv('api_dataframe.csv')
apDf.head()

Unnamed: 0,symbol,date,revenuePerShare,netIncomePerShare,operatingCashFlowPerShare,freeCashFlowPerShare,cashPerShare,bookValuePerShare,tangibleBookValuePerShare,shareholdersEquityPerShare,...,averagePayables,averageInventory,daysSalesOutstanding,daysPayablesOutstanding,daysOfInventoryOnHand,receivablesTurnover,payablesTurnover,inventoryTurnover,roe,capexPerShare
0,ACAMU,2019-12-31,0.0,0.114116,-0.067433,-0.067433,0.052522,0.164045,10.225163,0.164045,...,,,,,,,,,0.695639,0.0
1,ACTTW,2019-12-31,0.0,0.467505,-0.047178,-0.047178,0.119586,0.594467,36.309078,0.594467,...,,,,,,,,,0.786427,0.0
2,CBSHP,2019-12-31,7.218001,3.702023,4.506732,4.132558,4.320599,27.549427,227.776612,27.549427,...,,,0.0,,,,,,0.134378,0.374174
3,CBSHP,2018-12-31,7.434439,3.91241,4.987366,4.686911,4.583366,26.452893,228.460158,26.452893,...,,,0.0,,,,,,0.147901,0.300455
4,CBSHP,2017-12-31,6.95068,3.02575,4.104268,3.81225,4.153654,25.735967,233.876898,25.735967,...,,,0.0,,,,,,0.117569,0.292018


### Explore Data

In [3]:
apDf.shape

(1868, 59)

In [4]:

count = -1
for c in apDf.columns:
    count +=1
    print(type(apDf.iloc[2, count]))

<class 'str'>
<class 'str'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'nump

In [5]:
for c in apDf.columns:
    miss = apDf[c].isnull().sum()
    if miss>0:
        print("{} has {} missing value(s)".format(c,miss))

revenuePerShare has 159 missing value(s)
netIncomePerShare has 159 missing value(s)
operatingCashFlowPerShare has 159 missing value(s)
freeCashFlowPerShare has 159 missing value(s)
cashPerShare has 159 missing value(s)
bookValuePerShare has 159 missing value(s)
tangibleBookValuePerShare has 159 missing value(s)
shareholdersEquityPerShare has 159 missing value(s)
interestDebtPerShare has 159 missing value(s)
marketCap has 159 missing value(s)
enterpriseValue has 5 missing value(s)
peRatio has 211 missing value(s)
priceToSalesRatio has 312 missing value(s)
pocfratio has 267 missing value(s)
pfcfRatio has 78 missing value(s)
pbRatio has 263 missing value(s)
ptbRatio has 263 missing value(s)
evToSales has 169 missing value(s)
enterpriseValueOverEBITDA has 26 missing value(s)
evToOperatingCashFlow has 123 missing value(s)
evToFreeCashFlow has 78 missing value(s)
earningsYield has 159 missing value(s)
freeCashFlowYield has 159 missing value(s)
debtToEquity has 128 missing value(s)
debtToAsse

### Rename Column Names

In [14]:
# Replace Headers - symbol needs to be fund_name for merging data frames
apDf.rename({'symbol': 'fund_name', 'date': 'valuation_date'}, axis=1, inplace=True)
apDf.columns

Index(['fund_name', 'valuation_date', 'revenuePerShare', 'netIncomePerShare',
       'operatingCashFlowPerShare', 'freeCashFlowPerShare', 'cashPerShare',
       'bookValuePerShare', 'tangibleBookValuePerShare',
       'shareholdersEquityPerShare', 'interestDebtPerShare', 'marketCap',
       'enterpriseValue', 'peRatio', 'priceToSalesRatio', 'pocfratio',
       'pfcfRatio', 'pbRatio', 'ptbRatio', 'evToSales',
       'enterpriseValueOverEBITDA', 'evToOperatingCashFlow',
       'evToFreeCashFlow', 'earningsYield', 'freeCashFlowYield',
       'debtToEquity', 'debtToAssets', 'netDebtToEBITDA', 'currentRatio',
       'interestCoverage', 'incomeQuality', 'dividendYield', 'payoutRatio',
       'salesGeneralAndAdministrativeToRevenue',
       'researchAndDdevelopementToRevenue', 'intangiblesToTotalAssets',
       'capexToOperatingCashFlow', 'capexToRevenue', 'capexToDepreciation',
       'stockBasedCompensationToRevenue', 'grahamNumber', 'roic',
       'returnOnTangibleAssets', 'grahamNetNet', 

### Convert date datatype to datetime timestamp.

In [16]:
# convert date datatype to datetime timestamp
apDf['valuation_date']

0       2019-12-31
1       2019-12-31
2       2019-12-31
3       2018-12-31
4       2017-12-31
           ...    
1863    2019-12-31
1864    2018-12-31
1865    2017-12-31
1866    2016-12-31
1867    2015-12-31
Name: valuation_date, Length: 1868, dtype: object

In [17]:
apDf['valuation_date'] = pd.to_datetime(apDf['valuation_date'])

In [18]:
 print(type(apDf.iloc[2, 1]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


### Impute missing values with mean.

In [28]:
# fill all missing numerical values with mean
# https://stackoverflow.com/questions/18689823/pandas-dataframe-replace-nan-values-with-average-of-columns
apDf = apDf.fillna(apDf.mean())

In [29]:
for c in apDf.columns:
    miss = apDf[c].isnull().sum()
    if miss>0:
        print("{} has {} missing value(s)".format(c,miss))

investedCapital has 1868 missing value(s)


### Create new derived features.

In [36]:
# Calculate values for investedCapital and replace all Series values
apDf['investedCapital'] = apDf['enterpriseValue'] - apDf['marketCap'] + apDf['workingCapital']

In [37]:
apDf['investedCapital']

0       2.148150e+08
1       6.571400e+04
2       4.115269e+10
3       4.113641e+10
4       4.120586e+10
            ...     
1863    9.377100e+10
1864    8.876300e+10
1865    9.337900e+10
1866    7.966900e+10
1867    5.947500e+10
Name: investedCapital, Length: 1868, dtype: float64

In [38]:
# create a new Series with calculated value of share price
apDf['sharePrice'] = apDf['pocfratio'] * apDf['operatingCashFlowPerShare']

In [39]:
apDf['sharePrice']

0         10.44
1          1.29
2         26.05
3         25.73
4         26.25
         ...   
1863    1455.84
1864    1089.06
1865    1163.69
1866     802.32
1867     742.95
Name: sharePrice, Length: 1868, dtype: float64

### Subset the dataframe.

In [41]:
apSubDf = apDf.iloc[:, [0, 1, 11, 12, 13, 14, 16, 23, 26, 42, 46]]
apSubDf.shape

(1868, 11)

In [42]:
apSubDf.head()

Unnamed: 0,fund_name,valuation_date,marketCap,enterpriseValue,peRatio,priceToSalesRatio,pfcfRatio,earningsYield,debtToAssets,returnOnTangibleAssets,netCurrentAssetValue
0,ACAMU,2019-12-31,318206100.0,531418300.0,91.48581,3505604.0,-154.820044,0.010931,0.983957,0.01116,-304840400.0
1,ACTTW,2019-12-31,10850080.0,9864030.0,2.759329,3505604.0,-27.342988,0.362407,0.037001,0.012876,-10228240.0
2,CBSHP,2019-12-31,2964073000.0,2472458000.0,7.036693,3.609033,6.303602,0.142112,0.87974,0.016253,-13867860000.0
3,CBSHP,2018-12-31,2851193000.0,2343301000.0,6.576509,3.46092,5.489756,0.152056,0.884884,0.017125,-13486610000.0
4,CBSHP,2017-12-31,2770819000.0,2332380000.0,8.675536,3.776609,6.885698,0.115267,0.890609,0.012937,-12904140000.0
