## Setup: Create a path at the root as ./Private/Data, and move the shared CSV dataset files into this path. The same directory has been gitignored to preserve data integrity. Once this has been completed, the cells can be run in order to create a single pandas dataframe with all the company data, which is then saved as a single master CSV file

## Please do NOT edit this notebook unless you want to modify the data retrival process. For approaches to use this data to predict valuations/other metrics, please create separate notebooks in this same master directory. This notebook is inteded for data retrieval and data cleaning alone

## Please also do not uncomment commented lines and then commit changes without re-commenting these lines. The lines have been commented in order to preserve data integrity, as this is a public repository.

## For more information about this project/access to the data, please contact LionBase at https://lionbase.nyc/

In [1]:
# Standard Imports
import pandas as pd
#import matplotlib.pyplot as plt
#import numpy as np
%matplotlib inline

path = './Private/Data/'
file_names = ['Financials.csv','MarketData.csv','Profiles.csv']

In [2]:
financial_frame = pd.read_csv(path+file_names[0])
market_frame = pd.read_csv(path+file_names[1])
profile_frame = pd.read_csv(path+file_names[2])

#Dropping all the source columns, this data is not relevant 
financial_frame.drop(['Source','Source.1','Source.2','Source.3','Source.4','Source.5','Source.6'],axis=1, inplace=True)
market_frame.drop(['Source','Source.1','Source.2', 'Source.3', 'Source.4', 'Source.5', 'Source.6', 'Source.7'],axis=1, inplace=True)
profile_frame.drop(['Source','Source.1','Source.2', 'Source.3', 'Source.4', 'Source.5', 'Source.6'],axis=1, inplace=True)

#financial_frame.head()
#market_frame.head()
#profile_frame.head()

In [3]:
master_frame = pd.concat([financial_frame,market_frame, profile_frame], axis=1)

In [4]:
def clean_ticker(old_ticker):
    old_ticker=str(old_ticker)
    if(old_ticker=='nan'):
        return "N/A"
    hyphen_index = old_ticker.find('-')
    if(hyphen_index==-1):
        return old_ticker
    new_ticker = old_ticker[0:hyphen_index]
    return new_ticker

In [5]:
master_frame.columns = master_frame.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
master_frame.drop(['name'],axis=1,inplace=True)
master_frame.drop(['cusip'],axis=1,inplace=True)
master_frame['quote_symbol'] = master_frame['quote_symbol'].apply(clean_ticker)
#master_frame.head()

### List of features avalible in the dataset

In [6]:
feature_list = list(master_frame.columns.values)
for feature in feature_list:
    print(feature)
print('\n')
print("Number of elements:{}".format(len(master_frame)))

mastercsv = path+'MasterDataset.csv'

current_sales
current_ebitda
current_ebit
current_net_income
current_total_assets
current_total_liabilities
current_market_cap_usd
current_price_close
current_pe_ratio
actual_eps
current_price_/_cash
current_price_/_sales
dividend_yield
quote_symbol
sedol
country
exchange
primary_sic_code


Number of elements:4577


In [7]:
#Save dataframe to CSV file
master_frame.to_csv(mastercsv, sep='\t', encoding='utf-8')

## This dataset is now ready to be used in a separate notebook, in order to analyse and predict trends of companies.