# Project Part G: Better Models

![](banner_project.jpg)

In [1]:
analyst = "Firstname Lastname" # Replace this with your name

In [2]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

## Directions

### Exploratory Data Analysis

* **Objective:** Conduct an exploratory data analysis of a dataset about public company fundamentals.
* **Approach:**  Retrieve a public company fundamentals dataset comprising thousands of US companies from quarters 1, 2, 3, and 4 of year 2017 + company stock price data for those companies from quarter 4 of year 2018.  Transform the dataset representation to capture all information about any single company in a single observation.  Apply various descriptive statistics, data visualizations (including kernel density estimates), and cross-tabulations to look for interesting patterns and inter-company relationships.

### Representation

* **Objective:**  Transform the representation of a dataset about public company fundamentals.
* **Approach:**  Transform the dataset representation using variable filtration, imputation, principal component analysis, and/or other methods.

### Build & Tune

* **Objective:**  Construct and tune several classifiers and/or regressors, each trained on a transformed dataset about public company fundamentals.
* **Approach:**  Construct models to predict stock performance, given 12 months of past company fundamentals data, and using a machine learning model construction methods and transformed data.  Tune the models by systematically selecting various combinations of predictor variables and cutoffs, and identify the best business performance based on a business model and business parameters.  Select the method that produces the best performing model as measured by cross-validation, and build a new model using that method based on all the orginal data. 


### Deployment
* **Objective:** Recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \\$1,000,000 investment.
* **Approach:** Retrieve an investment opportunities dataset, comprising fundamentals for some set of public companies over some one-year period.  Transform the representation of the investment opportunities to match the representation expected by the model, leveraging previous analysis.  Use the model to make predictions about the investment opportunities and accordingly recommend a portfolio of 12 company investments.

### Data Source

The data includes these files:

* Data Dictionary.csv
* Company Fundamentals 2017.csv
* Company Fundamentals 2018.csv

The dataset and accompanying data dictionary was sourced from ...

* Wharton Research Data Services > Compustat - Capital IQ from Standard & Poor's > North America - Daily > Fundamentals Quarterly (https://wrds-www.wharton.upenn.edu/)

  * Date Variable: Data Date
  * Date Range: 2017-01 to 2017-12 -or- 2018-01 to 2018-12
  * Company Codes: Search the entire database
    * Consolidtaion Level: C, Output
    * Industry Format: INDL, FS, Output
    * Data Format: STD, Output
    * Population Source: D, Output
    * Quarter Type: Fiscal View, Output
    * Currency: USD, Output (not CAD)
    * Company Status: Active, Output (not Inactive)
  * Variable Types: Data Items, Select All (674)
  * Query output:
    * Output format: comma-delimited text
    * Compression type: None
    * Data format: MMDDYY10

The dataset is restricted to select US active, publicly held companies that reported quarterly measures including stock prices for 1st, 2nd, 3rd, and 4th quarters in years 2017 and 2018.  All non-missing stock prices exceed $3 per share.  File formats are all comma-separated values (CSV).

The data dictionary is from Variable Descriptions tab, copied to Excel, saved in csv format.

## Business Model


The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = ................... 

Fill the portfolio with companies that ...................

In [1]:
# Set the business parameters.


## Data

_<< Discuss this data retrieval. >>_

### Data Dictionary

Retrieve and present the data dictionary for the company fundamentals datasets.

In [None]:
# Retrieve the data dictionary.


### Data for Current Year

#### Retrieve Raw Data

Retrieve the company fundamentals data for calendar year 2017.

In [None]:
# Retrieve the 2017 data.


#### Partition Data by Calendar Quarter 

To partition the dataset by calendar quarter in which information is reported, first add a synthetic variable to indicate such.  Then partition into four new datasets, one for each quarter, and drop the quarter variables. Additionally, filter the observations to include only those with non-missing `prccq` $\geq$ 3.  Then remove any observations about companies that reported more than once per quarter.  Then change all the variable names (except for the `gvkey`, `tic`, and `conm` variables) by suffixing them with quarter information - e.g., in the Quarter 1 dataset, `prccq` becomes `prccq.q1`, etc.

In [None]:
# Partition the dataset as described.


#### Consolidate Data by Company

Consolidate the four quarter datasets into one dataset, with one observation per company that includes variables for all four quarters.  Remove any observations with missing `prccq.q4` values.

In [None]:
# Consolidate the partitions as described.


### Data for Next Year

#### Retrieve Raw Data

Retrieve the company fundamentals data for calendar year 2018.

In [None]:
# Retrieve the 2018 data.


#### Filter Data by Calendar Quarter 4 

To filter the dataset by calendar quarter in which information is reported, first add a synthetic variable to indicate such, and then select only observations with information reported in quarter 4. Additionally, filter the observations to include only those with non-missing `prccq`, and keep only the `gvkey` and `prccq` variables.  Then remove any observations about companies that reported more than once per quarter.

In [3]:
# Filter the dataset as described.


### Data for Consolidated Current Year / Next Year

Consolidate the processed 2017 dataset and processed 2018 dataset, keeping only observations that have both 2017 and 2018 information.  Then add appropriate synthetic variables.

In [None]:
# Consolidate the datasets as described.


## Exploratory Data Analysis

_<< Discuss this exploratory data analysis and what insights you gleaned from it. >>_

### Descriptive Statistics

In [None]:
# Apply descriptive statistics to the data

### Data Visualization

In [None]:
# Visualize the data

## Transform Representation of Data

_<< Discuss this new representation of data. >>_

In [None]:
# Change the representation of the data

## Build & Tune Models

_<< Discuss this deployment. >>_

### Try Several Methods to Build & Tune Several Models

In [None]:
# Construct several models using one method to predict stock performance.
# Iterate through combinations of hyper-parameter settings.
# Iterate through combinations of predictor variables.
# Iterate through cutoff values.

# Estimate each model's accuracy and profit, using 5-fold cross validation.


In [None]:
# Construct several models using one method to predict stock performance.
# Iterate through combinations of hyper-parameter settings.
# Iterate through combinations of predictor variables.
# Iterate through cutoff values.

# Estimate each model's accuracy and profit, using 5-fold cross validation.


In [None]:
# Construct several models using one method to predict stock performance.
# Iterate through combinations of hyper-parameter settings.
# Iterate through combinations of predictor variables.
# Iterate through cutoff values.

# Estimate each model's accuracy and profit, using 5-fold cross validation.


### Select the Best Method & Build the Best Model

In [4]:
# Construct a model using the method for the best performing model and
# based on the ORIGINAL model training data.


## Deployment

_<< Discuss this deployment. >>_

### Investment Opportunities

In [5]:
# Retrieve "Investment Opportunities.csv"


### Partition Investment Opportunities Data by Calendar Quarter 

To partition the dataset by calendar quarter in which information is reported, first add a synthetic variable to indicate such.  Then partition into four new datasets, one for each quarter, and drop the quarter variables. Additionally, filter the observations to include only those with non-missing `prccq`.  Then remove any observations about companies that reported more than once per quarter.  Then change all the variable names (except for the `gvkey`, `tic`, and `conm` variables) by suffixing them with quarter information - e.g., in the Quarter 1 dataset, `prccq` becomes `prccq.q1`, etc.

In [7]:
# Partition the dataset as described.


### Consolidate Investment Opportunities Data by Company

Consolidate the four quarter datasets into one dataset, with one observation per company that includes variables for all four quarters.  Remove any observations with missing `prccq.q4` values.

In [6]:
# Consolidate the partitions as described.


### Transform Representation of Investment Opportunities Data

In [None]:
# Transform representation of data to conform to model expectations


### Predict & Make Portfolio Recommendation

In [7]:
# Use the model to predict stock performance of each investment opportunity.
# Recommend a portfolio of allocations to 12 investment opportunities: gvkey, tic, conm, allocation


### Store Portfolio Recommendation

In [15]:
# Store portfolio recommendation

write.csv(portfolio, paste0(analyst, ".csv"), row.names=FALSE)

### Confirm That Format Is Correct

In [16]:
portfolio.retrieved = read.csv(paste0(analyst, ".csv"), header=TRUE)
opportunities = unique(read.csv("Investment Opportunities.csv", header=TRUE)$gvkey)

columns = all(colnames(portfolio.retrieved) == c("gvkey", "tic", "conm", "allocation"))
companies = all(portfolio.retrieved$gvkey %in% opportunities)
allocations = round(sum(portfolio.retrieved$allocation)) == budget
                         
check = data.frame(analyst, columns, companies, allocations)
fmt(check, "Portfolio Recommendation | Format Check")

analyst,columns,companies,allocations
Firstname Lastname,True,True,True


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised June 29, 2020
</span>
</p>
</font>