# Project Part F: All Together Now

![](banner_project.jpg)

In [None]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

In [None]:
analyst = "Firstname Lastname" # Replace this with your name
fmt(analyst)

## Preamble

### Objective

Recommend a portfolio of 12 company investments that maximizes 12-month profit on a \$1,000,000 investment.

### Decision Model

Decision is which companies to include in the portfolio.

Approach is to fill the portfolio with companies predicted to have ...
* the highest probabilities of growing 30% or more at 12 months, or
* the highest 12-month growths at 12 months

<br>
Business parameters:

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocation = portions of \\$1,000,000: investments to allocate to specific companies in the portfolio 

$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business value of decision:

* Profit at 12 months

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>


In [None]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(1000000/12, 12) # you can keep or change this setting

layout(fmt(budget), fmt(portfolio_size),fmt(allocation))

In [None]:
# Set any additional business parameters.

threshold = 0.30 # you can keep or change this setting
fmt(threshold)

### Data Source

Data files were sourced from Wharton Research Data Services > Compustat - Capital IQ from Standard & Poor's > North America - Daily > Fundamentals Quarterly (https://wrds-www.wharton.upenn.edu/)

Selection criteria:

  * Date Variable: Data Date
  * Date Range: 2017-01 to 2017-12 -or- 2018-01 to 2018-12
  * Company Codes: Search the entire database
    * Consolidtaion Level: C, Output
    * Industry Format: INDL, FS, Output
    * Data Format: STD, Output
    * Population Source: D, Output
    * Quarter Type: Fiscal View, Output
    * Currency: USD, Output (not CAD)
    * Company Status: Active, Output (not Inactive)
  * Variable Types: Data Items, Select All (674)
  * Query output:
    * Output format: comma-delimited text
    * Compression type: None
    * Data format: MMDDYY10

Data are restricted to select US active, publicly held companies that reported quarterly measures including stock prices for 1st, 2nd, 3rd, and 4th quarters in years 2017 and 2018.  All non-missing stock prices exceed $3 per share.  File formats are all comma-separated values (CSV).

## Data

### Retrieve Data

In [None]:
# Retrieve the 2017 data.
# Show the first observation of the retrieved data.

datax.2017 = read.csv("Company Fundamentals 2017.csv", header=TRUE)
datax.2017[1,]

In [None]:
# Retrieve the 2018 data.
# Show the first observation of the retrieved data.

datax.2018 = read.csv("Company Fundamentals 2018.csv", header=TRUE)
datax.2018[1,]

### Prepare Data for Analysis

_2017 Data:_

Partition the dataset by calendar quarter in which information is reported by first adding a synthetic variable to indicate such.  Then partition into four new datasets, one for each quarter, and drop the quarter variables. Additionally, filter the observations to include only those with non-missing `prccq` $\geq$ 3.  Then remove any observations about companies that reported more than once per quarter.  Then change all the variable names (except for the `gvkey`, `tic`, and `conm` variables) by suffixing them with quarter information - e.g., in the Quarter 1 dataset, `prccq` becomes `prccq.q1`, etc.

Consolidate the four quarter datasets into one dataset, with one observation per company that includes variables for all four quarters.  Remove any observations with missing `prccq.q4` values.

_2018 Data:_

Filter the dataset by calendar quarter in which information is reported by first adding a synthetic variable to indicate such, and then select only observations with information reported in quarter 4. Additionally, filter the observations to include only those with non-missing `prccq`, and keep only the `gvkey` and `prccq` variables.  Then remove any observations about companies that reported more than once per quarter.

_Consolidate:_

Consolidate the processed 2017 dataset and processed 2018 dataset, keeping only observations that have both 2017 and 2018 information.  Then add these 2 synthetic variables:

$\begin{align}
growth : & \, (prccq - prccq.q4) \div prccq.q4 \\
big\_growth : & \, growth \geq threshold
\end{align}$

In [None]:
# Partition the 2017 data.

# Consolidate the 2017 data partitions.

# Filter the 2018 data.

# Consolidate the 2017 data and 2018 data.

# Show first observation of prepared data.


## Exploratory Data Analysis

In [None]:
# Show fraction of observations that are missing price data (i.e., prccq.q1, prccq.q2, prccq.q3, prccq.q4).


In [None]:
# Show another interesting statistic.


In [None]:
# Show another interesting statistic.


In [None]:
# Visualize growth across companies (sorted lowest to highest).


In [None]:
# Visualize the amount of missing data across variables.


In [None]:
# Show another interesting visualization.


In [None]:
# Show another interesting visualization.


## Data Represention

Data representation is transformed as follows:

* Slice the data to include only predictor variables with at least 95% non-missing values.
* Impute missing data ...
  * for each numeric variable, use the mean of non-missing values
  * for each non-numeric variable, use the mode of non-missing values
* Slice the data to include only numeric (including integer) variables with non-zero variance.
* Transform predictor variables to principal component representation.
* Slice the data further ...
  * The first 3 columns are gvkey, tic, and conm; these are predictor variables
  * The next 3 columns are PC1, PC2, and PC3; these are predictor variables
  * The next 3 columns are prccq, growth, and big_growth; these are outcome variables

In [None]:
# Specify predictor variables and predicted variables

# Filter variables

# Impute missing data.

# Further filter variables

# Transform predictor variables to principal component representation

# Reduce number of predictor variables and consolidate data

# Show first observation of transformed data.


## Model 1

Model 1 is a naive Bayes classifier that predicts whether or not a company stock price will grow by 30% or more at 12 months.

### Build & Evaluate Model (not tuned)

In [None]:
# Construct a naive Bayes model to predict big_growth.
# Use the model to inform the decision about how to fill the portfolio.
# Show the 5-fold cross-validation estimated business value of the model as measured by portfolio profit.


### Build, Evaluate, & Tune Model

In [None]:
# Tune the naive Bayes model by iterating through predictor variable combinations.
# Show the predictor variable combination and estimated profit for the best performing model. 


## Model 2

Model 2 is a linear regression regressor that predicts company stock price at 12 months.

### Build & Evaluate Model (not tuned)

In [None]:
# Construct a linear regression model to predict growth.
# Use the model to inform the decision about how to fill the portfolio.
# Show the 5-fold cross-validation estimated business value of the model as measured by portfolio profit.


### Build , Evaluate, & Tune Model

In [None]:
# Tune the linear regression model by iterating through predictor variable combinations.
# Show the predictor variable combination and estimated profit for the best performing model. 


## Model 3

Model 3 is a ... that predicts ... at 12 months. (You choose the model construction method)

### Build & Evaluate Model (not tuned)

In [None]:
# Construct a ... model to predict ... .
# Use the model to inform the decision about how to fill the portfolio.
# Show the 5-fold cross-validation estimated business value of the model as measured by portfolio profit.


### Build , Evaluate, & Tune Model

In [None]:
# Tune the ... by iterating through predictor variable combinations,
# hyperparameter settings (if applicable), and cutoffs (if applicable).
# Show the predictor variable combination, hyperparameter settings (if applicable),
# cutoff (if applicable), and estimated profit for the best performing model.


## Model 4

Model 4 is a ... that predicts ... at 12 months. (You choose the model constructon method)

### Build & Evaluate Model (not tuned)

In [None]:
# Construct a ... model to predict ... .
# Use the model to inform the decision about how to fill the portfolio.
# Show the 5-fold cross-validation estimated business value of the model as measured by portfolio profit.


### Build, Evaluate, & Tune Model

In [None]:
# Tune the ... by iterating through predictor variable combinations,
# hyperparameter settings (if applicable), and cutoffs (if applicable).
# Show the predictor variable combination, hyperparameter settings (if applicable),
# cutoff (if applicable), and estimated profit for the best performing model.


## Investment Opportunities

Test the best performing model on new investment opportunities.

### Retrieve Data

In [None]:
# Retrieve the investment opportunities data.
# Show the first observation of the data.

datax.io = read.csv("Investment Opportunities.csv", header=TRUE)
datax.io[1,]

### Prepare Data for Analysis

In [None]:
# Prepare the investment opportunities data for further analysis.


### Transform Data Representation

In [None]:
# Transform the prepared investment opportunities data representation as apprpriate for use with the best performing model.
# Show first observation of transformed data.


## Apply Model

### Build Best Model

In [None]:
# Construct the best performing model using all 2017 and 2018 data for training.


### Recommend Portfolio

In [None]:
# Use the model to inform the decision about how to fill the portfolio with companies from the investment opportunities.
# Show the portfolio: gvkey, tic, conm, allocation


### Store Portfolio Recommendation

In [None]:
write.csv(portfolio, paste0(analyst, ".csv"), row.names=FALSE)

### Confirm That Format Is Correct

In [None]:
portfolio.retrieved = read.csv(paste0(analyst, ".csv"), header=TRUE)
opportunities = unique(read.csv("Investment Opportunities.csv", header=TRUE)$gvkey)

columns = all(colnames(portfolio.retrieved) == c("gvkey", "tic", "conm", "allocation"))
companies = all(portfolio.retrieved$gvkey %in% opportunities)
allocations = round(sum(portfolio.retrieved$allocation)) == budget
                         
check = data.frame(analyst, columns, companies, allocations)
fmt(check, "Portfolio Recommendation | Format Check")

## Discussion

_<  Discuss your analysis.  Comment on your approach, what you learned, and how you may be able to use what you learned in your future work.  Approximately 250 words. >_

<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised June 20, 2021
</span>
</p>
</font>