### Market Cap Prediction with Various Models

Exploring the Fortune 1000 companies and trying to predict the market cap value of the company. 

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

from main import preprocessing, scale_and_split_data, linear_models, ensemble_models

import warnings
warnings.filterwarnings(action="ignore")

In [2]:
df = pd.read_csv("Fortune_1000.csv")

In [3]:
df

Unnamed: 0,company,rank,rank_change,revenue,profit,num. of employees,sector,city,state,newcomer,ceo_founder,ceo_woman,profitable,prev_rank,CEO,Website,Ticker,Market Cap
0,Walmart,1,0.0,523964.0,14881.0,2200000,Retailing,Bentonville,AR,no,no,no,yes,1.0,C. Douglas McMillon,https://www.stock.walmart.com,WMT,411690
1,Amazon,2,3.0,280522.0,11588.0,798000,Retailing,Seattle,WA,no,yes,no,yes,5.0,Jeffrey P. Bezos,https://www.amazon.com,AMZN,1637405
2,Exxon Mobil,3,-1.0,264938.0,14340.0,74900,Energy,Irving,TX,no,no,no,yes,2.0,Darren W. Woods,https://www.exxonmobil.com,XOM,177923
3,Apple,4,-1.0,260174.0,55256.0,137000,Technology,Cupertino,CA,no,no,no,yes,3.0,Timothy D. Cook,https://www.apple.com,AAPL,2221176
4,CVS Health,5,3.0,256776.0,6634.0,290000,Health Care,Woonsocket,RI,no,no,yes,yes,8.0,Karen S. Lynch,https://www.cvshealth.com,CVS,98496
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Mr. Cooper Group,996,0.0,2007.0,274.0,9100,Financials,Coppell,TX,,no,no,yes,,Jay Bray,https://mrcoopergroup.com,COOP,674.1
996,Herc Holdings,997,0.0,1999.0,47.5,5100,Business Services,Bonita Springs,FL,,no,no,yes,,Lawrence H. Silber,https://www.hercrentals.com,HRI,590.5
997,Healthpeak Properties,998,0.0,1997.4,45.5,204,Financials,Irvine,CA,,no,no,yes,,Thomas M. Herzog,https://www.hcpi.com,PEAK,12059.3
998,SPX FLOW,999,0.0,1996.3,-95.1,5000,Industrials,Charlotte,NC,,no,no,no,,Marcus G. Michael,https://www.spxflow.com,FLOW,1211.8


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   company            1000 non-null   object 
 1   rank               1000 non-null   int64  
 2   rank_change        1000 non-null   float64
 3   revenue            1000 non-null   float64
 4   profit             998 non-null    float64
 5   num. of employees  1000 non-null   int64  
 6   sector             1000 non-null   object 
 7   city               1000 non-null   object 
 8   state              1000 non-null   object 
 9   newcomer           500 non-null    object 
 10  ceo_founder        1000 non-null   object 
 11  ceo_woman          1000 non-null   object 
 12  profitable         1000 non-null   object 
 13  prev_rank          1000 non-null   object 
 14  CEO                992 non-null    object 
 15  Website            1000 non-null   object 
 16  Ticker             938 no

In [5]:
df[["ceo_founder", "ceo_woman", "profitable"]].value_counts()

ceo_founder  ceo_woman  profitable
no           no         yes           763
                        no            126
             yes        yes            57
yes          no         yes            33
                        no             10
no           yes        no             10
yes          yes        yes             1
dtype: int64

The null values in the market cap column will needed to be dropped as that is the value we are trying to predict. The majority of the other null values are coming from columns which will not be used in the model. The missing profit rows will be filled using the mean of the column.

### Preprocessing

In [6]:
temp_df = preprocessing(df)
temp_df

Unnamed: 0,revenue,profit,num. of employees,ceo_founder,ceo_woman,profitable,Market Cap,sector_Aerospace & Defense,sector_Apparel,sector_Business Services,...,state_PA,state_PR,state_RI,state_SC,state_TN,state_TX,state_UT,state_VA,state_WA,state_WI
0,523964.0,14881.0,2200000,0,0,1,411690.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,280522.0,11588.0,798000,1,0,1,1637405.0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,264938.0,14340.0,74900,0,0,1,177923.0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,260174.0,55256.0,137000,0,0,1,2221176.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,256776.0,6634.0,290000,0,1,1,98496.0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,2007.0,274.0,9100,0,0,1,674.1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
996,1999.0,47.5,5100,0,0,1,590.5,0,0,1,...,0,0,0,0,0,0,0,0,0,0
997,1997.4,45.5,204,0,0,1,12059.3,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,1996.3,-95.1,5000,0,0,0,1211.8,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
temp_df.isnull().sum().sum()

0

In [8]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 950 entries, 0 to 999
Columns: 459 entries, revenue to state_WI
dtypes: float64(3), int64(4), uint8(452)
memory usage: 478.7 KB


In [9]:
X_train, X_test, y_train, y_test = scale_and_split_data(temp_df)

In [10]:
X_train

Unnamed: 0,revenue,profit,num. of employees,ceo_founder,ceo_woman,profitable,sector_Aerospace & Defense,sector_Apparel,sector_Business Services,sector_Chemicals,...,state_PA,state_PR,state_RI,state_SC,state_TN,state_TX,state_UT,state_VA,state_WA,state_WI
339,-0.182121,-0.030031,-0.172549,-0.228506,-0.249600,0.413249,-0.161971,-0.12969,4.179979,-0.17609,...,-0.209657,-0.038808,-0.038808,-0.038808,-0.141204,-0.356537,-0.038808,-0.197642,-0.12356,-0.157014
58,1.079306,0.501438,-0.231484,-0.228506,-0.249600,0.413249,-0.161971,-0.12969,-0.239236,-0.17609,...,-0.209657,-0.038808,-0.038808,-0.038808,-0.141204,2.804758,-0.038808,-0.197642,-0.12356,-0.157014
863,-0.372868,-0.252795,-0.276585,-0.228506,4.006405,0.413249,-0.161971,-0.12969,-0.239236,-0.17609,...,-0.209657,-0.038808,-0.038808,-0.038808,-0.141204,-0.356537,-0.038808,-0.197642,-0.12356,-0.157014
747,-0.355584,-0.225409,-0.343187,-0.228506,-0.249600,0.413249,-0.161971,-0.12969,-0.239236,-0.17609,...,-0.209657,-0.038808,-0.038808,-0.038808,-0.141204,2.804758,-0.038808,-0.197642,-0.12356,-0.157014
873,-0.374988,-0.370936,-0.285958,-0.228506,-0.249600,-2.419849,-0.161971,-0.12969,-0.239236,-0.17609,...,-0.209657,-0.038808,-0.038808,-0.038808,-0.141204,-0.356537,-0.038808,-0.197642,-0.12356,-0.157014
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
809,-0.364224,-0.240179,-0.315566,-0.228506,4.006405,0.413249,-0.161971,-0.12969,-0.239236,-0.17609,...,-0.209657,-0.038808,-0.038808,-0.038808,-0.141204,-0.356537,-0.038808,-0.197642,-0.12356,-0.157014
75,0.771302,0.373296,0.548207,-0.228506,-0.249600,0.413249,-0.161971,-0.12969,-0.239236,-0.17609,...,-0.209657,-0.038808,-0.038808,-0.038808,-0.141204,-0.356537,-0.038808,-0.197642,-0.12356,-0.157014
957,-0.384852,-0.369881,-0.295331,-0.228506,-0.249600,-2.419849,-0.161971,-0.12969,-0.239236,-0.17609,...,-0.209657,-0.038808,-0.038808,-0.038808,-0.141204,-0.356537,-0.038808,-0.197642,-0.12356,-0.157014
248,-0.082339,0.162313,0.323263,-0.228506,-0.249600,0.413249,-0.161971,-0.12969,-0.239236,-0.17609,...,-0.209657,-0.038808,-0.038808,-0.038808,-0.141204,-0.356537,-0.038808,-0.197642,-0.12356,-0.157014


In [11]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 665 entries, 339 to 38
Columns: 458 entries, revenue to state_WI
dtypes: float64(458)
memory usage: 2.3 MB


The data has been preprocessed, scaled and is now ready for modelling.

### Models and Results

In [12]:
linear_results, lasso_results, ridge_results = linear_models(X_train, X_test, y_train, y_test)

In [13]:
print("Linear Regression rmse: {}    Linear Regression R2 Score: {}".format(linear_results[0], linear_results[1]))
print(" Lasso Regression rmse: {}     Lasso Regression R2 Score: {}".format(lasso_results[0], lasso_results[1]))
print(" Ridge Regression rmse: {}     Ridge Regression R2 Score: {}".format(ridge_results[0], ridge_results[1]))

Linear Regression rmse: 1.5394740660412708e+38    Linear Regression R2 Score: -7.297178960644719e+27
 Lasso Regression rmse: 22012075263.671215     Lasso Regression R2 Score: -0.043382646303594896
 Ridge Regression rmse: 22586313757.127308     Ridge Regression R2 Score: -0.07060181904104446


In [14]:
gradient_boost_results, random_forest_results = ensemble_models(X_train, X_test, y_train, y_test)

In [15]:
print("Gradient Boost rmse: {}    Gradient Boost R2 Score: {}".format(gradient_boost_results[0], gradient_boost_results[1]))
print(" Random Forest rmse: {}     Random Forest R2 Score: {}".format(random_forest_results[0], random_forest_results[1]))

Gradient Boost rmse: 8243858192.73126    Gradient Boost R2 Score: 0.6092372720949465
 Random Forest rmse: 7668100034.362797     Random Forest R2 Score: 0.6365284776588684


### Model Rankings:
1) Random Forest
2) Gradient Boost
3) Lasso Regression
4) Ridge Regression
5) Linear Regression


The linear models didn't perform well on this dataset, all 3 of the models had a negative R2 score, so they failed to beat a model that would predict the mean market cap value for each estimate.

Gradient boost and Random forest were the best performing models with a positive R2 score.

