In [1]:
#Libraries
import matplotlib as plt

import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler

In [2]:
#Reading the data into pandas dataframes

df1 = pd.read_excel(r"C:\Users\Alex\Desktop\Data science\Deliverable\Data\R&D ranking of the world top 2500 companies_0_0.xlsx")
df2 = pd.read_excel(r"C:\Users\Alex\Desktop\Data science\Deliverable\Data\SB2021_World2500_0.xlsx")
df3 = pd.read_excel(r"C:\Users\Alex\Desktop\Data science\Deliverable\Data\SB2020_World2500.xlsx")
df4 = pd.read_excel(r"C:\Users\Alex\Desktop\Data science\Deliverable\Data\SB2019_main stats_GLOBAL2500.xlsx")


    Innovation has become ever more critical in the last decades in the efforts of companies to get a competitive advantage over their competitors. Despite that, it would be a massive understatement if we consider innovation as purely a drive for profits, as there are substantial side-effects that it generates. They may be anything from an expected boost in GDP and in the general prosperity of a country to a matter of national security. Therefore, as is expected, governments around the world have used any tools available in order to foster innovation. These tools could vary from anything between direct investments and subsidies or more indirect approaches like tax benefits or policies that are meant to incentivise companies to invest more in research.
    One of the most commonly accepted metrics of innovation is the levels of R&D expenditure, as they are easily accessible and quantifiable. For this purpose, in this project, we will attempt to predict the levels of R&D expenditures based on a variety of metrics that are available for some of the biggest companies around the world. In case we find a strong correlation with certain metrics, we will then replicate the process and predict future R&D expenditures with machine learning techniques.
    An attempt to predict the future R&D expenditure could be indeed a challenging endeavour as it would require relying on publicly available data that companies have provided. In our case, one of the most reliable sources of data that we could find was the “EU Industrial R&D Investment Scoreboard”, which the European Commission prepares on an annual basis in order to “benchmark the performance of EU innovation-driven industries against major global counterparts and to provide an R&D investment database that companies, investors and policymakers can use to compare individual company performances against the best global competitors in their sectors.”
    This dashboard includes data regarding the 2500 largest companies around the world. As these companies’ R&D expenditure for 2020 accounted for 90% of the world’s business-funded R&D, a successful way to predict their future R&D strategies would be an extremely valuable l tool in the hands of policymakers in their efforts to encourage innovation further.

    Our dataset contains information from the annual reports for the years 2018-2021. The datasets that were used can be easily retrieved from the website"https://iri.jrc.ec.europa.eu/scoreboard/2021-eu-industrial-rd-investment-scoreboard". The variables of the datasets include information about the current financial metrics of the company, such as the market capitalisation and the number of employees, the annual growth rate of these metrics and finally, some broader about the broader characteristics of the company like the location that is based in or the industry that it belongs to.

      Prior use the data for the regression and the attempt to use the different machine learning techniques, we need to do the appropriate data cleaning. Moreover, in an attempt to make the code as clear and easy to read as possible, we are thorough in our comments below.
      We merge the datasets on the columns that they have in common. The reason for this is that we want to approach the problem cross-sectionally and delve deeper into the relationship between the "R&D one-year growth" and the other available attributes. Since, in order to examine this relationship, we need a large enough dataset, we decided to use the datasets of the last four years.
      It is of paramount importance to point out that , as the report mentions, "In 2020, the pandemic hit global business hard, causing a significant drop in companies' sales, profits and capital expenditures. However, overall R&D investment was sustained by increases in sectors positively affected by the crisis, namely ICT services and Health industries, while most other sectors decreased R&D investment, particularly the transport-related industries that have been most strongly affected by the lockdown."

In [3]:
#Since our dataframes have different columns we are going to do a crude concat to append the datasets on the columns

data_whole = pd.concat([df1, df2,df3,df4],ignore_index = True)

    Regading the assumptions that we have to make, the most vital one, would be that it is assumed that the observations with the missing values are not correlated, as if they are, their removal would lead to bias. Additionally, a rather safer assumption that we have to make is about the reliability of our data source, but as the data source is from the European Commission, it is assumed that it is as reliable as it can be. The positive aspect would be that we would not have to make any assumptions regarding the causal relationship between our variables, as the aim of this project is to simply use machine learning to predict the values and not to infer a causal relationship.

In [4]:
#Dropping the attributes with less that 8000 non nun values

data_whole.dropna( inplace = True, axis=1, thresh=8000)

#Dropping the rows that contain Nun values

data_whole.dropna(inplace = True)
data_whole.isna().sum() # Check that we have no nun values left

#We can do that because we are still elft with a significant number of rows to conduct predictions
#Otherwise we would have to replicate some values and fill them in place of the nun values
#We still have 8383 rows

#Drop unnecessary columns

data_whole = data_whole.drop(['World rank', 'Company', 'Country', 'Region'],axis= 1)
data_whole.columns


Index(['R&D one-year growth (%)', 'Net sales one-year growth (%)',
       'R&D intensity (%)', 'Capex one-year growth (%)', 'Capex intensity (%)',
       'Op.profits one-year growth (%)', 'Profitability (%)', 'Employees',
       'Employees one-year growth (%)', 'Market cap one-year growth (%)'],
      dtype='object')

In [5]:
#Converting the data from panda objects to float64

for i in data_whole.columns:
    
    data_whole[i]= pd.to_numeric(data_whole[i],errors='coerce')
    
#Droping Na values that could have been created due to the argument errors='coerce'

data_whole.dropna(inplace = True)

In [6]:
#Scalling the data

data_to_be_scaled = data_whole.to_numpy()

scaler = MinMaxScaler(feature_range= (0,1)) #Initiating the scaler
data_scaled = scaler.fit_transform(data_to_be_scaled)

In [7]:
#Checking the shape of our data so we can perform the split
data_scaled.shape

(7774, 10)

In [8]:
# Splitting the data to x and y

y = data_scaled[:,0]
X = data_scaled[:,1:10]

In [9]:
#A regression


X2 = sm.add_constant(X) # adding a constant

est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary()) # This allows us to actually see the model


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.010
Model:                            OLS   Adj. R-squared:                  0.009
Method:                 Least Squares   F-statistic:                     8.768
Date:                Fri, 17 Jun 2022   Prob (F-statistic):           3.14e-13
Time:                        13:39:08   Log-Likelihood:                 20432.
No. Observations:                7774   AIC:                        -4.084e+04
Df Residuals:                    7764   BIC:                        -4.077e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0121      0.044      0.272      0.7

    In this part we are splitting the data into training and testing using the classic train_test_split of sklearn and scale the data appropriately to fit into the machine learning methods we are going to attempt, as not doing that will significally impact the result of such methods.

In [10]:
#Machine learning attempt

from sklearn.model_selection import train_test_split

#spliting to train and test data , test size 20%

ML_x = y = data_to_be_scaled[:,1:10]

ML_y = data_to_be_scaled[:,0]

X_train, X_test, y_train, y_test = train_test_split(ML_x, ML_y, test_size=0.2, random_state=50)

#Scaling our data

scaler2 = MinMaxScaler(feature_range= (0,1))

scaler2.fit(X_train)

X_train_scaled = scaler2.transform(X_train)
X_test_scaled = scaler2.transform(X_test)


    We will use Random Forest Regression as it combines predictions from multiple machine learning algorithms to make a more accurate prediction than a single model. Despite that the presence of correlated predictors has been shown to impact its ability to identify strong predictors.

In [11]:
from sklearn.ensemble import RandomForestRegressor

# Instantiate model with 500 decision trees

rf = RandomForestRegressor(n_estimators = 500, criterion='squared_error', random_state = 42) #Initiating the mode
# Train the model on training data

rf.fit(X_train_scaled, y_train) #Training the model
rf_predictions = rf.predict(X_test_scaled) #Conducting predictions on test data

from sklearn.metrics import r2_score 

r2_score(y_test,rf_predictions) #Evaluating the model

-1.7749318140132826

    Since in the simple linear we get a better R2 than in the random forest I decide to try two other linear regression Machine learning methods provided by sklearn the Lasso and Ridge models to further explore this topic. This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm.

    We are using Ridge Regression, as it is a method of estimating the coefficients in cases that independent variables are highly correlated, as could be the case with our data that reflects the financial health of a company.

In [12]:
#Ridge Regression ML

from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=1) #Initiating the model

ridge_reg.fit(X_train_scaled, y_train) #Training the model

ridge_reg_predictions = ridge_reg.predict(X_test_scaled) #Conducting predictions on test data
r2_score(y_test,ridge_reg_predictions) #Evaluating the model

0.0427909250889289

    We are also using Lasso regression to further ensure the robustness of our final results and observe potential differences between the performance of such methods

In [13]:
#Lasso regression ML

from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=0.1)

lasso_reg.fit(X_train_scaled, y_train) #Training the model

lasso_reg_predictions = lasso_reg.predict(X_test_scaled) #Conducting predictions on test data
r2_score(y_test,lasso_reg_predictions) #Evaluating the model


0.034833694128738935

    In conclusion, as machine learning has a larger R-squared, it indicates that the methods better explain the data and are able to conduct predictions. We believe that machine learning can indeed be used for the purpose of predicting the future R&D expenditure of a company and this tool can be used by the policymakers to guide them in creating policies that are aimed in fostering innovation.
    For the alpha of both Ridge and Lasso we observe approximately similar values of R-squared with alpha between 0.1 and 1  and therefore, we utilize the one with the better results. We are using only the R-squared, as it provides more precise results, compared to metrics such as the RMSE.

    Of course, it is essential to emphasise the shortcoming of our analysis. The main issues of this analysis can be attributed to the data, as there was a considerable constraint regarding the availability of the data, while for the existing datasets, there were a number of missing values that could further hamper the reliability of our result. Moreover, despite the same source (EU Commission), the attributes varied yearly.

    Resulting from that, we can see that this project would have great potential for more thorough research in the future with better data accessibility and quality, while better computer processing power would ensure that the machine learning would be able to be performed in a more detailed manner.