# Machine Learning to Predict Earnings for Stocks: Support-vector Machines

**Hugh Donnelly, CFA**<br> 
*AlphaWave Data*

**September 2021**

### Introduction
In this article, we are going to cover the support-vector machine (SVM) which is an incredibly powerful algorithm that can be used in both classification and regression settings. Let's begin by laying down the theoretical foundation of the algorithm.

Jupyter Notebooks are available on [Google Colab]() and [Github]().

For this project, we use several Python-based scientific computing technologies listed below.

In [1]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<h4>Machine Learning Overview</h4>  
<img src='ML Photos/1_SVM_ML_Graph.PNG'>

So where does SVM fit among other [machine learning algorithms](https://hdonnelly6.medium.com/list/machine-learning-for-investing-7f2690bb1826)?

SVM is in the supervised learning family. Supervised learning is applied when we want to map new input data to some output data.  In the context of classification, it will assign a label to some input data called X (i.e. independent features).  In a [regression](https://hdonnelly6.medium.com/introduction-to-machine-learning-regression-fee4200132f0), we map the input data X to some continuous output variable Y like in a single variant function, y = mx + b. The goal in supervised learning is to approximate a function that would take new input data X and predict output variable Y for that data. We train our algorithm in supervised learning, meaning we supervise it on the training data, and then we fine tune the algorithm so it performs well on the test data we feed it.

Let's build up the motivation and intuition for SVM.  Assume we have two classes defined by blue circles and red squares on a two dimensional plane.  How can we split this data so that the two classes are well separated?  We can start drawing lines, but you can see there are many options.

<h4>SVM used in Calssification and Regression</h4>  
<img src='ML Photos/2_SVM_Categorize_Graph.PNG'>

SVMs define the best line as the line that maximizes the margin between two classes.  This line is the optimal hyperplane as seen in the below graph.  However, we will likely be dealing with more than two dimensions in the SVM and the word hyperplane will be more accurate in this setting.  Keep in mind that we need to select the hyperplane that not only classifies this data, but also generalizes well for the new data points.

<h4>Optimal Hyperplane</h4>  
<img src='ML Photos/3_SVM_Hyperplane_Graph.PNG'>

If we look closely, the optimal hyperplane is identified by the maximum margin that is between boundary lines of the classes.  It is exactly on those class boundary lines where support-vectors exist.  So, support-vectors are just points that lie on the two margins.  Notice that support-vectors solely determine the maximum margin and the optimal hyperplane.  The points that are outside the support-vectors do not influence the position of that hyperplane.

As we can see, the main objective in SVM is to find the optimal hyperplane that correctly classifies data points between classes.  Dimension of this hyperplane is equal to the number of features minus one.  In our example where we had two features, the hyperplane would be a one dimensional line.  For three features, the hyperplane would be a two dimensional line.

In our example, two classes were linearly separable.  What if the data looks like the below graph where one class is encircled by another class?  Clearly no matter how hard we try no line will be able to separate the data into two classes.  So this is where the kernel trick comes into play.

<h4>Using Kernal to convert linear data into higher dimension space</h4>  
<img src='ML Photos/4_SVM_Hyperplane_Circle_Data_Graph.PNG'>

The kernel transforms our input space into higher dimensional space where we can separate the classes.  Put another way, the kernel takes low dimensional space and transforms it into higher dimensional space where the data can be linearly separated as can be seen in the graph.

After the data is transformed into the three dimensions, we can easily separate the data using a two dimensional hyperplane.  In practice, we often work with higher dimensions so this technique scales as well.  

For those interested in kernel's mathematical properties, the kernel function acts as a modified dot product which takes input vectors in the original space and returns dot product of the transformed vectors in the higher dimensional space.

### Initial Setup
Now let's retrieve simulated quarterly fundamentals data for anonymized members of the S&P 500 from a saved pickle file we will use in this analysis.  This pickle file contains 51 features that we will use to predict the direction of the next quarter's earnings based on the current quarter's fundamental data.

If you wish, you can also use real financial data provided by [AlphaWave Data](https://www.alphawavedata.com/) in this analysis.

In [2]:
# Load equity dataframe from the saved pickle file
data = pd.read_pickle("./svm_data.pkl")
data

Unnamed: 0_level_0,EPS,change in EPS,Account Receivable Turnover,Current Ratio,Quick Ratio,Inventory Turnover,Total Debt To Equity,ROA,ROE,Gross Profit Margin,...,Change in Equity to Fixed Assets,Change in Sales to Total Assets,Change in EBIT to revenue,Change in Profit margin,Change in Sales to Inventory,Change in Sales to Working capital,Change in R&D to Revenue,Change in working cap to Assets,Change in Operating Income or Losses,Change in EBITDA Margin
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3.044,,26.798445,3.150561,2.064582,22.240825,1325.517988,8.883952,33.958244,25.273754,...,,,,,,,,,,
0,3.424,0.380000,24.727972,2.700914,1.842956,21.072092,1347.030183,12.324957,44.042406,27.514299,...,-23.494239,-6.434533,29.515626,19.434621,-0.686169,52.633164,1.116427,-3.893287,21.181899,28.212400
0,3.934,0.510000,24.581375,2.704947,1.884968,21.077957,1345.780617,14.640880,48.929598,28.984562,...,1.892533,-1.163430,12.278989,19.276009,-1.082365,-4.937915,-6.448939,0.917992,10.972702,2.175614
0,4.084,0.150000,24.166449,2.537465,1.756469,20.245552,1406.426471,16.180438,58.727070,28.999023,...,-27.764985,-0.219476,-0.933605,-0.300019,1.578928,52.635740,-0.104461,-7.192905,-1.151032,6.219108
0,3.064,-1.020000,22.217841,3.063550,2.010260,17.716250,1402.460965,16.017779,59.725389,26.217002,...,17.893434,-7.214681,-24.290138,-32.450923,-9.127981,-51.242550,23.044946,-1.773955,-29.752363,-14.951802
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
504,1.914,0.306168,14.783556,2.844682,2.526065,,1399.524214,4.393086,19.923301,76.257884,...,,4.423973,6.126118,69.289285,,-19.917751,,-8.792037,10.821108,6.053170
504,1.784,-0.130000,14.864186,2.843043,2.631217,,1401.624076,4.715484,20.690754,78.124570,...,,-5.681214,18.362582,-15.160033,,5.456669,,-0.223852,11.638151,10.519497
504,1.634,-0.150000,15.180999,3.065134,2.847354,,1388.890037,-0.439290,0.712332,80.462786,...,,-4.431060,-2.815343,-24.099620,,-22.544284,,-0.832590,-7.121654,0.428377
504,1.674,0.040000,16.602697,3.221663,2.807657,,1385.848914,4.210020,19.787080,73.175611,...,,-0.264846,-26.519779,9.836592,,-20.725509,,-4.097929,-26.714389,-17.679338


Let's begin by outlining the steps we will take to make this prediction.

### Earnings movement prediction

#### Forecast direction of next quarter earnings based on accounting information of the current quarter

#### Steps:
- Enhance data with additional information
- Preprocess the data
- Apply Support-vector Machines on our dataset
- Try to improve our results through [PCA](https://hdonnelly6.medium.com/machine-learning-for-esg-stock-trading-pca-and-clustering-ebe6077fc8f0)



In [3]:
data.head(3)

Unnamed: 0_level_0,EPS,change in EPS,Account Receivable Turnover,Current Ratio,Quick Ratio,Inventory Turnover,Total Debt To Equity,ROA,ROE,Gross Profit Margin,...,Change in Equity to Fixed Assets,Change in Sales to Total Assets,Change in EBIT to revenue,Change in Profit margin,Change in Sales to Inventory,Change in Sales to Working capital,Change in R&D to Revenue,Change in working cap to Assets,Change in Operating Income or Losses,Change in EBITDA Margin
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3.044,,26.798445,3.150561,2.064582,22.240825,1325.517988,8.883952,33.958244,25.273754,...,,,,,,,,,,
0,3.424,0.38,24.727972,2.700914,1.842956,21.072092,1347.030183,12.324957,44.042406,27.514299,...,-23.494239,-6.434533,29.515626,19.434621,-0.686169,52.633164,1.116427,-3.893287,21.181899,28.2124
0,3.934,0.51,24.581375,2.704947,1.884968,21.077957,1345.780617,14.64088,48.929598,28.984562,...,1.892533,-1.16343,12.278989,19.276009,-1.082365,-4.937915,-6.448939,0.917992,10.972702,2.175614


Let's begin by enriching our data with some additional columns.  In a typical machine learning workflow, the majority of the effort is usually dedicated to data cleaning and data preparation.  In order for us to run SVM successfully, we need to do a lot of the necessary work before we can actually feed the data into the model.  To enhance the data, we follow the below steps.

#### Enhance data:
- Change in Earnings per share : (Current Period EPS - Prior Period EPS) 
- Assign 1 to positive change in EPS and 0 to negative change
- Shift data index by -1: we will be using current financial data to predict future change in earnings


In [4]:
# Create binary column of positive and negative earnings changes
data['binary change'] = [1 if row['change in EPS'] > 0 else 0 for _,row in data.iterrows()]

# Shift date index by -1 so we are predicting future changes: 1 or 0
data['Future change'] = data['binary change'].shift(-1)

In [5]:
# Goal is to anticipate the sign of futute earnings change from the financial data of the current quarter.
# If the future earnigs changes is + , we assign 1, otherwise 0,  to Future change value of the current quarter
data[['EPS','change in EPS','Future change']].head(6)

Unnamed: 0_level_0,EPS,change in EPS,Future change
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,3.044,,1.0
0,3.424,0.38,1.0
0,3.934,0.51,1.0
0,4.084,0.15,0.0
0,3.064,-1.02,0.0
0,1.654,-1.41,1.0


Using pandas describe function to examine our data, you can see there are a number of columns that have negative and positive infinity.

In [6]:
# Examine data 
data.describe()

Unnamed: 0,EPS,change in EPS,Account Receivable Turnover,Current Ratio,Quick Ratio,Inventory Turnover,Total Debt To Equity,ROA,ROE,Gross Profit Margin,...,Change in EBIT to revenue,Change in Profit margin,Change in Sales to Inventory,Change in Sales to Working capital,Change in R&D to Revenue,Change in working cap to Assets,Change in Operating Income or Losses,Change in EBITDA Margin,binary change,Future change
count,4495.0,3995.0,3751.0,3842.0,3842.0,2906.0,4278.0,4445.0,4149.0,3575.0,...,3868.0,4025.0,2736.0,3410.0,1287.0,3388.0,4018.0,3721.0,4545.0,4544.0
mean,2.526763,0.048794,inf,2.991957,2.393492,24.17929,1422.67078,6.090243,29.921766,55.445803,...,inf,340.6696,2.154707,54.832803,inf,0.34051,inf,-36.992755,0.467987,0.46809
std,3.903321,2.803885,,1.374618,1.099318,26.750252,888.674995,6.201164,36.878775,53.44922,...,,18875.29,38.520377,557.009442,,594.324837,,1739.997723,0.499029,0.499036
min,-37.796,-45.08,12.512706,1.380189,1.251109,12.341636,1234.0,-82.358224,-267.66,-2003.262946,...,-222289.5,-40413.1,-1225.694465,-970.974658,-100.0,-6485.354365,-19457.54,-102661.762624,0.0,0.0
25%,1.624,-0.31,16.848874,2.226506,1.76188,15.088645,1283.288269,2.823914,18.352548,40.309807,...,-24.98755,-39.03804,-10.095712,-20.948636,-7.5001,-10.985229,-30.16442,-13.991613,0.0,0.0
50%,2.134,0.035,18.742693,2.619922,2.101069,17.086972,1325.637636,5.199796,23.269783,54.984085,...,-1.277005,-3.455357,0.094533,-0.696431,-0.765746,-1.484532,-2.751739,0.213443,0.0,0.0
75%,2.864,0.386667,23.213704,3.250817,2.601631,21.818833,1399.547976,8.807839,33.320514,74.795379,...,16.9678,24.44517,11.251677,21.098892,7.577819,7.065427,20.27,11.807783,1.0,1.0
max,83.314,56.758548,inf,16.903861,12.961896,404.276095,44454.0,52.774513,640.483713,109.258971,...,inf,1185553.0,550.720288,23774.482579,inf,31651.342509,inf,18111.658087,1.0,1.0


We will replace negative and positive infinity with NaN.

In [7]:
# Replace infinity with nan
data = data.replace([np.inf, -np.inf], np.nan)

We will also drop the rows where the change in earnings per share is NaN.  We do this because we are trying to predict the change in earnings, so rows with NaN, or missing values, would not be useful information in our analysis.

In [8]:
#Drop rows where change in EPS is nan: they are no use to us 
data = data.dropna(subset = ['change in EPS', 'Future change'])

We are also going to drop three columns, EPS, change in EPS, and binary change.  We no longer need these columns to continue examining the missing data.

In [9]:
# We no longer need these columns
data = data.drop(columns = ['EPS','change in EPS','binary change'])

As you can see almost every column, other than future change, has some percentage of missing values and some columns have a substantial amount of missing values.  We have to deal with these missing values before proceeding.

In [10]:
# Examine missing data
missing_column_data = 100*(data.isnull().sum() / data.shape[0]).round(3)
print('Percent of missing values per column:\n', missing_column_data)

Percent of missing values per column:
 Account Receivable Turnover                18.5
Current Ratio                              15.3
Quick Ratio                                15.3
Inventory Turnover                         35.8
Total Debt To Equity                        5.6
ROA                                         1.5
ROE                                         8.1
Gross Profit Margin                        21.1
Accounts Receivable Turnover               18.5
Inventory to Sales                         16.6
LT Debt to Total Equity                     4.9
Sales to Total Assets                       0.2
EBIT to revenue                             4.0
Profit margin                               0.1
Sales to Cash                               0.2
Sales to Inventory                         32.1
Sales to Working capital                   15.3
Sales to Dep Fixed assets                  45.1
Working capital to total Asset             15.3
Operating Income to Total Assets            0.2
T

Real world data often has missing values which require careful attention.  The handling of missing values is very important during the pre-processing step because many machine learning algorithms do not work with missing data.  There are two general ways of thinking about how to handle missing data.  One way is to delete the rows with the missing data, but we risk losing valuable information doing this.  The alternative is to try to compute the missing values using an array of different methods like mean or median imputation, neural networks, or Multiple Imputation by Chained Equations (MICE).

In this exercise, we will drop columns that have more than 30% of data missing.

In [11]:
# Drop 10 columns that have more than 30% of data missing
columns_to_drop = missing_column_data[missing_column_data > 30]
columns_to_drop

Inventory Turnover                  35.8
Sales to Inventory                  32.1
Sales to Dep Fixed assets           45.1
change in Inventories               32.2
change in Inventory Turnover        36.3
change in R&D Expense               68.0
Change in Inventory to Sales        33.6
Change in Equity to Fixed Assets    49.2
Change in Sales to Inventory        32.3
Change in R&D to Revenue            68.0
dtype: float64

This will result in us dropping ten columns.

In [12]:
# Number of columns dropped, 10 
data = data.drop(columns = list(columns_to_drop.index))
print( f'New Dataframe shape : {data.shape}' )

New Dataframe shape : (3994, 40)


Let's continue with pre-processing our data.

#### Preprocess data:
- Handle remaining missing values
- Minimize influence of outliers by performing Winsorization
- Standardize data 


Handle remaining missing data by replacing NaN by mean of the column

In [13]:
# Keep in mind that this is a naive way to handle missing values. 
# This method can cause data leakage and does not factor the covariance between features.
# For more robust methods,take a look at MICE and KNN

for col in data.columns:
    data[col].fillna(data[col].mean(), inplace=True)

In [14]:
# Check for missing values
missing_column_data = 100*(data.isnull().sum()/ data.shape[0]).round(3)
print('Percent of missing values per column:\n',missing_column_data)

Percent of missing values per column:
 Account Receivable Turnover                0.0
Current Ratio                              0.0
Quick Ratio                                0.0
Total Debt To Equity                       0.0
ROA                                        0.0
ROE                                        0.0
Gross Profit Margin                        0.0
Accounts Receivable Turnover               0.0
Inventory to Sales                         0.0
LT Debt to Total Equity                    0.0
Sales to Total Assets                      0.0
EBIT to revenue                            0.0
Profit margin                              0.0
Sales to Cash                              0.0
Sales to Working capital                   0.0
Working capital to total Asset             0.0
Operating Income to Total Assets           0.0
Trailing 12M EBITDA Margin                 0.0
Div as % of CF                             0.0
change in Depreciation and Amortization    0.0
change Total Assets  

Before we proceed further, we need to split the data into train and test.  Splitting data into train and test is absolutely necessary in machine learning to avoid overfitting.  It allows us to see how good our model really is and how well it performs on the new data we feed it.  We train the model on the training data and then make a prediction using the model that we learned in the training phase.  The prediction is made on the unlabeled test data.

Here we split the data into train and test by allocating 80% of the data to train and 20% of the data to test.

In [15]:
# First we need to split our data into train and test. 
from sklearn.model_selection import train_test_split

# Independent values/features
X = data.iloc[:,:-1].values
# Dependent values
y = data.iloc[:,-1].values

# Create test and train data sets, split data randomly into 20% test and 80% train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

We also need to winsorize the data to limit the influence of the extreme values, typically by setting all outliers to a specified percentile of data.  Notice how we are winsorizing train data and test data separately.  If you winsorize all of your data together first and then partition it later into training and testing afterwards, you are allowing future data (i.e. test data) to influence your cutoff values.  Since you won't know what the future is when you use your model, you cannot use data manipulation affected by your future test data.

In [16]:
from scipy.stats import mstats
# Winsorize top 1% and bottom 1% of points. 

# Apply on X_train and X_test separately
X_train = mstats.winsorize(X_train, limits = [0.01, 0.01])
X_test = mstats.winsorize(X_test, limits = [0.01, 0.01])

There is one last thing that we have to do before we train the algorithm and that is to standardize the data.

$$z=(x-mean) /  Standard Deviation$$

Standardization of a dataset is a common requirement for many machine learning estimators.  The reason for this is that these algorithms may not behave well if the individual features are not standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).  This means there should be a mean of zero and unit variance.

For instance many elements used in the objective function of a machine learning algorithm (such as the RBF kernel of Support-vector Machines (SVM) or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order.  If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

In [17]:
# Standardize features by removing the mean and scaling to unit variance.

# IMPORTANT: During testing, it is important to construct the test feature vectors using the means and standard deviations saved from
# the training data, rather than computing it from the test data. You must scale your test inputs using the saved means
# and standard deviations, prior to sending them to your SVM library for classification.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

# Fit to training data and then transform it
X_train = sc.fit_transform(X_train)
# Perform standardization on testing data using mu and sigma from training data
X_test = sc.transform(X_test)

So what are the advantages of support-vector machines?

[Source: scikit-learn](https://scikit-learn.org/stable/modules/svm.html) <br>

### SVM

**Advantages:**
* Effective in high dimensional spaces.
* Still effective in cases where the number of dimensions is greater than the number of samples.
* Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
* Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

**Disadvantages:**
* If the number of features is much greater than the number of samples, it is crucial to avoid over-fitting in choosing Kernel functions and the regularization term.
* It also doesn’t perform very well when the data set has more noise (i.e. when target classes are overlapping).
* SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.


<img src='img/svm.jpg'>

Let's train SVM on our data.  We will first use default parameters for C, kernel, and gamma.

In [18]:
# Support Vector Classification(C)
from sklearn.svm import SVC

# Initialize svm, rbf is a default kernel
classifier_rbf = SVC(C = 1, kernel = 'rbf', gamma = 'auto', random_state = 0)

# Fit the model on training data
classifier_rbf.fit(X_train, y_train)

# Make a prediction on testing data
y_pred_rbf = classifier_rbf.predict(X_test)

We can also check the model's accuracy score and classification report.  In the classification report, precision, recall, and f1-scores are given.  Precision quantifies the number of positive class predictions that actually belong to the positive class.  Recall quantifies the number of positive class predictions made out of all positive examples in the dataset.

In [19]:
# Import accuracy score
from sklearn.metrics import accuracy_score
ac_rbf = accuracy_score(y_test, y_pred_rbf)
print('Accuracy with RBF: {:.2f}'.format(ac_rbf))

Accuracy with RBF: 0.58


In [20]:
# Precision and recall
from sklearn.metrics import classification_report
result = classification_report(y_test, y_pred_rbf)
print(result)

              precision    recall  f1-score   support

         0.0       0.57      0.77      0.66       413
         1.0       0.61      0.38      0.47       386

    accuracy                           0.58       799
   macro avg       0.59      0.58      0.56       799
weighted avg       0.59      0.58      0.57       799



Let's see if we can improve this algorithm by tuning some of the hyperparameters.  So what are the hyperparameters?

In machine learning, hyperparameters are those parameters whose values are used to control the learning process.  Their configuration is external to the model and the tuning process usually involves discovering hyperparameters that result in the model making the most skillful predictions.  Hyperparameters are often used in the process to help estimate the model parameters.  Let's look at the three most important hyperparameters in our example.

#### Hyperparameters:
- Kernel - transforms the data into a required form (i.e. dimension) so the data can be separated. RBF is useful for non-linear hyperplane in higher dimensions
  and computes the separation line in the higher dimension. In some of the applications, it is suggested to use a more complex kernel to separate the classes that are curved or nonlinear.
- Regularization, C - penalty parameter, which represents misclassification or error. It tells the SVM optimization how much error is bearable. Small C results in a small-margin hyperplane while large C results in a large margin hyperplane.
- Gamma - defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors. Higher values of gamma will exactly fit the training dataset, which can cause overfitting.

In [21]:
# Default C = 1, let's change kernel to linear
classifier_lin = SVC(C = 1, kernel = 'linear',gamma = 'auto',random_state=0)

# Fit the model on training data
classifier_lin.fit(X_train, y_train)

# Make a prediction on testing data
y_pred_lin = classifier_lin.predict(X_test)

from sklearn.metrics import accuracy_score
ac_lin = accuracy_score(y_test, y_pred_lin)
print('Accuracy with Linear: {:.2f}'.format(ac_lin))

Accuracy with Linear: 0.57


Finding optimal hyperparameters is a tedious task to solve, but it can be done by trying various combinations of hyperparameters to see which parameters work best.

Can we speed up our SVM algorithm ?

#### Principal Component Analysis (PCA)
- Common way to speed up machine learning algorithms
- Large number of features in the dataset can affect both the training times and accuracy of the model
- PCA is a statistical technique that reduces number of features to those that capture maximum information about the dataset
- Features are selected on the basis of their variance - higher the variance, more information that component conveys

In [22]:
from sklearn.decomposition import PCA

# keep 95% of variance
pca = PCA(0.95)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

In [23]:
# Components that explain 95% of variance in our dataset
explained_variance = pca.explained_variance_ratio_
# 28 features explain 95% of variance, down from original 40
len(explained_variance)

28

Able to achieve similar accuracy but with only 28 features

In [24]:
classifier = SVC(C = 1, kernel='rbf',gamma = 'auto',random_state=0)

classifier.fit(X_train_pca, y_train)
y_pred = classifier.predict(X_test_pca)
ac = accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}'.format(ac))


Accuracy: 0.57


As can be seen, PCA can be a powerful tool in your machine learning implementation.  Not only have we cut 12 features from our dataset, we were also able to improve the speed of the algorithm while achieving a similar prediction accuracy.

### Conclusion
SVMs can be a powerful tool in your machine learning arsenal.  SVM provides a very useful technique when using the kernel.  Within some classification problems, there is a strong assumption that data samples are linearly separable.  But with the introduction of the kernel, the input data can be converted into high dimensional data avoiding the need for this assumption.  So we no longer have to worry about the data needing to be linearly separable because it can be converted into high dimensional data.  SVMs generally do not suffer from overfitting and perform very well when there is a clear indication of separation between classes.  Also, SVMs can be used when the total number of samples is less than the number of features.

### Additional Resources

#### Helpful Blog Posts
Machine Learning for Investing: https://hdonnelly6.medium.com/list/machine-learning-for-investing-7f2690bb1826

#### Python Libraries

Scikit train_test_split:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Scikit SVM:
https://scikit-learn.org/stable/modules/svm.html
    
PCA:
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Missing values imputation
https://scikit-learn.org/stable/modules/impute.html

    