## Exploratory data analysis of the US Stock Financial indicators data set

#### Names of contributors: Anene Ifeanyi, Chizitere Igwe

#### Date: 2020-12-24

# Table of contents

1. [Summary](#Summary)
2. [Method](#Method)
3. [Data](#Data)
4. [Partition Data into train and test splits](#PartitionDataintotrainandtestsplits)
5. [Exploratory Data Visualisations](#ExploratoryDataVisualisations)
6. [References](#References)

### Summary <a name="Summary"></a>




## Method <a name="Method"></a>

Given the financial stock indicators, should a hypothetical investor buy the stock or not? 

### Data <a name="Data"></a>

In [1]:
# Import required exploratory data analysis packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# train test split and cross validation
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, RandomizedSearchCV

# Preprocessing
from sklearn.preprocessing import (
    OneHotEncoder,
    PolynomialFeatures,
    StandardScaler,
)
from sklearn.impute import SimpleImputer

# Feature selection
from sklearn.feature_selection import RFE, RFECV


# classifiers / models
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC


# Others
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.metrics import f1_score, mean_squared_error, make_scorer, recall_score, r2_score

In [2]:
# For this exploratory data analysis, we will only be working with the 2014 dataset. The techniques applied here can be applied to the other datasets.

df_2014 = pd.read_csv('../data/raw/2014_Financial_Data.csv') # Home directory is working directory

In [3]:
df_2014 = df_2014.rename(columns={'Unnamed: 0': 'Ticker'})

## Analysis and Results <a name="PartitionDataintotrainandtestsplits"></a>



In [4]:
train_df, test_df = train_test_split(df_2014, train_size = 0.75, random_state = 123)

train_df

Unnamed: 0,Ticker,Revenue,Revenue Growth,Cost of Revenue,Gross Profit,R&D Expenses,SG&A Expense,Operating Expenses,Operating Income,Interest Expense,...,Receivables growth,Inventory Growth,Asset Growth,Book Value per Share Growth,Debt Growth,R&D Expense Growth,SG&A Expenses Growth,Sector,2015 PRICE VAR [%],Class
2690,COWN,4.277760e+08,0.3021,2.344100e+07,4.043350e+08,0.0,3.871980e+08,4.232100e+08,-1.887500e+07,0.000000e+00,...,0.3715,0.0000,0.3060,0.3553,1.8686,0.0000,0.3901,Financial Services,460.323717,1
1809,HIL,4.893480e+08,-0.1514,3.227330e+08,1.666150e+08,0.0,1.462650e+08,1.462650e+08,2.035000e+07,3.099000e+06,...,0.1474,0.0000,0.0494,0.1749,-0.0881,0.0000,-0.1934,Industrials,-0.767263,0
178,GGB,1.597685e+10,-0.0537,1.404669e+10,1.930158e+09,0.0,1.024389e+09,1.118824e+09,8.113342e+08,5.247371e+08,...,-0.0352,-0.0751,-0.0399,-0.0590,0.0360,0.0000,-0.0740,Basic Materials,-63.811960,0
1971,HST,5.321000e+09,0.0300,3.674000e+09,1.647000e+09,0.0,2.700000e+08,7.170000e+08,9.300000e+08,2.070000e+08,...,0.3462,0.0000,-0.0501,0.0136,-0.1685,0.0000,-0.2128,Real Estate,-32.611145,0
118,CELH,1.461009e+07,,9.011923e+06,5.598167e+06,0.0,7.128100e+06,7.128100e+06,-1.529933e+06,4.972030e+05,...,,,,,,,,Consumer Defensive,288.000011,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1122,FIVE,5.354020e+08,0.2783,3.473860e+08,1.880160e+08,0.0,1.342790e+08,1.342790e+08,5.373700e+07,1.513000e+06,...,0.0000,0.4693,0.2235,0.0987,-0.4348,0.0000,0.1970,Consumer Cyclical,-19.589178,0
1346,EDUC,2.609700e+07,0.0239,1.052350e+07,1.557350e+07,0.0,1.432180e+07,1.432180e+07,1.251700e+06,0.000000e+00,...,-0.1223,0.0054,-0.0638,-0.0731,-1.0000,0.0000,0.0404,Consumer Cyclical,154.366251,1
3454,WIX,1.418410e+08,0.7626,2.610800e+07,1.157330e+08,57832000.0,1.135450e+08,1.713770e+08,-5.564400e+07,0.000000e+00,...,0.4266,0.0000,0.0266,-0.7010,0.0000,0.9498,0.8289,Technology,8.076013,1
3437,AMBA,1.576080e+08,0.3018,5.776100e+07,9.984700e+07,48777000.0,2.315300e+07,7.193000e+07,2.791700e+07,0.000000e+00,...,-0.0653,0.1720,0.3225,-0.3205,0.0000,0.1389,0.2916,Technology,9.251276,1


In [5]:
# Create X and Y train

X_train, y_train = (train_df.drop(columns = ["Class"]), train_df["Class"])

X_test, y_test = (test_df.drop(columns = ["Class"]), test_df["Class"])

In [6]:
# Preprocessing and transformations

drop_features = ["Ticker"]

categorical_features = ["Sector"]

numerical_features = X_train.select_dtypes(include=np.number).columns.tolist()

#len(numerical_features) == 222

In [7]:
numeric_transformer = make_pipeline(StandardScaler(), SimpleImputer())


preprocessor = make_column_transformer(
    ("drop", drop_features),
    (numeric_transformer, numerical_features),
    (OneHotEncoder(handle_unknown = "ignore"), categorical_features)
)

### Baseline model

In [8]:
# Helper function 

def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns the mean and standard deviation of cross validation scores
    """
    
    scores = cross_validate(model, X_train, y_train, **kwargs)
    
    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    
    out_col = []
    
    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))
    
    return pd.Series(data = out_col, index = mean_scores.index)

In [11]:
results = {}

scoring_metric = ["accuracy", "recall", "precision", "f1"]

dummy_model = make_pipeline(preprocessor, DummyClassifier(strategy = "stratified"));

results["dummy"] = mean_std_cross_val_scores(dummy_model, X_train, y_train, return_train_score = True, scoring = scoring_metric);

  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)


In [12]:
pd.DataFrame(results)

Unnamed: 0,dummy
fit_time,0.039 (+/- 0.007)
score_time,0.011 (+/- 0.002)
test_accuracy,0.503 (+/- 0.012)
train_accuracy,0.508 (+/- 0.012)
test_recall,0.431 (+/- 0.024)
train_recall,0.433 (+/- 0.021)
test_precision,0.426 (+/- 0.015)
train_precision,0.432 (+/- 0.015)
test_f1,0.429 (+/- 0.020)
train_f1,0.433 (+/- 0.017)


In [None]:
# Sector distribution is important when analysing stocks 


train_df["Sector"].value_counts().plot.bar(orientation=u'vertical')

plt.title('Sector Count', fontsize = 20)
plt.xlabel("Sector")
plt.ylabel("Count of Records")

plt.show()

Figure 2: Distribution of each sector in the 2014 Financial indicator training dataset.

Looking at Figure 2 above, it is observed that there are more companies in financial services, healthcare and technology, when compared to utilities and communication services. This should be kept in mind when developing an ML model to prevent overfitting. 

In [None]:
# Filter out some variables that might be important for this task, to create a correlation matrix.

train_df_corr = train_df.filter(["Revenue Growth", "Operating Income", "Book Value per Share Growth", "Net Income", "Net Profit Margin", 
                            "Total liabilities", "Capital expenditure", "Operating Cash Flow", "priceEarningsRatio", 
                           "grossProfitMargin", "netProfitMargin", "assetTurnover", "cashPerShare", "Net Income per Share",
                           "Book Value per Share", "Shareholders Equity per Share", "PE ratio", "Debt to Assets", "Income Quality",
                           "Capex to Operating Cash Flow", "Invested Capital", "ROE", "Capex per Share"])

In [None]:
fig, ax = plt.subplots(figsize = (20, 15)) 

sns.heatmap(train_df_corr.corr(), annot = True, cmap = 'YlOrRd', fmt = '.3f', vmin = -1, vmax = 1, center = 0, ax = ax)

plt.show()

Figure 3: Correlation matrix plot of some important variables in the 2014 Financial Indicator dataset.

Looking at this correlation matrix, we can see that most of the features in this dataset are neither highly positively correlated or highly negatively correlated. Also, we cannot exclude the few that are highly positively/negatively correlated because they may be useful predictors in conjuction with the other features. 

## References <a name="References"></a>

de Jonge, E., 2020. CRAN - Package Docopt. [online] Cran.r-project.org. Available at: https://cran.r-project.org/web/packages/docopt/index.html [Accessed 29 November 2020].

Oliphant, T.E., 2006. A guide to NumPy, Trelgol Publishing USA.

McKinney, W. & others, 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference. pp. 51–56.

Waskom, M. et al., 2017. mwaskom/seaborn: v0.8.1 (September 2017), Zenodo. Available at: https://doi.org/10.5281/zenodo.883859.

Van Rossum, G. & Drake, F.L., 2009. Python 3 Reference Manual, Scotts Valley, CA: CreateSpace.

Hunter, J.D., 2007. Matplotlib: A 2D graphics environment. Computing in science & engineering, 9(3), pp.90–95.

Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.

P'erez, Fernando & Granger, B.E., 2007. IPython: a system for interactive scientific computing. Computing in Science & Engineering, 9(3).

Kluyver, T. et al., 2016. Jupyter Notebooks – a publishing format for reproducible computational workflows. In F. Loizides & B. Schmidt, eds. Positioning and Power in Academic Publishing: Players, Agents and Agendas. pp. 87–90.

Anon, 2020. Anaconda Software Distribution, Anaconda Inc. Available at: https://docs.anaconda.com/.