## Exploratory data analysis of the US Stock Financial indicators data set

#### Names of contributors: Anene Ifeanyi, Chizitere Igwe

#### Date: 2020-12-24

# Table of contents

1. [Summary](#Summary)
2. [Method](#Method)
3. [Data](#Data)
4. [Partition Data into train and test splits](#PartitionDataintotrainandtestsplits)
5. [Exploratory Data Visualisations](#ExploratoryDataVisualisations)
6. [References](#References)

### Summary <a name="Summary"></a>




## Method <a name="Method"></a>

Given the financial stock indicators, should a hypothetical investor buy the stock or not? 

### Data <a name="Data"></a>

In [1]:
# Import required exploratory data analysis packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# train test split and cross validation
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, RandomizedSearchCV

# Preprocessing
from sklearn.preprocessing import (
    OneHotEncoder,
    PolynomialFeatures,
    StandardScaler,
)
from sklearn.impute import SimpleImputer

# Feature selection
from sklearn.feature_selection import RFE, RFECV


# classifiers / models
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC


# Others
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.metrics import f1_score, mean_squared_error, make_scorer, recall_score, r2_score

In [2]:
# For this exploratory data analysis, we will only be working with the 2014 dataset. The techniques applied here can be applied to the other datasets.

df_2014 = pd.read_csv('../data/raw/2014_Financial_Data.csv') 

In [3]:
df_2014 = df_2014.rename(columns={'Unnamed: 0': 'Ticker'})

## Analysis and Results <a name="PartitionDataintotrainandtestsplits"></a>



In [4]:
train_df, test_df = train_test_split(df_2014, train_size = 0.75, random_state = 123)

In [5]:
# Create X and Y train

X_train, y_train = (train_df.drop(columns = ["Class"]), train_df["Class"])

X_test, y_test = (test_df.drop(columns = ["Class"]), test_df["Class"])

In [6]:
# Preprocessing and transformations

drop_features = ["Ticker"]

categorical_features = ["Sector"]

numerical_features = X_train.select_dtypes(include=np.number).columns.tolist()

#len(numerical_features) == 222

The `Ticker` column was dropped because it does not seem to add any significant contribution to prediction. 

In [7]:
numeric_transformer = make_pipeline(StandardScaler(), SimpleImputer())


preprocessor = make_column_transformer(
    ("drop", drop_features),
    (numeric_transformer, numerical_features),
    (OneHotEncoder(handle_unknown = "ignore"), categorical_features)
)

### Baseline model

In [8]:
# Helper function 

def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns the mean and standard deviation of cross validation scores
    """
    
    scores = cross_validate(model, X_train, y_train, **kwargs)
    
    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    
    out_col = []
    
    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))
    
    return pd.Series(data = out_col, index = mean_scores.index)

In [9]:
results = {}

scoring_metric = ["accuracy", "recall", "precision", "f1"]

dummy_model = make_pipeline(preprocessor, DummyClassifier(strategy = "stratified"));

results["dummy"] = mean_std_cross_val_scores(dummy_model, X_train, y_train, return_train_score = True, scoring = scoring_metric);

  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)


In [10]:
pd.DataFrame(results)

Unnamed: 0,dummy
fit_time,0.039 (+/- 0.008)
score_time,0.011 (+/- 0.001)
test_accuracy,0.509 (+/- 0.019)
train_accuracy,0.514 (+/- 0.006)
test_recall,0.417 (+/- 0.052)
train_recall,0.435 (+/- 0.007)
test_precision,0.429 (+/- 0.027)
train_precision,0.438 (+/- 0.007)
test_f1,0.423 (+/- 0.040)
train_f1,0.436 (+/- 0.006)


Table 1: Results of the baseline model. 

Table 1 indicates the results of the baseline model developed. The scores of this model are quite low, however, they can be used as a reference for the models that will be later.

## Linear models

In [11]:
# logistic regression

log_reg_model = make_pipeline(preprocessor, LogisticRegression(max_iter = 1000, class_weight = "balanced"))

results["Logistic Regression"] = mean_std_cross_val_scores(log_reg_model, X_train, y_train, return_train_score = True, 
                                                           scoring = scoring_metric)

  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)


In [12]:
pd.DataFrame(results)

Unnamed: 0,dummy,Logistic Regression
fit_time,0.039 (+/- 0.008),0.111 (+/- 0.013)
score_time,0.011 (+/- 0.001),0.011 (+/- 0.000)
test_accuracy,0.509 (+/- 0.019),0.663 (+/- 0.040)
train_accuracy,0.514 (+/- 0.006),0.727 (+/- 0.066)
test_recall,0.417 (+/- 0.052),0.684 (+/- 0.057)
train_recall,0.435 (+/- 0.007),0.755 (+/- 0.063)
test_precision,0.429 (+/- 0.027),0.596 (+/- 0.039)
train_precision,0.438 (+/- 0.007),0.663 (+/- 0.073)
test_f1,0.423 (+/- 0.040),0.637 (+/- 0.046)
train_f1,0.436 (+/- 0.006),0.706 (+/- 0.069)


Table 2: Results of baseline model and Logistic regression model.

The logistic regression model shows good promise. The scores are better than the dummy classifier, however, the scores are still pretty low. Optimising the regularisation (`C`) hyperparameter should give better results.

### Logistic Regression hyperparameter optimisation

In [13]:
log_reg_param_grid = {"logisticregression__C": 10.0 ** np.arange(-3, 3, 1)}
mult_metric_eval_scorer = {"accuracy" : "accuracy", "recall" : "recall", "precision" : "precision", "f1" : "f1"}

log_reg_random_search = RandomizedSearchCV(log_reg_model, param_distributions = log_reg_param_grid,
                                           n_jobs = -1, verbose = 1, scoring = mult_metric_eval_scorer, refit = "f1",
                                           return_train_score = True);

log_reg_random_search.fit(X_train, y_train);

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    4.6s finished
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [14]:
best_log_reg_model = log_reg_random_search.best_estimator_;

results["Logistic Regression (tuned)"] = mean_std_cross_val_scores(best_log_reg_model, X_train, y_train, return_train_score = True,
                                                      scoring = scoring_metric);


  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED L

Unnamed: 0,dummy,Logistic Regression,logreg (tuned)
fit_time,0.039 (+/- 0.008),0.111 (+/- 0.013),0.465 (+/- 0.041)
score_time,0.011 (+/- 0.001),0.011 (+/- 0.000),0.012 (+/- 0.001)
test_accuracy,0.509 (+/- 0.019),0.663 (+/- 0.040),0.895 (+/- 0.022)
train_accuracy,0.514 (+/- 0.006),0.727 (+/- 0.066),0.943 (+/- 0.022)
test_recall,0.417 (+/- 0.052),0.684 (+/- 0.057),0.913 (+/- 0.030)
train_recall,0.435 (+/- 0.007),0.755 (+/- 0.063),0.964 (+/- 0.018)
test_precision,0.429 (+/- 0.027),0.596 (+/- 0.039),0.854 (+/- 0.023)
train_precision,0.438 (+/- 0.007),0.663 (+/- 0.073),0.911 (+/- 0.031)
test_f1,0.423 (+/- 0.040),0.637 (+/- 0.046),0.883 (+/- 0.025)
train_f1,0.436 (+/- 0.006),0.706 (+/- 0.069),0.937 (+/- 0.025)


In [15]:
pd.DataFrame(results)

Unnamed: 0,dummy,Logistic Regression,logreg (tuned)
fit_time,0.039 (+/- 0.008),0.111 (+/- 0.013),0.465 (+/- 0.041)
score_time,0.011 (+/- 0.001),0.011 (+/- 0.000),0.012 (+/- 0.001)
test_accuracy,0.509 (+/- 0.019),0.663 (+/- 0.040),0.895 (+/- 0.022)
train_accuracy,0.514 (+/- 0.006),0.727 (+/- 0.066),0.943 (+/- 0.022)
test_recall,0.417 (+/- 0.052),0.684 (+/- 0.057),0.913 (+/- 0.030)
train_recall,0.435 (+/- 0.007),0.755 (+/- 0.063),0.964 (+/- 0.018)
test_precision,0.429 (+/- 0.027),0.596 (+/- 0.039),0.854 (+/- 0.023)
train_precision,0.438 (+/- 0.007),0.663 (+/- 0.073),0.911 (+/- 0.031)
test_f1,0.423 (+/- 0.040),0.637 (+/- 0.046),0.883 (+/- 0.025)
train_f1,0.436 (+/- 0.006),0.706 (+/- 0.069),0.937 (+/- 0.025)


Table 3: Results of Baseline, Logistic regression, and regularisation optimised logistic regression models. 

After ooptimising the regularisation hyperparameter, much better s

## References <a name="References"></a>

de Jonge, E., 2020. CRAN - Package Docopt. [online] Cran.r-project.org. Available at: https://cran.r-project.org/web/packages/docopt/index.html [Accessed 29 November 2020].

Oliphant, T.E., 2006. A guide to NumPy, Trelgol Publishing USA.

McKinney, W. & others, 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference. pp. 51–56.

Waskom, M. et al., 2017. mwaskom/seaborn: v0.8.1 (September 2017), Zenodo. Available at: https://doi.org/10.5281/zenodo.883859.

Van Rossum, G. & Drake, F.L., 2009. Python 3 Reference Manual, Scotts Valley, CA: CreateSpace.

Hunter, J.D., 2007. Matplotlib: A 2D graphics environment. Computing in science & engineering, 9(3), pp.90–95.

Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.

P'erez, Fernando & Granger, B.E., 2007. IPython: a system for interactive scientific computing. Computing in Science & Engineering, 9(3).

Kluyver, T. et al., 2016. Jupyter Notebooks – a publishing format for reproducible computational workflows. In F. Loizides & B. Schmidt, eds. Positioning and Power in Academic Publishing: Players, Agents and Agendas. pp. 87–90.

Anon, 2020. Anaconda Software Distribution, Anaconda Inc. Available at: https://docs.anaconda.com/.