# Exploratory Data Analysis Notebook

In this notebook I explore the dataset I created using data from AlphaVantage. It is always a good idea to have a good sense of what the data are about before doing any modeling. Data explorations allow us to have a sense of the different variables we have at our disposal. It also allows us to know what type of data cleaning we will have to do (getting rid of some rows, inputting some missing values) and get some understanding about how the features interact with each others.

## Libraries

In [4]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

## Data

We can load the dataset we created for ourselves using data from the AlphaVantage API. We will then be able to explore the data to see whether we can find something of value for our upcoming modeling task.

In [2]:
data = pd.read_csv('../data/preprocessed/preprocessed_data_with_target.csv')

## Exploratory Data Analysis

### Basics

Here are some basics information about the dataset

#### Dimensions

In [7]:
print(f"The dataset consists in {len(data)} observations.")

The dataset consists in 3037 observations.


#### Features & Features Types

In [8]:
print(f"The dataset consists in {data.shape[1]} features including the target.")

The dataset consists in 67 features including the target.


In [13]:
print(f"The dataset consists in {data.select_dtypes('number').shape[1]} numerical features")

The dataset consists in 63 numerical features


In [23]:
print(f"The numerical features includes: {data.select_dtypes('number').columns.tolist()[:5]}")

The numerical features includes: ['year', 'totalAssets', 'totalCurrentAssets', 'cashAndCashEquivalentsAtCarryingValue', 'cashAndShortTermInvestments']


In [15]:
print(f"Conversely, the dataset consists in {data.select_dtypes('object').shape[1]} non-numerical features")

Conversely, the dataset consists in 4 non-numerical features


In [19]:
print(f"The {data.select_dtypes('object').shape[1]} non-numerical columns are: {data.select_dtypes('object').columns.tolist()}")

The 4 non-numerical columns are: ['symbol', 'fiscalDateEnding', 'Sector', 'Industry']


To sum up, the dataset is comprised of 67 features, 63 of which are numerical. This leaves us with 4 non-numerical features: *symbol*, *fiscalDateEnding*, *Sector*, *Industry*. Among those 4, *symbol* and *fiscalDateEnding* will not be use for modeling but, combined together, as a way to identify each obervation. All in all, we have to deal with only two non-numerical features: *Sector* and *Industry* which both gives us information about the types of companies we are dealing with. We will see later on how we can deal with those variables.

#### Missing Values

**Target variable**

Since we constructed the target variable by shifting the data and merging back, we knew from the beginning that we would have missing values for the target variable. Indeed, we cannot know what would be the target for a 2022 obervation (for instance) as this would be akin to know the EPS in 2027 (using a 5 year shift as we did) which is impossible.

In [29]:
print(f"There are {int(data['futureEPS'].isna().sum())} missing values in the target variable")

There are 1099 missing values in the target variable


One thing we can do though is making sure that all those missing values are for observations from 2019 onwards.

In [33]:
print(f"Observation year with missing target values: {data[data.futureEPS.isna()]['year'].unique()}")

Observation year with missing target values: [2013 2019 2020 2021 2022 2023 2024 2025]


Well, although I'm not too surprised to see years from 2019 onwards, having 2013 in the list is an issue as we should have been able to build the target for this year (i.e. EPS data from 2018). We need to investigate what happens with 2013 so as to dimension the problem (maybe only one company is involved).

In [37]:
data[(data.futureEPS.isna()) & (data['year'] == 2013)][['symbol', 'fiscalDateEnding']]

Unnamed: 0,symbol,fiscalDateEnding
709,CSCO,2013-07-27


Fortunately, only one observation from 2013, Cisco, has a missing target value. Since it is a 2013 observation, we can check whether we have a 2018 observation for Cisco. If so, the 2018 EPS of the 2018 Cisco observation can become the target of the 2013 Cisco observation.

In [40]:
data[(data.symbol == 'CSCO') & (data.year == 2018)][['symbol', 'fiscalDateEnding', 'reportedEPS']]

Unnamed: 0,symbol,fiscalDateEnding,reportedEPS


Apparently, we have no data for CISCO in 2018. Let's check what we have for CISCO in general:

In [41]:
data[(data.symbol == 'CSCO')][['symbol', 'fiscalDateEnding', 'reportedEPS']]

Unnamed: 0,symbol,fiscalDateEnding,reportedEPS
164,CSCO,2010-07-31,1.61
331,CSCO,2011-07-30,1.61
517,CSCO,2012-07-28,1.85
709,CSCO,2013-07-27,2.02
907,CSCO,2014-07-26,2.06
1106,CSCO,2015-07-25,2.2
1307,CSCO,2016-07-30,2.36
1510,CSCO,2017-07-29,2.39
1923,CSCO,2019-07-27,3.09
2136,CSCO,2020-07-25,3.2


We can even load and check the original datasets to see if we did something wrong in our merging and dataset creation operations:

In [57]:
income_data = pd.read_csv('../data/raw/income_statement_annual.csv')
income_data[(income_data['symbol'] == 'CSCO') & (income_data.fiscalDateEnding.str.contains('2018'))][['fiscalDateEnding', 'symbol']]

Unnamed: 0,fiscalDateEnding,symbol
1822,2018-07-28,CSCO


In [58]:
balance_data = pd.read_csv('../data/raw/balance_sheet_annual.csv')
balance_data[(balance_data['symbol'] == 'CSCO') & (balance_data.fiscalDateEnding.str.contains('2018'))][['fiscalDateEnding', 'symbol']]

Unnamed: 0,fiscalDateEnding,symbol
1132,2018-07-29,CSCO


Now, I understand why we do not have data for Cisco in 2018 in our dataset. We can see that the 2018 *fiscalDateEnding* for 2018 are not exactly the same between what we have in the balance sheet data (2018-07-29) and the income statement data (2018-07-28). The problem is that I used and inner join merging operation on the exact value of the *fiscalDateEnding* variable to merge those two datasets. 2018 was not able to merged properly and was therefore excluded from the dataset. This means that, we should probably try to merge on YYYY-MM instead of YYYY-MM-DD. This can be easily implemented in the code. Note that this may also fix the issue for other observations.

**Independent variables**

### Data Distribution

#### Box-plots

### Correlation Analysis

#### Correlation Matrix

## Insights gained

Let's summarize the insights we gained from EDA and list the next actions we aleady identified.

- Small dataset with ~3k observations
- 67 features including the target, 63 of which are numerical
- 2 non-numerical features to ID each observation
- 2 non-numerical features to define the sector and industry the company is evolving in. We may need to OHE them which would results in many more features.
- 1099 missing values for the target variable which was expected
- 2018 Cisco issue due to exact data merging between balance sheet data and income statement data. This lead to missing target value in 2013. Code must be fixed by merging on YYYY-MM instead.