<a href="https://colab.research.google.com/github/DanRHowarth/Artificial-Intelligence-Cloud-and-Edge-Implementations/blob/master/Oxford_EDA__Assessment_2019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Template for performing EDA on ABCF Data 

* This notebook sets areas and for potential investigation to explore the data
* It doesn't includes code (I ran out of time) but includes cells where the code could go. We have a decent idea of the code we can use so we can add this in after it has been developed. 
* It is a first draft - we will update as we learn so that we can build a good template for future use. It also misses out some EDA topics that we can add to. But hopefully its a good start.

* Contents:
  * 1. Overview of EDA / Suggested Workflow / Some ideas/questions we can ask about the data 
  * 2. Initial Data Analysis 
  * 3. Exploratory Data Analysis 


# 1. Overview of EDA and Workflow / Some questions we can ask about the data

## 1.1 Overview of EDA
* Confusingly, I would group *Initial Data Analysis* and *Exploratory Data Analysis* together under the same term, EDA.
* IDA - looking at the things we need to do to be able to analyse the data in EDA. This includes things like data types, missing data, making sure we have the full data and know where it comes from. Etc.
* EDA - building an understanding of the data, doing things we need to do to generate and test hypotheses, build our understanding and record assumptions, start to prepare for modelling. Etc.



### 1.1.2 Some relevant things I have excluded (for now)

* Model prep as a last phase of EDA - things required to get data ready for modelling - scaling data for the model, feature selection, feature engineering.
* Statistical hypotheses testing - confirmatory data analysis. e.g. statistical signficance  
* Also, this is not an exhaustive list of things that could be done under EDA. We can add more as we do more.

## 1.2 Workflow

* Suggested workflow *for each section / bit of analysis*: 
  * **Investigate it / Plot it / Report it**
* Or more fully: 
  * Implement in code
  * Plot the result - for pretty much everything 
  * Think about what it says, a) as a stand alone result, b) in relation to the rest of the dataset, c) in relation to the rest of the data for the problem, d) based on what we know about the problem (context), e) what is missing / what it doesn't say
  * Record any assumptions that you have made about the data 
  * Generate ideas about what the data tells us, or could tell us in combination with other data

* For each of the IDA/EDA sections, we should:
  * look at the properties of the data
  * preprocess as required
  * go into an understanding phase 
* Each will look a bit different, but the approach helps with structure (i think)

* This lends itself to a nice chart that I will pull together :-) 


## 1.3 Some Questions to get the ball rolling:

**Initial Data Analysis**
* What is the volume of data? 
* What does missing values / no missing values mean
* Is this a sample or other data?
* Is this the time period we were expecting?
* What are the different features?

**Exploratory Data Analysis**
* Are there duplicates in the data?
* What is the time period? 
* What does the data tell us? How does it tell us that? How well does it tell us that?
* How variable is the data?
* Are there any outliers, how do we treat them? 
* Does a long term summary provide some insight: a trend, a direction that we can capture? 
* How does a short term sample [of some length] compare to the overall data? (Same property types? Does a random sample vary from the broader population?  
* What does a yearly/monthly/rolling average tell us? Are there drifts in the data? Seasonal peaks?
* What are the most important variables? Why? What scale? (feature selection...relevant to what) 
* What ways can we measure the variables? The rate of changes etc? ratios
* If we set hypotheses...what will the measure of success be?




**For Assumptions..**
* What we believe about the data
* What it is
* Where it has been generated from 
* How it has been treated, if at all, prior to this stage


# 2. Initial Data Analysis 

## 2.1 Overview of what we are trying to achieve

* Principles...
* What it is 
* What we do 
* What the result is

## 2.2 Properties 

In [0]:
# data size and shape


In [0]:
# column names 


In [0]:
# time period of data


In [0]:
# data types (computational)


In [0]:
# Check for missing values 


In [0]:
# take a look at the data 
df.head()
df.tail()
df.describe()

### 2.2.1 What does this tell us?
* What is the volume of data? 
* What does missing values / no missing values mean?
* Is this the time period we were expecting? 


## 2.3 Preprocessing 

In [0]:
# treat data types 
df.astype()

In [0]:
# placeholder 


## 2.4 Understanding 

### 2.4.1 Questions to be asked (more qualitative at this stage)
* Where did the data come from?
* Is it a sample or the whole lot?
* What do we judge about its quality?
* Has it been treated? 
* Do we have feaure descriptors? 
* To all the above -> are we taking steps to address them?

### 2.4.2 Assumptions 
* What assumptions have we made about the data
* Add here...

### 2.4.3 What ideas and hypotheses have we generated about the data?
* What does the IDA give us in terms of ideas? 
* Add here...

# 3. Exploratory Data Analysis 

## 3.1 Overview 


* What we are looking at...
  * Properties 
  * Preprocessing
  * Understanding 
    * Univariate - understand individual variables 
    * Bivariate - understand variables in relationship to another variable
    * Multivariate - ....
* Looking to implement, code, describe and generate ideas about it

## 3.2 Properties 

### 3.2.1 Data types (statistical)
* Continuous, discrete, categorical, binary, ordinal 

In [0]:
# relevant code here 


### 3.2.2 Detailed view of values

In [0]:
# unique values - number, examples 


In [0]:
# duplicate entries 


In [0]:
# class imbalances (if relevant)


In [0]:
# plots include bar, histogram 


#### What does this tell us? 
* What do we need to do with duplicate entries? 
* Are the unique values what we expected?

## 3.3 Preprocessing 


In [0]:
# normalize, standardize the data if required 


In [0]:
# deal with missing values appropriately 


## 3.4 Understanding 


### 3.4.1 Based on individual variables 


In [0]:
# plot of data - e.g. line plot

In [0]:
# centrality - mean, median, etc. 


In [0]:
# distributions and spread of data (standard deviation, interquartile range, outliers, range, percentile, skewness/kurtosis)
# plots - boxplots, distributions, frequency table, density estimate, violin plot    


In [0]:
# averages - yearly, monthly, rolling, 


##### 3.4.1.1 Ideas and Hypotheses (numbering system gets a bit clunky now)
* What sort of distribution?
* Outliers? 
* How do the variables change over time? What could this mean? Do we think this change relates to change in other variables? In some sort of measurable way?
* 

##### 3.4.1.2 Assumptions
* What are we assuming about the data

### 3.4.2 Based on relationships between variables 

In [0]:
# correlation, covariance  
# plots - scatter, pairplots, heatmaps (hopefully we have seaborn!)


#### 3.4.2.1 Ideas and Hypotheses
* Is there a correlation between different data types -> what could explain this?
* How does the correlation change over time?

#### 3.4.2.2 Assumptions
* Update any assumptions here 

### 3.4.3 Other approaches  

In [0]:
# Dimensionality reduction for high fimensional data 


In [0]:
# 


#### 3.4.3.1 Ideas and Hypotheses
* Answer

#### 3.4.3.2 Assumptions
* Assumption

### 3.4.4 Relationships between datasets 

In [0]:
# 


In [0]:
# 


#### 3.4.4.1 Ideas and Hypotheses
* Answer

#### 3.4.4.2 Assumptions
* Assumption

## 3.5 Overall list of ideas/hypotheses/assumptions

### 3.5.1 Ideas / Hypotheses

* Idea 1
* Idea 2

### 3.5.2 Assumptions 

* Assumption 1 
* Assumption 2