#Capstone project overview

## The task

See Moodle page, or use data and project from your company that is of a similar complexity.

>If you choose to work on a company project, please make sure to clarify with your company if they need a confidentiality agreement (there is a sample on the Moodle page, but you can use your company's own NDA, instead, of course), and get consent to use it in your project (you can find a template for a letter of confirmation and consent in Appendix 1 of the Coding Project Handbook on Moodle).


## Code and text

Submit **one single PDF** (it’s nice if you also submit ipynb, but it’s optional). PDF should contain BOTH **code and text**!
* OptionA: **work exclusively in ipynb**, and use markdown cells for the sections, text, tables etc. Generate PDF from ipynb at the end (page numbers can be added then; pay attention to collapsed cells if working on Google colab, though!).
  * PDF generation: using [nbconvert](https://nbconvert.readthedocs.io/en/latest/) or print to PDF.
* OptionB: prepare **text in a separate document, code in ipynb and copy-and-pasted into the text document**.
  * Relevant parts of the code **copy-and pasted into the relevant sections** and parts of the text, or…
  * …**entire code from ipynb in an appendix**. Important: especially in this case, the _code should also be structured_!


## Format, references etc.

* Pay attention to **length requirement**: if you submit significantly shorter work than specified in the module handbook (e.g., only ~10 pages), you cannot get a pass grade!
  * **Word count includes text AND code**, so given the amount of exploration, data cleaning and preparation, modelling, evaluation and reflection needed in the project, it’s easy to reach the required length.
  * The text and code **should NOT be irrelevant** to the project, though! (Writing 5 pages on an irrelevant topic is not the way to satisfy the word count requirement!)
* PDF must start with a **title page and declaration**!
  * Working in ipynb, e.g., import them as images (see below).
  * Or simply prepend image/pdf to generated PDF. (If you like, [even in Python](https://stackoverflow.com/questions/3444645/merge-pdf-files).)
* ALWAYS **cite other works you use** (not just academic papers; URLs are fine, too, e.g., when using code snippets from stackoverflow). Do “**in-line citation**” where you use the material + **collect ALL references at the end** of the thesis under a dedicated “References” section!
  * Levels of using other work:
    * For **paraphrasing** or when only using ideas, a simple reference is fine.
    * For **short word-for-word citations** from some source, use quotation marks (plus the reference of course).
    * For **long word-for-word citations** from some source, use indentation and/or italics and/or smaller font, etc (plus the reference of course). Warning: _do NOT use more than a paragraph or two word-for-word from other sources!!! Paraphrase!_
  * If possible, be consistent in your referencing style.
* **Structure your work logically**, use sections, subsections.


## Submission

* Submission link will ONLY be open **once you answer the 3 reflective questions under “Coding Project title and final exam”** on Moodle first. → Pay attention to the **deadline** (last day of the teaching period), do not leave submission to the last minute!!!
  * Bear in mind that there is a **minimum word limit** to these questions, too.
* Submission of project itself:
  * **Submit one single PDF, containing EVERYTHING** (text, plots, code). Optionally also submit the ipynb you worked on.
  * **10MB file size limit** → probably won’t reach this limit, at most only if you generate / include a LOT of HIGH-resolution plots / images. You can reduce figure sizes if it seems that you runinto this issue.


## Assessment criteria

You have to _reach a pass mark on ALL criteria_ to get an overall pass grade.

* **Approach adopted**: including how data were handled, the rationale for various decisions, and discussion of selecting particular approach and tools.
* **Data science methodology**: appropriate techniques, exploratory and modelling steps, validation and conclusion.
* **Coding**: purposefulness, quality, and elegance of the codes used.
* **Reflexivity**: clarity of linkages between actions and outcomes, depth of reflections, identification of learning points for the future.
* **Communication and presentation**: language and grammar, clear formatting, adept use of Jupyter and citations.

Assessment procedure:
* 2 readers do a detailed evaluation based on the criteria.
* Checked by internal and external moderator.
* Approved by Exam Board. 
  * → this is the point after which you get the results and the feedback.


### In case you are unsuccessful...

* You to enroll (=> also pay) for another semester. 

* Two ways to be unsuccessful:

  * **If you don't submit a capstone project...**
    * In the next semester >> you have to work on the task assigned in THAT NEW SEMESTER (can't resubmit your old, reworked project).
    * But your Capstone Project **grade can be anything between 0-100%**.
  * **If you submit your capstone project, but get fail mark...**
    * In the next semester >> you can SUBMIT THE SAME PROJECT you worked on (after you corrected the problems with it, of course).
    * But your Capstone Project **grade is capped at 50%!**

# Capstone project structure


## Preliminary material

* Title page (1 page) -- see below
* Declaration (1 page)
* (Acknowledgements)
* Abstract (<300 words summary)
* Table of contents -- see below

## Introduction

* Topic and goal of the project
* Models to be used
* Hypotheses, expectations


## Description of the task

* Specifying the goal of the project and describing the task.
* Supervised/unsupervised, classification, regression, domain-specific, time series, etc.
* The aspects of data preparation (e.g., train-test split and shuffling) and modelling to pay attention to given the task.


## Exploratory data analysis

### Taking stock of independent and target variables
* Description, initial reasoning
* Data types (e.g., datetime)
* Sample/time step/pixel count, data dimensionality etc.
* Descriptive statistics

### Initial visualisations
* Data- and variable-appropriate plots
* Domain-specific explorations
* Eg. time series: trend, periodicity etc.
* Explore relationship between variables
* Pairplots
* With target: (Lagged) correlations, autocorrelation


## Data preparation

### Missing values

* Exploration, reasoning about handling it etc.

### Anomaly detection and solutions for outlier handling

* Boxplots, z-values, etc.
* Clipping, etc.

### Feature engineering

#### Feature encoding 

* e.g., cyclic encoding of temporal features, one-hot-encoding of categorical features etc.

#### Feature selection, possible derived features, possible additional (e.g., time-related) features

##### Reason about dropping a variable!
* categorical… -> too many unique values, e.g. 1000 samples, 800 unique values, 
  * and no simple way to reduce category count.
* Intuitively not important, e.g. ID tag.
* Too many NA values, e.g., 800 NA out of 1000.
* train models with and without and see which model performs best
* explore the feature importances in different models
* Linear reg.: coefficients (after scaling!)
* RF: feature importances
* L1 regularization
* only use some for now, rest: “future work”

Refer to EDA, e.g., correlations – but be aware of limitation (e.g., linear)!
* High absolute correlation 
  * Useful for predicting target?
  * Not useful, because information already present in other variable? -> reducing * multicollinearity, dimensionality reduction…
* Forecasting task -> _Lagged_ correlations relevant.

### Potential data normalisation

* Standard/MinMax scaling, reasoning about for which model it’s needed etc.

### Potential dimensionality reduction
* E.g., PCA or LDA for modelling or visualisation, or UMAP for visualisation?

### Potential resampling

### Domain-specific data prep
* E.g., detrending, deseasonalising, rolling aggregation for smoothing if needed.

### Preparation of input and target data for modelling

* Maybe use a pipeline.
* Pay attention to shape of data needed for different models.
* Reason about whether normalisation is vital for a model (e.g., neighbourhood/distance/density sensitive approaches, neural networks).
* Train-valid-test split (or prepare for cross-validation). Shuffle true or false? --> Has to be the same for all models.


## Modelling, predictions

### Deciding on goodness of fit criteria

* _Task-appropriate_ metrics and visualisation to be applied in the evaluation of _all_ predictions.
* Don’t forget to _inverse transform_ target when evaluating, if any prior transformations done (detrending, scaling etc.).
* Classification metrics vs. regression metrics; description of metrics used.
* Visualisation, inspection of _true and predicted values_ to see and reasoning about where issues lie.
* _Train vs. out-of-sample_ performance comparison for each model.

### Non-ML baselines

* Non-machine-learning baselines, e.g., average, rolling average, last, etc.
* Description of the baseline and evaluation.

### Baseline ML models, e.g., linear/logistic regression

* Description of the relevant model type and its values and limitations.
* (Hyperparameter tuning / motivation of hyperparam choices if applicable.)
* Evaluation.

### More advanced ML models, generic and domain-specific

* Description of the relevant model type and its values and limitations.
* Hyperparameter tuning, motivation of hyperparam choices if applicable.
* Evaluation.

## Evaluation, reflection

* Reasoning about and reflection on the results. 
* Compare to the goals and expectations.


## Conclusions

* Conclusions, recommendations based on the results.
* Limitations and identify work for further research.



## References

```
LeCun & Bengio (1995) Convolutional networks for images, speech, and time-series. _The handbook of brain theory and neural networks_ 3361 (10).

https://stackoverflow.com/questions/30061902/how-to-handle-citations-in-ipython-notebook

...
```




## Appendix

* E.g., code, or large tables etc.



# Appendix: some sample technical solutions in ipynb

## A solution to title page and declaration: embed image in markdown

Convert your docx (or pdf) to an image (there are countless online tools if you don't have a program to do it on your own PC).

Then use the markdown syntax `![](IMAGEPATH)` to embed it in a markdown cell.

(Note, below, I include the markdown code in the example, too. Naturally, you would just use the code itself, without the code block.)

```markdown
![](https://drive.google.com/uc?export=download&id=1DMNHIs5BZR9g08AJhA3DbdK-R1EFS-ka)
```
![](https://drive.google.com/uc?export=download&id=1DMNHIs5BZR9g08AJhA3DbdK-R1EFS-ka)

```markdown
![](https://drive.google.com/uc?export=download&id=1_zb-oXtlNvJonMyaa4ex6PtLYuMX0a58)
```

![](https://drive.google.com/uc?export=download&id=1_zb-oXtlNvJonMyaa4ex6PtLYuMX0a58)

## A solution to table of contents

### A solution when using Google Colab

Bring up the Command palette (e.g., hit Ctrl+Shift+P), search for table, then select "Add table of contents cell":

![](https://drive.google.com/uc?export=download&id=1v6ncWwbh2KGOdG3GVUYNTwN40jUgqJ-x)

This will automatically generate a table of contents from the headings (use `#`-syntax!) of your markdown cells.

### A solution for offline jupyter usage

E.g., put the following markdown code where ever you want your table of contents:

```markdown
<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>
```

Then put the following into a code cell anywhere in your notebook (preferable at the end, in some appendix), which will automatically populate the Table of Contents from the headings (use `#`-syntax!) of your markdown cells:

```
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
```