# Final project instructions

Final projects are due by **9 AM 20 Nov, 2020**.

### Timeline

**Note**: You are free to revise *any* of the entries below, so don't be constrained by what you wrote if it does not work out. For example, you might change the data science objective, change the technology stack used, etc - if doing so would improve the project outcome.

The dates given below are guides, but evidence of regular, incremental progress on the project will be considered in the final grade.

Aim to complete the first draft of the project by **05 Nov 2020**. Each group will be asked to make class presentations on their project between 05-20 Nov 2020.

#### By 11 Sep 2020

- Form a team
    - Give your team a name
    - Form a team GitHub repository for your project 
        - you will write your own blog about the project on your individual GitHub-pages, but refer to the group repository for project content
    - List members of your team and program on the README page
    - Send Cliburn and Michale the following information:
        - URL of team GitHUb repository
        - Team name
        - Name, email and program of each team member
        - Which of the 3 data sets you will be working on
    - Set up a channel for team communication on Teams/Slack
    - Set up schedule for team meetings
          
#### By 25 Sep 2020

- The team repository README should contain the following
    - The specific data sets you will be working with
    - The objective of the project and how it benefits the target consumer
        - What is the data product(s) that will be generated? For example, the project can generate one or more of the following
            - Report (e.g. auto-generated PDF)
            - Dashboard
            - Online ML algorithm
            - App
            - Other
    - A sketch of the data science plan, which may include (if appropriate)
        - Data plan (Extract, Load, Transform)
        - ML plan (Model training, evaluation, selection, deployment, monitoring)
        - Operations plan (workflow orchestration, CI/CD, containerization, serverless, testing, packaging, logging)
        - Technology stack (databases, python packages, cloud platform)
    - Roles, responsibilities and *timed milestones* for each team member
        - Plan for *accountability* but be willing to help out team members
        - Ask team members for help if you are overwhelmed
    - Set up channels for cross-team experts on Teams/Slack
        - Channel for **data engineering**
            - Databases (SQL, NoSQL)
            - ELT (`singer`)
            - Distributed data (`spark`)
            - Workflow orchestration (`airflow`)_
        - Channel for **data science and ML**
            - EDA and visualization (`pandas`, `vaex`)
            - Dashboards (`dash`, `streamlit`)
            - ML frameworks (`scikit-learn`, `tensorflow`)
            - Explainability (`yellowbrick`, `shap`)
        - Channel for **operations**
            - Testing (`pytest`)
            - Containers (`docker`, `docker-compose`, `kubernetes`)
            - CI/CD (`jenkins`, `buddy`, `CircleCI`)
            - Cloud platforms (`aws`, `gcp`, `azure`)

        
####  25 Sep  - 05 Nov 2020

- Do the research and necessary work as planned
- Perform a branch and merge operation to fix a bug or add a feature
- Contribute something to another team's project by forking and submitting a pull request

#### 05 Nov - 20 Nov

- Generate and review data product
- Final optimizations
- Revise team project README page
- Write personal blog
- Prepare for class presentation

#### 20 Nov 2020

- Final project due

Execute the cell below to see how much time you have left!

In [1]:
import pendulum

deadline = pendulum.datetime(2020, 11, 20, 9, 0, 0,tz='US/Eastern')
print(f'You have {pendulum.now().diff_for_humans(deadline)} proejct submission deadline.')

You have 2 months before proejct submission deadline.


## Project instructions

For the final project, you will choose a data set(s) from the following topics

- COVID-19
- ICU admission
- Brain imaging

From a team of 3-4 people. Each team will need to do the following:

- Identify and frame data science problem to solve given the data (data analyst/domain expert)
- Process and manage the data so that it is easily accessible (data engineer)
- Develop the analysis, modeling and reporting data product (data scientist / machine learning engineer)

Each group member will also have to do the following *individual* task:

- Write a post on GitHub-pages about the project and what you learned from doing this

**Note**: 80% of the grade will be based on the quality of the project as a whole (team effort); 20% will be based on teh quality of the blog/report (individual effort).

### You should use the project to showcase the following skill sets

- Basic skills
    - Effective use of version control
        - Ideally, showcase branch and merge skills for concurrent feature development / bug fixes
    - Literate programming
    - Functional coding style
- Domain knowledge
    - Introduction of domain concepts necessary to understand the data set(s)
    - Framing the data science challenge
- Data management
    - Use of different data formats
    - Use of SQL and/or NoSQL databases
- Data analysis and visualization
    - Data cleaning and validation
    - Use of linked data sources if applicable
    - Attractive and easy-to-understand reports and/or dashboards
- Development of data product
    - This is often a predictive machine learning model (classical or deep)
        - Construct pipeline, perform model training, evaluation and selection
    - Other data products can also be developed (in place of or in addition to supervised learning) if more appropriate for your problem
        - Examples - unsupervised or self-supervised models, graphical models, complex interactive visualizations
- Operationalization
    - Testing, logging, monitoring, streaming, packaging if applicable
    - Use of distributed computing platform if applicable
    - Deployment to cloud platform (including serverless if applicable)
    
### Milestones

- Form team (diversity is essential - teams **must** consist of people from at least 2 different programs)
- Initial exploration of data sets for all topics
- Choose topic
- Further exploration of data sets for selected topic
- Frame data science problem to solve
- Develop the data science and machine learning product
- Deploy the product to a cloud platform (GCP, Azure, AWS)
- Streamline and automate 


### Suggestions for working as a team

- You should begin as soon as possible
- Aim to have a group meeting at least 1-2 times a week
- Set up Teams or Slack for communication within your team
- Probably easiest if someone takes the role of project manager to 
    - coordinate group activities 
    - check that sufficient progress is being made week-by-week.
- Avoid excessive "divide and conquer" specialization
    - individual should take on at least two roles
    - each skill set area should have at least two people involved
    - teach your team members how you performed a task that not everyone is familiar with

## Data resources

Your primary data source should be from one of these three links provided. You can, however, link to other public reference data sets, databases or ontologies if appropriate.

**Notes** 

- To access the data set, you may need to register on the site, or even complete specified training courses.
- If you need the data sets put on the Duke Spark cluster, please let me know and we will work with OIT to make this possible

### Open-Access Data and Computational Resources to Address COVID-19 [link](https://datascience.nih.gov/covid-19-open-access-resources)

- COVID-19 open-access data and computational resources are being provided by federal agencies, including NIH, public consortia, and private entities. These resources are freely available to researchers, and this page will be updated as more information becomes available. 
- The Office of Data Science Strategy seeks to provide the research community with links to open-access data, computational, and supporting resources. These resources are being aggregated and posted for scientific and public health interests. Inclusion of a resource on this list does not mean it has been evaluated or endorsed by NIH.

### MIMIC [link](https://mimic.physionet.org)

- Deidentified health data associated with ~60,000 ICU admissions (53,432 adult patients and 8,100 neonatal patients) from June 2001 to October 2012
- Includes demographics, vital signs, laboratory tests, medications, and more


### Oasis Brains Project [link](http://www.oasis-brains.org/)

- Neuroimaging datasets; have been utilized for hypothesis driven data analyses, development of neuroanatomical atlases, and development of segmentation algorithms
- OASIS-1: Cross-sectional MRI data in young, middle aged, nondemented and demented older adults
o	416 subjects; 434 MRI sessions
- OASIS-2: Longitudinal MRI data in nondemented and demented older adults
o	150 subjects; 373 MRI sessions
- OASIS-3: Longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and Alzheimer’s disease
o	1098 subjects; 2168 MRI sessions; 1608 PET sessions