# Assignment 8: Deeper Into Data Analysis and Visualization

***
***
## Project Description
Similar to previous assignments, you will use Python to create your own original analysis of an existing real-world data set. You will create a Notebook documenting your research question, your process, and conclusions. Unlike previous assignments, this assignment also requires you to examine continuous data.

This is an individual project. You may not work with anyone else on this assignment.

## Purpose of the Assignment
To demonstrate your ability to:
* Develop a research question involving correlational data
* Explore correlational data using descriptive statistics and visualizations
* Draw appropriate conclusions from data

## Selecting and Using a Dataset
* You may not use the following datasets: ‘broadway’, ‘state crime’, or 'recent-grads'. 
* Choose and then download a dataset to work on using this link: [CORGIS project](https://think.cs.vt.edu/corgis/csv/index.html). (If you want to use a non-CORGIS dataset you must get approval from an instructor first).
* To work with any dataset you select, you will need to upload it to your 125assignments-f18/A8 folder on jupyterlab.  You can do this by navigating the jupyterhub file browser into your A8 directory, then dragging a file from a folder on your computer (or on your desktop) into your web browser window, dropping it in the file browser area.  We will also show you how to do this in class.
* Since you are required to include a scatter matrix by group (see below) you will need to find datasets that contain variables you can correlate as well as groups.

## Identifying a Research Question
Now the tough part. Explore your data. Use visualizations and descriptive statistics to get a broad sense of the data. Then get creative, and think of an interesting question to pursue using the data. During this process you may generate many questions. However, you are on only going to submit one question for this assignment. The question you choose should address something that isn't immediately obvious about the data set. For example, if you gave us a count of the number of plays and the number of musicals in the broadway dataset, and a bar graph showing the same, you would not be demonstrating your ability to develop an interesting research question or to explore the data in any meaningful way.

## Project Requirements
* You must use **at least one scatter matrix visualization** that looks at associations **by group** (see 7.4.4 in the textbook for an example)
* Your submitted notebook must clearly demonstrate your entire process using Markdown and Code cells
* You may not submit a visualization that appeared in any way in any of the class materials
* Your descriptive statistics and visualization must be appropriate for the type of data you are working with and the type of question you are trying to answer
* Your visualization must be appropriately titled, your axes must be labelled
* Your visualization should clearly convey meaningful information about the dataset
* You must provide a clear and succinct explanation of your visualization and descriptive stats and how they relate to your research question
* Additional specifics requirements are included in the markdown comments in individaul notebook section provided below 

## Help!
These sorts of open-ended assignments can be challenging. Please ask us for help. Talk to us in person after class, come to office hours, or post a question to Piazza.

## Submitting

The submission code is at the end of this notebook. Be aware that if you execute the entire notebook, the final block will likely need your verification and will wait for it rather than running additional code.

***
***

## Grading Rubric
 
### General Guidelines
* The due date and time is provided in Moodle.  Any code submitted after that time will have the late penalty from the syllabus applied.
* The evaluator will use informed judgement in assessing the notebook.
* In all cases a rating of Not Met is given if the required element is not recognizably present. It is the responsibility of the student to include each element.
* Ratings converted to percentages as follows: Not Met, 0%; Poor, 60%, Adequate 80%, Excellent 100%
 
### Compelling Research Question (10%):
* Not Met: The research question was absent or had no clear connection to the assignment.
* Poor: The research question was confusing or ambiguous.
* Adequate: The research question was well defined, but not interesting or well elaborated.
* Excellent: The research question was clear, well defined, interesting, and well elaborated.

### Documentation and Statement of Ethics (10%): 
* Not Met: The documentation and/or statement of ethics was absent or had no clear connection to the assignment.
* Poor: The documentation and/or statement of ethics was confusing or ambiguous.
* Adequate: The documentation and/or statement of ethics was complete and clear, but not well elaborated.
* Excellent: The documentation and/or statement of ethics was complete, clear, thoughtful, and well elaborated.

### Code Specifications and Correctness (20%):
For more details on Specifications and Correctness see the [course programming rubric](https://sun.iwu.edu/~mliffito/class/2018f/csds125/rubric.php). 
* Not Met: Many or most code blocks only function correctly in very limited cases or not at all.
* Poor: Significant details of the specification are violated, some code blocks exhibit incorrect behavior.
* Adequate: Minor details of the program specification are violated.
* Excellent: No errors, all code blocks work correctly and meets the specifications.

### Code Readability and Documentation (10%): 
Documentation includes Markdown cells and inline comments in code cells that explain what the code does. 
Readability includes using indentation consistently, adding whitespace (blank lines, spaces) where appropriate, giving variables meaningful names, and organizing the code well.
For more details on Readability and Documentation see the [course programming rubric](https://sun.iwu.edu/~mliffito/class/2018f/csds125/rubric.php). 
* Not Met: Code is not documented, and/or is not readable.
* Poor: Code is poorly documented, and/or is difficult to read.
* Adequate: Code is documented and is readable, but may contains minor issues.
* Excellent: Code is well documented and is clearly readable.

### Statistics (15%): The notebook used appropriate statistics to describe the data and answer the research question.
* Not Met: The statistics were absent or had no clear connection to the assignment.
* Poor: The statistics were inappropriate for the questions, or was incorrectly interpreted.
* Adequate: The statistics and the interpretation of the statistics were generally correct but lacked some degree of clarity or meaningful relationship.
* Excellent: The statistics and the interpretation of the statistics were was appropriate to the questions, and were interpreted correctly and clearly to answer the question.

### Visualization Selection (15%): The notebook used a visualization appropriate for answering the research question.
* Not Met: The visualization was absent or had no clear connection to the assignment.
* Poor: Visualization was inappropriate to the questions, or was incorrectly interpreted.
* Adequate: The kind of visualization, its interpretations, and the answer were generally correct but lacked some degree of clarity or meaningful relationship.
* Excellent: The visualization was appropriate to the questions, and was interpreted correctly and clearly to answer the question.

### Visualization Presentation (10%): Visualizations presented with appropriate formatting (y-axis scaling, labelling axes, title, labelled categories, etc).
* Not Met: Visualization formatting was absent.
* Poor: Visualization formatting was partially absent or inappropriate.
* Adequate: Visualization formatting was generally correct but omitted small elements or had some small elements that could be improved.
* Excellent:  Visualization formatting was correct and complete.

### Notebook Design (10%): The notebook was clear, succinct, and appropriately designed.
* Not Met: The notebook had no clear connection to the assignment.
* Poor: The statements were confusing or ambiguous; the notebook was difficult to read or contained distracting or irrelevant materials.
* Adequate: The statements were generally clear but were awkwardly phrased or overly verbose; the notebook was easily readable.
* Excellent: The statements of the questions and answers clearly understandable; the notebook used a layout that was easily readable and attractive visually.

***
***
***
***

**Edit the markdown blocks below. Replace the instructions for each section with information about your project.**

# Title of Your Analysis (replace this with your title)
* The title should succinctly convey the key question or finding.
* In a few sentences identify the research question you will explore with the dataset. Explain why the research question is of interest to you, and may be of interest to a potential audience.


***
## Authorship and Resources Used
* Include your name here.
* Include the date of authorship here.
* If you received any assistance from anyone else, state who you consulted and specifically how they helped.
* If you used any other resources, state what they were and specifically how they helped, include links to the resources. [Markdown links use this formatting.](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#links)

***
## Data Description and Source
* In a few sentences describe the dataset you selected for analysis.
* Data attribution: describe where the data came from, provide links.

***
## Ethics
* In a few sentences, identify any potential harm that might result from your analysis. If you see no possible harm, state your reasons. If you do see potential harm, identify who might be at risk. 

***
## Import Libraries and Set Preferences for Visualization

In [None]:
# enter and test your code here

***
## Read and Verify Data

In [None]:
# enter and test your code here

***
## Descriptive Statistics and Visualizations
* In each step you need to clearly articulate what you are doing, why you are doing it, and what you have learned from each step.
* Provide a specific clear and concise justication for why the types of visualization(s) you chose are well suited to the data and your question.

In [None]:
# enter and test your code here

***
## Conclusions
* Provide a summary of your results and a discussion of what has been learned from your project

***
***
***
***
## Submitting

Once you're finished, select "Save Notebook" in the File menu (or press the Save icon, or press <kbd>Ctrl+S</kbd>) and then execute the cell below. The result will contain a link that you can use to check that your assignment has been submitted successfully. You can re-run the cell and submit more than once before the deadline. We will only grade your final submission.

*[It may print some errors saying "Javascript Error: IPython is not defined"; those may safely be ignored.]*

In [None]:
# This cell is just for submitting your work.  Do not change anything in it.
from client.api.notebook import Notebook
ok = Notebook('A8.ok')
import os
if not os.path.exists(os.path.join(os.environ.get("HOME"), ".config/ok/auth_refresh")):
    ok.auth(force=True)
else:
    ok.auth(inline=True)
_ = ok.submit()