# [CPSC 310](https://github.com/GonzagaCPSC310) Data Mining
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)
## PA2 Data Visualization (100 pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Use `matplotlib` to visualize data
* Transform a continuous attribute into a categorical attribute using discretization
* Calculate a least squares linear regression line

## Prerequisites
Before starting this micro assignment, participants should be able to:
* Use Python for data analysis

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining HW2

## Github Classroom Setup
For this assignment, you will use GitHub Classroom to create a private code repositories to track code changes and submit your assignment. Open this PA2 link to accept the assignment and create a private repository for your assignment in Github classroom: https://classroom.github.com/a/CpqLQcT1

Your repo, for example, will be named GonzagaCPSC310/pa2-yourusername (where yourusername is your Github username). I highly recommend committing/pushing regularly so your work is always backed up. We will grade your most recent commit, even if that commit is after the due date (your work will be marked late if this is the case).

## Overview and Requirements
Write a program (`pa2.py`) that creates plots for the "pre-processed" automobile dataset (auto-data.txt) you created for PA1. Pick one method to deal with missing values for this assignment (e.g., eliminate rows with missing values, use means or medians, etc.).

For this assignment you will need to perform the following steps and turn in your source code, plots, and a description (i.e., log) of the process you used to "visualize" the given data and any insights gained from the visualizations. Your log needs to be written separately from your .py file and may be written in a .txt or a .md (markdown) file. Each chart that you generate must include:
* a figure title
* labels on the x and y axes where appropriate (see the examples in Figure 1) 

Also, your final program should save each chart as a PDF file using the `savefig("filename.pdf")` function. Saved charts should start with the step name, e.g., step-1-cylinders.pdf.

Note: as you write solutions for the following steps, I highly encourage you to design functions that are generic and re-usable for future programming assignments and data mining tasks.

## Step 1 Frequency Diagrams
Create a frequency diagram (sometimes informally referred to as a "histogram") for each of the cylinders, model year, and origin attributes of the auto-data.txt dataset. Each diagram should show the frequency (i.e., total number) of cars per value of the given attribute. Use a basic bar chart to draw your frequency diagrams. See Figure 1 for an example for the cylinders attribute.

## Step 2 Pie Charts
Create a pie chart showing the frequency of cars for each of the attributes from step 1. Your pie chart should include the percentages for each attribute value (using `autopct="%1.1f%%"`). See Figure 1 for an example for the cylinders attribute.

## Step 3 Dot Charts
Create a dot (aka strip) chart showing the values for mpg, displacement, horsepower,
weight, acceleration, and msrp. See Figure 1 for an example for mpg. As shown, darker circles indicate more data instances with that value. Some hints for creating a similar looking dot chart: set the y-axis values for each x value to 1; hide the y-axis using `pyplot.gca().get yaxis().set visible(False)`; use the '.' marker and set markersize to a larger default value and set alpha=0.2 to make dots transparent.

## Step 4 Discretization
There is often a need to transform a continuous attribute into a categorical attribute.
Use the following two approaches to convert mpg into a categorical attribute and for each approach create a corresponding frequency diagram.

Approach 1. The US Department of Energy assigns gasoline vehicles a fuel economy rating from 1
(worst) to 10 (best). The ratings are defined in terms of mpg as follows:

Rating |MPG
-|-|
10 |≥ 45
9 |37–44
8 |31–36
7 |27–30
6 |24–26
5 |20–23
4 |17–19
3 |15–16
2 |14
1 |≤ 13

Use these ranges to define category values (denoting rating 1 to 10) for the mpg attribute.


Approach 2. Create 5 "equal-width" bins to generate categories. Each bin should divide up the range of mpg values into equal subranges, where value 1 denotes the smallest subrange of values and 5 the largest subrange of values (see Figure 1).

Each frequency diagram should label bins according to their corresponding ranges (e.g., "27--30"). See Figure 1 for an example.

## Step 5 Histograms
Create a histogram using the `pyplot.hist()` function for each of the attributes in Step 3. Use the default of 10 bins (see Figure 1).

## Step 6 Relationships A
Create scatter plots that compare displacement, horsepower, weight, acceleration,
and msrp to mpg (i.e., where mpg is the y-axis in each scatter plot). Be sure to appropriately label the x and y axes. Figure 1 gives an example for displacement.

## Step 7 Relationships B
Write a function to calculate (least-squares) linear regressions and create scatter
plots with the corresponding linear regression lines for comparing displacement, horsepower, weight, and msrp to mpg. Create one additional scatter plot with a linear regression line comparing displacement to weight. Label each plot with the correlation coefficient and covariance. Figure 1 gives an example for displacement compared to mpg.

## Step 8 Relationships C
Create a chart to compare categorical/continuous attributes. To do this, create a box plot describing MPG (continuous) by model year (where we view year as categorical). An example of the chart is shown in Figure 1.

## Step 9 Relationships D
Create a chart to compare categorical/categorical attributes. To do this, create a frequency diagram of the number of cars from each country of origin (categorical) separated out by model year (viewed as categorical). An example of the chart is shown in Figure 1.

## Figure 1
![Figure 1](https://raw.githubusercontent.com/GonzagaCPSC310/PAs/master/figures/auto_data_charts.png)

Example visualizations for each step: 
* (a) frequency diagram
* (b) pie chart
* (c) dot chart
* (d) frequency diagram of equal width binning
* (e) histogram of acceleration values generated from pyplot with
10 bins
* (f) scatterplot comparing displacement to mpg
* (g) similar plot as in (f) but with linear regression line
* (h) box and whisker plot
* (i) (BONUS) multiple frequency diagram

## BONUS (5 pts)
Take a look at the matplotlib gallery: https://matplotlib.org/gallery/index.html

Choose a type of chart that is significantly different from the charts covered in this assignment and use it to display data from auto-data.txt. In your log, describe how the chart displays data, how it is supposed to be interpreted, interesting conclusions you can draw from the chart, and the process you took create the chart. Have fun with this one!!

## Submitting Assignments
1. Use Github classroom to submit your assignment via a Github repo. See the "Github Classroom Setup" section at the beginning of this document for details on how to do this. You must commit your solution by the due date and time.
1. Your repo should contain only your .py file(s), your input .csv/.txt file(s), your output .pdf plot files, and your log file (.txt or .md). Double check that this is the case by cloning (or downloading a zip) your submission repo and running your code from command line like we will when we grade your code. 

## Grading Guidelines
This assignment is worth 100 points + 5 points bonus. Your assignment will be evaluated based on a successful compilation from command line (using the Anaconda Python Distribution v3.7) and adherence to the program requirements. We will grade according to the following criteria:
* 10 pts for correct step 1
* 5 pts for correct step 2
* 10 pts for correct step 3
* 10 pts for correct step 4
* 5 pts for correct step 5
* 10 pts for correct step 6
* 10 pts for correct step 7
* 10 pts for correct step 8
* 10 pts for correct step 9
* 10 pts for quality and clarity of the write-up in the log file
* 10 pts for adherence to course [coding standard](https://nbviewer.jupyter.org/github/GonzagaCPSC310/PAs/blob/master/Coding%20Standard.ipynb)