# [CPSC 310](https://github.com/GonzagaCPSC310) Data Mining
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)
## PA1 Python for Data Analysis (75 pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Identify and remove duplicates from a table
* Combine tables together
* Clean data
* Compute simple summary statistics
* Handle missing values

## Prerequisites
Before starting this programming assignment, participants should be able to:
* Understand terms and concepts in Chapters 1 and 2 of Bramer
* Run Python scripts from the command line
* Use Python variables, operators, lists, functions, conditionals, loops, and file I/O

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining HW1

## Github Classroom Setup
For this assignment, you will use GitHub Classroom to create a private code repositories to track code changes and submit your assignment. Open this PA1 link to accept the assignment and create a private repository for your assignment in Github classroom: https://classroom.github.com/a/AqfR7uZX

Your repo, for example, will be named GonzagaCPSC310/pa1-yourusername (where yourusername is your Github username). I highly recommend committing/pushing regularly so your work is always backed up. We will grade your most recent commit, even if that commit is after the due date (your work will be marked late if this is the case).

Note: Working with Github classroom does not involve any forking!

## Overview and Requirements
Write a program (pa1.py) that performs data pre-processing tasks on an automobile dataset. Download the auto-mpg.txt and auto-prices.txt datasets from https://github.com/GonzagaCPSC310/PAs/tree/master/files. These datasets contain information about cars manufactured and sold in the 1970's.

The attributes of auto-mpg.txt are: 
* mpg (miles per gallon)
* cylinders
* displacement
* horsepower
* weight
* acceleration
* model year
* origin
* car name

The attributes of auto prices.txt are: 
* car name
* model year
* msrp (manufacturer's suggested retail price)

For this assignment you will need to perform the following steps and hand in your source code, tests, and a description (i.e., log) of the process you used to "clean" the given data. Your log needs to be written separately from your .py file and may be written in a .txt or a .md (markdown) file.

Note: as you write solutions for the following steps, I highly encourage you to design functions that are generic and re-usable for future programming assignments and data mining tasks.

## Step 1
In pa1.py write functions to count the number of instances in the dataset and to find any duplicates (i.e., instances with the same car name and model year). If duplicates do exist, determine how to resolve them and then modify the dataset (i.e., manually copy the dataset to a new file, resolve the duplicates, then rerun the script OR programmatically resolve duplicates and write the modified dataset to a file). Be sure to write down the duplicates you found (if any), how you resolved them, and why you resolved them the way you did in your log. After you complete this step, running your pa1.py program should print the following:
```
--------------------------------------------------
auto-mpg-nodups.txt:
--------------------------------------------------
No. of instances: ???
Duplicates: []
--------------------------------------------------
auto-prices-nodups.txt:
--------------------------------------------------
No. of instances: ???
Duplicates: []
```

where ??? should list the number of instances in the two datasets, respectively

## Step 2
Add functions to pa1.py to combine the two datasets, write out the result to auto-data.txt, and count the number of instances (as in Step 1 above). The result of this step should print the following (where ??? should be replaced with the actual number of instances):

```
--------------------------------------------------
auto-mpg-nodups.txt:
--------------------------------------------------
No. of instances: ???
Duplicates: []
--------------------------------------------------
auto-prices-nodups.txt:
--------------------------------------------------
No. of instances: ???
Duplicates: []
--------------------------------------------------
combined table (saved as auto-data.txt):
--------------------------------------------------
No. of instances: ???
Duplicates: []
```

The combined dataset should have 10 attributes such that the first 9 attributes are those from auto-mpg.txt and the last attribute is the corresponding msrp (price) from auto-prices.txt. The two datasets should be combined (i.e., joined) on car name and model year. To combine the two datasets, you should perform a "full outer join". That is, you should not disregard non-matches and instead include non-matches by padding the attributes with missing values (denoted as "NA"). As an example, while there isn't a price listed in auto-prices.txt for a 1970 amc rebel sst, auto-data.txt should include an instance:
```
16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst,NA
```

As another example, while there isn't an instance in auto-mpg.txt for a 1971 Audi 100 LS, auto-data.txt should include an instance:
```
NA,NA,NA,NA,NA,NA,71,NA,audi 100 ls,3595
```

## Step 3
Updated 1/29/19 for clarification:

Resolve (i.e., clean up) the cases in auto-data.txt in which there are prices but no mpg data. To do this, you may need to manually edit your auto-mpg-nodups.txt and/or auto-prices-nodups.txt files to resolve the issues (e.g., in the case of a misspelling), which will require you to regenerate the auto-data.txt file. Be sure to copy these files and rename them appropriately, e.g. auto-mpg-clean.txt and/or auto-prices-clean.txt. Also, write down in your log how and why you resolved these cases the way you did.

## Step 4
Add functions to pa1.py to compute summary statistics for auto-data.txt. For each continuous attribute, compute the minimum, maximum, midpoint (half way between min and max), average, and median values. You can ignore the categorical attributes for now (we'll look at these further in PA2). The result of running your program after this step should be similar to the following. The tables below were generated using the [`tabulate`](https://pypi.org/project/tabulate/) module (which you can install at the command line with `conda install tabulate` OR `pip install tabulate`). Note that your summary values may be different than those below.

```
--------------------------------------------------
auto-mpg-clean.txt:
--------------------------------------------------
No. of instances: ???
Duplicates: []
--------------------------------------------------
auto-prices-clean.txt:
--------------------------------------------------
No. of instances: ???
Duplicates: []
--------------------------------------------------
combined table (saved as auto-data.txt):
--------------------------------------------------
No. of instances: ???
Duplicates: []
Summary Stats:
============ ===== ===== ======= ====== ======
attribute min max mid avg med
============ ===== ===== ======= ====== ======
mpg 9 43.1 26.1 21.1 20
displacement 68 455 261.5 214.3 200
...
msrp 1798 21497 11647.5 4131 3824.5
============ ===== ===== ======= ====== ======

```

## Step 5
Write functions to perform three different techniques to resolve missing values, and for each compute the same summary statistics as in Step 4. Note that there should only be three columns in auto-data.txt that contain missing values.
1. The first approach should be to remove all instances with missing values.
2. The second approach should be to replace missing values with their corresponding attribute's average value.
3. And the third approach should also replace missing values with average values but based on meaningful subsets of the data, e.g., based on the model year, origin, or some combination of attributes that makes the most sense to you.

Be sure to document your decisions in your log. The result of running your program should include the summary statistics for each approach.

```
...
--------------------------------------------------
combined table (saved as auto-data.txt):
--------------------------------------------------
No. of instances: ???
Duplicates: []
Summary Stats:
============ ===== ===== ======= ====== =====
attribute min max mid avg med
============ ===== ===== ======= ====== =====
mpg 11 43.1 27.1 20.8 19.4
...
============ ===== ===== ======= ====== =====
--------------------------------------------------
combined table (rows w/ missing values removed):
--------------------------------------------------
No. of instances: ???
Duplicates: []
Summary Stats:
============ ===== ===== ======= ====== =====
attribute min max mid avg med
============ ===== ===== ======= ====== =====
mpg 9 43.1 26.1 21.1 20.2
...
============ ===== ===== ======= ====== =====
etc.
```

## Bonus (3 pts)
Change your program so the names of the two input files will be passed in to your program via [command line arguments](https://docs.python.org/3/tutorial/stdlib.html#command-line-arguments). Use the names of these files to programmatically create subsequent file names. For example, running your program as follows: 
```
python pa1.py auto-mpg.txt auto-prices.txt
```

would produce the same file names as listed in the specifications above. But if the names of the files are car-mpg.txt and car-prices.txt, then running your program as follows:
```
python pa1.py car-mpg.txt car-prices.txt
```

would produce file names such as car-mpg-nodupgs.txt, car-prices-nodups.txt, car-data.txt, car-mpg-clean.txt, etc. Essentially, you are not hard-coding filenames in your program but instead constructing them at runtime using the command line arguments.

If incorrect command line arguments are given (e.g. missing one of them), print a string showing usage instructions. This helps the user know how to run your program!!

## Submitting Assignments
1. Use Github classroom to submit your assignment via a Github repo. See the "Github Classroom Setup" section at the beginning of this document for details on how to do this. You must commit your solution by the due date and time.
1. Your repo should contain only your .py file(s), your input .csv file(s), and your log file (.txt or .md). Double check that this is the case by cloning (or downloading a zip) your submission repo and running your code from command line like we will when we grade your code.

## Grading Guidelines
This assignment is worth 75 points + 3 points bonus. Your assignment will be evaluated based on a successful compilation from command line (using the Anaconda Python Distribution v3.7) and adherence to the program requirements. We will grade according to the following criteria:
* 10 pts for correct step 1
* 10 pts for correct step 2
* 10 pts for correct step 3
* 10 pts for correct step 4
* 15 pts for correct step 5
* 10 pts for quality and clarity of the write-up in the log file
* 10 pts for adherence to course [coding standard](https://nbviewer.jupyter.org/github/GonzagaCPSC310/PAs/blob/master/Coding%20Standard.ipynb)