# Style

- Read the PEP-8 python style guide to help you write the most legible,
  cannonical python.
- Files/directories: so many people work on the command line in data science.
  IDE's don't have too much trouble with files that have spaces in them, and
  they are totally allowed on all OS's, but it can be a pain to work with them
  in scenarios like remote ssh sessions. Its annoying to ssh into a machine
  and then use quotes or escape characters to manage a file-system that has
  files with spaces in their names.
- Docstrings: use them. Always document your python function/classes with a
  docstring: <code>""" docstring """</code>. This is actually a part of the python
  language and can be read by IDEs or other python writing software to provide
  just-in-time documentation while writing code/editing code. Proper docstring
  useage is part of the PEP-8 standard.
- Spaces not tab
- <code>\_\_init\_\_.py</code> goes in any directory that has python modules which you call 
 'import' on.

# Mechanics

- Caching your data offline can be a good idea
- Writing your own tools to process and analyze data is definitely a positive.
  Its important that you understand what's happening behind the scenes.
- Although I personally see no reason why folks shouldn't use their own 
  homebrew code to handle data processing, there are actually a lot of 
  libraries which do this for you, which are also highly optimized for
  speed. This may become a necessary thing when your data sets jump from
  a few hundred rows to gigabytes or terabytes and beyond.

# Modeling
- So, you want to use the number of cylindars to predict gas mileage? There are
  several other features in the data set, why did you pick cylinders?
- Would other features help make your prediction more accurate? Why are you
  ignoring them?
- Implement your ideas and show me the following things:
 - How does the model predict the mpg based on the features?
 - If I gave you a new car your model had never seen, how could you predict the
   mpg of that car?
 - How can I evaluate how well your model performs?

# Programming
- Your nested while-loops are hella-inefficient for sorting. You should think
  about using a data structure like a dictionary if you want to get the info
  you want in one pass. For every year, you have to go through the whole data
  set.
- Your code is not 'functional' - I can see that if you wanted to modify your code, perhaps to group data in another way, or to try something different, it would require a complete re-write. This makes iterating on ideas slow and inefficient. Thankfully, someone has solved this problem for you, its called "pandas dataframes". For large data sets, you can use apache spark, or concurrency.

# Data Exploration
- Looks like you were careful when loading in your data to check for weird
  values. This is great practice!
- Explore the data a little bit more carefully before you start modeling. Why
  did you start with cylinders? Does the data show that the feature can be used
  to predict MPG? How would you show this with math?
- Its not clear what date range you are associating with cylinders, the way you
  wrote this part is prone to off-by-one errors.
- Why did you group cylinder count by year?
- You should look for:
 - Individual feature distributions 
 - Categorical vs numeric features
 - Transforming categorical features into numeric features.


# Nick Bootstrapped

Here, I will reproduce your code using a more data-sciency approach. Note where I have structured things differently.

In [83]:
%matplotlib notebook
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import numpy as np

## Download Data Set Locally

In [84]:
%%bash
wget https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

--2017-08-22 12:54:26--  https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
Resolving archive.ics.uci.edu... 128.195.10.249
Connecting to archive.ics.uci.edu|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30286 (30K) [text/plain]
Saving to: ‘auto-mpg.data’

     0K .......... .......... .........                       100% 1.30M=0.02s

2017-08-22 12:54:26 (1.30 MB/s) - ‘auto-mpg.data’ saved [30286/30286]



In [85]:
%%bash
ls -ltrh | grep 'auto-mpg'

-rw-r--r--  1 mbeaumi  staff    30K Jul  7  1993 auto-mpg.data


# Process Data

In [86]:
raw_data = []

features = [
    "mpg", 
    "cylinders", 
    "displacement", 
    "horsepower", 
    "weight", 
    "acceleration", 
    "model_year", 
    "origin", 
    "car_name"
]

with open('./auto-mpg.data','r') as f:
    for line in f:
        tokens = line.split()
        raw_data.append({
            "mpg":tokens[0],
            "cylinders":tokens[1],
            "displacement":tokens[2],
            "horsepower":tokens[3],
            "weight":tokens[4],
            "acceleration":tokens[5],
            "model_year":tokens[6],
            "origin":tokens[7],
            "car_name":'_'.join(tokens[8:])
        })
data = pd.DataFrame(raw_data)

for name in ['mpg', 'cylinders', 'displacement', 'weight', 'acceleration','horsepower','model_year', 'origin']:
    data[name] = pd.to_numeric(data[name], errors='coerce')

In [87]:
data.dtypes

acceleration    float64
car_name         object
cylinders         int64
displacement    float64
horsepower      float64
model_year        int64
mpg             float64
origin            int64
weight          float64
dtype: object

In [88]:
data.shape

(398, 9)

In [89]:
data

Unnamed: 0,acceleration,car_name,cylinders,displacement,horsepower,model_year,mpg,origin,weight
0,12.0,"""chevrolet_chevelle_malibu""",8,307.0,130.0,70,18.0,1,3504.0
1,11.5,"""buick_skylark_320""",8,350.0,165.0,70,15.0,1,3693.0
2,11.0,"""plymouth_satellite""",8,318.0,150.0,70,18.0,1,3436.0
3,12.0,"""amc_rebel_sst""",8,304.0,150.0,70,16.0,1,3433.0
4,10.5,"""ford_torino""",8,302.0,140.0,70,17.0,1,3449.0
5,10.0,"""ford_galaxie_500""",8,429.0,198.0,70,15.0,1,4341.0
6,9.0,"""chevrolet_impala""",8,454.0,220.0,70,14.0,1,4354.0
7,8.5,"""plymouth_fury_iii""",8,440.0,215.0,70,14.0,1,4312.0
8,10.0,"""pontiac_catalina""",8,455.0,225.0,70,14.0,1,4425.0
9,8.5,"""amc_ambassador_dpl""",8,390.0,190.0,70,15.0,1,3850.0


In [92]:
# fig, ax = plt.subplots(figsize=(10,10))
model_year_count = data.groupby(['cylinders','model_year']).count()
# model_year_count.plot(x='cylinders', y='mpg', ax=ax)
model_year_count

Unnamed: 0_level_0,Unnamed: 1_level_0,acceleration,car_name,displacement,horsepower,mpg,origin,weight
cylinders,model_year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3,72,1,1,1,1,1,1,1
3,73,1,1,1,1,1,1,1
3,77,1,1,1,1,1,1,1
3,80,1,1,1,1,1,1,1
4,70,7,7,7,7,7,7,7
4,71,13,13,13,12,13,13,13
4,72,14,14,14,14,14,14,14
4,73,11,11,11,11,11,11,11
4,74,15,15,15,15,15,15,15
4,75,12,12,12,12,12,12,12
