Assignment Four - CRISP-DM Capstone: Association Rule Mining, Clustering, or Collaborative Filtering
===
##Hector Curi - Spencer Kaiser

---
Instructions
---
In the final assignment for this course, you will be using one of three different analysis methods:

* Option A: Use transaction data for mining associations rules
* Option B: Use clustering on an unlabeled dataset to provide insight or features
* Option C: Use collaborative filtering to build a custom recommendation system

Your choice of dataset will largely determine the task that you are trying to achieve. Though the dataset does not need to change from your previous tasks. For example, you might choose to use clustering on your data as a preprocessing step that extracts different features. Then you can use those features to build a classifier and analyze its performance in terms of accuracy (precision, recall) and speed. Alternatively, you might choose a completely different dataset and perform rule mining or build a recommendation system.

---
Dataset Selection and Toolkits
---
As before, you need to choose a dataset that is not small. It might be massive in terms of the number of attributes (or transactions), classes (or items, users, etc.) or whatever is appropriate for the task you are performing. Note that scikit-learn can be used for clustering analysis, but not for Association Rule Mining (you should use R) or collaborative filtering (you should use graphlab-create from Dato). Both can be run using iPython notebooks as perfomed in class.

Write a report covering in detail all the steps of the project. The results need to be reproducible using only this report. Describe all assumptions you make and include all code you use in the iPython notebook or as supplemental functions. Follow the CRISP-DM framework in your analysis (you are performing all of the CRISP-DM outline). This report is worth 20% of the final grade.

---
### Choice of Task
We are going to use the MovieLens movie ratings dataset to complete **Option C**. We plan on utilizing over 100,000 user ratings of 8,570 movies to create a user-item collaborative filtering recommendation system.

---
### Initial Code
Our first task was to create a usable dataset from the individual files created by MovieLens. We accomplished this by using the pandas `merge` function, then we wrote the result to a new file called `data.csv`. Each instance in this file now contains the id of both the user and the movie, the movie title, its rating, genres, and more.

In [33]:
#import all packages used in this assignment
import graphlab as gl
import pandas as pd
import sqlite3
import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from pandas.tools.plotting import scatter_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from numpy import random as rd
from sklearn import metrics as mt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import StratifiedKFold
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline

# read in the separate csv files (ratings and movies) and combine them into one for easier use
df1 = pd.read_csv('data/ratings.csv')
df2 = pd.read_csv('data/movies.csv')
merged = df1.merge(df2, on="movieId", how="outer").fillna("")
merged.to_csv("data/data.csv", index=False)

# read in final merged file
data = gl.SFrame.read_csv("data/data.csv")

# embed graphs to ipython
gl.canvas.set_target('ipynb')

PROGRESS: Finished parsing file /Users/hcuri/Dropbox/SMU/Classes/ CSE 5331 - Data Mining/Assignments/DataMining/04_AssignmentFour/data/data.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.143437 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[float,float,float,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/hcuri/Dropbox/SMU/Classes/ CSE 5331 - Data Mining/Assignments/DataMining/04_AssignmentFour/data/data.csv
PROGRESS: Parsing completed. Parsed 100041 lines in 0.131392 secs.


Business Understanding
---

---
### 1. Business Purpose & Implications of Analysis (10 points)
**Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). How will you measure the effectiveness of a good algorithm and why does this method make sense for this specific dataset and the stakeholders needs?**

We chose the MovieLens Data Set from Department of Computer Science and Engineering at the University of Minnesota, Twin Cities at http://grouplens.org/datasets/movielens/.

The primary purpose of this dataset, and it’s creation, is for research purposes. Other similar datasets have been made, however, these datasets were primarily made available for competition purposes in order to improve a business’s recommendation model.

Regarding how an analysis of this data set may be used in the real-world, one of the most beneficial uses of the resulting system would be for movie streaming services like Netflix. Netflix is constantly adding new movies to their offering that they believe will be in high demand. At the same time, they are constantly removing under-performing titles. They do this in order to maximize the revenue generated from each title after paying licensing fees, to make sure their offering remains current and relevant, and they also do this to ensure that users of the service are happy with what they are paying for. 

Our recommendation system could be used to help Netflix fine-tune their offering to best meet the needs of their current subscribers. It could be used to help identify new movies that may be popular and it could also be used to help determine which movies are the least popular.

We will use cross validation to measure the effectiveness of our algorithm. Using precision and recall, we can estimate how many of our recommender system predictions are correct and how many don't match our testing set.

Data Understanding
---

---
### 1. Data Meaning (10 points)
**Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?**

Attribute | Type | Range / Values | Description 
--- | --- | --- | ---
userId | Nominal | 1 - 706 | Randomly assigned user ID's
movieId | Nominal | 1 - 129651 | Randomly assigned movie ID's
rating | Ordinal | 0.5 - 5.0 | Individual rating given to each movie by a user 
timestamp | Interval | 828504918 - 1427754939 | Represent how many seconds after midnight Coordinated Universal Time (UTC) of January 1, 1970 the rating was made
title | Nominal | N/A | Title of the movie 
genres | Nominal | Action, Adventure, Animation, Children’s, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western, N/A '(no genres listed)' | Genre or genres assigned to each movie 


Data collection for this dataset began in April of 1996. Since then, the data set has gone through multiple iterations and the result is a high-quality, well-polished dataset. The dataset does not contain missing values in the traditional sense, however, according to the creators of the dataset “Only movies with at least one rating or tag are included in the dataset”, so some instances have missing values for the ratings field. Because the the quantity of these types of instances is low and because we are primarily focusing on rating, we will be removing these instances from the data set.

In [32]:
# Use the head to get a general idea of how the data in our data set looks like
data.head()

userId,movieId,rating,timestamp,title,genres
1.0,6.0,2.0,980730861.0,Heat (1995),Action|Crime|Thriller
8.0,6.0,5.0,964736358.0,Heat (1995),Action|Crime|Thriller
9.0,6.0,3.0,844674216.0,Heat (1995),Action|Crime|Thriller
16.0,6.0,2.0,855198917.0,Heat (1995),Action|Crime|Thriller
19.0,6.0,4.0,1154985045.0,Heat (1995),Action|Crime|Thriller
21.0,6.0,4.0,865106909.0,Heat (1995),Action|Crime|Thriller
31.0,6.0,5.0,1186249974.0,Heat (1995),Action|Crime|Thriller
60.0,6.0,5.0,1264525122.0,Heat (1995),Action|Crime|Thriller
87.0,6.0,5.0,848503814.0,Heat (1995),Action|Crime|Thriller
96.0,6.0,5.0,882593416.0,Heat (1995),Action|Crime|Thriller


Furthermore, there are no apparent outliers in the set. The scale users rank movies on allows values from 0.5 to 5.0 with increments of 0.5. After running a `min` and a `max` of the dataset, we can see there are no incorrect values (outliers in this case) for the ratings field. Lastly, the dataset contains no duplicate instances.

---
### 2. WRITE TITLE (10 points)
**Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.**

Modeling and Evaluation (50 points)
---

Different tasks will require different evaluation methods. Be as thorough as possible when analyzing the data you have chosen and use visualizations of the results to explain the performance and expected outcomes whenever possible. Guide the reader through your analysis with plenty of discussion of the results.

---
### 1. WRITE TITLE
**Create user-item matrices or item-item matrices using collaborative filtering**

---
### 2. WRITE TITLE
**Determine performance of the recommendations using different performance measures.**

---
### 3. WRITE TITLE
**Use tables/visualization to discuss the found results.**

---
### 4. WRITE TITLE
**Describe your results. What findings are the most compelling and why?**

Deployment (10 points)
---
Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you acheive your goals? If not, can you reign in the utility of your modeling?

---
### 1. WRITE TITLE
**How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)?**

---
### 2. WRITE TITLE
**How would your deploy your model for interested parties?**

---
### 3. WRITE TITLE
**What other data should be collected?**

---
### 4. WRITE TITLE
**How often would the model need to be updated, etc.?**

Exceptional Work (10 points)
---

---
**You have free reign to provide additional analyses or combine analyses**

In [None]:
                                 .''.
       .''.             *''*    :_\/_:     .
      :_\/_:   .    .:.*_\/_*   : /\ :  .'.:.'.
  .''.: /\ : _\(/_  ':'* /\ *  : '..'.  -=:o:=-
 :_\/_:'.:::. /)\*''*  .|.* '.\'/.'_\(/_'.':'.'
 : /\ : :::::  '*_\/_* | |  -= o =- /)\    '  *
  '..'  ':::'   * /\ * |'|  .'/.\'.  '._____
      *        __*..* |  |     :      |.   |' .---"|
       _*   .-'   '-. |  |     .--'|  ||   | _|    |
    .-'|  _.|  |    ||   '-__  |   |  |    ||      |
    |' | |.    |    ||       | |   |  |    ||      |
 ___|  '-'     '    ""       '-'   '-.'    '`      |____