# Exploratory Data Analysis with K-Means Clustering

In [1]:
#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sb

| File              | Field 1      | Field 2       | Field 3            | Field 4        |
|-------------------|--------------|---------------|--------------------|----------------|
| ratings.csv       | userId(int)  | movieId(int)  | rating(0.5-5.0)    | timestamp(UTC) |
| tags.csv          | userId(int)  | movieId(int)  | tag(string)        | timestamp(UTC) |
| movies.csv        | movieId(int) | title(string) | genres(string)     | -              |
| genome-scores.csv | movieId(int) | tagId(int)    | relevance(0.0-1.0) |                |
| genome-tags.csv   | tagId(int)   | tag(string)   |                    |                |

 * **Genres(string)** gets to pick from the below:
   * Action
   * Adventure
   * Animation
   * Children's
   * Comedy
   * Crime
   * Documentary
   * Drama
   * Fantasy
   * Film-Noir
   * Horror
   * Musical
   * Mystery
   * Romance
   * Sci-Fi
   * Thriller
   * War
   * Western
   * (no genres listed)

## Possible Approaches

 I want to preface these models by mentioning that clustering on the number of genres wouldn't be feasible as the cluster sizes would be excessively large. One of the possible models below will utilize this however as data processing before clustering.
 * **Lazy Tag Model**
   * Remove all tags with relevancy less than 0.9
   * Make K-Means clusters with k equal to the number of unique tags left
 * **Lazy Genre-Tag Model**
   * Cluster all of the movies on the number of genres
   * Further input each of these clusters as individual datasets to be further clustered by tags
 * **Branch by Reviewer Model**
   * If a reviewer enjoyed 2 movies, then one movie should be considered a recommendation for the other.


### First Attempts at processing data for Lazy Tag Model

In [8]:
df = pd.read_csv('../data/external/ml-latest/ml-latest/genome-scores.csv')
df.tagId.unique().size

1128

In [9]:
df.movieId.unique().size

13176

In [10]:
df_trimmed = df[df.relevance >= 0.9]
df_trimmed.tagId.unique().size

1054

In [11]:
df_trimmed.movieId.unique().size

11969

From the above calculations, reducing the relevant genome tags to only those with a score of >= 0.9 results in losing approximately 1207 movies, making 1207 tagless movies. Furthermore we lost 74 tags. At this point however it is worth noting that within the **genome-scores.csv** file the number of unique Ids is 13176. Below we will read the **ratings.csv** to see if this reflects the actual number of unique movies in this dataset.

In [13]:
df_ratings = pd.read_csv('../data/external/ml-latest/ml-latest/ratings.csv')

df_ratings.movieId.unique().size

From the above information is can be determined that the **genome-scores.csv** only contains information pertaining to 13176 movies whereas the **ratings.csv** file contains information about 53889. At this point I would consider the tags model only useful for a small subset of the movies present within the dataset (11969 / 53889).

## Final Thoughts on the Lazy Tag Model
The Lazy Tag Model was my favorite initial idea as it provided the least initial consideration for details as the connection to other movies would be provided by other reviewers who had tagged the movies. Unfortunately from the dataset it can be seen that with only information of less than 22.210% of the movies with relevancy scores of tags higher than 0.9 the model would more often than not come up with either useful recommendations or recommend the cluster containing all tagless movies.