# Project Explainer Notebook
Made by Bjørn Vinther (s152423) and Peter Scheel (s152424)

# Motivation
### What is our dataset?
This dataset contains filming locations in San Francisco:

* 1265 data objects (each contains an unique film location) (244 kB total)

* 12 attributes

* Title, Release Year, Location, Fun Facts, Production Company, Distributor, Director, Writer, Actor 1, Actor 2, Actor 3, "Smile Again, Jenny Lee"

* The 1265 film locations are distributed over 220 different movies

* Movies date back to 1915 and up to early 2016

This seems like a small dataset, but we went with it which proved to need a lot of processing afterwards.

### Okay, but why this dataset?
We wanted something we found interesting in our spare time. It was either something with movies, or something with games, and this dataset seemed to fit best for what we've done previously in the course. We liked the visualization of location-based data, and working with geoplots. 

This could be useful for film studios to easily find locations that matched a certain genre. It could also be interesting for film entutiasts to see if there are any patterns in the filming locations and the type of movie. (e.g. genre or rating)

After working with the dataset it proved a bit too challenging to get any proper results from. Rather than discaarding all the preliminary work and change data set we chose to supplement the dataset with information from other sources.

# Basic stats
### Cleaning and preproccesing of the data
We spend a lot of time on processing the data. We tried to find other datasets to supplement our dataset but we didn't manage to find any. We started being creative and tried to use major web databases such as google, IMDB, and Facebook to collect any other useful information.  

We started of by then combining the original dataset with information from IMDb. The original data set had a lot of errors and missing values, so we instead used it to find the movie on IMDB and then use the data from IMDB. This was done automatically using a python plugin. Any movies the couldn't find automatically we tried findin manually. If we still weren't able to find it we would discard it. We also discarded all tv-series (total of 8).

The locations were names in the original dataset, so we used the Google geocoder to convert these names into latitude and longitude coordinates. This was very demanding task since Google has limits on the API calls, so it had to be processed over a couple of days. Additionally, not all coordinates were correct so we still had to detect and remove outliers. 

Finally, we used Facebook to get information about places nearby some of the positions in the dataset. The data took very long to receive, and as Google, Facebook has a limit on their API, so the data sadly didn't make it into any of our visualizations. The hope was that this information could help us tell more about the area around a geo-coordinate. 

### Attributes
Of the 12 attributes, we didn't really use much of the data for our models:
* Title 
 * Title of the movie, not so relevant for our problem
* Release Year
 * Year of release, used to categorize our locations into decades
* Location 
 * The most important attribute in our data
* Fun Facts
 * Random facts, not so relevant for our problem
* Production Company
 * Distributor of the movie, not really used, but could have been interesting
* Distributor
 * Distributor of the movie, not really used, but could have been interesting
* Director
 * Director of the movie, not so relevant for our problem
* Writer
 * Write of the movie, not so relevant for our problem
* Actor 1, Actor 2, Actor 3
 * Top 3 actors in the movie, not so relevant for our problem
* "Smile Again, Jenny Lee"
 * One movie has an entry in this attribute, funnily enought, the movie called "Smile Again, Jenny Lee"

### Initial visualizations and basic plots
<img src="img/genrescore.png" alt="Drawing" style="width: 400px;"/>
<img src="img/DistributionOfRatings.png" alt="Drawing" style="width: 400px;"/>
<img src="img/LocationsPrMovie.png" alt="Drawing" style="width: 400px;"/>
<img src="img/MoviesPrDirector.png" alt="Drawing" style="width: 400px;"/>
<img src="img/MoviesPrGenre.png" alt="Drawing" style="width: 400px;"/>
<img src="img/MoviesPrStudio.png" alt="Drawing" style="width: 400px;"/>
<img src="img/Heatmap-locations.png" alt="Drawing" style="width: 400px;"/>
<img src="img/Heatmap-facebook.png" alt="Drawing" style="width: 400px;"/>

# Theory
### Our models

#### Decision Forrest
We chose to use a decision forest to classify the genre based on the year and geo-coordinate of a filming location. We chose a decision forest because we believed it was unlikely that there was a linear relation between the geo-coordinates and the genre. A decision forest would be able make a better model.

Our first challenge was that each movie could have multiple genres. To solve this, we chose to train a decision forest for each genre. The data only contained two classes, either having the given genre or not, and we used to decision forest to learn the probability of a data point having a given genre. 

Another challenge we faced was that the data was unbalanced since some genres were more popular than others (e.g. the genre "war" only had one movie). We solved this be combining all the unpopular genres into a genre called "Other". We also resampled the training data so there was an equal amount entries for both of the two classes. 

We trained the data on 80 % of the data and used 20 % for testing. We used "entropy" to determine the optimal splits in the decision trees and calculated the performance by determining the fraction of correct predictions. 


#### KNN
The idea was to see if there were some tendencies only based on the location. It would be cool to see specific parts of the city being used for specific genres. Our initial model came up with this map of San Francisco for the genres:
<img src="img/knn.png" alt="Drawing" style="width: 400px;"/>
We did the same thing for the specific ratings for each location. Our initial model came up with this map of San Francisco for the ratings:
<img src="img/knn-2.png" alt="Drawing" style="width: 400px;"/>

We used K-fold to determine the optimal amount of neighbors for our KNN:
<img src="img/knn-cross.png" alt="Drawing" style="width: 400px;"/>
This surely didn't work so well, but we followed the result, and went with K=4 for both models.

We trained the data on 80 % of the data and used 20 % for testing. Calculated the performance by determining how accurate the models could to predict the genre and rating.

#### KMeans
For this model, we would like to see whether there were some clusterings of the locations around the city. Our initial model came up with the following clusters, using 10 clusters as a starting point:
<img src="img/kmeans.png" alt="Drawing" style="width: 400px;"/>
This model is able to find any clusters in the data, making it possible to see if the data have any sort of grouping.

For this model we didn't use any cross validation, as we knew that we wanted to make the amount clusters interactable for the end-user.

This model did cluster the data, and for specific areas, it seems to work decently. The dataset clearly has most of the locations in the central part of the city, and therefore there are most clusters at that part of the map.

# Visualizations
### Explain the visualizations you've chosen.
All of our visualizations on our models, we chose to go with Geoplots. Each of the models are using locations. 

#### Decision Forrest
Our Decision Forest, we chose to use a geoplot and then predict a probability that a certain genre would be at that position in a specific decade. This was done for each point in the grid.

#### KNN
Our KNN also use a geoplot and then predict that a certain genre would be at that position but is not using the decade as the Decision Forrest. This was done for each point in the grid.
The same was done for a KNN model trained to predict the IMDb rating instead of the genre.

#### KMeans
The KMeans clustering is using the same visualization as presented in the second assignment. We included the possibility to change the amount of clusters from 2-10, and for fun, we also did clustering of the individual genres.

### Why are they right for the story you want to tell?
Our visualization are supporting the main goal, eventhough it doesn't determine anything concrete. We are using the forest to predict the probability of a certain genre.

# Discussion
### The good stuff
We used a lot of alternative ways of getting data which potentially could have let to new discoveries. We did see some tendencies in our models, although the many locations in the downtown district makes it hard to differentiate the genres in there, each of the models actually shows that certain genres seem to be located in certain parts of the city.

### The bad stuff
Our dataset was lacking useful data. Most of our data did not seem to be correlated in any way and even though we hoped to find new cool patterns we just ended up finding nothing, but we guess that one of the risks when exploring data. Had we had more data we would probably had had a greater chance of finding some patterns. We would have liked to look more into the the surroundings of the clusters we've found. The processing of our small data clearly took up way too much of our time, and we didn't get to plot enough super cool plots or learn amazing models. 