# Synopsis
This exercise uses a dataset from IMDb ranging from age 1893 to 2005. Your tasks is to analyze the data and draw conclusion with your results(Think historicaly)

# Dataset
The CSV file can be found on http://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/movies.csv. (Same dataset has been uploaded to this repo)
Use the CSV file attached so every group analyzes the same data. The CSV file contains the following headers:

1. title Title of the movie
2. year Year the movie was released
3. length Length of the movie in minutes
4. budget Cost of the movie
5. rating The average rating(Rating listed on IMDb)
6. votes Amount of votes given
7. r1-r10 Possibility of rating given from 1 Rating to 10 Rating. Should add up to 100, but due to rounding it doesnot sum up exactly
8. mpaa(Motion Picture Association of America) https://en.wikipedia.org/wiki/Motion_Picture_Association_of_America_film_rating_system
9. Genres(Action,Animation,Comedy,Drama,Documentary,Romance,Short) Which genre the film is(Can be multiple)

# Questions
1. Create histograms of rating, length and length of title.
2. Create a graph showing both year and amount of movies made.
3. Make a scatter-plot with year and length of movies. (Try to find clusters and explain them. Wild guesses are accepted).
4. Apply the MeanShift algorithm to identify all the clusters in the scatter-plot from Question 3.
5. Create a median line for the scatter-plot made in Question 3.
6. Make a scatter plot of: Length and year. Explain the connections between the different axis. (Optional:If the movie is animated or not)

# Solutions

## Modules

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import webget
from collections import defaultdict
from urllib.parse import urlparse
import os

## Python book plot enabler

In [11]:
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

We will be using [pandas](http://pandas.pydata.org/) to read all the data from a csv file and prepare it for data handling as a dataframe object. Webget is a custom library written by us to download a file at a direct link location. Next, [os](https://docs.python.org/2/library/os.html) is used to get the destination of the file platform non-specific. Next, [urlparse](https://docs.python.org/2/library/urlparse.html) is used in conjunction with our webget to wellform a url. Next, [heapq](https://docs.python.org/2/library/heapq.html), which is a module that gives us access to things like the [heapsort](https://en.wikipedia.org/wiki/Heapsort) algorithm. Collections contain a [defaultdict](https://docs.python.org/2/library/collections.html) object we will be using as our data structure, which has a useful lamdas:0 approach for setting up new keys with default values. Finally, we have [pyplot](http://matplotlib.org/) for plotting our data. 

## Pre-Functions
To download the dataset from an external source implement the function:

In [6]:
def download(link):
    file = webget.download(link)
    return os.path.basename(urlparse(link).path)

To convert the csv file to a managable dataframe object implement the function:

In [None]:
def csv_to_df(src):
    return pd.read_csv(src)

To prepare a data object to be handled to reach conclusion, prepare the following variables using the functions above:

In [None]:
url = "http://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/movies.csv"
file = download(url)
data = csv_to_df(file)

We will be using two functions to handle the keys and the values of the dictionaries(Python data type) we'll be working with. Prepare the functions:

In [4]:
def get_dict_values(d):
    return list(d.values())

In [3]:
def get_dict_keys(d):
    return list(d.keys())

We are now ready to feed the functions that data handle our "data" object

## Question 1
Create histograms of rating, length and length of title.

## Question 2
Create a graph showing both year and amount of movies made

### Result
![image](http://i67.tinypic.com/2llle2o.png)

### Observations
1. Overall climb in movie production towards the year 2000
2. A sudden dive after 2002
3. On average 650 movies are made yearly

### Code
Prepare a defaultdict, so every new key entry is set to a default value. 
Iterate over every tuple in the dataframe. Using the rows we iterate on, we can determine the year by looking at column [3]. If that year is present in the dictionary we count up the entry, if not we make one.

Finally, return the dict

In [1]:
def ex_two(data):
    movie_dict = defaultdict(lambda:0)
    
    for row in data.itertuples():
        year = row[3]
        if year in movie_dict:
            movie_dict[year] += 1
        else:
            movie_dict[year] = 1
            
    return movie_dict

## Question 3
Make a scatter-plot with year and length of movies. (Try to find clusters and explain them. Wild guesses are accepted).

## Question 4
Apply the MeanShift algorithm to identify all the clusters in the scatter-plot from Question 3

## Question 5
Create a median line for the scatter-plot made in Question 3

## Question 6
Make a scatter plot of: Length and year. Explain the connections between the different axis. (Optional:If the movie is animated or not)