## 1. Intro

### 1.1 Data Set
In this project, we are working on the IMDb movie reviews which is collected from the IMDb website http://www.imdb.com/. 

The Internet Movie Database (IMDb) is an online database of information related to films, television programs and video games, including cast, production crew, fictional characters, biographies, plot summaries, trivia and reviews. It's launched in 1990 by computer programmer Col Needham. As of September 2016, IMDb has approximately 3.9 million titles (including episodes) and 7.4 million personalities in its database, as well as 67 million registered users. [*From Wikipedia*]


### 1.2 Data Preprocessing
In order to collect the all the information from the IMDb website. We make use of the [imdbpie](https://github.com/richardasaurus/imdb-pie) module. And we save the results into two tables in sqlite3. Due to the limitation of time, we only randomly crawl about 1% of all the movies in IMDb. Code for data collection can be found at [Data Collection Code]( https://github.com/JinyiLu/15688-Team/blob/master/doc/DataCollection.md). Data can be found at [Data]()

We store the data in two sqlite3 tables: **movie** and **review**.

**Movie** contains all the information about movies. Its schema is:

| Column Name  |  Type | Comment |
|---|---|---|
| imdb_id  | TEXT  | primary key  |
| title  | TEXT  | title of the movie  |
| type  | TEXT  | type of the movie  |
| year | INTEGER | movie release year |
| tagline | TEXT | |
| plots | TEXT | |
| plot_outline | TEXT | |
| rating | INTEGER | |
| genres | TEXT | |
| votes | INTEGER | number of user vote the movies |
| runtime | INTEGER | in seconds |
| poster_url | TEXT | |
| cover_url | TEXT | |
| release_date | TEXT | |
| certification | TEXT | |
| trailer_image_urls | TEXT | |
| directors_sumary | TEXT | |
| creators | TEXT | |
| cast_summary | TEXT | |
| writers_summary | TEXT | |
| credits | TEXT | |
| trailers | TEXT | ||


**Review** contains all the information about reviews. Its schema is:

| Column Name  |  Type | Comment |
|---|---|---|
|imdb_id | TEXT | |
|username | TEXT | |
|content | TEXT | |
|postdate| TEXT | |
|rating | INTEGER | user rating 1-10 |
|summary | TEXT | review summary |
|status | TEXT | |
|user_location|TEXT| |
|user_score| INTEGER | |
|user_score_count | INTEGER | ||

Following are some samples:

In [10]:
import sqlite3
import pandas as pd

DB_NAME = '../data/imdb_final.db'
conn = sqlite3.connect(DB_NAME)

df = pd.read_sql_query("SELECT * FROM movie", conn)
df.head()

Unnamed: 0,imdb_id,title,type,year,tagline,plots,plot_outline,rating,genres,votes,...,cover_url,release_date,certification,trailer_image_urls,directors_summary,creators,cast_summary,writers_summary,credits,trailers
0,tt1756420,Bez vini vinovatiye,feature,2008,,[],,5.1,[u'Drama'],18,...,,,,[],[<Person: u'Gleb Panfilov' (u'nm0659368')>],[],"[<Person: u'Inna Churikova' (u'nm0161500')>, <...",[],"[<Person: u'Gleb Panfilov' (u'nm0659368')>, <P...",[]
1,tt0241250,The Blind Date,feature,2000,,"[u""Lucy Kennedy, a one time police detective w...","Lucy Kennedy, a one time police detective whos...",6.2,[u'Thriller'],39,...,,,15.0,[],[<Person: u'Nigel Douglas' (u'nm0235196')>],[],"[<Person: u'Zara Turner' (u'nm0877947')>, <Per...","[<Person: u'Simon Booker' (u'nm0095405')>, <Pe...","[<Person: u'Nigel Douglas' (u'nm0235196')>, <P...",[]
2,tt0106806,Emmanuelle's Love,feature,1993,The Legend is back and the adventure begins!,"[u""Emmanuelle withdraws into a temple in Tibet...","Emmanuelle withdraws into a temple in Tibet, w...",4.6,"[u'Drama', u'Romance']",152,...,https://images-na.ssl-images-amazon.com/images...,1993-04-04,18.0,[],[<Person: u'Francis Leroi' (u'nm0163095')>],[],[<Person: u'Marcela Walerstein' (u'nm0907380')...,"[<Person: u'Emmanuelle Arsan' (u'nm0037491')>,...","[<Person: u'Francis Leroi' (u'nm0163095')>, <P...",[]
3,tt2635824,We Ride: The Story of Snowboarding,documentary,2013,The story of snowboarding told by the people w...,"[u""Grain Media and Burn Energy Drink tell the ...",Grain Media and Burn Energy Drink tell the sto...,7.8,"[u'Documentary', u'Adventure', u'History', u'S...",81,...,https://images-na.ssl-images-amazon.com/images...,2013-01-31,,[],"[<Person: u'Jon Drever' (u'nm2270358')>, <Pers...",[],"[<Person: u'Danny Davis' (u'nm2289497')>, <Per...",[<Person: u'Jon Drever' (u'nm2270358')>],"[<Person: u'Jon Drever' (u'nm2270358')>, <Pers...",[]
4,tt0426589,Succubus,feature,1987,Sex Slaves To The Devil!,[u'The Von Romburg castle has been cursed ever...,The Von Romburg castle has been cursed ever si...,6.6,[u'Horror'],7,...,,1987,,[],[<Person: u'Patrick Dromgoole' (u'nm0238245')>],[],"[<Person: u'Barry Foster' (u'nm0287687')>, <Pe...","[<Person: u'Bob Baker' (u'nm0048276')>, <Perso...",[<Person: u'Patrick Dromgoole' (u'nm0238245')>...,[]


In [11]:
df = pd.read_sql_query("SELECT * FROM review", conn)
df.head()

Unnamed: 0,imdb_id,username,content,postdate,rating,summary,status,user_location,user_score,user_score_count
0,tt2635824,(borkoboardo),It's a difficult one - the history of snowboar...,2013-02-26,4.0,Fans will love it!,G,Livigno,6.0,9.0
1,tt2635824,surfs_up1976,Another attempt of capturing the history of sn...,2013-02-25,3.0,"Oh boy, what a mess...",S,Sweden,3.0,4.0
2,tt0079677,lazarillo,This is one of those films that kind of fall i...,2008-11-24,,Not good but interesting--and certainly offbea...,G,"Denver, Colorado and Santiago, Chile",5.0,7.0
3,tt0079677,Wizard-8,"Today, the all-but-forgotten movie ""(Friday Th...",2012-07-23,,"Great effort is obvious, but it doesn't work i...",G,"Victoria, BC",2.0,3.0
4,tt0079677,HumanoidOfFlesh,After tragic death of his parents-the woman ac...,2010-12-12,8.0,Young boy's tormented psyche.,G,"Chyby, Poland",4.0,7.0



## Analysis
* General statistics
* Rating sparsity (compare with MovieLen)
* Rating cnt / review cnt
* Rating vs rating cnt
* Rating vs user cnt
* (Review/Movie) rating vs movie cnt
* Location vs user cnt
* Location vs movie cnt
* TOP
    * TOP rating/review cnt movie
    * TOP movie/review rating movie (vote threshold)
        * Movie title, trailers, runtime, credits, cover
    * TOP review cnt user
* Post date vs movie cnt
* Post date vs avg runtime
* Genre vs movie cnt
* Genre vs vote, movie rating
* Genre vs avg runtime

* Agree and disagree

## Recommendation System
* Sparsity -> traidional methods may not work
* Potentional solutions

## Reference
* https://en.wikipedia.org/wiki/IMDb
* https://github.com/richardasaurus/imdb-pie
* https://github.com/JinyiLu/15688-Team