# First Attempt implementing FC neural network in PyTorch

In [1]:
import torch

## Problem Description

The aim of the model is to estimate the [IMDB](https://www.imdb.com) score of the a movie, based on:

- Name
- Release Date
- Production Budget
- Genre
- Worldwide Gross
- Run Time

We then try to identify this relationship.

$$\left.
\begin{array}{l}
\text{Film Name} \\
\text{Release Date} \\
\text{Production Budget} \\
\text{Genre} \\
\text{Worldwide Gross} \\
\text{Run Time}
\end{array}
\right\}\Rightarrow\text{IMDB Rating}$$

## Data Source

The data source comes from [here](https://vega.github.io/editor/data/movies.json), licensed under MIT License.

## Preprocessing

For preprocessing, I used a combination of Lua and Python.

### Filter

The original JSON file has more than 3000 films, but many of which has missing information.

The `preprocess.lua` **filters out all entries without any one of the following parameters**:

- Film Name (`Title`)
- Release Date (`Release Date`)
- Production Budget (`Production Budget`)
- Genre (`Main Genre`)
- Worldwide Gross (`Worldwide Gross`)
- Run Time (`Running Time min`)
- IMDB Rating (`IMBD Rating`)

After filtering, **1141 films** survived with **12 different genres**:

```json
["Action","Black Comedy","Comedy","Adventure","Drama","Romantic Comedy","Horror","Thriller/Suspense","Musical","Documentary","Western","Concert/Performance"]
```

The list of movies that survived is stored in `movies_processed.json`.

The list of genres are stored in `genres.json`.

### Encoding

Before feeding the data into the FC neural network, we first need to find a way to encode the data into a list of numbers.

In [3]:
import json, os

with open("genres.json") as f:
    Genres = json.loads(f.read())
with open("movies_processed.json") as f:
    Movies = json.loads(f.read())
print("%d movies found with %d genres." % (len(Movies), len(Genres)))

1140 movies found with 12 genres.


### Input and Output

We'll input all information of a movie into a FC network.

#### Encoding

In [30]:
print("Maximum Length of Title:", max(list(len(str(i["Title"])) for i in Movies)))

Maximum Length of Title: 58


In the above snippet we found that the maximum length for a title is 58. Thus we can use up to 58 ASCII's as title. Use 0 as placeholders.

In [28]:
import time
# Example: Jul 25 2008
print("Release Date Timestamp:", list(time.mktime(time.strptime(i["Release Date"], "%b %d %Y")) / 100 for i in Movies))

Release Date Timestamp: [8237952.0, 5036832.0, 12169152.0, 8346816.0, 8389152.0, 3611232.0, 8244000.0, 8371008.0, 8552448.0, 8498016.0, 8334720.0, 8389152.0, 8395200.0, 8352864.0, 8219808.0, 8377056.0, 8262144.0, 8431488.0, 8449632.0, 11062368.0, 8443584.0, 22074912.0, 8244000.0, 8369280.0, 8362368.0, 8407296.0, 3296736.0, 8395200.0, 8485920.0, 8231904.0, 8062560.0, 8383104.0, 8449632.0, 5219964.0, 8504064.0, 9574560.0, 6973920.0, 8479872.0, 8326080.0, 8375328.0, 8479872.0, 8358912.0, 8364960.0, 8340768.0, 8383104.0, 8473824.0, 8467776.0, 8340768.0, 8280288.0, 8485920.0, 8472096.0, 8461728.0, 8484192.0, 8455680.0, 8383104.0, 8358912.0, 10917216.0, 8328672.0, 8401248.0, 8401248.0, 8310528.0, 8377056.0, 8381376.0, 8443584.0, 8316576.0, 11298240.0, 21978144.0, 8268192.0, 8437536.0, 8213760.0, 9748224.0, 9228096.0, 12380832.0, 10826496.0, 10040256.0, 11824416.0, 9840672.0, 12580416.0, 9556416.0, 10566432.0, 11788128.0, 10639008.0, 9997920.0, 10403136.0, 11733696.0, 9387072.0, 11886624.0, 1