# 主題: 電影評分預測
本項目使用文本卷積神經網絡，並使用[`MovieLens`](https://grouplens.org/datasets/movielens/)數據集完成電影推薦的任務<br>
![ex_screenshot](reco.png)

##  組員 : 
* ChoiHyunMin

### 主題 : 
* 使用DNN，Linear Regression及SVM 來比較 傳統機器學習方法與DNN預測頻分的精度

### 介紹 :
* 推薦系統在信息化日益發達的今天尤其重要，比如網上購物、網上買書、新聞頭條、社交網絡、音樂網站、電影資訊等，有用戶的地方就需要推薦。對擁有相同喜好，相同行為習慣的人群等信息進行個性化的內容推薦。<br>關鍵 : 使用三個模型及不同方法前處理。比較用什麼特征，或用什麼模型跟參數的時候 預測結果比較好。<br>

* 若要猜測電影, 因爲每個人不可能都看數據裏面的4000個電影之後都打評分,這導致很低的準確率。 所以我是要猜測每個人對每個電影評分。若找到評分高的電影就推薦給他。所以這邊要研究每個人對每個電影的評分的精確。是否接近他真實評分。推薦給客人適合的電影是我要研究的future work。

### 實作方法 : 
 1. 使用LR,SVM,DNN做出個性化推薦（圖片的方式去分析及訓練）
 2. 比較state-of-art的方式DNN跟傳統方法LR,SVM來做出一些洞察<br>
![ex_screenshot](moviedat.png)<br>
 3. 比較時，使用MSE,MAE來比較<br><br>
![ex_screenshot](mse.png)

# Import該用到的套件

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

import re
from tensorflow.python.ops import math_ops

In [2]:
# Keras utilis function
from keras import metrics
from keras.utils import np_utils

from keras import backend as K

import keras
from keras.backend import set_session
import tensorflow as tf
import os
def create_session(gpu_id='0', pp_mem_frac=None):

    tf.reset_default_graph()
    os.environ["CUDA_VISIBLE_DEVICES"] = gpu_id # can multiple?
    with tf.device('/gpu:' + gpu_id):
        config = tf.ConfigProto()
        config.gpu_options.allow_growth = True
        if pp_mem_frac is not None:
            config.gpu_options.per_process_gpu_memory_fraction=pp_mem_frac
        session = tf.Session(config = config)
    return session

gpu_id = '0'
sess = create_session(gpu_id)
set_session(sess)

Using TensorFlow backend.


# 下載數據集
運行下面代碼把[`數據集`](http://files.grouplens.org/datasets/movielens/ml-1m.zip)下載下來

In [3]:
#下載數據之後放在資料夾的程式碼
from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import zipfile
import hashlib

def _unzip(save_path, _, database_name, data_path):
    print('Extracting {}...'.format(database_name))
    with zipfile.ZipFile(save_path) as zf:
        zf.extractall(data_path)

def download_extract(database_name, data_path):
    DATASET_ML1M = 'ml-1m'

    if database_name == DATASET_ML1M:
        url = 'http://files.grouplens.org/datasets/movielens/ml-1m.zip'
        hash_code = 'c4d9eecfca2ab87c1945afe126590906'
        extract_path = os.path.join(data_path, 'ml-1m')
        save_path = os.path.join(data_path, 'ml-1m.zip')
        extract_fn = _unzip

    if os.path.exists(extract_path):
        print('Found {} Data'.format(database_name))
        return

    if not os.path.exists(data_path):
        os.makedirs(data_path)

    if not os.path.exists(save_path):
        with DLProgress(unit='B', unit_scale=True, miniters=1, desc='Downloading {}'.format(database_name)) as pbar:
            urlretrieve(
                url,
                save_path,
                pbar.hook)

    assert hashlib.md5(open(save_path, 'rb').read()).hexdigest() == hash_code, \
        '{} file is corrupted.  Remove the file and try again.'.format(save_path)

    os.makedirs(extract_path)
    try:
        extract_fn(save_path, extract_path, database_name, data_path)
    except Exception as err:
        shutil.rmtree(extract_path)  # Remove extraction folder if there is an error
        raise err

    print('Done.')

class DLProgress(tqdm):
    last_block = 0
    def hook(self, block_num=1, block_size=1, total_size=None):
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num

## 把資料度進來

In [4]:
data_dir = './'
download_extract('ml-1m', data_dir)

Found ml-1m Data


## 先來看看數據 - 用戶數據， 電影數據，評分數據
本項目使用的是MovieLens 1M 數據集，包含6000個用戶在近4000部電影上的1億條評論。

數據集分為三個文件：用戶數據users.dat，電影數據movies.dat和評分數據ratings.dat。

### 用戶數據
分別有用戶ID、性別、年齡、職業ID和郵編等字段。

數據中的格式：UserID::Gender::Age::Occupation::Zip-code

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

	*  1:  "Under 18"
	* 18:  "18-24"
	* 25:  "25-34"
	* 35:  "35-44"
	* 45:  "45-49"
	* 50:  "50-55"
	* 56:  "56+"

- Occupation is chosen from the following choices:

	*  0:  "other" or not specified
	*  1:  "academic/educator"
	*  2:  "artist"
	*  3:  "clerical/admin"
	*  4:  "college/grad student"
	*  5:  "customer service"
	*  6:  "doctor/health care"
	*  7:  "executive/managerial"
	*  8:  "farmer"
	*  9:  "homemaker"
	* 10:  "K-12 student"
	* 11:  "lawyer"
	* 12:  "programmer"
	* 13:  "retired"
	* 14:  "sales/marketing"
	* 15:  "scientist"
	* 16:  "self-employed"
	* 17:  "technician/engineer"
	* 18:  "tradesman/craftsman"
	* 19:  "unemployed"
	* 20:  "writer"

In [5]:
users_title = ['UserID', 'Gender', 'Age', 'OccupationID', 'Zip-code']
users = pd.read_csv('./ml-1m/users.dat', sep='::', header=None, names=users_title, engine = 'python')
users.head()

Unnamed: 0,UserID,Gender,Age,OccupationID,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


- 可以看出UserID、Gender、Age和Occupation都是類別字段，其中郵編字段是我們不使用的。

### 電影數據
分別有電影ID、電影名和電影風格等字段。

數據中的格式：MovieID::Title::Genres

- Titles are identical to titles provided by the IMDB (including
year of release)
- Genres are pipe-separated and are selected from the following genres:

	* Action
	* Adventure
	* Animation
	* Children's
	* Comedy
	* Crime
	* Documentary
	* Drama
	* Fantasy
	* Film-Noir
	* Horror
	* Musical
	* Mystery
	* Romance
	* Sci-Fi
	* Thriller
	* War
	* Western

In [6]:
movies_title = ['MovieID', 'Title', 'Genres']
movies = pd.read_csv('./ml-1m/movies.dat', sep='::', header=None, names=movies_title, engine = 'python')
movies.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


- MovieID是類別字段，Title是文本，Genres也是類別字段

### 評分數據
分別有用戶ID、電影ID、評分和時間戳等字段。

數據中的格式：UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

In [7]:
ratings_title = ['UserID','MovieID', 'Rating', 'timestamps']
ratings = pd.read_csv('./ml-1m/ratings.dat', sep='::', header=None, names=ratings_title, engine = 'python')
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,timestamps
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


- 評分字段Rating就是我們要學習的targets，時間戳字段我們不使用。

# 資料前處理 - 1

USER的部分
- UserID、Occupation和MovieID不用變。
- Gender：‘F’和‘M’轉換成0和1。
- Age：轉成7個連續數字0~6。

電影的部分 - 實驗會分成加電影的資料跟沒有的結果。 因爲不確定電影名跟類別可以提高精度多少。
- Genres：是分類字段，要轉成數字。首先將Genres中的類別轉成字符串到數字，之後把每個電影的Genres字段轉成數字列表（因為有些電影是多個Genres的組合）
- Title：處理方式跟Genres字段一樣，首先創建文本到數字的字典，然後將Title中的描述轉成數字的列表。另外Title中的年份也需要去掉。
- Genres和Title字段需要將長度統一，這樣在神經網絡中方便處理。

In [54]:
def load_data():

    #read User data
    users_title = ['UserID', 'Gender', 'Age', 'JobID', 'Zip-code']
    users = pd.read_table('./ml-1m/users.dat', sep='::', header=None, names=users_title, engine = 'python')
    users = users.filter(regex='UserID|Gender|Age|JobID')
    users_orig = users.values
    #Change gender and age in User data
    gender_map = {'F':0, 'M':1}
    users['Gender'] = users['Gender'].map(gender_map)

    age_map = {val:ii for ii,val in enumerate(set(users['Age']))}
    users['Age'] = users['Age'].map(age_map)

    #Read Movie
    movies_title = ['MovieID', 'Title', 'Genres']
    movies = pd.read_table('./ml-1m/movies.dat', sep='::', header=None, names=movies_title, engine = 'python')
    movies_orig = movies.values
    #remove year in Title
    pattern = re.compile(r'^(.*)\((\d+)\)$')

    title_map = {val:pattern.match(val).group(1) for ii,val in enumerate(set(movies['Title']))}
    movies['Title'] = movies['Title'].map(title_map)

    #電影類型轉數字字典
    genres_set = set()
    for val in movies['Genres'].str.split('|'):
        genres_set.update(val)

    genres_set.add('<PAD>')
    genres2int = {val:ii for ii, val in enumerate(genres_set)}

    #將電影類型轉成等長數字列表，長度是18
    genres_map = {val:[genres2int[row] for row in val.split('|')] for ii,val in enumerate(set(movies['Genres']))}

    for key in genres_map:
        for cnt in range(max(genres2int.values()) - len(genres_map[key])):
            genres_map[key].insert(len(genres_map[key]) + cnt,genres2int['<PAD>'])
    
    movies['Genres'] = movies['Genres'].map(genres_map)

    #電影Title轉數字字典
    title_set = set()
    for val in movies['Title'].str.split():
        title_set.update(val)
    
    title_set.add('<PAD>')
    title2int = {val:ii for ii, val in enumerate(title_set)}

    #將電影Title轉成等長數字列表，長度是15
    title_count = 15
    title_map = {val:[title2int[row] for row in val.split()] for ii,val in enumerate(set(movies['Title']))}
    
    for key in title_map:
        for cnt in range(title_count - len(title_map[key])):
            title_map[key].insert(len(title_map[key]) + cnt,title2int['<PAD>'])
    
    movies['Title'] = movies['Title'].map(title_map)

    #Read Ratings
    ratings_title = ['UserID','MovieID', 'ratings', 'timestamps']
    ratings = pd.read_table('./ml-1m/ratings.dat', sep='::', header=None, names=ratings_title, engine = 'python')
    ratings = ratings.filter(regex='UserID|MovieID|ratings')

    #Combine all data in to data variables
    data = pd.merge(pd.merge(ratings, users), movies)
    
    return data

### 把處理好的資料讀進來

In [65]:
data= load_data()

In [66]:
#只要使用前500名的資料
data = data[data["UserID"]<500]

In [57]:
data.shape

(73770, 8)

In [67]:
#看看資料張得如何
data.head()

Unnamed: 0,UserID,MovieID,ratings,Gender,Age,JobID,Title,Genres
0,1,1193,5,0,0,10,"[1851, 1878, 4754, 5135, 739, 5083, 927, 927, ...","[5, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18..."
1,2,1193,5,1,5,16,"[1851, 1878, 4754, 5135, 739, 5083, 927, 927, ...","[5, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18..."
2,12,1193,4,1,6,12,"[1851, 1878, 4754, 5135, 739, 5083, 927, 927, ...","[5, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18..."
3,15,1193,4,1,6,7,"[1851, 1878, 4754, 5135, 739, 5083, 927, 927, ...","[5, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18..."
4,17,1193,5,1,3,1,"[1851, 1878, 4754, 5135, 739, 5083, 927, 927, ...","[5, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18..."


# 資料前處理 - 2

* 爲了使用模型做出來結果 把series轉換成array

In [68]:
label = data["ratings"].as_matrix()
title = data["Title"].as_matrix()
genres = data["Genres"].as_matrix()

  if __name__ == '__main__':
  from ipykernel import kernelapp as app
  app.launch_new_instance()


- Title跟Genres維度太高 把它轉換成跟用戶資料的維度一樣

In [69]:
for i in range(len(data)):
    #print(type(title[i]))
    title[i] = np.asarray(title[i])
    genres[i] = np.asarray(genres[i])
   #print(type(title[i]))

* 把data資料中的 ratings,Title,Genres去掉，之後放在 data1裏面

In [71]:
data1 = data.drop(["ratings","Title","Genres"], axis = 1)

- 先把data1轉換成list 之後在下面再次轉換成Array

In [72]:
data1 = data1.values.tolist()

- 要使用包含電影的特徵的時候 使用以下的程式碼，因爲結果使用 用戶的特徵比較 包含電影的程式碼 注解起來。可是兩個狀況的數據仍然下面會比較<br>
data_list = []<br>
for i in range(len(data)):<br>
    data_ = np.concatenate((data1[i], title[i], genres[i]))<br>
    data_list.append(data_)

In [73]:
# data_list = []
# for i in range(len(data)):
#     data_ = np.concatenate((data1[i], title[i], genres[i]))
#     data_list.append(data_)
data_list = data1

- 把我們用的data_list資料轉換成array 這樣才能丟進去模型

In [74]:
data_array = np.array(data_list)

In [76]:
#data_array.shape

# 開始使用模型看看精度會如何

- 以下準確率以評分的差距 就是精度來比較

- MSE跟MAE來比較精度，約接近 0 越好 在此資料MAE比較直觀的知道，預測的評分跟原本的評分具體的兩個數據的差距多少。

## Linear Regression

In [77]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_array, label, test_size=0.2, random_state=42)

- **把資料給正規化**

In [78]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)



In [39]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(x_train, y_train)
result = reg.predict(x_test)

In [40]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
from math import sqrt
print(sqrt(mean_squared_error(result, y_test)))
print(mean_absolute_error(result, y_test))

1.1109963360784072
0.9191402037585548


MSE : 1.110 <br>MAE : 0.919

In [36]:
#把上面的結果移除掉
#del x_train, x_test, y_train, y_test

## SVR

In [38]:
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error
from math import sqrt
clf  = SVR()
reg = clf.fit(x_train, y_train)
result = reg.predict(x_test)

print(sqrt(mean_squared_error(result, y_test)))
print(mean_absolute_error(result, y_test))



1.1054977673959607
0.8891158932545625


MSE : 1.105 <br>MAE : 0.889

# Deep Neural Network

- 先建構使用的神經網路

In [90]:
from keras.models import Sequential
from keras.layers import Dense,Activation,Dropout
from keras import optimizers
from keras.callbacks import EarlyStopping
model = Sequential()
model.add(Dense(1024, input_shape=(5,)))
model.add(Activation('relu'))
# model.add(Dropout(0.2))
model.add(Dense(1024))
model.add(Activation('relu'))
# model.add(Dropout(0.5))
model.add(Dense(1024))
model.add(Activation('relu'))
model.add(Dense(32))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(1))
model.add(Activation('sigmoid'))

#### 使用 MSE

In [86]:
opt = optimizers.Adam(lr=1e-3) 
earlystop = EarlyStopping(monitor = 'val_loss',patience=5)
model.compile(loss='mean_squared_error', optimizer=opt)
model.fit(x_train,y_train,epochs=1000, batch_size=200,validation_split=0.2,callbacks=[earlystop])

Train on 47212 samples, validate on 11804 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000


<keras.callbacks.History at 0x2adef810b00>

In [87]:
mse = model.evaluate(x_test,y_test)
#開開根號
sqrt(mse)



2.841576332354263

**MSE的精度為 2.841**

#### 使用 MAE

In [88]:
#移除上面的 MSE模型
del model

In [93]:
#先再次執行 上面 “先建構使用的神經網路”的部分
opt = optimizers.Adam(lr=1e-3) 
earlystop = EarlyStopping(monitor = 'val_loss',patience=5)
model.compile(loss='mean_absolute_error', optimizer=opt)
model.fit(x_train,y_train,epochs=1000, batch_size=200,validation_split=0.2,callbacks=[earlystop])

Train on 47212 samples, validate on 11804 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000


<keras.callbacks.History at 0x2adeff0ea58>

In [97]:
mae = model.evaluate(x_test,y_test)
mae



2.6143418733902672

In [95]:
sqrt(mae)

1.6168926598232387

**MAE的精度為 1.616**

# 結論

- 跑過有正規化跟沒有正規化的結果，發現對此資料沒有什麽大的差別，還是這研究仍然使用有正規化的方法
- 一開始因爲不知道哪一個特徵好用，所以全部用進來之後，電影特徵 “title, Genres”移除之後，只用5個特徵的時候得到稍微比較好的結果，差別都是0.0001左右的範圍内。
- 電影的特徵 Title及Genres，這兩個包含的内容共結果比較模糊 沒有幫助於得到更好的精度
- DNN結果爲什麽比SVM,LR差呢？<br>- 目標是一個人對每個電影的評分。 可是電影的種類太多，這樣對DNN來講， 資料量相對小。 因爲不可能資料裏面都有每個人對每個電影的評分， 而且有些人不會認真去評分。所以結果反而使用傳統機器學習方法的時候結果比較好。
* 使用方法的比較圖如下<br>


 +| LR| SVR | DNN
-----|-----|--|---
MAE|0.92|0.85|1.61
MSE|1.11|1.10|2.84
