### Motivation
In [my last notebook](https://www.kaggle.com/mengxinbj/lstm-prediction-on-trending-youtube-videos-views), I have built a LSTM model with Keras and predicted views trending of videos of [Trending Youtube Video Statistic dataset](https://www.kaggle.com/datasnaek/youtube-new). In this notobook, I'm going to implement the same prediction task by Pytorch. It's a good example to experience the style of Pytorch. There are more coding in building and training model than in Keras, but it is much more flexible and powerful. I hope you like it. Let's get started.

In [51]:
import pandas as pd
import numpy as np
from datetime import datetime
import os
from sklearn.preprocessing import StandardScaler

### Data Read and Proprocessing

In [52]:
filepath1="/kaggle/input/youtube-new/US_category_id.json"
category_id_df = pd.read_json(filepath1)
category_id_df.head()

Unnamed: 0,kind,etag,items
0,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."
1,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."
2,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."
3,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."
4,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."


In [53]:
filepath2="/kaggle/input/youtube-new/USvideos.csv"
videos_df = pd.read_csv(filepath2,header='infer')
videos_df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


In the section, I found and removed some  unnecessary chars like '"' ',' '\r' in title,description and channel title columns.These chars may cause problems when trying to load data into relational database for ETL task(We don't do that in this project). Each country may have different chars need to be removed. We only take US dataset as an example.

In [54]:
def clean_video_csv(video_df,country_code):
    """
    This function is to remove unnecessary chars like '"',',','\r'which will cause errors when copy csv files into Redshift staging table.
    
    Parameters:
    video_df: Dataframe from read_csv file
    filepath: videos csv filepath
    
    Return:
    video_df: Dataframe which remove unnecessary chars
    """
    video_df["tags"] = video_df["tags"].apply(lambda x:x.replace('"',""))
    video_df["title"] = video_df["title"].apply(lambda x:x.replace(',',' '))
    video_df["channel_title"] = video_df["channel_title"].apply(lambda x:x.replace(',',' '))
    video_df["description"] = video_df["description"].apply(lambda x:str(x).replace('\r',''))
    video_df["description"] = video_df["description"].apply(lambda x:str(x).replace(',',' '))
    video_df["description"] = video_df["description"].apply(lambda x:str(x).replace('"',''))
    video_df["country"] = country_code
    return video_df
#Clean videos csv files for selected country code
country_code=['US']
for c in country_code:
    filepath="/kaggle/input/youtube-new/"+c+"videos.csv"
    video_df = pd.read_csv(filepath,header='infer')
    savepath = "/kaggle/working/"+c+"videos1.csv"
    video_df = clean_video_csv(video_df,c)
    video_df.to_csv(savepath,index=False)

Since category id in different countries are not the same, we extract categoty id and titles from json file of each country and save it as a csv file for later use. 

In [55]:
def category_extract (df,country_code):
    """
    The function is to extract category id and category title from category_id json files
    
    Parameters:
    df: Dataframe of read_json file
    filepath: category_id json filepath
    
    Return:
    category_df: Dataframe with columns: category_id,category_title,category_filename,country_code
    
    """
    category_id = []
    category_title = []
    for i in range(df.shape[0]):
        category_id.append(df.iloc[i]["items"]['id'])
        category_title.append(df.iloc[i]["items"]["snippet"]["title"])
    category_df = pd.DataFrame()
    category_df["category_id"] = category_id
    category_df["category_title"] = category_title
    category_df.insert(category_df.shape[1],"country_code",country_code)
    return category_df

#Extract category title and id from json file of each country
category_all = pd.DataFrame()
for c in country_code:
    filepath="/kaggle/input/youtube-new/"+c+"_category_id.json"
    category_id_df = pd.read_json(filepath)
    category_all = pd.concat([category_all,category_extract(category_id_df,c)])
    
#category_all.tail()
savepath = "/kaggle/working/category_all.csv"
category_all.to_csv(savepath,index=False)

### Data Transformation
Merge the category into trending videos data.

In [56]:
US = pd.read_csv("/kaggle/working/USvideos1.csv")
category = pd.read_csv("/kaggle/working/category_all.csv")
US1 =US.merge(category,how="inner",left_on=["category_id","country"],right_on=["category_id","country_code"])
US1["trending_date1"] = US1["trending_date"].apply(lambda x: pd.Timestamp(int("20"+x[0:2]),int(x[-2:]),int(x[3:5]),0))

We extract the columns would be used in this project.

In [57]:
columns = ["video_id","trending_date1","channel_title","publish_time","views","likes","dislikes","comment_count","category_title"]
US1 = US1[columns].copy()

We only keep videos that have more than 4 trending days data. We are going to use the previous 4 trending days data to predict the 5th trending day data.

In [58]:
trendingdate_df = US1.groupby("video_id").trending_date1.describe(datetime_is_numeric=True).reset_index()
videos = trendingdate_df[trendingdate_df["count"].values>4].video_id
US2 = US1[ US1.video_id.isin(videos.values)]
trendingdate_df.head()

Unnamed: 0,video_id,count,mean,min,25%,50%,75%,max
0,-0CMnp02rNY,6,2018-06-08 12:00:00,2018-06-06,2018-06-07 06:00:00,2018-06-08 12:00:00,2018-06-09 18:00:00,2018-06-11
1,-0NYY8cqdiQ,1,2018-02-01 00:00:00,2018-02-01,2018-02-01 00:00:00,2018-02-01 00:00:00,2018-02-01 00:00:00,2018-02-01
2,-1Hm41N0dUs,3,2018-04-30 00:00:00,2018-04-29,2018-04-29 12:00:00,2018-04-30 00:00:00,2018-04-30 12:00:00,2018-05-01
3,-1yT-K3c6YI,4,2017-11-30 12:00:00,2017-11-29,2017-11-29 18:00:00,2017-11-30 12:00:00,2017-12-01 06:00:00,2017-12-02
4,-2RVw2_QyxQ,3,2017-11-15 00:00:00,2017-11-14,2017-11-14 12:00:00,2017-11-15 00:00:00,2017-11-15 12:00:00,2017-11-16


We standardize views data before feed it into model later.

In [59]:
def standardize(data):
    scaler = StandardScaler()
    scaler = scaler.fit(data)
    transformed = scaler.transform(data)
    return scaler,transformed
scaler_views, US_views = standardize(US2.views.values.reshape(-1,1))
#scaler_likes, US_likes = standardize(US2.likes.values.reshape(-1,1))
#scaler_dislikes, US_dislikes = standardize(US2.dislikes.values.reshape(-1,1))
#scaler_comments, US_comments = standardize(US2.comment_count.values.reshape(-1,1))

We build the data subset would be used in building and trainging model.

In [60]:
US3 = pd.DataFrame()
US3["trending_date1"] = US2["trending_date1"]
US3["video_id"] = US2["video_id"]
US3["views"] = US_views
#US3["likes"] = US_likes
#US3["dislikes"] = US_dislikes
#US3["comment_count"] = US_comments
US3.reset_index(inplace=True)
US3.head()

Unnamed: 0,index,trending_date1,video_id,views
0,0,2017-11-14,2kyS6SvSYSE,-0.237504
1,17,2017-11-15,2kyS6SvSYSE,-0.05543
2,31,2017-11-16,cmoknv58jjE,-0.104086
3,32,2017-11-16,2kyS6SvSYSE,-0.038156
4,33,2017-11-16,Jidk0O6uu-0,-0.325464


In [61]:
US3.drop("index",axis=1,inplace=True)
US3.head()

Unnamed: 0,trending_date1,video_id,views
0,2017-11-14,2kyS6SvSYSE,-0.237504
1,2017-11-15,2kyS6SvSYSE,-0.05543
2,2017-11-16,cmoknv58jjE,-0.104086
3,2017-11-16,2kyS6SvSYSE,-0.038156
4,2017-11-16,Jidk0O6uu-0,-0.325464


We build x dataset(features) and y dataset(labels).X contained the previous 4 trending days views and Y is the fifth trending day views.

In [62]:

x=[]
y=[]
category = []
for v in videos:
    row=[]
    temp_df = US3[US3["video_id"]==v].sort_values(by="trending_date1")
    #print (temp_df)
    seq = temp_df.views[0:4].index
        
    for s in seq:
        #print (US3.iloc[s].values[2:])
        row.append(US3.iloc[s].values[2:3])
    x.append(row)
    nextstep = temp_df.views[4:5].values
    y.append(nextstep)
    

Shaping X(total,seqlen,inputdim) and Y(total,outputdim).seqlen=4,inputdim=1, outputdim=1

In [63]:
x = np.reshape(x,(len(x),4,1))
print (x.shape)
print (x[0])


(3984, 4, 1)
[[-0.2719427408449598]
 [-0.2555659799132111]
 [-0.24286315271238587]
 [-0.23675864020197335]]


In [64]:
y = np.reshape(y,(-1,1))
print (y.shape)
print (y[0])

(3984, 1)
[-0.23191164]


In [65]:
x = x.astype('float32')
y = y.astype('float32')

Train,validation and test data split. Each one should be integral multiples of batchsize.

In [66]:

batch_size = 100

x_train,x_remain = x[:3700],x[3700:3900]
y_train,y_remain = y[:3700],y[3700:3900]

x_val,x_test = x_remain[:100],x_remain[100:]
y_val,y_test = y_remain[:100],y_remain[100:]
print (x_train.shape,x_val.shape,x_test.shape)

(3700, 4, 1) (100, 4, 1) (100, 4, 1)


Create tensor datasets. Create DataLoaders and batch the training, validation and test dataset.

In [67]:
import torch
from torch.utils.data import TensorDataset,DataLoader
train_data = TensorDataset(torch.from_numpy(x_train),torch.from_numpy(y_train))
valid_data = TensorDataset(torch.from_numpy(x_val),torch.from_numpy(y_val))
test_data = TensorDataset(torch.from_numpy(x_test),torch.from_numpy(y_test))


train_loader = DataLoader(train_data,shuffle=True,batch_size=batch_size)
valid_loader = DataLoader(valid_data,shuffle=True,batch_size=batch_size)
test_loader = DataLoader(test_data,shuffle=True,batch_size=batch_size)

Have a look at the X and Y dataset

In [68]:
dataiter = iter(train_loader)
sample_x,sample_y = dataiter.next()

print ('Sample input size:',sample_x.size())
print ('Sample input:\n',sample_x[0:2])
print ('\n')
print ('Sample output size:',sample_y.size())
print ('Sample output:\n',sample_y[0:2])


Sample input size: torch.Size([100, 4, 1])
Sample input:
 tensor([[[-0.0899],
         [ 1.3740],
         [ 1.5428],
         [ 1.7263]],

        [[-0.2691],
         [-0.2090],
         [-0.1870],
         [-0.1702]]])


Sample output size: torch.Size([100, 1])
Sample output:
 tensor([[ 1.9194],
        [-0.1596]])


### Build the Model

Checking if GPU is available.

In [69]:
train_on_gpu = torch.cuda.is_available()
if train_on_gpu:
    print ('Training on GPU')
else:
    print ('Training on CPU')

Training on CPU


Build model with multilayer LSTM ,dropout layer and a full connected layer at the end. The model takes input_size,output_size,hidden_dim,number of layers(LSTM) and dropout prob(between LSTM layers) and set batch_first is True.

In [70]:
import torch.nn as nn
class mylstm(nn.Module):
    
    def __init__(self,input_size,output_size,hidden_dim, n_layers,drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(mylstm,self).__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        self.lstm = nn.LSTM(input_size,hidden_dim,n_layers,batch_first=True)
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(hidden_dim,output_size)
        
    def forward(self,x,hidden):
        batch_size = x.size(0)
        
        lstm_out,hidden = self.lstm(x,hidden)
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        out =self.dropout(lstm_out)
       
        out = self.fc(out)#shape=(batchsize,seqlen,outputdim)
        out = out.view(batch_size, -1) 
        out = out[:, -1]
        return out,hidden
    
    def init_hidden(self,batch_size):
        """
        Initializes hidden state and cell state
        """
        
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

In [71]:
input_size = 1
output_size =1
hidden_dim = 256
n_layers = 2

net = mylstm(input_size,output_size,hidden_dim,n_layers)
print (net)

mylstm(
  (lstm): LSTM(1, 256, num_layers=2, batch_first=True)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
)


### Training
We set criterion as MSEloss and optimizer as Adam.We have 40 loops for the whole trainset. Evaluate on validationset and print result every 50 training samples.

In [72]:
lr = 0.001
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(net.parameters(),lr=lr)

In [73]:
epoch = 40
counter = 0
print_every = 50
clip = 5

if train_on_gpu:
    net.cuda()
    
net.train()
for e in range(epoch):
    h = net.init_hidden(batch_size)
    #print (h[0].shape)
    for inputs,labels in train_loader:
        counter +=1
        
        if train_on_gpu:
            inputs,labels = inputs.cuda(),output.cuda()
            
        h = tuple([each.data for each in h])
        net.zero_grad()
        output,h = net(inputs,h)
        
        loss = criterion(output.squeeze(),labels.squeeze())
        loss.backward()
        
        nn.utils.clip_grad_norm_(net.parameters(),clip)
        optimizer.step()
        
        if counter % print_every==0:
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs,labels in valid_loader:
                val_h = tuple([each.data for each in val_h])
            
                if train_on_gpu:
                    inputs,labels = inputs.cuda(),labels.cuda()
                output,val_h = net(inputs,val_h)
                val_loss = criterion(output.squeeze(),labels.squeeze())
                
                val_losses.append(val_loss.item())
            net.train()
            print ("Epoch:{}/{}...".format(e+1,epoch),
               "Step:{}...".format(counter),
               "Loss:{:.6f}...".format(loss.item()),
               "Val loss:{:.6f}".format(np.mean(val_losses)))

Epoch:2/40... Step:50... Loss:0.012888... Val loss:0.008803
Epoch:3/40... Step:100... Loss:0.011558... Val loss:0.004719
Epoch:5/40... Step:150... Loss:0.189438... Val loss:0.004063
Epoch:6/40... Step:200... Loss:0.068017... Val loss:0.022945
Epoch:7/40... Step:250... Loss:0.000795... Val loss:0.002629
Epoch:9/40... Step:300... Loss:0.004782... Val loss:0.006457
Epoch:10/40... Step:350... Loss:0.031094... Val loss:0.001639
Epoch:11/40... Step:400... Loss:0.002246... Val loss:0.002133
Epoch:13/40... Step:450... Loss:0.026684... Val loss:0.002824
Epoch:14/40... Step:500... Loss:0.003164... Val loss:0.002069
Epoch:15/40... Step:550... Loss:0.003797... Val loss:0.002123
Epoch:17/40... Step:600... Loss:0.002095... Val loss:0.002428
Epoch:18/40... Step:650... Loss:0.003948... Val loss:0.001548
Epoch:19/40... Step:700... Loss:0.002160... Val loss:0.001442
Epoch:21/40... Step:750... Loss:0.000857... Val loss:0.001377
Epoch:22/40... Step:800... Loss:0.001483... Val loss:0.002395
Epoch:23/40... 

We test on test set,which only have 1 batch.We print the loss. Be notice that here the data is still standardized. They haven't been inversely transformed.

In [74]:
test_losses=[]

h = net.init_hidden(batch_size)
net.eval()
for inputs,labels in test_loader:
    h = tuple([each.data for each in h])
    if train_on_gpu:
        inputs,labels =inputs.cuda(),labels.cuda()
    output,h = net(inputs,h)
    test_loss = criterion(output.squeeze(),labels.squeeze())
    
    test_losses.append(test_loss.item())

print ("Test loss:{:.6f}".format(np.mean(test_losses)))





Test loss:0.001955


We predict on a slice of dataset which never been seen by train,validation and test set. And we print the demo true and prediction values. They are quite close. Finally we print the root mean square error of the small prediction demo dataset.

In [75]:
from torch import Tensor 
from sklearn.metrics import mean_squared_error
import math
predict_tensor = torch.from_numpy(x[3900:3910])
batch_size = predict_tensor.size(0)
h = net.init_hidden(batch_size)
if train_on_gpu:
    predict_tensor = predict_tensor.cuda()
output,h = net(predict_tensor,h)
pred = scaler_views.inverse_transform(Tensor.detach(output) )
true = scaler_views.inverse_transform(y[3900:3910])
for i in range(10):
    print ("The true value is {}, and the predict value is {}".format(true[i],pred[i]))
print ("The root mean square error of the 10 data is ",math.sqrt(mean_squared_error(pred,true)))

The true value is [3017144.], and the predict value is 3253285.0
The true value is [237230.97], and the predict value is 240166.21875
The true value is [460125.97], and the predict value is 472155.71875
The true value is [1899969.], and the predict value is 2014338.25
The true value is [879709.94], and the predict value is 860050.3125
The true value is [1827806.], and the predict value is 1916636.0
The true value is [96253.21], and the predict value is 85471.7109375
The true value is [351798.97], and the predict value is 359858.21875
The true value is [99200.96], and the predict value is 86233.9609375
The true value is [608431.06], and the predict value is 609081.6875
The root mean square error of the 10 data is  88104.06065556797


We see the average percentage error of the slice of 10 dataset is 2.35%. It's pretty good!

In [83]:
print ("The average percentage error of the 10 data is {:.3f}%".format(np.mean((pred-true)/true)))

The average percentage error of the 10 data is 2.354%
