# Prediction of Youtube Music-Video Views Based on Visual Features.
## Project Description

The aim of this project is to predict the number of views a certain Youtube music video has got only by using the visual information provided by the video itself. Thus, relying on metainformation like 'Song Name' or 'Band' as well as the audio coming with the video is strictly forbidden.  
 
The following protocol gives an overview of all the different approaches and steps that have been applied during this project.


# 1: Dataset creation
The first step of this project has been to create a Dataset consisting of various Youtube music videos. 
In order to select possible songs to search for we used the song-list from the LFM-1b dataset.
We divided our dataset creation procedure into two steps:
- Extraction: Search and select music videos 
- Download: Download found music videos

### Extraction
#### First Approach
The next step has been to find a way to check if a music video of these songs exist and if yes, to download the video from youtube.
Already this step was very problematic, as it is very difficult to decide only based on the meta-info of the youtube search results, if the video is indeed a music video for the searched song. How we handled this in detail is described later on.
Our first attempt to query the songs on youtube has been to make use of the YouTube-API, which provides a very straightforward and easy to use Interface for querying information.
Unfortunately, we had to refrain from using the YouTube API, as it only allowed us to query about 10 songs per day, which results in a far too slow search speed of course.   

#### Second Approach
Afterwards, we decided to use a hand crafted crawling approach as an alternative via the use of the Scrapy python package. 
Therefore, we implemented a web-crawler issuing search queries on youtube for a specific song $X$ and Creator $C$ using the following search pattern:  
"$X$ - $C$ Official Video"

This already led to very promising results, however, also many false positives were included in the crawled data, such as Lyric Videos, Videos of other songs of the same band, and just uploads of the audio with a picture of the CD-Cover.
In order to decrease our false positive rate, we used the following rules to define our final YouTube Database.

##### Extraction Rules
We defined the various rules based on the following video meta-info:
- Title of the Video
- Description of the Video
- Channel name that uploaded the Video

Based on the information above we defined the following rules:  
- If the title contains the words 'Cover' or 'Lyric', we assume that it is not the original music video for the song
- If the name of the band is neither in the Title nor in the Channel name we reject the video
- If "Official video" occurs in the Title or in the Description in various languages and mutations, we accept the video
- If "Full video" occurs in the title, we accept the video
- If "Directed by" occurs in the description, we accept the video
- If the video title exactly matches: "$SongName$ - $Creator$", we accept the video.
- In all other cases, we reject the video. 

These search rules lead to good results, however, there were still many videos dropped, which was not that much of an issue as we had plenty of songs to search for.  
So the clear focus has been on finding a consistent database of videos that are really music videos, than on avoiding false negatives.
The only problem, we noticed was that there were still many videos marked as music video that were only showing the CD-Cover of the band, whilst playing the audio.
We will talk about how we fixed this in a later step.
The result of the crawling step is a MongoDB, storing the LFM-1b data as well as the according found video meta-info like Video-ID, title, channel, description, and the URL to video.

### Download
#### Fetching the Videos
As we had now built a database storing the video-urls to music-videos, the only thing left to do to finalize our dataset creation was to download the music videos for the later use in prediction.
To tackle this, we used the 'youtube-dl' python package, which fortunately also allows to download all the video meta-info as json file.
We used the meta-info for updating our database, to store the needed viewcounts and even more information like dislikes, likes, duration and video resolution.

As we expected to run into memory problems due to the large amount of videos, we decided to crawl videos  with a max resolution of 640x360p.
The videos are stored on a specified location and the path to the video is stored in the database.

##### Problems
During our Downloading process, we unfortunately had to cope with some restrictions on how fast we can download videos, which is possibly caused by a DoS protection configured by YouTube. 
This problem already occurred during the Extraction process, but we were able to solve it via waiting some time after a query has been issued.
However, the Downloading process was putting far more load on YouTube than simply searching for videos.
Hence, especially the downloading process was very tedious and required long waiting times as well as trying different configurations. \
This has certainly eaten a huge amount of valuable time, that could have been put to good use otherwise. 
Thus, we would recommend to directly contact YouTube for extended API access, if someone wants to mine a similar dataset again, in order to avoid the mentioned problems. 

##### 2: Preprocessing
The previous step resulted in a Dataset consisting of about 10500 downloaded music videos including their meta-info.  
In order to make predictions on how we want to predict the viewcounts we had to solve the following issues:
- How do we handle the huge amount of video data (~200 Gb)?
- What features should we extract and how?

##### Size of Data
Regarding the memory problem, we took a similar approach as  Yu-Gang Jiang et al. "Understanding and Predicting Interestingness of Videos". In order to extract visual features, we sampled frames from the video in equally spaced intervals.  
As a result, each video is described via a fixed amount of frames N.  
During this process we had a very interesting idea in order to reduce our false-positive rate among the selected music videos. As we mentioned earlier, our video extraction pipeline still included many "music videos", only showing an album cover, whilst playing the audio.
In order to detect that, we used Structural-Similarity Scores between two consecutively sampled frames in order to check if they are basically the same. If the number of equal frames in a video exceeds a certain threshold, we exclude the video from the music video dataset and delete it to free storage space. 
This filtering was definitely a huge improvement to our database and is perfectly embedded into the video subsampling step anyways.

##### Features to Extract
In order to decide on the features to use for our regression task, we looked at various resources published by the scientific community, especially publications from the MediaEval challenge. 
The most promising publication regarding related work has been in our opinion: "Predicting popularity of online videos using Support Vector Regression"
by Tomasz Trzcinski and Przemysław Rokita. 
They listed a huge amount of various features, which they used in order to predict the viewcounts of Facebook and YouTube videos. At first they extracted basic video characteristics such as length, frame rate and resolution.
Furthermore, they extracted basic color information encoding the most dominant color in a frame.
Another approach has been to extract occurrences of faces and their relative size in the frame.
They applied OCR as well to spot subtitles and other textual hints. 
Their most promising approach was to use a CNN pretrained on ImageNet in order to get an idea of what happens actually in the video. 
Therefore, they extracted a set of frames that should represent the video scenes and propagated these through the CNN. They averaged the resulting 1000 dimensional probability vector and normalized the result to 1. 

However, the term most promising should be taken with a grain of salt as the reported correlation coefficients for all of the visual features are fairly low. 

In general, we noticed the problem that we could not find literature on how to predict video popularity solely based on visual features. 
In the mentioned work the task was to predict the viewcounts for the 30th day given the viewcount history of the preceding 29 days. 
The described visual features were basically only used as a hint for finetuning of the regression results. 

In other works they mostly used audio and visual features in combination with meta-data in order to get useful results, which is of course not the goal of our task.


As the mentioned CNN approach showed the most promising results in their research, we started with implementing this for our project as well. Furthermore, we used basic video information such as length, color information and also a face detector to retrieve features.
This process is described in detail in the following: 

##### Feature Extraction
Foremost, we implemented a small class performing a clean Train-Test Split for splitting our Dataset into Test and Train set. 
Our feature extractors are used on both sets separately.

At first we built an image classifier, which should predict what is happening in each sampled video frame. Therefore, we used the ResNet50 CNN loaded with weights pretrained on the ImageNet challenge.
This results in a 1000 dimensional probability vector, describing the probability of each of the 1000 ImageNet classes to occurr. Out of these probabilities we store the top-5 predictions to serve as features for the specific image.
Thus, we end up with 5 Labels per sampled Image and their respective probabilities for all the sampled images of a video.
We also implemented the approach of Tomasz Trzcinski and Przemysław Rokita via averaging the 1000 dimensional probability vector over all images of a video and storing the normalized vector as features for later regression analysis.
As a result, we end up with having two slightly different kinds of CNN based visual features for regression later on.

Afterwards we built a small feature extractor, which should count the occurrences of faces in the sampled video frames. The extracted features are:
- The total number of faces in the video samples
- The average number of faces per frame
- The average percentage of image-space covered by faces. 

At last, we wanted to extract some low level color information. Therefore, we loaded the video frames and converted them to HSV-Color space for easy color picking.
We split the Color space into 9 different 'main' colors being: red, orange, yellow, green, cyan, blue, violet, pink, black and white.
For every frame we store the most dominant color as well as the most dominant color of the video overall. 

As the 1000 dimensional vector of the CNN predictions and the 9 different colors are categorical features, we provide functionality to load them in one-hot-encoded manner during regression.

To sum up, the following features are now available:
- Basic
    - Video Duration in seconds
- CNN based
    - Top 5 labels/probabilities for every frame
    - Normalized average 1000 dimensional probability vector of label occurrence
- Faces
    - The total number of faces in the video samples
    - The average number of faces per frame
    - The average percentage of image-space covered by faces
- Color
    - Most dominant color per frame in video
    - Most dominant color per video


#### Exploratory Data Analysis

