# Prediction of Youtube Music-Video Views Based on Visual Features.
## Project Description

The aim of this project is to predict the number of views a certain Youtube music video has got only by using the visual information provided by the video itself. Thus, relying on metainformation like 'Song Name' or 'Band' as well as the audio coming with the video is strictly forbidden.  
 
The following protocol gives an overview of all the different approaches and steps that have been applied during this project.


# 1: Dataset creation
The first step of this project has been to create a Dataset consisting of various Youtube music videos. 
In order to select possible songs to search for we used the song-list from the LFM-1b dataset.
We divided our dataset creation procedure into two steps:
- Extraction: Search and select music videos 
- Download: Download found music videos
### Extraction
#### First Approach
The next step has been to find a way to check if a music video of these songs exist and if yes, to download the video from youtube.
Already this step was very problematic, as it is very difficult to decide only based on the meta-info of the youtube search results, if the video is indeed a music video for the searched song. How we handled this in detail is described later on.
Our first attempt to query the songs on youtube has been to make use of the YouTube-API, which provides a very straightforward and easy to use Interface for querying information.
Unfortunately, we had to refrain from using the YouTube API, as it only allowed us to query about 10 songs per day, which results in a far too slow search speed of course.   

#### Second Approach
Afterwards, we decided to use a hand crafted crawling approach as an alternative via the use of the Scrapy python package. 
Therefore, we implemented a web-crawler issuing search queries on youtube for a specific song $X$ and Creator $C$ using the following search pattern:  
"$X$ - $C$ Official Video"

This already led to very promising results, however, also many false positives were included in the crawled data, such as Lyric Videos, Videos of other songs of the same band, and just uploads of the audio with a picture of the CD-Cover.
In order to decrease our false positive rate, we used the following rules to define our final YouTube Database.

##### Extraction Rules
We defined the various rules based on the following video meta-info:
- Title of the Video
- Description of the Video
- Channel name that uploaded the Video

Based on the information above we defined the following rules:  
- If the title contains the words 'Cover' or 'Lyric', we assume that it is not the original music video for the song
- If the name of the band is neither in the Title nor in the Channel name we reject the video
- If "Official video" occurs in the Title or in the Description in various languages and mutations, we accept the video
- If "Full video" occurs in the title, we accept the video
- If "Directed by" occurs in the description, we accept the video
- If the video title exactly matches: "$SongName$ - $Creator$", we accept the video.
- In all other cases, we reject the video. 

These search rules lead to good results, however, there were still many videos dropped, which was not that much of an issue as we had plenty of songs to search for.  
So the clear focus has been on finding a consistent database of videos that are really music videos, than on avoiding false negatives.
The only problem, we noticed was that there were still many videos marked as music video that were only showing the CD-Cover of the band, whilst playing the audio.
We will talk about how we fixed this in a later step.
The result of the crawling step is a MongoDB, storing the LFM-1b data as well as the according found video meta-info like Video-ID, title, channel, description, and the URL to video.

### Download
#### Fetching the Videos
As we had now built a database storing the video-urls to music-videos, the only thing left to do to finalize our dataset creation was to download the music videos for the later use in prediction.
To tackle this, we used the 'youtube-dl' python package, which fortunately also allows to download all the video meta-info as json file.
We used the meta-info for updating our database, to store the needed viewcounts and even more information like dislikes, likes, duration and video resolution.

As we expected to run into memory problems due to the large amount of videos, we decided to crawl videos  with a max resolution of 640x360p.
The videos are stored on a specified location and the path to the video is stored in the database.

##### Problems
During our Downloading process, we unfortunately had to cope with some restrictions on how fast we can download videos, which is possibly caused by a DoS protection configured by YouTube. 
This problem already occurred during the Extraction process, but we were able to solve it via waiting some time after a query has been issued.
However, the Downloading process was putting far more load on YouTube than simply searching for videos.
Hence, especially the downloading process was very tedious and required long waiting times as well as trying different configurations. \
This has certainly eaten a huge amount of valuable time, that could have been put to good use otherwise. 
Thus, we would recommend to directly contact YouTube for extended API access, if someone wants to mine a similar dataset again, in order to avoid the mentioned problems. 

##### 2: Preprocessing


