# Clean-up

In this Notebook the Scraped data is cleaned up, this step is done after collecting the pictures.

First the instances that did not include a valid thumbnail link are removed (missing.txt in each subset folder)

Second the title text is cleaned and standardized to make it ready for vectorization.

In [1]:
import pandas as pd
import os
import urllib.request
import time
from urllib.error import HTTPError
import fasttext
import fasttext.util
import emoji

In [3]:
## remove instances with missing 

dataPath = './DataFiles/'
imagePath = './Images/'
cleanPath = './CleanedFiles/'

fileNames = os.listdir(dataPath)

for fileName in fileNames:
    
    dataFile = pd.read_csv(dataPath + fileName, lineterminator='\n')
    missingList = pd.read_csv(imagePath + fileName.replace('.csv','/') + 'missing.txt', lineterminator='\n',header=None)[0]
    noMissing = dataFile[~dataFile['ID'].isin(missingList)]
    noMissing.to_csv(cleanPath + fileName, index=False, sep=",")


In [4]:
# Remove Private and Missing Videos

fileNames = os.listdir(cleanPath)

titleBlacklist = ['Private video', 'Deleted video']

for fileName in fileNames:
         
    
    dataFile = pd.read_csv(cleanPath + fileName, lineterminator='\n')
    noMissing = dataFile[~dataFile['TITLE'].isin(titleBlacklist)]
    noMissing.to_csv(cleanPath + fileName, index=False, sep=",")

In [7]:
# Clean up strings to alphanumeric characters only

fileNames = os.listdir(cleanPath)

for fileName in fileNames:
    # read one file
    dataFile = pd.read_csv(cleanPath + fileName, lineterminator='\n')
    
    # replace emojis with their text equivalent
    
    dataFile['TITLE'] = dataFile['TITLE'].apply(emoji.demojize, delimiters=(" ", " "))
    
    #all strings to lowercase
    dataFile['TITLE'] = dataFile['TITLE'].str.lower()
    
    #remove everything except a-z 0-9 and ' '.
    dataFile['TITLE'] = dataFile['TITLE'].replace('[^a-zA-Z0-9 ]', '', regex=True)
    
    #remove reduce consecutive spaces to a single space
    dataFile['TITLE'] = dataFile['TITLE'].replace(' +', ' ', regex=True)
    
    #remove leading space
    dataFile['TITLE'] = dataFile['TITLE'].replace('^ ', '', regex=True)
    
    #remove leading space
    dataFile['TITLE'] = dataFile['TITLE'].replace(' $', '', regex=True)
    
    #remove all rows with no title
    dataFile = dataFile[dataFile['TITLE'] != '']
    
    #save the data frame
    dataFile.to_csv(cleanPath + fileName, index=False, sep=",")

In [8]:
# By checking the maximum number of white spaces in title in each 
# data set we can count how many words the the largest title includes

fileNames = os.listdir(cleanPath)
count = 1

for fileName in fileNames:
    dataFile = pd.read_csv(cleanPath + fileName, lineterminator='\n')

    print(dataFile['TITLE'].str.count(' ').max())
    
# the maximum number of whitespaces is 26 so the title with the most amount of words has 27 words

22
21
20
22
22
21
25
25
21
23
22
20
26
22
21
20
21
20
22
22
23
21
21
22
20
23
24
22
24
20
20
23
20
22
21
23
23
21
20
21


## Vectorisation

The python library fasttext is used in combination with a pre trained model available from their website to translate the title strings in the dataset to word vector representations.

To make the process of translation quicker, a vocablulatry set of all words in the data set is created, these are then translated to vector representations. Vocablulary and translation form a lookup table which is then used to add a word vector column to the data set.

### Vocabulary

In [9]:
## Compiling the vocabulary

path = './CleanedFiles/'

dataFiles = os.listdir(path)

completeFrame = pd.read_csv(path + dataFiles[0])

for dataFile in dataFiles[1:]:
    
    
    file = pd.read_csv(path + dataFile, lineterminator='\n')
    
    completeFrame = pd.concat([completeFrame,file])
    


In [13]:
completeFrame

Unnamed: 0,ID,TITLE,THUMBNAIL1,THUMBNAIL2,THUMBNAIL3
0,6yvn5ho-L7g,asmr sex sounds male and female moans with con...,https://i.ytimg.com/vi/6yvn5ho-L7g/default.jpg,https://i.ytimg.com/vi/6yvn5ho-L7g/mqdefault.jpg,https://i.ytimg.com/vi/6yvn5ho-L7g/hqdefault.jpg
1,TnAAhh61ykg,moaning mlg sound download,https://i.ytimg.com/vi/TnAAhh61ykg/default.jpg,https://i.ytimg.com/vi/TnAAhh61ykg/mqdefault.jpg,https://i.ytimg.com/vi/TnAAhh61ykg/hqdefault.jpg
2,6D2CewA5SlA,asmr milf handjob orgasm sexy moaning sounds,https://i.ytimg.com/vi/6D2CewA5SlA/default.jpg,https://i.ytimg.com/vi/6D2CewA5SlA/mqdefault.jpg,https://i.ytimg.com/vi/6D2CewA5SlA/hqdefault.jpg
3,4XKDid4uYoY,1 hour moaning,https://i.ytimg.com/vi/4XKDid4uYoY/default.jpg,https://i.ytimg.com/vi/4XKDid4uYoY/mqdefault.jpg,https://i.ytimg.com/vi/4XKDid4uYoY/hqdefault.jpg
4,u04MVJwJjqQ,male moaning sounds,https://i.ytimg.com/vi/u04MVJwJjqQ/default.jpg,https://i.ytimg.com/vi/u04MVJwJjqQ/mqdefault.jpg,https://i.ytimg.com/vi/u04MVJwJjqQ/hqdefault.jpg
...,...,...,...,...,...
11437,_wdWEmKHrIw,asmr mom orgasm sexy moaning sounds asmr videos,https://i.ytimg.com/vi/_wdWEmKHrIw/default.jpg,https://i.ytimg.com/vi/_wdWEmKHrIw/mqdefault.jpg,https://i.ytimg.com/vi/_wdWEmKHrIw/hqdefault.jpg
11438,cmZmA4vNM5I,pillow humping smilingfacewithhearteyes asmr o...,https://i.ytimg.com/vi/cmZmA4vNM5I/default.jpg,https://i.ytimg.com/vi/cmZmA4vNM5I/mqdefault.jpg,https://i.ytimg.com/vi/cmZmA4vNM5I/hqdefault.jpg
11439,ZT5uHSskDIA,girl orgasm asmr sexy moaning sounds fingering,https://i.ytimg.com/vi/ZT5uHSskDIA/default.jpg,https://i.ytimg.com/vi/ZT5uHSskDIA/mqdefault.jpg,https://i.ytimg.com/vi/ZT5uHSskDIA/hqdefault.jpg
11440,-rVom-_x5Vo,4 minutes of real female moaning asmr intense ...,https://i.ytimg.com/vi/-rVom-_x5Vo/default.jpg,https://i.ytimg.com/vi/-rVom-_x5Vo/mqdefault.jpg,https://i.ytimg.com/vi/-rVom-_x5Vo/hqdefault.jpg


In [14]:
vocabulary = set(completeFrame.TITLE.str.cat(sep=' ').split())


In [15]:
len(vocabulary)

200409

### FastText

In [8]:

fasttext.util.download_model('en', if_exists='ignore')

#ft = fasttext.load_model('cc.en.300.bin')

'cc.en.300.bin'

In [None]:
ft = fasttext.load_model('cc.en.300.bin')
ft.get_dimension()
fasttext.util.reduce_model(ft, 100)

ft.get_dimension()
ft.save_model('cc.en.100.bin')

### Look-up Dict