According to YouTube's terms of service, a video which "Promis[es] money, products, software, or gaming perks for free if viewers install software, download an app, or perform other tasks." is spam, posting these videos is against YouTube's trms of service. This project uses machine learning and statistical methods to find videos which fit this description and automatically report them.


The following types of content are not allowed on YouTube. Keep in mind this list isn't a complete list.

    Making exaggerated promises, such as claims that viewers can get rich fast or that a miracle treatment can cure chronic illnesses such as cancer.
    Promoting cash gifting or other pyramid schemes.
    Accounts dedicated to cash gifting schemes.
    Videos that promise "You'll make $50,000 tomorrow with this plan!"

Don’t post content on YouTube if it fits any of the descriptions noted below.

    Links to or promotes third-party services that artificially inflate metrics like views, likes, and subscribers
    Content linking to or promoting third-party view count or subscriber gaming websites or services
    Offering to subscribe to another creator’s channel only if they subscribe to your channel (“sub4sub”)
        Note: You're allowed to encourage viewers to subscribe, hit the like button, share, or leave a comment
    Content featuring a creator purchasing their views from a third party with the intent of promoting the service

    Here are some examples of content that’s not allowed on YouTube.

    A video testimonial in which a creator shows themselves successfully purchasing artificial page traffic from a third party
    A video in which a creator links to a third party artificial page traffic provider in a promotional or supportive context. For example: “I got 1 million subscribers on this video in a day and you can too!”
    A video that tries to force or trick viewers into watching another video through deceptive means (for example: a misleadingly labeled info card)
    Channels dedicated to artificial channel engagement traffic or promoting businesses that exist for this sole purpose





In [1]:
import os
import json 
import pandas as pd
import pymongo
#from google.colab import drive
from pymongo import MongoClient
import socket
import urllib.request as urllib2
import pandas as pd



In [2]:
#connect to the api
from googleapiclient.discovery import build
gkey="AIzaSyAxbjh1blqMTOdUOxNwmiFXv36cNwm4n6M"
youtube=build('youtube','v3', developerKey=gkey)

In [3]:
#Get up to 50 comments on a video plus all replies to these
def video_comments(video_id):
    try:
        counter=0
        # empty list for storing reply
        commentlist=[]
        replies = []

        # creating youtube resource object
        #youtube = build('youtube', 'v3',
                    # developerKey=gkey)

        # retrieve youtube video results
        video_response=youtube.commentThreads().list(
        part='snippet,replies',
        videoId=video_id
        ).execute()

        # iterate video response
        while video_response:
            
            # extracting required info
            # from each result object 
            for item in video_response['items']:
                
                # Extracting comments
                comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
                
                # counting number of reply of comment
                replycount = item['snippet']['totalReplyCount']

                # if reply is there
                if replycount>0:
                    
                    # iterate through all reply
                    for reply in item['replies']['comments']:
                        
                        # Extract reply
                        reply = reply['snippet']['textDisplay']
                        
                        # Store reply is list
                        replies.append(reply)

                # print comment with list of reply
                commentlist.append([comment, replies])

                # empty reply list
                replies = []
                counter+=1

            # Again repeat
            if 'nextPageToken' in video_response and counter<=50:
                video_response = youtube.commentThreads().list(
                        part = 'snippet,replies',
                        videoId = video_id
                    ).execute()
            else:
                return(commentlist)
    except HTTPError as err:
        if err.code == 403:
            return("disabled")
        else:
            raise



In [4]:
#Get a dictionary of variables from one video

def getvars(id,is_spam):
  request=youtube.videos().list(
      id = id,
      part=["snippet","statistics"],
  )
  response=request.execute()
  
  vars={}

  vars["id"]=id
  
  if is_spam==1:
    vars["spam"]=1
  elif is_spam == 2:
    vars["spam"] = 2
  elif is_spam == 0:
    vars["spam"] = 0
  else:
     raise Exception("Second argument must be 0, 1, or 2 or (0 for OK, 1 for money-scam,2 for harmful alternative health).")
  

  statsvarlist=["commentCount", "dislikeCount","favoriteCount","likeCount","viewCount"]
  snipvarlist=["defaultAudioLanguage","description","tags","title","thumbnails"]

  snipvalues=response["items"][0]["snippet"]
  statsvalues=response["items"][0]["statistics"]

  for item in statsvarlist:
    if item in statsvalues:
      vars[item] = statsvalues[item]

  for item in snipvarlist:
    if item in snipvalues:
      vars[item] = snipvalues[item]

  vars["commentSection"]=video_comments(id)
  return(vars)



In [5]:
#Get up to 50 videos from a username
def get_ids_byuser(username):
  vidIdList=[]
  counter=0
    # empty list for storing reply
  
    # creating youtube resource object
    #youtube = build('youtube', 'v3',
                   # developerKey=gkey)
  
    # retrieve youtube video results
  request=youtube.channels().list(
      forUsername = username,
      part=["contentDetails"]
    
  )
  response=request.execute()
  uploads_id=response["items"][0]["contentDetails"]["relatedPlaylists"]["uploads"]


  request=youtube.playlistItems().list(
      playlistId = uploads_id,
      part=["contentDetails"]
    
  )
  response=request.execute()
    
    # iterate video response
  while response:
        
        # extracting required info
        # from each result object 
        videoslist = response['items']
        for i in range(len(videoslist)):
          videoid=videoslist[i]["contentDetails"]["videoId"]
          vidIdList.append(videoid)
          counter+=1
  
        # Again repeat
        if 'nextPageToken' in response and counter<=50:
          request=youtube.playlistItems().list(
            playlistId = uploads_id,
            part=["contentDetails"],
            pageToken=response["nextPageToken"]
          )
        else:
            return(vidIdList)
  

    


In [6]:
#Get up to 50 videos from a username
def get_ids_bychannelid(ID):
  vidIdList=[]
  counter=0
    # empty list for storing reply
  
    # creating youtube resource object
    #youtube = build('youtube', 'v3',
                   # developerKey=gkey)
  
    # retrieve youtube video results
  request=youtube.channels().list(
      id = ID,
      part=["contentDetails"]
    
  )
  response=request.execute()
  uploads_id=response["items"][0]["contentDetails"]["relatedPlaylists"]["uploads"]


  request=youtube.playlistItems().list(
      playlistId = uploads_id,
      part=["contentDetails"]
    
  )
  response=request.execute()
    
    # iterate video response
  while response:
        
        # extracting required info
        # from each result object 
        videoslist = response['items']
        for i in range(len(videoslist)):
          videoid=videoslist[i]["contentDetails"]["videoId"]
          vidIdList.append(videoid)
          counter+=1
  
        # Again repeat
        if 'nextPageToken' in response and counter<=50:
          request=youtube.playlistItems().list(
            playlistId = uploads_id,
            part=["contentDetails"],
            pageToken=response["nextPageToken"]
          )
        else:
            return(vidIdList)
  

    


In [8]:
#connect to mongodb

conn = 'mongodb://localhost:27017'
client = pymongo.MongoClient(conn)


db = client.videos_mdb
collection = db["videos"]  

In [9]:
#Define a function to check if we already downloaded a video yet
def is_inthedb(id):
    count=collection.count_documents({"id":id})
    if count==0:
        return False
    else:
        return True

In [38]:
#Define a function to add video info to MongoDB by channel id
def sample_vids_bychannelid(id, is_scam):
  VidIds=get_ids_bychannelid(id)
  for subid in VidIds:
    if is_inthedb(subid)==False:

      data=getvars(subid,is_scam)
      with client:
            db = client.videos_mdb
            db.videos.insert_one(data)

    else:
      continue


In [11]:
#Define a function to add video info to MongoDB by username
def sample_vids_byusername(username, is_scam):
  VidIds=get_ids_byuser(username)
  for id in VidIds:
    if is_inthedb(id)==True:
      continue
    else:
      data=getvars(username,is_scam)
      with client:
            db = client.videos_mdb
            db.videos.insert_one(data)

In [42]:
#Manually search for the best usernames' channelID
request=youtube.search().list(
    q= "zork",
    part=["id", "snippet"],
    maxResults=5

)
response=request.execute()
response

{'kind': 'youtube#searchListResponse',
 'etag': '3_98CvV4TrFs7lVEPf_EDdmcy_Y',
 'nextPageToken': 'CAUQAA',
 'regionCode': 'US',
 'pageInfo': {'totalResults': 108995, 'resultsPerPage': 5},
 'items': [{'kind': 'youtube#searchResult',
   'etag': '0l3TzkcH5HTjIrFlKTBdrjhLx1I',
   'id': {'kind': 'youtube#video', 'videoId': 'PWQDccL0aXM'},
   'snippet': {'publishedAt': '2017-03-18T02:57:51Z',
    'channelId': 'UCY7icYxhAoo8LleOzWBw_6w',
    'title': 'Let&#39;s Play - Zork I: The Great Underground Empire',
    'description': 'A 1980s classic text adventure! Enjoy as the GH rips through it with some ambient music to break the silence.',
    'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/PWQDccL0aXM/default.jpg',
      'width': 120,
      'height': 90},
     'medium': {'url': 'https://i.ytimg.com/vi/PWQDccL0aXM/mqdefault.jpg',
      'width': 320,
      'height': 180},
     'high': {'url': 'https://i.ytimg.com/vi/PWQDccL0aXM/hqdefault.jpg',
      'width': 480,
      'height': 360}},
 

In [12]:
#Function to fetch individual spam videos 
def getone(id, is_spam):
    if is_inthedb(id)==True:
        print("Already got it.")
    else:
        get=getvars(id,is_spam)
        db = client.videos_mdb
        db.videos.insert_one(get)

In [46]:
#Sampling non-scams 
#sample_vids_bychannelid("UCm22FAXZMw1BaWeFszZxUKw",0)  #Kitboga - complete
#sample_vids_byusername("SmoshGames",0)
#sample_vids_bychannelid("UCY30JRSgfhYXA6i6xX1erWg",0)  #Markiplier - complete
#sample_vids_bychannelid("UCrPUg54jUy1T_wII9jgdRbg", 0) #Chris Rhamsay
#sample_vids_bychannelid("UCnZj2VMd3IdyzIKOJCK4VlA",0)
#---Non-scam videos that have scam-like qualities such as relevant keywords or legitimate product advertisements (so the machine can make finer distinctions)
#
#sample_vids_bychannelid("UCm22FAXZMw1BaWeFszZxUKw",0) #--Kitboga
#sample_vids_bychannelid("UCTCpOFIu6dHgOjNJ0rTymkQ",0) # as seen on TV -- complete
#sample_vids("Jim Browning",0)
#sample_vids("Freakin' Reviews",0)
#sample_vids("Chadtronic",0)
#sample_vids_bychannelid("UCZLFu8bHbwtnIgWLg5UtINw",0)#VidIQ. This one is important, because it shows you how to get subscribers the right way (rather than selling giftcards or subs for subs). --complete#

#---Normal non-scam videos
#sample_vids("Insym",0)
#sample_vids_bychannelid("UCnmgSO_4g6QcRzy0yFeglyA",0)#Grand Illusions --complete
#sample_vids_bychannelid("UC1VLQPn9cYSqx8plbk9RxxQ",0)# The action lab --complete
#sample_vids("Chris Ramsay",0)
goodlist=["tYQY1UKDLFM","PWQDccL0aXM","eoxJWJaA1gc","PCimuf6F6C8","7rfVE2JvLqA"]
moregoods=["sJqyaSV6E7c"]
for item in goodlist:
    try:
        getone(item,0)
    except:
        continue




In [23]:
#Sampling scams


#sample_vids_bychannelid("UCAT3-AQKNU0ITQXnjLOoDWA",1) #Wesley Virgin - complete
#sample_vids_bychannelid("UCBnYn54boCxNoob5DXsy_ag",1) #Digital Millionaire -complete
#sample_vids_bychannelid("UClVGRVvggdqZT02kjiVt0IQ",1) #Dave Nick --complete
sample_vids_bychannelid("UCC2Sqxq54NVM87b-gV1mKag",1)  #finance girl

#getone("KH-i2P92bS4",1)
#-----get all from "finance girl"
#spamlist=["jK7xrgOeOQg","feAfP_MWz5g","gnNyEBdBkO4","H-i2P92bS4","6RLYyE3dDLE","mAg7Qs-XifE","1DL1xnmkbJM","QDfXqGn4Bmo&t=138s","UPkEZ0Rl11k&t=169s","h6sbddOaI88","Q7O5aKKm4uM","B75deAvCw9o","Y5osifxyCSU&t=55s", "97Z37QRz2a4", "YgEKUE9vwPk","QuS0HqXx9sI","K-n06-1eS2A","Gs9saVFwyno","QpuVeL9IKCk","6ixsQInp11U","i-Lg4efoOJY","PvOi6uxtLIc","WsZllvBTNvA","cpEA1050d_s","qHWfP56Xo2E","ixpI0jBM_ps","z-6Ol9Gu2Bw","5C6GrJzc5zM","aj5WXrUfb0U","Gs9saVFwyno","SUFrcNpFYaw","lGTZf4AuaRY","ayXCcOOFWCk","fyzUXGmccKI","T7aPZo09ToM","gU_D_SkxpOY","XNE0jrfZs28","qvG7TDnzuf0","clzqH8jXlhY","SV_4nhfKKYo","apXcI7QnTzk","ioT5aRzPWFY","p5TUdv4G1K0","M9YKeor__8A&t=8s","dnsVDw5NqPs","Jko7MDeAzzo","umbpntB65uc","82hRKeuZav0","1XgVTx4j4c0"]



In [19]:
data=pd.read_csv("imagedata.csv")

In [43]:
#Manually search for the best usernames' channelID
request=youtube.search().list(
    q= "Zork",
    part=["id", "snippet"],
    maxResults=5

)
response=request.execute()


In [44]:
response

{'kind': 'youtube#searchListResponse',
 'etag': '1VOfpCSs2i6WzzGhL-gRp4eZc08',
 'nextPageToken': 'CAUQAA',
 'regionCode': 'US',
 'pageInfo': {'totalResults': 108977, 'resultsPerPage': 5},
 'items': [{'kind': 'youtube#searchResult',
   'etag': '0l3TzkcH5HTjIrFlKTBdrjhLx1I',
   'id': {'kind': 'youtube#video', 'videoId': 'PWQDccL0aXM'},
   'snippet': {'publishedAt': '2017-03-18T02:57:51Z',
    'channelId': 'UCY7icYxhAoo8LleOzWBw_6w',
    'title': 'Let&#39;s Play - Zork I: The Great Underground Empire',
    'description': 'A 1980s classic text adventure! Enjoy as the GH rips through it with some ambient music to break the silence.',
    'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/PWQDccL0aXM/default.jpg',
      'width': 120,
      'height': 90},
     'medium': {'url': 'https://i.ytimg.com/vi/PWQDccL0aXM/mqdefault.jpg',
      'width': 320,
      'height': 180},
     'high': {'url': 'https://i.ytimg.com/vi/PWQDccL0aXM/hqdefault.jpg',
      'width': 480,
      'height': 360}},
 

In [None]:
#TO DO

#replace comment-count missing data with 0, but add new dichotomous variable for disabled comments
#recode default audio language to be numeric
#delete repeat comments. I think there was an error collecting them at one point
#document the fact that missing a default audio langauge will not be recorded as missing data, but will be categorical data. This is because a video missing a default language might be a relevant predictor of scam-status
#be POSITIVE that video and channel id do not end up as variables
#add vars likecount/viewcount, dislikecount/viewcount
#eliminte "favoritecount"variable. It's value is either always 0, or 0 too often and will create skew.
#drop duplicate records
#handle coding of comment section disabled --- fixed in retrival, coded as "disabled", but may need more handling

In [None]:
#step 0: collect
#step 1: have one or more nested machine learning algorithms predict spamm dummy based on qualitative vars like title, tags, and comments
#step 2: have a parent algorithm (or regression equation) predict spammyness based on other values + output of nested predictions as factors

False

In [11]:
data

<pymongo.cursor.Cursor at 0x7fec9cdce340>

In [None]:
#scams
7yYvCIUjx7o
_vcBDMq6PkM
784LEikg8_o #--malicious

In [None]:
#non-scam giveaways that follow YT's rules decently well
_ltiL-AyRAk
3gh1AdtQKWQ

In [None]:
#Well-meaning people not following the giveaway rules 
BOByZjhFRmw
cEBNadvCJbs
KfArrqQtao0
iJj_Ikx_uDY
aAtJ1zC2LPk #Jazza
005ON0SKk9Q&t=94s #Jazza
m6gLe45zSsw
-dBwZdc_c0M
Co0_HVab0vw
VBozk2qZEpg

In [None]:
#doesn't follow giveaway rules and it also kinda sus but not obviously horrible
YHn1xTy-uY0

In [None]:
#video does not match description, and also offers free stuff
J90cJfKlhEY

In [None]:
#do nothing and make money videos
EkLFsL4KevU