## Web Scraping data

``requests`` module used to fetch the source-code of the required web-site.

``BeautifulSoup`` module used to pull data from the HTML page and parse it (using any suitable parser)

### Importing necessary libraries

In [1]:
import requests
from bs4 import BeautifulSoup as bs
from time import sleep
import pandas as pd

### Trying out data scraping on my friend's web-site

Learning web scraping using BeautifulSoup and requests libraries to get data from web-sites. Starting with scraping my friend's portfolio site.

In [2]:
# defining link variable to store web-site's URL
link = "http://bhupeshv.me"

In [3]:
# requesting site, source code is returned
req = requests.get (link)

# 200 means request went through
print (req, type (req))

<Response [200]> <class 'requests.models.Response'>


#### Parsing the web-site
``BeautifulSoup ()`` (``bs`` in the code) used to convert ``requests`` object into ``BeautifulSoup`` object (``soup`` in the code) and parse it using ``lxml`` parser.

In [4]:
# parsing the site using lxml parser
soup = bs (req.text, "lxml")

# prettify () used to print source code with proper nesting, easier to read
print (soup.prettify (), type (soup))

<!DOCTYPE HTML>
<html>
 <head>
  <title>
   Bhupesh Varshney
  </title>
  <meta charset="utf-8"/>
  <meta content="Welcome Its Me Bhupesh Varshney" name="description"/>
  <meta content="Bhupesh,Varshney,Bhupesh Varshney,bhupesh,bhupeshv,BHUPESH" name="keywords"/>
  <meta content="Bhupesh Varshney" name="author"/>
  <meta content="width=device-width, initial-scale=1, user-scalable=no" name="viewport"/>
  <link href="static/favicon.ico" rel="icon" type="image/x-icon"/>
  <link href="static/main.css" rel="stylesheet"/>
 </head>
 <body class="is-preload">
  <!-- Wrapper -->
  <div id="wrapper">
   <!-- Main -->
   <section id="main">
    <header>
     <span class="avatar">
      <img height="300" src="static/profile.jpg" width="300"/>
     </span>
     <h1>
      Bhupesh Varshney
     </h1>
     <p>
      Open Source Enthusiast 🤓 &amp; Code Pervert 😜 .
     </p>
     <a class="button" href="blog.html">
      Checkout My Blogs
     </a>
     <br/>
     <br/>
     <a class="button" href="htt

#### Finding specific elements
``find ()`` to find a specific tag/element.

``find_all ()`` to find all the instances of the specified tag/element.

In [5]:
print (soup.find ("a", class_="button"))
print (soup.find ("a", class_="button").text)

<a class="button" href="blog.html">Checkout My Blogs</a>
Checkout My Blogs


In [6]:
for link in soup.find_all ("a", class_="button"):
    print (link.text + ": " + link['href'])

Checkout My Blogs: blog.html
Resume: https://drive.google.com/file/d/1YlOHhvkLCwEhY1iFsHppVOmtMWeKtp4X/view?usp=sharing
Available: https://github.com/Bhupesh-V/HeckChat
Available: https://github.com/Bhupesh-V/Algorithms
On Going: https://github.com/Bhupesh-V/CoderBot
Available: https://github.com/Bhupesh-V/Learn-Python-Packages
Available: https://github.com/Bhupesh-V/30-Seconds-Of-STL
Available: https://github.com/Bhupesh-V/EmployeeAPI


In [7]:
print (soup.find ("section").text)




Bhupesh Varshney
Open Source Enthusiast 🤓 & Code Pervert 😜 .
Checkout My Blogs
Resume



              Hi, I am Bhupesh Varshney, 19 year old Student Developer, 2nd year Computer Science student living in New Delhi, India. I code in Python,C++,Golang,C & JavaScript.I use Linux, jupyter, VSCode as my developer environment.
            


Projects


Name 
Status


HeckChat
Available


Algorithms
Available


CoderBot
On Going


Learn-Python-Packages
Available


30-Seconds-Of-STL
Available


EmployeeAPI
Available



Frameworks & Technologies










My Weapons









Share Your Views!

















Find me on 





























### Scraping data from YouTube

In [8]:
# declaring variables which will be needed to create the specified links

yt_link = "https://www.youtube.com"
categories = ["travel+blog", "science+and+technology", "food", "manufacturing", "history", "art+and+music"]
searchstr = "/results?search_query="
only_vids = "&sp=EgIQAQ%253D%253D"

# printing the links that will be generated
for category in categories:
    print (yt_link + searchstr + category + only_vids)

https://www.youtube.com/results?search_query=travel+blog&sp=EgIQAQ%253D%253D
https://www.youtube.com/results?search_query=science+and+technology&sp=EgIQAQ%253D%253D
https://www.youtube.com/results?search_query=food&sp=EgIQAQ%253D%253D
https://www.youtube.com/results?search_query=manufacturing&sp=EgIQAQ%253D%253D
https://www.youtube.com/results?search_query=history&sp=EgIQAQ%253D%253D
https://www.youtube.com/results?search_query=art+and+music&sp=EgIQAQ%253D%253D


#### Starting with the Travel Blog category

In [9]:
travel_blog = yt_link + searchstr + categories[0] + only_vids

print (travel_blog)

req = requests.get (travel_blog)
print (req)

https://www.youtube.com/results?search_query=travel+blog&sp=EgIQAQ%253D%253D
<Response [200]>


In [10]:
soup = bs (req.text, "lxml")

# printing all links found at the requested page
# for link in soup.find_all ("a"):
#     print (link.prettify ())

In [11]:
# printing the source code of the web-page in readable format
# print (soup.prettify ())

In [12]:
# using class_ (CSS class of the element) to find all video-links, descriptions

links = soup.find_all ("a", class_="yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link ")

descriptions = soup.find_all ("div", class_="yt-lockup-description yt-ui-ellipsis yt-ui-ellipsis-2")

# 20 video results in one page
for link, description in zip (links, descriptions):
    print ("Video-Title: " + link.text)
    print ("Description: " + description.text)
    print ("Video-ID: " + link["href"].split("=")[1])
    print ()

Video-Title: TRAVEL VLOG ∙ Welcome to Bali | PRISCILLA LEE
Description: Learn how to set up a blog Follow this Step-By-Step Tutorial
Video-ID: i9E_Blai8vk

Video-Title: How do I travel so much ! How do I earn money!!
Description: I had the chance to fly out to Bali with my whole family this Thanksgiving for our first ever trip abroad! I wanted to document my ...
Video-ID: e2NQE41J5eM

Video-Title: GOA TRAVEL DIARY | FOUR DAYS IN GOA | TRAVEL OUTFIT IDEAS
Description: SUBSCRIBE - https://goo.gl/dEtSMJ ('MountainTrekker') Gimbal - https://goo.gl/Frwci2 If you have any other query feel free to ask ...
Video-ID: -LzdIILq5vE

Video-Title: 200 Days - A Trip Around the World Travel Film
Description: Hope you enjoy MY GOA TRAVEL DIARY this video! Don't forget to subscribe and like this video Thank you for your support, love ...
Video-ID: RcmrbNRK-jY

Video-Title: How to Start a Travel Blog [2019] Travel Blogging Full-Time
Description: My wife and I traveled to 17 countries in 200 days. This fi

In [13]:
# descriptions on the page are incomplete
for des in soup.find_all ("div", class_="yt-lockup-description yt-ui-ellipsis yt-ui-ellipsis-2"):
    print (des.text)
    print ()

Learn how to set up a blog Follow this Step-By-Step Tutorial

I had the chance to fly out to Bali with my whole family this Thanksgiving for our first ever trip abroad! I wanted to document my ...

SUBSCRIBE - https://goo.gl/dEtSMJ ('MountainTrekker') Gimbal - https://goo.gl/Frwci2 If you have any other query feel free to ask ...

Hope you enjoy MY GOA TRAVEL DIARY this video! Don't forget to subscribe and like this video Thank you for your support, love ...

My wife and I traveled to 17 countries in 200 days. This film is the story of our incredible trip! Enjoy! We used a GoPro and a Nikon ...

Create a Travel Blog Website for Just $3.95++ http://bit.ly/create-a-travel-blog-websiteI've been a full-time blogger ...

THIS IS MY FAVOURITE COUNTRY I HAVE EVER TRAVELED! You simply must put the Philippines on your bucket list. The most ...

Travel blogger, Nikki Vargas, of The Pin the Map Project (voted top 100 travel blog in 2015) shares her best tips to getting started ...

Read the full 

# Issue:
Don't know how I will get 300+ results for each category.

## Fix:

Use pagination links to request next page of results.

In [14]:
# YouTube's pagination button's CSS selector
buttons = soup.find_all ("a", attrs={"class": "yt-uix-button vve-check yt-uix-sessionlink yt-uix-button-default yt-uix-button-size-default"})

# <a aria-label="Go to page 2" class="yt-uix-button vve-check yt-uix-sessionlink yt-uix-button-default yt-uix-button-size-default" data-sessionlink="itct=CAcQnKQBGAciEwjY6q2r8_TgAhUTJmgKHbHYACwo9CQ" data-visibility-tracking="CAcQnKQBGAciEwjY6q2r8_TgAhUTJmgKHbHYACwo9CQ" href="/results?sp=EgIQAUgU6gMA&amp;search_query=travel+blog">
#  <span class="yt-uix-button-content">
#   Next »
#  </span>
# </a>

In [15]:
# printing all pagination buttons on the site
for button in buttons:
    print (button["href"])
    
# getting Next button (last pagination button for going to next page)
next_button = buttons[-1]["href"]

# creating link for next page
next_button = "https://www.youtube.com" + str(next_button)
print (next_button)

/results?search_query=travel+blog&sp=EgIQAUgU6gMA
/results?sp=EgIQAUgo6gMA&search_query=travel+blog
/results?sp=EgIQAUg86gMA&search_query=travel+blog
/results?search_query=travel+blog&sp=EgIQAUhQ6gMA
/results?sp=EgIQAUhk6gMA&search_query=travel+blog
/results?sp=EgIQAUh46gMA&search_query=travel+blog
/results?sp=EgIQAUgU6gMA&search_query=travel+blog
https://www.youtube.com/results?sp=EgIQAUgU6gMA&search_query=travel+blog


In [16]:
# making request to the next page linked by the Next button
nextreq = requests.get (next_button)

print (req)

<Response [200]>


In [17]:
# parsing the next page
nextsoup = bs (nextreq.text, "lxml")

In [18]:
count = 1

# finding the video links in the next page
for link in nextsoup.find_all ("a", class_="yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link "):
    print ("Video-Title: " + link.text + " Count = " + str (count))
    print ("Video-ID: " + link["href"].split("=")[1])
    print ()
    count += 1

Video-Title: Beautiful Thailand by Travel Blogger @joaocajuda Count = 1
Video-ID: rlCVwA2A5UA

Video-Title: Paris Vacation Travel Guide | Expedia Count = 2
Video-ID: AQ6GmpMu5L8

Video-Title: Top 10 YouTube Travellers Count = 3
Video-ID: 942sCY6d7kU

Video-Title: Thailand Vlog Count = 4
Video-ID: 41ssBujAS0U

Video-Title: Miami Vacation Travel Guide | Expedia Count = 5
Video-ID: 58iT2L4VQj4

Video-Title: A WEEK IN PARIS | travel vlog Count = 6
Video-ID: tKXrpRrj7Ow

Video-Title: Oslo, Flam, & Bergen Norway Travel Vlog | As Told By Count = 7
Video-ID: t2Y71Tv5Nos

Video-Title: Ajmer Shatabdi Express : My FIRST Indian Train Journey - Delhi to Jaipur ( FULL JOURNEY) 🇮🇳 Count = 8
Video-ID: eU9CXazi_3c

Video-Title: Taiwan Travel Guide: A 3-Day Itinerary | Taipei + Day Tours Count = 9
Video-ID: _ZU-bP6Z9S4

Video-Title: Exploring Hong Kong: A 5-Day Itinerary (Macau Included!) Count = 10
Video-ID: v3RLM9wt5s8

Video-Title: SIARGAO, PHILIPPINES TRAVEL GUIDE (budget & itinerary) Count = 11
Vid

In [19]:
# source code copied from a blog (works, but doesn't give exactly what is needed)

source = requests.get(travel_blog)
soup = bs (source.text, 'lxml')


for content in soup.find_all('div', class_= "yt-lockup-content"):
    
    # description may be empty and might throw exception
    try:
        
        title = content.h3.a.text
        print(title)

        description = content.find('div', class_="yt-lockup-description yt-ui-ellipsis yt-ui-ellipsis-2").text
        print(description)

    except Exception as e:
        
        description = None

    print('\n')

How to Start a Blog in 2019 Step-By-Step
Learn how to set up a blog Follow this Step-By-Step Tutorial


TRAVEL VLOG ∙ Welcome to Bali | PRISCILLA LEE
I had the chance to fly out to Bali with my whole family this Thanksgiving for our first ever trip abroad! I wanted to document my ...


How do I travel so much ! How do I earn money!!
SUBSCRIBE - https://goo.gl/dEtSMJ ('MountainTrekker') Gimbal - https://goo.gl/Frwci2 If you have any other query feel free to ask ...


GOA TRAVEL DIARY | FOUR DAYS IN GOA | TRAVEL OUTFIT IDEAS
Hope you enjoy MY GOA TRAVEL DIARY this video! Don't forget to subscribe and like this video Thank you for your support, love ...


200 Days - A Trip Around the World Travel Film
My wife and I traveled to 17 countries in 200 days. This film is the story of our incredible trip! Enjoy! We used a GoPro and a Nikon ...


How to Start a Travel Blog [2019] Travel Blogging Full-Time
Create a Travel Blog Website for Just $3.95++ http://bit.ly/create-a-travel-blog-websiteI've

In [20]:
# scraping description from a video's site (where video is played)
url = "https://www.youtube.com/watch?v=e2NQE41J5eM"

req_video = requests.get (url)

soup_video = bs (req_video.text, "lxml")

In [21]:
# getting video title
title = soup_video.find ("h1", class_="watch-title-container").text.split ("\n")[2].rsplit ()

formatted_title = ""

for w in title:
    formatted_title += w + " "
    
print ("Video title: ")
print (formatted_title)

# <h1 class="title style-scope ytd-video-primary-info-renderer"><yt-formatted-string force-default-style="" class="style-scope ytd-video-primary-info-renderer">How do I travel so much ! How do I earn money!!</yt-formatted-string></h1>

Video title: 
How do I travel so much ! How do I earn money!! 


In [22]:
# getting video description
print ("Video description: ")

print (soup_video.find ("p", id="eow-description").text)

Video description: 
SUBSCRIBE - https://goo.gl/dEtSMJ (‘MountainTrekker’)Gimbal - https://goo.gl/Frwci2If you have any other query feel free to ask at - www.facebook.com/groups/touristhelpline (It may not be possible for me to answer each and every query here, but other group members, travelers, and travel experts can help you)Other travel series  - # THAILAND playlist - https://goo.gl/dOUJck # EUROPE Playlist - https://goo.gl/Tlx9mJ# BANGLADESH playlist -  https://goo.gl/uw1y1v# SPITI (India) playlist - https://goo.gl/xqvvQ6# MALAYSIA playlist - https://goo.gl/2a3doKPLEASE SHARE THE VIDEOS AND LET OTHERS GET INFORMED ABOUT THIS CHANNEL My blog: www.touristhelpline.comFACEBOOK.com/page.mountaintrekkerTWITTER.com/mttrekkerindia


In [23]:
# tring something out
url = "https://www.youtube.com/results?sp=EgIQAQ%253D%253D&search_query=travel+blogs"

requestURL = requests.get (url)

URLsoup = bs (requestURL.text, "lxml")

In [24]:
# creating pandas dataframe so that writing to csv file easy
YTDf = pd.DataFrame (columns=["Video ID", "Title", "Description", "Category"])

# creating lists for storing video's ID, Title, Description, Category
ytid = []
yttitle =  []
ytdes = []
ytcat = []

YTDf

Unnamed: 0,Video ID,Title,Description,Category


In [25]:
links = URLsoup.find_all ("a", class_="yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link ")

# temporary link for making request to video's site
templink = yt_link

for link in links:
    print ("Video-Title: " + link.text)
    print ("Video-ID: " + link["href"].split("=")[1])
    
    ytid.append (link["href"].split("=")[1])
    yttitle.append (link.text)
    
    # creating link to a video
    templink += link["href"]
    
    # making request to that video's poge
    tempreq = requests.get (templink)
    tempsoup = bs (tempreq.text, "lxml")
    
    # scraping full video description
    print ("Video description: ")
    print (tempsoup.find ("p", id="eow-description").text)
    
    ytdes.append (tempsoup.find ("p", id="eow-description").text)
    ytcat.append ("travel")
    print ()

    # sleeping the thread to NOT get flagged by YouTube (making request in inhuman times same as DDoS attack)
    sleep (2)
    templink = yt_link

Video-Title: TRAVEL VLOG ∙ Welcome to Bali | PRISCILLA LEE
Video-ID: i9E_Blai8vk
Video description: 
I had the chance to fly out to Bali with my whole family this Thanksgiving for our first ever trip abroad! I wanted to document my memories so that I can always remember how blessed I am to have this opportunity and also share it with those who have never been to Bali.I am by no means anything close to a professional so please excuse the shaky hand and enjoy!Check out my Bali blog post on my website -http://www.leepriscilla.com/blog/sx9n...LOCATIONS:1. The Viceroy Bali (Ubud)https://www.viceroybali.com/en/introd...2. AYANA Resort & Spa (Jimbaran)http://www.ayana.com/CAMERA:Canon G9XMUSIC:Like more music and original songs?Check out David's channel: https://bit.ly/2Hhj3TB1. Naulé - Motionhttps://soundcloud.com/naulemusic/motion2. Pogo - Bloomhttps://www.youtube.com/watch?v=t_hto...3. Tom Misch - Lush Lifehttps://soundcloud.com/tommisch/lush-...4. Static Love - 1986https://soundcloud.com/

In [26]:
YTDf = pd.DataFrame (ytid, columns=["Video ID"])
YTDf["Title"] = yttitle
YTDf["Description"] = ytdes
YTDf["Category"] = ytcat
YTDf

Unnamed: 0,Video ID,Title,Description,Category
0,i9E_Blai8vk,TRAVEL VLOG ∙ Welcome to Bali | PRISCILLA LEE,I had the chance to fly out to Bali with my wh...,travel
1,e2NQE41J5eM,How do I travel so much ! How do I earn money!!,SUBSCRIBE - https://goo.gl/dEtSMJ (‘MountainTr...,travel
2,942sCY6d7kU,Top 10 YouTube Travellers,"Hello, and welcome to TopX, where we count dow...",travel
3,yvn79Rv0F48,Backpacking In Meghalaya | NorthEast India Tri...,"In this video I explored North East India, sta...",travel
4,RcmrbNRK-jY,200 Days - A Trip Around the World Travel Film,My wife and I traveled to 17 countries in 200 ...,travel
5,u-UA8t2EVpA,Top 10 MOST BEAUTIFUL Places in IRELAND | Esse...,Traveling to Ireland or Northern Ireland? Fro...,travel
6,EthqIhPtd2I,"TRAVEL VLOG: SANTORINI, GREECE",Thank you so much for watching! I hope you fou...,travel
7,7ByoBJYXU0k,5 Steps to Becoming a Travel Blogger,"Travel blogger, Nikki Vargas, of The Pin the M...",travel
8,i5F7Xh9CO8U,EXPLORING VARANASI | Benaras Travel Vlog #1,Spent an incredible week exploring the beautif...,travel
9,kiNyRY5s7n8,How to Start a Travel Blog [2019] Travel Blogg...,Create a Travel Blog Website for Just $3.95++ ...,travel


### Test code

**NOTE**: Above code being used to test out stuff required.

Below code doesn't work properly. Request being sent to the wrong site. Results getting repeated.

In [None]:
for category in categories:
    
    link = yt_link + searchstr + category + only_vids
    
    links = URLsoup.find_all ("a", class_="yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link ")

    templink = yt_link
    
    count = 0
    
    while (count < 300):
        
        for link in links:
            try:
            
                print ("Video-Title: " + link.text)
                print ("Video-ID: " + link["href"].split("=")[1])
        
                ytid.append (link["href"].split("=")[1])
                yttitle.append (link.text)
        
                templink += link["href"]
                tempreq = requests.get (templink)
                tempsoup = bs (tempreq.text, "lxml")
        
                print ("Video description: ")
                print (tempsoup.find ("p", id="eow-description").text)
        
                ytdes.append (tempsoup.find ("p", id="eow-description").text)
            
            except Exception as e:
                ytdes.append ("")
                ytcat.append (category.split ("+")[0])
            
            print ()
        
            sleep (5)
            templink = yt_link
            count += 1
            

## Final code:

*(which works)*

In [6]:
yt_link = "https://www.youtube.com"
categories = ["travel+blog", "science+and+technology", "food", "manufacturing", "history", "art+and+music"]
searchstr = "/results?search_query="
only_vids = "&sp=EgIQAQ%253D%253D"

for category in categories:
    print (yt_link + searchstr + category + only_vids)

https://www.youtube.com/results?search_query=travel+blog&sp=EgIQAQ%253D%253D
https://www.youtube.com/results?search_query=science+and+technology&sp=EgIQAQ%253D%253D
https://www.youtube.com/results?search_query=food&sp=EgIQAQ%253D%253D
https://www.youtube.com/results?search_query=manufacturing&sp=EgIQAQ%253D%253D
https://www.youtube.com/results?search_query=history&sp=EgIQAQ%253D%253D
https://www.youtube.com/results?search_query=art+and+music&sp=EgIQAQ%253D%253D


In [49]:
# creating empty lists to fill details in

vidID = []
vidTitle = []
vidDescription = []
vidCateogry = []

In [50]:
for category in categories:
    
    # link of category
    link = yt_link + searchstr + category + only_vids

    # first request
    req =  requests.get (link)

    # first soup
    soup = bs (req.text, "lxml")

    # finding all links in first page
    links = soup.find_all ("a", class_="yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link ")

    # setting count to 0 to get required results
    count = 0

    while (count < 400):
        
        # temporary link to get description
        templink = yt_link
    
        # iteratiing over the links
        for link in links:
        
            # description might be empty so putting code in try-except block
            try:
                # appending video ID
                vidID.append (link["href"].split("=")[1])
                # appending video Title
                vidTitle.append (link.text)
                # print (str (count) + ". " + link.text)
            
                # getting temporary link to make a request to get video description
                templink += link["href"]
            
                # making request and soupifying
                tempreq = requests.get (templink)
                sleep (2.5)
                tempsoup = bs (tempreq.text, "lxml")
            
                # appending video Description
                vidDescription.append (tempsoup.find ("p", id="eow-description").text)
            
            except Exception as e:
                
                vidDescription.append ("")
            
            # appending video Category
            vidCateogry.append (category.split ("+")[0])
            templink = yt_link
            count += 1        
    
        # finding pagination buttons
        buttons = soup.find_all ("a", class_="yt-uix-button vve-check yt-uix-sessionlink yt-uix-button-default yt-uix-button-size-default")
    
        # finding Next button
        next_button = buttons[-1]["href"]
        next_button = "https://www.youtube.com" + str(next_button)
    
        # making request on the Next button
        nextreq = requests.get (next_button)
    
        nextsoup = bs (nextreq.text, "lxml")
    
        # new list of links
        links = nextsoup.find_all ("a", class_="yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link ")
        # setting soup as nextsoup to find next page's pagination buttton
        soup = nextsoup
    
    # done message for each category
    print (category.split ("+")[0] + " done.")

print ("Done all categories.")

travel done.
science done.
food done.
manufacturing done.
history done.
art done.
Done all categories.


In [51]:
YTDf = pd.DataFrame (columns=["Video ID", "Title", "Description", "Category"])
YTDf["Video ID"] = vidID
YTDf["Title"] = vidTitle
YTDf["Description"] = vidDescription
YTDf["Category"] = vidCateogry

# counting all unique results
print ("Unique count = " + str (len (YTDf["Video ID"].unique ())))

# seeing how Dataframe looks like (with all results)
YTDf

Unique count = 2111


Unnamed: 0,Video ID,Title,Description,Category
0,e2NQE41J5eM,How do I travel so much ! How do I earn money!!,SUBSCRIBE - https://goo.gl/dEtSMJ (‘MountainTr...,travel
1,i9E_Blai8vk,TRAVEL VLOG ∙ Welcome to Bali | PRISCILLA LEE,I had the chance to fly out to Bali with my wh...,travel
2,-LzdIILq5vE,GOA TRAVEL DIARY | FOUR DAYS IN GOA | TRAVEL O...,Hope you enjoy MY GOA TRAVEL DIARY this video!...,travel
3,7ByoBJYXU0k,5 Steps to Becoming a Travel Blogger,"Travel blogger, Nikki Vargas, of The Pin the M...",travel
4,7mIzRYh8jGA,What is it like to travel in PAKISTAN?,Subscribe now: https://goo.gl/6zXZGKWatch the ...,travel
5,EthqIhPtd2I,"TRAVEL VLOG: SANTORINI, GREECE",Thank you so much for watching! I hope you fou...,travel
6,g-7RK9cmXis,HOW TO TRAVEL THE PHILIPPINES,THIS IS MY FAVOURITE COUNTRY I HAVE EVER TRAVE...,travel
7,SL_YBLWdZb8,Welcome to Peru! | Best Essential Tips & Trave...,Welcome to Peru! This essential travel guide w...,travel
8,ZR7Z74UY2TY,VAN LIFE | A Day in The Life | MEXICO,We have been traveling in our van in Mexico fo...,travel
9,RcmrbNRK-jY,200 Days - A Trip Around the World Travel Film,My wife and I traveled to 17 countries in 200 ...,travel


In [52]:
# how Dataframe looks like without duplicate results
YTDf.drop_duplicates ()

Unnamed: 0,Video ID,Title,Description,Category
0,e2NQE41J5eM,How do I travel so much ! How do I earn money!!,SUBSCRIBE - https://goo.gl/dEtSMJ (‘MountainTr...,travel
1,i9E_Blai8vk,TRAVEL VLOG ∙ Welcome to Bali | PRISCILLA LEE,I had the chance to fly out to Bali with my wh...,travel
2,-LzdIILq5vE,GOA TRAVEL DIARY | FOUR DAYS IN GOA | TRAVEL O...,Hope you enjoy MY GOA TRAVEL DIARY this video!...,travel
3,7ByoBJYXU0k,5 Steps to Becoming a Travel Blogger,"Travel blogger, Nikki Vargas, of The Pin the M...",travel
4,7mIzRYh8jGA,What is it like to travel in PAKISTAN?,Subscribe now: https://goo.gl/6zXZGKWatch the ...,travel
5,EthqIhPtd2I,"TRAVEL VLOG: SANTORINI, GREECE",Thank you so much for watching! I hope you fou...,travel
6,g-7RK9cmXis,HOW TO TRAVEL THE PHILIPPINES,THIS IS MY FAVOURITE COUNTRY I HAVE EVER TRAVE...,travel
7,SL_YBLWdZb8,Welcome to Peru! | Best Essential Tips & Trave...,Welcome to Peru! This essential travel guide w...,travel
8,ZR7Z74UY2TY,VAN LIFE | A Day in The Life | MEXICO,We have been traveling in our van in Mexico fo...,travel
9,RcmrbNRK-jY,200 Days - A Trip Around the World Travel Film,My wife and I traveled to 17 countries in 200 ...,travel


In [54]:
# creating .csv file by dropping the duplicate results
YTDf.drop_duplicates ().to_csv ("YTDataset.csv", index=False)

#### Resources followed to learn Web Scraping:

1. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
2. https://allofyourbases.com/2017/10/08/web-scraping-youtube-in-python/
3. https://allofyourbases.com/2017/11/08/webcrawling-youtube-pagination-in-python/
4. https://www.youtube.com/watch?v=0_VZ7NpVw1Y
5. https://www.youtube.com/watch?v=PI1-1TtFz50
6. https://www.youtube.com/watch?v=ng2o98k983k

### Learning web-scraping took the whole time, so wasn't able to do classification. 

However, if had to classify then would go with RNN to classify based on the video-title and the video-description (after cleaning up both since they contain emojis, affiliate links, other links which are not useful).