<a href="https://colab.research.google.com/github/Blackman9t/EDA/blob/master/web_scraping_with_beautiful_soup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Reading the headings, summary and video links from a website using BeautifulSoup

In [3]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
from pandas.io.json import json_normalize  # transform json files to pandas dataframes
import numpy as np
import csv

print('All modules imported')

All modules imported


In [0]:
site_link = 'https://coreyms.com/'

# get the source code text html from the website
source = requests.get(site_link).text

# using BeautifulSoup to parse the website 
soup = BeautifulSoup(source, 'lxml')

#print(soup.prettify())

###Now let's grab the first headline and summary and video link from the website, but first let's inspect the structure

First let's grab the headline

In [0]:
# first we find the first article headline
article = soup.find('article')

# let's print to see
#print(article.prettify())

In [12]:
# Let's save the first headline to a variable
headline = article.a.text
print(headline)

Visual Studio Code (Windows) – Setting up a Python Development Environment and Complete Overview


Next let's grab the summary post after the headline

In [15]:
# summary is finding the div with class entry-content, take p(paragraph) then get just the text of it
summary = article.find('div', class_='entry-content').p.text

print(summary)

In this Python Programming Tutorial, we will be learning how to set up a Python development environment in VSCode on Windows. VSCode is a very nice free editor for writing Python applications and many developers are now switching over to this editor. In this video, we will learn how to install VSCode, get the Python extension installed, how to change Python interpreters, create virtual environments, format/lint our code, how to use Git within VSCode, how to debug our programs, how unit testing works, and more. We have a lot to cover, so let’s go ahead and get started…


Next we need to get the link to the first video . This is a bit tricky

In [28]:
# First let's grab the particular iframe that has the video,
# Which in this case is the first iframe
frame = article.find('iframe',class_="youtube-player")
frame

<iframe allowfullscreen="true" class="youtube-player" height="360" src="https://www.youtube.com/embed/-nh9rCzPJ20?version=3&amp;rel=1&amp;fs=1&amp;autohide=2&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" style="border:0;" type="text/html" width="640"></iframe>

In [29]:
# to get the source(src) attribute with vid link from frame, let's index it using the square bracket

vid_link = frame['src']

vid_link

'https://www.youtube.com/embed/-nh9rCzPJ20?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent'

In [30]:
# Next let's split the link based on the / sign and pick the part with the version

link_split = vid_link.split('/')

link_split

['https:',
 '',
 'www.youtube.com',
 'embed',
 '-nh9rCzPJ20?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent']

In [32]:
# Next lets use simple python indexing to seect the link
# then apply the split function on the ? sign to target the link proper
vid_link = link_split[4].split('?')

vid_link = vid_link[0]

vid_link

'-nh9rCzPJ20'

Now we can create our own youtube link using this video ID

In [34]:
yt_link = f'youtube.com/watch?v={vid_link}'  
print(yt_link)

youtube.com/watch?v=-nh9rCzPJ20


### Finding the data for all the articles on the website

In [36]:
# Here we ahall use find all instead of find and a for-loop to find all data
    
for article in soup.find_all('article'):
    try:
        headline = article.a.text
        print(headline)
        summary = article.find('div', class_='entry-content').p.text
        print(summary)
        frame = article.find('iframe',class_="youtube-player")
        vid_link = frame['src']
        link_split = vid_link.split('/')
        vid_link = link_split[4].split('?')
        vid_link = vid_link[0]
        yt_link = f'youtube.com/watch?v={vid_link}'
        print(yt_link)
    except:
        print('None')
    print()


Visual Studio Code (Windows) – Setting up a Python Development Environment and Complete Overview
In this Python Programming Tutorial, we will be learning how to set up a Python development environment in VSCode on Windows. VSCode is a very nice free editor for writing Python applications and many developers are now switching over to this editor. In this video, we will learn how to install VSCode, get the Python extension installed, how to change Python interpreters, create virtual environments, format/lint our code, how to use Git within VSCode, how to debug our programs, how unit testing works, and more. We have a lot to cover, so let’s go ahead and get started…
youtube.com/watch?v=-nh9rCzPJ20

Visual Studio Code (Mac) – Setting up a Python Development Environment and Complete Overview
In this Python Programming Tutorial, we will be learning how to set up a Python development environment in VSCode on MacOS. VSCode is a very nice free editor for writing Python applications and many dev

Let's save this scraped website data to a CSV file, better to define a method with everything intact

In [0]:
def find_all_articles_data(article):
    """ This method takes a web page and scrapes its headline, summary,
    and video links and returns all these data in a pandas data frame """
    
    # create a csv file
    csv_file = open('scraped_site.csv','w')
    
    # call the writer function on the csv-file
    csv_writer = csv.writer(csv_file)
    
    # Pass the headers list to the csv object
    csv_writer.writerow(['Headline','Summary','Video_link'])
    
    
    for article in soup.find_all('article'):
        try:
            headline = article.a.text
            summary = article.find('div', class_='entry-content').p.text
            frame = article.find('iframe',class_="youtube-player")
            vid_link = frame['src']
            link_split = vid_link.split('/')
            vid_link = link_split[4].split('?')
            vid_link = vid_link[0]
            yt_link = f'youtube.com/watch?v={vid_link}'
        except:
            print('None')
        csv_writer.writerow([headline, summary, yt_link])
    csv_file.close()
    return pd.read_csv('scraped_site.csv')

In [71]:
scraped_df = find_all_articles_data(article)
scraped_df

Unnamed: 0,Headline,Summary,Video_link
0,Visual Studio Code (Windows) – Setting up a Py...,"In this Python Programming Tutorial, we will b...",youtube.com/watch?v=-nh9rCzPJ20
1,Visual Studio Code (Mac) – Setting up a Python...,"In this Python Programming Tutorial, we will b...",youtube.com/watch?v=06I63_p-2A4
2,Clarifying the Issues with Mutable Default Arg...,"In this Python Programming Tutorial, we will b...",youtube.com/watch?v=_JGmemuINww
3,5 Common Python Mistakes and How to Fix Them,"In this Python Programming Tutorial, we will b...",youtube.com/watch?v=zdJEYhA2AZQ
4,How I Setup a New Development Machine – Using ...,"In this video, I’ll be showing how I set up a ...",youtube.com/watch?v=kIdiWut8eD8
5,How to Write Python Scripts to Analyze JSON AP...,"In this Python Programming Tutorial, we will b...",youtube.com/watch?v=1lxrb_ezP-g
6,Homebrew Tutorial: Simplify Software Installat...,"In this video, we’ll be learning how to use th...",youtube.com/watch?v=SELYgZvAZbU
7,Python Tutorial: VENV (Windows) – How to Use V...,"In this Python Programming Tutorial, we will b...",youtube.com/watch?v=APOPm01BVrk
8,Python Tutorial: VENV (Mac & Linux) – How to U...,"In this Python Programming Tutorial, we will b...",youtube.com/watch?v=Kg1Yvry_Ydk
9,10 Python Tips and Tricks For Writing Better Code,"In this Python Programming video, we will be g...",youtube.com/watch?v=C-gEQdGVXbk
