<h1 align=center> Web Scraping Project </h1>

<h3 align=center> By Prosper Ayawah </h3>


## Description

This notebook contains a code that scraps info from a data science forum (https://www.datasciencecentral.com/forum) and compiles the data on a csv file.

The following tasks are completed:

- Use a for loop to extract key information from all 10 pages. In total, we need to extract key information of 100 topics (10 pages, each pages containing 10 topics);
- Use a pandas data frame to store the following key information of the 100 topics:
  1. TopicTitle
  2. TopicURL
  3. NumOfReplies
  4. Author
  5. AuthorURL
- Save the data frame containing the key information to a csv file.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

## 1. Prepare URLs of the first 10 pages

The first page URL:
https://www.datasciencecentral.com/forum?page=1

In [2]:
url_temp = 'https://www.datasciencecentral.com/forum?page={page_num}'

In [3]:
url_temp

'https://www.datasciencecentral.com/forum?page={page_num}'

In [4]:
url_temp.format(page_num = 1)

'https://www.datasciencecentral.com/forum?page=1'

In [5]:
url_temp.format(page_num = 2)

'https://www.datasciencecentral.com/forum?page=2'

Create a data frame with 10 elements, each element representing the URL of the pages.

In [6]:
df = pd.DataFrame({'Page': range(1,11)})
df

Unnamed: 0,Page
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,10


In [7]:
df['PageURL'] = [url_temp.format(page_num = i+1) for i in range(0,len(df))]
df

Unnamed: 0,Page,PageURL
0,1,https://www.datasciencecentral.com/forum?page=1
1,2,https://www.datasciencecentral.com/forum?page=2
2,3,https://www.datasciencecentral.com/forum?page=3
3,4,https://www.datasciencecentral.com/forum?page=4
4,5,https://www.datasciencecentral.com/forum?page=5
5,6,https://www.datasciencecentral.com/forum?page=6
6,7,https://www.datasciencecentral.com/forum?page=7
7,8,https://www.datasciencecentral.com/forum?page=8
8,9,https://www.datasciencecentral.com/forum?page=9
9,10,https://www.datasciencecentral.com/forum?page=10


## 2. Download HTML Files

Try the requests.get() method on the first page.

In [8]:
x = requests.get('https://www.datasciencecentral.com/forum?page=1')
x

<Response [200]>

Use the requests.get() method to download HTML files for all 10 pages.

In [9]:
df['HTML'] = [requests.get(df.loc[i,'PageURL']).content for i in range(0,len(df))]
df

Unnamed: 0,Page,PageURL,HTML
0,1,https://www.datasciencecentral.com/forum?page=1,"b'<!DOCTYPE html>\n<html lang=""en"" xmlns:og=""h..."
1,2,https://www.datasciencecentral.com/forum?page=2,"b'<!DOCTYPE html>\n<html lang=""en"" xmlns:og=""h..."
2,3,https://www.datasciencecentral.com/forum?page=3,"b'<!DOCTYPE html>\n<html lang=""en"" xmlns:og=""h..."
3,4,https://www.datasciencecentral.com/forum?page=4,"b'<!DOCTYPE html>\n<html lang=""en"" xmlns:og=""h..."
4,5,https://www.datasciencecentral.com/forum?page=5,"b'<!DOCTYPE html>\n<html lang=""en"" xmlns:og=""h..."
5,6,https://www.datasciencecentral.com/forum?page=6,"b'<!DOCTYPE html>\n<html lang=""en"" xmlns:og=""h..."
6,7,https://www.datasciencecentral.com/forum?page=7,"b'<!DOCTYPE html>\n<html lang=""en"" xmlns:og=""h..."
7,8,https://www.datasciencecentral.com/forum?page=8,"b'<!DOCTYPE html>\n<html lang=""en"" xmlns:og=""h..."
8,9,https://www.datasciencecentral.com/forum?page=9,"b'<!DOCTYPE html>\n<html lang=""en"" xmlns:og=""h..."
9,10,https://www.datasciencecentral.com/forum?page=10,"b'<!DOCTYPE html>\n<html lang=""en"" xmlns:og=""h..."


## 3. Explore the HTML Using Beautiful Soup Methods

Here, let's explore the usage of Beautiful Soup methods for HTML parsing. Again, we try the methods on the page 1.

In [10]:
soup = BeautifulSoup(df.loc[0,'HTML'], 'html.parser')

body = soup.find_all(lambda tag: tag.name=='div' and 
                     tag.get('class')==['xg_module_body'])[-1].find('tbody')

In [11]:
topics = body.find_all('h3')
print(topics)

[<h3><a _snid="6448529:Topic:945561" href="https://www.datasciencecentral.com/forum/topics/looking-for-a-ds-mentor">Looking for a DS Mentor</a></h3>, <h3><a _snid="6448529:Topic:961649" href="https://www.datasciencecentral.com/forum/topics/help-others-to-get-started-with-machine-learning">Help others to get started with Machine Learning.</a></h3>, <h3><a _snid="6448529:Topic:1000088" href="https://www.datasciencecentral.com/forum/topics/advices-for-use-case-in-data-science">Advices for use case in Data Science</a></h3>, <h3><a _snid="6448529:Topic:803137" href="https://www.datasciencecentral.com/forum/topics/one-day-will-humans-be-to-ai-what-dogs-are-to-humans-now">One day, will humans be to AI what dogs are to humans now?</a></h3>, <h3><a _snid="6448529:Topic:990315" href="https://www.datasciencecentral.com/forum/topics/example-of-traffic-camera-maintenance-dashboard">Example of Traffic Camera Maintenance Dashboard</a></h3>, <h3><a _snid="6448529:Topic:999240" href="https://www.datasc

In [12]:
# Get topic titles
[ x.text for x in topics]

['Looking for a DS Mentor',
 'Help others to get started with Machine Learning.',
 'Advices for use case in Data Science',
 'One day, will humans be to AI what dogs are to humans now?',
 'Example of Traffic Camera Maintenance Dashboard',
 'Data Science Techniques to eliminate False Negatives',
 'Global Data Science Platform Market: Trends and Opportunities',
 'FFT on Time Series Data',
 'Hardware Configuration',
 'Graduate Programs in Healthcare Data Science']

In [13]:
# Get topic URL
[x.find('a')['href'] for x in topics]

['https://www.datasciencecentral.com/forum/topics/looking-for-a-ds-mentor',
 'https://www.datasciencecentral.com/forum/topics/help-others-to-get-started-with-machine-learning',
 'https://www.datasciencecentral.com/forum/topics/advices-for-use-case-in-data-science',
 'https://www.datasciencecentral.com/forum/topics/one-day-will-humans-be-to-ai-what-dogs-are-to-humans-now',
 'https://www.datasciencecentral.com/forum/topics/example-of-traffic-camera-maintenance-dashboard',
 'https://www.datasciencecentral.com/forum/topics/data-science-techniques-to-eliminate-false-negatives',
 'https://www.datasciencecentral.com/forum/topics/global-data-science-platform-market-trends-and-opportunities',
 'https://www.datasciencecentral.com/forum/topics/fft-on-time-series-data',
 'https://www.datasciencecentral.com/forum/topics/hardware-configuration',
 'https://www.datasciencecentral.com/forum/topics/graduate-programs-in-healthcare-data-science']

In [14]:
replies = body.find_all('td',{'class':'bignum xg_lightborder'})
replies

[<td class="bignum xg_lightborder">2</td>,
 <td class="bignum xg_lightborder">2</td>,
 <td class="bignum xg_lightborder">3</td>,
 <td class="bignum xg_lightborder">4</td>,
 <td class="bignum xg_lightborder">4</td>,
 <td class="bignum xg_lightborder">6</td>,
 <td class="bignum xg_lightborder">0</td>,
 <td class="bignum xg_lightborder">2</td>,
 <td class="bignum xg_lightborder">4</td>,
 <td class="bignum xg_lightborder">2</td>]

In [15]:
# Get the number of replies
[int(x.text) for x in replies]

[2, 2, 3, 4, 4, 6, 0, 2, 4, 2]

In [16]:
users = body.find_all('span',{'class':'xg_avatar'})
users

[<span class="xg_avatar"><a class="fn url" href="https://www.datasciencecentral.com/profile/RionaZefi" title="Riona Zefi"><span class="table_img dy-avatar dy-avatar-48"><img alt="" class="photo photo" src="https://storage.ning.com/topology/rest/1.0/file/get/4445248530?profile=RESIZE_48X48&amp;width=48&amp;height=48&amp;crop=1%3A1"/></span></a></span>,
 <span class="xg_avatar"><a class="fn url" href="https://www.datasciencecentral.com/profile/StuartKasemeier" title="Stuart Kasemeier"><span class="table_img dy-avatar dy-avatar-48"><img alt="" class="photo photo" src="https://storage.ning.com/topology/rest/1.0/file/get/6712599095?profile=RESIZE_48X48&amp;width=48&amp;height=48&amp;crop=1%3A1"/></span></a></span>,
 <span class="xg_avatar"><a class="fn url" href="https://www.datasciencecentral.com/profile/Alonso612" title="Alonso"><span class="table_img dy-avatar dy-avatar-48"><img alt="" class="photo photo" src="https://storage.ning.com/topology/rest/1.0/file/get/2808682810?profile=origina

In [17]:
# Get user names
[x.find('a',{'class':'fn url'})['title'] for x in users]

['Riona Zefi',
 'Stuart Kasemeier',
 'Alonso',
 'Vincent Granville',
 'Hal Trask',
 'Arunansu Pattanayak',
 'Aakash Choudhary',
 'Prashanth Southekal, PhD',
 'Robert Ginns',
 'Dania Khan']

In [18]:
# To get author URL
[x.find('a',{'class':'fn url'})['href'] for x in users]

['https://www.datasciencecentral.com/profile/RionaZefi',
 'https://www.datasciencecentral.com/profile/StuartKasemeier',
 'https://www.datasciencecentral.com/profile/Alonso612',
 'https://www.datasciencecentral.com/profile/VincentGranville',
 'https://www.datasciencecentral.com/profile/HalTrask',
 'https://www.datasciencecentral.com/profile/ArunansuPattanayak',
 'https://www.datasciencecentral.com/profile/AakashChoudhary',
 'https://www.datasciencecentral.com/profile/PrashanthSouthekal',
 'https://www.datasciencecentral.com/profile/RobertGinns',
 'https://www.datasciencecentral.com/profile/DaniaKhan']

## 4. Looping through the pages to collect information

In [19]:
df1 = pd.DataFrame()
for r in range(0, len(df)):
    soup = BeautifulSoup(df.loc[r,'HTML'], 'html.parser') 

    body = soup.find_all(lambda tag: tag.name=='div' and 
                     tag.get('class')==['xg_module_body'])[-1].find('tbody')
    topics = body.find_all('h3')

    TopicTitle = [x.text for x in topics]    
    
    TopicURL = [x.find('a')['href'] for x in topics]
 
    replies = body.find_all('td',{'class':'bignum xg_lightborder'})
    
    NumOfReplies = [int(x.text) for x in replies]
    
    users = body.find_all('span',{'class':'xg_avatar'})
    
    Author = [x.find('a',{'class':'fn url'})['title'] for x in users]
    
    AuthorURL = [x.find('a',{'class':'fn url'})['href'] for x in users]
    
    df1_ = pd.DataFrame({'TopicTitle': TopicTitle,
                            'TopicURL': TopicURL,
                            'NumOfReplies': NumOfReplies,
                            'Author': Author,
                            'AuthorURL': AuthorURL})
    df1 = pd.concat([df1,df1_])


In [20]:
df1

Unnamed: 0,TopicTitle,TopicURL,NumOfReplies,Author,AuthorURL
0,Looking for a DS Mentor,https://www.datasciencecentral.com/forum/topic...,2,Riona Zefi,https://www.datasciencecentral.com/profile/Rio...
1,Help others to get started with Machine Learning.,https://www.datasciencecentral.com/forum/topic...,2,Stuart Kasemeier,https://www.datasciencecentral.com/profile/Stu...
2,Advices for use case in Data Science,https://www.datasciencecentral.com/forum/topic...,3,Alonso,https://www.datasciencecentral.com/profile/Alo...
3,"One day, will humans be to AI what dogs are to...",https://www.datasciencecentral.com/forum/topic...,4,Vincent Granville,https://www.datasciencecentral.com/profile/Vin...
4,Example of Traffic Camera Maintenance Dashboard,https://www.datasciencecentral.com/forum/topic...,4,Hal Trask,https://www.datasciencecentral.com/profile/Hal...
...,...,...,...,...,...
5,Linking Analytics Intervention Significance to...,https://www.datasciencecentral.com/forum/topic...,1,JOHN MEE,https://www.datasciencecentral.com/profile/JOH...
6,Constrained optimization with objective functi...,https://www.datasciencecentral.com/forum/topic...,3,DJ,https://www.datasciencecentral.com/profile/DJ
7,Help needed to prove that Earth is not flat,https://www.datasciencecentral.com/forum/topic...,15,Vincent Granville,https://www.datasciencecentral.com/profile/Vin...
8,Resources ?,https://www.datasciencecentral.com/forum/topic...,1,gerard pierre L,https://www.datasciencecentral.com/profile/ger...


In [21]:
# write collected info to a csv file
df1.to_csv('Web_Scraping_file.csv')