### Class Example:  Building A Web Scraper

This notebook will go over briefly how you build a web-scraper, and relate it to a few different concepts covered so far in this workshop.

**Important**:  This example is not meant to explain every single line of code that's used, but merely to relate some important concepts discussed in class to a real world example.  It's okay if you don't understand everything that's shown here, just focus on the bigger picture, and turn this notebook into a follow up project after the workshop.

**Step 1**:  Do the imports

In [1]:
from bs4 import BeautifulSoup
import requests
import json
import pandas as pd

**Step 2**: Connect to the URL using the requests library.

In [2]:
url         = 'https://generalassemb.ly/education'

params      = {
                'where'  : 'new-york-city',
                'format' : 'classes-workshops'
              }

r           = requests.get(url, params=params)

Now, we have the contents of the website saved as a variable 'r'.  

We can now grab the entire content of the website as a piece of text.

In [4]:
r.text[-500:]

"&gt;&lt;label for=&#39;design&#39;&gt;&lt;input type=&#39;checkbox&#39; name=&#39;topics&#39; id=&#39;design&#39; value=&#39;design&#39;&gt;Design&lt;/label&gt;&lt;label for=&#39;data&#39;&gt;&lt;input type=&#39;checkbox&#39; name=&#39;topics&#39; id=&#39;data&#39; value=&#39;data&#39;&gt;Data&lt;/label&gt;&lt;label for=&#39;business&#39;&gt;&lt;input type=&#39;checkbox&#39; name=&#39;topics&#39; id=&#39;business&#39; value=&#39;business&#39;&gt;Business&lt;/label&gt;'></section>\n</div>\n</body>\n"

**Our Problem**:  We don't need the vast majority of this information.  Just the list at the bottom of the page that contains the information about the classes.

To turn this into actual data, we need to accomplish two steps:
 
 - Extract the portion of the string that ONLY contains what we need
 - Convert this into an actual python list that we can then further manipulate to our desires.
 
Some of this code we are not going to go over, so make a note to do follow up study on what you don't understand:

In [5]:
# turn the request text into a web scraping object
doc         = BeautifulSoup(r.text, 'html.parser')
# grab all the script tags, then use slices to get the one that we want
script     = doc.find_all('script')[-2]
# convert the web scraping object into a string
text        = str(script)

Now let's take a look at our variable text:

In [6]:
text[:1000]

'<script>\n  window.EDUCATIONAL_OFFERINGS_JSON = [{"format":"event","overview":"Join us for a fireside chat with Lila Miller, Founder \\u0026 Executive Creative Director at Bonita Semana! ","topics":[{"id":1,"name":"Business","asset_folder":"business"},{"id":4,"name":"Design","asset_folder":"ux_and_design"}],"instructors":[{"id":22815,"name":"Roman Sciascia","title":"Visual Brand Lead, General Assembly"},{"id":22133,"name":"Lila Miller","title":"Founder \\u0026 Executive Creative Director, Bonita Semana"}],"title":"Fireside Chat with Lila Miller, Presented by Bonita Semana","starts":"2020-01-29T23:30:00.000Z","length_in_weeks":null,"url":"http://generalassemb.ly/education/fireside-chat-with-lila-miller-presented-by-bonita-semana/new-york-city/97264","image_url":"https://ga-core.s3.amazonaws.com/production/uploads/program/default_image/13114/thumb_Screen_Shot_2019-11-27_at_3.37.05_PM.png","duration_description":null,"next_info_session":null,"number_of_sessions":null,"date_num":"29","dat

This is still closer to what we want, but we need to whittle it down to the actual list that we need, and nothing else.

In [7]:
# get the index position of where our string ends
end         = text.index('window.TOPICS_JSON')
# these two lines create slices reduce the string to the exact positions of the list 
text        = text[:end]
text        = text[47:-4] 

Now let's go ahead and take a look at our text variable:

In [8]:
text

'[{"format":"event","overview":"Join us for a fireside chat with Lila Miller, Founder \\u0026 Executive Creative Director at Bonita Semana! ","topics":[{"id":1,"name":"Business","asset_folder":"business"},{"id":4,"name":"Design","asset_folder":"ux_and_design"}],"instructors":[{"id":22815,"name":"Roman Sciascia","title":"Visual Brand Lead, General Assembly"},{"id":22133,"name":"Lila Miller","title":"Founder \\u0026 Executive Creative Director, Bonita Semana"}],"title":"Fireside Chat with Lila Miller, Presented by Bonita Semana","starts":"2020-01-29T23:30:00.000Z","length_in_weeks":null,"url":"http://generalassemb.ly/education/fireside-chat-with-lila-miller-presented-by-bonita-semana/new-york-city/97264","image_url":"https://ga-core.s3.amazonaws.com/production/uploads/program/default_image/13114/thumb_Screen_Shot_2019-11-27_at_3.37.05_PM.png","duration_description":null,"next_info_session":null,"number_of_sessions":null,"date_num":"29","date_description":"Wed, 29 January","time_descripti

This is a list, serialized into a string, that has to be turned back into a list.  So we'll use the JSON library to accomplish this with one line of code.

In [9]:
data = json.loads(text)
data[5]

{'format': 'class',
 'overview': 'To successfully create an offering that dramatically shifts behavior, generates habitual engagement, and promotes a profitable cycle, we must first understand the consumer journey.',
 'topics': [{'id': 4, 'name': 'Design', 'asset_folder': 'ux_and_design'},
  {'id': 1, 'name': 'Business', 'asset_folder': 'business'}],
 'instructors': [{'id': 12492,
   'name': 'Ashley Treni',
   'title': 'Co-Founder, JustFix.nyc'}],
 'title': 'Customer Journey Mapping',
 'starts': '2020-01-30T23:30:00.000Z',
 'length_in_weeks': None,
 'url': 'http://generalassemb.ly/education/customer-journey-mapping/new-york-city/91728',
 'image_url': 'https://ga-core.s3.amazonaws.com/production/uploads/program/default_image/5952/thumb_Marketing_Customer_Audience_Cards_Hand.jpg',
 'duration_description': None,
 'next_info_session': None,
 'number_of_sessions': None,
 'date_num': '30',
 'date_description': 'Thu, 30 January',
 'time_description': ' 6:30 -  9:30pm EST'}

And now if we take a look at this, we can this is a normal list, that can be manipulated in every way we've discussed so far today.

In [10]:
data[0]

{'format': 'event',
 'overview': 'Join us for a fireside chat with Lila Miller, Founder & Executive Creative Director at Bonita Semana! ',
 'topics': [{'id': 1, 'name': 'Business', 'asset_folder': 'business'},
  {'id': 4, 'name': 'Design', 'asset_folder': 'ux_and_design'}],
 'instructors': [{'id': 22815,
   'name': 'Roman Sciascia',
   'title': 'Visual Brand Lead, General Assembly'},
  {'id': 22133,
   'name': 'Lila Miller',
   'title': 'Founder & Executive Creative Director, Bonita Semana'}],
 'title': 'Fireside Chat with Lila Miller, Presented by Bonita Semana',
 'starts': '2020-01-29T23:30:00.000Z',
 'length_in_weeks': None,
 'url': 'http://generalassemb.ly/education/fireside-chat-with-lila-miller-presented-by-bonita-semana/new-york-city/97264',
 'image_url': 'https://ga-core.s3.amazonaws.com/production/uploads/program/default_image/13114/thumb_Screen_Shot_2019-11-27_at_3.37.05_PM.png',
 'duration_description': None,
 'next_info_session': None,
 'number_of_sessions': None,
 'date_nu

And finally, let's do some additional processing to see how we can go ahead and turn this into a dataframe.

**Key Note**:  We are not going to go over these details in class!

In [11]:
# unpack the information from the instructors dictionary
instructors = [x['instructors'][0]['name'] if x['instructors'] else None for x in data]
# unpack the infomation from the topics dictionary
topics      = [x['topics'][0]['name'] for x in data]
# unpack the information from the date dictionary
date        = [x['starts'] for x in data]

# turn the original list into a datafram
df          = pd.DataFrame(data)

# make new columns from the variables instructors, topics, date
df['Instructor']   = pd.Series(instructors)
df['Topic']        = pd.Series(topics)
df['date']         = pd.to_datetime(date)
df['date']         = df['date'].dt.date
# drop unnecessary column
df                 = df.drop(['url', 'topics', 'instructors', 'image_url', 'next_info_session', 'starts', 'date_description', 'number_of_sessions', 'date_num', 'duration_description'], axis=1)

And now if we run this, we'll see that this is a very tidy, manageable variable that can be used easily.

In [12]:
df.head(10)

Unnamed: 0,format,overview,title,length_in_weeks,time_description,Instructor,Topic,date
0,event,"Join us for a fireside chat with Lila Miller, ...","Fireside Chat with Lila Miller, Presented by B...",,6:30 - 8:30pm EST,Roman Sciascia,Business,2020-01-29
1,class,"Grasp online marketing basics, get your head a...",Digital Marketing: Key Concepts and Metrics,,6:30 - 9:30pm EST,Salim Holder,Marketing,2020-01-29
2,workshop,This online bootcamp will immerse you in the f...,User Experience Design Bootcamp Remote (Online),,11:00 - 5:00pm EST,Joe Anastasio,Design,2020-01-30
3,event,"In this online workshop, you’ll learn to ask t...",Intro to Data Analytics | Livestream,,1:00 - 3:00pm EST,Craig Fryar,Data,2020-01-30
4,event,In this workshop you will learn new approaches...,Go Dutch on a Property: A New Approach to Hom...,,6:30 - 8:30pm EST,Nikki Merkerson,Business,2020-01-30
5,class,To successfully create an offering that dramat...,Customer Journey Mapping,,6:30 - 9:30pm EST,Ashley Treni,Design,2020-01-30
6,class,"In this hands-on introductory workshop, learn ...",Visual Design 101,,6:30 - 8:30pm EST,Dominika Juraszek,Design,2020-01-30
7,workshop-series,Build and Evaluate Machine Learning Models Wit...,Python and Machine Learning Bootcamp Series,,10:00 - 5:00pm EST,Jonathan Bechtel,Data,2020-01-31
8,workshop,"Learn to create a new file, set up master page...",Adobe InDesign Bootcamp,,10:00 - 6:00pm EST,Dominika Juraszek,Design,2020-01-31
9,workshop,This single day bootcamp will run through the ...,Data Analytics Bootcamp,,10:00 - 5:00pm EST,Kelly Gracia,Data,2020-01-31
