<a href="https://colab.research.google.com/github/BenPimley/ECO/blob/main/BasicFTLoop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Science - web scraper 3**

Aim of the file:

1.   Scrape multiple sites with a similar URL.
2.   Do this efficiently by running a loop over an array.
3.   Collect togeter the results.

In [22]:
# // 1.  Import packages that we need:
# // Always run this from the start, if you change something and come back later it may give you an error
import numpy as np
import pandas as pd
# // Web scraping: 
import requests
import string
from bs4 import BeautifulSoup
# // OS. Sometimes need this for finding working directory:
import os
# ////////////////////////////////////////////////////////////////

Introduction: using a base URL and injecting a series of stock tickers into it.

In [None]:
# // Set the base URL: 
# // Curly brackets mean that whatever you put in the array will appear inside these brackets
url_base = "https://www.ft.com/{}"

# // Add an array:
# // Because I chose the FT website I have called it 'topic' but this could equally be anything, it is just a name not a formula. 
# // I have then taken each of the top columns from the website and inserted them below 
# // For the homepage I don't want to add anything in curly brackets so I add a blank in speech marks
topic = ['','world', 'world/uk',  'companies', 'tech', 'markets', 'climate-capital', 'opinion', 'work-careers', 'life-arts', 'htsi']

# // Create an empty array that we are going to fill, base it on the length of the tickers array
# // length = however many items you put in your array (in this case 11). 
# // Empty array is a spacer you will fill up with the info you want 
# // With Python unlike Excel you must say the amount of data that you will use before you get it rather than just having a large blank sheet for data
# // alter string length (here S50) based on the length of website urls. Try running it with a shorter figure - e.g. here 's30' would not work. 
length = len(topic)
urls = np.empty(length, dtype='S50')

# // Motivation for a loop is to do lots of things in a repeated way - download lists for large datasets rather than doing it manually
# // "topic" and "t" can be called anything - just an identifier that says we will go from first to last in order - the value of t changes each time you go through the loop
# // Loops save a lot of time - like a mass-mailout 
# // Loop across this array:
for x in topic:
   # // Put the particular "topic" into the base URL
   # // Format takes whatever is in your url base and puts it into the curly brackets from above
   # // TopicURL = a different thing each time in the loop based on x - full long url 
   # // After this you have to tell the computer to store all the pieces of info in the right place/order - the 2 lines below 
   topicURL = url_base.format(x)
   # // Find the index value of this particular "topic".
   # // This tells you what number item you are on, i also varies each time as it depends on x - e.g. the home page is zero in the list - remember codes go 0,1,2,3!
   i = topic.index(x)
   # // Fill the empty url, at the given index value, with the full url for this "topic"
   # // Square brackets means find position in the list 
   urls[i] = topicURL

# // Print out the urls that we have  
# // This makes sure the array is working - helpful to check the length of the string - some urls may be cut off it is not long enough and therefore you can't click on them
# // 'b' is just a formatting thing that would disappear in a CSV 
urls

Using this in a full example:

In [24]:
# // Set the base url:
url_base = "https://www.ft.com/{}"

# // Pick the words that we want to put into the curly brackets of this url:
topic = ['','world', 'world/uk',  'companies', 'tech', 'markets', 'climate-capital', 'opinion', 'work-careers', 'life-arts', 'htsi']

# // Create an empty array that is going to house the results
# // We need to tell Python this array needs to be able to hold objects, hence dtype=object.
# // This is becuase we are not going to put just one number, or one piece of string into position in the array
# // Rather, each part of this array is going to be an array with the individual scraping results:
data = np.empty(length, dtype='object')

# // Begin a loop, dealing with this tickers one by one:
for x in topic:
   
   # // Return the index number of the thing we are working with:
   s = topic.index(x)
   
   # // Build the URL for this iteration of the loop:
   URL = url_base.format(x)
   
   # // Request the html from the URL:
   html = requests.get(URL)
   
   # // Get the soup of this page
   soup = BeautifulSoup(html.content, 'html.parser')
   
   # // Now get what we want from the page: 
   headline = soup.find_all("a", class_="js-teaser-heading-link")
   article = soup.find_all("div", class_="o-teaser__heading")

   
   headline = headline[0].text
   article = article[0].text
 
   
   
   # // Group together:
   results = [x, headline, article]
   
   # // Sense check: print out what we have on this point in the loop:
   s
   x
   results

   # // Find the index value of this particular ticker.
   i = topic.index(x)
   
   # // Fill these results in to a master array of results:
   # // Fill the empty url, at the given index value, with the full url for this ticker
   data[i] = results   

Now examine what we have, and how we can retrive various parts of it:

In [25]:
data

array([list(['', 'Taliban faces growing dissent as protests erupt in Afghan cities', 'Taliban faces growing dissent as protests erupt in Afghan cities']),
       list(['world', 'Athens official blames wildfires on ‘criminal lack of preparedness’', 'Athens official blames wildfires on ‘criminal lack of preparedness’']),
       list(['world/uk', 'Boris Johnson facing calls to sack foreign secretary Dominic Raab  ', 'Boris Johnson facing calls to sack foreign secretary Dominic Raab  ']),
       list(['companies', 'Gopuff cuts pay after raising billions', 'Gopuff cuts pay after raising billions']),
       list(['tech', 'Facebook unveils virtual office app Horizon Workrooms', 'Facebook unveils virtual office app Horizon Workrooms']),
       list(['markets', 'Stocks and commodities fall on Fed and global growth jitters', 'Stocks and commodities fall on Fed and global growth jitters']),
       list(['climate-capital', 'Ozone recovery helps reduce global warming', 'Ozone recovery helps reduce 

In [20]:
data[1]

['world',
 'Athens official blames wildfires on ‘criminal lack of preparedness’',
 'Athens official blames wildfires on ‘criminal lack of preparedness’']

In [21]:
data[0][2]

'Taliban faces growing dissent as protests erupt in Afghan cities'