## PS7: Webscraping
Due: 17-MAR-2021

Import BeautifulSoup, json, requesrts, and pandas. 

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import re
import json

## IMDB top 50 rated films.

The following URL, https://www.imdb.com/search/title/?groups=top_250&sort=user_rating, is a link to the top 50 rated films on IMDB. Create a pandas DataFrame with three columns: Title, Year, and Rating, pulling the data from the webpage. 

We can do this in steps. First, get the HTML code that generated the webpage. 

In [2]:
url = "https://www.imdb.com/search/title/?groups=top_250&sort=user_rating"

Using the "Inspect Element" tool in a browser, see that each film is displayed in a `DIV` with the class `lister-item`. Use BS to find all such elements and store them in a list called `films`.

Then, create a list of the title of each film. Notice, by inspecting the HTML, that the title is contained inside of a `<a>` tag (a link) that is itself inside of a `DIV` with class `lister-item-content`. That is, for each film in the list films, find the div with the class `lister-item-content` and then find the first link and get the text of that link. Store this in a dataframe called films_df (which currently has a single column, 'Title').

In [3]:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
films= soup.find_all('div', class_ = 'lister-item') #Generates a list of all instances 
title_list=[] #Empty list to help count
for i in range(len(films)): #Iterates over a range of the number of instances
    film_i=films[i].find_all('div', class_ = 'lister-item-content') #Obtains a list of code for every title
    title=film_i[0].find('a') 
    title_list.append(title.text) #Presents the title in text form and appends onto the empty list
T = 'Title'
films_df = pd.DataFrame(title_list, columns=[T]) #Defines DataFrame
films_df

Unnamed: 0,Title
0,The Shawshank Redemption
1,The Godfather
2,The Dark Knight
3,The Godfather: Part II
4,12 Angry Men
5,The Lord of the Rings: The Return of the King
6,Pulp Fiction
7,Schindler's List
8,Inception
9,Fight Club


Repeat: now create a list of the year of each film, and store it in a second column of `films_df`. This is even easier since each year is stored in a `span` with class `lister-item-year`. Convert the text to an integer (which means first formating the string to remove the parenthesis). 

In [4]:
year_list=[] #Same format as before
for j in range(len(films)): 
    film_j=films[j].find_all('span', class_ = 'lister-item-year')
    year=str(film_j[0].text)
    year=year.replace('(','') 
    year=year.replace(')','')
    year_list.append(int(year))
T='Title'
Y='Year'
data={T:title_list,Y:year_list}
films_df = pd.DataFrame(data ,columns=[T, Y])
films_df

Unnamed: 0,Title,Year
0,The Shawshank Redemption,1994
1,The Godfather,1972
2,The Dark Knight,2008
3,The Godfather: Part II,1974
4,12 Angry Men,1957
5,The Lord of the Rings: The Return of the King,2003
6,Pulp Fiction,1994
7,Schindler's List,1993
8,Inception,2010
9,Fight Club,1999


Repeat: now create a list of the score of each film. This time, you have to figure out where it is stored. Convert the text to an float and store it in the 3rd column of the df.

In [5]:
score_list=[] #Same format as before
for k in range(len(films)):
    film_k=films[k].find_all('div', class_='inline-block ratings-imdb-rating')
    score=float(film_k[0].text)
    score_list.append(float(score)) 
T='Title'
Y='Year'
S='Rating'
data={T:title_list,Y:year_list, S:score_list}
films_df = pd.DataFrame(data ,columns=[T, Y, S])
films_df

Unnamed: 0,Title,Year,Rating
0,The Shawshank Redemption,1994,9.3
1,The Godfather,1972,9.2
2,The Dark Knight,2008,9.0
3,The Godfather: Part II,1974,9.0
4,12 Angry Men,1957,9.0
5,The Lord of the Rings: The Return of the King,2003,8.9
6,Pulp Fiction,1994,8.9
7,Schindler's List,1993,8.9
8,Inception,2010,8.8
9,Fight Club,1999,8.8


Find the coorelation between the year and the score of the films.

In [6]:
x=films_df["Year"] #Sets the inputs of x and y
y=films_df["Rating"]
corr=x.corr(y) #Calculates the correlation between x and y
print(corr) #Correlation value sugggests a negative relationship where as time goes on, ratings go down

-0.11464313226119215


## Postcode Property Values.

The object of this question is to construct a dataframe with a list of job listings given a (US) city. 
This question is hard and should take some time.

First, we want to create a function `createURL` that take a city and a start value and creates something that looks like 

`'https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=CITY&start=START'`

where `CITY` and `START` are input parameters. So `createURL('New York', 10)` should create

https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10

Notice, that we need to replace spaces with plus signs in order to make the URL work. 

In [7]:
def createURL(city, start):
    """
    inputs: A string and an integer respectivley 
    output: A URL
    A function that generates a URL featuring parameters which were used as inputs
    """
    url='https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=CITY&start=START'
    city=city.replace(' ','+') #Replaces any spaces with + to make a usable URL
    url=url.replace('CITY', city) #Replaces CITY with the input
    url=url.replace('START', str(start)) #Replaces START with the input in string form
    print(url)
    return

createURL('New York', 10)

https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10


Then examine the link above, that should be the output of `createURL('New York', 10)`. Create three functions
 - `extract_title`, which extracts the job title from each listing and stores the results in a list. These can be found in `a` tags with the attribute `data-tn-element` equal to `jobTitle`. Thus you can use `soup.findAll(name="a", attrs={"data-tn-element":"jobTitle"})` to get all the elements. Then the job title is stored in the title attribute of the element which can be obtained via dictionary like indexing: `element.title`.
 - `extract_location`, which extracts the location from each listing and stores the results in a list. These can be found in `span` tags with the attribute `class` equal to `location`. The relevant information is in the `text` attribute.
 - `extract_company`, which extracts the company name from each listing and stores the results in a list. Use inspect element in your browser to figure out which tags to search (its very similar to the previous function).

In [52]:
def extract_title(soup):
    """
    inputs: A HTML file which is generated using BeautifulSoup
    output: A list of strings
    A function that generates a list of Job Titles from a specified webpage (within set parameters)
    """
    title_list=[] #Empty list to help count the titles
    title_soup=soup.findAll(name="a", attrs={"data-tn-element":"jobTitle"}) #Generates a list of all raw code instances
    for i in range(len(title_soup)): 
        titles=title_soup[i].text #For some reason the .title function did not work but .text did so I went with that 
        titles=titles.replace('\n','') #Deletes any \n to make it pretty
        title_list.append(titles) #Appends the title as text onto the empty list
    return title_list

page = requests.get("https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10") #URL input
soup = BeautifulSoup(page.content, 'html.parser') #HTML file from URL to be used as the function input
extract_title(soup)

['Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist, Buyer Experience',
 'Senior Data Scientist, Machine Learning Research',
 'Data Scientist',
 'Content Contributor : Exploratory Data Analysis',
 'Business Intelligence Analyst / Data Scientist',
 'Senior Data Scientist (Americas - Remote)',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist I/II - 014116',
 'Data Scientist, Practice Management',
 'Data Scientist']

In [53]:
def extract_location(soup): 
    """
    inputs: A HTML file which is generated using BeautifulSoup
    output: A list of strings
    A function that generates a list of Job Locations from a specified webpage (within set paramters) 
    """
    loc_list=[] #Same format as previously, again used .text since I couldn't get .title to work
    loc_soup=soup.findAll("span", attrs={"class":"location"})
    for i in range(len(loc_soup)):
        location=loc_soup[i].text
        loc_list.append(location)
    return loc_list

page = requests.get("https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")
soup = BeautifulSoup(page.content, 'html.parser')
extract_location(soup)

['New York State',
 'New York, NY 10007 (Tribeca area)',
 'Utica, NY 13502',
 'New York, NY',
 'Rochester, NY 14607 (East Avenue area)',
 'New York, NY 10007 (Financial District area)',
 'Niskayuna, NY 12309',
 'New York, NY',
 'Armonk, NY 10504',
 'New York, NY',
 'New York, NY 10022 (Turtle Bay area)',
 'New York, NY 10001 (Chelsea area)',
 'New York, NY 10007 (Tribeca area)',
 'New York, NY',
 'Utica, NY 13502']

In [10]:
def extract_company(soup): 
    """
    inputs: A HTML file which is generated using BeautifulSoup
    output: A list of strings
    A function that generates a list of Companies from a specified webpage (within set paramters)
    """
    comp_list=[] #Same format as previously
    comp_soup=soup.findAll("span", attrs={'company'})
    for i in range(len(comp_soup)):
        company=comp_soup[i].text
        company=company.replace('\n','')
        comp_list.append(company)
    return comp_list
    
page = requests.get("https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")
soup = BeautifulSoup(page.content, 'html.parser')

extract_company(soup)

[]

Now create a function `URLtoDataFrame` that takes a city and start, creates the relevant URL and constructs a dataframe with the columns `Job Title`, `Location`, and `Company` and with the rows as the data scraped from the above functions.

In [54]:
def URLtoDataFrame(city, start):
    """
    inputs: A string and an integer respectivley 
    output: A dataframe 
    A function that generates a URL and a dataframe with information on Job Titles, Locations and Companies
    """
    url='https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=CITY&start=START'
    city=city.replace(' ','+') 
    url=url.replace('CITY', city)
    url=url.replace('START', str(start)) #Obtains usable URL
    print("URL:", url)
    page = requests.get(url) 
    soup = BeautifulSoup(page.content, 'html.parser') #Creates Input for sub-functions
    data={'Job Title':extract_title(soup),'Location':extract_location(soup), 'Company':extract_company(soup)} #Generates Data
    df=pd.DataFrame(data, columns=['Job Title', 'Location', 'Company']) #Formats DataFrame from sub-function outputs
    return df
    
URLtoDataFrame('New York', 20) #Test Function

URL: https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=20


Unnamed: 0,Job Title,Location,Company
0,DATA SCIENTIST,"New York, NY",Sightly Enterprises
1,eCom Data Scientist,"New York, NY",PepsiCo
2,Data Scientist IBM Tech Re-Entry,"Armonk, NY 10504",IBM
3,"Senior Data Scientist, Applied Machine Learning","New York, NY 10007 (Tribeca area)",Flatiron Health
4,Data Scientist - Publishing Royalties,"New York, NY",Spotify
5,Data Scientist,"New York, NY",RISIRISA
6,Data Scientist,"New York, NY",Neuberger Berman
7,Data Scientist,"New York, NY 10013 (SoHo area)",Sharecare Inc
8,Data Scientist (Data Visualization) - Podcasts...,"New York, NY",Spotify
9,Data Scientist,"New York, NY",MassMutual


Finally, construct a function that concatenates all the data together to obtain 50 companies for a given city. That is, it iterates over the start values $0,10,20,...90$ and combines the dataframes. 

In [49]:
def CitytoDataFrame(city):
    """
    inputs: A string  
    output: A dataframe 
    A function that generates multiple URLs for a single dataframe with information on:
    Job Titles, Locations and Companies without repeating any Companies over an iterationof 9 different URLs per city
    """  
    url='https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=CITY&start=START'
    city=city.replace(' ','+')
    url=url.replace('CITY', city) #Obtaining URL
    url_list=[] #Empty list for URLs
    for n in range(10): 
        start=n*10
        url_n=url.replace('START', str(start))
        url_list.append(url_n) #List for URLs 
    page = requests.get(url_list[0]) #Requests page for 1st URL
    soup = BeautifulSoup(page.content, 'html.parser') #HTML file
    data={'Job Title':extract_title(soup),'Location':extract_location(soup),'Company':extract_company(soup)} #Create DataFrame for 1st URL
    df=pd.DataFrame(data, columns=['Job Title', 'Location', 'Company']) #Format DataFrame
    for i in range(1,len(url_list)): #Iterates I from 1 to 10
        page = requests.get(url_list[i]) #Requests page for Ith URL
        soup = BeautifulSoup(page.content, 'html.parser') #Obtains HTML File
        df['Job Title'][len(df)+i-1]=extract_title(soup) #Sets Newest entry to respective sub-function output for ith URL 
        df['Location'][len(df)+i-1]=extract_location(soup)
        df['Company'][len(df)+i-1]=extract_company(soup)
    df=df.drop_duplicates(subset='Company') #Deletes any rows containing duplicate company values, ensuring uniqueness
    return df

In [50]:
CitytoDataFrame('Chicago')


Unnamed: 0,Job Title,Location,Company
0,Associate Data Scientist,"Deerfield, IL 60015",WALGREENS
1,Data Scientist,"Chicago, IL 60615 (East Hyde Park area)",NowPow
2,Data Scientist,"Chicago, IL",Eleks
3,Data Scientist,"Chicago, IL 60604 (The Loop area)",CNA Insurance
4,Cat Digital - Associate Data Scientist,"Chicago, IL 60661 (Near West Side area)",Caterpillar
5,Data Scientist,"Chicago, IL",Lenovo
6,Data Scientist,"Chicago, IL",Hitachi Solutions Ltd
7,Jr. Data Scientist,"Chicago, IL",Federal Reserve Bank of New York
8,Data Scientist,"Chicago, IL 60606 (The Loop area)",Walker & Dunlop
9,Data Scientist,"Chicago, IL 60290",Thermo Fisher Scientific


Dear Reader,
With regards to the last part of Question 2, I experienced the Captcha Issue and since its 7 AM and I have been working close to 12 hours, I will have to concede the remainder of this question. However, the function CitytoDataFrame does perform its required task (to my understanding). The function takes a string input and generates multiple URLs over 10 iterations and concatenates the required parameters using previous functions. It also removes any rows that possess a Company value that has appeared before making all the companies unique. What I am Unsure of is how many entries are in the data frame (since I can't get an output) however, the intention is to have a row for each listing, on each page for 10 pages. I hope you will consider this when marking this question since I thoroughly enjoyed coding it before the unresponsiveness. 