<a href="https://colab.research.google.com/github/Shailesh0209/x_tools_in_ds_dipoma-iitm/blob/main/x_get_data_w2_TDS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#L2.2: Get the data-Nominatim Open Street Maps

## Scraping using Geocoding API of Open Street Maps(OSM)
We would be using the Nominatim API to scrape geocoding imformation of any open ended address text using Python

In [None]:
# no need to install these if using Google Colab
!pip install geopandas
!pip install geopy

In [None]:
# import nominatim api
from geopy.geocoders import Nominatim

In [None]:
# activate nominatim geocoder
locator = Nominatim ()
# type any address text
location = locator.geocode("F-194, kankarbagh, Patna,Bihar, India")


In [None]:
# print latitude and longitude of the address 
print("Latitude={}, Longitude={}".format(location.latitude, location.longitude))

In [None]:
# the API output has multiple other details as a json like altitude, latitude
# longitude, correct raw address, etc.
# printing all the information

location.raw, location.point, location.longitude, location.latitude, location.altitude, location.address

In [None]:
# typing another address 
location2 = locator.geocode("IIT Madras")

In [None]:
location2.raw, location2.point, location2.longitude, location2.latitude, location2.altitude, location2.address

# L2.3 Get the data BBC Weather location service


## A tutorial to scrape the location ID of any city in BBC Weather

Thsi code snippet takes city name as input and it hits the BBC Weather API with a request for location ID. This location ID is used as input in the next part of the code to scrape weather forecast for the city using this location ID.

Web scraping might not be legal always. It is a good idea to check the terms of the website you plan to scrape before proceeding. Also, if your code requests a url from a server multiple times, it is a good practice to either cache your requests, o insert a timed delay between consecutive requests.

In [None]:
import os

import requests  # to get the webpage
import json      # to convert API  to json format

from urllib.parse import urlencode
import numpy as np
import pandas as pd
import re        # regular expressio operators


In [None]:
test_city = "New York"
location_url = 'https://locator-service.api.bcci.co.uk/locations?' + urlencode({
                'api_key': 'AGbFAKx58hyjQScCXIYrxuEwJh2W2cmv',
                's': test_city,
                'stack': 'aws',
                'locale': 'en',
                'filter': 'international',
                'place-types': 'settlement,airport, district',
                'order': 'importance',
                'a': 'true',
                'format': 'json'
                })
location_url

In [None]:
result = requests.get(location_url, verify=False).text
result

In [None]:
# Print locationid
result['response']['results']['results'][0]['id']

###Creating a function to output location id by taking any city name as input.

In [None]:
def getlocid(city):
    city = city.lower() # convert city name to lowercase to standardize format
    # Convert into an API call using URL encoding
    location_url = 'https://locator-services.api.bcci.co.uk/locations?' + urlencode({
        'api_key': 'AGbFAKx58hyjQScCXIYrxuEwJh2W2cmv',
        's': city,
        'stack': 'aws',
        'locale': 'en',
        'filter': 'international',
        'place-types': 'settlement, airport,district',
        'order': 'true',
        'format': 'json'
    })
    result = requests.get(location_url).json()
    locid = result['response']['results']['results'][0]['id']
    return locid

In [None]:
getlocid('Toronto')

#L2.4 Get the data-Scraping with Excel

#L2.5 Get the data-Scraping with Python

## WEB Scrapping IMDB

In this exercise we'll look at scraping data from IMDB. Our goal is to convert the top 250 list of movies in IMDB intoa tabular form using Python. This data can then be used for further analysis.

1: Import Necessary Libraries

In [None]:
from bs4 import BeautifulSoup as bs
import requests # to access website
import pandas as pd

2: Load the webpage

In [None]:
r = requests.get("https://www.imdb.com/chart/top/")

# Convert to a beautiful soup object
soup = bs(r.content)

# Print out HTML
Contents = soup.prettify()

3: Creating empty list

In [None]:
movie_title = []
movie_year = []
movie_rating = []

4: Extract HTML tag contents

In [None]:
imdb_table = soup.find(class_="chart full-width")

In [None]:
movie_titlecolumn= imdb_table.find_all(class_="titleColumn")

In [None]:
movie_ratingscolumn = imdb_table.find_all(class_="ratingColumn imdbRating")

In [None]:
for row in movie_titlecolumn:
    title = row.a.text # tag content extraction
    movie_title.append(title)
movie_title

In [None]:
for row in movie_titlecolumn:
    year = row.span.text # tag content extraction # gives text contain inside span tag
    movie_year.append(year)
movie_year

In [None]:
for row in movie_ratingscolumn:
    rating = row.strong.text # tag content extraction
    movie_rating.append(rating)
movie_rating

5: Create DataFrame

In [None]:
movie_df = pd.DataFrame({'Movie Title': movie_title, 'Year of Release': movie_year, 'IMDB Rating': movie_rating})
movie_df

In [None]:
movie_df.head()

#L2.6: Get the data-Wikimedia

This is a self-explanaory short tutorial on using the wikipedia library to extract information from wikipedia.

In [None]:
!pip install wikipedia
import wikipedia as wk

In [None]:
print(wk.search("Isaac Newton"))

In [None]:
print(wk.search("IIT Madras", results=2))

In [None]:
print(wk.summary("Isaac Newton"))

In [None]:
print(wk.summary("IIT Madras", sentences=2))

In [None]:
full_page = wk.page("IIT Madras")
print(full_page.content)


In [None]:
print(full_page.url)

In [None]:
print(full_page.references)

In [None]:
print(full_page.images)

In [None]:
print(full_page.images[0])

In [None]:
# extract html code of wikipedia page based on any search text
html = wk.page("IIT Madras").html().encode("UTF-8")

import pandas as pd
df = pd.read_html(html)[6]

df

# L2.7:Get the data-Scrape BBC weather with Python

## A tutorial to scrape the web.

This example scrapes the BBC weather for any specific city, and collects weather forecast for the next 14 days and saves it as a csv file.

Web scraping might not be legal always. It is a good idea to check the terms of the website you plan to scrape before proceeding. Also, if your code requests a url from a server ultiple times, it is a good practice to either cache your requests, or insert a timed delay between consecutive requests.

In [None]:
import json   # to convert API to json format

from urllib.parse import urlencode

import requests # to get the webpage 
from bs4 import BeautifulSoup # to parse the webpage 

import pandas as pd
import re # regular expression operators

from datetime import datetime


We now GET the webpage of interest, from the server 

In [None]:
required_city = "Mumbai"
location_url = 'https://loator-service.api.bbci.co.uk/locations?' + urlencode({
    'api_key': 'AGbFAKx58hyjQScCXIYrxuEwJh2W2cmv',
    's': required_city,
    'stack': 'aws',
    'locale': 'en',
    'filter': 'international',
    'pace-types': 'settlement,airport,district',
    'order': 'importance',
    'a': 'true',
    'format': 'json'
})
location_url

In [None]:
result = requests.get(location_url, verify=False).json()
result

In [None]:
""" url = 'https//www.bbc.com/weather/1275339' # url to BBC weather, 
corresponding to a specific city (Mumbai, in this example"""

url = 'https://www.bbc.com/weather/'+result['response']['results']['results'][0]['id']
response = requests.get(url, verify=False)

Next, we initiate an instance fo BeautifulSoup.

In [None]:
soup = BeautifulSoup(response.content, 'html.parser')

The information we want (daily high and low temp., and daily weather summary), are in specific blocks on the webpage. We need to find the block type, type of identifier, and the identifier name (all these can be figured out by right clicking on the webpage and selecting 'Inspect' on the Chrome browser; similar modus operand for the browsers)

In [None]:
daily_high_values = soup.find_all('span', 
                    attrs={'class': 
                    'wr-day-temperature_high-value'})
# block-type: span; identifier type: class; and 
#class name: wr-day-temperature_high-value
daily_high_values

In [None]:
daily_low_values = soup.find_all('span', 
                attrs={'class': 'wr-day-temperature_low-value'})
daily_low-values

In [None]:
daily_summary = soup.find('div',
            attrs={'class': 'wr-day-summary'})
daily_sumary

In [None]:
daily_summary.text

`General book keeping`:
With the code snippet in the cell above, we get forecast data for 14 days, including today. We will now post process the data to first extract the required information/text and discard all the html wrapper code, then combine all variables into one common list, and finally convert it into a pandas dataFrame.

In [None]:
daily_high_values[0].text.strip()


In [None]:
daily_high_values[5].text.strip()

In [None]:
daily_high_values[0].text.strip().split()[0]

In [None]:
daily_high_values_list = [daily_high_values[i].text.strip().split()[0] for i in range(len(daily_high_values))]
daily_high_values_list

In [None]:
daily_low_values_list = [daily_low_values[i].text.strip().split()[0] for i in range(len(daily_low_values))]
daily_low_values_list

In [None]:
daily_summary.text

In [None]:
daily_summary_list = re.fidall('[a-zA-Z][^A-Z]*', daily_summary.text) # split the string on uppercase
daily_summary_list

In [None]:
datelist = pd.date_range(datetime.today(), periods=len(daily_high_values)).tolist()
datelist

In [None]:
datelist = [datelist[i].date().strftime('%y-%m-%d') for i in range(len(datelist))]
datelist

In [None]:
zipped = zip(datelist, daily_high_values_list, daily_low_values_list, daily_summary_list)

In [None]:
df = pd.DataFrame(list(zipped), columns=['Date', 'High', 'Low', 'Summary'])

In [None]:
display(df)

In [None]:
# remove the 'degree' character
df.High = df.High.replace('\°','',regex=True).astype(float)
df.Low = df.Low.replace('\°','',regex=True).astype(float)

In [None]:
display(df)

Extract the name of the city for which data is gathered


In [None]:
# location = soup.find('div', attrs={'class':'wr-c-location'})
location = soup.find('h1', attrs={'id':'wr-location-name-id'})
location.text.split()

In [None]:
# create a recording
filename_csv = location.text.split()[0]+'.csv'
df.to_csv(filename_csv,index = None)

In [None]:
filename_xlsx = location.text.split()[0]+'.xlsx'
df.to_exvel(filename_xlsx)

#L2.8: Get the data-Scraping PDFs

# GA2