<a href="https://colab.research.google.com/github/SARA3SAEED/more-practice/blob/main/DA_Mu_Lab17_web_scraping_casestudy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ADVANCED PANDAS: DATA IMPORTING & WEB SCRAPING

Course Outline:
- Basic Data Importing
    - Flat Files (.csv, .tsv, .txt)
    - Excel Files (.xlsx)
    - Other Files (.dta, .mat, .. etc)
- Importing Data from Relational Databases
    - SQL Crash Course
    - Database Files (.db, .sqlite, .. etc)
- ***Importing Data from the Internet***
    - HTML & CSS Crash Course
    - Web Scraping Basics
    - Working with JSON Data & APIs
- ***Case-study: Wuzzuf.com [Web Scraping]***

### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

==========

## Case-study: Wuzzuf.com [Web Scraping]

- Wuzzuf.com URL: https://wuzzuf.net/search/jobs/?a=navbl%7Cspbl&q=illustrator
- Full Tutorial: https://www.edureka.co/blog/web-scraping-with-python/

##### How Do You Scrape Data From A Website?

- Find the URL that you want to scrape
- Inspecting the Page
- Find the data you want to extract
- Write the code
- Run the code and extract the data
- Store the data in the required format

##### Importing Libraries & Methods

In [None]:
# BeautifulSoup is a Python library for pulling data out of HTML and XML files
from bs4 import BeautifulSoup as bs

# urllib.request for opening and reading URLs
from urllib.request import urlopen

##### Inputting the URL

In [None]:
# Getting the website page address
url = 'https://wuzzuf.net/search/jobs/?a=navbl%7Cspbl&q=illustrator'

##### Create a Client-based Request to Get the URL

In [None]:
client = urlopen(url)

##### Getting the HTML Code of the Full Page

In [None]:
html = client.read()
print(html)

##### Closing the Request

In [None]:
client.close()

##### Creating an HTML Parser Using BeautifulSoup

In [None]:
soup = bs(html, "html.parser")

In [None]:
# Now we have a well-prepared html code
soup

##### Creating a Container for the Needed Data

In [None]:
containers = soup.find_all("div", {"class" : "css-qa8nz1-Card e1v1l3u10"})

In [None]:
len(containers)

In [None]:
print(bs.prettify(containers[0]))

##### Accessing Page Elements

In [None]:
# Let's check how to access a specific element in the page (i.e. the job title)
containers[0].div.h2.text

In [None]:
# Here is the best practice (better way) for accessing the job title
job_title = containers[0].findAll("h2", {"class": "css-m604qf"})
job_title[0].text

In [None]:
company_name = containers[0].findAll("a", {"class": "css-17s97q8"})
company_name[0].text

In [None]:
job_type = containers[0].findAll("a", {"class": "css-n2jc4m"})
job_type[0].text

##### Bringing it All Together

In [None]:
# We need to create a file to save our new data to it
f = open('data/wuzzuf.csv','w')
headers = "Job_Ttile, Company_Name, Job_Type\n"
f.write(headers)

# Now we will get ALL the needed data from the web page
for container in containers:
    jtitle = container.findAll("h2", {"class": "css-m604qf"})
    job_title = jtitle[0].text.strip()

    cname = container.findAll("a", {"class": "css-17s97q8"})
    company_name = cname[0].text.strip()

    jtype = container.findAll("a", {"class": "css-n2jc4m"})
    job_type = jtype[0].text.strip()

#     print(job_title)
#     print(company_name)
#     print(job_type)
#     print()

    print(job_title + ", " + company_name + ", " + job_type)
    f.write(job_title + ", " + company_name + ", " + job_type + "\n")

f.close()

##### Inputting the File into Pandas

In [None]:
wuzzuf = pd.read_csv('data/wuzzuf.csv')
wuzzuf

==========

# THANK YOU!