# Code Snippets for assignment 2.

1. HTML and DOM and their structure & tags.

• HTML (Hypertext Markup Language) is the standard markup language used to structure and present content on the World Wide Web. It consists of a series of elements (tags) that define the structure and semantics of a web page.
• Each HTML element is enclosed within opening and closing tags, and elements can contain other elements, creating a parent-child relationship.
• HTML is not responsible for styling or layout.
• Some HTML tags are:
    o	<html>
    o	<head>
    o	<body>
    o	<p>
    o	<img>
    o	<a>
    o	<h1>, <h2>, <h3>
    o	<title>
    o	<ul>, <ol>
    o	<li>
• The Document Object Model is a programming interface for web documents. It represents the structure of an HTML or XML document as a tree-like structure of objects that can be manipulated and interacted with using scripting languages like JavaScript.
• When a web page is loaded, the browser creates a DOM representation of the HTML document.


In [None]:
<!DOCTYPE html>
<html>
<head>
    <title>My Webpage</title>
</head>
<body>
    <header>
        <h1>Welcome to My Webpage</h1>
    </header>
    <main>
        <p>This is a paragraph.</p>
        <a href="https://www.youtube.com/">Click me</a>
    </main>
    <footer>
        <p>&copy; 2023 My Webpage</p>
    </footer>
</body>
</html>


2. CSS Selectors.

• CSS (Cascading Style Sheets) selectors are a fundamental part of web development and are used to target and style HTML elements on a web page.
• Few CSS Selectors are as follows:
    o	Type Selector: p { #styles for all <p> elements. }
    o	Class Selector: .highlight { #styles for all elements in that class=highlight. }
    o	ID Selector: #header { #styles for all element which the id=header. }
    o	Descendent Selector: div p { #styles for the all <p> which are descendent of <div>. }
    o	Child Selector: ul > li { #styles for all lists under unordered list. }


In [None]:
#Select by element:
p {
    color: blue;
}

#Select by class:
.highlight {
    font-weight: bold;
}

#Select by ID:
#header {
    background-color: yellow;
}

#Select by Descendent:
div p {
    color: red;
    font-size: 16px;
}


3. HTTP and request.

• Hypertext Transfer Protocol (HTTP) is a method for encoding and transporting information between a client (such as a web browser) and a web server.
• A request is an action sent by the client and can act on the server and client in order to retrieve a specific resource, like a web page or an image.


In [None]:
import requests

url = "https://api.example.com/data"
response = requests.get(url)

if response.status_code == 200:
    data = response.json()
    print(data)
else:
    print("Request failed with status code:", response.status_code)


4. Parsing HTML.

• Parsing HTML refers to the process of analysing and interpreting the structure and content of an HTML document.
• Parsing is essential for web browsers, search engines, and other software to understand and display the content correctly.
• The parsing process involves breaking down the raw HTML text into a structured format that the software can work with.


In [None]:
from bs4 import BeautifulSoup

html = "<p>This is a <strong>paragraph</strong> with <a href='https://www.example.com'>a link</a></p>"
soup = BeautifulSoup(html, 'html.parser')

paragraph = soup.find('p')
strong_text = paragraph.find('strong').text
link = paragraph.find('a')['href']

print("Paragraph:", paragraph.text)
print("Strong text:", strong_text)
print("Link:", link)


5. Web Scrapping Ethics and Legality.

• Web scraping involves extracting data from websites, and its ethics and legality can be complex and depend on various factors, including the purpose of scraping, the website's terms of use, and applicable laws.
• ETHICS:
    o Respect for Terms of Use.
    o Purpose of Scraping.
    o Impact on website performance.
    o Respect for Privacy.
    o Attribution and Ownership: Give proper attribution to the source of the scraped data and respect copyright and intellectual property rights.
• LEGALITY:
    o Copywrite and Intellectual Property.
    o Violation of Terms of Use.
    o Publicly Available Data.


6. Basics of API.

• An API (Application Programming Interface) is a set of rules and protocols that allows different software applications to communicate and interact with each other.
• APIs enable developers to access certain features or data from another application, service, or platform without needing to understand the internal workings of that application.
• Example: Websites that use “Sign in using Google.” or “Sign in using Facebook.”, they are using the data from Google or Facebook to verify the person and let the person use their website.


7. User Agents and Headers.

• User Agents: A user agent is a string of text that identifies the software and device (client) that is making a request to a server, typically over the internet.
• Headers: HTTP headers are pieces of information included in the requests and responses exchanged between a client and a server.
• Headers are crucial for controlling the behaviour of the client and server during web communication. They facilitate efficient data transfer, authentication, and content negotiation.


In [None]:
import requests

headers = {
    "User-Agent": "My User Agent",
    "Custom-Header": "Value"
}

url = "https://www.example.com"
response = requests.get(url, headers=headers)


8. Dynamic Websites and AJAX.

• Dynamic websites are websites that display content that can change or update dynamically without requiring a full page reload.
• The content is generated on the server or client side based on user interactions or other events, this allows for more interactive and responsive user experiences compared to traditional static websites.
• AJAX (Asynchronous JavaScript and XML): AJAX is a technique used to create interactive and dynamic web applications by enabling asynchronous communication between the browser and the server. It allows parts of a web page to be updated without requiring a full page reload.
• AJAX is widely used in web development to create interactive web applications, such as social media feeds, search autosuggestions, live chat systems, etc.


9. Regular Expression (Regex).

• A regular expression is a sequence of characters that defines a search pattern.
• Regular expressions are used for pattern matching within strings, making them a powerful tool for text manipulation and data validation.
• Regular expressions consist of a combination of normal characters (literal characters) and special characters (metacharacters) that have special meanings.
• These metacharacters allow you to define complex patterns for searching, matching, and replacing text.
• Example: [a-zA-z] will match any character from a to z or A to Z.


10. Python Libraries: Beautiful Soup, Requests, Scrapy, Selenium, urllib3, JSON, CSV , XML.

a) Beautiful Soup: A library for parsing HTML and XML documents. It provides tools for navigating and manipulating the parsed content. It's commonly used for web scraping tasks.

b) Requests: A library for making HTTP requests and handling responses. It simplifies the process of sending HTTP requests and working with data from APIs or websites.

c) Scrapy: A powerful web crawling framework used for extracting data from websites. It provides features for handling HTTP requests, parsing HTML/XML, and organizing scraped data.

d) Selenium: A web testing library used to automate browser actions. It's often used for tasks like web scraping on websites with dynamic content that cannot be easily parsed using traditional methods.

e) urllib3: A powerful HTTP client library with features like connection pooling, support for file uploads, and more. It's a lower-level alternative to requests for more advanced use cases.

f) JSON: A standard format for representing structured data. The json library in Python provides functions to encode Python objects into JSON and decode JSON into Python objects.

g) CSV: A format for representing tabular data. Python's built-in csv module allows you to read and write CSV files easily.

h) XML: A format for representing structured data, like HTML. Python's built-in libraries (xml.etree.ElementTree or minidom) allow you to parse and manipulate XML documents.


In [None]:
c) scrapy startproject myproject
   scrapy genspider myspider example.com
   scrapy crawl myspider
   
d) from selenium import webdriver
   driver = webdriver.Chrome()
   driver.get("https://www.example.com")
   
e) import urllib3
   http = urllib3.PoolManager()
   response = http.request('GET', 'https://www.example.com')
   
f) import json
   data = {"name": "John", "age": 30}
   json_data = json.dumps(data)
   decoded_data = json.loads(json_data) 
   
g) import csv
   with open('data.csv', 'w', newline='') as csvfile:
   csvwriter = csv.writer(csvfile)
   csvwriter.writerow(['Name', 'Age'])
   csvwriter.writerow(['Daniel', 21])
   
h) import xml.etree.ElementTree as ET
   xml_data = "<data><item>Value</item></data>"
   root = ET.fromstring(xml_data)
   value = root.find('item').text