# Python 101

Python is a versatile programming language that's becoming increasingly popular in data science and web scraping. This tutorial will cover the basics of Python, including essential data structures, the significance of whitespace, and other crucial Python concepts. These are foundational skills that will prepare you for web scraping.



## Why Python?

Python is known for its simplicity and readability, which makes it an excellent choice for both new and experienced programmers. Its vast libraries and tools make it a go-to language for data analysis, machine learning, web scraping, and much more.



## 1. Python Basics

### Whitespace and Syntax

Python uses indentation to define code blocks. This reliance on whitespace makes the code consistent and readable but requires careful attention to indentation.


In [8]:
# Optional type checking:
# (name: str="") -> None
#   expect the name param to be a string and assign an empty string if none is provided
#   --> None: do not expect a return value


def greet(name: str = "") -> None:
    if name:
        print(f"Hello, {name}!")
    else:
        print("Hello, World!")


def greet_with_no_typing(name=""):
    if name:
        print(f"Hello, {name}!")
    else:
        print("Hello, World!")


# function call with positional argument
greet("CALDER")
# function call with keyword argument
greet(name="CEDR")
# function call with no arguments
greet()

Hello, CALDER!
Hello, CEDR!
Hello, World!


In the example above, notice the indentation under the `if` and `else` statements. This is how Python groups lines of code into blocks.

- **Indentation:** Typically, Python uses 4 spaces per indentation level.
- **New Lines:** Statements are usually separated by new lines, but you can end a statement with a semicolon (`;`) if you'd like to put multiple statements on a single line.



### Comments

Comments in Python are indicated by the `#` symbol.

```python
# This is a single-line comment

"""
This is a multi-line comment.
It's not technically a comment but a multi-line string.
"""
```



## 2. Data Structures

Python offers several built-in data structures:



### Strings

### 1. Creating and Accessing Strings
In Python, strings are sequences of characters enclosed in quotes.
```python
# Single or double quotes can be used
my_string = "Hello, World!"
print(my_string)  # Output: Hello, World!

# Accessing individual characters
first_char = my_string[0]
print(first_char)  # Output: H

# Using negative index to access characters from the end
last_char = my_string[-1]
print(last_char)  # Output: !
```

### 2. Slicing Strings
You can extract a substring using slicing.
```python
# Syntax: string[start:end]
substring = my_string[0:5]
print(substring)  # Output: Hello

# Omitting start or end will go to the end or start of the string, respectively
substring = my_string[:5]
print(substring)  # Output: Hello

substring = my_string[7:]
print(substring)  # Output: World!
```

### 3. Modifying Strings
Strings in Python are immutable, meaning you cannot modify them directly, but you can create new strings.
```python
# Concatenation
greeting = "Hello"
name = "Alice"
message = greeting + ", " + name + "!"
print(message)  # Output: Hello, Alice!

# Repeating strings
repeated = "Ha" * 3
print(repeated)  # Output: HaHaHa
```

### 4. Common String Methods
Python provides many built-in methods for string manipulation.
```python
# Converting case
print(my_string.lower())  # Output: hello, world!
print(my_string.upper())  # Output: HELLO, WORLD!

# Checking content
print(my_string.startswith('Hello'))  # Output: True
print(my_string.endswith('!'))        # Output: True

# Finding substrings
index = my_string.find('World')
print(index)  # Output: 7 (returns -1 if not found)

# Replacing substrings
new_string = my_string.replace('World', 'Python')
print(new_string)  # Output: Hello, Python!
```

### 5. Splitting and Joining Strings
```python
# Splitting a string into a list
words = my_string.split(', ')
print(words)  # Output: ['Hello', 'World!']

# Joining a list into a string
joined_string = ' '.join(words)
print(joined_string)  # Output: Hello World!
```

### 6. Stripping Strings
Remove whitespace or specific characters from the start and end of the string.
```python
str_with_spaces = "   Hello, World!   "
trimmed = str_with_spaces.strip()
print(trimmed)  # Output: Hello, World!

# You can also strip specific characters
stripped_chars = "xxHelloxx".strip('x')
print(stripped_chars)  # Output: Hello
```

### 7. String Formatting
Add variable into string with two methods:
```python
# f-string style
x = "This is a variable"
new_string = f"{x} and it is awesome"
# string format
new_string = "{} and it is awesome".format(x)
```

These are just a few of the many ways you can manipulate strings in Python. Experimenting with these methods will help you understand how to use them effectively in your programming tasks.



### Lists

Lists are ordered collections of items, which are mutable.


In [10]:
# create list
fruits = ["apple", "banana", "cherry"]
print(fruits)  # ['apple', 'banana', 'cherry']

# append to list
fruits.append("orange")  # Adds an item to the list
print(fruits)  # ['apple', 'banana', 'cherry', 'orange']

# access list value by index
# NOTE: index starts at 0
print(fruits[0])  # -> "apple"

# negative index starts from end of list
print(fruits[-1])  # -> "orange"
print(fruits[-2])  # -> "cherry"

# slice of a list [first index is inclusive, second index is non-inclusive]
print(fruits[0:2])  # -> ['apple', 'bannana']
print(fruits[0:-1])  # -> ['apple', 'bannana', 'cherry']

fruits.append(
    {"apples": 5, "oranges": 10.5, "cherries": 200}
)  # can add multiple data structures to list
print(
    fruits
)  # ["apple", "banana", "cherry", "orange", {"apples": 5, "oranges": 10.5, "cherries": 200}]

fruits.extend(["blueberry", "raspberry"])
# same as
# fruits = fruits + ["blueberry", "raspberry"]
print(fruits)
# ["apple", "banana", "cherry", "orange", {"apples": 5, "oranges": 10.5, "cherries": 200}, "blueberry", "raspberry"]

# sort list alphabetically and only keep strings
only_fruits = sorted(f for f in fruits if isinstance(f, str))
print(only_fruits)

['apple', 'banana', 'cherry']
['apple', 'banana', 'cherry', 'orange']
apple
orange
cherry
['apple', 'banana']
['apple', 'banana', 'cherry']
['apple', 'banana', 'cherry', 'orange', {'apples': 5, 'oranges': 10.5, 'cherries': 200}]
['apple', 'banana', 'cherry', 'orange', {'apples': 5, 'oranges': 10.5, 'cherries': 200}, 'blueberry', 'raspberry']
['apple', 'banana', 'blueberry', 'cherry', 'orange', 'raspberry']



### Tuples

Tuples are similar to lists but are immutable, meaning they cannot be changed once created.

Tuples are less important for web scraping.

```python
coordinates = (10.4, 5.3)
# coordinates[0] = 5  # This will raise an error
```



### Dictionaries

Dictionaries store key-value pairs and are much like hashMaps in other languages.

In [22]:
person = {"name": "Alice", "age": 28}
print(person["name"])  # Alice

# add a new key
person["height"] = "6'"
print(person)

# change a key value
person["height"] = "5'"
print(person)


Alice
{'name': 'Alice', 'age': 28, 'height': "6'"}
{'name': 'Alice', 'age': 28, 'height': "5'"}


## 3. Control Structures

Python's control structures work similarly to other programming languages but emphasize readability.

### Conditional Statements



In [9]:
age = 18
if age >= 18:
    print("Eligible to vote")
elif age == 17:
    print("Almost eligible")
else:
    print("Not eligible to vote")


Eligible to vote


### Loops

#### For Loop

Iterating over a sequence (like a list, tuple, dictionary, set, or a string).

In [11]:
print("** Loop over a list... **")
for fruit in only_fruits:
    print(fruit)

dictionary = {"key": "value", "key2": "value2"}

print("\n** Loop over a dict **")
# Iterate over all key/value pairs
for key, value in dictionary.items():
    print(f"Dict iterator for {key=}: {value=}")

# Iterate over keys only with key loopup for value
for key in dictionary:
    print(f"Value for {key} is {dictionary[key]}")

** Loop over a list... **
apple
banana
blueberry
cherry
orange
raspberry

** Loop over a dict **
Dict iterator for key='key': value='value'
Dict iterator for key='key2': value='value2'
Value for key is value
Value for key2 is value2



#### While Loop

Repeats a block of code as long as a condition is true. 


In [26]:
# Take care to avoid infinite loops by having a break condition.
count = 0
while True:
    print(count)
    count += 1
    if count >= 5:
        break

print("\n")

# While loop with built in break condition
count = 0
while count < 5:
    print(count)
    count += 1

0
1
2
3
4


0
1
2
3
4


### Try/Except and Exceptions

Catches an error and does something.


In [None]:
# Raise an exception:
raise Exception("There was an error")


Exception: There was an error

In [20]:
# Conditionally raise an exception
val = 10
if not isinstance(val, str):
    dtype = type(val)
    raise ValueError(f"{val} is not a string. It is a {dtype} instead.")


ValueError: 10 is not a string. It is a <class 'int'> instead.

In [None]:
# Exception handling
try:
    var = int("not a number")
except ValueError:
    print("Can't int a string")

# catch any exception
try:
    var = int("not a number")
except Exception as e:
    print(f"The exception is: {e}")


Can't int a string
The exception is invalid literal for int() with base 10: 'not a number'


#### Two different strategies with Try/Except

1. **Look before you leap**: Check values before trying to access them to avoid errors.
    ```python
    if "key" in dictionary:
        print("The key is in the dictionary")
    ```

2. **Better to ask forgiveness than permission**: Access the value and handle potential exceptions.
    ```python
    try:
        val = dictionary['non existent key']
    except KeyError as e:
        print(e)
    ```



## 4. Functions

Functions in Python are defined using the `def` keyword.


In [13]:
def add(a, b):
    return a + b


result = add(5, 3)
print(result)  # 8

############################


# clean text of unwanted characters
def clean_str(text: str) -> str:
    return text.replace("qqq ", "")


print(clean_str("This is a string with qqq errors qqq in it."))
print(clean_str("This is a string without errors in it."))

8
This is a string with errors in it.
This is a string without errors in it.


In [14]:
from curl_cffi import requests


# retrieve a url and return the content
def get_url(url):
    resp = requests.get(url, impersonate="chrome")
    if not resp.ok:
        raise Exception(f"Error retrieving {url}")
    return resp.content


print(get_url("https://toscrape.com/"))

# Understanding functions is crucial when performing web scraping, as you'll often need to define reusable operations for crawling data.

b'<!DOCTYPE html>\n<html lang="en">\n    <head>\n        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n        <title>Scraping Sandbox</title>\n        <link href="./css/bootstrap.min.css" rel="stylesheet">\n        <link href="./css/main.css" rel="stylesheet">\n    </head>\n    <body>\n        <div class="container">\n            <div class="row">\n                <div class="col-md-1"></div>\n                <div class="col-md-10 well">\n                    <img class="logo" src="img/zyte.png" width="200px">\n                    <h1 class="text-right">Web Scraping Sandbox</h1>\n                </div>\n            </div>\n\n            <div class="row">\n                <div class="col-md-1"></div>\n                <div class="col-md-10">\n                    <h2>Books</h2>\n                    <p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It\'s a safe place for beginners learning web scraping and for deve

### Saving files

In [None]:
from curl_cffi import requests

url = "https://core-docs.s3.us-east-1.amazonaws.com/documents/asset/uploaded_file/1591/Prospect_SD_59/4767963/Substitute_Instructional_Assistant_8.28.24.docx.pdf"

resp = requests.get(url, impersonate="chrome")

# check the status code is good, raise exception if not
if not resp.ok:
    raise Exception("Error retrieving file.")

# save the content of the pdf to file
filename = "filename.pdf"
with open(filename, "wb") as fi:
    fi.write(resp.content)

# Introduction to HTML and Web Scraping with XPath and CSS Selectors

Web scraping is a powerful technique for extracting information from websites. To effectively scrape data from a web page, it's crucial to understand the structure of HTML and how you can navigate it using tools like XPath and CSS selectors. This tutorial will introduce you to HTML elements, the basics of XPath and CSS selectors, and how to use them for web scraping.

## Understanding HTML Elements

HTML (Hypertext Markup Language) is the standard language for creating web pages. A web page is composed of various HTML elements, which are the building blocks of a web page. Here are the core components of an HTML element:

1. **Tags**: HTML elements are defined by tags, surrounded by angle brackets. For example, `<p>` is a paragraph tag.
2. **Attributes**: Elements can have attributes that provide additional information. Attributes are added within the opening tag. For example, `<a href="https://example.com">` uses the `href` attribute to specify the URL.
3. **Content**: The text or nested elements within an HTML tag. For example, in `<p>Hello World</p>`, "Hello World" is the content.
4. **Hierarchy**: HTML is structured in a tree-like hierarchy with nested elements. Understanding the parent-child-sibling relationship is key for web scraping.

### Common HTML Elements

- `<html>`: The root element of an HTML page.
- `<head>`: Contains meta-information about the document, like the title.
- `<body>`: Contains the content of the document, including text, images, links, etc.
- `<div>`: A block-level container for organizing content.
- `<span>`: An inline container for text.
- `<a>`: Anchor tag for hyperlinks.
- `<p>`: Paragraphs for blocks of text.
- `<ul>` and `<ol>`: Unordered and ordered lists, containing `<li>` elements.

Understanding these elements will help you identify the parts of the webpage you want to scrape.

## Selecting HTML Structures with XPath

XPath is a powerful tool used to navigate through elements and attributes in an XML document, and it works similarly in HTML.

### Basic XPath Syntax

1. **Select nodes**: Use a single slash `/` to select from the root, and a double slash `//` to select from anywhere in the document.
   - `/html/body/div`: Selects a `div` that's a direct child of `body`.
   - `//div`: Selects all `div` elements in the document.

2. **Select child nodes**: You can specify children by using `/`.
   - `//ul/li`: Selects all `li` elements that are children of any `ul`.

3. **Attributes**: Use `@` to specify an attribute.
   - `//@href`: Selects all `href` attributes.
   - `//a[@href='https://example.com']`: Selects `a` tags with a specific `href` value.

4. **Predicates**: Use square brackets `[]` to specify conditions.
   - `//div[@class='example']`: Selects all `div` elements with a class of "example".
   - `//li[1]`: Selects the first `li` element.
   - `//li[last()]`: Selects the last `li` element.

### Advanced XPath Features

- **Axes**: Navigate through the document by relationships, like `ancestor`, `descendant`, `parent`, and `following-sibling`.
- **Functions**: XPath supports various functions such as `contains()`, `starts-with()`, `text()`, allowing fine-grained selection.
  - "//button[text()='Click Me']" selects a button with text content of "Click Me"

## Selecting HTML Structures with CSS Selectors

CSS selectors provide a straightforward syntax for selecting HTML elements based on their tags, classes, IDs, and more. They are commonly used in web development and scraping.

### Basic CSS Selectors

1. **Tag Selector**: Selects elements by their tag name.
   - `p`: Selects all `<p>` elements.

2. **Class Selector**: Use a dot `.` to select elements by their class.
   - `.example`: Selects all elements with `class="example"`.

3. **ID Selector**: Use a hash `#` to select elements by their ID.
   - `#main`: Selects the element with `id="main"`.

4. **Attribute Selector**: Use brackets to select elements with specific attributes.
   - `a[href]`: Selects all `<a>` elements with an `href` attribute.
   - `a[href='https://example.com']`: Selects `<a>` elements with a specific `href`.

### Advanced CSS Selectors

- **Descendant Selector**: Use a space to select descendants.
  - `div p`: Selects all `<p>` elements inside a `<div>`.

- **Child Selector**: Use `>` to select direct children.
  - `div > p`: Selects all `<p>` elements that are direct children of a `<div>`.

- **Sibling Selector**: Use `+` or `~` to select siblings.
  - `h1 + p`: Selects the first `<p>` following an `<h1>`.
  - `h1 ~ p`: Selects all `<p>` siblings after an `<h1>`.

- **Pseudo-classes and Pseudo-elements**: Select elements based on their position or state.
  - `a:hover`: Selects `<a>` elements when they're hovered over.
  - `li:first-child`: Selects the first `<li>` in a parent.


## Conclusion

Understanding HTML and learning how to navigate its structure using XPath and CSS selectors is foundational for web scraping. By mastering these techniques, you can efficiently extract data from websites for various applications. Start experimenting with different selectors to become proficient in scraping real-world web pages.

### Next Steps

Once you're comfortable with these concepts, you can move on to implementing them with a web scraping library like BeautifulSoup (for Python + CSS selectors) or lxml (for Python + XPath) to automate your data collection processes.

### Sample HTML Document

```html
<!DOCTYPE html>
<html>
<head>
    <title>Sample HTML</title>
    <style>
        .highlight { color: red; }
    </style>
</head>
<body>
    <div id="main">
        <h1>Welcome to the XPath and CSS Selectors Guide</h1>
        <p class="intro">This is an introductory paragraph.</p>
        <ul id="item-list">
            <li class="highlight">Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
            <li class="highlight">Item 4</li>
        </ul>
        <a href="https://example.com" class="external">Visit Example.com</a>
    </div>
</body>
</html>
```

### XPath Examples

1. **Select all `<div>` elements:**
   - XPath: `//div`
   - Matches: `<div id="main">...</div>`

2. **Select the title of the page:**
   - XPath: `/html/head/title`
   - Matches: `<title>Sample HTML</title>`

3. **Select all list items with the class "highlight":**
   - XPath: `//li[@class='highlight']`
   - Matches: `<li class="highlight">Item 1</li>`, `<li class="highlight">Item 4</li>`

4. **Select the first `<p>` element:**
   - XPath: `//p[1]`
   - Matches: `<p class="intro">This is an introductory paragraph.</p>`

5. **Select the last `<li>` element inside the unordered list:**
   - XPath: `//ul[@id='item-list']/li[last()]`
   - Matches: `<li class="highlight">Item 4</li>`

6. **Select all the `<a>` elements with an `href` attribute:**
   - XPath: `//a[@href]`
   - Matches: `<a href="https://example.com" class="external">Visit Example.com</a>`

### CSS Selector Examples

1. **Select all `<p>` elements:**
   - CSS Selector: `p`
   - Matches: `<p class="intro">This is an introductory paragraph.</p>`

2. **Select the element with id "main":**
   - CSS Selector: `#main`
   - Matches: `<div id="main">...</div>`

3. **Select all elements with the class "highlight":**
   - CSS Selector: `.highlight`
   - Matches: `<li class="highlight">Item 1</li>`, `<li class="highlight">Item 4</li>`

4. **Select the `<a>` element that has the `href` attribute set to "https://example.com":**
   - CSS Selector: `a[href='https://example.com']`
   - Matches: `<a href="https://example.com" class="external">Visit Example.com</a>`

5. **Select all `<p>` elements that are direct children of `<div>` elements:**
   - CSS Selector: `div > p`
   - Matches: `<p class="intro">This is an introductory paragraph.</p>`

6. **Select the sibling `<li>` elements after the first one:**
   - CSS Selector: `li + li`
   - Matches: `<li>Item 2</li>`, `<li>Item 3</li>`, `<li class="highlight">Item 4</li>`

These examples showcase how you can leverage both XPath and CSS selectors to navigate and select specific parts of an HTML document efficiently.

In [None]:
from curl_cffi import requests

# from selectolax.parser import HTMLParser

base_url = "https://adc.hidoe.us/#/chronic-absenteeism"
url = "https://adc.hidoe.us/api/chronic-absenteeism/grade"

querystring = {
    "dataLoadTag": "2024Final",
    "dataType": "fsy",
    "denNum": "Denominator",
    "endYear": "2024",
    "entity": "335",
    "entityType": "School",
    "gradRetCategory": "1",
    "listType": "ReadingChronicAbsenteeism",
    "schoolTitle": "all",
    "schoolType": "all",
    "series": "grade",
    "seriesType": "grade",
    "startYear": "2017",
    "subject": "R",
}

payload = ""
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:132.0) Gecko/20100101 Firefox/132.0",
    "Accept": "application/json, text/plain, */*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Connection": "keep-alive",
    "Referer": "https://adc.hidoe.us/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-origin",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "TE": "trailers",
}
resp = requests.get(
    url, impersonate="chrome", data=payload, headers=headers, params=querystring
)

if not resp.ok:
    raise Exception("Error with api call")

data = resp.json()
data


{'status': 'success',
 'message': 'Successfully retrieved chronic-absent grade',
 'data': {'nonFinal': True,
  'raw': [{'SchoolYear': '2017',
    'SchoolYearFormatted': '2016-17',
    'DataLoadTag': '2017Final',
    'PCT99': 13.6,
    'PCT91': 18.6,
    'PCT01': 11.9,
    'PCT02': 13.7,
    'PCT03': 12.7,
    'PCT04': 10.2,
    'PCT05': 17.9,
    'PCT06': 10.8,
    'PCT07': None,
    'PCT08': None,
    'PCT09': None,
    'PCT10': None,
    'PCT11': None,
    'PCT12': None},
   {'SchoolYear': '2018',
    'SchoolYearFormatted': '2017-18',
    'DataLoadTag': '2018Final',
    'PCT99': 10.3,
    'PCT91': 15.3,
    'PCT01': 4.5,
    'PCT02': 12.1,
    'PCT03': 10.5,
    'PCT04': 6.1,
    'PCT05': 16.2,
    'PCT06': 8.1,
    'PCT07': None,
    'PCT08': None,
    'PCT09': None,
    'PCT10': None,
    'PCT11': None,
    'PCT12': None},
   {'SchoolYear': '2019',
    'SchoolYearFormatted': '2018-19',
    'DataLoadTag': '2019Final',
    'PCT99': 12,
    'PCT91': 12.5,
    'PCT01': 25,
    'PCT02':

In [None]:
data["data"]["raw"][0]

{'SchoolYear': '2017',
 'SchoolYearFormatted': '2016-17',
 'DataLoadTag': '2017Final',
 'PCT99': 13.6,
 'PCT91': 18.6,
 'PCT01': 11.9,
 'PCT02': 13.7,
 'PCT03': 12.7,
 'PCT04': 10.2,
 'PCT05': 17.9,
 'PCT06': 10.8,
 'PCT07': None,
 'PCT08': None,
 'PCT09': None,
 'PCT10': None,
 'PCT11': None,
 'PCT12': None}