![HI-TEC 2025](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/hitec2025-logo.png)
## Secure Programming with Python
![SECURE PROGRAMMING WITH PYTHON](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/workshop-head-image.png)


### Prof. Pamela Brauda and David Singletary
### Florida State College at Jacksonville

![PART1](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/p1-head.png)
# Part 1. Introductory Topics
0. Welcome and Workshop Overview
1. Git and GitHub for Secure Version Control
2. Jupyter Notebooks and Google Colab
3. Secure Coding Basics

# Welcome

Thank you for enrolling in our workshop. The material we are presenting consists of excerpts from a recent weeklong course we taught at the NITIC Summer Working Connections in Columbus, OH (here is the link to the full course: https://github.com/FSCJ-FacultyDev/SWC-Columbus-2025). This course drew from a variety of sources, including books, websites, and personal experience gained over eight years of teaching Python at Florida State College at Jacksonville (FSCJ). Some of this content is part of the curriculum in our A.S. in Computer Information Technology (https://www.fscj.edu/academics/programs/as/2153), our A.S. in Data Science Technology (https://www.fscj.edu/academics/programs/as/2157), our B.A.S. in Information Systems Technology (https://www.fscj.edu/academics/programs/bs/S301) which includes concentrations in application development and FinTech, and our non-credit (CWE) course offerings.

We've included a notebook containing references, and we loosely cite these throughout the content (apologies in advance to the American Psychological Association and to any workshop attendees who are sticklers for proper source citations).

# 1. Secure Version Control with Git and GitHub
- Let's kick the workshop off by building from the ground up with secure version control using two de facto tools: Git and GitHub.
  - These form the backbone of secure software development infrastructure by enabling collaboration, access control, traceability, and early vulnerability detection.
- [Git](https://git-scm.com/) and [GitHub](https://github.com/) allow students to learn secure coding practices and conduct peer reviews. Together, these tools provide the following capabilities:
  - Support collaboration among multiple contributors without conflicts.
  - Enhance reproducibility by offering transparency in both development and research workflows.
  - Enforce access control by restricting unauthorized modifications through protected branches.
  - Encourage structured, versioned, and well-documented coding practices.


![GitHub Classroom Overview](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/day1-gitlogo.png)
![GitHub Classroom Overview](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/day1-ghlogo.png)

- **Git** ensures code changes are tracked, managed, and reviewed systematically for security and accountability.
  - Branching and merging allow teams to develop features and security patches separately before integrating them into the main codebase.
  - Commit history and diffs provide visibility into changes, helping to identify and prevent security vulnerabilities.
  - Rollback and recovery capabilities enable reverting to previous versions to mitigate security breaches or accidental changes.
- **GitHub** provides cloud-based tools for managing repositories, enforcing security policies, and controlling access.
  - Pull requests and code reviews ensure changes are reviewed by peers before merging, reducing the risk of introducing vulnerabilities.
  - Security tools in GitHub (e.g., Dependabot, secret scanning, code scanning) help detect vulnerabilities in dependencies and source code.
  - Access controls and compliance features, such as protected branches and role-based permissions, enforce secure coding practices and industry standards.

# GitHub Classroom

![GitHub Classroom Overview](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/day1-ghclassroom.png)

- We use GitHub Classroom extensively in many of our data science and software development courses to manage coding assignments, encourage collaboration, and teach real-world version control practices.
- The platform allows instructors to automatically generate private repositories for each student or team, streamlining assignment distribution and submission.
- It enables instructors to monitor student progress, provide in-line feedback through pull requests, and even automate testing and grading using GitHub Actions.
- Students gain hands-on experience with industry-standard tools while reinforcing best practices in collaborative and secure software development.
- See our [HITEC 2023 presentation (PDF)](https://github.com/FSCJ-FacultyDev/HITEC2025/raw/main/docs/GitHubClassroom-Instructor.pdf) for more information on setting up GitHub Classroom.



# 🛠️ Hands-On: Set Up a Private GitHub Repo for a Python Project

In this exercise, we will create a private GitHub repository for a simple Python project, define a dependency in `requirements.txt`, and write a basic Python script that uses that dependency.

---

### 1. Create a Private GitHub Repository

a. Go to [https://github.com](https://github.com) and log in.  
b. Click the **+** icon in the top-right corner and choose **New repository**.  
c. Fill in:  
   - Repository name: `python-demo-project` (or similar)
   - Description (optional)  
   
d. Under **Visibility**, select **Private**.  
e. (Optional) Check **"Add a README file"**.  
f. Click **Create repository**.  

---

### 2. Add a Dependency File

a. In your new repository, click **Add file** > **Create new file**.  
b. Name the file **requirements.txt**  
c. In the editor, add the following line:  

    requests==2.31.0

4. Press **Commit changes** and enter a commit message in the dialog (e.g., *Create requirements.txt*), and click **Commit changes**.

---

### 3. Add a Python Script

a. In the repository, click **Add file** > **Create new file**.  
b. Name the file **main.py**  
c. Paste in the following code:  

    import requests

    response = requests.get("https://www.example.com")
    if response.status_code == 200:
        print("Successfully reached example.com")
    else:
        print("Request failed with status:", response.status_code)


4. Commit the file.



# Multi-Factor Authentication
- GitHub supports Multi-Factor Authentication (MFA) to enhance account security by requiring users to provide an additional verification factor beyond their password (Settings > Password and authentication)
- MFA can be enabled via:
  - Time-based one-time passwords (TOTP) generated by authenticator apps like Google Authenticator, Authy, or the GitHub mobile app
  - Security keys that support FIDO2/WebAuthn
  - SMS-based authentication (less recommended due to security concerns).
- Once enabled, MFA is required during login and when performing sensitive actions, such as modifying account settings or accessing repositories with heightened security policies.
- Additionally, GitHub allows organizations to enforce MFA for members, ensuring stronger protection for repositories and codebases.

# Role-Based Access Control (RBAC)
- RBAC is a security model that restricts system access based on predefined roles assigned to users, ensuring they have only the necessary permissions to perform their tasks.
- Instead of granting individual permissions directly, RBAC assigns permissions to roles, which are then assigned to users, simplifying access management and reducing security risks.
- Originally formalized by NIST, RBAC is widely used in enterprise environments, databases, operating systems, and cloud platforms to enforce least privilege and improve compliance with security policies.
- It helps organizations efficiently manage user access, streamline administrative tasks, and minimize the risk of unauthorized actions
- GitHub uses RBAC by assigning predefined roles (like Read, Write, Admin) at the repository, organization, and enterprise levels to control user access, enabling least-privilege permissions and scalable access management.

# Student Repositories
- Access control must be balanced with collaborative learning when using GitHub in a classroom environment.
- The goal is to promote collaboration while maintaining the integrity of the repository and preventing accidental or unauthorized modifications.
- Instructors can assign Read access to students who only need to view a repository, Write access for those contributing code without merging, and Maintain or Admin roles for team leads or advanced students managing repository settings.
- Branch protection rules allows instructors to ensure students follow proper version control workflows, such as requiring pull requests and code reviews before merging.
- GitHub Classroom allows use of private (template) repositories to generate private forkable repos for students to support academic integrity.

# Branch Protection Rules
- Branch protection rules in GitHub help enforce version control best practices by restricting direct changes to important branches. To set them up:
  - In the repository on GitHub, click on the "Settings" tab.
  - In the left sidebar, under "Code and automation", click "Branches".
  - In the "Branch protection rules" section, click "Add rule".
  - In the "Branch name pattern" field, enter the branch name you want to protect (e.g., main, develop, or use wildcards like feature/*).
  - Select Protection Options. GitHub provides several protection rules you can enable:
    - Require pull request reviews before merging: set a required number of approvals (e.g., at least one review). Block self-reviews to enforce team feedback.
    - Require status checks to pass before merging: ensure automated tests (CI/CD) pass before merging. Select specific checks (e.g., Linting, Unit Tests, Build).
    - Require commit signatures: enforce cryptographic signatures to verify commit authenticity.
    - Restrict who can push to the branch: allow only instructors or specific team members to push directly.
    - Require branches to be up to date before merging: prevent merging outdated branches to avoid conflicts
    - Prevent branch deletion: stop accidental or malicious deletion of protected branches.
  - Save the Rule
    - Review the settings.
    - Click "Create" to apply the protection rule.

### Examples of Branch Protection Rules for GitHub Projects

---

#### 1. Require Pull Request Reviews Before Merging  
*Prevents direct commits to the main branch and enforces a code review process.*

**Example Rule:**
- Require at least one approved review before merging.  
- Prevent self-approval (someone else must review the changes).  
- Dismiss stale approvals if new commits are added.

---

#### 2. Require Status Checks to Pass Before Merging  
*Prevents merging unless automated tests (CI/CD pipelines) pass.*

**Example Rule:**
- Require GitHub Actions tests to pass before merging.  
- Block merging if tests fail.

---

#### 3. Restrict Who Can Push to a Branch (intro courses)
 *Limits who can make direct changes to critical branches.*

**Example Rule:**
- Only instructors can push directly to main.  
- Students must use feature branches and open pull requests.

---

#### 4. Prevent Deletion of Protected Branches  
*Stops accidental or malicious deletion of important branches.*

**Example Rule:**
- Prevent deletion of the main and develop branches.


# 🛠️ Hands-On: Explore Branch Protection Rules in a GitHub repo

In this exercise, we will explore branch protection rules for our previous GitHub repository.

---

- Modifying these rules is a common practice in professional workflows.

1. Go to your repository on GitHub.
2. Click the Settings tab (you must have admin access to see this option).
3. In the left sidebar, select Branches.
4. Under Branch protection rules, click **Add classic branch protection rule**
5. In Branch name pattern, type **main**
6. Verify **Allow deletions** is *unchecked* — this prevents the branch from being deleted.
7. Other rules you can set:
  - Require a pull request before merging
  - Require status checks (e.g. automation tool checks) to pass before merging
8. Click Create or Save changes at the bottom.  
(Disregard any warnings about non-enforcement of your rules due to use of a free account.)

# Managing Sensitive Data (Secrets)
## Never Commit Secrets!
- API keys, credentials, and other sensitive data should never be committed to version control systems like Git because they can easily be exposed to unauthorized users, especially if the repository is public or shared across teams.
- [Don't be like Dropbox!](https://blog.gitguardian.com/dropbox-breach-hack-github-circleci/)
- Once exposed, these  values can be used by attackers to gain access to protected systems, steal data, abuse services (e.g., triggering rate limits or incurring unexpected costs), or compromise application security.
- Even in private repositories, accidental leaks are possible through forks, backups, or misconfigured access controls.
- Best practices dictate storing secrets in environment variables or secure vaults, and using .gitignore to exclude local configuration files that contain sensitive information.

# Tools for Protecting Secrets
- Tools like [git-secrets](https://github.com/awslabs/git-secrets) and [truffleHog](https://trufflesecurity.com/trufflehog) are designed to detect and prevent the accidental leakage of secrets—such as API keys, passwords, and tokens—into Git repositories.
- These tools scan commit messages, staged files, and repository history for patterns that resemble sensitive information, helping developers catch issues before they are pushed to remote servers.
-- **git-secrets** can be integrated as a pre-commit hook to block commits containing known secret patterns using regular expressions for detection.
  - A common example of a known secret pattern is an AWS Access Key ID, e.g.,

```
          AKIAIOSFODNN7EXAMPLE
```

  - detected by the regex

```
          AKIA[0-9A-Z]{16}
```

- **truffleHog** performs deep scans for high-entropy (highly random) strings that may indicate keys or credentials.


# Static Analysis and Dependency Scanning of Code "at Rest"
- Static analysis and dependency scanning in Python are essential practices for identifying code issues and securing third-party packages early in the development cycle.
- **Static analysis** analyzes code without executing it in order to detect potential errors, code smells (signs of poor design or maintainability issues), security vulnerabilities, and other issues.
- Static analysis tools like [pylint](https://pypi.org/project/pylint/), [flake8](https://flake8.pycqa.org/en/latest/), and [bandit](https://bandit.readthedocs.io/) examine Python code without executing it, flagging syntax errors, code style violations, and potential security flaws such as use of unsafe functions or insecure imports.
- **Dependency scanning** automatically identifies and evaluates third-party libraries used in a project to detect known security vulnerabilities, outdated packages, and other issues.
- Dependency scanning tools like [pip-audit](https://pypi.org/project/pip-audit/), [safety](https://www.getsafety.com/cli), or GitHub's built-in [Dependabot](https://docs.github.com/en/code-security/dependabot) check  configured packages for known security vulnerabilities.

# 🛠️ Hands-On: Set Up a Private GitHub Repo with Dependabot Dependency Scanning

- **Dependabot** is a built-in GitHub feature that scans dependency files (like requirements.txt for Python or package.json for Node.js) and opens pull requests to update outdated or insecure packages.
- It helps keep projects secure and up to date with minimal manual effort.
- In this exercise we will configure Dependabot to automatically check our Python project's dependencies for updates and known security vulnerabilities.

1. Go to your Python project's GitHub Repository

2. Enable GitHub Security Features
  - a.	Click the Settings tab of the repository.
  - b.	In the left sidebar, go to **Advanced Security**.
  - d.	Confirm the following are enabled (the default) and enable them if necessary:
    - Dependency graph
    - Dependabot alerts
    - Dependabot security updates

3. Add a Dependabot Configuration File
  - a.	Return to the main repo page and create a YAML file named **.github/dependabot.yml** (YAML is a configuration language; we are using it here to tell Dependabot how and when to automatically check for and suggest updates to the project dependencies).
  - c.	Paste the following configuration for Python dependencies:

```
version: 2
updates:
  - package-ecosystem: "pip"
    directory: "/"  # Location of requirements.txt
    schedule:
      interval: "weekly"
```
4. Commit the file

5. Review Alerts and Pull Requests
  - a.	Under “Security/Dependabot alerts”, review any detected vulnerabilities. If issues are found, Dependabot may automatically open pull requests to update affected packages.
  - c.	Under “Insights/Dependency Graph/Dependabot”, review **Recent update jobs**.


# 2. Jupyter Notebooks and Google Colab
- Jupyter Notebooks and Google Colab provide flexible, interactive environments for secure coding practices, enabling developers to test and refine security-focused scripts in isolated, controlled settings.
  - Support for Python and various security-related libraries, making them useful for tasks like penetration testing, secure coding education, and cryptographic implementations.
  - Built-in execution controls, users can safely run code in segmented cells, reducing the risk of unintended operations.
- Colab’s cloud-based execution adds an extra layer of security by sandboxing processes away from local machines, preventing potential malware execution.
- Both platforms facilitate reproducibility and collaboration, allowing security teams to document vulnerabilities and share insights while maintaining strict access controls.

# 3. Secure Coding Basics


# Coding Style
- Good coding style is essential because it promotes clear, consistent, and readable code, making it easier to spot logic errors, unintended behavior, and security flaws during development and review.

# GotoFail: A Case Study in Style-Related Vulnerabilities
The **goto fail** bug was a critical security flaw in Apple’s SSL/TLS implementation (discovered in 2014)
  - It was caused by a duplicate goto statement in the C code.
  - The defect caused a certificate validation operation to prematurely exit, allowing attackers to impersonate secure websites and intercept encrypted communications.


```
// NOTE: This is C++, not Python
if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
    goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
    goto fail;
    goto fail;  // <- This accidental second 'goto fail;' is the bug
if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
    goto fail;
```


# PEP 8 and Python Style Guidelines

- A Python Enhancement Proposal (PEP) describes a new feature, process, or  guideline for the Python community.
  - PEPs provide a structured way to propose, discuss, and document changes to the language.
    - PEP 1 (https://peps.python.org/pep-0001/) was written in March 2000 and defines what PEPs are for and how they should be used.
    - PEPs 2–7 set guidelines for Python's development, including the PEP index and workflow (PEP 2), PEP guidelines for informational proposals (PEP 3), procedures for Python releases (PEP 4 and 6), deprecation policy (PEP 5), and how to submit patches (PEP 7).
    - PEP 8 (https://peps.python.org/pep-0008/) provides Python coding style guidelines for maintainable code
  - As instructors of introductory programming courses, we have the opportunity to teach the need for code style discipline and best practices early in a student's learning journey.
  - By teaching established and consistent conventions we can help students develop habits that lead to more reliable and professional software.



# Guidelines

- These guidelines emphasize code readability and foster a mindset that prioritizes security and maintainability—critical skills for aspiring developers.

1. Formatting for Readability and Maintainability
2. Naming Conventions
3. Module Imports
4. Documentation and Comments

# Formatting for Readability and Maintainability

- Consistent Indentation (https://peps.python.org/pep-0008/#indentation)
  - Use 4 spaces (not tabs) per indentation level to avoid confusion that could lead to logical errors.
- Maximum Line Length (https://peps.python.org/pep-0008/#maximum-line-length)
  - 79 characters prevents the need for horizontal scrolling, making it easier to audit code for security flaws.



# Naming Conventions
- https://peps.python.org/pep-0008/#naming-conventions
- Use meaningful names for variables, functions, and classes.
- Avoid **shadowing**
  - Shadowing occurs when a local variable or function name in your code overrides a built-in name, making the original built-in temporarily inaccessible.
  - Don't use names that overwrite Python built-ins (e.g., don’t name a variable **id**, **list**, or **sum**).
- Use CAPITALIZED_NAMES for constants that shouldn’t be modified.

In [None]:
# shadowing and constants

TAX_RATE = 0.07 # FL sales tax, won't change without legislation
sum = 10  # Overwrites the built-in sum function

numbers = [1, 2, 3]
total = sum(numbers)

del sum # fixes it, comment out previous line first
total = sum(numbers)
print(total)

print(total + (1 + TAX_RATE))

# Module Imports

# 🛠️ Hands-On: Demonstrating Wildcard Import Issues
- Avoid wildcard imports

```
    from module import *
```

- This can introduce unintended variables and functions into the namespace, leading to unpredictable behavior.

In [None]:
# Create two module files dynamically with conflicting function names

# First module: math_tools with an add() function
# that performs numerical addition
with open('math_tools.py', 'w') as f:
    f.write("""
def add(x, y):
    return x + y
""")

# Second module: string_tools with an add() function
# that performs string concatenation
with open('string_tools.py', 'w') as f:
    f.write("""
def add(x, y):
    return x + " " + y
""")

# Import all contents from math_tools using wildcard import
from math_tools import *
print("Imported math_tools")
# Show memory address of the current add() function
print("id(add) after math_tools import:", id(add))

# Import all contents from string_tools — this silently
# overwrites the previous add()
from string_tools import *
print("Imported string_tools")
 # Show memory address has change — function was overwritten
print("id(add) after string_tools import:", id(add))

# Test which 'add' function is currently in scope
# (hint: it's the one from string_tools).
print("\nTesting add(2, 3):")
try:
    # Will raise a TypeError because string concatenation expects strings
    result = add(2, 3)
    print("Result of add(2, 3):", result)
except Exception as e:
    # Catch error caused by conflicting function definitions
    print("Error:", e)

# Clean up: delete dynamically created module files
import os
os.remove('math_tools.py')
os.remove('string_tools.py')


In [None]:
# Fix the problem shown in previous cell.
# Import the modules explicitly using aliases to avoid conflicts
import math_tools as mt
import string_tools as st

# Call add() from each module explicitly
print("Calling math_tools.add(2, 3):")
try:
    result_math = mt.add(2, 3)  # This performs numerical addition
    print("Result:", result_math)
except Exception as e:
    print("Error in math_tools.add:", e)

print("\nCalling string_tools.add('hello', 'world'):")
try:
    result_string = st.add("hello", "world")  # This performs string concatenation
    print("Result:", result_string)
except Exception as e:
    print("Error in string_tools.add:", e)

# Whitespace

- https://peps.python.org/pep-0008/#whitespace-in-expressions-and-statements

- Use spaces around operators and after commas to improve readability. Do this:

```
    x = a + b
```

- instead of this:

```
    x=a+b
```

- Avoid extraneous whitespace in expressions; instead of:

```
    x = (a + b ) # (trailing space inside parentheses)
```

- Do this:

```
    x = (a + b)
```

# Exception Handling
- https://peps.python.org/pep-0008/#programming-recommendations
- Don’t use "bare except" statements. Instead of:

```
    try:
        process_data()
    except:
        pass # Silently ignores errors (including critical ones)
```

- Do this:

```
    try:
        process_data()
    except (ValueError, KeyError) as e:
        logger.error(f"Processing failed: {e}")  # Log the issue
```

- Raise exceptions explicitly and use meaningful exception types with clear messages.  
  Instead of:

```
    raise Exception("Error occurred")  # Valid but non-specific
```

- Do this:

```
    raise ValueError("Invalid input")
```

# Documentation and Comments
- https://peps.python.org/pep-0008/#comments
- Use docstrings for function behavior and security emphasis

In [None]:
def sanitize_input(user_input):
    """
    Cleans user input to prevent injection attacks.

    This function removes potentially dangerous characters
    to protect against code injection vulnerabilities.

    Args:
        user_input (str): The input string provided by the user.

    Returns:
        str: A sanitized version of the input safe for further processing.
    """
    return user_input.replace("<", "").replace(">", "").replace(";", "")

print(help(sanitize_input))


- Avoid inline comments disclosing security-sensitive information. Instead of:

```
	# Hashing passwords with MD5 (insecure)
```

- Do this:

```
	# Securely hash passwords
```

# Loops, Lists, Tuples, and Dictionaries

# Loops / Secure Iteration

- Avoid Infinite Loops and Ensure Proper Termination
- Infinite loops can cause a program to become unresponsive, consume excessive system resources, or create security vulnerabilities such as denial-of-service (DoS) risks.
- To ensure proper termination of loops, follow these best practices

In [None]:
# Use explicit loop conditions: ensure loops have well-defined termination conditions

count = 0
while count < 10:  # Proper termination condition
    print(count)
    count += 1  # Ensure progress towards termination

In [None]:
# Avoid using while True without a break condition

while True:
    # processing steps ...
    #
    user_input = input("Enter 'exit' to stop: ")
    if user_input.lower() == 'exit':
        break

# Implement Timeouts/Iteration Limits
- When processing user input or external data, avoid infinite loops by implementing timeouts or iteration limits.

  ```
  import time

  start_time = time.time()
  timeout = 5  # seconds

  while time.time() - start_time < timeout:
      if some_condition():  # Replace with actual condition
          break
  ```

- Instead of looping through large datasets, use generators to iterate over data securely and avoid excessive memory usage.
- The **yield** operator allows a function to produce values one at a time, pausing execution between each value and resuming from the same point, making it ideal for use in generators.

In [None]:
# The secure_generator function below produces one value at a time
# instead of returning an entire list

def secure_generator(n):
    for i in range(n):
        yield i  # Generates values on demand

for value in secure_generator(10):
    print(value)

# Security Considerations for Lists and Tuples

| List | Tuple |
|------|-------|
| Mutable | Immutable                  |
| Dynamic collections | Fixed data structures |
| Memory overhead for dynamic resizing | More memory efficient |
| Built-in methods for modification | Fewer built-in methods |
| Slightly slower for dynamic resizing | Sightly faster for large datasets |


# Tuples for Secure Data

---
- A **hash value** is a fixed-size numerical value produced by a **hash function**
- A hash value uniquely represents the contents of an object and is used to quickly compare and retrieve objects in data structures like dictionaries and sets.
- Tuples are **hashable** and can be used as dictionary keys or set elements as long as they only contain hashable elements.
- Using tuples ensures that security-sensitive mappings (e.g., user permissions, access control lists) remain unchanged.

In [None]:
# Example: Is an admin with read permissions allowed to perform a write?

non_hashable_tuple = (["admin", "read"], "write") # contains a list
access_rights = { non_hashable_tuple: True }

hashable_tuple = (("admin", "read"), "write") # hashable
access_rights = { hashable_tuple: True }

# Copying Lists
- Prevent unintended data modifications in lists by using tuples for data that should remain unchanged
  - make copies of lists before passing them to functions if modification is not intended
- Copying lists ensures that modifications to copied objects do not unintentionally affect the original, important when dealing with mutable and nested data structures
- A Python list's copy method performs a shallow copy: it only copies the outer list and keeps references to mutable elements inside.
- The **copy** module (https://docs.python.org/3/library/copy.html) allows programmers to create both shallow and deep copies of objects.
  - The copy function in this module behaves similarly to the List's copy method.
  - A **deep copy** creates a new compound object and recursively inserts copies into it of the objects found in the original.
  - This is only relevant for compound objects (objects that contain other objects, like lists or class instances).

# Copying Lists: The Shallow Copy Problem
- Since a shallow copy only copies references to the inner lists, modifying an element inside the copy also affects the original_list.

In [None]:
import copy

original_list = [[1, 2, 3], [4, 5, 6]]
shallow_copy = copy.copy(original_list)

# Modifying an inner list

shallow_copy[0][0] = 99

print(original_list)  # Output (modified): [[99, 2, 3], [4, 5, 6]]
print(shallow_copy)   # Output (modified): [[99, 2, 3], [4, 5, 6]]

# 🛠️ Hands-On: Make a Deep Copy

- copy.deepcopy() creates a new object and recursively copies all objects within it, ensuring that nested mutable objects are fully duplicated rather than just referenced so the copied object is completely independent.

In [None]:
import copy

original_list = [[1, 2, 3], [4, 5, 6]]
deep_copy = copy.deepcopy(original_list)

# Modifying an inner list
deep_copy[0][0] = 99

print(original_list)  # Output: [[1, 2, 3], [4, 5, 6]]  (Unchanged)
print(deep_copy)      # Output: [[99, 2, 3], [4, 5, 6]]  (Modified)

# Index Errors/Boundary Overflow Best Practices

# 🛠️ Hands-On: Validating Index Values
- Always check if an index is within range before accessing elements.
- Use len() to determine valid index ranges.
- Validate user input before using it as an index.


In [None]:
 # Safely retrieve an element from a list
 def get_element(lst, indx):
    if not isinstance(indx, int): # Validate user input
        print("Error: Index must be an integer.")
        return None
    if 0 <= indx < len(lst):  # Check if index is in range
        return lst[indx]
    else:
        print("Error: Index out of range.")
        return None

# Example usage
my_list = ["apple", "banana", "cherry"]
# Valid index
print(get_element(my_list, 1))
# Out-of-range index
print(get_element(my_list, 5))
# Invalid input (not an integer)
print(get_element(my_list, "two"))

# 🛠️ Hands-On: Handling Unexpected Errors Using try-except.


In [None]:
# handle unexpected errors using try-except

# return a default value or a meaningful message on failure.
# log errors for debugging instead of silently failing.

my_list = [10, 20, 30]
try:
    print(my_list[3])
except IndexError:
    print("Index out of range!")

# If You Must Use Indexes, Use Slicing
  - Unlike direct indexing (e.g., my_list[3]), which could raise an IndexError if the index is out of bounds, slicing gracefully handles it.

  # 🛠️ Hands-On: Using Slices

In [None]:
# uncomment for error example
#my_list = [10, 20, 30]
#print(my_list[5])  # IndexError: index out of range

my_list = [10, 20, 30]
print(my_list[:5])  # Prevents out-of-range errors

# Dictionaries
## Use dict.get() for Dictionaries Instead of Direct Key Access
- dict.get() for dictionary access is generally safer than direct key access (e.g., my_dict['key']) because it prevents potential KeyError exceptions that could disrupt program flow or unintentionally expose sensitive error messages.
- A default return value can be specified when the key is missing, avoiding unhandled exceptions and reducing the likelihood of information leakage or crashes due to unexpected input or data manipulation by malicious users.
- Gracefully handles edge cases.

# 🛠️ Hands-On: Using dict.get() to Safely Access a Dictionary Element

In [None]:
user_data = {
    "username": "alice",
    "email": "alice@example.com"
    # Note: 'phone' key is missing
}

# safe access using dict.get()
phone = user_data.get("phone", "Not provided")
print(f"Phone: {phone}")

# unsafe access using direct indexing (will raise KeyError)
try:
    phone_direct = user_data["phone"]
    print(f"Phone (direct): {phone_direct}")
except KeyError:
    print("Error: 'phone' key not found — unhandled exception avoided using .get()")

# defaultdict for Missing Keys
- **collections.defaultdict** is a built-in class which proactively handles missing dictionary keys without raising exceptions, enhancing code reliability and security.
- Specifying a default factory function, such as int or list, automatically initializes missing keys with a safe, predictable value
  - Prevents KeyError exceptions that might otherwise expose implementation details or crash the application due to unanticipated input.
- In scenarios where input data is partially controlled by users, using defaultdict maintains program integrity, reduces error-handling complexity, and safeguards against logic flaws that could be exploited by attackers.

# 🛠️ Hands-On: Using defaultdict() to Handle Missing Keys

In [None]:
from collections import defaultdict

# Initialize defaultdict with list as the default factory
user_actions = defaultdict(list)

# Simulated user input (some users may be missing from initial data)
user_actions["alice"].append("login")
user_actions["bob"].append("upload_file")
user_actions["charlie"].append("logout")

# Accessing a non-existent user — will NOT raise KeyError
user_actions["david"].append("download_report")

# Output all user actions
for user, actions in user_actions.items():
    print(f"{user}: {actions}")


# Using Safer Data Structures
- Use `collections.deque` when implementing fixed-size buffers.
- Good for scenarios like rolling logs, recent event tracking, or sliding windows, where only the most recent items need to be retained.
- Automatically removes oldest entries when the maximum size is reached, eliminating the need for manual cleanup logic.
- Helps prevent unbounded memory growth in applications that process continuous or untrusted input streams, making it both efficient and safer.

# 🛠️ Hands-On: Using a Deque

In [None]:
from collections import deque

# Create a deque with a fixed max size of 3
recent_inputs = deque(maxlen=3)

# Simulated stream of user inputs
inputs = ["a", "b", "c", "d", "e"]

for item in inputs:
    recent_inputs.append(item)
    print(f"Current buffer: {list(recent_inputs)}")


![PART2](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/p2-head.png)
# Part 2. Data Science and AI Topics

# Data Science Topics

1. Input Validation and Sanitization

2. Access Control and Data Privacy

3. Secure Data Storage and Serialization

4. Third-Party Library and Dependency Management

# 1. Input Validation and Sanitization
- Ensure all data inputs (e.g., CSVs, JSON, user uploads) are validated before processing.

# 🛠️ Hands-On: Clean Data with Pandas Built-in Methods
- Use Pandas' built-in methods to check for missing values, unexpected types, or malformed rows.


In [None]:
import pandas as pd

# input data
data = {
    "name": ["Alice", "Bob", None],
    "age": [25, "unknown", 30],
    "email": ["alice@example.com", "bob[at]example.com", "carol@example.com"]
}

df = pd.DataFrame(data)

# Check for missing values
if df.isnull().values.any():
    print("Warning: Missing values detected!")
    print(df.isnull().sum())

# Check for unexpected types (e.g., non-numeric ages)
if not pd.api.types.is_numeric_dtype(df['age']):
    print("Warning: Non-numeric values detected in 'age' column!")
    print(df['age'])

# Sanitize: Attempt to coerce 'age' to numeric, mark errors
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Check for malformed email addresses (basic check)
invalid_emails = df[~df['email'].str.contains(r"^[^@]+@[^@]+\.[^@]+$", regex=True)]
if not invalid_emails.empty:
    print("Warning: Malformed email addresses found!")
    print(invalid_emails['email'])

print("\nCleaned DataFrame:")
print(df)


# Prevent Malicious Payloads

- A code injection pattern is a sequence of input (e.g. Excel formulas or SQL statements) intended to insert malicious code into a program, aiming to trick the system into executing unintended commands, altering behavior, or compromising security.

# 🛠️ Hands-On: Sanitize Data
- Use string prefix checks to sanitize potentially dangerous spreadsheet formulas in a DataFrame.

In [None]:
import pandas as pd

# uploaded data
data = {
    "username": ["alice", "bob", "=2+5", "+CMD|' /C calc'!A0"],
    "comment": ["hello", "world", "=HYPERLINK('http://malicious.com')", "goodbye"]
}

df = pd.DataFrame(data)

print("Original data:")
print(df)

# Sanitize dangerous formulas
dangerous_prefixes = ('=', '+', '-', '@')

def sanitize_formula(cell):
    if isinstance(cell, str) and cell.startswith(dangerous_prefixes):
        return "'" + cell
    return cell

# Apply sanitize_formula to each column separately
df_sanitized = df.copy()
for col in df_sanitized.columns:
    # map() applies a function to each element in a column
    df_sanitized[col] = df_sanitized[col].map(sanitize_formula)

print("\nSanitized data:")
print(df_sanitized)



# 2. Access Control and Data Privacy
- Access control and data privacy come into play when  handling sensitive information, particularly in shared or multi-user environments.
- One effective control strategy is to enforce access policies through **row and column filtering**, which ensures that users can only view or manipulate the data they are authorized to access (similar to creating a view in a database).
- For example, an analyst might only be permitted to see records related to their assigned region (row-level filtering), or a customer support representative may be restricted from viewing personally identifiable information like social security numbers (column-level filtering).

# 🛠️ Hands-On: Use Column Filtering

In [None]:
import pandas as pd

# Sample data
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'email': ['alice@example.com', 'bob@example.com', 'carol@example.com'],
    'ssn': ['123-45-6789', '987-65-4321', '555-55-5555'],
    'salary': [70000, 80000, 75000]
})

print("Original DataFrame:")
print(df)

# Only allow non-sensitive columns to be accessed
df_filtered = df[['name', 'email']]

print("\nFiltered DataFrame (restricted access):")
print(df_filtered)

# Anonymizing/Redacting Sensitive Fields
- Anonymizing and redacting sensitive fields is a key data privacy technique used to protect personally identifiable information (PII) or confidential attributes within a dataset.
- Anonymization typically involves transforming data so that individuals cannot be identified
  - This can include name hashing or email address masking
- Redaction replaces or removes sensitive values entirely (e.g., replacing a social security number with "***-**-****").
- These methods are used when sharing data with third parties, performing analytics, or creating datasets for testing or training machine learning models.

# 🛠️ Hands-On: Anonymize Sensitive Data in a Pandas DataFrame
- This program demonstrates how to anonymize sensitive fields in a DataFrame using pandas.
- It starts with a sample dataset containing personally identifiable information (PII) such as names, email addresses, and Social Security numbers.
- A copy of the original DataFrame is created, and the sensitive columns (name, email, and ssn) are overwritten with the placeholder value "REDACTED".
- This simple anonymization hides sensitive information while preserving the overall structure and non-sensitive data (like salary) for further analysis or sharing.

In [None]:
import pandas as pd

# Sample data
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'email': ['alice@example.com', 'bob@example.com', 'carol@example.com'],
    'ssn': ['123-45-6789', '987-65-4321', '555-55-5555'],
    'salary': [70000, 80000, 75000]
})

print("Original DataFrame:")
print(df)

# Anonymize sensitive fields
df_anonymized = df.copy()
df_anonymized['name'] = 'REDACTED'
df_anonymized['email'] = 'REDACTED'
df_anonymized['ssn'] = 'REDACTED'

print("\nAnonymized DataFrame:")
print(df_anonymized)

# 3. Secure Data Storage and Serialization
- Choosing safe serialization methods and securing the storage are critical parts of building secure applications.
- The two processes are closely related because:
- Serialization is the process of converting data into a format (like JSON, pickle, XML) so it can be saved to storage or transmitted over a network.
- Secure data storage requires protecting the data both at rest (stored persistently and not actively being transmitted or processed) and during serialization to prevent unauthorized access, tampering, or exploitation.
- If serialization is handled insecurely, attackers can
  - Inject malicious payloads.
  - Read sensitive data from poorly protected files.
  - Exploit insecure formats (like untrusted pickle files) to execute arbitrary code.



# Avoid pickle
- pickle can execute arbitrary code when deserializing.
- Never load pickle files from untrusted sources.

```
# Pickle is unsafe
import pickle

# Dangerous: loading untrusted data
with open('user_data.pkl', 'rb') as f:
    data = pickle.load(f)  # Vulnerable to code execution
```


# JSON Serializes Securely
- JSON is data-only — it cannot embed executable code.
- It is better suited for secure storage and exchange of structured data.

```
# Using JSON for serialization
import json

# Safe: loading trusted JSON data
with open('user_data.json', 'r') as f:
    data = json.load(f)  # Parses data only, no code execution
```

# But JSON is Text - How Can It Be Secured?
- Sensitive data can be encrypted before writing it to storage.
- Always validate and sanitize data when loading from any serialized format.
- NOTE: Fernet encryption, provided by the cryptography library, is a Python-specific symmetric encryption method that securely encrypts data using a shared secret key. It is popular due to its simplicity, but is not suitable for large file encryption or streaming since it requires that the entire message reside in memory.
- We describe various cryptography libraries in the Part 3 content.

# 🛠️ Hands-On: Encrypt and Decrypt JSON Data
- This program uses Fernet symmetric encryption to demonstrates secure encryption and decryption of JSON data.
- It generates a key and saves it to a file, then creates a sample JSON object, serializes it to bytes, and encrypts it.
- The encrypted data is then saved to a binary file.
- The program then reads the key and encrypted data back from disk, validates that the encrypted file is not empty, and attempts to decrypt and parse the JSON securely.
- If decryption fails due to tampering or corruption, it raises an InvalidToken error.
- Finally, it prints the decrypted data and cleans up temporary files, ensuring confidentiality and integrity of the data at rest.

In [None]:
from cryptography.fernet import Fernet, InvalidToken
import json
import os

# generate and save the encryption key securely
key = Fernet.generate_key()
with open('secret.key', 'wb') as key_file:
    key_file.write(key)

# create a cipher using the key
cipher = Fernet(key)

# create JSON data to encrypt
json_data = {
    "username": "admin",
    "password": "SuperSecret123!",
    "permissions": ["read", "write", "delete"]
}

# serialize JSON to string and then encode to bytes
json_string = json.dumps(json_data)
data_bytes = json_string.encode('utf-8')

# encrypt the serialized data
encrypted_data = cipher.encrypt(data_bytes)

# save the encrypted data to a file
with open('secure_data.bin', 'wb') as data_file:
    data_file.write(encrypted_data)

# reload the key (yes, we already have it) and encrypt the data

try:
    # load the key
    with open('secret.key', 'rb') as key_file:
        loaded_key = key_file.read()

    # recreate the cipher using the key
    loaded_cipher = Fernet(loaded_key)

    # load the encrypted data
    with open('secure_data.bin', 'rb') as data_file:
        encrypted_contents = data_file.read()

    # validate - Check if the file is empty
    if not encrypted_contents:
        raise ValueError("Error: Encrypted file is empty.")

    # attempt decryption (sanitize - verify integrity)
    decrypted_bytes = loaded_cipher.decrypt(encrypted_contents)

    # decode bytes back to string and parse JSON
    decrypted_json = json.loads(decrypted_bytes.decode('utf-8'))

    # use the clean decrypted JSON data
    print("Decrypted JSON data:")
    print(json.dumps(decrypted_json, indent=2))

except InvalidToken:
    print("Error: Decryption failed — data may have been tampered with!")
except Exception as e:
    print(f"Unexpected error: {e}")

# clean up files after demonstration
os.remove('secure_data.bin')
os.remove('secret.key')


# 4. Third-Party Library and Dependency Management

- Third-Party Library and Dependency Management is a critical aspect of Python software development that ensures applications are built using reliable, secure, and maintainable external packages.
  - Python's ecosystem boasts a vast array of third-party libraries available through the Python Package Index (PyPI), offering developers access to prebuilt functionality for tasks like web development, machine learning, data analysis, and cryptography.
  - Relying on external packages introduces risks such as compatibility issues, deprecated APIs, or vulnerabilities.
  - Proper dependency management practices help mitigate these risks and maintain the long-term health of a project.
    - Keep dependencies up to date by patching known vulnerabilities.
    - Use virtual environments or containers to isolate packages.
    - Verify the integrity of libraries (e.g., via hash checks or using trusted sources like PyPI).
- The most common tool for managing Python dependencies is pip, which allows developers to install packages using simple commands like *pip install requests*.
- To ensure consistency across development environments, dependencies are often listed in a requirements.txt file. This file serves as a manifest of exact versions used in a project and can be regenerated using pip freeze > requirements.txt.

In [None]:
# Use the ! to run shell commands
!pip install requests flask
!pip freeze > requirements.txt

    - Example of a requirements.txt file:

        flask==2.3.2  
        requests==2.31.0  

    - NOTE: Package sub-dependencies required by flask or requests will be downloaded and installed.



- Using virtual environments or containers like Docker ensures that dependencies are isolated, reproducible, and avoid polluting the global environment.
  - Isolation enhances security by creating controlled, predictable environments and limiting exposure to global Python environments or system-wide packages.
  - Reproducibility is especially important in teams or production deployments.
  - Python developers frequently use tools like venv or virtualenv.
  - These tools create project-specific environments where dependencies can be installed without conflicting with those of other projects.

### The following is sample code only, it is not intended to be run in Colab
  
```
# Create a virtual environment using the venv command
# (the following code is intended to run on Linux)
# Name the virtual environment venv -- to make sure it's confusing for beginners
python -m venv venv

# Activate the virtual environment
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install the specific packages required from the requirements.txt file
pip install -r requirements.txt
```  

- For larger projects or those requiring more sophisticated workflows, tools like Poetry or Pipenv offer enhanced dependency resolution, semantic versioning, and lock file generation.
  - These tools help manage both direct and transitive dependencies more precisely.
  - Integrating security tools such as pip-audit or GitHub's Dependabot helps identify known vulnerabilities in third-party packages, enabling proactive patching.
- The following code uses pip-audit to audit your project's dependencies:

In [None]:
!pip install pip-audit
!pip-audit

# This demo is more effective using a command prompt or terminal on a local
# machine vs.Colab; pip-audit in Colab will only audit the packages installed
# in the current session, not your system environment or project virtualenv.

  - Keeping data science libraries like NumPy and Pandas up to date is crucial:
    - patching security vulnerabilities
    - improving performance
    - ensuring compatibility with other modern libraries.
  - Outdated packages can expose applications to known security issues that are publicly documented in CVEs (Common Vulnerabilities and Exposures).
  - Tools like pip list --outdated, pip-review, or automated scanners such as GitHub Dependabot can help identify outdated packages.


In [None]:
# Use pip to check for outdated packages
!pip list --outdated

  - Running the above code block may reveal some 'outdated' packages
  - Sometimes the latest is not always the greatest
    - The newest version of a package may not be compatible with other stable packages being used.

- Verifying the integrity of libraries helps ensure a malicious or tampered package is not installed.
  - Python supports hash checking mode via pip, which uses SHA256 hashes in requirements.txt to verify the exact files being installed.

In [None]:
!pip download numpy==2.2.5
!pip hash /content/numpy-2.2.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

- We can now add this information to our requirements.txt file:

```
numpy==2.2.5 \
  --hash=sha256:d84a1e9a5f2b4fadc3e9a2f8a69dfae9d048ba861b546f5e4f3a4a0b7a65c208
```

- Including a --hash with each dependency ensures the exact package file is used.
- If someone uploads a malicious version of numpy==2.2.5 to PyPI, or if a mirror is compromised, pip will refuse to install it if the file’s SHA256 hash doesn’t match.
- This protects development and deployment pipelines from executing compromised code during automated builds or deployments.

# Best Practices
By following best practices in dependency management such as pinning versions, isolating environments, regularly auditing packages, and avoiding unmaintained libraries, developers can reduce technical debt and enhance the reliability and security of Python applications.

# NOTE from "Alice and Bob Learn Secure Coding":
[2] Janca, Ch. 6
"You might have noticed that I suggested Pip Freeze and then said not to pin your libraries. How can you both freeze and keep updating libraries? Whenever you have a chance to do a code update, you want to update as many libraries as you can, to avoid technical debt. If you have the libraries pinned, they won't do that. But when you move from environment to environment (dev ‐> QA ‐> staging), you do not want versions of your code changing, as your testing will be inaccurate. Once you are ready to go beyond dev, you freeze, but before that, you update, update, update (especially libraries!)."

# AI Topics
1. Using AI Safely and Securely
2. Ethical Use of AI
3. Data Integrity
4. Defending Against Adversarial Attacks

# 1. Using Artificial Intelligence Safely and Securely

- As artificial intelligence becomes increasingly integrated into software development workflows, developers must focus on safety and security.
  - [**Safety**](https://arxiv.org/pdf/1606.06565) protects the system and its users from unintentional harm.
  - This includes preventing software bugs, ensuring system reliability, avoiding accidents (e.g., crashing an autonomous system), and mitigating unintended consequences of AI behavior.
  - Safety is about making sure the system does what it's supposed to do, and doesn’t do something dangerous by mistake.
  - [**Security**](https://arxiv.org/pdf/1802.07228) focuses on protecting the system from intentional harm, such as malicious attacks, unauthorized access, data breaches, or exploitation of vulnerabilities.
  - It's about defending against external threats and ensuring confidentiality, integrity, and availability.
- Tools such as code assistants, vulnerability scanners, and automated reasoning engines can help identify bugs, enforce secure coding standards, and streamline threat modeling.
- Use caution when relying on AI-generated code, as it may introduce vulnerabilities or insecure results from training data.

# Useful Tips from Alice and Bob
[2] Ch. 15

- Use AI to write user stories, documentation, and anything else that is written for your job.
  - Double‐check everything before showing it to anyone else; the first draft will likely be imperfect.
- Use AI to help you write code; just make sure you check it first and don't give it sensitive data when asking for that help.
- Use AI to help you find vulnerabilities; if you can share your code (e.g., open source), ask it to find vulnerabilities.
  - It might not be as good as a SAST tool, but if you have an AI and you don't have a SAST tool, take what you can get.
- Ask AI to do threat models for you; you will likely come up with lots of ridiculous ideas, but it usually has one or two good ones.
- Ask AI for help with design, describe what you want, and see what it comes up with.
  - This is like asking a junior employee: the ideas can be a bit wacky, but you get a lot of results really fast, and some of it is usable.
- Ask AI to add comments to your code.
  - It might be able to add more than you would have the patience for.
  - Be sure they are kept concise and brief; no one wants to read a novel.
- Ask AI to suggest bug fixes; some of them will be good.

# Caveats:
- Do not allow AI to make decisions on behalf of applications without oversight
  - Each decision should have some other part of the code not controlled by the AI validate the decision
  - AI should never be trusted to make an important decision on its own
- An AI should not be able to control itself or others
  - Many current researchers cannot fully explain how models work or why they hallucinate
- Do not share sensitive or private information with an AI unless you own or control it
- Always fully review AI-generated code
  - Don't use it if you don't understand how it works
  - [Junior developers sometimes rely too heavily on them without fully understanding the code being created](https://www.calcalistech.com/ctechnews/article/ybba8gx5n)
  - [As a junior developer, it’s important to learn the fundamentals](https://noncodersuccess.medium.com/should-junior-developers-use-ai-for-coding-24cc03717525)
- Verify AI-generated content has not broken copyright
  - Be sure it is original, properly attributed, and not derived from protected material.

# 2. Ethical Use of AI
## How Does AI Ethics Relate to Secure Development?
1. Bias as a Security Risk
	- Biased models can lead to unfair treatment (e.g., in hiring or lending).
	- Security tie-in: Attackers can exploit known biases to game or manipulate models (e.g., feeding crafted inputs to skew recommendations).
2. Data Privacy and Confidentiality
	- Users have a right to privacy; exposing sensitive data violates ethical standards.
	- Security tie-in: Poor privacy protections can lead to data leaks or model inversion attacks that reconstruct private training data.
3. Accountability and Transparency
	- Users should know how decisions are made and be able to challenge them.
	- Security tie-in: Transparent systems allow for better auditing, threat detection, and accountability for model behavior and access.
4. Adversarial Robustness as Ethical Responsibility
	- Systems should behave reliably and safely, especially in critical domains like healthcare or autonomous driving.
	- Security tie-in: Building defenses against adversarial attacks ensures models cannot be tricked into unsafe or harmful outputs.
5. Fair Access and System Abuse
	- AI systems should not reinforce inequality or exclusion.
	- Security tie-in: Rate limiting, authentication, and abuse detection protect systems from being exploited for unfair advantage.

# Compliance with Legal and Regulatory Requirements
- Compliance guidelines such as the General Data Protection Regulation (GDPR) and U.S. Equal Employment Opportunity Commission (EEOC) is essential.
- These regulations mandate responsible data handling, privacy protections, and fairness in areas like automated decision-making, user profiling, and employment screening.
- Developers must ensure that AI-driven applications are transparent, auditable, and designed to avoid bias or discrimination
- Robust security is required to protect sensitive data.
- Proper data validation, encryption, access control, and audit logging play a critical role.


5. Secure Model Deployment
- Restrict access to inference endpoints and model files.
	- Authenticate and authorize API requests to prevent abuse or reverse engineering.
	- Protect models in memory and at rest using encryption and secure containers.
6. Privacy-Preserving Machine Learning
- Implement techniques like differential privacy, federated learning, or homomorphic encryption.
	- Mask or anonymize sensitive features during preprocessing.
	- Ensure compliance with data minimization principles.
7. Monitoring and Incident Response
- Continuously monitor model behavior for drift, misuse, or attacks.
	- Set up logging and alerting systems for unexpected input/output patterns.
	- Have a rollback or patching process in place for compromised models.

# 3. Data Integrity

- Maintaining data integrity is fundamental to the development of secure and reliable AI systems, beginning with validating and cleaning training data to detect anomalies, outliers, or potentially malicious inserts that could impact model behavior or introduce vulnerabilities.
- In traditional software development, we only need to focus on testing and versioning code. [6] Ch. 1  
- In machine learning, we have to test and version our data as well
- Indiscriminately accepting all available data might hurt your model’s performance and even make it susceptible to data poisoning attacks
- High-quality, trusted data input is essential to reduce the risk.
- Basic strategies for regularly auditing models for signs of weak performance can help, as well as using advanced measures such as [defensive distillation](https://arxiv.org/pdf/1511.04508) and [feature squeezing](https://arxiv.org/pdf/1704.01155).
- [Cisco: How to detect and mitigate AI data poisoning](https://outshift.cisco.com/blog/ai-data-poisoning-detect-mitigate)
- Implementing data versioning and **provenance tracking** (keeping a record of data origin, changes, and history) allows developers to trace the origin, transformations, and usage of datasets over time.
- MLOps frameworks like [DVC (Data Version Control)](https://dvc.org/doc) provide built-in support for versioning and provenance tracking.
- Reproducibility, accountability, and rollback are essential to handle potential contamination or error.
- Data validation libraries, logging pipelines, and checksum verification help enforce these safeguards.

# 🛠️ Hands-On: Detect Model Poisoning Using IsolationForest
- Detecting outliers using Scikit-Learn's IsolationForest algorithm can flag potentially poisoned samples during preprocessing.
- IsolationForest detects outliers by repeatedly splitting data to see which points are easiest to isolate.
- IsolationForest.fit_predict() labels each data point as either:
  - 1 → inlier (normal)
  - -1 → outlier (anomalous / possible poisoned sample)

![IsolationForest](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/day2-isolationforest.png)

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest

# Simulate training data with potential poisoned samples
np.random.seed(42)

# Generate 100 data points with 2 features from a normal distribution
#   ~68% within ±1 standard deviation (i.e., between −1 and 1)
#   ~95% within ±2 standard deviations (i.e., between −2 and 2)
#   ~99.7% within ±3 standard deviations (i.e., between −3 and 3)
normal_data = np.random.normal(loc=0.0, scale=1.0, size=(100, 2))

# Manually create 2 outliers to simulate poisoned samples
poisoned_data = np.array([[10, 10], [15, -12]])  # Simulated poisoned outliers

# Stack the normal and poisoned data vertically to form one dataset
data = np.vstack((normal_data, poisoned_data))

# Convert to DataFrame for inspection
df = pd.DataFrame(data, columns=["feature1", "feature2"])

# Apply Isolation Forest to detect outliers
# (relative to this specific dataset)
model = IsolationForest(contamination=0.05)
df['anomaly'] = model.fit_predict(df[["feature1", "feature2"]])

# -1 indicates an outlier (possible poisoned sample)
outliers = df[df['anomaly'] == -1]
clean_data = df[df['anomaly'] == 1]

print("Detected Outliers (Possible Poisoned Samples):")
print(outliers)

# 4. Defending Against Adversarial Attacks
- **Adversarial attacks** involve subtly modified inputs designed to fool machine learning models into making incorrect predictions.
- These are often imperceptible to humans but can drastically alter model outputs, revealing vulnerabilities in both the model and the broader software systems that depend on its predictions.
- **Adversarial training** is a widely adopted technique which involves augmenting the training dataset with adversarial examples so that the model learns to classify both normal and adversarial inputs correctly.
  - This helps prepare the model for real-world attacks by simulating them during training.
- **Input regularization techniques**, such as adding noise or using dropout, further harden models by discouraging overfitting and encouraging the model to generalize better, reducing its sensitivity to small variations.
- Detection mechanisms can also be used at inference time to identify and reject potentially adversarial inputs.
  - These might involve monitoring for abnormal activation patterns (i.e., inputs that trigger unexpected neuron activation responses) or statistical flags that suggest input manipulation.
- Evaluating model robustness can be performed using specialized testing frameworks like [CleverHans](https://cleverhans.io/) and [Foolbox](https://foolbox.readthedocs.io/en/stable/#) which generate adversarial examples and assess model susceptibility.
- These tools support continuous security testing practices similar to **fuzzing** (feeding a program  large volumes of unexpected or random data to discover  vulnerabilities, crashes, or unexpected behavior) and penetration testing in traditional software.

# 🛠️ Hands-On: Add Noise to Training Data

In [None]:
# adding noise for input regularization

import numpy as np

# Original input (e.g., a vector representing an image or features)
original_input = np.array([0.5, 0.7, 0.2, 0.9])

# Add Gaussian noise with mean 0 and standard deviation 0.1
noise = np.random.normal(loc=0.0, scale=0.1, size=original_input.shape)
noisy_input = original_input + noise

print("Original input:", original_input)
print("Noisy input:   ", noisy_input)

In [None]:
# Deep learning example: apply Gaussian noise during training for robustness

import numpy as np
from tensorflow.keras.layers import Input, Dense, GaussianNoise
from tensorflow.keras.models import Model

# Define the input layer with 4 features
inputs = Input(shape=(4,))

# Apply Gaussian noise (stddev = 0.1) to the input during training
# Regularizes the model by making it less sensitive to small input changes
x = GaussianNoise(0.1)(inputs)

# Add a hidden dense layer with 16 neurons and ReLU activation
x = Dense(16, activation='relu')(x)

# Output layer with 1 neuron and sigmoid activation
# (suitable for binary classification)
outputs = Dense(1, activation='sigmoid')(x)

# Build the model
model = Model(inputs, outputs)

# Compile the model with Adam optimizer and binary crossentropy loss function
model.compile(optimizer='adam', loss='binary_crossentropy')

# Sample batch of input data (3 samples with 4 features each)
sample_input = np.array([
    [0.5, 0.7, 0.2, 0.9],
    [0.1, 0.3, 0.5, 0.2],
    [0.9, 0.8, 0.1, 0.4]
])

# Perform a forward pass in training mode (noise is applied)
predictions_with_noise = model(sample_input, training=True)

# Perform a forward pass in inference mode (no noise applied)
predictions_without_noise = model(sample_input, training=False)

# Compare model predictions with and without input noise
print("Predictions with noise:\n", predictions_with_noise.numpy())
print("\nPredictions without noise:\n", predictions_without_noise.numpy())

# Summary of results
- The predictions with noise are close to but not exactly the same as those without noise.
- The differences are generally small if the model is reasonably well-behaved and the standard deviation of the noise is low (0.1).
- This difference illustrates the effect of input noise and how the model responds to slight variations in input values.
- This technique helps in:
  - Preventing overfitting by exposing the model to slightly altered data during training.
  - Improving generalization to real-world data that may contain natural variability or noise.
  - Making the model more robust against adversarial examples or unintended edge-case inputs.
  - Enhancing resilience to input fuzzing and injection attacks, which is a valuable defensive measure.

![PART3](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/p3-head.png)
# Part 3. Cryptography and Static Analysis
1. Implementing Cryptography with Python
2. Detecting Vulnerabilities using Static Analysis Tools



# 1. Implementing Cryptography with Python

 - # Encryption and Decryption of Sensitive Data

- The modern developer's encryption toolbox consists of a modest collection of basic tools.
- The following list enumerates the basic crypto security functions and describes what each does, as well as what the security of each depends on:
  - **Random numbers** are useful as padding and nonces(*), but only if they are unpredictable.
  - **Message digests** (hash functions) serve as a fingerprint of data, but only if impervious to collisions.
  - **Symmetric encryption** conceals data based on a secret key the parties share.
  - **Asymmetric encryption** conceals data based on a secret the recipient knows.
  - **Digital signatures** authenticate data based on a secret only the signer knows.
  - **Digital certificates** authenticate signers based on trust in a root certificate.
  - **Key exchange** allows two parties to establish a shared secret over an open channel, despite eavesdropping.
- [1] Kohnfelder Ch. 5
-\* a **nonce** (short for "number used once") is a random value used to ensure that old communications cannot be reused in replay attacks, and to add unpredictability to encryption operations.

# 🛠️ Hands-On: Encrypt and Decrypt a Message

In [None]:
# 1: Random nonce         — Ensure unpredictable values for operations
# 2: Hash message         — Create secure, fixed-size data fingerprint
# 3: Symmetric encrypt    — Encrypt with a shared secret
# 4: Asymmetric encrypt   — Encrypt with public/private key pairs
# 5: Digital signature    — Authenticate sender’s identity and message integrity
# 6: Signature verify     — Confirm that signature is authentic
# 7: Key exchange         — Establish a new shared secret without prior sharing

from cryptography.fernet import Fernet
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import rsa, padding
from cryptography.hazmat.primitives.asymmetric import ec
from cryptography.hazmat.primitives.kdf.hkdf import HKDF
from cryptography.hazmat.primitives.asymmetric.utils import Prehashed
from cryptography.hazmat.backends import default_backend
import os
import secrets

# 1
nonce = secrets.token_bytes(16)  # 16-byte random nonce
print("Random Nonce:", nonce.hex())

# 2
message = b"Confidential data"
digest = hashes.Hash(hashes.SHA256())
digest.update(message)
message_hash = digest.finalize()
print("SHA-256 Digest:", message_hash.hex())

# 3
symmetric_key = Fernet.generate_key()
cipher = Fernet(symmetric_key)

encrypted_message = cipher.encrypt(message)
print("Symmetrically Encrypted Message:", encrypted_message)

decrypted_message = cipher.decrypt(encrypted_message)
print("Decrypted Symmetric Message:", decrypted_message.decode())

# 4
private_key = rsa.generate_private_key(public_exponent=65537, key_size=2048)
public_key = private_key.public_key()

encrypted_with_public = public_key.encrypt(
    message,
    padding.OAEP(
        mgf=padding.MGF1(algorithm=hashes.SHA256()),
        algorithm=hashes.SHA256(),
        label=None
    )
)
print("Asymmetrically Encrypted Message:", encrypted_with_public.hex())

decrypted_with_private = private_key.decrypt(
    encrypted_with_public,
    padding.OAEP(
        mgf=padding.MGF1(algorithm=hashes.SHA256()),
        algorithm=hashes.SHA256(),
        label=None
    )
)
print("Decrypted Asymmetric Message:", decrypted_with_private.decode())

# 5
signature = private_key.sign(
    message,
    padding.PSS(
        mgf=padding.MGF1(hashes.SHA256()),
        salt_length=padding.PSS.MAX_LENGTH
    ),
    hashes.SHA256()
)
print("Digital Signature:", signature.hex())

# 6
try:
    public_key.verify(
        signature,
        message,
        padding.PSS(
            mgf=padding.MGF1(hashes.SHA256()),
            salt_length=padding.PSS.MAX_LENGTH
        ),
        hashes.SHA256()
    )
    print("Signature Verified Successfully!")
except Exception as e:
    print("Signature Verification Failed:", e)

# skip digital certificates (trust chains)
# this would need a real X.509 certificate authority setup (bigger project!)

# 7.do key exchange (Elliptic Curve Diffie-Hellman - ECDH)
# Simulate two parties generating shared secret

# Party A key pair
party_a_private_key = ec.generate_private_key(ec.SECP384R1())
party_a_public_key = party_a_private_key.public_key()

# Party B key pair
party_b_private_key = ec.generate_private_key(ec.SECP384R1())
party_b_public_key = party_b_private_key.public_key()

# Each party computes the shared secret
shared_secret_a = party_a_private_key.exchange(ec.ECDH(), party_b_public_key)
shared_secret_b = party_b_private_key.exchange(ec.ECDH(), party_a_public_key)

# Confirm that both shared secrets match
print("Shared Secret A:", shared_secret_a.hex())
print("Shared Secret B:", shared_secret_b.hex())
print("Shared secrets match:", shared_secret_a == shared_secret_b)


# Choosing Secure Cryptographic Libraries
- Cryptographic libraries are used in software development to securely implement encryption, decryption, authentication, digital signatures, and data integrity checks.
  - Selecting the right libraries is critical to building secure applications.
  - OWASP provides guidelines for identifying trusted libraries at https://top10proactive.owasp.org/the-top-10/c6-use-secure-dependencies/#implementation  
- Python librariy specifics:
  1. Use established libraries that are widely trusted and have been subject to external audits, such as **cryptography** (the "official" Python cryptography package https://cryptography.io/en/latest/), **PyNaCl** (Python binding to libsodium, a high-level cryptography library), or **PyCryptodome** (a fork of the now-legacy PyCrypto, with active maintenance)
  2. Avoid outdated or deprecated libraries like **PyCrypto** (no longer maintained) or custom, hand-rolled cryptography code with potentially unresolved vulnerabilities.
  3. Libraries should have a history of frequent updates, security patches, and responsiveness to vulnerability reports.
  4. Use simple, high-level cryptographic APIs (like Fernet from cryptography.fernet) instead of directly managing low-level primitives like block ciphers and key scheduling.
  5. Verify the library’s license (e.g., Apache 2.0, BSD) fits within your project’s legal and operational requirements.
  6. Well-documented libraries with active communities are easier to use securely, reducing the chance of misuse or configuration mistakes.

# Secure Hashing and Integrity Checks

- Hashing is a cryptographic process that converts data into a fixed-length string (digest), uniquely representing the input.
  - Secure hash functions (like SHA-256) are designed to be collision-resistant, fast, and irreversible.
  - A hash **collision** is a random match in hash values that occurs when a hashing algorithm produces the same hash value for two distinct pieces of data, e.g.,

```
hash1
         -> 9F86D081884C7D659A2FEAA0C55A...
hash2
```

  - Probability of a collision using SHA256 is = 1 in 2<sup>256</sup> (≈ 1.2 × 1077)
  - No efficient algorithm is known to construct sequences with the same hash value
  - **Quantum computers** could exploit SHA-256 vulnerabilities by reversing the hashing process
- Integrity checks are used to verify that data has not been tampered with or corrupted.
  - A newly computed hash of received data can be compared to a previously known good hash.
- Any change to the original input (even a single bit) produces a drastically different output.

# 🛠️ Hands-On: Verify File Integrity Using SHA-256 Hashing

In [None]:
import hashlib

# create a sample file to hash
with open("example.txt", "w") as f:
    f.write("This is some test content for hashing.\n")

# define the secure hash function
def hash_file(filepath):
    hasher = hashlib.sha256()
    with open(filepath, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            hasher.update(chunk)
    return hasher.hexdigest()

# hash the file initially
original_hash = hash_file("example.txt")

# re-hash the file to simulate checking for integrity
current_hash = hash_file("example.txt")

print(original_hash)
print(current_hash)

# compare hashes to verify integrity
if original_hash == current_hash:
    print("File integrity verified.")
else:
    print("File has been altered or corrupted.")

# 4. Key Management and Storage

<img src="https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/day3-key.png">

- Managing and storing cryptographic keys securely is just as critical as the algorithms themselves.
  - Poor key management can render even the most robust encryption ineffective.
- In Python, cryptographic keys may be symmetric (e.g., used with AES) or asymmetric (e.g., public/private RSA or ECC key pairs).
- Regardless of the type, keys must be generated using cryptographically secure random number generators (e.g., secrets or os.urandom) and stored in a way that prevents unauthorized access, while still allowing legitimate use within the application.

- The cryptography, PyCryptodome, and Fernet libraries described in the "Choosing Secure Cryptographic Libraries" section above provide secure ways to generate, serialize, and deserialize keys.
  - **cryptography** supports key serialization to PEM or DER formats for both symmetric and asymmetric keys. Private keys can be optionally encrypted with a password during export.
  - **PyCryptodome** supports raw and standard formats (e.g., PEM for RSA/ECC). Private keys can be encrypted using passphrases with PKCS#8. Also supports key wrapping for secure key transport and exchange.
  - **Fernet** uses fixed-format, URL-safe Base64-encoded 32-byte symmetric keys that are not serialized in PEM/DER formats and are typically stored in secure files or environment variables.
- Keys should never be hardcoded in source code or stored in plaintext configuration files; instead, they can be encrypted and stored in environment variables, secure key vaults (like AWS KMS or HashiCorp Vault), or protected OS-specific keyrings.

# 🛠️ Hands-On: Manage a Symmetric Encryption Key Using Fernet

In [None]:
from cryptography.fernet import Fernet

# Generate a key and write it to a file
key = Fernet.generate_key()
with open("secret.key", "wb") as key_file:
    key_file.write(key)

# To use the key later, read it back securely and initialize the cipher object:

# Load the key
with open("secret.key", "rb") as key_file:
    key = key_file.read()

cipher = Fernet(key)

# Best Practices

- Best practices include periodically rotating keys and revoking compromised ones.
  - Implementing key rotation involves encrypting data with a new key and optionally decrypting-reencrypting older data, or maintaining a key versioning scheme.
  - In a production-grade system, it is often beneficial to separate key management from application logic entirely by delegating to hardware security modules (HSMs) or managed services.
  - For local development and learning purposes, securing file permissions and avoiding plaintext exposure are key steps toward better cryptographic hygiene.

# Hardware Security Modules (HSMs)

<img src="https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/day3-hsm.png">

- HSMs are physical devices designed to protect and manage digital keys and perform cryptographic operations within a tamper-resistant environment.
  - AWS CloudHSM (https://aws.amazon.com/cloudhsm) a cloud-based HSM service from Amazon that provides dedicated HSM appliances within AWS infrastructure.
  - Azure Dedicated HSM (https://azure.microsoft.com/en-us/products/azure-dedicated-hsm) a managed HSM offering deployed in Azure data centers.
  - Google Cloud HSM (https://cloud.google.com/kms/docs/hsm) a cloud-native HSM that integrates with Google Cloud's KMS (see below).


# Key Management Services
- KMS are cloud-based or on-premises services designed to securely create, store, manage, and control access to cryptographic keys, often integrating with applications and infrastructure to simplify encryption and compliance.
  - AWS Key Management Service (https://aws.amazon.com/kms/) - integrates with other AWS services and optionally backs keys with AWS CloudHSM.
  - Azure Key Vault (https://azure.microsoft.com/en-us/products/key-vault) - a  centralized cloud service for managing keys, secrets, and certificates, optionally backed by HSMs.
  - Google Cloud Key Management Service (https://cloud.google.com/security/products/security-key-management) - manages encryption keys for Google Cloud projects with various levels of protection, including Cloud HSM integration.
  - HashiCorp Vault (https://www.hashicorp.com/en/products/vault) - a popular open-source and enterprise tool for managing secrets and protecting sensitive data using software-based encryption and optional HSM support (integrated - does not provide it's own HSM)

# Digital Signatures and Authentication

- Digital signatures are a cornerstone of modern cryptographic authentication
- They provide a way to ensure that a message or document genuinely comes from a trusted source and has not been altered in transit.
- Unlike handwritten signatures, which can be forged or copied, digital signatures rely on mathematical algorithms and cryptographic key pairs—specifically asymmetric encryption.
- The sender signs a message with their private key, and the recipient verifies the signature using the sender’s public key.
- This mechanism guarantees both integrity and authenticity.

# Authentication Using Digital Signatures

- Using digital signatures for authentication is common in secure communications protocols such as TLS, S/MIME, and digital certificates in PKI (Public Key Infrastructure).
- When a signed message is received, the recipient can verify its origin and confirm that the message content hasn't changed.
- This is useful for validating code (e.g., software binaries), securing email, and authenticating identity in blockchain transactions or secure APIs.

- In Python, digital signatures can be implemented using libraries like cryptography or PyCryptodome.
- Using **cryptography**, you can generate an RSA key pair, sign a message with the private key, and verify the signature with the public key.

# 🛠️ Hands-On: Generate RSA Keys, Sign Message, Verify Signature

In [None]:
from cryptography.hazmat.primitives.asymmetric import rsa, padding
from cryptography.hazmat.primitives import hashes

# Generate RSA keys
private_key = rsa.generate_private_key(public_exponent=65537, key_size=2048)
public_key = private_key.public_key()
print("RSA key pair generated.")

# Sign a message
message = b"Verify me"
signature = private_key.sign(
    message,
    padding.PSS(
        mgf=padding.MGF1(hashes.SHA256()),
        salt_length=padding.PSS.MAX_LENGTH
    ),
    hashes.SHA256()
)
print(f"Message signed: {message.decode()}")
print(f"Signature (hex): {signature.hex()}")

# Verify the signature
try:
    public_key.verify(
        signature,
        message,
        padding.PSS(
            mgf=padding.MGF1(hashes.SHA256()),
            salt_length=padding.PSS.MAX_LENGTH
        ),
        hashes.SHA256()
    )
    print("Signature is valid. Message is authentic and unchanged.")
except Exception as e:
    print("Signature verification failed:", str(e))


## A NOTE on Private Key Generation

```
private_key = rsa.generate_private_key(public_exponent=65537, key_size=2048)
```

The arguments **public_exponent** and **key_size** are critical to how the RSA key pair is generated and how secure and efficient it will be.

- public_exponent=65537
  - This is the RSA public exponent **e**, a value used in the exponentiation step of the encryption. There is also a private exponent **d** which is hidden and calculated from e.
  - It must be an odd integer greater than 1.
  - 65537 is a widely used standard value because it strikes a balance between security and performance.
  - 65537 is prime and has only two bits set (binary: 10000000000000001), making exponentiation efficient.
  - Larger values are slower.
  - Smaller values have had historical vulnerabilities.
- key_size=2048
  - This defines the bit length of the RSA modulus, which determines the overall strength of the key.
  - A 2048-bit key means the modulus (a large calculated value included in both the public and private keys) is a product of two 1024-bit primes.

## Differences between Simple Hashing and RSA Signature Validation
| **Feature**     | **Hashing**                                | **Digital Signing with RSA**                                |
|-----------------|---------------------------------------------|--------------------------------------------------------------|
| **What it does**| Creates a unique fingerprint of data        | Authenticates and proves the origin and integrity of data     |
| **Use case**    | Detect changes to data                      | Verify the sender’s identity and detect changes to data       |

# Using Signatures for Non-Repudiation
- **Non-repudiation** is a security principle that ensures a party in a communication cannot deny the authenticity of their signature, message, or action.
  - It provides proof of origin and integrity that can be verified by a third party, and is commonly implemented using digital signatures and cryptographic certificates.
  - When a user signs a document with their private key, anyone with the corresponding public key can verify that the signature is valid and came from that user.
    - Since the private key is known only to the signer, they cannot later claim they didn't sign it—this is the essence of non-repudiation.
  - It is widely used in:
    - Electronic contracts and legal documents
    - Secure email (e.g., with S/MIME or PGP)
    - Software distribution to verify trusted sources
    - Blockchain transactions to prove ownership or consent
  - Beyond simple message authentication, digital signatures are a critical building block for systems that require non-repudiation.
  - This ensures that a sender cannot later deny having signed a message, as only their private key could have produced the signature.
  - In digital contract systems or electronic voting, non-repudiation is essential to maintain trust and accountability.

# 🛠️ Hands-On: Use Non-Repudiation to Verify a Sender

In [None]:
# This example simulates a sender signing a message with their private key
# and a verifier confirming the signature with the corresponding public key
# ==> the sender cannot deny authorship.

# Message Integrity: The message has not been tampered with.
# Authentication: It came from the holder of the private key.
# Non-Repudiation: The signer cannot later deny creating the signature,
# because only they had access to the private key.

from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.asymmetric import padding, rsa
from cryptography.hazmat.primitives import serialization

# generate RSA key pair ===
private_key = rsa.generate_private_key(public_exponent=65537, key_size=2048)
public_key = private_key.public_key()

# sender signs the message
message = b"This message is from Alice"
signature = private_key.sign(
    message,
    padding.PSS(
        mgf=padding.MGF1(hashes.SHA256()),
        salt_length=padding.PSS.MAX_LENGTH
    ),
    hashes.SHA256()
)

# verifier confirms authenticity (non-repudiation)
try:
    public_key.verify(
        signature,
        message,
        padding.PSS(
            mgf=padding.MGF1(hashes.SHA256()),
            salt_length=padding.PSS.MAX_LENGTH
        ),
        hashes.SHA256()
    )
    print("Signature is valid. Message is authentic and unchanged.")
except Exception as e:
    print("Signature verification failed:", str(e))

# 🛠️ Hands-On: Securing Local File Permissions to Avoiding Plaintext Exposure

In [None]:
# Storing an AES key securely (Linux/macOS)
from cryptography.fernet import Fernet
import os
import stat

# Generate a new encryption key
key = Fernet.generate_key()

# Define a secure file path
key_file = "secret.key"

# Write the key to a file with restricted permissions
with open(key_file, "wb") as f:
    f.write(key)

# Set file permission to -rw-------
os.chmod(key_file, 0o600)

print("Key saved securely to", key_file)
file_stat = os.stat(key_file)
permissions = stat.filemode(file_stat.st_mode)

print(f"File: {key_file}")
print(f"Permissions: {permissions}")

# Common Pitfalls

["The highest priority for developers is to build features, and while security is not intentionally on the back burner, they don’t necessarily have the skills to avoid poor coding patterns that lead to security bugs, and the benchmark of a good engineer rarely includes secure coding prowess."](https://www.securecodewarrior.com/article/poor-coding-patterns-can-lead-to-big-security-problems-so-why-do-we-encourage-them)  
Matias Madou, Ph.D., CTO/Co-Founder, Secure Code Warrior

[Common Weakness Enumeration (CWE)](cwe.mitre.org) is a community-developed list of common software and hardware security weaknesses maintained by MITRE Corporation, a nonprofit that operates federally funded research and development centers (FFRDCs) to support U.S. government agencie in cybersecurity and other areas. It helps identify, classify, and mitigate common causes of security vulnerabilities. CVE **tells you what went wrong**; CWE **explains why it was possible**.

- Common pitfalls include
  - [hardcoding secrets or keys in source code](https://cwe.mitre.org/data/definitions/540.html)
  - [using outdated or insecure algorithms](https://cwe.mitre.org/data/definitions/327.html)
  - [misusing cryptographic primitives](https://cwe.mitre.org/data/definitions/1240.html)
  - [skipping validation or error checking](https://cwe.mitre.org/data/definitions/1215.html)
  - [assuming randomness is secure by default](https://cwe.mitre.org/data/definitions/330.html)
  - [insecure storage of sensitive files](https://cwe.mitre.org/data/definitions/922.html)

# Best Practices

- Use high-level abstractions
  - High-level libraries like Fernet and cryptography.hazmat.primitives.serialization provide sensible defaults and built-in protections (e.g., encryption with authentication, key serialization standards) so developers don’t have to manage low-level error-prone cryptographic operations.
- Leverage environment variables and secrets managers
  - Load secrets from os.environ or external vaults (e.g., AWS Secrets Manager, HashiCorp Vault) rather than bundling them with your code.
- Validate all inputs before cryptographic operations
  - Prevent malformed, tampered, or oversized inputs from causing errors or unexpected behavior. Check the format of incoming public keys or the size of decrypted plaintext.
- Always authenticate encrypted data
  - Use encryption modes like AES-GCM or tools like Fernet that include built-in authentication to protect both confidentiality and integrity.
- Implement key lifecycle management
  - Plan for key rotation, expiration, and revocation. Store metadata (e.g., key version, created timestamp) with encrypted content to support future changes.
- Handle exceptions gracefully and securely
  - Avoid leaking detailed error messages that could help attackers infer internal behavior (e.g., padding oracle attacks). Log minimal information and sanitize outputs.
- Test cryptographic logic independently
  - Write unit tests specifically for cryptographic operations to verify key generation, encryption-decryption symmetry, and signature validity.
- Avoid reinventing protocols
  - Resist the temptation to build your own security protocol or tweak standards. Instead, follow established ones like TLS, JOSE (for JWT), or S/MIME.

# 2. Detecting Vulnerabilities Using Static Analysis Tools

- Static Code Analysis
  - Static analysis is a cornerstone of secure software development, particularly when detecting vulnerabilities early in the development lifecycle.
  - Unlike dynamic analysis, which tests programs during execution, static code analysis examines source code or compiled bytecode without executing it.
  - This enables detection of syntactic and semantic issues such as buffer overflows, injection flaws, and improper use of APIs before the software is run.

- Selecting and Implementing Static Analysis Tools
  - In Python, tools like pylint, flake8, and bandit are commonly used for static analysis.
    - [Bandit](https://github.com/PyCQA/bandit) is designed specifically to find security issues in Python code, such as use of insecure functions or hard-coded passwords.


# 🛠️ Hands-On: Use Bandit to Flag vulnerabilities

In [None]:
# create a file with a security flaw
with open('script.py', 'w') as f:
    f.write("""
# Example Python script with a security flaw
import subprocess

def ping_server(host):
    subprocess.call(f"ping -c 1 {host}", shell=True)
    # Unsafe: vulnerable to shell injection
""")

In [None]:
# Use bandit to analyze the script file
!apt install bandit -y > /dev/null 2>&1 #suppress install output
!bandit -r script.py

  - The output of Bandit flags the use of subprocess.call with shell=True as a high-severity vulnerability, warning that it may lead to shell injection if the input is not properly sanitized.

  - Selecting the right static analysis tool depends on several factors, including the programming language, type of vulnerabilities to detect, integration capabilities with your development environment, and scalability for large codebases.
      - Tools like SonarQube, Semgrep, and CodeQL offer cross-language support and can integrate with CI/CD pipelines.
      - For Python specifically:
        - [Pylint](https://pylint.pycqa.org/) is effective for enforcing coding standards and identifying common logic errors
        - [Flake8](https://flake8.pycqa.org/) focuses on style and syntax conformity.
        - [Semgrep](https://semgrep.dev/) combines rule-based scanning with fast performance and is particularly useful for finding both security and logic issues.

In [None]:
# Example: running semgrep on a Python file using built-in rules
$ semgrep --config=p/ci python_project/

- To integrate static analysis tools into your development workflow, they should ideally be part of your IDE and CI/CD pipeline.
  - For local development, editors like VS Code and PyCharm support plugins for pylint, flake8, and Bandit, providing real-time feedback.
  - In CI/CD environments, GitHub Actions or GitLab CI can be configured to run these tools automatically on each push or pull request, ensuring consistent enforcement of security and style checks.

  - Here is an example of a GitHub Actions workflow that runs bandit on every push.
    - We will implement this as a hands-on activity in Part 4

```
---
name: Security Scan

on: [push]

jobs:
  bandit-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.10'
      - name: Install Bandit
        run: pip install bandit
      - name: Run Bandit
        run: bandit -r .
```        

- Static analysis tools help identify security and quality issues before code reaches production, making them invaluable in modern DevSecOps practices.
- Their effectiveness lies not only in the bugs they catch but also in how seamlessly they can be integrated into development and delivery pipelines.
- Selecting appropriate tools and incorporating them into both local and automated workflows can significantly reduce the likelihood of shipping insecure or faulty code.

![PART4](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/p4-head.png)
# Part 4. Secure Software Processes
1. Threat modeling  
2. Secure workflows  
3. Automated Security Testing




# 1. Threat Modeling

![ThreatZine](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/day4-threatmodeling.png)

[7] Shostack Ch. 1
- **Threat modeling** is a way to think ahead about security—before your system is built or changed.
- Considers system design, components, and data flows to figure out vulnerabilities; ask questions like
  - What could go wrong?
  - Who might attack this system, and how?
  - What damage could they do?
  - What can we do to prevent or minimize that damage?
- The goal is to spot and fix security issues early, when it is cheaper and easier to do so.
- It's a key part of building secure, resilient systems

[1] Kohnfelder Ch. 2
- Security mindset means shifting from a builder’s perspective to an attacker’s view
- Threats include intentional attacks, accidents, bugs, hardware failures, and human error.
- A security mindset helps guide secure decision-making and can be adapted to fit available time and resources.
- Incremental improvements in threat identification and mitigation significantly strengthen security even if all vulnerabilities aren't found.
  - Can also reveal opportunities for non-security-related improvements e.g., system efficiencies and new features.


# Identifying Assets and Attack Surfaces
- Identify and prioritize software assets based on their value and sensitivity, e.g., applications and APIs (both internal and external), source code and configuration files, data stores (databases, user credentials)
- Avoid complex risk calculations; instead, use a simple strategy like the Agile "T-shirt size" system (Large, Medium, Small) to prioritize asset protection efforts
  - https://www.easyagile.com/blog/agile-estimation-techniques





# 🛠️ Hands-On: Asset Prioritization using T-Shirt Sizing
### Prioritize the Following Assets Using T-Shirt Sizing
Match each asset to its appropriate T-shirt size priority:  
(Choices: **Extra-Large**, **Large**, **Medium**, **Small**)

| Asset | Priority (Match) |
|:------------------------|:------------------|
| 1. Financial transaction records | ___ |
| 2. Internal system logs containing harmless details | ___ |
| 3. Customer personal information (e.g., location, identifiers) | ___ |
| 4. Client-side application code accessible to all users | ___ |
| 5. Private encryption keys used for secure communications | ___ |
| 6. Advertising data collected by a social media platform | ___ |

<details>
<summary>Click to reveal the Answer Key</summary>
<br>

| Asset | Correct Priority |
|:------|:-----------------|
| 1. Financial transaction records | Extra-Large |
| 2. Internal system logs containing harmless details | Small |
| 3. Customer personal information (e.g., location, identifiers) | Large |
| 4. Client-side application code accessible to all users | Small |
| 5. Private encryption keys used for secure communications | Extra-Large |
| 6. Advertising data collected by a social media platform | Medium |
</details>


## Group Similar Assets When Appropriate
- Group similar assets when appropriate for easier management, but separate if  risk profiles or usage contexts differ significantly.
  - Consider an organization which maintains the following assets:
    - Two internal HR web applications hosted on the same internal network.
    - A public-facing customer support portal accessible via the internet.
    - An internal payroll system that processes sensitive financial data.
<details>
<summary><strong>How would you group them? (click for suggestions)</strong></summary>
<ul>
  <li><strong>Group the two internal HR web applications:</strong>
    <ul>
      <li>Same network.</li>
      <li>Accessed by the same group of employees.</li>
      <li>Similar security controls and data sensitivity levels.</li>
    </ul>
  </li>
  <li><strong>Do not group the customer support portal with internal applications:</strong>
    <ul>
      <li>Exposed to the internet and has a larger attack surface.</li>
      <li>Subject to different risks, such as DDoS or credential stuffing.</li>
      <li>May have a separate set of compliance or logging requirements.</li>
    </ul>
  </li>
  <li><strong>Do not group the payroll system with the HR applications:</strong>
    <ul>
      <li>Handles more sensitive data (e.g., salaries, bank details).</li>
      <li>Requires stricter access controls and different audit requirements.</li>
    </ul>
  </li>
</ul>
</details>

- Always consider asset value from multiple perspectives — including customers, attackers, and the organization itself — to avoid underestimating potential risks.
- Minimize attack surfaces wherever possible, since they are the first points of entry for attackers; early blocking reduces the spread of attacks.
- Recognize that attack surfaces include both digital and physical exposures such as public network connections and device interfaces.

# Using Threat Modeling Frameworks
- [5] Olmstead Ch. 6
- A **threat modeling framework** provides a structured approach to identifying, evaluating, and addressing potential security threats in a system or application. These frameworks help teams anticipate how attackers might exploit vulnerabilities and guide the design of appropriate defenses before issues occur.
- Two common frameworks include **STRIDE** and **DREAD**
- The STRIDE model is a Microsoft framework for identifying and categorizing different security threats affecting a system.
  - Spoofing - an attacker pretending to be someone/something else, e.g., gaining unauthorized access with a valid user's credentials
  - Tampering - unauthorized modification of data, code, or system components, e.g., altering database contents, disrupting a system's regular operation
  - Repudiation - denying actions or events by a user or system entity making it hard to attribute responsibility, e.g., manipulating a log file to make it appear someone else did something
  - Information disclosure - exposing sensitive information to unauthorized individuals or systems, e.g., sharing confidential data, such as student grade or financial information, or personal health information
  - Denial of Service (DOS) - disrupt or degrade the availability of a system or its components, making them inaccessible to legitimate users, e.g., flooding a web server with requests so legitimate users can't access information
  - Elevation of privilege - an attacker gains a higher level of access or permission than authorized ones, e.g., a vulnerability allows a user to escalate from regular user to administrator
- The DREAD model uses a scale from 0 to 10 for each component, with a lower score being better
  - Damage - assesses the potential impact of a security vulnerability if it were to be exploited; 0 indicates no damage, 10 indicates catastrophic damage
  - Reproducibility - how easily can an attacker reproduce the conditions necessary to exploit a vulnerability; 0 means the vulnerability is difficult to impossible to reproduce, and 10 means it is effortless to reproduce
  - Exploitability - how easily an attacker can exploit a vulnerability, considering the complexity of the attack and skills required to carry it out
  - Affected users - assessing the number of users or systems that could be impacted if a vulnerability is exploited
  - Discoverability - how easy can a vulnerability be discovered by an attacker; 0 means difficult and 10 means straightforward

4. Threat Modeling tools
- [Microsoft Threat Modeling Tool](https://learn.microsoft.com/en-us/azure/security/develop/threat-modeling-tool)
- [IBM Gardium Vulnerability Assessment](https://www.ibm.com/products/guardium-vulnerability-assessment)
- [OWASP Threat Dragon](https://owasp.org/www-project-threat-dragon/)
- [OWASP pytm](https://owasp.org/www-project-pytm/) is a pythonic Framework for threat modeling


# 🛠️ Hands-On: Use pytm to Create a Threat Model
- The following code defines a minimal system architecture with **actors** (entities that interact with the system, such as users or external services), **boundaries** (logical or physical zones that separate trust levels, such as internal networks or the internet), and **dataflows** (paths through which data moves between components, indicating protocols and direction).
- The script then generates and renders a threat report using a custom Jinja2 template.
- [Jinja2](https://pypi.org/project/Jinja2/) is a Python templating engine that allows dynamic generation of text files (e.g., HTML, Markdown, reports) using placeholders and control structures.
  - Templates can include variables, loops, and conditionals, making it easy to separate presentation logic from application logic.
  - Commonly used in web development and report generation, Jinja2 integrates seamlessly with tools like Flask, Django, and custom CLI applications like pytm.

In [None]:
!pip install pytm Jinja2

In [None]:
%%writefile minimal_model.py
from pytm import TM, Server, Dataflow, Datastore, Actor, Boundary
from jinja2 import Environment, FileSystemLoader
import os

# STRIDE category inference is used below because built-in pytm threats
# often lack an explicit category. Keyword matching is used to approximate
# for reporting purposes. Some threats may still be unlabeled.

STRIDE_CATEGORIES = {
    # Spoofing
    "spoof": "Spoofing",
    "forging": "Spoofing",
    "impersonation": "Spoofing",
    "credential falsification": "Spoofing",
    "session hijacking": "Spoofing",
    "replay": "Spoofing",

    # Tampering
    "tamper": "Tampering",
    "manipulation": "Tampering",
    "injection": "Tampering",
    "sql": "Tampering",
    "command": "Tampering",
    "format string": "Tampering",
    "api manipulation": "Tampering",
    "overwriting": "Tampering",
    "overwrite": "Tampering",

    # Repudiation
    "repudiation": "Repudiation",
    "audit log manipulation": "Repudiation",
    "log tampering": "Repudiation",

    # Information Disclosure
    "leak": "Information Disclosure",
    "exfiltration": "Information Disclosure",
    "exposure": "Information Disclosure",
    "unprotected": "Information Disclosure",
    "disclosure": "Information Disclosure",
    "data leak": "Information Disclosure",
    "sensitive": "Information Disclosure",
    "sniffing": "Information Disclosure",

    # Denial of Service
    "flood": "Denial of Service",
    "dos": "Denial of Service",
    "denial": "Denial of Service",
    "overflow": "Denial of Service",
    "crash": "Denial of Service",
    "allocation": "Denial of Service",
    "ping of the death": "Denial of Service",
    "smuggling": "Denial of Service",
    "excessive": "Denial of Service",

    # Elevation of Privilege
    "privilege": "Elevation of Privilege",
    "escalation": "Elevation of Privilege",
    "bypass": "Elevation of Privilege",
    "unauthorized": "Elevation of Privilege",
    "elevation": "Elevation of Privilege",
    "root": "Elevation of Privilege",
    "admin": "Elevation of Privilege",
}

def infer_stride_category(threat):
    description = threat.description.lower()
    name = threat.__class__.__name__.lower()

    for keyword, stride in STRIDE_CATEGORIES.items():
        if keyword in description or keyword in name:
            return stride
    return "Uncategorized"

  # Create a new threat model instance
tm = TM("Minimal Threat Model")

# define a basic system architecture

# trust boundaries
internet = Boundary("Internet")
internal = Boundary("Internal Network")

# external actor
user = Actor("User")

# key components
web_server = Server("Web Server", boundary=internet)
db = Datastore("Database", boundary=internal)

# data flows
Dataflow(user, web_server, "User sends credentials", protocol="HTTPS")
Dataflow(web_server, db, "Web server queries user info", protocol="SQL")

# Process the model to populate threats
tm.process()

# Access the elements
all_elements = list(tm._elements)
dataflows = [e for e in all_elements if isinstance(e, Dataflow)]
components = [e for e in all_elements if not isinstance(e, Dataflow)]
threats = list(tm._threats)

# Infer STRIDE category for each threat
for threat in threats:
    threat.category = infer_stride_category(threat)

# Set up Jinja2 templating
env = Environment(loader=FileSystemLoader(searchpath=os.path.dirname(__file__)))
template = env.get_template("custom_report.jinja2")
output = template.render(tm=tm, elements=components, dataflows=dataflows, threats=threats)

print(output)



In [None]:
%%writefile custom_report.jinja2
{# create custom template #}
Threat Model: {{ tm.name }}
=========================

Components:
{% for element in elements %}
- {{ element.name }} ({{ element.__class__.__name__ }})
{% endfor %}

Data Flows:
{% for df in dataflows %}
- {{ df.name }}: {{ df.source.name }} → {{ df.sink.name }} via {{ df.protocol }}
{% endfor %}

Threats:
Threats:
{% for threat in threats %}
- **{{ threat.target.name if threat.target else "General Threat" }}**:
  - **Type:** {{ threat.__class__.__name__.replace('_', ' ') }}
  - **STRIDE:** {{ threat.category if threat.category else "Uncategorized" }}
  - **Description:** {{ threat.description }}
{% endfor %}


In [None]:
# The threat list covers common attack types like injection, spoofing, and
# session hijacking, as identified by pytm's built-in threat modeling logic.
!python3 minimal_model.py

# 2. Secure Workflows


# Secure Development Workflows
- A **Secure Development Workflow** integrates security practices throughout the entire software development lifecycle, from initial planning to deployment and maintenance.
- These workflows ensure that security is not treated as an afterthought, but as a core component of every phase of development.
- Key practices include threat modeling during design, secure coding standards during implementation, static and dynamic analysis during testing, and secure configuration management during deployment.
- Secure workflows also emphasize the use of version control, peer reviews, and automated CI/CD pipelines to reduce the risk of introducing vulnerabilities and to detect issues early.


# Integration in the SDLC
![SecureAgile](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/day4-secureagilesdlc.png)
- By embedding security controls directly into development processes, teams can more effectively manage risks without slowing down delivery.
- Secure development workflows promote collaboration between developers, security professionals, and operations teams, following methodologies such as DevSecOps.
- These workflows often include automated tools for code scanning, dependency checking, and infrastructure validation, allowing security to scale with development **velocity** (the speed and efficiency of delivering code changes to production).
- Secure development workflows lead to more resilient applications, reduced remediation costs, and greater compliance with regulatory and industry standards.

# Integrating Security into CI/CD Pipelines
- **CI/CD** (Continuous Integration and Continuous Deployment) refers to the practice of regularly merging code changes with automated builds and tests to catch issues early (CI) and automatically releasing validated changes to production or staging environments (CD).
- Integrating automated security tests into CI/CD pipeline ensures that vulnerabilities are detected and addressed early—during code commits, builds, and deployments, rather than after release.
- Common integrations include static application security testing (SAST), dependency scanning, secret detection, and configuration validation tools that run automatically as part of the pipeline.
- By **shifting security left** and making it a routine part of development workflows, teams can reduce risk without compromising development velocity and maintain a consistent security baseline across all code changes.  
![ShiftLeft](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/day4-devsecshiftleft.png)
- As a practical example, developers can integrate tools like Bandit into their GitHub CI workflows to automatically detect common Python security issues during each code push.

# 🛠️ Hands-On: Integrate Bandit into a GitHub Actions Workflow

- In this hands-on we will add a GitHub workflow to our python-demo-project repository from Part 1 to demonstrate a secure development workflow into our project.

1. Add the following file to your repository as **.github/workflows/bandit.yml**
---

```
name: Security Scan

on: [push]

jobs:
  bandit-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.10'
      - name: Install Bandit
        run: pip install bandit
      - name: Run Bandit
        run: bandit -r .
```

- After committing the file, click on the "Actions" menu from your repository's home page.
- Bandit is configured to return a non-zero status for any vulnerabilities found, even those with moderate severity.
- Our initial Python code made a call to **requests.get** with no timeout, which is viewed as a moderate vulnerability.
- We can modify the execution to supress the lower sev issues as follows:

```
    bandit -r . --severity-level high
```
  - or we can fix the problem; add a timeout to the call.
  - This is a good way to retest our action and verify our security test passes; the script will run when we commit the following change:

- Edit the main.py file and change

```
    response = requests.get("https://www.example.com")
```
- to

```
    response = requests.get("https://www.example.com", timeout=5)
```

- then commit your change and view the Actions results again (it may be in a pending status for awhile, since we use free GitHub we aren't usually first in line).

# Secure Design Reviews and Approval Processes
![CodeReview](https://raw.githubusercontent.com/FSCJ-FacultyDev/HITEC2025/main/images/day4-codereviews.png)
- Vulnerabilities and insecure code should be identified before deployment.
- This requires systematic inspection of source code by one or more qualified reviewers looking for issues such as
  - improper input validation
  - insecure cryptographic use
  - injection flaws
  - logic errors
- Code reviews not only help find bugs but also encourage developers to follow
 secure coding standards and best practices (e.g., [OWASP](https://owasp.org/www-project-top-ten/) and [CERT](https://wiki.sei.cmu.edu/confluence/display/seccode/SEI+CERT+Coding+Standards)).
 - OWASP Guidelines for code reviews can be found [here](https://owasp.org/www-project-code-review-guide/assets/OWASP_Code_Review_Guide_v2.pdf).

## Effective Security Design Reviews
- References
- [Microsoft](https://www.microsoft.com/en-us/securityengineering/sdl/practices)
- [OWASP](https://owasp.org/www-project-application-security-verification-standard/)
- [NIST](https://csrc.nist.gov/publications/detail/sp/800-218/final)
- [GitHub](https://github.com/google/eng-practices/blob/master/review/index.md)
- An effective secure review combines manual inspection with automated tools
  - Manual reviews allow human reviewers to spot complex logic flaws and subtle security issues that scanners might miss
  - Automated static analysis tools can efficiently catch repetitive patterns, outdated libraries, or known vulnerabilities across large codebases.
    - Integrating these tools into a CI/CD pipeline ensures that each pull request or code commit is scanned early, preventing security regressions and helping teams maintain a strong security posture throughout the development cycle.

## Secure Design Review Checklist
  1. Understand the System Context
    - Have all components, data flows, and trust boundaries been identified?
    - Has a threat model (e.g., STRIDE or DREAD) been developed for the system?
    - Are all third-party services, libraries, and APIs documented?
  2. Authentication & Authorization
    - Does the system enforce strong, secure user authentication?
    - Are authentication credentials securely stored (e.g., hashed and salted passwords)?
    - Is access control enforced at all critical entry points?
    - Are role-based or attribute-based access control models clearly defined?
  3. Data Protection & Privacy
    - Is sensitive data (PII, credentials, tokens) encrypted in transit (TLS) and at rest?
    - Are proper cryptographic algorithms and key lengths selected?
    - Is key management handled securely and separately from application logic?
    - Are data retention and deletion policies aligned with privacy requirements?
  4. Input Validation & Output Encoding
    - Is all user input validated, sanitized, and length-limited?
    - Are appropriate output encoding mechanisms in place to prevent injection attacks (e.g., XSS, SQLi)?
    - Are dangerous file uploads, redirects, or deserialization scenarios accounted for?
  5. Error Handling & Logging
    - Are errors logged in a secure, centralized location without exposing sensitive details?
    - Do error messages avoid revealing internal implementation details to users?
    - Are logs protected from tampering and accessible only to authorized users?
  6. Secure Communications
    - Is TLS enforced for all client-server and service-to-service communication?
    - Are certificates validated, and is certificate pinning considered for critical systems?
    - Are insecure protocols (e.g., HTTP, FTP) avoided?
  7. Dependency & Environment Security
    - Are third-party libraries and dependencies tracked and regularly scanned for vulnerabilities (e.g., via SBOM or SCA tools)?
    - Is the build and deployment environment hardened against supply chain attacks?
    - Are secrets managed securely (e.g., not hardcoded or in source control)?
  8. Secure Defaults & Fail-Safe Design
    - Does the system follow the principle of least privilege by default?
    - Are security controls opt-out rather than opt-in?
    - Does the system fail securely (e.g., deny access by default when uncertain)?
  9. Resilience & Threat Mitigation
    - Are rate limiting, CAPTCHA, or other bot defenses implemented where needed?
    - Is the system protected against common attacks (e.g., replay attacks, CSRF, DoS)?
    - Are security headers (e.g., CSP, HSTS, X-Frame-Options) considered for web apps?
  10. Review & Documentation
    - Has the design been reviewed by at least one independent security reviewer?
    - Are security assumptions, decisions, and mitigations documented?
    - Are plans in place for ongoing threat monitoring and incident response?

- Approval processes further reinforce security by requiring that code cannot be merged into the main branch without passing defined **security gates**.
- These gates include
  - successful automated tests
  - static analysis results
  - formal sign-off from security-trained reviewers.

- Role-based access control (RBAC) within source control systems ensures that only authorized individuals can approve or deploy changes.
- Since a merge can be blocked when a security gate failure occurs, feedback should always be provided which helps the developer(s) resolve the issue(s) and learn from the experience.

# Incident Response
- Operational processes must be able to withstand, respond to, and recover from security incidents without compromising data integrity or business continuity
- Secure workflows are designed not only to prevent unauthorized actions but also to remain resilient under attack or failure.
- By integrating incident response into the lifecycle, organizations ensure that even if a breach or disruption occurs, there are predefined procedures in place to contain the threat, minimize impact, and restore secure operations.
- This reinforces both trust and continuity in systems that handle sensitive or mission-critical activities.

# Workflow Recovery
- Following an incident, workflow recovery is integral to restoring affected business functions and digital services following an incident.
- This includes the recovery of applications, user access, and dependent systems in accordance with defined recovery time objectives (RTOs) and recovery point objectives (RPOs).
- Workflow recovery plans may involve failover systems, backups, and automated deployment scripts to rebuild environments efficiently.
- Coordination between IT, development, and security teams is essential to ensure continuity and reduce downtime.
- Integrating workflow recovery with incident response ensures not only that threats are neutralized, but that services are brought back online in a secure and controlled manner.

# 3. Automating Security Testing
- Integrating automated security tests into CI/CD Pipelines ensures that vulnerabilities are caught early in the development lifecycle, reducing the risk of deploying insecure code.
- Using automated tools for static application security testing (SAST), dependency scanning, and secret detection within build and deployment workflows helps enforce security policies without delaying delivery.
- This also helps developers receive immediate feedback when insecure code or libraries are introduced, allowing issues to be resolved before reaching production.
- Security gates in CI/CD pipelines can also be configured to block deployments if critical findings are detected, reinforcing a shift-left security strategy.

# Security Test Coverage and Prioritization
- The most critical parts of an application (such as authentication logic, data processing, and external interfaces) must be thoroughly tested for vulnerabilities.
- Since testing every line of code equally is often impractical, prioritization helps focus security efforts on high-risk areas that handle sensitive data or have a history of exploitation.
- Effective coverage (the extent to which your security tests examine critical code paths, inputs, and features) includes a mix of static and dynamic analysis, dependency checks, and manual reviews for complex logic.
- Mapping tests to known threat models or CWE categories can help guide where deeper scrutiny is needed, making security testing more efficient and impactful across the development lifecycle.

# Managing False Positives
- False positives in security test findings must be managed to distinguish real vulnerabilities from incorrect alerts.
- This is essential to maintaining trust in automated security tools and avoiding wasted developer effort.
- When tools produce too many irrelevant warnings, teams may start ignoring results altogether, missing real threats in the process.
- Prioritizing findings based on severity, exploitability, and impact helps filter meaningful issues from noise.
- Integrating results into developer workflows with clear remediation guidance also improves response time and reduces frustration.
- Regular tuning of security tools and rulesets is necessary to adapt to evolving codebases and reduce alert fatigue.










# Prioritizing Findings
- By considering not just severity but also **exploitability** (how easy it is to take advantage of the issue) and **impact** (what harm it can cause), developers can focus on what truly needs immediate action and avoid wasting time on theoretical or low-risk findings.
## Examples
### CVE-2022-12345 – SQL Injection in Login Endpoint
Severity: High  
Exploitability: Easy (public exploit available)  
Impact: Allows account takeover  
Priority: Critical — Fix Immediately
### Hardcoded test credentials found in test_config.py
Severity: Medium  
Exploitability: Low (file not deployed in production)  
Impact: No direct production risk  
Priority: Low — Address later or exclude from scan scope
### Outdated jQuery version detected
Severity: Medium  
Exploitability: Medium (theoretical exploit)  
Impact: Potential XSS on legacy admin tools  
Priority: Medium — Plan patch in next sprint  
### Missing HttpOnly flag on session cookie
Severity: High  
Exploitability: Moderate  
Impact: Increases XSS impact  
Priority: High — Patch in current release
### Unused dependency xmltodict with known DoS vulnerability
Severity: High  
Exploitability: Low (not imported anywhere)  
Impact: Minimal unless activated  
Priority: Low — Remove when cleaning dependencies

# SBOMs
- A Software Bill of Materials (SBOM) is a detailed inventory of all components, libraries, and dependencies used by a software application.  
- It provides a comprehensive record which lists open-source, proprietary, and third-party components.  
- It contains component metadata, including version numbers, licenses, and source information.  
- SBOMs promote visibility into the software supply chain and are used in conjunction with scanning tools to identify components with known security issues
- Popular SBOM generators include [Trivy](https://trivy.dev/latest/), [CycloneDX](https://cyclonedx.org/), [SPDX](https://spdx.dev/), [OWASP Dependency-Track](https://dependencytrack.org/), [Syft](https://www.cisa.gov/resources-tools/services/syft), [Anchore](https://anchore.com/), and [FOSSA](https://fossa.com/).
- SBOM scans are typically run as part of automated CI/CD workflows to verify:
  - Known vulnerabilities in dependencies
  - License compliance and component provenance
  - Tampering or unauthorized components in build artifacts
- SBOM data is cross-referenced with vulnerability databases (e.g., CVE, National Vulnerability Database, Aqua Vulnerability Database, OSS Index, GitHub Advisory Database, Snyk Vulnerability Database) to identify known issues
- Languages other than Python are also vulnerable, e.g. JavaScript/Node.js (npm), Java (Maven Central), and others



# 🛠️ Hands-On: Run an SBOM check

In [None]:
!pip freeze >requirements.txt
!echo 'showing line count for dependencies:'
!wc -l requirements.txt

In [None]:
!sudo apt-get install wget apt-transport-https gnupg lsb-release
!wget -qO - https://aquasecurity.github.io/trivy-repo/deb/public.key | sudo apt-key add -
!echo deb https://aquasecurity.github.io/trivy-repo/deb $(lsb_release -sc) main | sudo tee -a /etc/apt/sources.list.d/trivy.list
!sudo apt-get update
!sudo apt-get install trivy

In [None]:
!pip install cyclonedx-bom
!python3 -m cyclonedx_py requirements -i requirements.txt -o sbom.json
!trivy sbom sbom.json

## Results
- Environment scanned: Python packages (via SBOM from requirements.txt)
- Total Vulnerabilities Found: 10
  - High severity: 5
  - Medium: 3
  - Low: 2
  - Critical: 0
- Warnings: Trivy warns that SBOMs generated by third-party tools (like cyclonedx-bom) may lead to incomplete or imprecise matching, but this report still picked up valid CVEs based on package name and version, so the findings are informative and should not be ignored.
- Each row in the table tells you:
  - Library: The affected package
  - Vulnerability: CVE ID with severity (e.g. CVE-2022-40023)
  - Installed Version: The version in your Colab environment
  - Fixed Version: The version where the issue is patched
  - Title + Link: A brief vulnerability description and a link for more info
- Examples:
  - High Severity
    - Mako 1.1.3 → vulnerable to CVE-2022-40023 (Regular Expression DoS)
      - Fixed in 1.2.2
    - keras 3.8.0 → vulnerable to CVE-2025-1550
      - Fixed in 3.9.0
    - jupyter-server has multiple high and medium CVEs
  - cryptography 43.0.3 has a known LOW severity issue — fixed in 44.0.1
- Should You Be Concerned?
  - Yes, especially for HIGH severity vulnerabilities in actively used libraries like keras (RCE risk), jupyter-server (user hash disclosure, redirection, etc.), and Mako (REDos).
  - These could impact the confidentiality, integrity, or availability of systems if exposed to malicious input — particularly in multi-user/shared environments like Jupyter notebooks or APIs.
- What You Should Do
  - Upgrade the packages: use pip install --upgrade <package> or pin higher versions in requirements.txt
  - Avoid vulnerable versions when building distributable apps or APIs.

# Integrating a Scan into a GitHub Action
### Sample YAML file; store in .github/workflows

```
name: Trivy Dependency Scan

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]

jobs:
  trivy-scan:
    name: Scan Python dependencies with Trivy
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Install Trivy
        run: |
          sudo apt-get update
          sudo apt-get install -y wget
          wget https://github.com/aquasecurity/trivy/releases/latest/download/trivy_0.48.4_Linux-64bit.deb
          sudo dpkg -i trivy_0.48.4_Linux-64bit.deb

      - name: Scan project directory for vulnerabilities
        run: trivy fs --exit-code 1 --severity CRITICAL,HIGH .

      # Optional: Save Trivy scan report as an artifact
      - name: Save Trivy scan report
        run: trivy fs --severity CRITICAL,HIGH --format table --output trivy-report.txt .
      
      - name: Upload report
        uses: actions/upload-artifact@v3
        with:
          name: trivy-report
          path: trivy-report.txt
```