Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research O&M Resource Tracking Paradigms #4240

Closed
2 tasks done
nickumia-reisys opened this issue Mar 17, 2023 · 4 comments
Closed
2 tasks done

Research O&M Resource Tracking Paradigms #4240

nickumia-reisys opened this issue Mar 17, 2023 · 4 comments
Assignees
Labels
Explore Mission & Vision O&M Operations and maintenance tasks for the Data.gov platform

Comments

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Mar 17, 2023

User Story

In order to make the O&M Role better focused, the Data.gov O&M Team wants to research auditing/reporting/metric/managing frameworks that capture the nature of the serverless cloud environment that we support.

Acceptance Criteria

  • GIVEN research has been completed
    WHEN I look at this ticket
    THEN I see a list of (or maybe just one) framework that we can use to track O&M tasks/priorities/metrics.
  • There are supporting details and/or a comparison between each framework.

Background

O&M tasking is a bit of a mess. We don't have a baseline for normal operations and we don't have a list of resources that need a baseline.

Security Considerations (required)

Improving team mental health helps make us more prepared to be proactive and reactive in future incidents.

Sketch

TBD.

@nickumia-reisys
Copy link
Contributor Author

From my (initial?) findings the main things that I would like to see implemented related to:

  • Abstraction: Hardware vs. Software vs. Networking
    • For Hardware: Better inventory tracking processes
    • For Software/Networking: Better site reliability processes
    • For Software: Better state dashboards
  • Resiliency: If our technology stacks changes, updating these processes and policies should take minimal effort.
  • Prioritization:
    • A list of our critical failure indicators
      • Downtime
      • Compromised data/system
      • ... what else?
    • Fostering a proactive vs. reactive mindset
      • Because we don't know what we're looking at, we are inherently reactive. By organizing our state and resource tracking, we'll be able to be more proactive than reactive.
  • Continuous Baseline Analysis: @FuhuXia has the best idea of what the baseline is for our systems (having been on O&M the longest)
    • What tools can we use to gather our baseline?
      • NR has it's AI-based analysis, maybe investigate it more. If it's missing data, that should be a different issue.
      • How can we use cloud.gov based software/technology to gain insight into application performance?
    • When assessing these tools, be sure to see how baselines can be tracked and updated relative to a new variable that may be introduced in the future (i.e. A baseline may consist of 50 parameters/metrics, how would an additional parameter effect the overall computation of baseline).

I'm going to start with SRE (site reliability engineering) and our state dashboards.

@nickumia-reisys
Copy link
Contributor Author

Okay, I think I was putting too much effort into trying to actively apply the things I was researching. Refocusing to answer the acceptance criteria, I'll just focus on SRE as a framework to improve O&M. As an explanation of how the rest of this comment is organized, there are two sections: (1) Statements to ponder and (2) Statements that we already imbibe. Statements to ponder are a collection of concepts that we can consider implementing and/or organizing to help rate the health of our systems on a daily basis. Statements that we already imbibe are mostly things that we've implemented but have not accepted as truths for how our systems are performing. We just need to calibrate them to make them useful for the purpose of O&M assessment. After reviews from the team, I think we can start to dig into the actual implementation of some of these ideas.

I think the point of everything I've laid out here is that we want to say, "Here is the dashboard that shows that our applications have been healthy" as opposed to us loosely saying that we've checked a bunch of different items and have not witnessed any incidents.

Statements to ponder:

  • Operations is a software problem
  • "SREs have skillsets of both Dev and Ops and share “Wisdom of Production” to the development team"
  • What are some SLOs that we care about?

    Catchpoint SRE Survey Report 2019
    Availability — 72%
    Response Time — 47%
    Latency — 46%
    We do not have SLOs — 27%

    • Important key points about SLOs
    • To implement SLOs, we need to define what a good classification of our system is.
      • Catalog is a metadata repository + (soon to be) metadata reporting system
        • What qualities do we want to track that will assess success as this type of system?
        • Availability (uptime, latency, error rates), Data Integrity (known problem, but we need to track it somehow) and Searchability (how many clicks between entering the site and finding a desired dataset or how long do users spend on the site) can be a few. Starting out, Availability + Data Integrity are more important than Searchability.
      • Inventory is a metadata creation system
      • I believe both can be classified more broadly as "User-facing serving systems"
  • DevOps focuses on moving through the development pipeline efficiently, while SRE focuses on balancing site reliability with creating new features. O&M is more SRE than DevOps.
  • Implementation of SRE requires planning and consensus.
    • This ticket is just an initial planning one. But more is needed to be able to make this a reality. I would encourage that we foster an SRE mindset during O&M, so that the O&M role build the initial foundation of this framework.
  • Gain system observability
    • SRE teams use metrics to determine if the software consumes excessive resources or behaves abnormally.
    • SRE software generates detailed, timestamped information called logs in response to specific events.
    • Traces are observations of the code path of a specific function in a distributed system. (i.e. how long does a specific function call/process take)
  • Implement system monitoring
    • Latency
    • Traffic (number of requests)
    • Errors
    • Saturation (realtime load of application)

Statements that we already imbibe (but need to continuously implement):

  • Automate as much as possible
  • NR tracks Latency, Traffic and Errors, but let's define a metric so that we can say
    • "Application Latency has been average for this time period" or
    • "We've seen 3x as much traffic as our average for this time period"
    • If our systems are performing well in average AND edge cases, then O&M can feel like the job is getting done.
  • I'm not sure which uptime we want to actually track, application status on cloud.gov or NR synthetic monitoring on cloudfront distributions.. It could be a combination of them; but to one of the points above, it's not about perfection.

A list of references for different perspectives on SRE:

@jbrown-xentity
Copy link
Contributor

I think that the SLO (Service Level Objectives) we want to focus on are the following (by process):

  • Catalog Web: Uptime (this should factor in down time if a solr server is unresponsive, and a % of our requests are failing)
  • Catalog Harvest: Count of jobs that are "clean" (no errors, only additions, updates, and removals) vs count of jobs that contain errors
    • Ideally this is rolled up by harvest source/organization, but if a job runs daily and always fails this should be a signal for our attention
    • Not sure what the target value (Service Level Agreement) is at this time, but let's start capturing and figure that out later
    • Will be helpful information to have on hand, when we start harvesting 2.0
  • Inventory: Count of user reported issues/bugs
    • Would like to consider automated testing so we aren't reliant on users, but not possible at this time

@nickumia-reisys
Copy link
Contributor Author

We have accepted the details above as the initial starting point to improve O&M processes. This will be an ongoing effort as part of the O&M shift over the next few weeks. I'll make sure to tag it in the new issues, so that we don't lose it.

@nickumia-reisys nickumia-reisys added O&M Operations and maintenance tasks for the Data.gov platform Mission & Vision Explore labels Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Explore Mission & Vision O&M Operations and maintenance tasks for the Data.gov platform
Projects
Status: 🗄 Closed
Development

No branches or pull requests

2 participants