Site Reliability Engineering (SRE)

The role

We are looking for Site Reliability Engineering (SRE) to join the infrastructure team.

Site reliability engineering (SRE) is a software engineering approach to IT operations. SRE teams use software as a tool to manage systems, solve problems, and automate operations tasks.

A SRE is a unique role that requires either a background as a software developer with additional operations experience, or as a sysadmin or in an IT operations role that also has software development skills.

Site reliability engineers split their time between operations tasks and project work. We expect they only spend a maximum of 50% of their time on operations, which should be monitored to ensure they don’t go over. The rest of the time should be spent on development tasks like creating new features, scaling the system, and implementing automation.

Automation is an important part of the site reliability engineer’s role. If they are dealing with a problem repeatedly then they will automate a solution. This also helps ensure that operations work remains at half of their workload.

Maintaining the balance between operations and development work is a key component of SRE.

Read about other perks and benefits at jobs.holded.com

The team

You will be part of the Infrastructure and Operations team that owns and maintains the whole Holded infrastructure.

That team is empowered to make independent decisions, partnering and serving effectively the product teams, analytics teams, and other areas of the business.

What you will do

SRE teams are responsible for how code is deployed, configured, and monitored, as well as the availability, latency, change management, emergency response, and capacity management of services in production.
You will be hands-on writing code, not only proofs-of-concept, but anything that you consider you should be hands-on.
You will work directly with other technical leads, architects, and product designers in shaping up and reinventing an epic digital product.
You will collaborate with other teams to identify and fix technical problems.
You will be involved in architectural decisions, communicates them, and help teams to adopt the decisions.
You will collaborate with hiring and training technical personnel.
You will actively automate or eliminate anything that is repetitive or that could lead to human errors.
You will design and implement observability, as in, the ability to be able to ask arbitrary questions about your system without having to know ahead of time what you wanted to ask.
You will define, test, and run an incident management process.

In one month

You will have completed your onboarding.
You will already know your team.
You will have deployed several times to production.
You will have joined the main architectural discussions that will be taking place and have actively participated in them.
You will know the main metrics and service level indicators of the main product areas.

In three months

You will know the architecture in detail, and you will be in the process of improving certain parts. By then, you will have clear areas you would like to improve and lead the adoption of those improvements.
You will have led a successful project, be it an automation feature, a technical debt reduction, a DX improvement, etc... achieving the expected result and with total technical independence.

In six months

You will already know all the processes and tools in depth.
With you contributions, you will have improved some metrics or key indicators of the platform

About you

You have an intrinsic bias towards simplicity, and a constant willingness to simplify complex systems
+4 years of experience as SRE/DevOps/System Engineer.
You have experience working with major cloud vendors like AWS, GCP, Azure, etc...
You have experience with Kubernetes.
You practice infrastructure-as-code.
You have deep knowledge of networking (VPC, network peering, etc..c)
You have experience with monitoring tools like Prometheus, Grafana, Kibana, Datadog, New Relic, etc.
You have experience managing a logs infrastructure.
You have deep experience with database technologies
You care about best practices and software maintainability is a top priority for you
You like to explore new technologies and are curious about how things work
Maintainability is a top priority for you.
You have experience creating high-quality software balanced with a pragmatic understanding of how to make appropriate tradeoffs (e.g., reduce scope) to ship quickly and iterate when necessary.
You are a reliable, trustworthy person that keeps their promises.

Nice-to-haves

You have a full stack mindset. It does not mean you have mastered every single part of the stack, but that you understand how things work under the hood and are willing to help when needed.
You have deep experience with MongoDB, Redis, etc...
But the most important is, you are a freak like we are, you love what you do and you want to enjoy your work while building something important

Apply now!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sre.md

sre.md

Site Reliability Engineering (SRE)

The role

The team

What you will do

In one month

In three months

In six months

About you

Nice-to-haves

Files

sre.md

Latest commit

History

sre.md

File metadata and controls

Site Reliability Engineering (SRE)

The role

The team

What you will do

In one month

In three months

In six months

About you

Nice-to-haves