# **Big Data e Computação em Nuvem**

Prof. Michel Fornaciali

Profa. Thanuci Silva

Lista 1 - Logs SSH

Utilize Spark RDDs para responder as perguntas abaixo:

1. **Quantas linhas há no arquivo de log?**
2. **Quantos logins com sucesso ocorreram no sistema?**
3. **Quais são os usuários que logaram neste sistema?**
4. **Quem acessa a máquina com maior frequência? (exceto root)**
5. **Quais IP's estão acessando mais frequentemente esta máquina?**

In [1]:
# Criar a sessao do Spark
from pyspark.sql import SparkSession
spark = SparkSession \
            .builder \
            .master('local[2]') \
            .appName("appName") \
            .getOrCreate()

# Pega o contexto do Spark
sc = spark.sparkContext

In [2]:
!wget https://raw.githubusercontent.com/elastic/examples/master/Machine%20Learning/Security%20Analytics%20Recipes/suspicious_login_activity/data/auth.log

--2024-11-09 12:42:30--  https://raw.githubusercontent.com/elastic/examples/master/Machine%20Learning/Security%20Analytics%20Recipes/suspicious_login_activity/data/auth.log
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 797637 (779K) [text/plain]
Saving to: ‘auth.log’


2024-11-09 12:42:31 (13.0 MB/s) - ‘auth.log’ saved [797637/797637]



In [3]:
rdd1 = sc.textFile('auth.log')

In [4]:
rdd1.take(20)

['Mar 27 13:06:56 ip-10-77-20-248 sshd[1291]: Server listening on 0.0.0.0 port 22.',
 'Mar 27 13:06:56 ip-10-77-20-248 sshd[1291]: Server listening on :: port 22.',
 'Mar 27 13:06:56 ip-10-77-20-248 systemd-logind[1118]: Watching system buttons on /dev/input/event0 (Power Button)',
 'Mar 27 13:06:56 ip-10-77-20-248 systemd-logind[1118]: Watching system buttons on /dev/input/event1 (Sleep Button)',
 'Mar 27 13:06:56 ip-10-77-20-248 systemd-logind[1118]: New seat seat0.',
 'Mar 27 13:08:09 ip-10-77-20-248 sshd[1361]: Accepted publickey for ubuntu from 85.245.107.41 port 54259 ssh2: RSA SHA256:Kl8kPGZrTiz7g4FO1hyqHdsSBBb5Fge6NWOobN03XJg',
 'Mar 27 13:08:09 ip-10-77-20-248 sshd[1361]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)',
 'Mar 27 13:08:09 ip-10-77-20-248 systemd: pam_unix(systemd-user:session): session opened for user ubuntu by (uid=0)',
 'Mar 27 13:08:09 ip-10-77-20-248 systemd-logind[1118]: New session 1 of user ubuntu.',
 'Mar 27 13:09:37 ip-10-77-20-248 s

# Quantas linhas há no arquivo de log?

* [pyspark.RDD.count](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.count.html)

In [7]:
num_linhas = rdd1.count()
print(f"Total de linhas no arquivo de log: {num_linhas}")


Total de linhas no arquivo de log: 7121


# Quantos logins com sucesso ocorreram no sistema?

Dica: utilize como referência para busca: _"Accepted password for"_

* [pyspark.RDD.filter](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.filter.html)

In [12]:
logins_sucesso = rdd1.filter(lambda line: "Accepted password" in line)
num_logins_sucesso = logins_sucesso.count()
print(f"Total de logins com sucesso: {num_logins_sucesso}")


Total de logins com sucesso: 190


In [14]:
logins_sucesso.take(20)

['Mar 29 10:43:01 ip-10-77-20-248 sshd[1193]: Accepted password for elastic_user_7 from 127.0.0.1 port 52942 ssh2',
 'Mar 29 11:35:20 ip-10-77-20-248 sshd[1361]: Accepted password for elastic_user_2 from 85.245.107.41 port 50690 ssh2',
 'Mar 29 11:36:51 ip-10-77-20-248 sshd[1460]: Accepted password for elastic_user_8 from 85.245.107.41 port 50696 ssh2',
 'Mar 29 11:37:37 ip-10-77-20-248 sshd[1510]: Accepted password for elastic_user_5 from 85.245.107.41 port 50697 ssh2',
 'Mar 29 11:37:50 ip-10-77-20-248 sshd[1558]: Accepted password for elastic_user_8 from 85.245.107.41 port 50699 ssh2',
 'Mar 29 11:45:01 ip-10-77-20-248 sshd[1734]: Accepted password for elastic_user_7 from 85.245.107.41 port 50755 ssh2',
 'Mar 29 11:52:58 ip-10-77-20-248 sshd[1857]: Accepted password for elastic_user_0 from 85.245.107.41 port 50797 ssh2',
 'Mar 29 12:02:08 ip-10-77-20-248 sshd[1985]: Accepted password for elastic_user_0 from 85.245.107.41 port 50817 ssh2',
 'Mar 29 12:19:17 ip-10-77-20-248 sshd[2055]

['elastic_user_7',
 'elastic_user_2',
 'elastic_user_8',
 'elastic_user_5',
 'elastic_user_8']

# Quais são os usuários que logaram neste sistema?

* [pyspark.RDD.map](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.map.html)

In [21]:
total_users = logins_sucesso.map(lambda line: line.split()[8])
total_users.take(5)
usuarios = total_users.distinct().collect()
print(f"Usuários que logaram no sistema: {usuarios}")


Usuários que logaram no sistema: ['elastic_user_8', 'elastic_user_5', 'elastic_user_4', 'elastic_user_9', 'elastic_user_7', 'elastic_user_2', 'elastic_user_0', 'elastic_user_3', 'elastic_user_1', 'elastic_user_6']


# Quem acessa a máquina com maior frequência?

* [pyspark.RDD.map](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.map.html)
* [pyspark.RDD.reduceByKey](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.reduceByKey.html)
* [pyspark.RDD.takeOrdered](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.takeOrdered.html)

In [22]:
acessos_por_usuario = total_users \
                      .filter(lambda user: user != "root") \
                      .map(lambda user: (user, 1)) \
                      .reduceByKey(lambda a, b: a + b)

usuario_mais_frequente = acessos_por_usuario.takeOrdered(1, key=lambda x: -x[1])[0]
print(f"Usuário que acessa a máquina com mais frequência (exceto root): {usuario_mais_frequente}")


Usuário que acessa a máquina com mais frequência (exceto root): ('elastic_user_0', 29)


# Quais IP's estão acessando mais frequentemente esta máquina?

* [Regex para IPs](https://stackoverflow.com/questions/10086572/ip-address-validation-in-python-using-regex)
* [pyspark.RDD.flatMap](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.flatMap.html)

In [24]:
import re
valid_ip_adress_regex = r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b"
pattern = re.compile(valid_ip_adress_regex)

In [26]:
acessos_por_ip = (
    logins_sucesso
    .flatMap(lambda line: pattern.findall(line))
    .map(lambda ip: (ip, 1))
    .reduceByKey(lambda a, b: a + b)
)

ips_mais_frequentes = acessos_por_ip.takeOrdered(5, key=lambda x: -x[1])
print(f"IPs que acessam a máquina com mais frequência: {ips_mais_frequentes}")

IPs que acessam a máquina com mais frequência: [('85.245.107.41', 139), ('24.151.103.17', 47), ('95.93.96.191', 3), ('127.0.0.1', 1)]
