# Find Hacking Attempts from Network Data

The goal of this project is for you to create a program that examines log data of net flow traffic, and produces a score, from 1 to 10, describing the degree to which the logs suggest a brute force attack is taking place on a server. [Source](https://classroom.udacity.com/courses/ud919/lessons/3610088757/concepts/35982786900923#)

### Initial imports

In [42]:
import pandas as pd
import datetime
import time

### Dealing with the data

First let's load and explore the data. As mentioned in project description, the data has the following format:

source ip-address | destination ip-address | protocol | source port | destination port | # packets | bytes | flags | site | time

In [2]:
# Load the input csv

input_file_path = "ds_1_with_fields.csv"

data = pd.read_csv(input_file_path)

In [6]:
# To obtain some information of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 337940 entries, 0 to 337939
Data columns (total 10 columns):
source_ip           337940 non-null object
destination_ip      337940 non-null object
start_time          337940 non-null int64
source_port         337940 non-null int64
destination_port    337940 non-null int64
flags               337940 non-null object
site                337940 non-null object
asn                 337940 non-null object
num_packets         337940 non-null int64
num_bytes           337940 non-null int64
dtypes: int64(5), object(5)
memory usage: 25.8+ MB


It's possible to conclude that there are no missing values.

In [5]:
data.head()

Unnamed: 0,source_ip,destination_ip,start_time,source_port,destination_port,flags,site,asn,num_packets,num_bytes
0,135.b1d10.d1c38.20,135.0777d.04511.237,1415749946,22,45092,-AP---,45c48,c4ca4,6,504
1,135.0777d.04511.237,135.b1d10.13fe9.91,1415477729,45603,22,"-A----,-AP---",45c48,c4ca4,11,956
2,135.0777d.04511.237,135.b1d10.13fe9.71,1415702327,45596,22,"-A----,-AP---",45c48,c4ca4,11,956
3,135.b1d10.d1c38.119,135.0777d.04511.237,1415754478,22,50107,-AP---,45c48,c4ca4,6,504
4,135.b1d10.13fe9.56,135.0777d.04511.237,1415749597,22,45580,-AP---,45c48,c4ca4,6,504


In [110]:
# It is possible to transform the unix time
def convert_time(unixtime):
    
    return datetime.datetime.fromtimestamp(unixtime).strftime('%Y-%m-%d %H:%M:%S')

In [113]:
print(convert_time(max(data.start_time)))
print(convert_time(min(data.start_time)))

2014-11-12 15:44:10
2014-11-04 22:42:11


In [50]:
# Create a list with time of the data flows

init_time = time.time()

times = list()

for ix, row in data.iterrows():
    times.append((ix, row["start_time"]))
    
    if ix%50000 == 0:
        print(ix, end=" .. ")

times.sort(key=lambda x: x[1])
print("\nFinished creating list in {:.2f}s".format(time.time()-init_time))

0 .. 50000 .. 100000 .. 150000 .. 200000 .. 250000 .. 300000 .. 
Finished creating list in 51.06s


A possible heuristic is to detect cases in which the same source ip is targeting the same ip, varying the port at close time steps. Another possible thing to check for is that the number of packets and its size it's all the same in the detected instances.

In [53]:
# Creating a different dataframe only with the necessary columns

data_1 = data.drop(['flags', 'site', 'asn'], axis=1)

In [132]:
for ip, count in data_1.source_ip.value_counts().iteritems():
    
    if count < 500: break

    else:

        temp_df = data_1.loc[data_1.source_ip == ip]
        nr_ip_access = len(temp_df)

        temp_df = temp_df.loc[temp_df.destination_port == 22]
        nr_port_access = len(temp_df)
        ratio = 100*float(nr_port_access)/nr_ip_access
        
        if temp_df.empty: pass
        
        else:

            print("Dealing with source ip: ", ip)
            print("Number of accesses to port 22: {} corresponding to {:.2f}%".format(nr_port_access,
                                                                               ratio))
            
            early_unix_time = min(temp_df.start_time)
            late_unix_time = max(temp_df.start_time)
            
            print("Earliest access attempt: ", convert_time(early_unix_time))
            print("Median acess atempt: ", convert_time(temp_df.start_time.median()))
            print("Latest access attempt: ", convert_time(late_unix_time))
            print("Std approximately {:.2f} hours".format(temp_df.start_time.std()/(60*60)))

            
            if (late_unix_time - early_unix_time)/(60*60) >= 12:
                
                print("Fails time distance criterium")
                
            else:
                print(temp_df.num_packets.value_counts())
                print(temp_df.num_bytes.value_counts())
            
            print("\n")
            
            

Dealing with source ip:  135.0777d.04511.237
Number of accesses to port 22: 91646 corresponding to 100.00%
Earliest access attempt:  2014-11-04 22:42:11
Median acess atempt:  2014-11-10 01:09:16
Latest access attempt:  2014-11-12 15:39:12
Std approximately 44.85 hours
Fails time distance criterium


Dealing with source ip:  135.0777d.04511.232
Number of accesses to port 22: 71672 corresponding to 100.00%
Earliest access attempt:  2014-11-12 03:47:59
Median acess atempt:  2014-11-12 09:40:29
Latest access attempt:  2014-11-12 15:43:48
Std approximately 3.42 hours
7    45016
6    21196
5     4427
4      877
3      129
2       18
8        6
9        2
1        1
Name: num_packets, dtype: int64
385    45001
339    19064
287     2332
333     1469
293      932
279      903
325      546
227      288
235      259
241      233
281      125
273      118
340      106
233       72
181       42
189       39
175       37
288       12
399       11
183       11
221       11
129        9
229        8
3

The goal with the packets information is to check for consistency. It seems that in the first case the numbers are close which is suspicious. It may indicate that a lot of similar packets may being trying to be sent. The next is to find tuples of (number_packets, packet_size) that are equal.