I write it on my PC typora ,but the passage may go wrong when I copy the passage to Utopian post editor.So I upload to the Github. If the article layout make you dazzled please read on the below url.Thanks.
With the growth of network information security data scale, application of data analysis technology for network security is becoming more and more important.Through Honeypot log data ,let us use Python to briefly analyse what the public do by using proxy IP.
When some hackers or technicians on the Internet want to hack something, they need to hide their IP so that they will use proxy IP to hide their identity . The Honeypot is an active defense security technology model, which is designed to be a special attack or intrusion and deception object .The Honeypot is set up to protect system, as well as being used to collect the hacker information.
-
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
-
Matplotlib is a Python 2D plotting library which produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell (à la MATLAB or Mathematica), web application servers, and various graphical user interface toolkits.
-
import pandas as pd from datetime import timedelta , datetime import matplotlib.pylpot as plt import numpy as np
-
We prepare the proxy IP using log as the data which we used to analyse.The proxy IP log is stored in the form of CSV.To read data from the CSV file,just typing "pandas.read_csv". One line of code can read all the data to a two-dimensional table structure DataFrame variable .
analysis_data = pd.read_csv('./honeypot_data/csv')
Of course ,Pandas provides a IO tool which allows large files to be read in block. It is really convient for us. I have tried to read large files to test the performance of the IO tool, completely loading of about 215 billion 300 million data will probably only need 90 seconds, the performance is quite good.
-
In general, before analysis of the data ,we had better first have a general understanding of the data, such as the total number of data, data variables, data distribution , data duplication, missing data,abnormal data and so on .
-
Several simple functions
#view the rows and columns of the datra anslysis_data.shape() # view the first 5 lines of data anslysis_data.head() # you can also input parameter # view the last 5 lines of data anslysis_data.tail() # you can also input parameter
-
List information about user proxy IP dates, proxy header information, proxy access to domain names, proxy methods, source IP, and honeypot node and so on.
df.select_dtypes(include=['0']).describe()
df.select_dtypes(include=['float64']).describe()
Find fields such as scan_os_sub_fp, scan_scan_mode, and so on whose values may be NaNN.
-
-
Because the source data usually contains some empty values or even empty columns, it will affect the time and efficiency of data analysis. After previewing the data summarization, these invalid data need to be processed.
# Remove all rows containing empty values analysis_data.dropna() #Remove all columns that are empty values analysis_data.dropna(axis=1,how='all') #Remove data from the specified column,like'proxy_host' and 'srcip' analysis_data.dropna(subset=['proxy_host','srcip]') #Remove rows with a value less than 10 in all row fields analysis_data.dropna(thresh=10)
-
-
After a preliminary understanding of some of the information in the data,we use the data slice method of pandas,'loc'.
loc([start_row_index:end_row_index,[‘timestampe’, ‘proxy_host’, ‘srcip’]])
# Select the variables :'timestampe', 'proxy_host', 'srcip' analysis_data = analysis_data.loc([:,['timestampe', 'proxy_host', 'srcip']])
daily_proxy_data = analysis_data[analysis_data.module=='proxy'] daily_proxy_visited_count = daily_proxy_data.timestamp.value_counts().sort_index() daily_proxy_visited_count.plot()
To discard the data column, in addition to an invalid value and requirement, some redundant column table itself also needs to be cleared in this link, such as DataFrame and index, the type of description, by discarding of these data, and generate new data, can make the data capacity to effectively reduce and improve. Computational efficiency
daily_proxy_data = analysis_data[analysis_data.module=='proxy'] daily_proxy_visited_count = daily_proxy_data.groupby(['proxy_host']).srcip.nunique() daily_proxy_visited_count.plot()
host_associate_ip = proxy_data.loc[:,['proxy_host','srcip']] grouped_host_ip = host_associate_ip.groupby(['proxy_host']).srcip.nunique() print(grouped_host_ip.sort_values(ascending=False).head(10))
Check the log data to and find information such as the collection of second-hand car prices, workers' recruitment and so on. From the hot point of host, generally we use the main agent or access to Baidu, QQ, Google, Bing that even woman and children all know site information.
host_associate_ip = proxy_data.loc[:,['proxy_host','srcip']] grouped_host_ip = host_associate_ip.groupby(['srcip']).proxy_host.nunique() print(grouped_host_ip.sort_values(ascending=False).head(10))
Well, the guy whose IP is 123..*.155 has a lot of access records and then looks at the log. He has been collecting hotel information in a large amount.
date_ip = analysis_data.loc[:,['timestamp','srcip']] grouped_date_ip = date_ip.groupby(['timestamp','srcip']) #Calculate the date of every src IP using the proxy all_srcip_duratime_times = .... #Calculate the longest one duration_date_cnt = count_date(all_srcip_duration_times)
Well, I'll have a little understand of what those people do, who is the longest, and so on. Get the log of the users whose IP = 80... 38 .And find that the young man had been getting the Sohu images for a long time.
node = df[df.module=='scan'] node = node.loc[:,['srcip','origin_details']] grouped_node_count = node.groupby(['srcip']).count() print grouped_node_count.sort_values(['origin_details'],ascending=False).head(10)
From the above two tables, we can conclude some conclusions: the user whose source IP is 182... 205 to scan the honeypot nodes for such a long time that it is marked as dangerous user.
-