## Camelot

Documentation: https://camelot-py.readthedocs.io/en/master/index.html

PyPI: https://pypi.org/project/camelot-py/

In [137]:
import camelot
tables = camelot.read_pdf('Mirai1.pdf', pages='all')
tables

<TableList n=8>

### Extract ID, Name and Identified Sentence

In [138]:
import re

for t in tables:
    # Convert table to pandas dataframe
    df = t.df
    
    # Preprocess Dataframe
    df.columns = ['ID', 'Name', 'Identified Sentence']
    
    # If first row consists of labels
    if df['ID'].iloc[0] == 'ID':
        # Remove that row permanently with inplace=True
        df.drop([0], inplace=True)
        
    # Loop over each df's rows
    for index, row in df.iterrows():
        
        # If ID value is not empty, update ID variable to current row's ID value
        # Else continue with previous value for ID
        if row['ID'] != '':
            ID = row['ID']
        
        
        # Repeat for name value
        if row['Name'] != '':
            name = row['Name']
            
            # Preprocess Name, can use repr() to check
            name = name.replace('\n', '')
    
        
        # Preprocess Sentences
        sentences = row['Identified Sentence']  
        
        # Split the table cell of text into indiviudal sentences
        # sentences_list is a list of the sentences in the tabel cell
        sentences_list = re.split(' \n  \n|\n\n | \n  \n  \n  \n', sentences)
        for s in sentences_list:
            # Replace any new lines separating parts of the sentence
            s = s.replace('\n', ' ')
            
            # Replace any double spaces which might result from previous step with a single space
            s = s.replace('  ', ' ')
            
            # Do a length check to skip empty strings and random punctuation
            if len(s) < 3:
                continue
                
            # Finally, we have the desired sentence to be inserted into database
            # Use repr() to avoid resolving escape charcters: print(repr(s), '\n')
            # insert(ID, name, s)
            print("ID:", ID)
            print("NAME:", name)
            print("SENTENCE:", s)
            print()
            

ID: T1046
NAME: Network Service Scanning
SENTENCE: Much of the new mirai variant that scans port 7547 has been covered by various sources.

ID: T1065
NAME: Uncommonly Used Port
SENTENCE: The following table lists first seen time of “old” mirai and the “new” ones that hit our honeypot. 

ID: T1065
NAME: Uncommonly Used Port
SENTENCE: You can see the variant on port 7547 first shown up on 2016-11-26 21:27:23, and first observed for the variant on port 5555 was one day after on 2016-11-27 17:04:02(all GMT +8). 

ID: T1043
NAME: Commonly Used Port
SENTENCE: The following table lists first seen time of “old” mirai and the “new” ones that hit our honeypot. 

ID: T1043
NAME: Commonly Used Port
SENTENCE: You can see the variant on port 7547 first shown up on 2016-11-26 21:27:23, and first observed for the variant on port 5555 was one day after on 2016-11-27 17:04:02(all GMT +8). 

ID: T1065
NAME: Uncommonly Used Port
SENTENCE: [](/content/images/2016/11/03-bot-current-growth-rate-on-all- port.

### Sample table data

In [3]:
[t.df for t in tables]

[       0                           1  \
 0     ID                        Name   
 1  T1046  Network Service \nScanning   
 2  T1065      Uncommonly Used \nPort   
 3  T1043          Commonly Used Port   
 4  T1065      Uncommonly Used \nPort   
 
                                                    2  
 0                                Identified Sentence  
 1  Much of the new mirai variant that scans port ...  
 2  The following table lists first seen time of “...  
 3  The following table lists first seen time of “...  
 4  [](/content/images/2016/11/03-bot-current-grow...  ,
        0                                         1  \
 0  T1046                Network Service \nScanning   
 1  T1043                        Commonly Used Port   
 2  T1048  Exfiltration Over \nAlternative Protocol   
 
                                                    2  
 0  [](/content/images/2016/11/03-bot-current-grow...  
 1  [](/content/images/2016/11/03-bot-current-grow...  
 2  The total number of b

In [133]:
print(tables[5].df['Name'], '\n')
print(repr(tables[5].df['Name'][0]))
bool(tables[5].df['Name'][0])

0                          
1         Fallback Channels
2    Uncommonly Used \nPort
Name: Name, dtype: object 

''


False

In [43]:
# Notice how separate sentences have 2 or more new line characters, 
# whereas parts of a single sentence might be separated by only a single new line 
tables[6].df['Identified Sentence'][2]

'[](/content/images/2016/11/07-two-c2-server-and-two-report-\nserver-in-one-\nmarai-sample.jpg)  \n  \nBot Overlap  \n  \nIn other words, the operator of new variant on port 7547 and \nthe previous\nmirai operator are very likely the same group of people.  \n  \nThe following diagram shows the overlap of all the bots we \ncaptured in our\nhoneypot that have scanned port 23/2323/5555/7547.  \n  \nWe can see that:  \n  \n96.4% of the Mirai Bots scan port 23 or port 2323.'

In [94]:
# Sample regex splitting
a='Beautiful, is; better*than\nugly'
import re
re.split('; |, |\*|\n',a)

['Beautiful', 'is', 'better', 'than', 'ugly']

In [27]:
tables[1].df

Unnamed: 0,0,1,2
0,T1046,Network Service \nScanning,[](/content/images/2016/11/03-bot-current-grow...
1,T1043,Commonly Used Port,[](/content/images/2016/11/03-bot-current-grow...
2,T1048,Exfiltration Over \nAlternative Protocol,The total number of bots on current port 7547 ...


In [40]:
tables[1].df[2][2]

"The total number of bots on current port 7547 has already \nexceeded 30,000.  \n  \nBot growth rate on port 7547, per 10 minutes:  \n  \n  \n  \nThe figure shows that, bot's growth rate quickly reached a peak, \nand smoothly\nmaintained at a high level.  \n  \nOn the other hand, from the perspective of the backbone \nnetwork, the scan on\nport 7547 began to rise sharply in the evening of 2016-11-26.  \n  \n  \n  \nIn terms of the geographical distribution of the newly infected \nbot, Brazil is\nstill far ahead of the others, which is consistent with the \ngeographical\ndistribution of the existing mirai botnet.  \n  \n  \n  \nWe provide various statistics and data downloads of Mirai-\ninfected devices at\nhttp://data.netlab.360.com/mirai-scanner for researchers.  \n  \nFor those who have been using API to access our bot list, \nplease re-download\nthe data from 2016-11-26 and later to obtain updates for port \n7547 and 5555\ndata.  \n  \nThe New Variant Shares Some of the Infrastructu

In [41]:
tables[1].df[2][2].split('\n')

['The total number of bots on current port 7547 has already ',
 'exceeded 30,000.  ',
 '  ',
 'Bot growth rate on port 7547, per 10 minutes:  ',
 '  ',
 '  ',
 '  ',
 "The figure shows that, bot's growth rate quickly reached a peak, ",
 'and smoothly',
 'maintained at a high level.  ',
 '  ',
 'On the other hand, from the perspective of the backbone ',
 'network, the scan on',
 'port 7547 began to rise sharply in the evening of 2016-11-26.  ',
 '  ',
 '  ',
 '  ',
 'In terms of the geographical distribution of the newly infected ',
 'bot, Brazil is',
 'still far ahead of the others, which is consistent with the ',
 'geographical',
 'distribution of the existing mirai botnet.  ',
 '  ',
 '  ',
 '  ',
 'We provide various statistics and data downloads of Mirai-',
 'infected devices at',
 'http://data.netlab.360.com/mirai-scanner for researchers.  ',
 '  ',
 'For those who have been using API to access our bot list, ',
 'please re-download',
 'the data from 2016-11-26 and later to obtain 

In [42]:
# Clean chunk of text
for s in tables[1].df[2][2].split('\n'):
    if s!= '  ':
        print(s)

The total number of bots on current port 7547 has already 
exceeded 30,000.  
Bot growth rate on port 7547, per 10 minutes:  
The figure shows that, bot's growth rate quickly reached a peak, 
and smoothly
maintained at a high level.  
On the other hand, from the perspective of the backbone 
network, the scan on
port 7547 began to rise sharply in the evening of 2016-11-26.  
In terms of the geographical distribution of the newly infected 
bot, Brazil is
still far ahead of the others, which is consistent with the 
geographical
distribution of the existing mirai botnet.  
We provide various statistics and data downloads of Mirai-
infected devices at
http://data.netlab.360.com/mirai-scanner for researchers.  
For those who have been using API to access our bot list, 
please re-download
the data from 2016-11-26 and later to obtain updates for port 
7547 and 5555
data.  
The New Variant Shares Some of the Infrastructure of the 
Existing Mirai Botnet


### Pseudocode

In [134]:
# # How to iterate over df rows? Use range?
# For every row in df:
#     # If id is empty string, use previous id, else 
#     if df['ID'][0] == ''
#         id = id (previous id)
#     else:
#         id = df['ID'][0]
    
#     # If name is empy string, use previous name
#     if df['Name'][0] == '':
#         name = name (previous name)
#     else:
#         name = df['Name'][0]
        
#     # Clean sentences
#     sents = 
#     prep_sents = f(sents)
    
#     # Insert Data
#     for s in prep_sents:
#         insert(id,name,s)
        
        

# for i in range:
#     ids[i]
#     names[i]
#     sents[i]


# for i in range(10):
#     if i == 0:
#         name = 'wassup'
#         continue
#         else:
#         print(name)