# Part 1
### Define and explain the graphs being used to model the android applications (from the Hindroid paper). Explain any computationally relevant portions of the process of writing (text processing) code to extract the nodes and edges of the graphs.
The graphs in used to model Android applications are:
* app-to-api: maps apps to all the different APIs that they use
* api-to-api: maps APIs to APIs based on if they coexist in the same code block
* api-to-api: maps APIs to APIs based on if they are in the same package. To save space this is done as a mapping of API<-> packages. In order to get API<->API we can simply do AA<sup>T</sup>
* api-to-api: maps APIs to APIs based on if they are invoked in the same way. To save space this is done as a mapping of API<->invoke method. To get API<->API we can simply do AA<sup>T</sup>

In order to text process, we need to extract call methods, remember in the methods what APIs we have called, and have some way of associating each method call with a respective node in our graph. Then, we can simply get those nodes out to create edges between them.

In [None]:
from data_pipeline import download_and_process_apks

download_and_process_apks(50)

ignoring because size is 71.600000
ignoring because size is 626.600000
ignoring 2048 because exists
ignoring 2048 because exists
4636 downloaded
ignoring because size is 84.900000
five-craft-nights downloaded
modern-sniper downloaded
ignoring guns because exists
dual downloaded
ignoring flappy-crush because exists
batman-the-flash-hero-run downloaded
gun-shoot-war downloaded
ignoring because size is 69.600000
sonic-jump-fever downloaded
ignoring because size is 99.500000
ignoring because size is 89.100000
ignoring because size is 75.700000
prop-hunt-portable downloaded
deathrun-portable downloaded
ignoring because size is 61.700000
ignoring because size is 53.600000
mountain-sniper-shooting-3d-fps downloaded
ignoring ninja-revenge because exists
zombie-frontier downloaded
ignoring birds-vs-zombies-2 because exists
zombie-shooter-3d downloaded
city-battle-war downloaded
ignoring assassin-ape-open-world-game because exists
ignoring because size is 80.100000
101-skateboard-racing-3d downl

### EDA on amount of classes per app

In [None]:
import os

total_classes_regular = 0
total_apps_regular = 0
print('Regular Apps classes per app:')
for directory in next(os.walk('data'))[1]:
    total_apps_regular += 1
    
    classes = len(next(os.walk('data/' + directory))[2])
    total_classes_regular += classes
    print(directory + ": " + str(classes))
print('Average classes per app: %i' % (total_classes_regular / total_apps_regular))
        
print('\nMalware:')
for directory in next(os.walk('/datasets/dsc180a-wi20-public/Malware/amd_data_smali'))[1]:
    total_classes_variety = 0
    total_apps_variety = 0
    for variety in next(os.walk('/datasets/dsc180a-wi20-public/Malware/amd_data_smali/' + directory))[1]:
        for sample in next(os.walk('/datasets/dsc180a-wi20-public/Malware/amd_data_smali/%s/%s' % (directory, variety)))[1]:
            total_apps_variety += 1

            classes = sum([len(files) for r, d, files in os.walk('/datasets/dsc180a-wi20-public/Malware/amd_data_smali/%s/%s/%s/' % (directory, variety, sample))])
            total_classes_variety += classes
            print('%s/%s/%s: ' % (directory, variety, sample) + str(classes))
    print('Average classes for %s: %i' % (directory, total_classes_variety / total_apps_variety))

Looking at average class amounts, it seems a lot of malware is simply repackaged and rebranded as a different app, when the internals seem the same. For example, Stealer has either 120 or 122 classes, with no other values being present. The same goes for many other apps, such as BankBot having 48, 348, 938, or 1462 a lot of the time. Due to some malware having as low as 19 classes, and others having over 1000, class amount is not really indicative of malware when compared to benign apps.

In [None]:
os.listdir('malware/GingerMaster/variety3/2a96b4721c638ec5d67b9b318bb0b3e0')

In [None]:
import threading

threads = []
for directory in next(os.walk('testing/benign'))[1]:
    thread = threading.Thread(target=os.system, args=('./reorganize-testing.sh %s' % 'testing/benign/' + directory,))
    threads.append(thread)
    thread.start()
# for directory in next(os.walk('testing/malware'))[1]:
#     thread = threading.Thread(target=os.system, args=('./reorganize-testing.sh %s' % 'testing/malware/' + directory,))
#     threads.append(thread)
#     thread.start()
for thread in threads:
    thread.join()
print('done')

In [30]:
import os
import networkx as nx
from tqdm import tqdm_notebook
from collections import defaultdict
import pandas as pd

config = {}
# config['path_benign'] = '/datasets/dsc180a-wi20-public/Malware/testing/benign'
# config['path_benign'] = 'test'
config['path_benign'] = 'data'
config['path_malware'] = '/datasets/dsc180a-wi20-public/Malware/testing/malware'

api_list = []
app_list = []
package_list = []
seen_api = set()
# api_counts = defaultdict(int)

# pandas method
# data = pd.DataFrame()

app_to_api = nx.Graph()
api_cooccur = nx.Graph()
api_same_invoke = nx.Graph()
api_same_package = nx.Graph()

for directory in tqdm_notebook(next(os.walk(config['path_benign']))[1]):
    app_list.append(directory)
    app_to_api.add_node(directory)
    
    # pandas method
#     app_data = []
    
    for subdir, dirs, files in os.walk(config['path_benign'] + '/' + directory):
        for file in files:
            filepath = subdir + os.sep + file

            with open(filepath, 'r') as fp:
                in_method = ''
                api_calls = set()
                
                for line in fp:
                    stripped = line.strip()
                    if stripped.startswith('.method'):
                        in_method = stripped[8:]
                    if stripped == '.end method':
                        in_method = ''
                        api_calls.clear()
                    if stripped[:6] == 'invoke':
                        invoke_method = stripped.split(' {')[0][7:].split('/')[0]
                        
                        splitted = line.split('}, ')
                        fns = splitted[1].split('->')
                        api_package = fns[0]
                        method = fns[1].split('(')[0]
                        
                        current_method_call = api_package + ',' + method
    
                        api_calls.add(current_method_call)
                        
                        # pandas method
#                         app_data.append([directory, invoke_method, '%s,%s,%s' % (directory, file, in_method), api_package, current_method_call])
    
                        if current_method_call not in seen_api:
                            api_list.append(current_method_call)
                            seen_api.add(current_method_call)
                    
                        # app to api generation
                        if current_method_call not in app_to_api:
                            app_to_api.add_node(current_method_call)
                        app_to_api.add_edge(directory, current_method_call)
                        
                        # api to api co-occurance generation
                        if in_method != '':
                            if current_method_call not in api_cooccur:
                                api_cooccur.add_node(current_method_call)
                            for api_call in api_calls:
                                if not api_cooccur.has_edge(current_method_call, api_call):
                                    api_cooccur.add_edge(current_method_call, api_call)
                                
                        # api to api invoke generation
                        # can go from invoke method <-> method call to method call <-> method call by doing A*(A^T)
                        if invoke_method not in api_same_invoke:
                            api_same_invoke.add_node(invoke_method)
                        if current_method_call not in api_same_invoke:
                            api_same_invoke.add_node(current_method_call)
                        if not api_same_invoke.has_edge(invoke_method, current_method_call):
                            api_same_invoke.add_edge(invoke_method, current_method_call)
                                
                        # api to api package generation
                        # can go from package <-> method call to method call <-> method call by doing A*(A^T)
                        if api_package not in api_same_package:
                            package_list.append(api_package)
                            api_same_package.add_node(api_package)
                        if current_method_call not in api_same_package:
                            api_same_package.add_node(current_method_call)
                        if not api_same_package.has_edge(api_package, current_method_call):
                            api_same_package.add_edge(api_package, current_method_call)
    # pandas method
#     data = data.append(app_data)
# data.columns = ['app', 'invoke', 'method', 'package', 'api']
# nx.draw(app_to_api)

HBox(children=(IntProgress(value=0, max=3), HTML(value='')))




In [31]:
# app-api
matrix_A = nx.adjacency_matrix(app_to_api, api_list + app_list)[-len(app_list):, :-len(app_list)]
matrix_A

<3x17656 sparse matrix of type '<class 'numpy.int64'>'
	with 18857 stored elements in Compressed Sparse Row format>

In [32]:
# api-api co-occur
matrix_B = nx.adjacency_matrix(api_cooccur, api_list)
matrix_B

<17656x17656 sparse matrix of type '<class 'numpy.int64'>'
	with 479378 stored elements in Compressed Sparse Row format>

In [33]:
# api-to-api same invoke
from scipy.sparse import csr_matrix
import scipy.sparse
import numpy as np
invoke_types = ['direct', 'static', 'virtual', 'super', 'interface']
matrix_C = nx.adjacency_matrix(api_same_invoke, nodelist=api_list + invoke_types)[-len(invoke_types):, :-len(invoke_types)]
# scipy.sparse.save_npz('matrix_C.npz', matrix_C)
# matrix_C_disk = np.memmap('matrix_C.npz', dtype='float32', mode='r')
# matrix_C_disk.shape
# matrix_C_disk_final = np.memmap('matrix_C_final.npz', dtype='float32', mode='w+', shape=(matrix_C_disk.shape[1], matrix_C_disk.shape[1]))
# matrix_C_disk_final[:] = matrix_C_disk.T.dot(matrix_C_disk)
matrix_C = matrix_C.transpose() @ matrix_C
# matrix_C = csr_matrix(np.matmul(matrix_C.T, matrix_C))
matrix_C

<17656x17656 sparse matrix of type '<class 'numpy.int64'>'
	with 103752202 stored elements in Compressed Sparse Column format>

In [34]:
# api-to-api same package
matrix_D = nx.adjacency_matrix(api_same_package, nodelist=api_list + package_list)[-len(package_list):, :-len(package_list)]
matrix_D = matrix_D.transpose() @ matrix_D
matrix_D

<17656x17656 sparse matrix of type '<class 'numpy.int64'>'
	with 239330 stored elements in Compressed Sparse Column format>

# pandas implementation for deriving matrices

In [None]:
# pandas app-api
from scipy.sparse import csr_matrix
matrix_A = csr_matrix(pd.pivot_table(data, index='app', columns='api', aggfunc='count').fillna(0).clip(0, 1).values)
matrix_A

In [12]:
# pandas api-api same package
from scipy.sparse import csr_matrix
import numpy as np
part_package = csr_matrix(pd.pivot_table(data, index='api', columns='package', aggfunc='count').fillna(0).clip(0, 1).values)
matrix_package = part_package @ part_package.transpose()
matrix_package

<17656x17656 sparse matrix of type '<class 'numpy.float64'>'
	with 239330 stored elements in Compressed Sparse Row format>

In [11]:
len(data['api'].unique())

17656

In [None]:
# O(n + k) top kth frequent appearing elements, sorry heap but O(nlogk) is just too slow for leetcode

freq_map = defaultdict(list)
max_degree = 0
for node in app_to_api.degree:
    if max_degree < node[1]:
        max_degree = node[1]
    freq_map[node[1]].append(node[0])

highest_k = 100
should_break = False
for i in range(max_degree, -1, -1):
    if i in freq_map:
        for api in freq_map[i]:
            print('%s: %i' % (api, i))
            highest_k -= 1
            if highest_k == 0:
                should_break = True
                break
    if should_break:
        break

The app with the highest amount of API calls in my sample set is a cryptocurrency wallet, with a total of 63992 unique API calls. The API with the highest usages are core Java utils, used among all 50 apps. These include things such as initialization of Object (from super() calls due to inheritance), StringBuilders, Java Collection objects, and IOStreams.