# Virus, Malware and Executables : Features Functions to implement

# To Do

Descriptive Stats | Inferential Stats| Predictive Stats
--|--|--
Description or summary of features of data we have at hand. | Make generalization based on sample data we have at hand. | Make future projections based on data we have at hand.


* Cluster all `IP` by `classes` (A,B,C) and check `Descriptive Stats` of each class using: <br> `(Boxplot, Q-Q plot, Histogram, ANOVA)`. Bloxplot to show median, quantile etc. *Var: IPs, IP Classes, Continents, Countries, Cities, malware, domains, etc* 


* Can we approximate the IPs distribution as a `normal distribution`? How about each `IP Class`? Are they approximately normal? Use `Q-Q plot` to show relationship for each class (law of large numbers and central limit theorem).


* Does the `IP Class` that an IP belongs to affect its vulnerability to attack? Use `F-Ratio test ANOVA`. If true, which class is potentially more vulnerable?


* Measure time effect (by days or weeks) `dayselapsed` on the `frequency` of attack on the DNS. `Cluster` by `IPclass`. Use scatterplot for days, boxplot for weeks.


* Which `domain | IP` has `highest & Lowest` vulnerability or most targetted DNS? `Boxplot` and `Bubbleplot`


* Which virus/malware `(MD5Hash)` has the `highest & Lowest` penetration frequency? `Boxplot` and `Bubbleplot`


* Use `IP LookUp API` to determine geographic location of the IP. Map geopgraphic location to world map. `Country, long/lat, Continent, etc`


* Plot charts by `Country, Continent, Class, City, State, Type, ISP` etc using `Boxplot` and `Bubbleplot`


* Scrape offline and Online blacklisted IP and create Histograms of IPs in our dataset identified as blacklisted.


* Query `IP` of incoming users (new data), use as input to `IP LookUP API` and obtain feature set data about the user. Based on the `Class`, `predict` which virus/malware `MD5Hash` the user IP is vulnerable to.

# `IP LookUp API` 

Given an `IP Address`,the `IP LookUp` API extracts `CountryName`, `CountryCode`, `City`, `State`, `CityLocation (Long/Lat)`, `CountryLocation (Long/Lat)`, `Zipcode` and `IP StatusMessage` as `JSON()` format. The project can also aid in `Fraud detection`

* https://github.com/fiorix/freegeoip
* http://freegeoip.net/

Request: http://freegeoip.net/json/70.173.245.105

Return Values
`{
"ip": "70.173.245.105",
"country_code": "US",
"country_name": "United States",
"region_code": "NV",
"region_name": "Nevada",
"city": "Las Vegas",
"zip_code": "89108",
"time_zone": "America/Los_Angeles",
"latitude": 36.2143,
"longitude": -115.2131,
"metro_code": 839
}`


# IP Geolocation Data Preparation & Extraction

In [1]:
import os, time, threading, requests
import pandas as pd


# Locate API key for the geoips application - www.geoips.com
f_entr = open(os.path.join(os.path.dirname(os.path.dirname(os.getcwd())), 'geoips_enterp_key.txt'), 'r')
f_basc = open(os.path.join(os.path.dirname(os.path.dirname(os.getcwd())), 'geoips_basic_key.txt'), 'r')

geoips_key = f_entr.read()
# geoips_key = f_basc.read()



# Prepare `input` data file

## Break the `400,000` rows of unique IPs into `csv` files of size `9,000` rows

`FREEGEOIP` allows up to 10,000 queries per hour by default. Once this limit is reached, all requests will result in HTTP 403, forbidden, until 1 hour quota is cleared. To avoid `403` and other server error, we could setup a `cronjob` to automatically query the `API` service every hour at a lower rate of `9,000 queries/hr.` First, we need to divide the `000,000` rows of unique IPs to `0,000`. For tractability, we set the file name as the `file_day_hr` e.g. `free_geo_19_01`.

# Function to determine the `IP Class`

In [1]:
def assign_ipclass(ip):
    """ Assign class A,B,C,D,E or Special to an IP address. """
    
    # Split IP and take first and second octects.
    first_octet = int(ip.split(sep='.')[0])
    scnd_octet = int(ip.split(sep='.')[1])
    
    # Determine class of IP based on 1st and 2nd octet
    if (0 <= first_octet <= 126):
        if (first_octet == 10):
            return 'Private Class A'
        else:
            return 'Class A'
    
    elif (first_octet == 127):
        return 'Loopback IP' 
    
    elif (128 <= first_octet <= 191):
        if scnd_octet in np.arange(16,31):
            return 'Private Class B'
        else:
            return 'Class B'
        
    elif (192 <= first_octet <= 223):
        if (scnd_octet == 168):
            return 'Private Class C'
        else:
            return 'Class C'
    
    elif (224 <= first_octet <= 239):
        return 'Class D'
    
    elif (240 <= first_octet <= 255):
        return 'Class E'
    else:
        return 'Unknown IP'



# Update `IP Address`  with the `Classes` and display

In [None]:

# Assign classes to IPs
uniqueIPs['IP_Class'] = uniqueIPs.IP_Address.apply(lambda ip: assign_ipclass(ip))

# Check the Classes assigned
print('\nAssigned IPs : ',uniqueIPs.IP_Class.unique())
uniqueIPs.head(3)


### Initialize

In [2]:
api_hr_limit = 9500
uniqueIPs = pd.read_csv("uniqueIPs.csv", index_col=0)

# Uniques processed with `geoips API` - 200K
uniques_done = 354000
# uniqueIPs[37500:uniques_done].to_csv("for_geoips.csv")

# This uniques will be processed by `free_geoip` approx. 200K
uniqueIPs = uniqueIPs[uniques_done:]

# Expected number of API calls for the List of IPs we want to extract geodata for.
total_api_call = round(uniqueIPs.index.size / api_hr_limit)
print("\nAPI calls required : %d " %total_api_call)


API calls required : 5 


In [23]:
# Specify date and time that cron job will begin

total_api_call = 7
def make_dates():
    """ Comments"""
    date_rng = pd.Series(pd.date_range(start=pd.datetime.today(), periods=total_api_call, freq='3H'))
    rng_dt = pd.DataFrame()
    rng_dt['days'] = date_rng.apply(lambda x: x.day)
    rng_dt['hours'] = date_rng.apply(lambda x: x.hour)
    return rng_dt




# Call the function
rg_dates = make_dates()
    
rg_dates.head()


Unnamed: 0,days,hours
0,5,23
1,6,2
2,6,5
3,6,8
4,6,11


### Process the rest on `freegeoip.net`

In [6]:
# Set filenames
inp_filenames = ["free_gip_{0}_{1}.csv".format(tm[0],tm[1]) for tm in rg_dates.values]
# print(inp_filenames)

In [12]:
# Divide the dataframes of IPs into separate .csv files. 

p_start = 0
p_end = api_hr_limit 

inp_ip_file_path = os.path.join(os.getcwd(), 'cron_data', 'input')


for f_name in inp_filenames:
    data_extract = uniqueIPs[p_start:p_end]
    data_extract.to_csv(os.path.join(inp_ip_file_path, f_name))
    p_start = p_end
    p_end += api_hr_limit


# Connect to `API`, fetch data using `freegeoip API`. Save as `Output` data file

In [7]:
def lookup_ip_geodata(list_of_ips):
    """ Function to lookup geolocation data of given IP addresses. 
        Returns dataframe of geolocation for given IPs.
    """
    ip_lookup = []
    for ip in list_of_ips:
        r = requests.get("http://freegeoip.net/json/"+ip)
        if r.ok:
            ip_lookup.append(r.json())
    return pd.DataFrame(ip_lookup)



# Test the function
lookup_ip_geodata(uniqueIPs.IP_Address.values[:2])


Unnamed: 0,city,country_code,country_name,ip,latitude,longitude,metro_code,region_code,region_name,time_zone,zip_code
0,Hangzhou,CN,China,115.28.44.153,30.2936,120.1614,0,33.0,Zhejiang Sheng,Asia/Shanghai,
1,,GB,United Kingdom,109.74.205.147,51.5,-0.13,0,,,Europe/London,


In [24]:
inp_filenames = ['free_gip_6_0.csv']
print(inp_filenames)

['free_gip_5_20.csv']


### `Code to execute at every hour`. 
With the `csv` datafiles all ready. We need to configure a `cronjob` that executes at regular `1 hour` interval. At each hour, the cron reads a `csv` file and perform a `GET request` for all `input IPs` in the respective `csv` file. On completion, it saves the `IP geodata` fetched through `API` into another `csv` file as output. 



In [20]:
files_treated = []
out_geo_ip_file_path = os.path.join(os.getcwd(), 'cron_data', 'output')


def readip_lookupgeo_out2file():
    """ This fxn reads each .csv file as input. It calls the `lookup_ip_geodata`
        which returns a dataframe of geo data for set of IPs in the csv file. 
        The dataframe is then resaved as .csv output file. 
    """
    start_day = pd.datetime.today().day
    start_hr = pd.datetime.today().hour
#     start_min = round(pd.datetime.today().minute, -1)
    file_to_pull = "free_gip_{0}_{1}.csv".format(start_day,start_hr)
    
    # Check if file exist and read it. Else print msg to log and exit function
    file_url = os.path.join(inp_ip_file_path, file_to_pull)
    
    if not os.path.isfile(file_url):
        print("File '{}' does not exist!".format(file_to_pull))
        return 
    elif file_to_pull in files_treated:
        print("File '{}' already fetched.".format(file_to_pull))
        return 
    else:
        try:
            inp_ips_addrs = pd.read_csv(file_url, index_col=0).IP_Address.values
            ip_geo_df = lookup_ip_geodata(inp_ips_addrs)
            files_treated.append(file_to_pull)
        except ConnectionError or MaxRetryError:
            print("Connection Problem on '{}'. Moving on.".format(file_to_pull))
            return
    
    file_fetched = "out_free_gip_{0}_{1}.csv".format(start_day, start_hr)
    out_path = os.path.join(out_geo_ip_file_path, file_fetched)
    ip_geo_df.to_csv(out_path)
    
    print("{0}: {1} of {2} successfully completed!".format(
            file_fetched, inp_filenames.index(file_to_pull)+1, len(inp_filenames)))
    return



## Setup `cronjob` or `timer` function to call `readip_lookupgeo_out2file()` every `65 mins`

Function below calls `readip_lookupgeo_out2file()` every `65 mins` converted into secs. We set a timer that does the counting and continue the loop till all the `csv` input files have been treated. This implies that we loop till end of `len(filenames)`.

In [26]:

def lookup_ip_geo_timer():
    """ Continuosly call function till all filenames are treated. """
    for fn in inp_filenames:
        readip_lookupgeo_out2file()
        time.sleep(1800)                 # 60 mins interval
    return


### Call the timer function that pulls data using `API`

In [28]:
# Call function and print progress

lookup_ip_geo_timer()


# Function to `Update` missing values in data extracted from `FreeGeoIP`

In [23]:


gips_base_url = "http://api.geoips.com/ip"
append_url = "output/json/hostname/true/timezone/true"


def update_free_geoip_data(df):
    geoips_url = "{0}/{1}/key/{2}/{3}".format(gips_base_url, df.name, geoips_key, append_url)
    r = requests.get(geoips_url)
    
    if r.ok and 'error' not in r.json():
        returned_data = r.json()['response']
        
        # Get request ok, no error but no entry was found
        if 'Success' not in returned_data['message']:
            print(returned_data['notes'])
            return 
        # Request is ok and success msg was received.
        else:
            ip_location_data = returned_data['location']
            
            # Rename keys to match existing dataframe naming format
            ip_location_data.pop('county_name')
            ip_location_data.pop('ip')
            ip_location_data['city'] = ip_location_data.pop('city_name')
            ip_location_data['time_zone'] = ip_location_data.pop('timezone')
            
            # Update missing data in dataframe with fetched API data
            for fetched_key, fetched_value in ip_location_data.items():
                if pd.isnull(df[fetched_key]):
                    df[fetched_key] = fetched_value  
                    
            df['latitude'] = ip_location_data['latitude']
            df['longitude'] = ip_location_data['longitude']  

    # request ok but something else went wrong with the fetch
    else:
        stop_point_log.append("API stopped at {0}".format(df.name))
        print(r.json()['error']['notes'])
        return
    
    return df



## Import data file of  `IP Geo LookUp`

In [None]:
# Array to store log information about error during API calls and fetches
stop_point_log = []      

# Temp variables to be deleted when function moves
out_geo_ip_file_path = os.path.join(os.getcwd(), 'cron_data', 'output')



def loadcsv_call_updater(csv_fname):
    """ The function loads a CSV file given in the filename, 
        changes it structure and call the update function on it 
    """
    # Load dataframe from csv
    df = pd.read_csv(os.path.join(out_geo_ip_file_path, csv_fname), index_col='ip')
    df.drop('Unnamed: 0', axis=1, inplace=True)
    
    # Update structure of existing dataframe
    df['continent_name'] = None
    df['continent_code'] = None
    df['owner'] = None
    df['hostname'] = None

    # Get those rows missing some data properties
    has_missing_data = df[df.country_name.isnull()]

    # Update missing rows with new data from API calls 
    updated_row = has_missing_data.apply(update_free_geoip_data, axis=1)
    
    # Merge inplace into original dataframe using .update() and resave as .csv file into `phase2` directory.
    df.update(updated_row)
    df.to_csv(os.path.join(os.getcwd(), 'cron_data', 'phase2' ,csv_fname))
    print("{} Done!".format(csv_fname))
    return 




## Main function
Function here iteratively calls `loadcsv_call_updater(csv_fname)` which in turn calls `update_free_geoip_data(df)`. Both together load a `.csv` containing output rows of geolocation data on an IP int odataframe.  The `loadcsv_call_updater(csv_fname)` examines the dataframe for rows that contain missing data properties. It extract the missing elements into a temporary dataframe and calls `df.apply(update_free_geoip_data)` on each of the rows. The `update_free_geoip_data()` uses another `IP geolocation API service` to extract finer data. (So we perform API data extraction twice!)

In [None]:
# Consider the filenames of the `/output/out_free_gip_XX_YY_ZZ.csv` and format a uniform call mthd


out_filenames = ["out_free_gip_{0}_{1}_{2}.csv".format(tm[0],tm[1],tm[2]) for tm in rg_dates.values]

for csv_file in out_filenames:
    loadcsv_call_updater(csv_file)



## Merge `API` data file with the data file containing `IP class` info

Merge all the API data files into single dataframe and save as `.csv` file

In [3]:
pieces = []

for f_csv in out_filenames[:2]:
    api_file = pd.read_csv(os.path.join(os.getcwd(), 'cron_data', 'phase2', f_csv), index_col=0)
    pieces.append(api_file)


# Concatenate the list of files into single dataframe
ip_geodata_lookup = pd.concat(pieces) 

# Reaarange the columns
column_names = ['ip', 'country_name', 'country_code', 'latitude', 'longitude', 
                'region_name', 'region_code', 'city', 'metro_code', 'zip_code', 'time_zone']

ip_geodata_lookup.columns = column_names


# Merge with the `uniqueIPs` data file containing CLASS information
augmented_ip_geodata = pd.merge(uniqueIPs, ip_geodata_lookup, left_on='IP_Address', right_on='ip')
augmented_ip_geodata.drop('ip', axis=1, inplace=True)

augmented_ip_geodata.to_csv("augmented_ip_geodata.csv")
augmented_ip_geodata.head(3)
