---

Title: Acquisition Phase - Collecting data to come able in Analyzing Global Cyberattack Patterns and Enhanced Cybersecurity.

---

---
### 🛠️ Importing Libraries

In [1]:
# Basic/Standard libraries to manage data
import numpy as np
import pandas as pd

# Necessary libraries to use the api
import requests
import json
from pprint import pprint # for better visualization of json structure

# pip install tqdm (only if you haven't installed this library in your pc)
from tqdm import tqdm # To add a progress bar when you use a for cycle (useful when you want to take trace of the iterations in a cycle or its execution time)

---
### 1️⃣ Loading First Data Source 
#### (Kaggle Dataset: updated_cybersecurity_attacks.csv)

In [None]:
path = r""
attacks = pd.read_csv(path, delimiter = ',') # First data source

In [None]:
# See the head of our dataset
attacks.head()

In [None]:
# Takes the columns
attacks.columns

In [5]:
# Define the columns to keep
columns_to_keep = [
    'Source IP Address', 'Destination IP Address', 'Protocol', 'Packet Type', 
    'Packet Length', 'Traffic Type', 'Malware Indicators', 'Anomaly Scores', 
    'Alerts/Warnings', 'Attack Type', 'Action Taken', 'Severity Level', 
    'Log Source', 'Browser', 'Device/OS', 'Year', 'Month', 'Day'
    ]

In [6]:
# Filter the DataFrame to keep only the selected columns
attacks = attacks[columns_to_keep]

In [7]:
# Create a new column 'Date' by merging Year, Month, and Day
attacks['Date'] = pd.to_datetime(attacks[['Year', 'Month', 'Day']])

In [8]:
# Drop the Year, Month, and Day columns
attacks = attacks.drop(columns=['Year', 'Month', 'Day'])

In [None]:
# Display the first 5 rows of the updated DataFrame
attacks.head()

In [None]:
# Updated shape of our dataset
attacks.shape

---
### 🔍 Define & Explore our API (essential element of our second data source)

For more Info & Details see below links:
 
[API Landing Page](https://ipgeolocation.io/)

[API Documentation](https://ipgeolocation.io/documentation.html)

In [None]:
API_KEY = ""
ip = "142.250.186.174" # Example of "typic" ip (--> A lot info can be extracted from this ip)
#ip = "10.90.47.5" # Example of ip for wich isn't possible to extract any information
api_base_url = f"{"https://api.ipgeolocation.io/ipgeo?apiKey="}{API_KEY}{"&ip="}{ip}"
options = "&fields=geo&include=security" # Include geolocation and security info
users_endpoint = f"{api_base_url}{options}"
print(f"Your user edpoint: {users_endpoint}")
response = requests.get(users_endpoint)
response = response.json()

In [None]:
# See the structure of our API response
pprint(response)

In [None]:
type(response)

In [None]:
# See the nested pice
pprint(response["security"])

---
### 2️⃣ Second Data Source "Generation/Collection"
This section is dedicated to the defenition and usage of two important functions:
1. **"JSON_frame_generation"** function: A function that given a list of API, return (save) a json flie that contain a huge and important amount of information related at them;
2. **"convert_JSON_to_CSV_DATAFRAME"** function: Afunction that given a json file as input, convert this input in a dataframe object ad save it as csv file.

Notice that:
- The first function is important because make we able to collect data more fast and in a more "reliable" way. This last observation reguard the fact that the service that we use for collecting data is not free (whe have payed for a specif number of request). We have at disposition only 150000 request and we need to do 80000 of them for collecting the info related at ip of attakers (40000) and the ip of victims (40000). So basically we have about two attempts for collecting our data. 

- The importance of the second function is related to the fact that we know well the structure of our data and we know well the stracture that we need for our future analysis (EDA). Indeed we need a table as input for our future analysis (that in other cases can be also a ML project, Linear Regression Analysis, Dashboard Creation, Cluster Analysis ecc.), we also know that the structure of our data is almost completely fixed, so for these reasons, we will opt for store our data sources in a relational database.

#### JSON_frame_generation function - definition

In [15]:
def JSON_frame_generation(ip_list, API_KEY, saving_path, file_name):

    '''
    Arguments:
    ip_list = A list that contain the ip's for which you want to extract related informations
    API_KEY = The KEY of the API
    saving_path = The path where you want to save the output (the json file)
    file_name = How you want to name the output file

    Outupt:
    A json file that contain the info related to the input ip's.

    This function was implemented to make you able to work with a batch approach.
    You can decide to use only a portion of the given list input (e.g. from intex 0 to 100 = ip_list[0:100]).

    Step 1 - Choose you usage type
    Choose 1: if you want to extract information using a specific portion of input lists (from a to b = ip_list[a:b])
    Choose 2: if you want to start ectracting from a specific position in your input lists util the end of the lists (from c until the end = ip_list[c:])
    
    Step 2 - Set the starting and (eventually) the ending batch position
    '''

    dict_info_att = {}

    print("Choose your usage: (1 = from a to b | 2 = from a until the end).")
    use_method = int(input("Select 1 (if you need a starting and ending batch position) 2 (if you need only a starting batch position)"))
    if use_method == 1:   
        strt_btc_pos = int(input("Select the starting batch position: "))
        end_btc_pos = int(input("Select the ending batch position: "))
        ip_list = ip_list[strt_btc_pos:end_btc_pos]
        print(f"{"Number of selected rows:"} {str(len(ip_list))}")
        print(f"You start collecting data from row {str(strt_btc_pos)} (included) to row {str(end_btc_pos)} (excluded).")
    elif use_method == 2:
        strt_btc_pos = int(input("Select the starting batch position: "))
        ip_list = ip_list[strt_btc_pos:]
        print(f"{"Number of selected rows:"} {str(len(ip_list))}")
        print(f"You start collecting data from row {str(strt_btc_pos)} (included) until the end.")

    options = "&fields=geo&include=security"

    for att_ip in tqdm(ip_list, desc = "IP's info extraction"):
        api_base_url = f"{"https://api.ipgeolocation.io/ipgeo?apiKey="}{API_KEY}{"&ip="}{att_ip}"
        options = "&fields=geo&include=security"
        users_endpoint = f"{api_base_url}{options}"
        response = requests.get(users_endpoint)
        response = response.json()
        
        if len(response) > 1: 
            dict_key = response["ip"]
        else:
            list_message = response["message"].split(" ")
            dict_key = list_message[0].strip("'")

        dict_info_att[dict_key] = response

    print("Saving process has started, please wait...")
    full_path = f"{saving_path}\\{file_name}.json"
    with open(full_path, "w") as file:
        json.dump(dict_info_att, file, indent=4)
    print(f"Full path: {full_path}")

    return print("File saved successfully!!!")

#### JSON_frame_generation function - usage

In [16]:
# Attakers ip list
att_ip_list = attacks["Source IP Address"]
att_ip_list = att_ip_list.tolist()

In [None]:
# Useful info about them
print(att_ip_list) # See the list 
print(len(att_ip_list)) # See her lenght
print(type(att_ip_list)) # Check that type = list

In [None]:
# JSON_frame_generation function - usage for attakers (extract info from attakers ip's)
#API_KEY = ""
#path = r""
#file_name = r"attakers_json_file"
#JSON_frame_generation(att_ip_list, API_KEY, path, file_name)

In [18]:
# Victims ip list
vic_ip_list = attacks["Destination IP Address"]
vic_ip_list = vic_ip_list.tolist()

In [None]:
# Useful info about them
print(vic_ip_list) # See the list
print(len(vic_ip_list)) # See her lenght
print(type(vic_ip_list)) # Check that type = list

In [None]:
# JSON_frame_generation function - usage for victims (extract info from victims ip's)
#API_KEY = ""
#path = r""
#file_name = r"victims_json_file"
#JSON_frame_generation(vic_ip_list, API_KEY, path, file_name)

#### convert_JSON_to_CSV_DATAFRAME function - definition

##### Some useful checks before starting with the definition of the function

In [None]:
# Loading the json file of the attakers (for example) 
# To be precise: when you finish with attakers, do the same with the victims
path = r""
with open(path, "r") as file:
    data = json.load(file)

In [None]:
# See the structure
pprint(data)

In [None]:
# Check the lenght
len(data)

In [None]:
# Check the lenght of a single element
len(data['103.216.15.12']) # for example the length for '103.216.15.12'

In [24]:
# Count the lenght for each element
ls = []
for ip in data:
    count = len(data[ip])
    ls.append(count)

In [25]:
n_16 = 0 # Count the number of element with lenght = 16 (elements/ip's for which we can collect related inforamtions)
n_1 = 0 #Count the elment with lenght = 1 (elements/ip's for which isn't possible to collect related inforamtions)
ls_other = []
for el in ls:
    if el == 16:
        n_16 += 1
    elif el == 1:
        n_1 +=1
    else:
        ls_other.append(el)

In [None]:
n_16

In [None]:
n_1

In [None]:
n = n_16 + n_1
n # if n_16 + n_1 = tot. elements/ip's in the json_file (40000 in our case) --> don't worry, everything is ok !!!

##### Definition

In [29]:
def convert_JSON_to_CSV_DATAFRAME(full_json_file_path, full_destination_path, are_attakers):

    '''
    This function take a json file as input and convert it in a dataframe object that will be saved in a csv file.

    Arguments:
    full_json_file_path = The path (full path with the json file name and his extension .json) where you have saved the json file
    full_destination_path = The full path (folder path + file name + .csv) where you want to save the output (the csv file)
    are_attakers = True (means that we are considering the json file related to the attakers ip's info) or False (means that we are considering the json file related to the victims ip's info)

    Outupt:
    A csv file that contain the same informations of the input but stoered in another structure/format.
    
    Notice that: the form of the output depends by the argument are_attakers, indeed if are_attakers = True then the columns of the output dataframe will have "att_" statement as their initial, else there will be "vic_".
    '''

    with open(full_json_file_path, "r") as file:
        data = json.load(file)

    if are_attakers == True:

        data_structure = {
            "att_ip":[],
            "att_continent_code": [],
            "att_continent_name": [],
            "att_country_code2": [],
            "att_country_code3": [],
            "att_country_name": [],
            "att_country_name_official": [],
            "att_is_eu": [],
            "att_state_prov": [],
            "att_state_code": [],
            "att_district": [],
            "att_city": [],
            "att_zipcode": [],
            "att_latitude": [],
            "att_longitude": [],
            "att_threat_score": [],
            "att_is_tor": [],
            "att_is_proxy": [],
            "att_proxy_type": [],
            "att_is_anonymous": [],
            "att_is_known_attacker": [],
            "att_is_spam": [],
            "att_is_bot": [],
            "att_is_cloud_provider": [],
            "att_message" : []
            }
        
        for ip_key in tqdm(data, desc = "Processing attakers data..."):

            if len(data[ip_key]) == 16:

                data_structure["att_ip"].append(data[ip_key]["ip"])
                data_structure["att_continent_code"].append(data[ip_key]["continent_code"])
                data_structure["att_continent_name"].append(data[ip_key]["continent_name"])
                data_structure["att_country_code2"].append(data[ip_key]["country_code2"])
                data_structure["att_country_code3"].append(data[ip_key]["country_code3"])
                data_structure["att_country_name"].append(data[ip_key]["country_name"])
                data_structure["att_country_name_official"].append(data[ip_key]["country_name_official"])
                data_structure["att_is_eu"].append(data[ip_key]["is_eu"])
                data_structure["att_state_prov"].append(data[ip_key]["state_prov"])
                data_structure["att_state_code"].append(data[ip_key]["state_code"])
                data_structure["att_district"].append(data[ip_key]["district"])
                data_structure["att_city"].append(data[ip_key]["city"])
                data_structure["att_zipcode"].append(data[ip_key]["zipcode"])
                data_structure["att_latitude"].append(data[ip_key]["latitude"])
                data_structure["att_longitude"].append(data[ip_key]["longitude"])
                data_structure["att_threat_score"].append(data[ip_key]["security"]["threat_score"])
                data_structure["att_is_tor"].append(data[ip_key]["security"]["is_tor"])
                data_structure["att_is_proxy"].append(data[ip_key]["security"]["is_proxy"])
                data_structure["att_proxy_type"].append(data[ip_key]["security"]["proxy_type"])
                data_structure["att_is_anonymous"].append(data[ip_key]["security"]["is_anonymous"])
                data_structure["att_is_known_attacker"].append(data[ip_key]["security"]["is_known_attacker"])
                data_structure["att_is_spam"].append(data[ip_key]["security"]["is_spam"])
                data_structure["att_is_bot"].append(data[ip_key]["security"]["is_bot"])
                data_structure["att_is_cloud_provider"].append(data[ip_key]["security"]["is_cloud_provider"])
                data_structure["att_message"].append("")

            else:

                data_structure["att_ip"].append("")
                data_structure["att_continent_code"].append("")
                data_structure["att_continent_name"].append("")
                data_structure["att_country_code2"].append("")
                data_structure["att_country_code3"].append("")
                data_structure["att_country_name"].append("")
                data_structure["att_country_name_official"].append("")
                data_structure["att_is_eu"].append("")
                data_structure["att_state_prov"].append("")
                data_structure["att_state_code"].append("")
                data_structure["att_district"].append("")
                data_structure["att_city"].append("")
                data_structure["att_zipcode"].append("")
                data_structure["att_latitude"].append("")
                data_structure["att_longitude"].append("")
                data_structure["att_threat_score"].append("")
                data_structure["att_is_tor"].append("")
                data_structure["att_is_proxy"].append("")
                data_structure["att_proxy_type"].append("")
                data_structure["att_is_anonymous"].append("")
                data_structure["att_is_known_attacker"].append("")
                data_structure["att_is_spam"].append("")
                data_structure["att_is_bot"].append("")
                data_structure["att_is_cloud_provider"].append("")
                data_structure["att_message"].append("message")

    else:

        data_structure = {
            "vic_ip":[],
            "vic_continent_code": [],
            "vic_continent_name": [],
            "vic_country_code2": [],
            "vic_country_code3": [],
            "vic_country_name": [],
            "vic_country_name_official": [],
            "vic_is_eu": [],
            "vic_state_prov": [],
            "vic_state_code": [],
            "vic_district": [],
            "vic_city": [],
            "vic_zipcode": [],
            "vic_latitude": [],
            "vic_longitude": [],
            "vic_threat_score": [],
            "vic_is_tor": [],
            "vic_is_proxy": [],
            "vic_proxy_type": [],
            "vic_is_anonymous": [],
            "vic_is_known_attacker": [],
            "vic_is_spam": [],
            "vic_is_bot": [],
            "vic_is_cloud_provider": [],
            "vic_message" : []
            }
        
        for ip_key in tqdm(data, desc = "Processing victims data..."):

            if len(data[ip_key]) == 16:

                data_structure["vic_ip"].append(data[ip_key]["ip"])
                data_structure["vic_continent_code"].append(data[ip_key]["continent_code"])
                data_structure["vic_continent_name"].append(data[ip_key]["continent_name"])
                data_structure["vic_country_code2"].append(data[ip_key]["country_code2"])
                data_structure["vic_country_code3"].append(data[ip_key]["country_code3"])
                data_structure["vic_country_name"].append(data[ip_key]["country_name"])
                data_structure["vic_country_name_official"].append(data[ip_key]["country_name_official"])
                data_structure["vic_is_eu"].append(data[ip_key]["is_eu"])
                data_structure["vic_state_prov"].append(data[ip_key]["state_prov"])
                data_structure["vic_state_code"].append(data[ip_key]["state_code"])
                data_structure["vic_district"].append(data[ip_key]["district"])
                data_structure["vic_city"].append(data[ip_key]["city"])
                data_structure["vic_zipcode"].append(data[ip_key]["zipcode"])
                data_structure["vic_latitude"].append(data[ip_key]["latitude"])
                data_structure["vic_longitude"].append(data[ip_key]["longitude"])
                data_structure["vic_threat_score"].append(data[ip_key]["security"]["threat_score"])
                data_structure["vic_is_tor"].append(data[ip_key]["security"]["is_tor"])
                data_structure["vic_is_proxy"].append(data[ip_key]["security"]["is_proxy"])
                data_structure["vic_proxy_type"].append(data[ip_key]["security"]["proxy_type"])
                data_structure["vic_is_anonymous"].append(data[ip_key]["security"]["is_anonymous"])
                data_structure["vic_is_known_attacker"].append(data[ip_key]["security"]["is_known_attacker"])
                data_structure["vic_is_spam"].append(data[ip_key]["security"]["is_spam"])
                data_structure["vic_is_bot"].append(data[ip_key]["security"]["is_bot"])
                data_structure["vic_is_cloud_provider"].append(data[ip_key]["security"]["is_cloud_provider"])
                data_structure["vic_message"].append("")

            else:

                data_structure["vic_ip"].append("")
                data_structure["vic_continent_code"].append("")
                data_structure["vic_continent_name"].append("")
                data_structure["vic_country_code2"].append("")
                data_structure["vic_country_code3"].append("")
                data_structure["vic_country_name"].append("")
                data_structure["vic_country_name_official"].append("")
                data_structure["vic_is_eu"].append("")
                data_structure["vic_state_prov"].append("")
                data_structure["vic_state_code"].append("")
                data_structure["vic_district"].append("")
                data_structure["vic_city"].append("")
                data_structure["vic_zipcode"].append("")
                data_structure["vic_latitude"].append("")
                data_structure["vic_longitude"].append("")
                data_structure["vic_threat_score"].append("")
                data_structure["vic_is_tor"].append("")
                data_structure["vic_is_proxy"].append("")
                data_structure["vic_proxy_type"].append("")
                data_structure["vic_is_anonymous"].append("")
                data_structure["vic_is_known_attacker"].append("")
                data_structure["vic_is_spam"].append("")
                data_structure["vic_is_bot"].append("")
                data_structure["vic_is_cloud_provider"].append("")
                data_structure["vic_message"].append("message")

    df = pd.DataFrame(data_structure)   
    df.to_csv(full_destination_path, index = False)
    print("File successfully saved!!!")

    return print(f"{"Your file location:"}{full_destination_path}")

#### convert_JSON_to_CSV_DATAFRAME function - usage

##### Usage for the attakers

In [None]:
# convert_JSON_to_CSV_DATAFRAME function - usage for attakers
#json_file_path = r""
csv_df_destination_path = r""
#are_attakers = True
#convert_JSON_to_CSV_DATAFRAME(json_file_path, csv_df_destination_path, are_attakers)

###### Test and see the outpt for the attakers

In [31]:
# Load the csv file
attakers_csv = pd.read_csv(csv_df_destination_path, delimiter = ',') 

In [None]:
# First look on the output 
attakers_csv.head() 

In [None]:
# See the shape (row x columns)
attakers_csv.shape

In [None]:
# Extratc the names of the all columns
attakers_csv.columns

##### Usage for the victims

In [None]:
# convert_JSON_to_CSV_DATAFRAME function - usage for victims
#json_file_path = r""
csv_df_destination_path = r""
#are_attakers = False
#convert_JSON_to_CSV_DATAFRAME(json_file_path, csv_df_destination_path, are_attakers)

###### Test and see the outpt for the victims

In [36]:
# Load the csv file
victims_csv = pd.read_csv(csv_df_destination_path, delimiter = ',') 

In [None]:
# First look on the output 
victims_csv.head()

In [None]:
# See the shape (row x columns)
victims_csv.shape

In [None]:
# Extratc the names of the all columns
victims_csv.columns

---
### ⏭️ Next Step: 
$\rightarrow$ Data Storage, Integration and Enrichment (see 2_dataStorage_Integration_Enrichment_withPostgreSQL.ipynb)