---

Title: Data Quality - Assessment and Improvement

---

---
### 🛠️ Importing Libraries

In [1]:
# Basic/Standard libraries to manage data
import numpy as np
import pandas as pd

# pip install tqdm (only if you haven't installed this library in your pc)
from tqdm import tqdm # To add a progress bar when you use a for cycle (useful when you want to take trace of the iterations in a cycle or its execution time)

from collections import Counter  # Simplifies counting hashable objects. Ideal for counting elements, finding frequent items, or performing operations on frequencies

---
### 🗃️ Loading Data  
#### Starting Point:  
We begin with the cleaned version of the enriched dataframe, extracted using PostgreSQL.  
Key details:  
- **Enriched Dataframe File**: `final_df_ExtractedWithPostgreSQL.csv`  
- **Cleaned Version File**: `cleaned_data.csv`

In [2]:
path = r"cleaned_data.csv" # For windows user's
#path = "cleaned_data.csv" # For macos user's
data = pd.read_csv(path, delimiter = ',')

--- 
### 1. Assessing "Overall Data Completeness"

By analyzing the initial results, we observe that the dataset consists of **37 columns**, all of which are free of missing values. This means that the dataset has **39,657 observations** and **37 columns**, with each column having a value for every observation. This represents a situation of **100% completeness** in the current data.  

However, if we consider the previous version of the dataset (prior to the completion of the cleaning process but after dropping unnecessary columns—see `Data_Cleaning.ipynb` for more details), it originally contained **40,000 observations**. By calculating the difference between the number of rows in the previous version (**40,000**) and the current version (**39,657**), we find that **343 rows** were dropped. This equates to a reduction of approximately **0.8%** (less than 1%) during the cleaning process.  

Thus, we can conclude that regardless of whether we consider the previous or the current version of the data, the dataset maintains a **very high level of completeness**, which can be described as either excellent or near-perfect.

In [3]:
# First look on data:
print("Shape:", data.shape) # See the shape of our dataframe (rows, columns)
print("#rows in final_df_ExtractedWithPostgreSQL.csv - #rows in cleaned_data.csv:", 40000-len(data)) # Difference between number of rows in "final_df_ExtractedWithPostgreSQL.csv" and the number of rows in "cleaned_data.csv"
print("Percentage of droppend rows during the cleaning phase:", round((40000-len(data))/len(data)*100, 4)) # % of dropped row during the cleaning phase 

Shape: (39657, 37)
#rows in final_df_ExtractedWithPostgreSQL.csv - #rows in cleaned_data.csv: 343
Percentage of droppend rows during the cleaning phase: 0.8649


In [4]:
# See info about the variables
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39657 entries, 0 to 39656
Data columns (total 37 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   vic_ip                  39657 non-null  object 
 1   vic_continent_name      39657 non-null  object 
 2   vic_country_code2       39657 non-null  object 
 3   vic_country_name        39657 non-null  object 
 4   vic_city                39657 non-null  object 
 5   vic_latitude            39657 non-null  float64
 6   vic_longitude           39657 non-null  float64
 7   att_ip                  39657 non-null  object 
 8   att_continent_name      39657 non-null  object 
 9   att_country_code2       39657 non-null  object 
 10  att_country_name        39657 non-null  object 
 11  att_city                39657 non-null  object 
 12  att_latitude            39657 non-null  float64
 13  att_longitude           39657 non-null  float64
 14  att_threat_score        39657 non-null

---  
### 2. Assessment of Data Quality for Numeric Variables  

In this section, we will evaluate the **consistency** of the following numeric variables:  

**Variables Under Consideration:**  
- **att_latitude:** Latitude of the attacker  
- **att_longitude:** Longitude of the attacker  
- **vic_latitude:** Latitude of the target  
- **vic_longitude:** Longitude of the target  
- **att_threat_score:** Threat score of the IP address, ranging from 0 to 100 (100 represents the highest threat level, while lower scores indicate less threat)  
- **Packet Length:** Size of the packet, measured in bytes

#### 2.1. Consistency Check on Latitude and Longitude for Hackers and Targets  

This section aims to verify the validity of latitude and longitude coordinates by applying the following rules:  

- **Latitude** must fall within the range **[-90, 90]** (inclusive).  
- **Longitude** must fall within the range **[-180, 180]** (inclusive).  

To perform this validation, the minimum and maximum values for each column are checked in the provided code.  

In our case, all columns comply with the specified constraints. Therefore, we can conclude that the latitude and longitude values are consistent for both the Hackers and Targets variables.

In [5]:
# Some basic stats on Latitude and Longitude for Hackers and Targets
data[["att_latitude", "att_longitude", "vic_latitude", "vic_longitude"]].describe()

Unnamed: 0,att_latitude,att_longitude,vic_latitude,vic_longitude
count,39657.0,39657.0,39657.0,39657.0
mean,34.465133,-9.860342,34.550893,-9.701187
std,19.339776,89.066137,19.152156,89.144209
min,-46.41787,-175.22904,-45.87955,-158.08239
25%,32.71568,-82.89573,32.79166,-82.89573
50%,39.10291,-6.24827,39.38314,-6.20226
75%,44.80171,72.85573,44.80808,73.00551
max,71.28884,178.45276,69.96004,178.45276


#### 2.2. Quality Check on IP Address’ Threat Score  

According to the [documentation](https://ipgeolocation.io/ip-location-api.html#documentation-overview) of our API, the IP address’ threat score ranges from **0 to 100**, where **100 represents the highest threat level**, and lower values indicate a lesser threat.  

To validate this, we examine the minimum and maximum values of the `att_threat_score`. In our dataset, the threat score ranges from **0 to 90**, which falls within the valid range of **[0, 100]**.  

Thus, the constraints are met, and we can conclude that the `att_threat_score` values in our data are consistent and valid.

In [6]:
# Some basic stats on IP address’ Threat Score
data[["att_threat_score"]].describe()

Unnamed: 0,att_threat_score
count,39657.0
mean,2.970724
std,12.530444
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,90.0


#### 2.3. Quality Check on Packet Length  

According to this [article](https://labs.apnic.net/index.php/2024/10/04/the-size-of-packets/) ([APNIC](https://en.wikipedia.org/wiki/APNIC), October 4, 2024, by [Geoff Huston](https://blog.apnic.net/author/geoff-huston/)), "the pragmatic default Internet answer these days is that an Internet packet is between **20 and 1,500 bytes** in size."  

In our dataset, the size of packets ranges from **64 bytes to 1,500 bytes**, which falls within this plausible interval and aligns with real-world expectations.  

In [7]:
# Some basic stats on Packet Length
data[["Packet Length"]].describe()

Unnamed: 0,Packet Length
count,39657.0
mean,781.30383
std,416.123634
min,64.0
25%,420.0
50%,782.0
75%,1143.0
max,1500.0


---
### 3. Data Quality Assessment for Nominal Variables  

The following code starts by considering a list of categorical/binary variables for which we want to check for any anomalies or suspected inconsistencies in their attributes.  

As seen in the results, there are no variables with any anomalies or suspected issues with their attributes.

In [8]:
# List that contain the variables that we want to check
ls_VariableToCheck = [
       'att_is_tor', 'att_is_proxy', 'att_is_anonymous',
       'att_is_known_attacker', 'att_is_spam', 'att_is_bot',
       'att_is_cloud_provider', 'Protocol', 'Packet Type', 'Traffic Type',
       'Malware Indicators', 'Alerts/Warnings', 'Attack Type', 'Action Taken',
       'Severity Level', 'Log Source', 'Browser', 'Device/OS']

# Iterate through each variable in the list "ls_VariableToCheck"
for variable in ls_VariableToCheck:
    # Get the unique values of the current variable from the DataFrame "data"
    unique_values = data[variable].unique()
    print(f"Unique Values for {variable}: {unique_values}")

Unique Values for att_is_tor: [False  True]
Unique Values for att_is_proxy: [False  True]
Unique Values for att_is_anonymous: [False  True]
Unique Values for att_is_known_attacker: [False  True]
Unique Values for att_is_spam: [False  True]
Unique Values for att_is_bot: [False  True]
Unique Values for att_is_cloud_provider: [False  True]
Unique Values for Protocol: ['ICMP' 'UDP' 'TCP']
Unique Values for Packet Type: ['Data' 'Control']
Unique Values for Traffic Type: ['HTTP' 'DNS' 'FTP']
Unique Values for Malware Indicators: ['IoC Detected' 'No Detection']
Unique Values for Attack Type: ['Malware' 'DDoS' 'Intrusion']
Unique Values for Action Taken: ['Logged' 'Blocked' 'Ignored']
Unique Values for Severity Level: ['Low' 'Medium' 'High']
Unique Values for Log Source: ['Server' 'Firewall']
Unique Values for Browser: ['Mozilla' 'Opera']
Unique Values for Device/OS: ['Windows' 'Macintosh' 'Linux' 'iPod' 'iPhone' 'iPad' 'Android']


---
### 4. Assessing the Validity of a Given IP  

In our case, the simplest way to assess the validity of a given IP address is by referring to the [documentation](https://ipgeolocation.io/ip-location-api.html#documentation-overview) of our API. The API works as follows:  

- **Input**: The API takes an IP address as input.  
- **Validation Process**: The API checks whether the IP meets specific constraints. The key point here is that:  
    - There are multiple reasons why the API might fail to provide information about a specific IP.  
    - A complete list of these reasons (also known as error codes) is detailed in the documentation. Here are a few examples:  
        - **E400**: Bad Request  
        - **E401**: Unauthorized  
        - **E402**: Forbidden  
        - Etc.  
    - **If all checks are passed**:  
        $\rightarrow$ **Output**: The requested information about the given IP is provided.  
    - **Otherwise**:  
        $\rightarrow$ **Output**: An error message explaining the issue is returned.  

- **Conclusion**: If information has been successfully collected for an IP, it means the IP address satisfies all constraints (e.g., it is a valid IP) and does not generate any errors. Therefore, the IP can be considered valid.  

--- 
### 5. Quality Assessment for Geospatial Information  

In this final step, our goal is to validate the following geospatial information extracted using the API:  

- **`vic_continent_name`**: Target's continent name  
- **`vic_country_code2`**: Target's country code (two-letter format)  
- **`vic_country_name`**: Target's country name  
- **`vic_city`**: Target's city  
- **`att_continent_name`**: Hacker's continent name  
- **`att_country_code2`**: Hacker's country code (two-letter format)  
- **`att_country_name`**: Hacker's country name  
- **`att_city`**: Hacker's city  

#### General Approach:  
The idea is to validate the geospatial information collected through our API by cross-checking it with two official resources (datasets) that contain a list of officially recognized cities, countries, and continents. The process is as follows:  

- **If the geospatial information is present in the official dataset**, it is considered validated and recognized as an official city, country, or continent.  
- **If not**, potential issues arise:  
    1. **City/Country/Continent Not Recognized**: If a geospatial attribute is not found in the dataset, it decreases the **accuracy** (i.e., the number of correct entries relative to the total).  
    2. **Inconsistency**: If the name of a city, country, or continent is misspelled or formatted differently compared to the dataset (e.g., *Milano ≠ milano ≠ MILANO*), it decreases the **consistency** of the data.  
    3. **Dataset Limitation**: Since we are using the **FREE VERSION** of the official datasets, fewer cities and countries are included. This means that some valid geospatial information may not appear in the dataset, and therefore, it cannot be validated, even though it is correct. 


#### 5.1. Datasets Used for Validation

##### 1. World Cities Database (FREE VERSION)  
A simple, accurate, and up-to-date database of the world's cities and towns. It is built using authoritative sources such as NGIA, US Geological Survey, US Census Bureau, and NASA. More details [here](https://simplemaps.com/data/world-cities).  

**Advantages:**  
- **Variables of interest:**  
    - **city**: Name of the city/town as a [Unicode](https://en.wikipedia.org/wiki/UTF-8) string (e.g., Goiânia).  
    - **country**: Name of the city/town's country.  
    - **iso2** ($=$ country_code2): [Alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) ISO country code.  

**Disadvantages:**  
- No information about the continent name.  
- Includes data for only **47,868 cities**; a paid version is required for a larger dataset.  

In [9]:
# 1. World Cities Database (FREE VERSION)
path = r"worldcities.csv" # For windows user's
#path = "worldcities.csv" # For macos user's
worldcities = pd.read_csv(path, delimiter = ',')

In [10]:
# First look on data
worldcities.head()

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.6897,139.6922,Japan,JP,JPN,Tōkyō,primary,37732000.0,1392685764
1,Jakarta,Jakarta,-6.175,106.8275,Indonesia,ID,IDN,Jakarta,primary,33756000.0,1360771077
2,Delhi,Delhi,28.61,77.23,India,IN,IND,Delhi,admin,32226000.0,1356872604
3,Guangzhou,Guangzhou,23.13,113.26,China,CN,CHN,Guangdong,admin,26940000.0,1156237133
4,Mumbai,Mumbai,19.0761,72.8775,India,IN,IND,Mahārāshtra,admin,24973000.0,1356226629


In [11]:
# Shape of data (rows, columns)
worldcities.shape

(47868, 11)

##### 2. GeoDataSource - World Cities Database (FREE VERSION)  
A free text-format database of worldwide cities suitable for applications requiring a comprehensive list of cities and country codes. It is a subset of the paid GeoDataSource World Cities Database editions (Basic, Premium, Gold, Platinum, Titanium). More details [here](https://www.geodatasource.com/world-cities-database/free).  

**Advantages:**  
- Most accurate and up-to-date data source.  
- Comprehensive list of cities and related items (2,995,619 entries).  
- Covers over 260+ countries, territories, and sovereign lands.  
- **Variables of interest:**  
    - **full_name_nd** ($=$ city): Feature’s full name without diacritics (special characters replaced with Roman characters).  
    - **cc_iso** ($=$ iso2 $=$ country_code2): ISO 3166 primary country code (two-letter format) uniquely identifying countries, dependencies, and sovereign territories.  

**Disadvantages:**  
- No information about the country name.  
- No information about the continent name. 

In [12]:
# 2. GeoDataSource - World Cities Database (FREE VERSION)
path = r"GEODATASOURCE-CITIES-FREE.TXT" # For windows user's
#path = "GEODATASOURCE-CITIES-FREE.TXT" # For macos user's
GEODATASOURCE = pd.read_csv(path, delimiter = '\t')

In [13]:
# First look on data
GEODATASOURCE.head()

Unnamed: 0,CC_FIPS,CC_ISO,FULL_NAME_ND
0,AN,AD,Aixirivall
1,AN,AD,Aixovall
2,AN,AD,Aixas
3,AN,AD,Andorra la Vella
4,AN,AD,Ansalonga


In [14]:
# Difference in shapes of data (rows, columns) between the two datasets
print("1. World Cities Database:", worldcities.shape)
print("2. GeoDataSource:", GEODATASOURCE.shape) # Bigger's one

1. World Cities Database: (47868, 11)
2. GeoDataSource: (2995618, 3)


#### 5.2. Functions Used  

In this section, we develop and explain the functions used to assess the quality of geospatial information.  

Specifically, we will discuss **three functions** that we created:  
1. **`CheckCitiesValidity()`**: Validates city names.  
2. **`CheckCountriesValidity()`**: Validates country names.  
3. **`CheckCountryCodes2Validity()`**: Validates country codes.  

Below, you will find the code for each function and an explanation of its key aspects.  

**Important Notes:**  
- For the **`CheckCitiesValidity()`** and **`CheckCountryCodes2Validity()`** functions, we used the **second dataset** (the larger one) to achieve the best results possible, ensuring the validation of the highest number of cities.  
- For the **`CheckCountriesValidity()`** function, we had to use the **first dataset** because the second one does not contain country name information.

In [15]:
# Function used to check the cities validity
def CheckCitiesValidity(ls_cities_to_check):
    
    """
    Arguments:
    ls_cities_to_check: a list of cities to check.

    Mid-output (printed) --> Results:
        - Number of Checked Cities.
        - Non Admmitted Cities Preview.
        - Count of Non Admmitted Cities.
        - Count of Admmitted Cities.
        - Proportion (%) of Non Admitted Cities.

    Final output (returned) --> findNonAdmittedCities: a list containing non-admitted cities.
    """
    
    # Info used to do the validation
    #admittedCities_set = set(worldcities["city"]) # If you want to use this dataset to validate the city
    admittedCities_set = set(GEODATASOURCE["FULL_NAME_ND"]) # We have used it becouse it have more city 

    # Find non admitted cities
    findNonAdmittedCities = [city for city in ls_cities_to_check if city not in admittedCities_set]

    # Count the number of non admitted cities
    NonAdmmitted_count = len(findNonAdmittedCities)

    # Print the results (mid-output)
    print("Results:")
    print("Number of Checked Cities:", len(ls_cities_to_check))
    print("Non Admmitted Cities Preview:", findNonAdmittedCities)
    print("Count of Non Admmitted Cities:", NonAdmmitted_count)
    print("Count of Admmitted Cities:", (len(ls_cities_to_check)-NonAdmmitted_count))
    print("Proportion (%) of Non Admitted Cities:", round(((NonAdmmitted_count/len(ls_cities_to_check))*100), 4))

    # Final outupt
    return findNonAdmittedCities 


In [16]:
# Function used to check the countries validity
def CheckCountriesValidity(ls_countries_to_check):

    """
    Arguments:
    ls_countries_to_check: a list of countries to check.

    Mid-output (printed) --> Results:
        - Number of Checked countries.
        - Non Admmitted countries Preview.
        - Count of Non Admmitted countries.
        - Count of Admmitted countries.
        - Proportion (%) of Non Admitted countries.

    Final output (returned) --> findNonAdmittedCountries: a list containing non-admitted countries.
    """

    # Info used to do the validation
    admittedCountries_set = set(worldcities["country"]) # Our only choice for validate the country

    # Find non admitted countries
    findNonAdmittedCountries = [country for country in ls_countries_to_check if country not in admittedCountries_set]

    # Count the number of non admitted countries
    NonAdmmitted_count = len(findNonAdmittedCountries)

    # Print the results (mid-output)
    print("Results:")
    print("Number of Checked Countries:", len(ls_countries_to_check))
    print("Non Admmitted Countries Preview:", findNonAdmittedCountries)
    print("Count of Non Admmitted Countries:", NonAdmmitted_count)
    print("Count of Admmitted Countries:", (len(ls_countries_to_check)-NonAdmmitted_count))
    print("Proportion (%) of Non Admitted Countries:", round(((NonAdmmitted_count/len(ls_countries_to_check))*100), 4))

    # Final outupt
    return findNonAdmittedCountries

In [17]:
# Function used to check the country codes2 (or iso2) validity
def CheckCountryCodes2Validity(ls_CountryCodes2_to_check):
    
    """
    Arguments:
    ls_CountryCodes2_to_check: a list of country codes2 (or iso2) to check.

    Mid-output (printed) --> Results:
        - Number of Checked Country Codes2 (or iso2).
        - Non Admmitted Country Codes2 (or iso2) Preview.
        - Count of Non Admmitted Country Codes2 (or iso2).
        - Count of Admmitted Country Codes2 (or iso2).
        - Proportion (%) of Non Admitted Country Codes2 (or iso2).

    Final output (returned) --> findNonAdmittedCountryCodes2: a list containing non-admitted Country Codes2 (or iso2).
    """
    
    # Info used to do the validation
    #admittedCountryCodes2_set = set(worldcities["iso2"]) # If you want to use this dataset to validate the country codes2 (or iso2)
    admittedCountryCodes2_set = set(GEODATASOURCE["CC_ISO"]) # We have used it becouse it have more country codes2 (or iso2) 

    # Find non admitted country codes2 (or iso2)
    findNonAdmittedCountryCodes2 = [CountryCode2 for CountryCode2 in ls_CountryCodes2_to_check if CountryCode2 not in admittedCountryCodes2_set]

    # Count the number of non admitted country codes2 (or iso2)
    NonAdmmitted_count = len(findNonAdmittedCountryCodes2)

    # Print the results (mid-output)
    print("Results:")
    print("Number of Checked Country Codes2 (or iso2):", len(ls_CountryCodes2_to_check))
    print("Non Admmitted Country Codes2 (or iso2) Preview:", findNonAdmittedCountryCodes2)
    print("Count of Non Admmitted Country Codes2 (or iso2):", NonAdmmitted_count)
    print("Count of Admmitted Country Codes2 (or iso2):", (len(ls_CountryCodes2_to_check)-NonAdmmitted_count))
    print("Proportion (%) of Non Admitted Country Codes2 (or iso2):", round(((NonAdmmitted_count/len(ls_CountryCodes2_to_check))*100), 4))

    # Final outupt
    return findNonAdmittedCountryCodes2

#### 5.3. Check Cities Validity

The first two code chunks highlight the following:  
- When **considering the unique values** of cities, 999 cities cannot be validated.  
    - With a total of 6665 unique cities, this means that approximately 15% of our unique cities cannot be validated (cannot be admitted as cities).
- Looking at the **results in general**, including all occurrences of cities **(duplicates included)**, this corresponds to:  
    - 11,142 non-admitted cities.  
    - About 14% of the total.  

$\implies$ Further analysis is needed, along with possible correction, processing, or cleaning efforts to improve the quality of these cities.

In [18]:
# A list that contain all the unique city (both for Hackers and Targets) in our dataset (cleaned_data.csv)
ls_city = list(set(data["att_city"].unique()) | set(data["vic_city"].unique()))

# Check the cities validity and extract the non admitted cities list
NonAdmittedCities = CheckCitiesValidity(ls_city) # N.B.: THIS LIST DOESN'T CONTAIN DUPLICATE BUT JUST UNIQUE VALUES

Results:
Number of Checked Cities: 6665
Non Admmitted Cities Preview: ['Aksayskiy Rayon', 'Aqqayiñ audani', 'Kreisfreie Stadt München', 'Paris Township', 'João Pessoa', 'Fréjus', 'Qizilzhar audani', 'Hsinchu City', 'Goianésia do Pará', 'Hünenberg', 'Asterousioi', 'Nouméa', 'Togliatti', 'Neckargemünd', 'Kreisfreie Stadt Berlin', 'Paju-si', 'Salmiya', 'Charleville-Mézières', 'Capital Township', 'Tel Aviv-Yafo', 'Ribeirão', 'Alcalá de Henares', 'Ishøj', 'Catalão', 'Saint Florian', 'Xuyên Mộc', 'Pfaeffikon / Pfaeffikon', 'Vallabh Vidhyanagar', 'Kirovskiy Rayon', 'São José dos Campos', 'Viña del Mar', 'Wereda 02', 'Nan Ning Shi', 'Petrópolis', 'Björklinge', 'Patrocínio', 'Zhunan Township', 'Châteauneuf-sur-Cher', 'St Gallen', 'Ciudad De México', 'Shin-Gye-Dong', 'Bangalore', 'Ponte San Nicolò', 'Portes-lès-Valence', 'Konakovskiy Rayon', "Saint-Martin-d'Hères", 'São Félix do Xingu', 'Muckleneuk', 'Førde', 'Góra', 'Pedersöre', 'São Leopoldo', 'Gießen', 'Taguig City', 'Tanami East', 'FB Area B

In [19]:
# A list that contain all the city-NON UNIQUE-(both for Hackers and Targets) in our dataset (cleaned_data.csv)
att_vic_city = list(data["att_city"]) + list(data["vic_city"])


# Check the cities validity and extract the non admitted cities list
NonAdmittedCities = CheckCitiesValidity(att_vic_city) # N.B.: THIS LIST CONTAIN DUPLICATE

Results:
Number of Checked Cities: 79314
Non Admmitted Cities Preview: ['Paranavaí', 'Capital Functional Core Area', 'Chiyoda City', 'Trojanów', 'Taipei City', 'Shen Yang Shi', 'Capital Functional Core Area', 'Nizhny Novgorod', 'Chiyoda City', 'Chiyoda City', 'Jakarta', 'Mt Laurel', 'Jakarta', 'Chang-hua', 'Ciudad De México', 'Chiyoda City', 'Ciudad De México', 'Capital Functional Core Area', 'Buenos Aires City', 'Chiyoda City', 'Schönefeld', 'Chiyoda City', 'Chiyoda City', 'Shao Xing Shi', 'Chiyoda City', 'Villa Constitución', 'Capital Functional Core Area', 'Chiyoda City', 'Chiyoda City', 'Capital Functional Core Area', 'Capital Functional Core Area', 'Capital Functional Core Area', 'Kecamatan Mampang Prapatan', 'Shao Xing Shi', 'Chiyoda City', 'Mangilao', 'Holbæk', 'South Korea', 'Minato City', 'Großmehring', 'Capital Functional Core Area', 'Capital Functional Core Area', 'Buenos Aires City', 'Chiyoda City', 'Chiyoda City', 'Ciudad de Guatemala', 'Valencina de la Concepción', 'Nowy 

There can be several reasons why a city is not admitted as plausible, here below we present some of these:  
1. Human errors during the development of the API used to extract this information.  
2. Other potential errors related to the API functionality.  
3. Cities being recognized in different ways (e.g., *Mexico City* and *Ciudad de México* or *Taipei City* and *Taipei*).  
4. Cities that cannot be validated because we are using a free version of the dataset, which does not include all possible city names worldwide (a paid subscription is required to access more cities).  

**Our Approach:** 
1. First, we perform a count of non-admitted cities stratified by city. Essentially, the following code shows the count of each city in our dataset.  
2. We know that our final objective is to explore the data through an Exploratory Data Analysis (EDA) and answer the research questions set at the beginning of this project.  
3. The dataset we are using for this specific task is fixed and will not be updated in real time or otherwise.  
4. Our goal in this phase is not to achieve 100% data quality but to reach a sufficient level of quality (a level where a small amount of insignificant dirty data does not pose a real problem for our final conclusions). This ensures that we can provide reliable results.  
5. For these reasons, we decided to address (i.e., clean, process, and fix) the issue only for the top 10 most frequent sources of "dirty" data (non-admitted cities).  

By cleaning the most significant sources of dirty data, we expect to achieve good data quality for the cities, even if we do not resolve the issue for every single city but only for the top 10 most frequent ones.  

In [20]:
# Code to count the top 10 sources of "dirty" (non-admitted cities)

NonAdmittedCities_df = pd.DataFrame(Counter(NonAdmittedCities).items(), columns = ["city", "count"])

# Sort the dataframe by descending count
NonAdmittedCities_df.sort_values(by = "count", ascending = False).head(10)

Unnamed: 0,city,count
2,Chiyoda City,2659
1,Capital Functional Core Area,1545
19,Minato City,752
4,Taipei City,617
8,Mt Laurel,374
11,Buenos Aires City,294
30,"Bogota, D.C.",270
10,Ciudad De México,218
7,Jakarta,168
46,Quận Đống Đa,146


**Predicting Final Results After Cleaning the Top 10 Sources of Dirty Data**

To proceed, let's estimate the expected result if we clean the top 10 most frequent sources of "dirty" data for city validity:  
1. **Initial total of cities to check**: We know the total number of cities to check (considering both *Hackers* and *Targets*) in our data is **79,314**.  
2. **Next step**: Count the number of cities belonging to the top 10 non-admitted cities (this represents the amount of dirty data that would be dropped).  
3. **Remove this amount** from the total number of non-admitted cities.  
4. **Calculate the final proportion** by dividing this difference by the initial total number of cities (79,314).  
5. The result is the expected proportion of non-admitted cities **after** cleaning the top 10 most frequent sources of dirty data.  

**Formula to Calculate the Proportion of Non-Admitted Cities**  
The formula for calculating the proportion of non-admitted cities after cleaning is:  

$$
P_{\text{non-admitted}} = \frac{C_{\text{total non-admitted}} - C_{\text{top 10 dirty}}}{C_{\text{initial total}}}
$$

Where:  
- $C_{\text{total non-admitted}}$: Total number of non-admitted cities before cleaning.  
- $C_{\text{top 10 dirty}}$: Number of non-admitted cities from the top 10 dirty sources.  
- $C_{\text{initial total}}$: Total number of cities to check (in this case, 79,314).  

In [21]:
TotCitiesToCheck = 79314  # Total number of cities to check (in this case, 79,314)
TotNonAdmittCities = 11142 # Total number of non-admitted cities before cleaning
NonAdmittCitiesTop10 = NonAdmittedCities_df.sort_values(by = "count", ascending = False).head(10).sum()["count"] # Number of non-admitted cities from the top 10 dirty sources

ExpectedProp = round(((TotNonAdmittCities-NonAdmittCitiesTop10)/TotCitiesToCheck)*100, 4)
print("The expected Proportion (%) of Non Admitted Cities after cleaning is:", ExpectedProp)

The expected Proportion (%) of Non Admitted Cities after cleaning is: 5.1681


As the above results shows after clean our data from the top 10 source of "dirty" (top 10 non admitted cities) we expect to obtain that only the 5% of our cities cannot be validated. This is obviuosly a good results in terms of data quality becouse we can hope in attendible results in our EDA. In other wods, this 5% of dirty will not geopardize our conclusion or anlysis during the EDA.

##### 5.3.1. Assessment for Capital Functional Core Area

The **Capital Functional Core Area (CFCA)** typically refers to the central zone of a country's capital city that hosts key governmental, administrative, economic, and cultural functions. It is a concept often used in urban planning and geography to delineate the most important and functional areas of a capital city. Below are some key features of a CFCA:

**Key Characteristics:**
1. **Government Institutions:**
   - Headquarters of the executive, legislative, and judicial branches of the government.
   - Embassies, consulates, and other diplomatic offices.
   - Ministries and government departments.

2. **Economic Activities:**
   - Concentration of major businesses, financial institutions, and corporate headquarters.
   - Central Business District (CBD), often overlapping with the CFCA.
   - Areas for high-level commerce and trade.

3. **Cultural and Historical Significance:**
   - Prominent museums, theaters, and cultural institutions.
   - Heritage sites, monuments, and landmarks of national importance.

4. **Transportation Hubs:**
   - Centralized public transportation networks (bus, metro, trains).
   - Major roads, highways, and possibly airports serving the core.

5. **Population and Urban Density:**
   - High density of workers, residents, and visitors.
   - Diverse demographics, often with a mix of locals and international residents.

6. **Urban Design and Infrastructure:**
   - Planned urban layout with prominent squares, parks, and green spaces.
   - Iconic architecture and high-rise buildings.

**Examples:**
- **Washington, D.C. (USA):** The National Mall and surrounding areas with the White House, Capitol Building, Supreme Court, and major monuments.
- **London (UK):** Westminster, housing Parliament, Buckingham Palace, and other governmental functions.
- **New Delhi (India):** Central Vista, including the Rashtrapati Bhavan (Presidential Palace), Parliament House, and India Gate.

**Importance of CFCA:**
- **Governance:** Acts as the nerve center for decision-making and administration.
- **Economic Impact:** Drives a significant portion of the country's economy through commerce and trade activities.
- **Cultural Representation:** Reflects national identity and heritage.
- **Tourism:** Attracts visitors, contributing to the local and national economy.

In urban planning, the Capital Functional Core Area (CFCA) is often governed by specific zoning and development regulations to maintain its vital functions and significance. In our case, if a city is labeled as CFCA implies the following:

- **For Hacker City (variable `att_city`)**: If the attack originates from a CFCA, it means it was launched from a central zone of a country's capital city that hosts essential governmental, administrative, economic, and cultural activities.  
- **For Target City (variable `vic_city`)**: If the attack is aimed at a CFCA, it targets such a central zone.

This distinction provides valuable insights for our analysis, so it is important to retain this information.

**Proposed Approach:**
1. Create an updated version of the cleaned dataset (from `cleaned_data.csv` to `data_cleaned_V2`) with two new variables:
   - `is_att_city_CFCA`:
     - **Yes** if `att_city` (Hacker City) equals "Capital Functional Core Area."
     - **No** otherwise.
   - `is_vic_city_CFCA`:
     - **Yes** if `vic_city` (Target City) equals "Capital Functional Core Area."
     - **No** otherwise.

   Adding these variables ensures that we preserve this information for subsequent steps.

2. Replace occurrences of `"Capital Functional Core Area"` in `att_city` and `vic_city` with the corresponding country's capital city. This adjustment allows us to validate all the cities labeled as "Capital Functional Core Area," which would otherwise remain ambiguous.

In [22]:
# Step 1: Create an updated version of the cleaned dataset (from `cleaned_data.csv` to `data_cleaned_V2`) with two new variables for CFCA status:

data_cleaned_V2 = data

data_cleaned_V2["is_att_city_CFCA"] = data_cleaned_V2["att_city"].apply(lambda x: "Yes" if x == "Capital Functional Core Area" else "No")
data_cleaned_V2["is_vic_city_CFCA"] = data_cleaned_V2["vic_city"].apply(lambda x: "Yes" if x == "Capital Functional Core Area" else "No")

In [23]:
# Distribution of is_att_city_CFCA
data_cleaned_V2["is_att_city_CFCA"].value_counts()

is_att_city_CFCA
No     38885
Yes      772
Name: count, dtype: int64

In [24]:
# Distribution of is_vic_city_CFCA
data_cleaned_V2["is_vic_city_CFCA"].value_counts()

is_vic_city_CFCA
No     38884
Yes      773
Name: count, dtype: int64

In [25]:
# Dictionary mapping countries to their capital cities

# Filter only rows where the column 'capital' equals "primary"
# This step ensures we only consider primary capital cities for each country
capital_cities = worldcities[worldcities["capital"] == "primary"]

# Create a dictionary mapping each country to its corresponding capital city
# Use the 'zip' function to pair country names (keys) with city names (values)
# The 'dict' function converts these pairs into a dictionary
country_capital_mapping = dict(zip(capital_cities["country"], capital_cities["city"]))

# Print the resulting dictionary to verify the mapping
print(country_capital_mapping)


{'Japan': 'Tokyo', 'Indonesia': 'Jakarta', 'Philippines': 'Manila', 'Korea, South': 'Seoul', 'Mexico': 'Mexico City', 'Egypt': 'Cairo', 'Bangladesh': 'Dhaka', 'China': 'Beijing', 'Thailand': 'Bangkok', 'Russia': 'Moscow', 'Argentina': 'Buenos Aires', 'Iran': 'Tehran', 'Congo (Kinshasa)': 'Kinshasa', 'United Kingdom': 'London', 'France': 'Paris', 'Peru': 'Lima', 'Angola': 'Luanda', 'Malaysia': 'Putrajaya', 'Vietnam': 'Hanoi', 'Colombia': 'Bogotá', 'Sudan': 'Khartoum', 'Hong Kong': 'Hong Kong', 'Saudi Arabia': 'Riyadh', 'Chile': 'Santiago', 'Spain': 'Madrid', 'Iraq': 'Baghdad', 'Singapore': 'Singapore', 'Kenya': 'Nairobi', 'Turkey': 'Ankara', 'Burma': 'Nay Pyi Taw', 'United States': 'Washington', 'Côte d’Ivoire': 'Yamoussoukro', 'Germany': 'Berlin', 'South Africa': 'Bloemfontein', 'Afghanistan': 'Kabul', 'Mali': 'Bamako', 'Jordan': 'Amman', 'Nigeria': 'Abuja', 'Algeria': 'Algiers', 'Greece': 'Athens', 'Ethiopia': 'Addis Ababa', 'Guatemala': 'Guatemala City', 'Kuwait': 'Kuwait City', 'Hun

In [26]:
# Step 2: Replace "Capital Functional Core Area" with the country's capital city

# For Hacker City "att_city"
data_cleaned_V2["att_city"] = data_cleaned_V2.apply(
    lambda row: country_capital_mapping[row["att_country_name"]]
    if row["att_city"] == "Capital Functional Core Area" else row["att_city"],
    axis=1
)

# For Target City "vic_city"
data_cleaned_V2["vic_city"] = data_cleaned_V2.apply(
    lambda row: country_capital_mapping[row["vic_country_name"]]
    if row["vic_city"] == "Capital Functional Core Area" else row["vic_city"],
    axis=1
)

**Check Cities Validity after assessment for Capital Functional Core Area (CFCA)**

By running the code below, we can observe that the cleaning and processing approach applied to cities labeled as "Capital Functional Core Area" has reduced the proportion of non-admitted cities from 14% to 12%. This indicates a slight improvement in data quality. 

It is worth noting that by introducing the two new variables, `is_att_city_CFCA` and `is_vic_city_CFCA`, we retained the information related to cities labeled as "Capital Functional Core Area." This was achieved even though, in Step 2, we replaced the label "Capital Functional Core Area" with the corresponding country's capital city for the variables `att_city` and `vic_city`. This approach allowed us to assess and enhance data quality without losing any crucial information.

The next step involves assessing the validity of the following cities: Chiyoda, Minato, Taipei, and Buenos Aires.

In [27]:
# A list that contain all the city-NON UNIQUE-(both for Hackers and Targets) in our dataset (data_cleaned_V2.csv)
att_vic_city = list(data_cleaned_V2["att_city"]) + list(data_cleaned_V2["vic_city"])


# Check the cities validity and extract the non admitted cities list
NonAdmittedCities = CheckCitiesValidity(att_vic_city) # N.B.: THIS LIST CONTAIN DUPLICATE

Results:
Number of Checked Cities: 79314
Non Admmitted Cities Preview: ['Paranavaí', 'Chiyoda City', 'Trojanów', 'Taipei City', 'Shen Yang Shi', 'Nizhny Novgorod', 'Chiyoda City', 'Chiyoda City', 'Jakarta', 'Mt Laurel', 'Jakarta', 'Chang-hua', 'Ciudad De México', 'Chiyoda City', 'Ciudad De México', 'Buenos Aires City', 'Chiyoda City', 'Schönefeld', 'Chiyoda City', 'Chiyoda City', 'Shao Xing Shi', 'Chiyoda City', 'Villa Constitución', 'Chiyoda City', 'Chiyoda City', 'Kecamatan Mampang Prapatan', 'Shao Xing Shi', 'Chiyoda City', 'Mangilao', 'Holbæk', 'South Korea', 'Minato City', 'Großmehring', 'Buenos Aires City', 'Chiyoda City', 'Chiyoda City', 'Ciudad de Guatemala', 'Valencina de la Concepción', 'Nowy Tomyśl', 'Chiyoda City', 'University of the Western Cape', 'Minato City', 'dongmenli', 'Mt Laurel', 'Ciudad De México', 'Persequor', 'Chiyoda City', 'Chiyoda City', 'Minato City', 'Minato City', 'Kiev', 'Damüls', 'Douglas Township', 'Buenos Aires City', 'Bogota, D.C.', 'Chiyoda City', 'K

##### 5.3.2. Assessment for Chiyoda City, Minato City, Taipei City, Buenos Aires City

As mentioned at the beginning of this chapter, there are various reasons why a city cannot be validated. One of these reasons is not necessarily related to the city being invalid but rather to discrepancies in how the city name is written compared to its officially recognized name. For instance, in our dataset, five of the top 10 sources of "dirty" (non-admitted) cities are due to such discrepancies. 

For example:
- **Chiyoda City** and **Chiyoda** are the same, but when using the GEODATASOURCE dataset for validation, only **Chiyoda** is recognized as the official name, even though both names are valid.
- Similar cases include **Minato City** and **Minato**, **Taipei City** and **Taipei**, and **Buenos Aires City** and **Buenos Aires**.

However, this is not true for all cities with a similar structure. For instance, **México City** and **México** are not the same. This is another reason for wich we decided to clean and process only the top 10 non admitted cities. Applying such cleaning universally risks introducing more errors and biases, which would not improve our data quality.

In [28]:
# Dictionary mapping cities to their correct way to recognize them (e.g. Chiyoda City --> Chiyoda)

# {non admitted city name version: admitted city name version}
city_mapping = {"Chiyoda City": "Chiyoda",  
                "Minato City": "Minato",
                "Taipei City": "Taipei",
                "Buenos Aires City": "Buenos Aires"}

In [29]:
# Replace non admitted city names with the corresponding admitted city names

# For Hacker City "att_city"
data_cleaned_V2["att_city"] = data_cleaned_V2.apply(
    lambda row: city_mapping[row["att_city"]]
    if row["att_city"] in city_mapping else row["att_city"],
    axis=1
)

# For Target City "vic_city"
data_cleaned_V2["vic_city"] = data_cleaned_V2.apply(
    lambda row: city_mapping[row["vic_city"]]
    if row["vic_city"] in city_mapping else row["vic_city"],
    axis=1
)

**Check Cities Validity After Replacing Non-Admitted City Names with Corresponding Admitted City Names**

As shown below, by adopting this additional cleaning and processing procedure, we have reduced the proportion of non-admitted cities from 12% to just 6.6%. Next, we will address another, yet similar, issue involving the cities Mt Laurel, Bogota, D.C., and Ciudad de México.

In [30]:
# A list that contain all the city-NON UNIQUE-(both for Hackers and Targets) in our dataset (data_cleaned_V2.csv)
att_vic_city = list(data_cleaned_V2["att_city"]) + list(data_cleaned_V2["vic_city"])


# Check the cities validity and extract the non admitted cities list
NonAdmittedCities = CheckCitiesValidity(att_vic_city) # N.B.: THIS LIST CONTAIN DUPLICATE

Results:
Number of Checked Cities: 79314
Non Admmitted Cities Preview: ['Paranavaí', 'Trojanów', 'Shen Yang Shi', 'Nizhny Novgorod', 'Jakarta', 'Mt Laurel', 'Jakarta', 'Chang-hua', 'Ciudad De México', 'Ciudad De México', 'Schönefeld', 'Shao Xing Shi', 'Villa Constitución', 'Kecamatan Mampang Prapatan', 'Shao Xing Shi', 'Mangilao', 'Holbæk', 'South Korea', 'Großmehring', 'Ciudad de Guatemala', 'Valencina de la Concepción', 'Nowy Tomyśl', 'University of the Western Cape', 'dongmenli', 'Mt Laurel', 'Ciudad De México', 'Persequor', 'Kiev', 'Damüls', 'Douglas Township', 'Bogota, D.C.', 'Kiev', 'Petah Tikva', 'Mt Laurel', 'Vällingby', 'Jonkoping', 'Anápolis', 'Jakarta', 'Lomonosovskiy Rayon', 'Marataízes', 'Fällanden', 'Sixth of October', 'Bogota, D.C.', 'First 6th of October', 'Bogota, D.C.', "St John's", 'Mumbai City', 'Shen Yang Shi', 'Törökbálint', 'Eemshaven', 'Maxéville', 'Jakarta Pusat', 'Quận Đống Đa', 'El Qahera El Gididaa', 'București', 'Mt Laurel', 'Kecamatan Mampang Prapatan', 'M

#### 5.3.3. Assessment for Mt Laurel, Bogota, D.C., and Ciudad De México

In this step, we focus on correcting the mapping of city names that are written "incorrectly" by replacing them with their proper, officially recognized forms. Specifically, we will map the cities of interest as follows:  
- **Mt Laurel** → **Mount Laurel**  
- **Bogota, D.C.** → **Bogota**  
- **Ciudad De México** → **Mexico City**

As you can see, this is not much different compared to what we have done before for Chiyoda, Minato, Taipei and Buenos Aires cities.

In [31]:
# Dictionary mapping cities to their correct way to recognize them (e.g. Mt Laurel --> Mount Laurel)

# {non admitted city name version: admitted city name version}
city_mapping_2 = {"Mt Laurel": "Mount Laurel",
                  "Bogota, D.C.": "Bogota",
                  "Ciudad De México": "Mexico City"}

In [32]:
# Replace non admitted city names with the corresponding admitted city names

# For Hacker City "att_city"
data_cleaned_V2["att_city"] = data_cleaned_V2.apply(
    lambda row: city_mapping_2[row["att_city"]]
    if row["att_city"] in city_mapping_2 else row["att_city"],
    axis=1
)

# For Target City "vic_city"
data_cleaned_V2["vic_city"] = data_cleaned_V2.apply(
    lambda row: city_mapping_2[row["vic_city"]]
    if row["vic_city"] in city_mapping_2 else row["vic_city"],
    axis=1
)

**Check Cities Validity After Replacing Non-Admitted City Names with Corresponding Admitted City Names**

As shown in the code and results below, this final step has reduced the proportion of non-admitted cities to **5.6%**. This represents a significant improvement in data quality, especially considering that at the beginning of this process, the proportion of non-admitted cities was approximately **14%**. 

It is worth noting that earlier in this analysis, we had estimated the proportion of non-admitted cities after addressing the top 10 sources of "dirty" to be around **5.1681%**. The current result (**5.564% ≈ 5.1681%**) is quite close to this expectation. 

Given this outcome, we can confidently conclude the data quality improvement process for city names. With only **5%** of "dirty" data, the overall quality of the dataset is unlikely to jeopardize the final conclusions regarding general data quality or the insights we will derive during the Exploratory Data Analysis (EDA).

In [33]:
# A list that contain all the city-NON UNIQUE-(both for Hackers and Targets) in our dataset (data_cleaned_V2.csv)
att_vic_city = list(data_cleaned_V2["att_city"]) + list(data_cleaned_V2["vic_city"])

# Check the cities validity and extract the non admitted cities list
NonAdmittedCities = CheckCitiesValidity(att_vic_city) # N.B.: THIS LIST CONTAIN DUPLICATE

Results:
Number of Checked Cities: 79314
Non Admmitted Cities Preview: ['Paranavaí', 'Trojanów', 'Shen Yang Shi', 'Nizhny Novgorod', 'Jakarta', 'Jakarta', 'Chang-hua', 'Schönefeld', 'Shao Xing Shi', 'Villa Constitución', 'Kecamatan Mampang Prapatan', 'Shao Xing Shi', 'Mangilao', 'Holbæk', 'South Korea', 'Großmehring', 'Ciudad de Guatemala', 'Valencina de la Concepción', 'Nowy Tomyśl', 'University of the Western Cape', 'dongmenli', 'Persequor', 'Kiev', 'Damüls', 'Douglas Township', 'Kiev', 'Petah Tikva', 'Vällingby', 'Jonkoping', 'Anápolis', 'Jakarta', 'Lomonosovskiy Rayon', 'Marataízes', 'Fällanden', 'Sixth of October', 'First 6th of October', "St John's", 'Mumbai City', 'Shen Yang Shi', 'Törökbálint', 'Eemshaven', 'Maxéville', 'Jakarta Pusat', 'Quận Đống Đa', 'El Qahera El Gididaa', 'București', 'Kecamatan Mampang Prapatan', 'Akita Shi', 'Río Cuarto', 'El Qahera El Gididaa', 'Shen Yang Shi', 'dongmenli', 'Benito Juárez', 'First 6th of October', 'First 6th of October', 'Córdoba', 'Para

#### 5.4. Check Countries Validity

As shown by the code and results below:  
- Out of 197 unique countries in the `worldcities` dataset, **15 unique countries** in our dataset (`data_cleaned_V2`) cannot be validated. This means that the country names in our dataset do not match any entries in the `worldcities` dataset. Note that we are using the free version of the `worldcities` dataset, which may lack some details available in the paid version.  
- If we consider all occurrences of country names in our dataset (both for Hackers and Targets), we are unable to validate **3.5%** of the total countries. Specifically, out of **79,314 checked country entries** (including duplicates), **2,769 entries** (3.5%) cannot be validated.  

From further analysis, we found that **2,474 out of 2,769 non-admitted entries** stem from a single issue: the country name is written as **"Shouth Korea"** in our dataset, whereas it is officially recognized as **"Korea, South"** in the `worldcities` dataset.  

It is evident that resolving this issue would significantly reduce the proportion of non-admitted countries, bringing it close to zero.  

To address this problem, we can follow a similar approach to the one used for cities: map the incorrectly recognized version of the country name to the correct version as recognized by the `worldcities` dataset.

In [34]:
# A list that contain all the unique country (both for Hackers and Targets) in our dataset (data_cleaned_V2)
ls_countries = list(set(data_cleaned_V2["att_country_name"].unique()) | set(data_cleaned_V2["vic_country_name"].unique()))


# Check the countries validity and extract the non admitted countries list
NonAdmittedCountries = CheckCountriesValidity(ls_countries) # N.B.: THIS LIST DOESN'T CONTAIN DUPLICATE BUT JUST UNIQUE VALUES

Results:
Number of Checked Countries: 197
Non Admmitted Countries Preview: ['Macao', 'British Virgin Islands', 'Czech Republic', 'Aland Islands', 'Curacao', 'Bahamas', 'Gambia', 'South Korea', 'East Timor', 'Republic of the Congo', 'Myanmar', 'Bonaire, Saint Eustatius and Saba', 'Democratic Republic of the Congo', "Cote d'Ivoire (Ivory Coast)", 'Palestinian Territory']
Count of Non Admmitted Countries: 15
Count of Admmitted Countries: 182
Proportion (%) of Non Admitted Countries: 7.6142


In [35]:
# A list that contain all the country-NON UNIQUE-(both for Hackers and Targets) in our dataset (data_cleaned_V2)
att_vic_countries = list(data_cleaned_V2["att_country_name"]) + list(data_cleaned_V2["vic_country_name"])

# Check the countries validity and extract the non admitted countries list
NonAdmittedCountries = CheckCountriesValidity(att_vic_countries)# N.B.: THIS LIST CONTAIN DUPLICATE

Results:
Number of Checked Countries: 79314
Non Admmitted Countries Preview: ['South Korea', 'Czech Republic', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'Republic of the Congo', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'Czech Republic', 'Czech Republic', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'Gambia', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South Korea', 'South K

In [36]:
# Code to count the top 10 sources of "dirty" (non-admitted countries)
NonAdmittedCountries_df = pd.DataFrame(Counter(NonAdmittedCountries).items(), columns = ["country", "count"])

# Sort the dataframe by descending count
NonAdmittedCountries_df.sort_values(by = "count", ascending = False).head(10)

Unnamed: 0,country,count
0,South Korea,2474
1,Czech Republic,195
7,Cote d'Ivoire (Ivory Coast),29
8,Palestinian Territory,23
3,Gambia,10
5,Democratic Republic of the Congo,8
4,Macao,7
2,Republic of the Congo,5
10,Curacao,4
11,Bahamas,4


In [37]:
# To discover if and how worldcities dataset recognize country like "South Korea"
list({country for country in worldcities["country"] if "Korea" in country})

['Korea, South', 'Korea, North']

In [38]:
# Dictionary mapping countries to their correct way to recognize them

# {non admitted country name version: admitted country name version}
country_name_mapping = {"South Korea": "Korea, South"}

# Replace non admitted countries names with the corresponding admitted countries names

# For Hacker Country "att_country_name"
data_cleaned_V2["att_country_name"] = data_cleaned_V2.apply(
    lambda row: country_name_mapping[row["att_country_name"]]
    if row["att_country_name"] in country_name_mapping else row["att_country_name"],
    axis=1
)

# For Target Country "vic_country_name"
data_cleaned_V2["vic_country_name"] = data_cleaned_V2.apply(
    lambda row: country_name_mapping[row["vic_country_name"]]
    if row["vic_country_name"] in country_name_mapping else row["vic_country_name"],
    axis=1
)

**Check Countries Validity After Replacing Non-Admitted Country Names with Corresponding Admitted Country Names**

As expected, after replacing the primary source of inconsistencies in country names (the incorrectly labeled **"South Korea"**) with the officially recognized name **"Korea, South"** (as per the `worldcities` database), the proportion of non-admitted countries has been reduced to approximately **0.4%**. This represents a highly satisfactory result in terms of data quality and validity for our dataset.

In [39]:
# A list that contain all the unique county (both for Hackers and Targets) in our dataset (data_cleaned_V2)
att_vic_countries = list(data_cleaned_V2["att_country_name"]) + list(data_cleaned_V2["vic_country_name"])

# Check the countries validity and extract the non admitted countries list
NonAdmittedCountries = CheckCountriesValidity(att_vic_countries)# N.B.: THIS LIST CONTAIN DUPLICATE

Results:
Number of Checked Countries: 79314
Non Admmitted Countries Preview: ['Czech Republic', 'Republic of the Congo', 'Czech Republic', 'Czech Republic', 'Gambia', 'Czech Republic', 'Republic of the Congo', 'Macao', 'Czech Republic', 'Czech Republic', 'Democratic Republic of the Congo', 'Czech Republic', 'Czech Republic', 'Czech Republic', 'Bonaire, Saint Eustatius and Saba', "Cote d'Ivoire (Ivory Coast)", 'Czech Republic', 'Czech Republic', 'Czech Republic', 'Palestinian Territory', 'Palestinian Territory', 'Czech Republic', 'Czech Republic', 'Palestinian Territory', 'Czech Republic', "Cote d'Ivoire (Ivory Coast)", 'Czech Republic', 'Czech Republic', 'Czech Republic', 'East Timor', 'Czech Republic', 'Czech Republic', 'Czech Republic', 'Czech Republic', "Cote d'Ivoire (Ivory Coast)", 'Czech Republic', 'Czech Republic', 'Czech Republic', 'Czech Republic', 'Czech Republic', 'Czech Republic', 'Czech Republic', 'Czech Republic', 'Czech Republic', 'Czech Republic', 'Palestinian Territory

#### 5.5. Check Country Codes2 (ISO2) Validity

As shown by the code below, there are no non-admitted country codes2 (ISO2) in our dataset. In other words, the proportion of non-admitted country codes2 (ISO2) is **0%**. Therefore, no further action is required.

In [40]:
# A list that contain all the unique country code2 (both for Hackers and Targets) in our dataset (data_cleaned_V2)
ls_CountryCodes2 = list(set(data_cleaned_V2["att_country_code2"].unique()) | set(data_cleaned_V2["vic_country_code2"].unique()))

# Check the country codes2 validity and extract the non admitted country codes2 list
NonAdmittedCountryCodes2 = CheckCountryCodes2Validity(ls_CountryCodes2) # N.B.: THIS LIST DOESN'T CONTAIN DUPLICATE BUT JUST UNIQUE VALUES

Results:
Number of Checked Country Codes2 (or iso2): 197
Non Admmitted Country Codes2 (or iso2) Preview: []
Count of Non Admmitted Country Codes2 (or iso2): 0
Count of Admmitted Country Codes2 (or iso2): 197
Proportion (%) of Non Admitted Country Codes2 (or iso2): 0.0


In [41]:
# A list that contain all the country code2-NON UNIQUE-(both for Hackers and Targets) in our dataset (data_cleaned_V2)
att_vic_CountryCodes2 = list(data_cleaned_V2["att_country_code2"]) + list(data_cleaned_V2["vic_country_code2"])

# Check the country codes2 validity and extract the non admitted country codes2 list
NonAdmittedCities = CheckCountryCodes2Validity(att_vic_CountryCodes2) # N.B.: THIS LIST CONTAIN DUPLICATE

Results:
Number of Checked Country Codes2 (or iso2): 79314
Non Admmitted Country Codes2 (or iso2) Preview: []
Count of Non Admmitted Country Codes2 (or iso2): 0
Count of Admmitted Country Codes2 (or iso2): 79314
Proportion (%) of Non Admitted Country Codes2 (or iso2): 0.0


#### 5.6. Check Continent Validity

It is not necessary to build any algorithm or function using a specific dataset to verify continent validity, as this can be easily checked manually. As observed, there are no inconsistencies or invalid continent names in our dataset. All continent names in the dataset are evidently valid.

In [42]:
# A list that contain all the unique continent name (both for Hackers and Targets) in our dataset (data_cleaned_V2)
list(set(data_cleaned_V2["att_continent_name"].unique()) | set(data_cleaned_V2["vic_continent_name"].unique()))

['Europe', 'Oceania', 'North America', 'Asia', 'South America', 'Africa']

--- 
### 6. Update and Save the dataset for the EDA 
#### (cleaned_data.csv $\rightarrow$ cleaned_data_V2.csv )

In [43]:
# Save the updated dataset as a new version
#data_cleaned_V2.to_csv(r"cleaned_data_V2.csv", index=False)

#print("New dataset saved as 'cleaned_data_V2.csv'")
#print(r"Location: write here the location where you have saved the file")

# Load the updated dataset
path = r"cleaned_data_V2.csv" # For windows user's
#path = "cleaned_data_V2.csv" # For macos user's
cleaned_data_V2 = pd.read_csv(path, delimiter = ',')

New dataset saved as 'cleaned_data_V2.csv'
Location: D:\UniVault\UniProjects\DataScience\Year_1\DataMan\data\csv_file


#### A brief exploration of the updated version of the dataset (cleaned_data_V2)

In [44]:
# Shows the first 5 observations
cleaned_data_V2.head()

Unnamed: 0,vic_ip,vic_continent_name,vic_country_code2,vic_country_name,vic_city,vic_latitude,vic_longitude,att_ip,att_continent_name,att_country_code2,...,Alerts/Warnings,Attack Type,Action Taken,Severity Level,Log Source,Browser,Device/OS,Date,is_att_city_CFCA,is_vic_city_CFCA
0,84.9.164.252,Europe,GB,United Kingdom,London,51.50115,-0.09951,103.216.15.12,Asia,CN,...,no,Malware,Logged,Low,Server,Mozilla,Windows,2023-05-30,No,No
1,66.191.137.154,North America,US,United States,Rochester,44.01212,-92.4802,78.199.217.198,Europe,FR,...,no,Malware,Blocked,Low,Firewall,Mozilla,Windows,2020-08-26,No,No
2,198.219.82.17,North America,US,United States,Montgomery,32.40286,-86.24044,63.79.210.48,North America,US,...,yes,DDoS,Ignored,Low,Firewall,Mozilla,Windows,2022-11-13,No,No
3,101.228.192.255,Asia,CN,China,Shanghai,31.23042,121.4737,163.42.196.10,Asia,JP,...,yes,Malware,Blocked,Medium,Firewall,Mozilla,Macintosh,2023-07-02,No,No
4,189.243.174.238,North America,MX,Mexico,Mexico City,19.27774,-99.16447,71.166.185.76,North America,US,...,yes,DDoS,Blocked,Low,Firewall,Mozilla,Windows,2023-07-16,No,No


In [45]:
# See all the columns names
cleaned_data_V2.columns

Index(['vic_ip', 'vic_continent_name', 'vic_country_code2', 'vic_country_name',
       'vic_city', 'vic_latitude', 'vic_longitude', 'att_ip',
       'att_continent_name', 'att_country_code2', 'att_country_name',
       'att_city', 'att_latitude', 'att_longitude', 'att_threat_score',
       'att_is_tor', 'att_is_proxy', 'att_is_anonymous',
       'att_is_known_attacker', 'att_is_spam', 'att_is_bot',
       'att_is_cloud_provider', 'Source IP Address', 'Destination IP Address',
       'Protocol', 'Packet Type', 'Packet Length', 'Traffic Type',
       'Severity Level', 'Log Source', 'Browser', 'Device/OS', 'Date',
       'is_att_city_CFCA', 'is_vic_city_CFCA'],
      dtype='object')

In [46]:
# Rows x Columns
cleaned_data_V2.shape

(39657, 39)

--- 
### 7. Conclusions

In conclusion, after thoroughly checking, processing, and cleaning our dataset, we can confidently state that it presents excellent data quality in terms of completeness, validity, and consistency. Of course, in some cases, we chose to leave a small portion of "dirty" data (insignificant inconsistencies) to avoid introducing potential bias or additional errors. However, the amount of remaining dirty data is minimal and will not impact the subsequent analysis (EDA). Therefore, we can confidently conclude that using the "cleaned_data_V2.csv" dataset for the EDA will yield real, unbiased results and uncover valuable insights without the risk of incorrect analysis.

---
### ⏭️ Next Step: 
$\rightarrow$ Data Preparation (see 5_EDA.ipynb)