# 📝 Data cleaning

## 📊 Overview

As you have already seen in the file of acquistion in this project, we utilized **two datasets** to conduct exploratory data analysis (EDA):
1. An **API dataset**, fetched by using our API key.  
2. A **Kaggle dataset**, which served as the primary foundation for our analysis.

By combining these datasets, we aimed to create a more enriched and informative dataset for our study.

---

## 🌐 API Data Acquisition
- Due to the **request limits** associated with the API key and the fact that additional requests incurred costs 💰, we maximized the use of available requests to fetch **all possible data** and then we filter it based on our needs.  
- The fetched data included detailed information about **IP addresses, locations, threat scores, and attack patterns...**, which significantly enhanced the Kaggle dataset.

---

## 🛠️ Data Cleaning and Column Selection
During the data cleaning process, we performed the following steps to prepare the dataset for analysis:

### ✅ Columns Retained
We carefully selected only the **most relevant columns** for EDA based on their contribution to the analysis that we want to do. These columns provided essential insights into the patterns and trends we wanted to explore to answer our reserach questions.

### ❌ Columns Dropped
Several columns were removed for the following reasons:
1. **Irrelevance**:  
   - These columns were not needed for our specific analytical goals.  
2. **High Missing Values**:  
   - Columns with **more than 80% missing data** were dropped as their limited information added no significant value.
3. **Zero Variability**:  
   - Columns where all rows had the **same value** were deemed uninformative and removed.

---

##
- 🔍 **Streamlined Dataset**: The final dataset is focused, clean, and ready for detailed analysis, ensuring that all included features have meaningful contributions.

---

## 🎯 Next Steps
With this curated dataset, we are now well-prepared to dive into **exploratory data analysis**, identify trends, and generate actionable insights. Stay tuned for more exciting discoveries! 🚀


In [1]:
# Load needed library
import pandas as pd


# 🛠️ Filtering and Summary


This code filters the dataset to retain only relevant columns for analysis and provides a summary of the dataset's dimensions (number of rows and columns).


In [4]:

# Load your dataset
df = pd.read_csv('/Users/alesarabandi/data_man_project/final_df.csv')

df['Date'] = pd.to_datetime(df[['Year', 'Month', 'Day']]).dt.date

# List of columns to keep
columns_to_keep = [
    'vic_ip', 'vic_continent_name', 'vic_country_code2', 'vic_country_name', 'vic_city', 
    'vic_latitude', 'vic_longitude', 'att_ip', 'att_continent_name', 'att_country_code2', 
    'att_country_name', 'att_city', 'att_latitude', 'att_longitude', 'att_threat_score', 
    'att_is_tor', 'att_is_proxy', 'att_is_anonymous', 'att_is_known_attacker', 
    'att_is_spam', 'att_is_bot', 'att_is_cloud_provider', 'Source IP Address', 
    'Destination IP Address', 'Protocol', 'Packet Type', 'Packet Length', 'Traffic Type', 
    'Malware Indicators', 'Alerts/Warnings', 'Attack Type', 'Action Taken', 'Severity Level', 
    'Log Source', 'Browser', 'Device/OS', 'Date'
]

# Filter the DataFrame to keep only the specified columns
df = df.loc[:, [col for col in columns_to_keep if col in df.columns]]

# Display the updated DataFrame
print("Columns after filtering:")
print(df.columns.tolist())
num_columns = len(df.columns)
num_rows = len(df)

print(f"The dataset has {num_columns} columns.")
print(f"The dataset has {num_rows} rows.")


Columns after filtering:
The dataset has 37 columns.
The dataset has 40000 rows.


# 🔍 Column Analysis Summary

This code examines the dataset's columns to identify unique values, missing data, data types, and sample entries. The findings are compiled into a structured DataFrame for better readability.


In [5]:
# Inspecting unique values and patterns for each column
column_overview = {}

for column in df.columns:
    column_overview[column] = {
        'Unique Values': df[column].nunique(),
        'Sample Values': df[column].dropna().unique()[:5].tolist(),  # Show first 5 unique values
        'Missing Values': df[column].isnull().sum(),
        'Data Type': df[column].dtype
    }

# Convert overview to a DataFrame for better readability
column_summary = pd.DataFrame(column_overview).T
column_summary.reset_index(inplace=True)
column_summary.rename(columns={'index': 'Column'}, inplace=True)

column_summary


Unnamed: 0,Column,Unique Values,Sample Values,Missing Values,Data Type
0,vic_ip,39841,"[84.9.164.252, 66.191.137.154, 198.219.82.17, ...",159,object
1,vic_continent_name,6,"[Europe, North America, Asia, Africa, South Am...",159,object
2,vic_country_code2,171,"[GB, US, CN, MX, NO]",163,object
3,vic_country_name,172,"[United Kingdom, United States, China, Mexico,...",159,object
4,vic_city,4539,"[London, Rochester, Montgomery, Shanghai, Ciud...",163,object
5,vic_latitude,9387,"[51.50115, 44.01212, 32.40286, 31.23042, 19.27...",159,float64
6,vic_longitude,9378,"[-0.09951, -92.4802, -86.24044, 121.4737, -99....",159,float64
7,att_ip,39829,"[103.216.15.12, 78.199.217.198, 63.79.210.48, ...",171,object
8,att_continent_name,6,"[Asia, Europe, North America, South America, A...",171,object
9,att_country_code2,186,"[CN, FR, US, JP, IN]",173,object


# 🧹 Missing Values Removal

This code removes all rows with missing values from the dataset and displays the shape of the original and cleaned DataFrames for comparison.


In [6]:
# Drop all rows with missing values
df_cleaned = df.dropna()

# Check the shape of the cleaned DataFrame
print(f"Original dataset had {df.shape[0]} rows and {df.shape[1]} columns.")
print(f"Cleaned dataset has {df_cleaned.shape[0]} rows and {df_cleaned.shape[1]} columns.")


Original dataset had 40000 rows and 37 columns.
Cleaned dataset has 39657 rows and 37 columns.


# 📊 Column Analysis Post-Cleaning

This code evaluates the unique values, sample data, missing values, and data types for each column in the cleaned dataset. The results are summarized in a structured DataFrame for clarity.


In [7]:
# Inspecting unique values and patterns for each column after cleaning
column_overview = {}

for column in df_cleaned.columns:
    column_overview[column] = {
        'Unique Values': df_cleaned[column].nunique(),
        'Sample Values': df_cleaned[column].dropna().unique()[:5].tolist(),  # Show first 5 unique values
        'Missing Values': df_cleaned[column].isnull().sum(),
        'Data Type': df_cleaned[column].dtype
    }

# Convert overview to a DataFrame for better readability
column_summary = pd.DataFrame(column_overview).T
column_summary.reset_index(inplace=True)
column_summary.rename(columns={'index': 'Column'}, inplace=True)

column_summary


Unnamed: 0,Column,Unique Values,Sample Values,Missing Values,Data Type
0,vic_ip,39657,"[84.9.164.252, 66.191.137.154, 198.219.82.17, ...",0,object
1,vic_continent_name,6,"[Europe, North America, Asia, Africa, South Am...",0,object
2,vic_country_code2,171,"[GB, US, CN, MX, NO]",0,object
3,vic_country_name,171,"[United Kingdom, United States, China, Mexico,...",0,object
4,vic_city,4520,"[London, Rochester, Montgomery, Shanghai, Ciud...",0,object
5,vic_latitude,9343,"[51.50115, 44.01212, 32.40286, 31.23042, 19.27...",0,float64
6,vic_longitude,9334,"[-0.09951, -92.4802, -86.24044, 121.4737, -99....",0,float64
7,att_ip,39657,"[103.216.15.12, 78.199.217.198, 63.79.210.48, ...",0,object
8,att_continent_name,6,"[Asia, Europe, North America, South America, A...",0,object
9,att_country_code2,186,"[CN, FR, US, JP, IN]",0,object


In [8]:
# Save the cleaned DataFrame to a CSV file
output_file = "/Users/alesarabandi/data_man_project/cleaned_data1.csv"
df_cleaned.to_csv(output_file, index=False)

print(f"Cleaned data has been saved to '{output_file}'.")


Cleaned data has been saved to '/Users/alesarabandi/data_man_project/cleaned_data1.csv'.


## Next step: Data Quality