### Data Analysis Mathematics, Algorithms and Modeling
`PROG8431 - Fall 2025 - Section 1`

Problem Analysis Workshop 1 - Data Collection and Analysis


#### THE TEAM
1. George Jose
2. Lawal Oluwafemi
3. Kamamo Lesley 




#### `FIELD OF INQUIRY`
CyberSecurity

#### `The Problem`
**How does poorly defined cybersecurity protocols affect telecom networks**

#### `The Prompt to Gen Ai`
You are a Cybersecurity analyst. Analyze the impact of poorly defined research questions in cybersecurity. Provide examples where vague or overly broad security goals led to ineffective strategies or wasted resources. Compare these with cases where well-defined, specific questions produced actionable results. Summarize key recommendations for framing precise and effective research questions in cybersecurity.

#### `ESSAY ON FINDINGS`

Below is the short 100 word essay, from Gen AI to base our findings on the Field of Cyber Security:

Poorly defined research questions in cybersecurity often lead to wasted resources and weak defenses. Broad goals like “prevent cyberattacks” lack focus, causing organizations to invest in generic tools rather than addressing real threats. For example, vague cloud security strategies often overlook identity management, a frequent breach cause. In contrast, well-defined questions such as “How to cut phishing success rates by 50% in one year?” enable measurable, actionable solutions like user training and email filtering. Effective cybersecurity research demands precision, scope, and context. Clear questions drive targeted defenses, maximize resources, and strengthen resilience against evolving threats


#### `Goal of this project:` 

The goal is to detect potential intrusions (cyberattacks) in network sessions using features like packet size, login attempts, session duration, IP reputation, and unusual access patterns. 

#### `Sources of the Information`
https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset




In [2]:
# import libraries
import pandas as pd
import numpy as np

In [5]:
# import dataset
file_path = "../data/cybersecurity_intrusion_data.csv"
df = pd.read_csv(file_path)

# display the first 5 rows of the dataset
df.head()


Unnamed: 0,session_id,network_packet_size,protocol_type,login_attempts,session_duration,encryption_used,ip_reputation_score,failed_logins,browser_type,unusual_time_access,attack_detected
0,SID_00001,599,TCP,4,492.983263,DES,0.606818,1,Edge,0,1
1,SID_00002,472,TCP,3,1557.996461,DES,0.301569,0,Firefox,0,0
2,SID_00003,629,TCP,3,75.044262,DES,0.739164,2,Chrome,0,1
3,SID_00004,804,UDP,4,601.248835,DES,0.123267,0,Unknown,0,1
4,SID_00005,453,TCP,5,532.540888,AES,0.054874,1,Firefox,0,0


In [6]:
df.columns

Index(['session_id', 'network_packet_size', 'protocol_type', 'login_attempts',
       'session_duration', 'encryption_used', 'ip_reputation_score',
       'failed_logins', 'browser_type', 'unusual_time_access',
       'attack_detected'],
      dtype='object')

In [7]:
# dispaly the data types
df.dtypes

session_id              object
network_packet_size      int64
protocol_type           object
login_attempts           int64
session_duration       float64
encryption_used         object
ip_reputation_score    float64
failed_logins            int64
browser_type            object
unusual_time_access      int64
attack_detected          int64
dtype: object

### Data Exploration

In [8]:
# display shape - entire dataset
df.shape

(9537, 11)

In [9]:
df.describe()

Unnamed: 0,network_packet_size,login_attempts,session_duration,ip_reputation_score,failed_logins,unusual_time_access,attack_detected
count,9537.0,9537.0,9537.0,9537.0,9537.0,9537.0,9537.0
mean,500.430639,4.032086,792.745312,0.331338,1.517773,0.149942,0.447101
std,198.379364,1.963012,786.560144,0.177175,1.033988,0.357034,0.49722
min,64.0,1.0,0.5,0.002497,0.0,0.0,0.0
25%,365.0,3.0,231.953006,0.191946,1.0,0.0,0.0
50%,499.0,4.0,556.277457,0.314778,1.0,0.0,0.0
75%,635.0,5.0,1105.380602,0.453388,2.0,0.0,1.0
max,1285.0,13.0,7190.392213,0.924299,5.0,1.0,1.0


In [10]:
# Check for missing values
df.isnull().sum()

session_id                0
network_packet_size       0
protocol_type             0
login_attempts            0
session_duration          0
encryption_used        1966
ip_reputation_score       0
failed_logins             0
browser_type              0
unusual_time_access       0
attack_detected           0
dtype: int64

In [12]:
# Check for duplicates
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
9532    False
9533    False
9534    False
9535    False
9536    False
Length: 9537, dtype: bool

### Data Cleaning

- This process is to standardize data such that the dataset is transformed for better analysis
- Some of the quiestion we could ask oursleves:
1. How to deal with missing values (empty cells)
2. Should we normalize data types (Convert numbers stored as text to numbers)
3. How to deal with duplicated values
4. Fixing Structural inconsistencies (Change text to upper/lower case, replacing some text - gender, dropping of unnecessary columns)
5. Check for Outliers
