# {The Rising Impact of Data Breaches in the U.S.}üìù

![Banner](./assets/banner.jpeg)

## Topic
*What problem are you (or your stakeholder) trying to address?*
üìù <!-- Answer Below -->
I want to talk about cybersecurity and data privacy, specifically analyzing patterns in data 
breaches in the United States. Every year, millions of people are affected by compromised personal 
data, which can result in identity theft, financial loss, and a decline in faith in businesses and 
technology. The rise of cloud platforms, artificial intelligence, and digitalservices has led to a 
rise in the collection of sensitive data, but laws and safeguards are frequently lagging behind. 
This makes the topic important today. Analyzing breach trends can help us identify weak points and 
improve data security for both individuals and corporations.

## Project Question
*What specific question are you seeking to answer with this project?*
*This is not the same as the questions you ask to limit the scope of the project.*
üìù <!-- Answer Below -->
- How has breach frequency changed over time in the U.S.?
- Which industries are most frequently targeted?
- What types of data (PII, PHI, financial) are most often exposed?
- Can a model predict whether a breach contains PII/PHI based on the breach summary?

## What would an answer look like?
*What is your hypothesized answer to your question?*
üìù <!-- Answer Below -->
- The number of reported breaches increases over time (more reporting & attacks)
- Finance and healthcare will be among the most-targeted industries
- PII and health data will appear frequently
- A text classifier can detect PII mentions in breach summaries with reasonable accuracy

## Data Sources
*What 3 data sources have you identified for this project?*
*How are you going to relate these datasets?*
üìù <!-- Answer Below -->
Data Sources Used:
1. Dataset 1 (Df_1.csv)
2. Dataset 2 (breach_report.csv)
3. Dataset 3 (cyber_security_breaches.csv)

By finding the common fields that show up in all of the datasetssuch as the breach date, organization name, industry sector, breach type, and the quantity of documents exposed I am connecting them. These shared fields enable me to combine each dataset into a single, cohesive dataset, despite the fact that they each concentrate on distinct facets of cybersecurity incidents. Even if datasets cannot be directly combined, I can still conduct independent analyses and use the findings to bolster the main conclusions. I can analyze trends across industries, time periods, and breach characteristics by merging the data in this way, giving me a broader and more comprehensive picture of U.S. data breaches.



## Approach and Analysis
*What is your approach to answering your project question?*
*How will you use the identified data to answer your project question?*
My method involves cleaning and standardizing all of the breach data so that it can be studied uniformly. After that, I examine the data to find trends, outliers, and connections between factors like year, industry, type of breach, and records exposed. After that, I make visualizations that show trends, such as how the frequency of breaches varies over time or which industries are most impacted. In order to determine whether breach severity can be anticipated using the dataset's attributes, I also employ machine-learning algorithms. Overall, I can address my project issue and demonstrate how data breaches have changed and what aspects are most important in security events by combining the datasets, investigating the trends, creating visuals, and using ML models.

I'll measure the frequency of breaches, the industries most affected, and the kinds of data exposed using the aggregated breach datasets. I can spot trends like rising breach frequency over time or industries with the highest occurrence rates by examining the dates, industry classifications, breach kinds, and record counts. While summary data can draw attention to significant changes or anomalous occurrences, visualizations will aid in the clear revelation of these trends. In order to determine whether particular characteristics, such as industry or breach kind, have a significant impact, I will also utilize the data to train machine-learning models that aim to forecast breach severity. The datasets collectively offer the proof required to address my project's questions and bolster significant findings regarding cybersecurity and data-privacy threats in the US.
üìù <!-- Start Discussing the project here; you can add as many code cells as you need -->



## Data Cleaning Pipeline (summary)
- Normalize column names (lowercase, underscores)  
- Parse and standardize date/year fields  
- Impute missing or malformed numeric values if needed (median imputation used for numeric summaries)  
- Drop duplicate rows  
- Create `pii_exposed` label via keyword detection in `summary`  
- Save a cleaned dataset for reproducibility


## Import Libraries Cell
Summary:
This section offers openness into the source of the data and documents the dataset used in the project. The primary dataset, Cyber Security Breaches.csv, comprises breach narratives, dates, organizations, locations, and additional metadata. Clearly identifying the data sources guarantees reproducibility and lays the groundwork for the subsequent analysis.

In [None]:
#Imports
import os, re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, roc_curve, auc
import joblib

plt.rcParams['figure.figsize'] = (10,5)


In [None]:
# Load the dataset (make sure Cyber Security Breaches.csv is in repo root or data/)
DATA_PATH = "Cyber Security Breaches.csv"
df = pd.read_csv(DATA_PATH, low_memory=False)

# Normalize column names
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
print("Dataset shape:", df.shape)
display(df.head(3))

In [None]:
# Basic info
df.info()
display(df.describe(include='all').T)

# Missing values and duplicates
print("Missing values per column:\n", df.isnull().sum())
print("Duplicate rows:", df.duplicated().sum())

In [None]:
# Ensure summary column exists and is string
df['summary'] = df.get('summary', '').astype(str).fillna('')

# Parse date-like columns to create year
date_cols = [c for c in df.columns if 'date' in c or 'breach' in c and 'date' in c]
# Prefer common fields
if 'date_of_breach' in df.columns:
    df['date_of_breach'] = pd.to_datetime(df['date_of_breach'], errors='coerce')
    df['year'] = df['date_of_breach'].dt.year
elif 'breach_start' in df.columns:
    df['breach_start'] = pd.to_datetime(df['breach_start'], errors='coerce')
    df['year'] = df['breach_start'].dt.year
elif 'year' in df.columns:
    df['year'] = pd.to_numeric(df['year'], errors='coerce')
else:
    df['year'] = pd.NA

# Standardize columns we may use (create if missing)
for col in ['industry','breach_type','state','organization']:
    if col not in df.columns:
        df[col] = pd.NA

# Drop exact duplicates
df = df.drop_duplicates()

# Attempt to normalize any numeric 'records' columns if present
possible_records = [c for c in df.columns if 'record' in c or 'expos' in c or 'comprom' in c]
if possible_records:
    rc = possible_records[0]
    df['records_exposed'] = pd.to_numeric(df[rc].astype(str).str.replace(r'[^0-9]', '', regex=True), errors='coerce')
else:
    df['records_exposed'] = pd.NA

# Print counts after cleaning
print("After cleaning shape:", df.shape)
print("Year nulls:", df['year'].isnull().sum(), "Records_exposed nulls:", df['records_exposed'].isnull().sum())

In [None]:
# Build a binary target: whether the summary mentions PII / PHI indicators
pii_terms = [
    r'\bssn\b', 'social security', 'patient', 'protected health', r'\bphi\b',
    'health information', 'personal information', 'credit card', 'card number',
    'financial account', 'date of birth', r'\bdob\b', 'driver license', 'passport',
    'medical record', 'email', 'address'
]
pattern = re.compile('|'.join(pii_terms), flags=re.IGNORECASE)

df['pii_exposed'] = df['summary'].apply(lambda s: 1 if pattern.search(str(s)) else 0)
print("PII label counts:\n", df['pii_exposed'].value_counts())

In [None]:
# 1 ‚Äî Breaches per year (line)
if df['year'].notna().any():
    year_counts = df['year'].dropna().astype(int).value_counts().sort_index()
    plt.figure()
    plt.plot(year_counts.index, year_counts.values, marker='o')
    plt.title("Number of Reported Breaches per Year")
    plt.xlabel("Year")
    plt.ylabel("Count of Breaches")
    plt.grid(True)
    plt.show()

In [None]:
# 2 ‚Äî Top industries by count
if 'industry' in df.columns:
    top_ind = df['industry'].fillna('Unknown').value_counts().head(10)
    plt.figure()
    top_ind.plot(kind='barh')
    plt.title("Top 10 Industries by Number of Breaches")
    plt.xlabel("Number of Breaches")
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

In [None]:
# 3 ‚Äî PII presence pie
vals = df['pii_exposed'].value_counts().sort_index()
labels = ['No PII', 'PII Mentioned']
plt.figure()
plt.pie(vals.values, labels=labels, autopct='%1.1f%%', startangle=140)
plt.title("Proportion of Breaches Mentioning PII/PHI in Summary")
plt.show()

In [None]:
# 4 ‚Äî summary length histogram
lengths = df['summary'].str.len()
plt.figure()
plt.hist(lengths.dropna(), bins=30)
plt.title("Distribution of Summary Length (characters)")
plt.xlabel("Characters")
plt.ylabel("Count")
plt.show()

In [None]:
# 5 ‚Äî Year vs summary length scatter (log y)
if df['year'].notna().any():
    plt.figure()
    idx = df['year'].dropna().index
    plt.scatter(df.loc[idx, 'year'].astype(int), df.loc[idx, 'summary'].str.len(), alpha=0.4)
    plt.yscale('log')
    plt.title("Summary Length vs Year (log scale)")
    plt.xlabel("Year")
    plt.ylabel("Summary Length (log scale)")
    plt.show()

In [None]:
# 6 ‚Äî Top words in PII summaries
def top_n_words(texts, n=15):
    words = Counter()
    for t in texts:
        toks = re.findall(r'\b[a-zA-Z]{3,}\b', t.lower())
        words.update(toks)
    return words.most_common(n)

pii_top = top_n_words(df[df['pii_exposed']==1]['summary'].astype(str).tolist(), n=20)
if pii_top:
    words, counts = zip(*pii_top)
    plt.figure(figsize=(10,5))
    plt.bar(words, counts)
    plt.title("Top Words in PII-Related Summaries")
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

In [None]:
# Prepare data for ML (text classification)
X = df['summary'].astype(str)
y = df['pii_exposed']

# Train / test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123, stratify=y)

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=5000)),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=123))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:,1]

print("Train size:", len(X_train), "Test size:", len(X_test))

In [None]:
# Classification report
print("Classification report:\n")
print(classification_report(y_test, y_pred, zero_division=0))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=['No PII','PII'])
plt.figure()
disp.plot(cmap='Blues')
plt.title("Confusion Matrix - PII detection")
plt.show()

# ROC curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.2f}")
plt.plot([0,1],[0,1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - PII detection")
plt.legend()
plt.show()

In [None]:
# Show top positive predictive terms
clf = pipeline.named_steps['clf']
tfidf = pipeline.named_steps['tfidf']
feature_names = tfidf.get_feature_names_out()
coefs = clf.coef_[0]
top_pos_idx = np.argsort(coefs)[-20:][::-1]

print("Top features predictive of PII (positive coefficients):")
for i in top_pos_idx[:20]:
    print(f"{feature_names[i]}: {coefs[i]:.3f}")

In [None]:
# Save trained pipeline + cleaned CSV
os.makedirs("models", exist_ok=True)
joblib.dump(pipeline, "models/pii_detector_tfidf_logreg.joblib")
cleaned_csv = "cleaned_cyber_breaches.csv"
df.to_csv(cleaned_csv, index=False)
print("Saved model to models/pii_detector_tfidf_logreg.joblib")
print("Saved cleaned data to", cleaned_csv)


## Resources and References
*What resources and references have you used for this project?*
üìù <!-- Answer Below -->
https://www.nist.gov/cyberframework
https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf
https://privacyrights.org/
https://www.idtheftcenter.org/
https://www.cisa.gov/
https://pandas.pydata.org/docs/
https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf
https://privacyrights.org/data-breaches
https://www.idtheftcenter.org/
https://jupyter-notebook.readthedocs.io/en/stable/
https://scikit-learn.org/stable/modules/model_evaluation.html
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
https://pandas.pydata.org/docs/
https://numpy.org/doc/
https://matplotlib.org/stable/users/index.html
https://scikit-learn.org/stable/
https://www.nltk.org/
https://docs.python.org/3/library/re.html
https://www.verizon.com/business/resources/reports/dbir/
https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf?utm_source=chatgpt.com
https://www.idtheftcenter.org/breach-alert/

In [None]:
# ‚ö†Ô∏è Make sure you run this cell at the end of your notebook before every submission!
!jupyter nbconvert --to python source.ipynb



[NbConvertApp] Converting notebook source.ipynb to python
[NbConvertApp] Writing 1271 bytes to source.py
