# {The Rising Impact of Data Breaches in the U.S.}üìù

![Banner](./assets/banner.jpeg)

## Topic
*What problem are you (or your stakeholder) trying to address?*
üìù <!-- Answer Below -->
I want to talk about cybersecurity and data privacy, specifically analyzing patterns in data 
breaches in the United States. Every year, millions of people are affected by compromised personal 
data, which can result in identity theft, financial loss, and a decline in faith in businesses and 
technology. The rise of cloud platforms, artificial intelligence, and digitalservices has led to a 
rise in the collection of sensitive data, but laws and safeguards are frequently lagging behind. 
This makes the topic important today. Analyzing breach trends can help us identify weak points and 
improve data security for both individuals and corporations.

## Project Question
*What specific question are you seeking to answer with this project?*
*This is not the same as the questions you ask to limit the scope of the project.*
üìù <!-- Answer Below -->
1. How have data breaches in the US changed over time in terms of both quantity and severity?
2. Which sectors are most commonly the focus of data breaches caused by cyberattacks?
3. What types of data (financial, healthcare, personal identifiers, etc.) are most often compromised, and what patterns exist?

## What would an answer look like?
*What is your hypothesized answer to your question?*
üìù <!-- Answer Below -->
A line chart showing the number of breaches per year.
A bar chart comparing breaches across industries.
A stacked bar or pie chart showing the breakdown of data types exposed.

## Data Sources
*What 3 data sources have you identified for this project?*
*How are you going to relate these datasets?*
üìù <!-- Answer Below -->
1.Annual reports with comprehensive breach statistics are available as files or PDFs from the Identity Theft Resource Center (ITRC).
2.Chronology of Data Breaches by Privacy Rights Clearinghouse is a searchable database that includes variables like date, organization type, and type of data exposed (database/CSV).
3.The Breach Portal of the U.S. Department of Health and Human Services (HHS) contains comprehensive records of data breaches pertaining to healthcare (API and downloadable data).

## 1. Exploratory Data Analysis (EDA)
The goal of this section is to explore the structure and content of the merged breach datasets, summarize key statistics, and identify data issues or correlations.
python

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

itrc_df = pd.read_csv("itrc_data.csv")
prc_df = pd.read_csv("privacy_rights.csv")
hhs_df = pd.read_csv("hhs_breaches.csv")

df = pd.concat([itrc_df, prc_df, hhs_df], ignore_index=True)

df.head()
python

df.info()
python

df.describe()
python

print("Missing values:\n", df.isnull().sum())
print("\nDuplicate rows:", df.duplicated().sum())
python

df.corr(numeric_only=True)


### EDA Summary
- The merged dataset contains [X rows] and [Y columns], covering data breaches from multiple sources.  
- Common variables include Date, Organization Type, Records Exposed, Breach Type, and Location.  
- Some datasets have missing values in Records Exposed and Breach Type.  
- Outliers are present in the Records Exposed column, representing large-scale breaches (millions of records).  
- Correlations show strong relationships between Year and Total Breaches* but weaker relationships between categorical variables.

## Loading and Merging the Datasets 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

itrc_df = pd.read_csv("itrc_data.csv")
prc_df = pd.read_csv("privacy_rights.csv")
hhs_df = pd.read_csv("hhs_breaches.csv")

df = pd.concat([itrc_df, prc_df, hhs_df], ignore_index=True)
df.head() 

## Data Overview 
df.info()
df.describe()
print("Missing values:\n", df.isnull().sum())
print("\nDuplicate rows:", df.duplicated().sum())
df.corr(numeric_only=True) 

## 2. Data Visualizations
Below are four visualizations that reveal patterns and insights within the data breach datasets.


# 1. Histogram - Breaches per Year
sns.histplot(df['Year'], bins=20, kde=False, color='steelblue')
plt.title('Distribution of Data Breaches Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Breaches')
plt.show()

Insight:
Breaches steadily rise from 2005‚Äì2023, showing either increased reporting or increased cyber activity.

# 2. Scatter Plot - Records Exposed vs. Year
plt.scatter(df['Year'], df['Records_Exposed'], alpha=0.5)
plt.title('Records Exposed by Year')
plt.xlabel('Year')
plt.ylabel('Records Exposed (Millions)')
plt.show()


Insight: 
Most breaches affect smaller numbers, but a few extreme events (e.g., Equifax) dominate exposure totals.


# 3. Correlation Heatmap - Numeric features

corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap - Breach Statistics')
plt.show()


Insight:
Years with more breaches tend to have higher total exposure totals.

# 4. Boxplot - Outliers in Records Exposed
sns.boxplot(x=df['Records_Exposed'])
plt.title('Boxplot of Records Exposed')
plt.xlabel('Records Exposed')
plt.show()


Insight: 
Numerous severe outliers indicate very large breach events skewing the distribution.



## 3. Data Cleaning and Transformations

Cleaning steps performed based on EDA findings:
- Filled missing numeric values with median values.
- Removed duplicate rows across merged datasets.
- Filtered extreme outliers above a certain threshold.
- Converted *Year* to integer and *Date* to datetime type.

df['Records_Exposed'] = df['Records_Exposed'].fillna(df['Records_Exposed'].median())
df = df.drop_duplicates()
df['Year'] = df['Year'].astype(int)
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

df = df[df['Records_Exposed'] < df['Records_Exposed'].quantile(0.99)]

### Cleaning Summary

Filled missing values using median imputation
Removed duplicate rows
Converted Date/Year formats
Removed extreme outliers for cleaner modeling

## 4. Machine Learning Plan
Regression Models
- To predict numerical severity:
- Linear Regression
- Random Forest Regressor
- Gradient Boosting Regressor

## Classification Models
To categorize breaches into severity groups (Low/Medium/High):
- Logistic Regression
- Random Forest Classifier
- Gradient Boosting Classifier
- AdaBoost Classifier

## Challenges Identified
Challenge	
- Missing values
- Outliers	
- Imbalanced severity classes	
- Mixed variable types	

Explanation
- Not all reports include exposure numbers
- Mega-breaches distort model accuracy
- Few high-severity breaches
- Need separate processing for categorical + numerical

## Plan to Address These Challenges
Challenge	
- Missing values
- Outliers
- Imbalanced data	
- Mixed types	

Solution
- Use SimpleImputer in pipeline.
- Use RobustScaler, remove top 1%. 
- Try class weights or SMOTE.
- Use ColumnTransformer for separate processing.

## Process (Pipelines, Scaling, Encoding)
Can we predict breach severity or understand what factors most influence the number of records exposed?
Example pipeline setup:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

numeric_features = ['Year', 'Records_Exposed']
categorical_features = ['Industry', 'Breach_Type']

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numeric_features),
    ('cat', categorical_pipeline, categorical_features)
])

## Analyze ‚Äì Testing Multiple Models

from sklearn.ensemble import RandomForestClassifier

model = Pipeline([
    ('preprocess', preprocessor),
    ('clf', RandomForestClassifier())
])






## Approach and Analysis
*What is your approach to answering your project question?*
*How will you use the identified data to answer your project question?*
üìù <!-- Start Discussing the project here; you can add as many code cells as you need -->


breaches_by_industry = df['Industry'].value_counts().head(10)

plt.figure(figsize=(10,6))
breaches_by_industry.plot(kind='bar')
plt.title("Top 10 Industries by Data Breaches")
plt.xlabel("Industry")
plt.ylabel("Number of Breaches")
plt.show()


In [None]:
# Start your code here
# Compare breaches by industry
breaches_by_industry = df['Industry'].value_counts().head(10)

plt.figure(figsize=(10,6))
breaches_by_industry.plot(kind='bar')
plt.title("Top 10 Industries by Data Breaches")
plt.xlabel("Industry")
plt.ylabel("Number of Breaches")
plt.show()

## Resources and References
*What resources and references have you used for this project?*
üìù <!-- Answer Below -->
https://www.nist.gov/cyberframework
https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf
https://privacyrights.org/
https://www.idtheftcenter.org/
https://www.cisa.gov/
https://pandas.pydata.org/docs/
https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf
https://privacyrights.org/data-breaches
https://www.idtheftcenter.org/

In [None]:
# ‚ö†Ô∏è Make sure you run this cell at the end of your notebook before every submission!
!jupyter nbconvert --to python source.ipynb



[NbConvertApp] Converting notebook source.ipynb to python
[NbConvertApp] Writing 1271 bytes to source.py
