<a href="https://colab.research.google.com/github/DartDoesData/python-practice/blob/main/Week_4_Day_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🛡️ **Week 4, Day 3: Python for Cybersecurity**

In this lesson, we will explore how Python can be used for cybersecurity use cases.

Python’s data analysis libraries (like Pandas) and its ability to interact with APIs make it a helpful tool for cybersecurity tasks such as detecting phishing attempts, and analyzing malicious activity.

## 🔍 What is PhishTank?

[PhishTank](https://phishtank.org/) is a clearinghouse for data and information about phishing. It allows anyone to submit, verify, track, and access phishing data. The data includes details about phishing URLs, verification status, and the target brand being impersonated.

PhishTank provides a rich dataset of phishing attempts that have been verified by a community of users. It is a great source of real-world data for practicing cybersecurity analysis and understanding common phishing trends.

**Column Definitions**

[PhishTank Developer Info](https://phishtank.org/developer_info)
1. `phish_id`: The unique ID number for each phishing attempt.
2. `phish_detail_url`: A URL with details about the phishing attempt.
3. `url`: The actual phishing URL.
4. `submission_time`: The date and time the phishing attempt was reported.
5. `verified`: Indicates if the phishing attempt was verified (always 'yes' in this dataset).
6. `verification_time`: The date and time the phishing attempt was verified.
7. `online`: Indicates if the phishing attempt is still online.
8. `target`: The name of the company or brand being impersonated.



## 🧠 Lesson 1: Thinking Like an Analyst

In cybersecurity, analysts often have to explore the data on their own and ask questions about what insights can be gained. Rarely are they given a detailed list of tasks. Instead, they need to look at the data and think critically about potential analyses.

Review the data from the API endpoint to determine analysis use cases.

In [None]:
import requests
import pandas as pd
from pprint import pprint

# Fetch phishing data from PhishTank
phishtank_data_url = 'http://data.phishtank.com/data/online-valid.json'
response = requests.get(phishtank_data_url)

if(response.status_code == 200):
  response_json = response.json()
  print(f'API call successful: {response.status_code}')
else:
  print(f'Error with the PhishTank API: {response.status_code}')


In [None]:
# # ALTERNATE METHOD for downloading Phishing data
# # If the API call above is not successful

# import gdown
# import json

# # Google Drive file ID
# file_id = "1-gpQvp3iOAgZoflrr8BHKMXIe5Ys6yH6"

# # Download the file using gdown
# url = f"https://drive.google.com/uc?id={file_id}"
# response_json_file = "phishing_data.json"
# gdown.download(url, response_json_file, quiet=False)

# # Load the JSON contents into response_json
# with open(response_json_file, 'r') as file:
#     response_json = json.load(file)

In [None]:
# Print the response data type
print(type(response_json))

# Preview the # of responses
print(len(response_json))

# Preview the response (two records)
response_json[:2]

In [None]:
# Convert this JSON into a DataFrame
phishing_df = pd.DataFrame(response_json)

# Display the DataFrame
display(phishing_df.head())

# Get a summary of the dataset
phishing_df.info()
print(len(phishing_df))


## 📝 Practice exercise 1: Top Targeted Brands

One of the key questions we can answer is: **Which brands are most frequently targeted by phishing attacks?**

Using your `phishing_df` DataFrame, find the top 20 targets for phishing attacks. Filter out "other". In your data set.


In [None]:
# Filter out "Other" from the 'target' column
# YOUR CODE HERE

# Get the top 20 most targeted brands
# YOUR CODE HERE

# Display the results
# YOUR CODE HERE


## 📝 Practice exercise 2: Trend of Phishing Activity Over Time

Another important analysis involves examining the trend of phishing attempts over time. This can help us identify spikes in phishing activity.

Work with "Kisha" to create a horizontal bar chart that illustrates phishing trends by week over the past 90 days.

In [None]:
# YOUR CODE HERE
# Work with "Kisha" on this


## 📝 Practice exercise 3: Average Time to Verify Phishing Attempts

Analyzing the time taken to verify phishing attempts can provide insights into how quickly these threats are identified and addressed.

Work with Kisha to calculate the average time (in hours) between the submission of a phishing attempt and its verification.

In [None]:
# YOUR CODE HERE
# Work with "Kisha" on this

## 📄 Generating Web Content with Kisha

We'll use this prompt to generate a data table and horizontal bar chart as an example.

First, create a filtered version of the DataFrame.

```
# Filter out records where 'target' is 'Other' (only for testing)
phish_report_df = phishing_df[phishing_df['target'] != 'Other']

# Limit the DataFrame to the first 2000 rows (only for testing)
phish_report_df = phish_report_df[:2000]

# Display the filtered DataFrame
phish_report_df.head()
```

Then Prompt the AI to create a web page using this data.
```
Generate a full, interactive single web page from the phish_report_df DataFrame using Bootstrap for the UI.

The web page should meet the following requirements:
1. Data Source: Use the contents of phish_report_df as the data source.
2. Paginated Table: Use DataTables from DataTables.net to display phish_report_df in a paginated, searchable, and sortable table.
3. Embed data in the web page: Ensure data is embedded in the file and not expected to come from an endpoint.
4. Interactive Bar Chart: Use Plotly for a horizontal bar chart that shows the count of records grouped by phish_report_df['target'].
5. Elegant UI: Use Bootstrap for styling to create a polished and responsive design.
6. Download Option: Save the generated HTML file in Colab and include code to allow downloading it directly to my computer.
```

In [None]:
# Recreate the phishing_df from the JSON response
phishing_df = pd.DataFrame(response_json)

# Filter the DataFrame
phish_report_df = phishing_df[phishing_df['target'] != 'Other']

# Limit the DataFrame to the first 2000 rows (only for testing)
phish_report_df = phish_report_df[:2000]

# Display the filtered DataFrame
phish_report_df.head()

Use the cell below for your prompt (use the `generate` option and paste the prompt below)

---

_Generate a full, interactive single web page from the phish_report_df DataFrame using Bootstrap for the UI._

_The web page should meet the following requirements:_
1. _Data Source: Use the contents of phish_report_df as the data source._
2. _Paginated Table: Use DataTables from DataTables.net to display phish_report_df in a paginated, searchable, and sortable table._
3. _Embed data in the web page: Ensure data is embedded in the file and not expected to come from an endpoint._
4. _Interactive Bar Chart: Use Plotly for a horizontal bar chart that shows the count of records grouped by phish_report_df['target']._
5. _Elegant UI: Use Bootstrap for styling to create a polished and responsive design._
6. _Download Option: Save the generated HTML file in Colab and include code to allow downloading it directly to my computer._