# Assignment 03 ‚Äì ‚Äì Visualising Email Domain Distribution

This notebook analyses a dataset of 1,000 individuals to explore the distribution of email domains. The results are presented using a pie chart, styled for clarity and visual appeal. This task demonstrates basic data wrangling and visualisation skills using Python.


### 1. Import Libraries

In [None]:
# Import necessary libraries
import pandas as pd 
# Necessary for data manipulation ‚Äì see: https://pandas.pydata.org/
import matplotlib.pyplot as plt 
# Necessary for plotting charts ‚Äì see: https://matplotlib.org/stable/contents.html
import seaborn as sns 
# Necessary for styling plots ‚Äì see: https://seaborn.pydata.org/
import requests 
# Necessary for downloading files from the web ‚Äì see: https://docs.python-requests.org/en/latest/
from pathlib import Path # Necessary for filesystem path handling ‚Äì see: https://docs.python.org/3/library/pathlib.html

# Set a consistent visual theme for plots
sns.set_theme(style="whitegrid") 
# Set the theme for seaborn plots ‚Äì see: https://seaborn.pydata.org/generated/seaborn.set_theme.html


### 2. Download the dataset

The dataset is hosted on Google Drive. The following code downloads it and saves it locally.

In [None]:
# Define download URL and local save path
url = "https://drive.google.com/uc?id=1AWPf-pJodJKeHsARQK_RHiNsE8fjPCVK&export=download"
data_path = Path("data/assignment03_people.csv")
data_path.parent.mkdir(exist_ok=True)

# Download and save the file
response = requests.get(url)
data_path.write_bytes(response.content)

print(f"‚úÖ Dataset saved to: {data_path.resolve()}")


### 3. Load the Data

Load the CSV file into a DataFrame and inspect the structure

In [None]:
# Load the CSV file into a pandas DataFrame
# This reads structured tabular data from the specified file path
df = pd.read_csv(data_path)  
# Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

# Preview the first five rows of the dataset
# Useful for checking column names, data types, and general structure
df.head()  
# Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html


### 4. Extract Email Domains

We extract the domain (e.g. gmail.com) from each email address to analyse their frequency.


In [None]:
# Extract domain from email safely
df['domain'] = (
    df['Email']
    .astype(str)
    .str.lower()
    .str.extract(r'@([\w\.-]+)$')[0]
)

# Count frequency of each domain
domain_counts = df['domain'].value_counts()

# Count number of unique domain types
unique_domain_count = df['domain'].nunique()

# Display results
print(f"üìå Total unique email domain types: {unique_domain_count}")
print("üìä Domain frequency table:")
print(domain_counts)


### ü•ß Create the Pie Chart

The chart displays the top email domain types directly as only three unique domains exist in the dataset, so no additional grouping is required.

In [None]:
# Prepare labels and sizes from full domain distribution
labels = domain_counts.index.tolist()
sizes = domain_counts.values.tolist()
colours = sns.color_palette('pastel', n_colors=len(labels))

# Function to show percentage and count
def make_autopct(values):
    def autopct(pct):
        total = sum(values)
        count = int(round(pct * total / 100.0))
        return f"{pct:.1f}%\n({count})"
    return autopct

# Create pie chart
fig, ax = plt.subplots(figsize=(8, 8))
pie_result = ax.pie(
    sizes,
    labels=labels,
    autopct=make_autopct(sizes),
    startangle=140,
    colors=colours,
    wedgeprops={'edgecolor': 'white'},
    textprops={'fontsize': 12}
)

# Unpack result safely
if len(pie_result) == 3:
    wedges, texts, autotexts = pie_result
else:
    wedges, texts = pie_result
    autotexts = []

# Finalise chart
ax.set_title('Distribution of Email Domains', fontsize=16)
ax.axis('equal')  # Ensures pie is circular
plt.tight_layout()
plt.show()


### 6. üíæ Save the Chart

Save the chart to the plots directory for examination and submission.

In [None]:
plot_path = Path("plots/assignment-03-pie-chart.jpg")
plot_path.parent.mkdir(exist_ok=True)
fig.savefig(plot_path, dpi=150)
print(f"üìÅ Chart saved to: {plot_path.resolve()}")