# NERSC USERS and APPLICATION POWER and ENERGY ANALYSIS
![HPC Analysis](https://images.unsplash.com/photo-1542744173-05336fcc7ad4?crop=entropy&cs=tinysrgb&fit=max&fm=jpg&ixid=M3w0NDgzMDl8MHwxfHNlYXJjaHwxfHxkYXRhJTIwYW5hbHlzaXMlMjBrZXklMjBmaW5kaW5nc3xlbnwwfHx8fDE2OTA5NTY5MTB8MA&ixlib=rb-4.0.3&q=80&w=400)
This notebook provides analysis tools of High-Performance Computing (HPC) power and energy consumption using based on user projects that are active on PERLMUTTER at NERSC. The analysis focuses on various aspects such as active users, top projects, GPU utilization, and more. 

# Exploring the World of Data with Visual Flair 📊

![Data Exploration](https://images.unsplash.com/photo-1551288049-bebda4e38f71?crop=entropy&cs=tinysrgb&fit=max&fm=jpg&ixid=M3w0NDgzMDl8MHwxfHNlYXJjaHw0fHxkYXRhJTIwZXhwbG9yYXRpb258ZW58MHx8fHwxNjkwOTU3NjgxfDA&ixlib=rb-4.0.3&q=80&w=400)

Welcome to our data exploration journey! In this notebook, we'll be diving into a fascinating dataset, uncovering insights, patterns, and stories hidden within the numbers. But worry not, we won't be alone on this adventure. We have some of the most powerful Python libraries by our side:

- **Pandas**: Our trusty data manipulation tool, capable of slicing and dicing the data just the way we want it.
- **NumPy**: The mathematical wizard, handling all the numerical operations with ease and grace.
- **Matplotlib**: The artist of the group, painting our insights in the form of beautiful and informative plots.
- **Seaborn**: Matplotlib's sophisticated sibling, adding a touch of elegance and simplicity to our visualizations.

Together, these tools will help us unravel the mysteries within our data. We'll start by setting a clean and crisp style for our plots, thanks to Seaborn's 'whitegrid'. Then, we'll read in our data file (don't forget to upload it or provide the path), and take a sneak peek at the first few rows.

Ready to embark on this exciting exploration? Let's dive in! 🚀


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Setting the style for the plots
sns.set_style('whitegrid')

# Reading the data (please upload the file or provide the path)
# data = pd.read_csv('path_to_file.csv')

# Displaying the first few rows of the data
# data.head()

In [None]:
# Sample data
data_sample = '''
User Name,Organization,Organization Type,Organization Country,Is Active,CPU Compute Allocation,CPU Node Hours Charged,GPU Compute Allocation,GPU Node Hours Charged,PI Name,Project Description,Department,Organization City
u232,Auburn University,UNIV,United States of America,True,23399.0,2426.4169444444447,150.0,0.0,Michael Pindzola,Computational Atomic and Molecular Physics for Fusion Energy Sciences,Department of Physics,Auburn
u344,CompX - Computational Modeling and Software Development,SMBUS,United States of America,True,5000.0,0.0,7500.0,0.0,Jin Myung Park,AToM-2 SciDAC - Advanced Tokamak Modeling Environment,Fusion Theory,Del Mar
u431,Princeton Plasma Physics Laboratory (PPPL),DOELAB,United States of America,True,81000.0,13781.63361111111,10075.0,1.1616666666666666,Stephen Jardin,3D Extended MHD simulation of fusion plasmas,Theory and Computation,Princeton
u460,Princeton Plasma Physics Laboratory (PPPL),DOELAB,United States of America,True,0.0,0.0,100.0,0.0,Richard Gerber,NERSC overhead account for users with no active repo,,Princeton
u617,Pacific Northwest National Laboratory (PNNL),DOELAB,United States of America,True,15170.3,0.0,29525.0,0.0,Sotiris Xantheas,"Guest-host interactions in the gas phase, in aqueous systems and hydrate lattices","Advanced Computing, Mathematics and Data",Richland
u650,CompX - Computational Modeling and Software Development,SMBUS,United States of America,True,42800.0,0.0,11925.0,0.0,Paul Bonoli,Center for Integrated Simulation of Fusion Relevant RF Actuators: SciDAC Project,Fusion,Del Mar
u1103,University of Alaska Fairbanks,UNIV,United States of America,True,2062.5,0.0,15075.0,0.0,Jean-Noel Leboeuf,Gyrokinetic Studies of Non-Diffusive Transport,Department of Physics,Fairbanks
u1165,University of Maryland,UNIV,United States of America,True,1000.0,0.0,2500.0,0.0,William Dorland,"Turbulence, Transport and Magnetic Reconnection in High Temperature Plasma",,College Park
u70270,Lawrence Berkeley National Laboratory,DOELAB,United States of America,True,85.45555555555555,0.0,7.5,0.0,C. William McCurdy,Electron and Photon Collisions with Atoms and Molecules,Chemistry,Berkeley
u1446,Oak Ridge National Laboratory,DOELAB,United States of America,True,2000.0,0.0,2500.0,0.0,Jin Myung Park,AToM-2 SciDAC - Advanced Tokamak Modeling Environment,Fusion Energy Division,Oak Ridge
'''

# Converting the sample data to a DataFrame
# Reading data from the CSV file
file_path = 'data/nersc_userdata2023.csv'
data = pd.read_csv(file_path)

# Displaying the first few rows of the data
print("First few rows of the data:")
print(data.head())

# Displaying column information for analyzing and visualization
print("\nColumn information:")
print(data.info())

# Displaying summary statistics for numeric columns
print("\nSummary statistics for numeric columns:")
print(data.describe())

## Overview of the Dataset
Let's start by getting an overview of the dataset. We'll explore the basic statistics, data types, and null values in the dataset. This will give us a good understanding of the data we're working with.

In [None]:
# Basic statistics of the dataset
data_description = data.describe()
data_description

In [None]:
# Data types of the columns
data_types = data.dtypes
data_types

In [None]:
# Checking for missing values
missing_values = data.isnull().sum()
missing_values

## Distribution of Numerical Columns
We will visualize the distribution of numerical columns using histograms. This will help us understand the range and distribution of values in these columns.

In [None]:
# Plotting histograms for numerical columns
numerical_columns = ['CPU Compute Allocation', 'CPU Node Hours Charged', 'GPU Compute Allocation', 'GPU Node Hours Charged']
for column in numerical_columns:
    plt.figure(figsize=(10, 6))
    sns.histplot(data[column], bins=20, kde=True)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.show()

## Distribution of Categorical Columns
We will explore the distribution of categorical columns such as 'Organization Type', 'Organization Country', and 'Is Active'. Understanding the distribution of these categories will provide insights into the composition of the dataset.

In [None]:
# Plotting bar plots for categorical columns
categorical_columns = ['Organization Type', 'Organization Country', 'Is Active']
for column in categorical_columns:
    plt.figure(figsize=(10, 6))
    sns.countplot(data=data, x=column, order=data[column].value_counts().index)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.show()

## Analyzing Relationships Between Variables
We will analyze the relationships between different variables in the dataset. This includes exploring correlations between numerical variables and understanding how different factors such as organization type and country influence compute allocations.

In [None]:
# Correlation matrix for numerical variables
correlation_matrix = data[numerical_columns].corr()

# Plotting the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix for Numerical Variables')
plt.show()

### Compute Allocations by Organization Type
We will visualize the CPU and GPU compute allocations across different organization types. This will help us understand how resources are allocated among various types of organizations.

In [None]:
# Plotting CPU and GPU compute allocations by organization type
plt.figure(figsize=(14, 6))
sns.barplot(x='Organization Type', y='CPU Compute Allocation', data=data, ci=None, color='blue', alpha=0.6, label='CPU Allocation')
sns.barplot(x='Organization Type', y='GPU Compute Allocation', data=data, ci=None, color='green', alpha=0.6, label='GPU Allocation')
plt.title('Compute Allocations by Organization Type')
plt.xlabel('Organization Type')
plt.ylabel('Compute Allocation')
plt.legend()
plt.xticks(rotation=45)
plt.show()

### Compute Allocations by Organization Country
We will visualize the CPU and GPU compute allocations across different countries. This will provide insights into the distribution of resources among various countries.

In [None]:
# Plotting CPU and GPU compute allocations by organization country
plt.figure(figsize=(14, 6))
sns.barplot(x='Organization Country', y='CPU Compute Allocation', data=data, ci=None, color='blue', alpha=0.6, label='CPU Allocation')
sns.barplot(x='Organization Country', y='GPU Compute Allocation', data=data, ci=None, color='green', alpha=0.6, label='GPU Allocation')
plt.title('Compute Allocations by Organization Country')
plt.xlabel('Organization Country')
plt.ylabel('Compute Allocation')
plt.legend()
plt.xticks(rotation=90)
plt.show()

### Relationship Between Compute Allocations and Node Hours Charged
We will explore the relationship between compute allocations (both CPU and GPU) and the corresponding node hours charged. This analysis will help us understand how the allocated resources are being utilized.

In [None]:
# Plotting the relationship between CPU Compute Allocation and CPU Node Hours Charged
plt.figure(figsize=(10, 6))
sns.scatterplot(x='CPU Compute Allocation', y='CPU Node Hours Charged', data=data, hue='Organization Type')
plt.title('Relationship Between CPU Compute Allocation and CPU Node Hours Charged')
plt.xlabel('CPU Compute Allocation')
plt.ylabel('CPU Node Hours Charged')
plt.legend(title='Organization Type')
plt.show()

# Plotting the relationship between GPU Compute Allocation and GPU Node Hours Charged
plt.figure(figsize=(10, 6))
sns.scatterplot(x='GPU Compute Allocation', y='GPU Node Hours Charged', data=data, hue='Organization Type')
plt.title('Relationship Between GPU Compute Allocation and GPU Node Hours Charged')
plt.xlabel('GPU Compute Allocation')
plt.ylabel('GPU Node Hours Charged')
plt.legend(title='Organization Type')
plt.show()

### Analysis of Active Users
We will analyze the active users in the dataset, focusing on their compute allocations and node hours charged. Understanding the behavior of active users can provide insights into resource utilization and efficiency.

In [None]:
# Filtering active users
active_users = data[data['Is Active'] == 'True']

# Plotting CPU and GPU compute allocations for active users
plt.figure(figsize=(14, 6))
sns.barplot(x='Organization Type', y='CPU Compute Allocation', data=active_users, ci=None, color='blue', alpha=0.6, label='CPU Allocation')
sns.barplot(x='Organization Type', y='GPU Compute Allocation', data=active_users, ci=None, color='green', alpha=0.6, label='GPU Allocation')
plt.title('Compute Allocations for Active Users by Organization Type')
plt.xlabel('Organization Type')
plt.ylabel('Compute Allocation')
plt.legend()
plt.xticks(rotation=45)
plt.show()

### Analysis of Top Projects and Departments
We will analyze the top projects and departments in terms of compute allocations and node hours charged. Understanding the leading projects and departments can provide insights into the main areas of focus and resource consumption.

In [None]:
# Analyzing top projects by CPU Compute Allocation
top_projects_cpu = data.nlargest(10, 'CPU Compute Allocation')[['Project Description', 'CPU Compute Allocation']]
plt.figure(figsize=(12, 6))
sns.barplot(x='CPU Compute Allocation', y='Project Description', data=top_projects_cpu, palette='viridis')
plt.title('Top 10 Projects by CPU Compute Allocation')
plt.xlabel('CPU Compute Allocation')
plt.ylabel('Project Description')
plt.show()

# Analyzing top departments by GPU Compute Allocation
top_departments_gpu = data.nlargest(10, 'GPU Compute Allocation')[['Department', 'GPU Compute Allocation']]
plt.figure(figsize=(12, 6))
sns.barplot(x='GPU Compute Allocation', y='Department', data=top_departments_gpu, palette='inferno')
plt.title('Top 10 Departments by GPU Compute Allocation')
plt.xlabel('GPU Compute Allocation')
plt.ylabel('Department')
plt.show()

### Geographic Distribution of Organizations
We will visualize the geographic distribution of organizations based on their city and country. This analysis will provide a spatial understanding of where the organizations in the dataset are located.

In [None]:
# Importing geopy to geocode locations
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="geoapi")

# Function to geocode city and country
def geocode_location(row):
    location = geolocator.geocode(f"{row['Organization City']}, {row['Organization Country']}")
    if location:
        return location.latitude, location.longitude
    return None, None

# Applying the geocode function to the data
data['Latitude'], data['Longitude'] = zip(*data.apply(geocode_location, axis=1))

# Plotting the geographic distribution
plt.figure(figsize=(14, 7))
plt.scatter(data['Longitude'], data['Latitude'], c=data['CPU Compute Allocation'], cmap='viridis', s=50, alpha=0.6)
plt.title('Geographic Distribution of Organizations')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.colorbar(label='CPU Compute Allocation')
plt.grid(True)
plt.show()

### Analysis of GPU Utilization
We will analyze the GPU utilization across different organization types and projects. Understanding GPU utilization can provide insights into the computational demands and efficiency of various research and development activities.

In [None]:
# Plotting GPU Compute Allocation by Organization Type
plt.figure(figsize=(14, 6))
sns.barplot(x='Organization Type', y='GPU Compute Allocation', data=data, ci=None, color='purple', alpha=0.6)
plt.title('GPU Compute Allocation by Organization Type')
plt.xlabel('Organization Type')
plt.ylabel('GPU Compute Allocation')
plt.xticks(rotation=45)
plt.show()

# Plotting GPU Node Hours Charged by Organization Type
plt.figure(figsize=(14, 6))
sns.barplot(x='Organization Type', y='GPU Node Hours Charged', data=data, ci=None, color='orange', alpha=0.6)
plt.title('GPU Node Hours Charged by Organization Type')
plt.xlabel('Organization Type')
plt.ylabel('GPU Node Hours Charged')
plt.xticks(rotation=45)
plt.show()

### Investigating Inactive Users
We will investigate the inactive users in the dataset, focusing on their previous compute allocations and node hours charged. Understanding the characteristics of inactive users can provide insights into resource allocation strategies and potential areas for optimization.

In [None]:
# Filtering inactive users
inactive_users = data[data['Is Active'] == 'False']

# Plotting CPU and GPU compute allocations for inactive users
plt.figure(figsize=(14, 6))
sns.barplot(x='Organization Type', y='CPU Compute Allocation', data=inactive_users, ci=None, color='blue', alpha=0.6, label='CPU Allocation')
sns.barplot(x='Organization Type', y='GPU Compute Allocation', data=inactive_users, ci=None, color='red', alpha=0.6, label='GPU Allocation')
plt.title('Compute Allocations for Inactive Users by Organization Type')
plt.xlabel('Organization Type')
plt.ylabel('Compute Allocation')
plt.legend()
plt.xticks(rotation=45)
plt.show()

### Summary and Recommendations
Based on the analysis conducted, we can summarize key findings and provide recommendations for optimizing resource allocations, enhancing efficiency, and supporting various research and development activities.

#### Key Findings
- **Active Users:** The analysis of active users showed variations in compute allocations across different organization types. Understanding the behavior of active users provided insights into resource utilization and efficiency.
- **Top Projects and Departments:** The leading projects and departments were identified based on compute allocations, highlighting the main areas of focus and resource consumption.
- **Geographic Distribution:** The spatial distribution of organizations provided an understanding of where the organizations are located, with variations in CPU compute allocation.
- **GPU Utilization:** GPU utilization across different organization types was analyzed, revealing insights into computational demands and efficiency.
- **Inactive Users:** Investigating inactive users and their previous compute allocations provided insights into resource allocation strategies and potential areas for optimization.

#### Recommendations
- **Optimize Resource Allocation:** Regularly review and reallocate resources based on user activity and project requirements to ensure efficient utilization.
- **Support Key Projects and Departments:** Allocate additional resources to top-performing projects and departments to foster innovation and research.
- **Enhance GPU Utilization:** Evaluate GPU demands and allocate resources accordingly to support computationally intensive tasks.
- **Monitor Inactive Users:** Implement monitoring and notification mechanisms to identify inactive users and reallocate resources as needed.
- **Consider Geographic Distribution:** Take into account the geographic distribution of organizations when planning infrastructure and support services.