# Top 500 Supercomputers Analysis
Welcome to our interactive Jupyter notebook where we will explore and analyze data from the TOP500 supercomputers! This notebook includes interactive exercises to help you practice and apply your knowledge. Let's get started!

## Loading the Data
Let's start by loading the data and getting a glimpse of what it contains. We have information about various supercomputers, such as their ranks, performance metrics, energy efficiency, and more.

Make sure to run the code cell below to load the data and set up the necessary libraries.

In [None]:
import pandas as pd
import glob
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

scale = 1.75  # pick something less than 2
sns.set(rc={'figure.figsize': (4 * scale, 3 * scale)})

data_path = "data/TOP500_202306.csv"
data = pd.read_csv(data_path)

## The following lines do some basic formatting/housekeeping, don't worry about these for now!
dfraw = data.copy()
data["Name"].fillna(data["Site ID"], inplace=True)
data["Accelerator/Co-Processor Cores"].fillna(0, inplace=True)
data = data.drop(columns=["Nmax", "Nhalf", "HPCG [TFlop/s]", "Memory", "Previous Rank", "Site ID", "System ID"])

# Display the first few rows of the dataframe
data.head(5)

### Exercise 1: Data Inspection
Now that we have loaded the data, can you identify the columns with missing values and the data types of each column? Write the code in the cell below to inspect the data.

## Preparing the Data for Analysis
Before we dive into analysis, let's take a moment to look at the data. We need to handle missing values and convert certain columns to the appropriate data types.

### Exercise 2: Data Cleaning
1. Identify the columns that need to be converted to numbers and write code to convert them.
2. Identify which data needs to be removed from our analysis to handle missing values and write code to handle them.

(Note: You can click on the cell below and write your code!)

## Exploring the Data
Now that our data is cleaned up, let's explore it through visualization. We will start by visualizing the distribution of some important features.

### Histograms
We will use histograms to see the distributions of "Power (kW)" and "Energy Efficiency [GFlops/Watts]" columns. Run the code cells below to visualize them.

In [None]:
distplt_power = sns.displot(data=data, x="Power (kW)", aspect=16/9)
distplt_energy = sns.displot(data=data, x="Energy Efficiency [GFlops/Watts]", aspect=16/9)

### Exercise 3: Histogram Analysis
Analyze the histograms above and answer the following questions:
1. What can you infer about the distribution of "Power (kW)"?
2. Are there any outliers in the "Energy Efficiency [GFlops/Watts]" distribution?
3. How would you describe the overall trends in these features?

(Note: You can click on the cell below and write your answers!)

## Top 500 Ranking
The TOP500 list ranks supercomputers based on their performance, specifically the "Rmax [TFlop/s]" value. Let's visualize the relationship between the rankings and the "Rmax [TFlop/s]" value.

In [None]:
scatter = sns.scatterplot(data=data, x="Rank", y="Rmax [TFlop/s]", hue="Power (kW)")
plt.yscale('log')

### Exercise 4: Ranking Analysis
1. Can you find out what the rankings would be if they were based on "Energy Efficiency [GFlops/Watts]"? Write code to sort the data based on that column and replot the rankings.
2. Compare the rankings based on "Rmax [TFlop/s]" and "Energy Efficiency [GFlops/Watts]". What differences do you observe?

(Note: You can click on the cell below and write your code and answers!)

## Exploring Regions and Vendors
Let's explore the distribution of the supercomputers among different countries and manufacturers. We will focus on the top five countries and their corresponding manufacturers.

In [None]:
countries = data.groupby("Country").count()
countries.sort_values('Rank', ascending=False)["Rank"].plot(kind="bar")
plt.xlabel("Country")
plt.ylabel("Number of Supercomputers")

### Exercise 5: Region Analysis
1. Which countries have the most supercomputers in the top 500 list?
2. What insights can you gather about the distribution of supercomputers among different regions?

(Note: You can click on the cell below and write your answers!)

In [None]:
top5countries = list(countries.sort_values('Rank', ascending=False).index[:5])
top5countrydata = data.loc[data['Country'].isin(top5countries)]

sns.swarmplot(data=top5countrydata, x="Energy Efficiency [GFlops/Watts]", y="Country", order=top5countries, size=4)

### Exercise 6: Vendor Analysis
1. Which manufacturers dominate the top five countries in terms of supercomputer presence?
2. How does the distribution of supercomputers vary among different manufacturers within the top five countries?

(Note: You can click on the cell below and write your answers!)

## Exploring Energy Efficiency
Energy efficiency is a critical aspect of supercomputers. Let's explore how energy efficiency varies among different supercomputers and identify trends and insights.

In [None]:
scatter_energy_efficiency = sns.scatterplot(data=data, x="Rank", y="Energy Efficiency [GFlops/Watts]", hue="Power (kW)")
plt.yscale('log')

### Exercise 7: Energy Efficiency Analysis
1. What trends do you observe in the energy efficiency of the top 500 supercomputers?
2. How does energy efficiency correlate with other features such as performance and power consumption?

(Note: You can click on the cell below and write your answers!)

### Exercise 8: Energy Efficiency Comparison
1. Compare the energy efficiency of supercomputers from different countries and manufacturers.
2. Identify the top 10 most energy-efficient supercomputers and analyze their characteristics.

(Note: You can click on the cell below and write your answers!)

## Clean Up Time!
Let's clean up some of our data. Cleaning involves removing commas from numeric values and converting them into the appropriate data type.

For instance, the value "500,000,000" needs to be cleaned to "500000000" and then cast into an integer.

Python allows us to do this with a lambda function:

```python
x = lambda x: float(x.replace(',', ''))
```

Now, apply the above function to each value in a column using the `map` function:

```python
data["Column_Name"] = data["Column_Name"].map(x)
```

### Exercise 2: Data Cleaning
1. Use the `.notna()` function to drop rows in the dataframe where the "Energy Efficiency [GFlops/Watts]" column has missing values.
2. Apply the cleaning process to the columns: "Rmax [TFlop/s]", "Rpeak [TFlop/s]", "Processor Speed (MHz)", and "Power (kW)".

Write your code in the cell below.

## Exploring the Data
Now that our data is cleaned up, let's explore it through visualization. We will start by visualizing the distribution of some important features.

### Histograms
We will use histograms to see the distributions of "Power (kW)" and "Energy Efficiency [GFlops/Watts]" columns. Run the code cells below to visualize them.

In [None]:
distplt_power = sns.displot(data=data, x="Power (kW)", aspect=16/9)
distplt_energy = sns.displot(data=data, x="Energy Efficiency [GFlops/Watts]", aspect=16/9)

### Exercise 3: Histogram Analysis
Based on the histograms above, can you interpret the distributions of the "Power (kW)" and "Energy Efficiency [GFlops/Watts]" features? Are there any outliers or patterns that you can identify? Write your observations in the cell below.

## Top 500 Ranking
The TOP500 list ranks supercomputers based on their performance, specifically the "Rmax [TFlop/s]" value. Let's visualize the relationship between the rankings and the "Rmax [TFlop/s]" value.

In [None]:
scatter = sns.scatterplot(data=data, x="Rank", y="Rmax [TFlop/s]", hue="Power (kW)")
plt.yscale('log')

### Exercise 4: Ranking Analysis
Can you find out what the rankings would be if they were based on "Energy Efficiency [GFlops/Watts]"? Use the `np.argsort` function to sort the data based on that column and replot the rankings. Write your code in the cell below.

## Exploring Regions and Vendors
Let's explore the distribution of the supercomputers among different countries and manufacturers. We will focus on the top five countries and their corresponding manufacturers.

In [None]:
countries = data.groupby("Country").count()
countries.sort_values('Rank', ascending=False)["Rank"].plot(kind="bar")
plt.xlabel("Country")
plt.ylabel("Number of Supercomputers")

### Exercise 5: Country Analysis
Based on the bar chart above, can you identify the top five countries with the most supercomputers? Write your observations in the cell below.

Now, we will deep dive into the top five countries and visualize the relationship between their rankings and energy efficiency.

In [None]:
top5countries = list(countries.sort_values('Rank', ascending=False).index[:5])
top5countrydata = data.loc[data['Country'].isin(top5countries)]

sns.swarmplot(data=top5countrydata, x="Energy Efficiency [GFlops/Watts]", y="Country", order=top5countries, size=4)

## Preparing the Data for Analysis
Before we dive into analysis, let's take a moment to look at the data. We need to handle missing values and convert certain columns to the appropriate data types.

### Exercise 6: Energy Efficiency Analysis
Based on the swarm plot above, can you identify any patterns or insights about the energy efficiency of supercomputers in the top five countries? Write your observations in the cell below.

Next, let's explore the vendors in these top five countries.

### Exercise 2: Data Cleaning
In this exercise, you will clean specific columns by removing commas and converting them to the correct data type.

1. Write a lambda function to remove commas from a string and convert it to a float.
2. Apply the lambda function to the columns: "Rmax [TFlop/s]", "Rpeak [TFlop/s]", "Processor Speed (MHz)", and "Power (kW)".
3. Use the `.notna()` function to drop rows in the dataframe where the "Energy Efficiency [GFlops/Watts]" column has missing values.

Write your code in the cell below:

In [None]:
vendors = top5countrydata.groupby("Manufacturer").count()
top5vendors = list(vendors.sort_values('Rank', ascending=False).index[:5])
top5vendata = data.loc[data['Manufacturer'].isin(top5vendors)]

sns.swarmplot(data=top5vendata, x="Energy Efficiency [GFlops/Watts]", y="Country", order=top5countries, size=4, hue="Manufacturer")

### Exercise 7: Vendor Analysis
Based on the swarm plot above, can you identify any patterns or insights about the energy efficiency of supercomputers from different manufacturers in the top five countries? Write your observations in the cell below.

## Exploring the Data
Now that our data is cleaned up, let's explore it through visualization. We will start by visualizing the distribution of some important features.

### Histograms
We will use histograms to see the distributions of "Power (kW)" and "Energy Efficiency [GFlops/Watts]" columns. Run the code cells below to visualize them.

## Conclusion
We have explored the data of the top 500 supercomputers, cleaned it, and visualized various aspects of the systems. You can continue the analysis further and discover exciting insights about these impressive machines!

### Exercise 8: Further Analysis
Based on what you've learned in this notebook, can you come up with a new analysis or visualization that could provide further insights into the data? Write your code in the cell below.

In [None]:
distplt_power = sns.displot(data=data, x="Power (kW)", aspect=16/9)
distplt_energy = sns.displot(data=data, x="Energy Efficiency [GFlops/Watts]", aspect=16/9)

### Exercise 3: Histogram Analysis
Based on the histograms above, can you interpret the distributions of "Power (kW)" and "Energy Efficiency [GFlops/Watts]" columns? Identify any outliers or patterns and write your observations in the cell below.

## Top 500 Ranking
The TOP500 list ranks supercomputers based on their performance, specifically the "Rmax [TFlop/s]" value. Let's visualize the relationship between the rankings and the "Rmax [TFlop/s]" value.

In [None]:
scatter = sns.scatterplot(data=data, x="Rank", y="Rmax [TFlop/s]", hue="Power (kW)")
plt.yscale('log')

### Exercise 4: Ranking Analysis
Can you find out what the rankings would be if they were based on "Energy Efficiency [GFlops/Watts]"? Write the code to sort the data based on that column and replot the rankings. Share your observations on how the rankings change when based on energy efficiency.

## Exploring Regions and Vendors
Let's explore the distribution of the supercomputers among different countries and manufacturers. We will focus on the top five countries and their corresponding manufacturers.

In [None]:
countries = data.groupby("Country").count()
countries.sort_values('Rank', ascending=False)["Rank"].plot(kind="bar")
plt.xlabel("Country")
plt.ylabel("Number of Supercomputers")

### Exercise 5: Country Analysis
Based on the bar chart above, can you identify the top 10 countries with the most supercomputers? Write the code to extract this information and visualize the results using a pie chart or another suitable visualization. Share your observations on the distribution of supercomputers among the top 10 countries.

In [None]:
top5countries = list(countries.sort_values('Rank', ascending=False).index[:5])
top5countrydata = data.loc[data['Country'].isin(top5countries)]

sns.swarmplot(data=top5countrydata, x="Energy Efficiency [GFlops/Watts]", y="Country", order=top5countries, size=4)

### Exercise 6: Vendor Analysis
Now, let's explore the vendors in these top five countries.

1. Write code to group the data by "Manufacturer" and count the number of supercomputers for each manufacturer in the top five countries.
2. Identify and visualize the top 5 manufacturers in these countries using a suitable plot.
3. Share your observations on the distribution of supercomputers among the top manufacturers.

## Exploring Energy Efficiency
Energy efficiency is a crucial aspect of supercomputing. Let's deep dive into the top five countries and visualize the relationship between their rankings and energy efficiency.

In [None]:
sns.swarmplot(data=top5countrydata, x="Energy Efficiency [GFlops/Watts]", y="Country", order=top5countries, size=4)

### Exercise 7: Energy Efficiency Analysis
Based on the swarm plot above, analyze the energy efficiency of the top five countries.

1. Identify the country with the highest and lowest energy efficiency.
2. Write code to calculate the average energy efficiency for each of the top five countries.
3. Visualize the average energy efficiency using a suitable plot.
4. Share your insights on how energy efficiency varies among the top five countries.

## Conclusion
We have explored the data of the top 500 supercomputers, cleaned it, and visualized various aspects of the systems. You can continue the analysis further and discover exciting insights about these impressive machines!

### Exercise 8: Further Exploration
Now that you have analyzed various aspects of the top 500 supercomputers, it's time to apply your knowledge and creativity.

1. Choose an aspect of the data that interests you (e.g., a specific country, manufacturer, or performance metric).
2. Formulate a question or hypothesis related to that aspect.
3. Write code to analyze the data and answer your question or test your hypothesis.
4. Visualize the results using appropriate plots.
5. Write a brief conclusion summarizing your findings and insights.

Feel free to use all the tools and techniques you've learned in this notebook. Happy exploring!