# Title

Text about the whole point of this project

## Loading the Data

In [None]:
import pandas as pd
import glob
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
scale = 1.75 #pick something less than 2
sns.set(rc={'figure.figsize':(4*scale,3*scale)})


Make sure you update the path to where you have put the data! This should be a path on the NERSC system, not your local machine. 

In [None]:
data_path = "data/TOP500_202306.csv"

# Load the data into a pandas DataFrame
data = pd.read_csv(data_path)

##The following lines do some basic formatting/housekeeping, don't worry about these for now!
dfraw = data.copy()
data["Name"].fillna(data["Site ID"], inplace = True)
data["Accelerator/Co-Processor Cores"].fillna(0, inplace = True)
data = data.drop(columns = ["Nmax", "Nhalf", "HPCG [TFlop/s]", "Memory", "Previous Rank", "Site ID", "System ID"])

# Display the first few rows of the dataframe
data.head(5)

In [None]:
# Let's take a look at the data types!
data.info()

## Preparing the Data for Analysis

Notice something strange about the data? For example, are there missing values? Also, are all the numeric values actually numbers, or are some of them "objects"?

Noticing this is an important part of data analysis, and uses your background as a scientist/engineer/mathmatician or other experience! 

Can you figure out which columns need to be converted to numbers? 

Can you figure out which data need to be removed from our data, in order to correctly acheive our goal analysis?

### Exercise

You can click on this cell and type your guesses here!

### Clean Up Time!

While you think about that, let's clean up some of our data. What does cleaning mean?

For example, many of the numeric values are written in the format "500,000,000" with commas, which we can understand. But Python does not understand that this is the number five hundred million, or 500000000. This is the case for many of the values in this data. So we have to clean the data, by removing the commas, and casting the value into the appropriate data type. 

First, let's figure out what needs to be done to the value to "clean it." 

The value x is "500,000,000". We need to: 
- Remove the commas
- Cast the value into an integer

In Python, strings have a "replace" function that searches for the string character specified in the first positional argument, and replaces it with the second. 

`x.replace(',','')`

Willi take x to "500000000". This is still a string! So we need to cast it. This is done with the `int` or `float` function for casting to integer and floating point number respectively. 

`int(x.replace(',',''))`

Great! We now know how to convert one value easily. What about a whole column?

Luckily, we can tell Python that we have a function that takes one value and spits out another (based on the input value), and ask it to apply that function to each value in a column. We will do this using the `map` function on a `lambda` function/statement. 

An example of a lambda function is:

`myx = lambda a: a + 10`

Which means that any value given to the function myx will return the value plus 10. This is a really fast way to define a simple function. For example, we want to take each value in our column and apply the replace and integer functions. So our function is: 

`x = lambda x: float(x.replace(',',''))`

And lastly, we want to apply that function to every value in a column individually (element-wise operation). We can do this by using the `map` function: 

```df["col_name"] = df["col_name"].map(lambda x: int(x.replace(',','')))```


But first...

If we don't cut out the data that is missing, we will get errors when we try to clean the data. So, did you figure out which data to remove? If you did, use the `.notna()` function to tell pandas to drop all rows in the dataframe where the specified column's value is not NA: 

`data = data[data[column_to_use_for_cutting].notna()]`
What should `column_to_use_for_cutting` be?

### Exercise

In [None]:
## This will eliminate any data rows where the power value is not specified
#column_to_use_for_cutting = #what goes here?
#data = data[data[column_to_use_for_cutting].notna()]

## Solution
column_to_use_for_cutting = "Energy Efficiency [GFlops/Watts]"
data = data[data[column_to_use_for_cutting].notna()]

# Solutions can include the Power col or the Power souce col (assuming the same rows are missing)


In [None]:
## This uses the above formula to convert object/string values to an integer
data["Total Cores"] = data["Total Cores"].map(lambda x: int(x.replace(',','')))

## Try it out! Follow the formula above to convert the following values correctly:
# data["Rmax [TFlop/s]"] = data["Rmax [TFlop/s]"]#What goes here??
# data["Rpeak [TFlop/s]"] = data["Rpeak [TFlop/s]"]#What goes here??
# data["Processor Speed (MHz)"] = data["Processor Speed (MHz)"]#What goes here??                               
# data["Power (kW)"] = data["Power (kW)"]#What goes here??

### Solution                                                                  
data["Rmax [TFlop/s]"] = data["Rmax [TFlop/s]"].map(lambda x: float(x.replace(',','')))
data["Rpeak [TFlop/s]"] = data["Rpeak [TFlop/s]"].map(lambda x: float(x.replace(',','')))
data["Processor Speed (MHz)"] = data["Processor Speed (MHz)"].map(lambda x: float(x.replace(',','')))
data["Power (kW)"] = data["Power (kW)"].map(lambda x: float(x.replace(',',''))) 

Now let's double check that the data is cleaned up and ready to use!

In [None]:
data.info()
columns = list(data.columns)

In order to make it easier to select multiple columns at the same time, we can use the index calue associated with the column:

In [None]:
## You can pick which columns to descibe - remember they have to be numeric!

cols = [11,13,14,18,20]
cols_to_describe = [columns[i] for i in cols]
print(cols_to_describe)
data[cols_to_describe].describe()

## Exploring the Data

blurb about getting a sense of the data by plotting it - especially histograms to see distributions

In [None]:
distplt = sns.displot(data = data, x = "Power (kW)", aspect=16/9)

In [None]:
distplt = sns.displot(data = data, x = "Energy Efficiency [GFlops/Watts]", aspect=16/9)

## Top 500 Ranking:

blurb!

## Exercise: 
-ask students to determine which quantity the systems are ranked on

-this can then be contrasted/compared to the green500 data.

In [None]:
# We can see that these systems are ranked based on this Rmax value:
scatter = sns.scatterplot(data = data, x = "Rank", y = "Rmax [TFlop/s]", hue = "Power (kW)")
plt.yscale('log')

### Exercise:
What would the rankings be if they were ranked on ____?
- use something like np.argsort to sort on a different column and replot.

In [None]:
inds = np.argsort(data["Energy Efficiency [GFlops/Watts]"])
sns.scatterplot(x = range(len(inds)), y = data["Energy Efficiency [GFlops/Watts]"].iloc[inds[::-1]])

In [None]:
## We can reorder the data to have our new ranking!
new_ranking = data.sort_values("Energy Efficiency [GFlops/Watts]", ascending = False)

## Exploring regions and vendors

In [None]:
countries = data.groupby("Country").count()
countries.head()

We can sort these values and look at which countries have the most systems in the top 500:

In [None]:
countries.sort_values('Rank', ascending=False)["Rank"].plot(kind = "bar")
plt.xlabel("slkjf")
plt.ylabel("sdjkfl")

Lets deep dive into the top 5 countries

In [None]:
top5countries = list(countries.sort_values('Rank', ascending=False).index[:5])
print(top5countries)
top5countrydata = data.loc[data['Country'].isin(top5countries)]
sns.scatterplot(data = top5countrydata, x = "Rank", y = "Energy Efficiency [GFlops/Watts]", hue = "Country")

That is really hard to interpret! Let's try another type of plot:

In [None]:
sns.swarmplot(data = top5countrydata, x = "Energy Efficiency [GFlops/Watts]", 
              y = "Country", order = top5countries, size = 4)


Let's look into the vendors of these machines

In [None]:
vendors = top5countrydata.groupby("Manufacturer").count()
vendors.sort_values('Rank', ascending=False)
top5vendors = list(vendors.sort_values('Rank', ascending=False).index[:5])
print(top5vendors)
top5vendata = data.loc[data['Manufacturer'].isin(top5vendors)]
top5vendata

In [None]:
sns.swarmplot(data = top5vendata, x = "Energy Efficiency [GFlops/Watts]", y = "Manufacturer", order = top5vendors, size = 4, hue = "Country")


Oops! Only want our top 5 countries for now:

In [None]:
top5vcountry = top5vendata.loc[top5vendata['Country'].isin(top5countries)]
sns.swarmplot(data = top5vcountry, x = "Energy Efficiency [GFlops/Watts]", y = "Manufacturer", order = top5vendors, size = 4, hue = "Country")


In [None]:
sns.swarmplot(data = top5vcountry, x = "Energy Efficiency [GFlops/Watts]", y = "Country", order = top5countries, size = 4, hue = "Manufacturer")
