# data visualization

we will use `matplotlib`, an open-source data visualization library written for Python (but we call it from Julia using `PyPlot`). 

* [documentation](https://matplotlib.org/) for `matplotlib`
* a `matplotlib` [gallery](https://matplotlib.org/gallery/index.html).

In [1]:
using DataFrames
using CSV
using PyPlot # plotting library

# check out all of the styles! https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html
PyPlot.matplotlib.style.use("seaborn-pastel")

# change settings for all plots at once, e.g. font size
rcParams = PyPlot.PyDict(PyPlot.matplotlib."rcParams")
rcParams["font.size"] = 16

16

## bar plots: steak preferences

*goal*: visualize how Americans like their steak done [[source](https://fivethirtyeight.com/features/how-americans-like-their-steak/)] via a bar plot showing preference for rare, medium rare, etc.

a problem here when reading in the file: the column names missing from `steak.csv`!
* first column: "do you eat steak?"
* second column: "how do you like your steak done?"

(for ambitious Beavers only) I'm running a shell command here from Julia to look at head (first 3 lines) of the file.

we can pass the column names to `CSV.read` manually via `header`. we also pass `copycols=true` to make the columns mutable, since we will later drop `missing` values.

In [None]:
df_steak = CSV.read("steak.csv", header=[:eats_steak, :how_cooked], copycols=true)
first(df_steak, 10) # diplay only the first ten rows to avoid clutter in notebook

remove missing values (folks who do not eat steak should not count in our visualization)

In [5]:
dropmissing!(df_steak)
first(df_steak, 10)

UndefVarError: UndefVarError: df_steak not defined

let's make sure everyone in this data set now eats steak and thus can answer the question

In [6]:
unique(df_steak[:, :eats_steak])

UndefVarError: UndefVarError: df_steak not defined

now let's count the number of folks who prefer rare, medium rare, etc.

In [7]:
df_steak_pref = aggregate(df_steak, :how_cooked, length)

UndefVarError: UndefVarError: df_steak not defined

let's re-order the rows since there is an ordinality in the rows via the time cooked.

In [8]:
df_steak_pref = df_steak_pref[[2, 1, 3, 4, 5], :]

UndefVarError: UndefVarError: df_steak_pref not defined

In [9]:
let's define percentage of population that likes 

LoadError: syntax: incomplete: invalid character literal

In [10]:
nb_preferences = size(df_steak_pref)[1] # will be useful later

UndefVarError: UndefVarError: df_steak_pref not defined

finally, let's make a bar graph!

In [11]:
# construct a blank figure panel
figure()
# make bar plot (h = horizontal)
#  first argument: y-positions to center the bars [1, 2, 3, 4, 5]
#  second argument: length of the bars (which we want to set as the number of ppl that eat steak that way)
barh(1:nb_preferences, df_steak_pref[!, :eats_steak_length])
# specify wut to label the y-ticks
#  first argument: where to draw the y-ticks
#  second argument: wut text to put as a y-tick label
yticks(1:nb_preferences, df_steak_pref[!, :how_cooked])
# label x-axis, give a title
xlabel("# responses")
title("how do you like your steak?")
# save as pdf
tight_layout() # needed to adjust bounding box to include labels
savefig("steak.pdf", format="pdf")

UndefVarError: UndefVarError: nb_preferences not defined

... we can do better by using colors to indicate ordinality of the ways to cook the steak.

see matplotlib's choice of [colormaps](https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html) for representing numerical scales with color.

let's construct the `hot` colormap. this is a function that takes in floating point numbers from 0 to 1 and returns an color (R,G,B, alpha) where R = red, G = green, B = blue, alpha = transparency.

In [12]:
hot_colormap = PyPlot.matplotlib.cm.get_cmap("hot")
hot_colormap(0.0) # color representing 0.0

(0.0416, 0.0, 0.0, 1.0)

assuming equal spacing between each way of cooking the steak, let's color each bar accordingly.

In [13]:
colors = [hot_colormap(1.0 - p / nb_preferences) for p = 1:nb_preferences]

UndefVarError: UndefVarError: nb_preferences not defined

In [14]:
figure()
# pass colors of each bar
barh(1:nb_preferences, df_steak_pref[!, :eats_steak_length],color=colors)
yticks(1:nb_preferences, df_steak_pref[!, :how_cooked])
xlabel("# responses")
title("how do you like your steak?")
tight_layout()
savefig("steak_colored.pdf", format="pdf")

UndefVarError: UndefVarError: nb_preferences not defined

## histograms: salaries of professors at Oregon State University

I extracted these from [here](https://hr.oregonstate.edu/employees/administrators-supervisors/classification-compensation/salary-reports).

*goal*: visualize the distribution of 9-month salaries among OSU professors (assistant, associate, and full)

In [15]:
where_my_salary_file = joinpath("salaries", "osu_salaries.csv")

"salaries\\osu_salaries.csv"

In [16]:
df_salaries = CSV.read(where_my_salary_file, copycols=true) # copycols so we have mutable columns
first(df_salaries, 10)

ArgumentError: ArgumentError: "salaries\osu_salaries.csv" is not a valid file

filter out all employees but professors: particulary, Assistant Professors, Associate Professors, and Professors

In [17]:
filter!(row -> row[:job_title] in ["Assistant Professor", "Associate Professor", "Professor"], df_salaries)
first(df_salaries, 10)

UndefVarError: UndefVarError: df_salaries not defined

for an equal comparison, only plot those on 9-month salaries

In [18]:
filter!(row -> row[:nb_months] == 9, df_salaries)
first(df_salaries, 10)

UndefVarError: UndefVarError: df_salaries not defined

In [19]:
figure()
# hist for histogram
hist(df_salaries[:, :salary])
xlabel("9-month salary, \$")
ylabel("# professors")
title("salary of OSU profs")

UndefVarError: UndefVarError: df_salaries not defined

we can granularize by assistant, associate, full professor

In [20]:
figure()
# partition data into groups by their job title, plot histograms separately on top of each other
for df_s in groupby(df_salaries, :job_title)
    job_in_this_group = df_s[1, :job_title]
    # label will go into legend later
    # alpha is transparency
    hist(df_s[:, :salary], label=job_in_this_group, alpha=0.5)
end
legend()
xlabel("9-month salary, \$")
ylabel("# professors")
title("salary of OSU profs")

UndefVarError: UndefVarError: df_salaries not defined

an awkward attribute of the above graph is that the bins are different for each histogram. let's specify the bins ourselves.

In [21]:
bins = range(minimum(df_salaries[:, :salary]), stop=maximum(df_salaries[:, :salary]), length=20)

UndefVarError: UndefVarError: df_salaries not defined

In [22]:
figure()
# partition data into groups by their job title, plot histograms separately on top of each other
for df_s in groupby(df_salaries, :job_title)
    job_in_this_group = df_s[1, :job_title]
    # pass bins instead of letting matplotlib choose for us
    hist(df_s[:, :salary], label=job_in_this_group, alpha=0.5, bins=bins)
end
legend()
xlabel("9-month salary, \$")
ylabel("# professors")
title("salary of OSU profs")

UndefVarError: UndefVarError: df_salaries not defined

# line plots: Xe/Kr gas adsorption in a porous material, noria
[source](https://onlinelibrary.wiley.com/doi/pdf/10.1002/chem.201602131)

*goal*: visualize the amount of xenon and krypton (mmol/g) adsorbed in the material as a function of [pure-gas] pressure (bar). there will be two curves here, one for xenon, one for krypton

In [23]:
df_gas = Dict{String, DataFrame}()
for gas in ["Xe", "Kr"] 
    df_gas[gas] = CSV.read(joinpath("gas_adsorption", "$gas.csv"), copycols=true)
end
first(df_gas["Xe"], 10)

ArgumentError: ArgumentError: "gas_adsorption\Xe.csv" is not a valid file

unfortunately column names are complicated and must be casted as `Symbol`s from `String`s. i.e. there is not a way to write the first column as `:pressure (torr)`.

In [24]:
names(df_gas["Xe"])

KeyError: KeyError: key "Xe" not found

In [25]:
figure()
xlabel("pressure [torr]")
ylabel("gas uptake [mmol/g]")
for gas in ["Xe", "Kr"]
    plot(df_gas[gas][:, Symbol("pressure (torr)")], df_gas[gas][:, Symbol("gas uptake (mmol/g)")],
        linewidth=3, marker="o", label=gas)
end
legend()
title("gas adsorption in Noria")

KeyError: KeyError: key "Xe" not found

## scatter plots: GDP vs life expectancy

*goal*: scatter plot life expectancy against income per person, where each point is a country. use the data from the year 2011.

In [26]:
df_income = CSV.read(joinpath("gapminder", "GDP_per_capita.csv"))
first(df_income, 10)

ArgumentError: ArgumentError: "gapminder\GDP_per_capita.csv" is not a valid file

In [27]:
df_life = CSV.read(joinpath("gapminder", "life_expectancy.csv"))
first(df_life, 10)

ArgumentError: ArgumentError: "gapminder\life_expectancy.csv" is not a valid file

*problem 1*: the rows are not in the same order, so it's not as easy as:
`plot(df_income[:, :2010], df_life[:, :2010]`

so, we need to `join` the two `DataFrames`. we choose an `inner` join because, for a country to appear on our scatter plot, we need both income and life expectancy for that country. we use the `:country` column as the key for joining.

*problem 2*: in both `DataFrame`s, `df_income` and `df_life`, the relevant column name is `:2010`. there is no way to join the two dataframes with these duplicate column names. so we rename these columns of interest so they are unique to the two `DataFrame`s

In [28]:
rename!(df_life, Symbol("2010") => :life_expectancy);
rename!(df_income, Symbol("2010") => :income);

UndefVarError: UndefVarError: df_life not defined

In [29]:
df_gap = join(df_life[:, [:country, :life_expectancy]], 
    df_income[:, [:country, :income]], 
    on=:country, kind=:inner)
dropmissing!(df_gap)
first(df_gap, 10)

UndefVarError: UndefVarError: df_life not defined

In [30]:
figure()
xlabel("income per person [USD, \$]")
ylabel("life expectancy [years]")
# scatter plot, pass x, then y
scatter(df_gap[:, :income], df_gap[:, :life_expectancy])

xscale("log")

UndefVarError: UndefVarError: df_gap not defined