## Plot Ideas
-------------

* QQ Normal Plot to see if Hailstone Sizes are normally distributed
    * If not normal, then find a distribution that does fit it (maybe something log normal?)
    * Generate plot that demonstrates this

* Histograms for some of the variables, especially Hailstone Sizes and maybe heatmaps with some other ones?

* For all duplicates, see how far apart the actual variables are; worth using three times as much information for little benefit?
    * Do this maybe with... stacked histograms / line plot / something else? 
    * Calculate mean of duplicate variables, is that a better indicator, or should use closest to mean variable?

* Correlation matrix for the data, make it real pretty like, consider whether we neeeeed all these variables or can PCA/SVM/LASSO to reduce dimensionality

* Scale data maybe? 

* Boxplots to see about spread and central tendency, maybe even two dimensional versions or facet grid

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from math import floor
import statsmodels.api as sm


In [33]:
%matplotlib qt
plt.rcParams["font.family"] = "Fantasy"

In [None]:
path = "/Users/joshuaelms/Desktop/github_repos/CSCI-B365/Meteorology_Modeling_Project/data/pretty_data.csv"

df = pd.read_csv(path, index_col=0)
df.index += 1
df.iloc[:, [0, 10, 20]].corr()

In [5]:
rng = np.random.default_rng(100)

In [34]:
plt.clf()

x = [1, 2, 3, 4]
y = [10, 8, 6, 4]

fig, [ax1, ax2] = plt.subplots(2, 1)

sns.set_theme()
fig.suptitle("The Quick Brown Fox Jumped Over the Lazy Dog")
sns.pointplot(x=y, y=x, ax=ax1)
sns.pointplot(x=x, y=y, ax=ax2)
plt.show()

In [18]:
### Histogram and QQ Norm for Hail Size

step = 0.25
breaks = [i for i in np.arange(floor(df["Hailstone Size"].min()), df["Hailstone Size"].max() + step, step)]

normal = rng.standard_normal(size=df["Hailstone Size"].shape[0])

fig, [ax1, ax2] = plt.subplots(nrows=1, ncols=2, sharex=False, sharey=False)
fig.patch.set_facecolor("xkcd:powder blue")
sns.histplot(data=df, x="Hailstone Size", discrete=True, bins=breaks, ax=ax1)
sm.qqplot(data=df["Hailstone Size"], line="45", ax=ax2)
ax1.legend(["Flirst"])
ax2.legend(["Stuff", "More"])
ax1.set_title("First")
ax2.set_title("Second")
plt.tight_layout()
plt.show()

In [24]:
### Corr plot for ten duplicates
plt.clf()

### group plots by variable; for each variable in the dictionary, generate and display corrplot of various calculation methods for it 
fig, ax_lst = plt.subplots(nrows=5, ncols=2, figsize=(8, 10))

cnt = 0
for layer in ax_lst:
    for ax in layer:
        correlations = df.iloc[:, [cnt, cnt+10, cnt+20]].corr()
        title = correlations.columns[0].split()[-1]
        sns.heatmap(data=correlations, vmin=-1, vmax=1, ax=ax, cmap="magma")
        ax.tick_params(axis='x', rotation=0)
        ax.set_title(title)
        cnt+=1
        
fig.suptitle("Correlation Plots for 3 Methods of Calculating 10 Meteorological Metrics")
fig.patch.set_facecolor("xkcd:light grey")
plt.tight_layout()
plt.show()


In [55]:
### Corr plot overall

plt.clf()

desired = [i for i in range(20,46)] + [i for i in range(48, 54)]

fig, ax1 = plt.subplots()
df_corr = df.corr().iloc[desired, desired]

sns.heatmap(data=df_corr, vmin=-1, vmax=1, ax=ax1,  xticklabels=1, yticklabels=1)

plt.tight_layout()