# Data Science Ex 04 - Advanced Plotting & Binning/Sampling

24.02.2021, Lukas Kretschmar (lukas.kretschmar@ost.ch)

## Let's have some more Fun with Visualization and Binning/Sampling!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

## Introduction

### Advanced Plotting

#### Bar Charts

Besides lines and points, we cannot survive without bar charts - so here you go.

In [None]:
rng = np.random.RandomState(42)
values = rng.randn(1000).round().astype(int)    # generating 1000 normal distributed values N(0,1)
x = np.arange(values.min(), values.max())       # get bins (int) from min to max
h = [(values == i).sum() for i in x]            # count occurrences per bin

fig, ax = plt.subplots()
ax.bar(x, h)

It is also possible to draw a horizontal bar chart.

In [None]:
fig, ax = plt.subplots()
ax.barh(x, h)

And we can stack values on top of each other - simply by calling the method multiple times.

In [None]:
h2 = rng.randint(50, 150, len(x))

fig, ax = plt.subplots(3, figsize=(7, 10), constrained_layout=True)
ax[0].bar(x, h)
ax[0].bar(x, h2, bottom=h) # here we define how to stack - basically, where the bottom should be
ax[0].set(title="Vertical Stack", ylim=(0,500))

ax[1].bar(x, h)
ax[1].bar(x, h2, alpha=.5) # without the bottom-parameter, bars overlap
ax[1].set(title="No Stack, but Overlap", ylim=(0,500))

ax[2].barh(x, h)
ax[2].barh(x, h2, left=h)
ax[2].set(title="Vertical Stack", xlim=(0,500))

Or we can group bars.
It's a bit tricky, but you get the idea.
We just have to tweak the location of our *x* values a bit.

In [None]:
fig, ax = plt.subplots()
width = .4
ax.barh(x - width/2, h, width)
ax.barh(x + width/2, h2, width)

#### Pie Charts

How could we live without pie charts?
\- We cannot!

Thus...

In [None]:
labels = ["1/3", "1/4", "1/6", "1/9", "1/12", "1/18"]
sizes = [1/3, 1/4, 1/6, 1/9, 1/12, 1/18]
explode = [0, 0.1, 0, 0, 0, 0.5] # define the breakouts

fig, ax = plt.subplots()
ax.pie(sizes, explode=explode, labels=labels, autopct="%1.2f%%", shadow=True, startangle=45)

#### Histograms

Bar charts and pie charts are good to show quantities, but they do not respect the order of values.
If the order is important, we can use histograms.

In [None]:
rng = np.random.RandomState(42)
normalDist = rng.randn(10000)

fig, ax = plt.subplots()
ax.hist(normalDist)

As you may noticed, we threw an array of 10000 values into the method, and we got bars.
So there must be some sort of binning (spoiler alert - upcomming topic in this notebook) within the method.
The default is `10`, but we can easily set it to another value.

In [None]:
fig, ax = plt.subplots()
ax.hist(normalDist, bins=50)

If we only want to see a subset of all bins, we can define an array of bin boundaries and the histogram will only show the values of these bins.

In [None]:
fig, ax = plt.subplots()
ax.hist(normalDist, bins=[-1,-.5,0,.5,1,1.5,2])

Besides the number of bins, we have some more parameters we can use to customize our histograms.
- `density` normalizes the histogram so that the sum of the area is equal to 1
- `cumulative` adds to each bin the sum of all bins to its left
- `histtype` changes the appearance of the histogram
  - `bar` draws bars (default)
  - `barstacked` draws bars on top of each other when multiple hisograms are plotted
  - `step` draws a line
  - `stepfilled` draws an area

In [None]:
fig, ax = plt.subplots()
ax.hist(normalDist, bins=20, density=True, alpha=.5, color="r", edgecolor="b")

In [None]:
fig, ax = plt.subplots()
ax.hist(normalDist, bins=40, cumulative=True)

In [None]:
fig, ax = plt.subplots()
ax.hist(normalDist, bins=50, density=True, cumulative=True)

In [None]:
fig, ax = plt.subplots(1,2,figsize=(20,4))
ax[0].hist(normalDist, bins=20, histtype="step")
ax[0].set(title="step")
ax[1].hist(normalDist, bins=20, histtype="stepfilled")
ax[1].set(title="stepfilled")

#### Visualizing Errors

When working with empirical data, we know that there is uncertainty.
And as you can imagine, Matplotlib got you covered for this case as well - with the `errorbar()` method.

In [None]:
fig, ax = plt.subplots()
x = np.linspace(0, 10, 20)
y = np.sin(x)
ax.plot(x, y, c="red")
ax.errorbar(x, y, fmt="o", yerr=.5, capsize=3, ecolor="gray")

We can also show errors in multiple directions and individually for each point.

In [None]:
rng = np.random.RandomState(42)
px = rng.randn(25)
ex = rng.rand(25) / 2
py = rng.randn(25)
ey = rng.rand(25) / 2

fig, ax = plt.subplots()
ax.errorbar(px, py, fmt=".", xerr=ex, yerr=ey, ecolor="r", capsize=2)
ax.scatter(px, py, marker="s", c="b")
ax.axis("equal") # keeping distance in both directions the same

#### Text & Annotations

Although, visualizing data helps us understanding it better than just looking on numbers, we still need the ability to show text to highlight findings.
Until now, we learned how to apply labels and titles on plots.
But what if we want to write into the figure?

Matplotlib offers a `text()` method to indivudually show text within a figure.
And `annotate()` to highlight points.

`text()` needs at least the following information
- a value on the horizontal axis
- a value on the vertical axis
- the text to show

Additinally, we can also define the alignment `ha` of the text, and other style information (like `color`, `size`, etc.)

In the example below, we use the information returned from the `hist()` method.
We get a tuple with 3 different values.
The first value `freq` contains the count of each bin.
The second value `bins` contains the boundaries of the bins.
And the last value we can ignore (we need a variable, though, to unpack the tuple).

Thus, `freq[0]` contains the height of the first bin, and `freq[-1]` contains the height of the last bin.
The same is true for `bins`, `bins[1]` is just the value to the right of the left outermost bin, and `bins[-2]` is the value to the left of the right outermost bin.
The location of the mode is calculated by finding the left and right side of the corresponding bin.
We use this information to place our texts (the `+ 5` is just here so the text is above the top of the bin).

In [None]:
rng = np.random.RandomState(42)
tri = rng.triangular(2, 6, 8, 10000)

fig, ax = plt.subplots(figsize=(20,4))
(freq, bins, p) = ax.hist(tri, bins=50)
ax.text(bins[1], freq[0] + 5, "min", ha="right", color="g", size=14)
ax.text(bins[-2], freq[-1] + 5, "max", ha="left", size=14)
i = np.where(freq == freq.max())[0] # finding the index of the max value = mode
ax.text(np.mean([bins[i], bins[i+1]]), freq.max() + 5, "mode", ha="center", size=18, alpha=.5)

If simple text is not enough, we can go a step further and use the `annotate()` method.
In addition to `text()` that takes one corrdinate for the text, `annotate()` takes two corrdinates to position text and where it belongs to.
And between text and target, an arrow is drawn.
We won't go to much into details, but you can find the documentation [here](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.annotate.html).

In [None]:
rng = np.random.RandomState(42)
tri = rng.triangular(2, 6, 8, 10000)

fig, ax = plt.subplots(figsize=(20,4))
(freq, bins, p) = ax.hist(tri, bins=50)

# min
minPoint = (np.mean([bins[0], bins[1]]), freq[0] + 2)
ax.annotate("min", xy=minPoint, xycoords="data", xytext=(-20, 100), textcoords="offset points",
            ha="right", color="g", size=14,
            arrowprops=dict(arrowstyle="fancy", ec="g", fc="g", connectionstyle="angle3,angleA=0,angleB=-120"))

# max
maxPoint = (bins[-1], freq[-1] + 2)
ax.annotate("max", xy=maxPoint, xycoords="data", xytext=(50, 50), textcoords="offset points",
            ha="left", size=14,
            arrowprops=dict(arrowstyle="-|>", ec="k"))

# mode
i = np.where(freq == freq.max())[0] # finding the index of the max value = mode
modePoint = (np.mean([bins[i], bins[i+1]]), freq.max() + 2)
ax.annotate("mode", xy=modePoint, xycoords="data", xytext=(-100, -10), textcoords="offset points",
            ha="center", size=18,
            bbox=dict(boxstyle="round", fc="white", ec="k", alpha=.5),
            arrowprops=dict(arrowstyle="-", fc="gray", ec="gray", connectionstyle="angle3,angleA=45,angleB=90"))

# range
startRange = (3,250)
endRange = (4, 250)
ax.annotate("", xy=startRange, xycoords="data", xytext=endRange, textcoords="data",
            arrowprops=dict(arrowstyle="|-|,widthA=.8,widthB=.4", ec="r"))
ax.annotate("Range from 3 to 4", xy=(3.5, 250), xycoords="data", xytext=(0,5), textcoords="offset points",
            ha="center", c="r", size=14)

### Plots with Pandas

Reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

For the following examples, we'll work with a dataset containing game sales information.

In [None]:
games = pd.read_csv("./Demo_VideoGames.csv")
games.head(5)

You now know how to do basic data visualization with Matplotlib.
Luckily for us, many of the previously introduced methods are already available for `DataFrames` and `Series`.
So there is usually no need to throw values at an axis, but you still can.

In [None]:
top5Games = games.head(5).sort_values("Global_Sales")

fig, ax = plt.subplots()
ax.barh(top5Games["Name"], top5Games["Global_Sales"])

The following plots are directly supported by `DataFrames` or `Series`:
- `plot()` generic method used by every other method
- `bar()` and `barh()`
- `hist()`
- `box()`
- `kde()` or `density()`
- `area()`
- `scatter()`
- `hexbin()`
- `pie()`

These methods (with the exception of `plot()`) are accesbile through the `plot` property on a `DataFrame` or `Series`.

We will show some examples how you can use these methods, but we won't go too much into detail.
Using plots is mostly learning by doing and relies heavily on ones imagination.
And it's up to you to do some exploration on your own.

Let's say, we want to show the number of games per platform in a bar chart.

In [None]:
games["Platform"].value_counts().plot.bar(figsize=(12,6))

We can also combine this method with the already known plot methods.
This is probably the best approach as soon as you want to visualize multiple plots as subplots with Pandas since you create the axis first and the define where to plot what.
But at the end, it's up to you, how you want to arrange your plots.

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
top10Platforms = games["Platform"].value_counts().plot.bar(ax=ax)

Let's go on with some more visualizations.

We can also show the number of games released per year.

In [None]:
releases = games["Year_of_Release"]
years = releases.max() - releases.min()
releases.plot.hist(figsize=(20,6), bins=years)

Having numbers of sales, we could show how the sales of the best sold game are distributed amongst the regions.

In [None]:
bestSoldGame = games[games["Global_Sales"] == games["Global_Sales"].max()]
sales = bestSoldGame[["NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"]]
print("01 - Sales")
print(sales)

sales = sales.T
print()
print("02 - Sales as DataFrame")
print(sales)

sales = sales.rename(columns={0 : "Regional Sales"})
print()
print("03 - Defining Name for Column")
print(sales)

sales = sales.rename(index=lambda i : i.replace("_Sales", ""))
print()
print("04 - Removing '_Sales' suffix from index")
print(sales)

highlighEU = [0,.2,0,0]
sales.plot.pie(subplots=True, shadow=True, explode=highlighEU, startangle=45,
               title=f"Global Sales of {bestSoldGame['Name'][0]}",
               autopct="%.2f%%", legend=False)

Or we can show how many titles per genre were released each year.

In [None]:
genrePerYear = games.groupby(["Year_of_Release", "Genre"]).agg("Genre").count()
print("01 - Genre per Year")
print(genrePerYear)

genrePerYear = genrePerYear.unstack()
print()
print("02 - First 5 Years with Genre as Columns")
print(genrePerYear[:5])

ax = genrePerYear.plot.area(figsize=(20,6))
ax.set(xlabel="Release Date", ylabel="Games per Year")

Or how popular genres are by year.

In [None]:
genres = games["Genre"].unique()
genreIds = dict(zip(genres, range(len(genres))))
print("01 - Assigned Numbers to Genres")
print(genreIds)

games["GenreId"] = games["Genre"].replace(genreIds)
print()
print("02 - Adding GenreId as new Column")
print(games[["Genre", "GenreId"]])

ax = games.plot.scatter(x="Year_of_Release", y="Global_Sales", s=200, c="GenreId", cmap="rainbow", alpha=.5, colorbar=False, figsize=(20,5))
ax.set(xlabel="Release Year", ylabel="Global Sales", title="Global Sales per Genre")

colorbar = ax.get_figure().colorbar(ax=ax, mappable=ax.collections[0])
colorbar.set_ticks(list(genreIds.values()))
colorbar.set_ticklabels(list(genreIds.keys()))

And we can go one step further.
How about a comparison between sales in the european region compared to north america and japan, by platform and global sales, of the top 1% of games of each region.

In [None]:
platforms = ["PC", "XB", "X360", "XOne", "PS", "PS2", "PS3", "PS4", "PSV", "N64", "Wii", "WiiU", "GB", "GC", "GBA", "DS", "3DS"]
myGames = games[(games["Platform"].isin(platforms))]

xlim = (-1, myGames["EU_Sales"].max() + 1)
ylim = (-1, myGames[["NA_Sales", "JP_Sales"]].max().max() + 5)

platformNumbers = list(range(len(platforms)))
platformIds = dict(zip(platforms, platformNumbers))
cmap = "rainbow"
alpha = .6
style = {"cmap":cmap, "vmin":platformNumbers[0], "vmax":platformNumbers[-1], "alpha":alpha, "colorbar": False}

def setColorbar(fig, ax, labelDict):
    colorbar = fig.colorbar(ax=ax, mappable=ax.collections[0])
    colorbar.set_ticks(list(labelDict.values()))
    colorbar.set_ticklabels(list(labelDict.keys()))

def getSizeAndColors(games):
    return games["Global_Sales"] * 10, games["Platform"].replace(platformIds)

topPercent = .01    

naGames = myGames[myGames["NA_Sales"] >= myGames["NA_Sales"].quantile(1-topPercent)]
naSizes, naColors = getSizeAndColors(naGames)

jpGames = myGames[myGames["JP_Sales"] >= myGames["JP_Sales"].quantile(1-topPercent)]
jpSizes, jpColors = getSizeAndColors(jpGames)

fig, ax = plt.subplots(2,1, figsize=(20, 10))
axNA = naGames.plot.scatter(ax=ax[0], x="EU_Sales", y="NA_Sales", s=naSizes, c=naColors, **style)
axNA.set(xlim=xlim, ylim=ylim, xlabel="EU", ylabel="NA", title="Regional Sales Comaprison (EU vs NA)")
setColorbar(fig, axNA, platformIds)
ax[0].plot([0, 50], [0, 50], ":k", alpha=.6) # line to see correlation

axJP = jpGames.plot.scatter(ax=ax[1], x="EU_Sales", y="JP_Sales", s=jpSizes, c=jpColors, **style)
axJP.set(xlim=xlim, ylim=ylim, xlabel="EU", ylabel="JP", title="Regional Sales Comparison (EU vs JP)")
setColorbar(fig, axJP, platformIds)
ax[1].plot([0, 50], [0, 50], ":k", alpha=.6) # line to see correlation

# Findings
def getCoord(game, sales, offset=None):
    corr = offset or (0,0)
    return (game["EU_Sales"] + corr[0], game[sales] + corr[1])

wiiSports = games[games["Name"] == "Wii Sports"]
axNA.annotate("Wii Sports", xy=getCoord(wiiSports, "NA_Sales", (-.2,-3)), xytext=(-100, -50), textcoords="offset points", 
              arrowprops=dict(arrowstyle="fancy", ec="g", fc="g", connectionstyle="angle3,angleA=0,angleB=-120"))
axJP.annotate("Wii Sports", xy=getCoord(wiiSports, "JP_Sales", (-.2,3)), xytext=(-100, 50), textcoords="offset points", 
              arrowprops=dict(arrowstyle="fancy", ec="g", fc="g", connectionstyle="angle3,angleA=0,angleB=120"))

axNA.annotate("XBox Games", xy=(4,12), xytext=(50, 50), textcoords="offset points", 
              arrowprops=dict(arrowstyle="fancy", ec="purple", fc="blue", alpha=.5, connectionstyle="angle3,angleA=0,angleB=120"))
axJP.annotate("No XBox Games", xy=(4,15), xytext=(50, 50), textcoords="offset points", 
              arrowprops=dict(arrowstyle="fancy", ec="purple", fc="blue", alpha=.5, connectionstyle="angle3,angleA=0,angleB=-120"))

axJP.annotate("", xy=(0, 10), xytext=(5,10), arrowprops=dict(arrowstyle="|-|,widthA=.4,widthB=.4",ec="r"))
axJP.annotate("Handheld Games", xy=(2.5, 10), xytext=(0,5), textcoords="offset points", ha="center")

pokemonRedBlue = games[games["Name"] == "Pokemon Red/Pokemon Blue"]
axJP.annotate("Pokemon Red/Pokemon Blue", xy=getCoord(pokemonRedBlue, "JP_Sales", (.1,2)), xytext=(100, 50), textcoords="offset points", 
              arrowprops=dict(arrowstyle="fancy", ec="r", fc="orange", alpha=.5, connectionstyle="angle3,angleA=0,angleB=75"))

#### 3D

And if you really want to up your game, there is also 3D support.
But we won't go into details.

In [None]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")

# Line
xline = np.linspace(0, 15, 1000)
yline = np.sin(xline)
zline = np.cos(xline)
ax.plot(xline, yline, zline, c="k", alpha=.5)

# Points
xdata = 15 * np.random.random(100)
ydata = np.sin(xdata)
zdata = np.cos(xdata)
ax.scatter(xdata, ydata, zdata, c=xdata, cmap="rainbow")

And if you want to have even more fun, have a look at [BaseMap](https://matplotlib.org/basemap/) - I'm just saying...

<tr>
    <td><img src="https://matplotlib.org/basemap/_images/ortho_full.png" style="height:250px;float:left" /></td>
    <td><img src="https://matplotlib.org/basemap/_images/background3.png" style="height:250px;float:left" /></td>
    <td><img src="https://matplotlib.org/basemap/_images/background4.png" style="height:250px;float:left" /></td>
</tr>

### Binning

You have already seen simple approaches on how to bin values.
Above, you have examples of histograms where the bins are returned when plotting.
And there are quite simple methods to normalize numeric values.

In [None]:
rng = np.random.RandomState(42)
numbers = pd.DataFrame(rng.rand(1000000) * 100, columns=["values"])
numbers.head(5)

Having values we can apply simple methods to find common values.

In [None]:
numbers["bin (.1)"] = np.round(numbers["values"], 1)
numbers["bin (10)"] = np.round(numbers["values"] / 10) * 10
numbers.head(5)

In [None]:
numbers.groupby("bin (.1)")["values"].count()

In [None]:
numbers.groupby("bin (10)")["values"].count()

Besides these two approaches, Pandas offers some handy functions to do more complex binning.

- `pd.cut()` creates bins according to given boundaries (equal width)*
- `pd.qcut()` creates bins according to quantiles (equal frequency)*

*if not specified otherwise

In [None]:
games = pd.read_csv("./Demo_VideoGames.csv")
games.head(5)

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

In [None]:
equalWidth = pd.cut(games["Global_Sales"], bins=4)
print(equalWidth.value_counts())
print(equalWidth)

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

In [None]:
equalFreq = pd.qcut(games["Global_Sales"], q=4)
print(equalFreq.value_counts())
print(equalFreq)

In [None]:
fig, ax = plt.subplots(2,2,figsize=(15,10), constrained_layout=True)
equalWidth.value_counts(sort=False).plot.bar(ax=ax[0,0], rot=0, title="cut()")
equalWidth.value_counts(sort=False).plot.bar(ax=ax[1,0], rot=0, title="cut() - zoomed-in", ylim=(0,50))

equalFreq.value_counts(sort=False).plot.bar(ax=ax[0,1], rot=0, title="qcut()")
ax[1,1].remove() # not showing the right bottom panel

These methods also offer some more features.
If we want, we can get the boundaries back separately by using the `retbins` parameter.

In [None]:
result, bins = pd.cut(games["Global_Sales"], bins=4, retbins=True)
print(bins)

Or we can define our own boundaries.

In [None]:
result = pd.cut(games["Global_Sales"], bins=[0, 10, 25, 60, 90])
print(result)
print(result.value_counts())

In [None]:
result = pd.qcut(games["Global_Sales"], q=[0, .5, .75, .99, 1])
print(result)
print(result.value_counts())

And we can even provid our own lables.

In [None]:
games["Feedback"] = pd.cut(games["Global_Sales"], bins=[0, 10, 40, 75, 100], labels=["Crappy", "Mediocre", "Good", "Bestseller"])
print(games["Feedback"].value_counts(sort=False))
games[games["Feedback"].isin(["Good", "Bestseller"])]

### Sampling

Working with large datasets, it might be a good idea to reduce the entries used to work with (for e.g. better performance or getting test and train datasets (topic of other exercise)).
Using Numpy and Pandas, there are two simple ways to reduce a `DataFrame`.

- `np.random.choice()`
- `DataFrame.sample()`

Reference: https://numpy.org/doc/1.18/reference/random/generated/numpy.random.choice.html

`np.random.choice()` is the generic approach from Numpy.
You can use it in two ways.

1. You generate an array of indices
2. You pick items of an array

In [None]:
rng = np.random.RandomState(42)
sample = rng.choice(20, 10)
sample

As you can see, some values get picked multiple times.
This is due the fact that the `replace` parameter is per default set to `True`.

In [None]:
fruits = ["Apple", "Raspberry", "Banana", "Orange", "Pineapple", "Kiwi", "Grapefruit", "Pear", "Tomato", "Dragonfruit", "Strawberry", "Lemon"]
rng = np.random.RandomState(42)
sample = rng.choice(fruits, 6, replace=False)
sample

In [None]:
games = pd.read_csv("./Demo_VideoGames.csv")

Working with Pandas, we could use `np.choice()` to select rows by generating indices.

In [None]:
rng = np.random.RandomState(42)
idx = rng.choice(len(games), 3000, replace=False)
idxSample = games.iloc[idx]
print(len(idxSample))
idxSample

In [None]:
rng = np.random.RandomState(42)
idx = rng.choice(games.index, 5000, replace=False)
idxSample = games.loc[idx]
print(len(idxSample))
idxSample

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html

`sample()` is a method offered on each instance of a `DataFrame` to select a subset of the dataset.
As with the `RandomState` or `seed()` method, we also have the ability to define a `random_state` to ensure reproducability.
`sample()` also has a `replace` parameter, but compared to `choice()` from above, it is per default `False` - which makes sense when selecting a random sample out of a `DataFrame`.

In [None]:
someGames = games.sample(n=4000, random_state=42)
sampleSize = len(someGames)
gamesSize = len(games)
print(f"Sample size: {sampleSize} of {gamesSize}")
someGames

In [None]:
someOtherGames = games.sample(frac=.33, random_state=42)
sampleSize = len(someOtherGames)
gamesSize = len(games)
print(f"Sample size: {sampleSize} of {gamesSize} => {np.round(sampleSize/gamesSize, 5)}")
someOtherGames

There are other approaches on getting samples out of a dataset, but we will introduce them when they are needed later in this course.

## Exercises

### Ex01 - Bar Charts

Draw a bar chart of the following data.

In [None]:
category = np.array(["Sunny", "Rainy", "Cloudy", "Foggy", "Snowy"])
days = np.array([184, 84, 21, 25, 51])

Sort the days so the highest occurrence is on the left, and the least occurrence is on the right.
*Hint: `np.argsort()` returns an array of indices.*

Now draw the data as horizontal bar chart, the categories should be sorted alphabetically with A at the top und Z at the bottom.

#### Solutions

In [None]:
# %load ./Ex04_01_Sol.py

### Ex02 - Pie Charts

Create a pie chart, each slice has the size of 1/3, value precision should be 3 (e.g. 0.333) and no slice should touch another.

Create a pie chart with 50%, 25% and twice 12.5%.
The slice with 50% should be on the left side and exposed.

#### Solutions

In [None]:
# %load ./Ex04_02_Sol.py

### Ex03 - Histogram

Generate a histogram containing 101 bins with 10'000 uniformly (`uniform()`) distributed values between -100 and 100.

Generate a histogram binning 100'000 normaly (`normal()`) distributed in 80 bins.
The mean should be at 5 and the standard deviation is 3.

#### Solutions

In [None]:
# %load ./Ex04_03_Sol.py

### Ex04 - Plots with Pandas

Load the dataset **Ex04_04_Data.csv** and show the first 5 lines.
It contains all Airbnb listings in New York City of 2019.

Before we start visualizing data, we add some values.
Add a series containing the *minimum price per stay* by multiplying the *price* with the *minimum nights*.
And another series containing the *expected revenue per year* by multiplying the *price* with the *availability per year* (availability_365).

Get the 10 most expensive offerings that were available in 2019 (based on the minimum price you just calculated) and list them.

Now show in a bar chart in which neighborhood group they are located.

Let's see how all the listings are distributed amongst the heighborhoods.
Use a pie chart to show the percentages per group.

What are the mean and median prices per neighborhood?
Show it in a bar chart.

Show the distribution of room types as pie chart.
What are the percentages?

Show the mean and median price per room type in a bar chart.

Show in a bar chart how many listings are offered by the same host.
Use the host_id to find listings belonging to the same host.
Use 5 bins, for 1 to 4 listings and the fifth for 5 and more.

Whats the mean listings per host?

Airbnb has defined that every listing available for 60 or more days per year (availability_365) is labeled as highly available.
Add a new column to the dataset containing `high` or `low` if the listing is highly available or not, respectively.

Now use this new column to show within a pie chart the distribution between high and low availability.
The highly available slice should be on the upper side and having a gap to the other slice.
What's the distribution in percent?

#### Challenges!

Plot a line, indicating how many hosts (y) have how much yearly revenue (x).
Limit the revenue to 100'000 and ignore every host who has no revenue at all.
*You may need to use [reset_index()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) to create a `DataFrame` from a group.*

Create a scatter plot showing the availability per year (x) compared to the minimum nights (y).
The color of the points should indicate their neighborhood and the size is bound to the price, but divide it by 5 (thus the points don't take too much space).
Since the dataset has nearley 50'000 entries, we just want to see those who have an availability greater than 0 and cost more than 1000.
Also limit the y axis to (-50, 400), so you can see all points. 

#### Solutions

In [None]:
# %load ./Ex04_04_Sol.py

### Ex05 - Binning & Sampling

We will keep using the data from the exercise before.
So just use the same variable or load **Ex04_04_Data.csv** again.

Create bins containing 25% each based on the number of reviews.
Where are the boundaries of the bins?

Take a sample of 5000 entries from all the listings.

Take 25 samples from each neighborhood group.
You should have a total of 125 samples, since there are 5 of them.

#### Solutions

In [None]:
# %load ./Ex04_05_Sol.py