# Visualization

this week we will use pandas and seaborn to look into different ways to visualize your machine learning project. In particular we well 

- visualize categorical and contnuous varaibles of the data set
- visualize them machine learning methods we have learned so far
- visualize the performace of our algorithm 


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#### Let's get started with the (unprocessed) charity dataset

In [None]:
df = pd.read_csv("CharityData/census.csv")
df.head()

#### Pie charts can be a simple way to illustrate the categorical vaiables in your dataset

In [None]:
series = pd.Series(3 * np.random.rand(4), index=["a", "b", "c", "d"], name="series")
print(series)
series.plot.pie(figsize=(6, 6))

##### to use a pie chart on the charity data we first have to create a series

In [None]:
workclass = df.workclass.value_counts()
workclass.plot.pie(figsize=(6, 6))

##### there are differnet ways to change the color: use the colormap or colors argument
- use a matplotlib colormap: https://matplotlib.org/stable/tutorials/colors/colormaps.html
- use a seaborn colormap (called a palette): https://seaborn.pydata.org/tutorial/color_palettes.html
- specify your own color list

In [None]:
plt.figure(figsize=(20, 6))

plt.subplot(131)
workclass.plot.pie( colormap='Paired')
plt.title('Matplotlib: Paired', size=18)

plt.subplot(132)
col = sns.color_palette("Spectral", n_colors=10)
workclass.plot.pie(colors=col)
plt.title('Seaborn: Spectral', size=18)

plt.subplot(133)
colors=['red', '#000000',(0.7, 0.6,0.4), '#666666', '#ffffff', 'blue', 'orange']
workclass.plot.pie(colors=colors)
plt.title('Handpicked', size=18)

##### Use the parameters of df.plot to set properties of your plot
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html#pandas.DataFrame.plot

In [None]:
col = sns.color_palette("Spectral", n_colors=10)
workclass.plot.pie(figsize=(6, 6), colors=col, legend=True, autopct="%.2f", fontsize=14)
plt.legend()

##### You can combine the pandas plot commands with 

In [None]:
col = sns.color_palette("Spectral", n_colors=10)
workclass.plot.pie(figsize=(6, 6), colors=col, legend=True, autopct="%.2f", fontsize=14, labeldistance=None, label='')
plt.legend(bbox_to_anchor=(0.95, 0.6), loc=2, borderaxespad=0.,  fontsize=14, ncol=2)
plt.title('Workclass', size=20)

#### you can use df.groupby to make several piecharts 

In [None]:
series_mf = df.groupby('sex').workclass.value_counts()
print(series_mf)

In [None]:
series_mf = df.groupby('sex').workclass.value_counts().unstack(0)
print(series_mf)

In [None]:
series_mf.plot.pie(subplots=True, legend=False, figsize=(12,4), cmap='Pastel1', label='')

# But: 
although piecharts can look very pretty, it is not recommended to use them for datset visualization. Can you think of a reason?

### A bar chart is just a pie chart, that doesn't look like a pie
and we can produce it just the same way: 

In [None]:
df.workclass.value_counts().plot.bar()
plt.xticks(size=16, rotation=-45)

### For the next ten minutes do the following: 
- group the charity dataset by income
- make a bar chart showing the eductaion level for both income classes
- make sure your graph has clearly readable labels and a title
- choose some nice colors
- add a legend

### For plotting continuous data we will use seaborns penguin dataset

#### this is supposed to be the new Iris dataset: https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris

In [None]:
penguins = sns.load_dataset("penguins")
penguins.info()

#### Let's start with a simple histogramm of the flipper length: 

In [None]:
penguins.flipper_length_mm.hist()

##### we can improve this plot by changing bins,  color and adding lables

In [None]:
penguins.flipper_length_mm.hist(bins =20, grid=0, color='purple', edgecolor='black', alpha=0.3)
plt.xlabel("flipper length [cm]", size=16)
plt.ylabel("frequency", size=16)

#### we can also add an estimate of the probability distribution

In [None]:
penguins.flipper_length_mm.hist(bins =20, grid=0, color='purple', edgecolor='black', alpha=0.3)
penguins.flipper_length_mm.plot.kde(linewidth=3, color='blue')
plt.xlabel("flipper length [mm]", size=16)
plt.ylabel("frequency", size=16)


#### why does this not work?

#### and just like with pies and bars we can make histograms for several groups: 

In [None]:
pengs_by_spec = penguins.pivot(columns='species', values="flipper_length_mm")

#### why can't we use the same aproach as before?

In [None]:
pengs_by_spec.plot.hist(color=['magenta', 'purple', 'blue'], alpha=0.5)

In [None]:
pengs_by_spec.plot.hist(color=['magenta', 'purple', 'blue'], alpha=0.5, stacked=True)

### For the next couple of minutes do the following: 
- use only female penguins and group them by island
- make a histogram of the bodymasses
- choose some good colors and alpha values, add lables, title and legend

####  Can you think of some situations where we would not want to use a histogram?

### Boxplots

In [None]:
pengs_by_spec.boxplot()

#### you can customize your boxplot

In [None]:
color_dict = {"boxes": "green","whiskers": "blue","medians": "magenta", "caps": "black" }
boxprops = dict(linestyle='--', linewidth=3, color='darkgoldenrod')

In [None]:
pengs_by_spec.boxplot(color=color_dict, boxprops= boxprops, grid=False)

##### but seaborn has a very pretty deafult boxplot function

In [None]:
sns.boxplot(x='species', y = 'flipper_length_mm', data=penguins)

###### you can even add the datapoints to the boxplot

In [None]:
sns.boxplot(x='species', y = 'flipper_length_mm', data=penguins)
sns.swarmplot(x='species', y = 'flipper_length_mm', data=penguins, color="grey")

### when should you use a boxplot vs a histogram?

### The best of both worlds: Violin plots

In [None]:
sns.violinplot(x='species', y = 'flipper_length_mm', data=penguins)

##### you can even add more varaibles

In [None]:
sns.violinplot(x='species', y = 'flipper_length_mm', hue='sex', data=penguins, split=True)

#### Take the next couple of minutes to: 
- Display the age distribution in the charity dataset for men and women and both income classes
- As always, choose good colors and make sure your plot is labelled

## Scatter plots

##### You have already used the basic scatter plot many times

In [None]:
plt.scatter(penguins['bill_length_mm'],penguins['flipper_length_mm'])

#### pandas has it's own scatter function

In [None]:
penguins.plot.scatter(x='bill_length_mm', y='flipper_length_mm')

#### and so has seaborn
..... in seaborn you can use categorical varaibles to specify color and size

In [None]:
plt.figure(figsize=(5,5))
sns.scatterplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="sex", size="island", style='species', palette='Set2')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

#### Seaborns jointplot is even more powerful: 

In [None]:
sns.jointplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", kind="reg")

### Take the next 10 minutes to check out the documentation on seaborns jointplot 
and create some interesting visulization of the relationship between bill length and flipper length: 
https://seaborn.pydata.org/generated/seaborn.jointplot.html

## Scatter plots allow you to see a  relationship between different variables

In [None]:
from sklearn.linear_model import LinearRegression

#### let's reproduce the joint plot regression line with sklearn

In [None]:
lin_reg = LinearRegression()

In [None]:
penguins = penguins.dropna()

In [None]:
lin_reg.fit(penguins.bill_length_mm.values.reshape(len(penguins), 1),
            penguins.flipper_length_mm.values.reshape(len(penguins), 1))

In [None]:
plt.figure(figsize=(5,5))
plt.scatter(penguins.bill_length_mm,penguins.flipper_length_mm)
plt.xlabel('bill length [mm]', size=16)
plt.ylabel('flipper length [mm]', size=16)

#### now where do we get the line from?

### Let's do the same thing for logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
pengs_gentoo = penguins[penguins.species == 'Gentoo']

In [None]:
sns.scatterplot(data=pengs_gentoo, x="bill_length_mm", y="flipper_length_mm", hue="sex", palette='Set2')

### Take the next ten minutes to: 
apply logistic regression to predict the sex of the penguin from it's bill length

In [None]:
log_reg = LogisticRegression()
pengs_gentoo = pengs_gentoo.dropna()

#### the logistic function is defined as: 
## $$\frac{1}{1+e^{-x}}$$

### Let's check how well our logistic regression works
##### which measures are there to measure the performance of a  classification problem? And how could we illustrate them?

## A new measure: the ROC Curve

In [None]:
from IPython.display import Image
from IPython.display import display

In [None]:
im = Image('1280px-Roc-draft-xkcd-style.svg.png', width=800)
display(im)
print("this image is taken from: https://en.wikipedia.org/wiki/Receiver_operating_characteristic#/media/File:Roc-draft-xkcd-style.svg")

##### When we use logistic regression, we can either predict class labels or class probabilities

In [None]:
y = log_reg.predict_proba(pengs_gentoo.flipper_length_mm.values.reshape(len(pengs_gentoo), 1))
x = pengs_gentoo.flipper_length_mm.values
plt.scatter(x, y[:, 1])
plt.xlabel('??', size=18)
plt.ylabel('??', size=18)

#### let's plot an ROC curve for our logistic regression model

In [None]:
from sklearn.metrics import roc_curve

### Decision Bounaries

In [None]:
plt.figure(figsize=(5,5))
sns.scatterplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species", palette='Set1')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

### Take the next 10 minutes to: 
- apply K means clustering to the data (using k=3)
- draw a catterplot with the colors indicating the found clusters, the style indicating the true clusters and big black x's showing the cluster centers

In [None]:
from sklearn.cluster import KMeans

In [None]:
km = KMeans(n_clusters=3)

In [None]:
class_lables = km.fit_predict(penguins[['bill_length_mm', 'flipper_length_mm']])

In [None]:
km.cluster_centers_.shape

In [None]:
plt.figure(figsize=(5,5))
sns.scatterplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species", style= class_lables, palette='Set1')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1], marker='X', color='k', s=120)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.scatter([37], [220], color='black', s=200, marker="$?$")

In [None]:
def plot_decision_boundary(data, feature_names, predictor, res=30):
    x = data[feature_names[0]]
    y = data[feature_names[1]]
    
    x_min=x.min()
    x_max=x.max()    
    y_min=y.min()
    y_max=y.max()
    
    xx = np.linspace(x_min, x_max, res)
    yy = np.linspace(y_min,y_max, res)
    xv, yv = np.meshgrid(xx,yy)
    mesh = np.vstack((xv.flatten(), yv.flatten())).T
    pred = predictor.predict(mesh)
    
    cs = plt.contourf(xv,yv, pred.reshape(xv.shape), cmap=plt.cm.coolwarm)
    
    plt.colorbar()
    plt.xlabel(feature_names[0], size=16)
    plt.xlim(x_min-0.1*(x_max-x_min), x_max+0.1*(x_max-x_min))
    plt.ylabel(feature_names[1], size=16)
    plt.ylim(y_min-0.1*(y_max-y_min), y_max+0.1*(y_max-y_min))

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [None]:
le.fit(penguins["species"])

In [None]:
plot_decision_boundary(penguins, ['bill_length_mm', 'flipper_length_mm'], km)
plt.scatter(penguins['bill_length_mm'],penguins['flipper_length_mm'], c=le.transform(penguins["species"]))

In [None]:
data = dict(
        type = 'choropleth',
        locations = df_countries.Code,
        z = df_countries['Count'],
        text = df_countries['Country'],
        colorbar = {'title' : 'Starbucks Stores - World Wide'},
        zmax = 100,
        zmin =0
      )
layout = dict(
    title = 'Stores Count',
    geo = dict(
            showframe = False,
            projection = {'type':'natural earth'}
    )
)
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap)