# Data Visualization
---
Visualization Libraries:
1. [MatPlotLib](https://matplotlib.org): Useful for most data visualization tasks. <br><br>
2. [Seaborn](https://seaborn.pydata.org): Useful for when it can't be done with Matplotlib. <br><br>
3. [D3 Data Driven Documents](https://d3js.org/): Optional JavaScript library for manipulating documents.
---
Example Datasets: 

1. [Netflix Data Visualization](https://www.kaggle.com/joshuaswords/netflix-data-visualization/) <br><br>
2. [SciKit Learn Datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html)

#### Import Datasets:

In [None]:
import numpy as np
import pandas as pd                 # 
import matplotlib as mpl            # For Data Visualization
import matplotlib.pyplot as plt     # For plotting
import seaborn as sns               # For themes & dataset

from sklearn.datasets import load_wine, load_iris

##### Some miscellaneous methods for datasets:

In [None]:
# what methods can be called from the load_wine dataset?

# use the dir method on the imported dataset:
dir(load_wine)

In [None]:
# Example: What is the size of the load_wine dataset? call the appropriate method
load_wine.__sizeof__()

#### Manipulating datasets primer:

In [None]:
# Load the dataset for Wine Data into a new variable:
wine = load_wine()

# display dataset of the object "wine":
# note: may be a partial showing if output exceeds size limit
wine

In [None]:
# check new variable dataset type
type(wine)

# What is a "Bunch"? basically like a dictionary with values, attributes and metadata stored

In [None]:
# display attributes of wine variable
dir(wine)

In [None]:
# show array data from wine.data
wine.data

In [None]:
# What is the shape of wine.data array?
wine.data.shape

In [None]:
# show array data for wine.target:
wine.target

In [None]:
# show shape of the array for wine.target:

wine.target.shape

In [None]:
# create a test dataframe using pandas from wine.data
test_df = pd.DataFrame(data=wine.data)

# and check its shape:
test_df.shape

In [None]:
# create a 2nd test dataframe from wine.target to concatenate to the previous one
test_df2 = pd.DataFrame(data=wine.target)

# and check its shape
test_df2.shape

In [None]:
# display the head of test_df dataframe:

test_df.head()

In [None]:
# display the head of the test_df2 dataframe:

test_df2.head()

In [None]:
# using numpy, the "np.c_" method takes the 1st argument (wine.data) and constructs a matrix,
# then takes the 2nd argument (wine.target) and concatenate it to the matrix as an additional column.
# this then becomes a new variable "wine_data"

new_test = np.c_[test_df,test_df2]

In [None]:
# check to make sure the data was concatenated:

new_test.shape

In [None]:
# put the new_test_df in a dataframe using pandas again:

new_test_df = pd.DataFrame(data=new_test)

# if you skip this step and try "new_test_df.head()" then you will get an error message.

In [None]:
# see the head of the new DataFrame:

new_test_df.head()

# now we have a new dataframe concatenated from the original "wine.data" and "wine.target"

#### Now let's try for real:

In [None]:
# previously, we created a new "wine" variable 
# from the dataset "load_wine" from sklearn.datasets

# check the shape of the wine.data array:

wine.data.shape

In [None]:
# check the shape of the wine.target array:

wine.target.shape

In [None]:
# create a new variable "wine_data" and
# concatenate the data from the two DataFrames using numpy:

wine_data = np.c_[wine.data,wine.target]

In [None]:
# notice the new shape includes an additional column
# which is all data from "wine.data" and "wine.target"

# note that this is not a DataFrame yet, so the command 
# "wine_data.head()" will generate an error message.

wine_data.shape

In [None]:
# create a new variable which will be our new dataframe
# using pandas from our new wine_data variable

wine_df = pd.DataFrame(data=wine_data)

In [None]:
# display the head of the new dataframe

wine_df.head()

In [None]:
# now we want to name all the columns the "wine.feature_names" instead of numbers
# so we can view the data we want to add here:

wine.feature_names

In [None]:
# check the data type of the data we want to add to the DataFrame:

type(wine.feature_names)

# note that it is a list

In [None]:
# create a new variable for the list of feature names:

feat = wine.feature_names

# and display them here:

feat

In [None]:
# except we are missing the last column for 'target'
# so we can append the list like this:

feat.append('target')

# and display the new list with 'target' appended at the end:

feat

In [None]:
# now we can modify the "wine_df" to have the properly named columns:

wine_df = pd.DataFrame(data=wine_data, columns=feat)

# and display the head of the new DataFrame:

wine_df.head()

#### Analyzing the Wine Dataframe:

In [None]:
# declare a new variable as a copy of "wine_df"

df = wine_df.copy()

# and display the head:

df.head()

In [None]:
# Here's how we check the # of unique values in the 'alcohol' column:
# This can come in handy when we need to start making plots

df['alcohol'].nunique()

In [None]:
# lets look at a statistical descrption for this dataset:

df.describe()

In [None]:
# transpose the dataset so we don't have to scroll right to see all the columns:

df.describe().transpose()

In [None]:
# and view some information on the data structure itself:
# in this case there are no "surprises" in the data to deal with

df.info()

#### Data Visualization using MatPlotLib and Seaborn:

Some additional documentation: [MatPlotLib hist method](https://matplotlib.org/stable/gallery/statistics/hist.html)

In [None]:
# Get a histogram of the alcohol column
# First we need to declare a figure and an axis using matplotlib.pyplot:

fig, ax = plt.subplots()

# we want to display the histogram of the alcohol content of each row of the dataframe

ax.hist(df['alcohol'])

# Set titles and x,y labels for the histogram:

ax.set_title('Alcohol Levels in Wine')
ax.set_xlabel('Alcohol Level')
ax.set_ylabel('Frequency in Data')

In [None]:
# another way to do it:
df['alcohol'].plot.hist(title='Alcohol Content in Wine')

##### Using Seaborn:

In [None]:
# plot the histogram of the alcohol content:
sns.kdeplot(df['alcohol'])

In [None]:
# to plot both the frequency histogram and the kde density plot together:
# using seaborn
sns.displot(df['alcohol'],bins=10,kde=True)

# note that changing the # of bins will affect the # of bars in the graph

In [None]:
# reminder: # of unique values in the 'alcohol' column of the DataFrame:
df['alcohol'].nunique()

In [None]:
# Number of unique values in the 'target' column of DataFrame:
df['target'].nunique()

In [None]:
# List of columns in the dataframe:
df.columns

# note the last column is 'target'

##### Plotting the 'target' classes using MatPlotLib:

In [None]:
# Easiest way is to plot through pandas
df['target'].plot.bar()

# This is not what we want though...
# its a mess of jumbled data at the bottom of the graph.

In [None]:
# We want to plot this data in a bar graph.
df['target'].value_counts()

# The class is the left column

In [None]:
# Here we can plot a bar graph
df['target'].value_counts().plot.bar()

# But wait...the data is not organized how we want it

In [None]:
# Sort the data and add color:
df['target'].value_counts().sort_index().plot.bar(color=['red','blue','green'])

# This is what we were looking for

##### Another way to create the bar plot:

In [None]:
# Declare a new variable to contain the data
target = df['target'].value_counts()

# Check the data we want to plot:
target

In [None]:
# declare a new list for the names of the items on the x axis:
index = ['Class0','Class1','Class2']

# declare a new array that contains the target.values:
values = target.values

# Declare a blank figure and an axis using matplotlib.pyplot:
fig1, ax1 = plt.subplots()

# Now add data and color to it
ax1.bar(index,values, color=['red','blue','green'])

In [None]:
# if you want a horizontal bar graph plot:
# declare a new blank figure and axis using matplotlib.pyplot:
fig2, ax2 = plt.subplots()

# and use the 'barh' method instead
ax2.barh(index,values, color=['red','blue','green'])

##### Same bar plot using Seaborn (sns):

In [None]:
# use the 'countplot' method:
sns.countplot(df['target'])

# Uses the 'target' dataframe that we already created previously

#### Plots Correlation using Pandas:

Additional Documentation: [pandas.DataFrame.corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)

In [None]:
# How to find correlation
# declare a new variable to store the correlation data:
corr = df.corr(method='pearson')

# display the correlation data:
corr

# This data is not really meaningful in this format

In [None]:
# Instead, using seaborn, we can display a more useful heatmap of the correlation data:
sns.heatmap(corr)

# note that the last column 'target' is erroneous in this case and should have been dropped.

##### Alcohol Concentration based on class (target)

Useful Documentation: [Pandas Categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)

In [None]:
# view the head of our complete DataFrame
df.head()

In [None]:
# view the 'target' column of our DataFrame:
df['target']

# not much use to us yet...

In [None]:
# in the original dataset we imported and named 'wine'
# there is the 'target' array with values 0, 1, or 2.
# we appended our dataframe with that data as an extra column named 'target'
# now we want to append the dataframe again with another column named 'class'
# which is mapped to the values of 'target'

# reminder: here is our original dataset
wine

In [None]:
# We want to append the dataframe with another column named 'class'
# since we are mapping coded values to categorical values, we use the 'from_codes' method
# 'from_codes' takes codes as input and outputs values
# This will read each coded value from wine.target, 
# with the corresponding 'target_names' value, and append the dataframe with the output
# in an new 'class' column.

df['class'] = pd.Categorical.from_codes(wine.target,wine.target_names)

# view the new dataframe and notice the last column and its values:
df

In [None]:
# check the value_count of the new 'class' column:
df['class'].value_counts()

In [None]:
# note that this matches the 'target' value counts, which is what we want
df['target'].value_counts()

In [None]:
# take dataframe and group by the class
# take the alcohol mean in each class and plotted it. (can use max() or min() method also)
# then sort the top 10 [:10] values in descending order
# then plot the bar graphs in red, blue, and green colors

df.groupby('class').alcohol.mean().sort_values(ascending=False)[:10].plot.bar(color=['red','blue','green'])

In [None]:
# Here is another example, using the min() value and printed in ascending order

df.groupby('class').alcohol.min().sort_values(ascending=True)[:10].plot.bar(color=['red','blue','green'])