# Extra Practice for Data viz and Machine Learning
### Alex
This notebook is for extra practice with `Seaborn` and `scikit-learn`. We will be using a couple of different datasets from the `vega-datasets` library. If you don't know what that is, it's fine - it's just a library with some really nice practice datasets to use. This challenge does not have any tests, as it's partly to help you explore what you can do with `seaborn`, but you can compare your plots visually to the examples shown. Please look at the documentation and see what other cool plots you can come up with!   
  
For reference, here are links to the documentation of the `seaborn` functions you should use:
 - [catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html)
 - [relplot](https://seaborn.pydata.org/generated/seaborn.relplot.html)
 - [heatmap](https://seaborn.pydata.org/generated/seaborn.kdeplot.html)
 - [plot a distribution](https://seaborn.pydata.org/generated/seaborn.kdeplot.html)
 - [plot a regression line](https://seaborn.pydata.org/generated/seaborn.regplot.html)
 - [plot bivariate data with axis graphs](https://seaborn.pydata.org/generated/seaborn.jointplot.html)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set()

cars = pd.read_csv('cars.csv')
barley = pd.read_csv('barley.csv')

# only for Jupyter
%matplotlib inline

The `cars` dataset is a dataset with a bunch of different models of car, with several different statistics about each of them, including their horsepower, acceleration, etc., the year they were released, and their country of origin. Here's what it looks like:

In [None]:
cars.head()

You may have seen the `barley` dataset before - it's a bunch of measurements of barley yields at several different sites and over multiple years.

In [None]:
barley.head()

# Data Viz and ML with Quantitative data
For this section, we will be working with the `cars` dataset. The majority of the columns in the dataset are *quantitative*, which means it's numbers that don't fit into a specific category. If you're unclear about different types of data, take another look at the lessons, or see [this article on types of data](https://www.dummies.com/education/math/statistics/types-of-statistical-data-numerical-categorical-and-ordinal/).  
  
As a reminder, here's what the `cars` dataset looks like.

In [None]:
cars.head()

## Problem 1) Scatterplot

Create a scatterplot using `seaborn` and the `cars` dataset that plots `Horsepower` on the x-axis and `Miles_per_Gallon` on the y-axis. The points should be colored by the country of origin. The x-axis title should be `Horsepower` and the y-axis label should be `Miles per Gallon`. The plot should be titled 'Horsepower vs. Miles per Gallon'.  
Your plot should look like this:  
  
![plot](ml_dataviz_img/hp_vs_mpg.png)

In [None]:
# Enter your solution here!
sns.relplot(y='Miles_per_Gallon', x='Horsepower', hue='Origin', data=cars)
plt.ylabel('Miles per Gallon')
plt.title('Horsepower vs. Miles per Gallon')
plt.show()

## Problem 2) Distribution

Plot a distribution of the horsepower of the `cars` dataset. the x-axis should be labeled 'Horsepower', and the plot should be titled 'Distribution of Horsepower'. Your plot should look like this:  
  
![plot](ml_dataviz_img/hp_dist.png)

In [None]:
sns.kdeplot(cars['Horsepower'])
plt.xlabel('Horsepower')
plt.title('Distribution of Horsepower')
plt.show()

## Problem 3) Bivariate Data

Plot a *bivariate distribution* - two categorical variable distributions plotted against each other - of miles per gallon on the y-axis and horsepower on the x-axis. Compare this to the corresponding scatterplot - what do you see?  
  
You will need to use the `data2` optional variable of `kdeplot`, and you will need to use `dropna`. Your plot should look like this:  
  
![plot](ml_dataviz_img/hp_vs_mpg_dist.png)

In [None]:
cars.head()

In [None]:
cars = cars.dropna()
sns.kdeplot(data=cars['Horsepower'], data2=cars['Miles_per_Gallon'])
plt.title('Distributions of Horsepower vs Miles per Gallon')
plt.ylabel('Miles per Gallon')
plt.show()

## Problem 4) Regression Plot

Plot a linear regression, with acceleration on the x-axis and weight on the y-axis. There should be a linear regression line. The x-axis should have a label of 'Acceleration', and the y-axis should have a label of 'Weight (lbs)'. The plot should be titled 'Regression of Acceleration vs. Weight (lbs)'. Also, for an added bit of fun, set the 'marker' to be '*' and the color to be 'r'.  
  
![plot](ml_dataviz_img/acc_vs_lbs_reg.png)

In [None]:
sns.regplot(x='Acceleration', y='Weight_in_lbs', data=cars, marker='*', color='r')
plt.title('Regression of Acceleration vs. Weight (lbs)')
plt.ylabel('Weight (lbs)')
plt.show()

## Problem 5) Joint hexplot and binned histograms

Plot a hexplot of horsepower on the x-axis and weight on the y-axis, so that there are distributions show of horsepower and weight on the x and y axes (See the documentation of jointplot on the seaborn website). Instead of titleing the y label, try and rename the `DataFrame` column to something else. What happens if you use `plt.ylabel` on this type of chart? What happens if you try and add a title? Use magenta as the color. Your plot should look like this:  
  
![plot](ml_dataviz_img/lbs_vs_hp_joint.png)

In [None]:
cars['Weight (lbs)'] = cars['Weight_in_lbs']
sns.jointplot(x='Horsepower', y='Weight (lbs)', kind='hex', data=cars, color='m')
plt.show


## Problem 6) ML with Quantitative Data

Create and train a model that, given the `cars` dataset, create a model that will predict the Horsepower of a car. Think about the type of data you are trying to predict - what model should you use to predict quantitative data? Make sure to split training and testing data, and check the mean squared error of your model.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

cars = cars.dropna()
features = cars.loc[:, cars.columns != 'Horsepower']
features = pd.get_dummies(features)
labels = cars['Horsepower']

features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.2)

model = DecisionTreeRegressor()
model.fit(features_train, labels_train)

train_predictions = model.predict(features_train)
test_predictions = model.predict(features_test)

train_msr = mean_squared_error(train_predictions, labels_train)
test_msr = mean_squared_error(test_predictions, labels_test)
print(f'Training Mean Squared Error: {train_msr}')
print(f'Testing Mean Squared Error: {test_msr}')

# Data Viz and ML with Categorical Data

For this section, we will be using both the `barley` dataset and the `cars` dataset. You will get some experience setting up ML for categorical data, and see a few more types of plots you can do with categorical data, and some you can do to combine quantitative and categorical data.

In [None]:
# here's a reminder of the barley dataset
barley.head()

## Problem 1) Bar Chart

Plot a bar chart of the origin in the `cars` dataset. Remember to use the right function in `seaborn` - don't use `barplot` yet, it won't work quite right. Remember to add a descriptive axis labels and a title to your plot. Your plot should look like this:  
  
![plot](ml_dataviz_img/origin_bar.png)

In [None]:
sns.catplot('Origin', kind='count', data=cars)
plt.ylabel('Count')
plt.title('Origin of Cars')
plt.savefig('ml_dataviz_img/origin_bar.png', bbox_inches='tight')

## Problem 2) Combined Categorical with Quantitative Plot

Plot a categorical chart of the `barley` dataset with the 'site' along the x-axis, the yield along the y-axis, and the color according to the variety of barley. There are two variants of this graph you can try out - the first if you just plot it, and the second if you did the extra practice with `groupby`, try and group according to the variety and the site at once, and reset the index. Here are a couple plots you could come up with:  
  
![plot](ml_dataviz_img/yield_vs_site_overall.png) ![plot](ml_dataviz_img/yield_vs_site_mean.png)

In [None]:
barley['Variety'] = barley['variety']
barley['Yield'] = barley['yield']
barley['Site'] = barley['site']
barley['Year'] = barley['year']

heatmap_barley = barley[['Site', 'Variety', 'Yield']]
heatmap_barley = heatmap_barley.groupby(['Variety', 'Site']).mean()
heatmap_barley = heatmap_barley.reset_index()

order = ['University Farm', 'Waseca', 'Morris', 'Crookston', 'Grand Rapids', 'Duluth']
sns.catplot(x='Site', y='Yield', hue='Variety', data=heatmap_barley, order=order)
plt.title('Mean of Barley Yield at Different Sites')
plt.ylabel('Mean of Yield')
plt.xticks(rotation=-45)
plt.savefig('ml_dataviz_img/yield_vs_site_mean', bbox_inches='tight')
plt.show()

## ML with Categorical Data

Create, train, and test a model that will predict the country of origin for the `cars` dataset. Remember, this is categorical data, so you will need to use a different type of model than you did for the `Horsepower` model.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

cars = cars.dropna()
features = cars.loc[:, cars.columns != 'Origin']
features = pd.get_dummies(features)
labels = cars['Origin']

features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(features_train, labels_train)

train_predictions = model.predict(features_train)
test_predictions = model.predict(features_test)

train_accuracy = accuracy_score(train_predictions, labels_train)
test_accuracy = accuracy_score(test_predictions, labels_test)
print(f'Training Accuracy: {train_accuracy}')
print(f'Testing Accuracy: {test_accuracy}')

## Just for Kicks: Barley Heatmap

`seaborn` has a `heatmap` function which produces some entertaining results. It takes two Categorical or Ordinal variables and plots them against each other, with the color of the corresponding box correlated to a third, quantitative, variable. To get data for this, we will have to `groupby` the `barley` dataset, then use `barley.pivot` to change the index and columns of the `DataFrame`. Take a look at the `pivot` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html) if this interests you.

In [None]:
barley_pivot = barley.loc[:, ['Year', 'Site', 'Yield']]
barley_pivot = barley_pivot.groupby(['Year', 'Site']).sum()
barley_pivot = barley_pivot.reset_index()
barley_pivot['Yield'] = barley_pivot['Yield'].astype(int)
barley_pivot = barley_pivot.pivot('Year', 'Site', 'Yield')

Now, we can create our heatmap. Notice that we can very clearly see which years and sites did well overall - the same point could have been shown with a different plot, but this is still interesting to see.

In [None]:
sns.heatmap(barley_pivot, annot=True, fmt='d')