# Part 1: Regression and Classification with Wine Data

You have recently been hired by a famous wine merchant who wants you to use your machine learning skills to help grow his business. For this exercise you will be using the UCL wine quality dataset and will be building a regression and classification model.

Let's take a look at the data:

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

In [3]:
wine=pd.read_csv('wine_quality.csv')

In [4]:
wine.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,Colour
0,5.7,0.25,0.27,11.5,0.04,24.0,120.0,0.99411,3.33,0.31,10.8,6,White
1,9.8,0.42,0.48,9.85,0.034,5.0,110.0,0.9958,2.87,0.29,10.0,5,White
2,7.1,0.47,0.29,14.8,0.024,22.0,142.0,0.99518,3.12,0.48,12.0,8,White
3,6.0,0.34,0.66,15.9,0.046,26.0,164.0,0.9979,3.14,0.5,8.8,6,White
4,6.7,0.46,0.21,4.0,0.034,12.0,88.0,0.99016,3.26,0.54,13.0,6,White


## Brief

In this exercise you are going to build a regression model to predict the quality of a bottle of wine, and then a classification model to predict its colour. 


We have left this assignment fairly open, you can research and use more advanced models and techniques if you want, or use more simple models if that is more comfortable. Either way, you are expected to evaluate the models you built.

## 1) Explore the data

Never neglect your EDA! 

 - Calculate summary statistics for your data
 - Check the distribution- are all fields normally distributed? Does this matter?
 - Check for Null values
 - Check for outliers, how can you visualise this? What do you think you should do with them?
 - Check correlations, produce a heatmap to demonstrate- which fields are most correlated with median value? Is there any examples of multicollinearity?

# 2) Regression

Your first task is to build a model that predicts the quality of a type of wine. You will need to build a linear regression model but you can use as many features as you like, any you do include must be justified from your EDA. Once you have created your model, evaluate it by reporting at least the r-squared.

Required:

 - Select and justify features to be included in your model
 - Use the train_test_split function to create a training and testing set
 - Build a regression model to predict quality from your chosen features
 - Report the r-squared and at least one other performance metric
 
Optional: 

 - Use cross validation to evaluate the performance of your model 
 - Use StandardScaler to standardise your features and compare the models performance
 - Plot the model coefficients against each other- which variable is the most important when predicting median value?
 - Plot the set of predicted values against the target variable- what does this show?
 
Extend:

 - Optimise the model by using regularisation
 - Research cross validation estimators (e.g. ElasticNetCV)
 - Research sklearn Pipeline and use it in building your model

## 3) Classification

Now you have made a model to predict the quality of a wine, you have now been asked to predict whether it is red or white based off its features.

You should use logistic regression, but you can research and use other classification methods if you would like. 

Required:

 - Use train_test_split to create a training and testing set
 - Build a classification model to predict 'Colour' using your chosen features
 - Report the accuracy and baseline accuracy
 
Optional:
 
  - Produce a confusion matrix to show how effectve your model is
  - Calculate the precision and recall to explain how effective your model is for predicting
 
Extend:
 
  - Research and use other classification models (KNN, SVC, etc) and compare them to your logistic regression model


# Part 2

Load in a dataset from your work and create a regression and classification model using whatever features you like. This can be as simple or complex as you'd like, but ensure you have justified your decisions through your EDA and evaluated the performance metrics of each model. 