# Assignment: Wines, Team 1
Learning goals
In this assignment, you:
1. learn to conduct linear regression analysis.
2. improve your data manipulation skills in Python.

## Assignment
In this assignment, you analyse numerical data on wine properties.

The data sets are available in the Documents/Methods/Data/Wine folder in the course’s Oma workspace. Alternatively, the data sets can be downloaded from UCI repository at http://archive.ics.uci.edu/ml/datasets/Wine+Quality.

First, choose either red or white wines as the target of the study.

Then choose a trait from two options: 1) wine quality or 2) wine alcohol content.

Now, your task is to build a regression model that predict the values of your chosen response variable as well as possible.

You should provide evidence-based answers to the following questions:

1. What is the regression equation for estimating your chosen trait values?
2. What are the five most useful variables for estimating the trait values?
3. Provide a validation-based error estimate for your model. As the data set is large, use
split validation that divides the data set into separate training and testing sets.


### Libraries and Data

#### Explanations from winequality.names, we will be targeting wine quality for our regression model.

7. Attribute information:

   For more information, read [Cortez et al., 2009].

   Input variables (based on physicochemical tests):
   
   1 - fixed acidity
   
   2 - volatile acidity
   
   3 - citric acid
   
   4 - residual sugar
   
   5 - chlorides
   
   6 - free sulfur dioxide
   
   7 - total sulfur dioxide
   
   8 - density
   
   9 - pH
   
   10 - sulphates
   
   11 - alcohol Output variable (based on sensory data): 
   
   12 - quality (score between 0 and 10)

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error



In [2]:
# fetch data for red wines
file = "winequality-red.csv"
data = pd.read_csv(file, delimiter=';')

# have a look at the first wine, column names
print(data.head(1))

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4               0.7          0.0             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  


### Building the Model

In [3]:
# Separate features and target
X = data.drop(columns=['quality'])
y = data['quality']

# 80% of data for training, rest for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the linear regression model and train it
model = LinearRegression()
model.fit(X_train, y_train)

### Regression equation
The coefficients indicate the impact of each variable on wine quality. Positive coefficients suggest a positive correlation, while negative coefficients suggest a negative correlation, i.e. whether quality goes up or down when a given variable increases.

In [4]:
# Coefficients of the regression equation
coefficients = model.coef_
intercept = model.intercept_

# The equation itself
regression_equation = "Quality = {:.2f}".format(intercept)
for i in range(len(coefficients)):
    regression_equation += " + {:.2f} * {}".format(coefficients[i], X.columns[i])

print("Regression Equation:")
print(regression_equation)

Regression Equation:
Quality = 14.36 + 0.02 * fixed acidity + -1.00 * volatile acidity + -0.14 * citric acid + 0.01 * residual sugar + -1.81 * chlorides + 0.01 * free sulfur dioxide + -0.00 * total sulfur dioxide + -10.35 * density + -0.39 * pH + 0.84 * sulphates + 0.28 * alcohol


#### Validating the equation with the first 10 wines

In [5]:
# Selecting features for the first 10 wines
first_10 = X.head(10)

# Calculating predicted quality for each of the first 10 wines
predicted_quality = []
for i in range(len(first_10)):
    prediction = intercept
    for j in range(len(coefficients)):
        prediction += coefficients[j] * first_10.iloc[i, j]
    predicted_quality.append(prediction)

# Printing predicted quality and real quality for the first 10 wines
for i in range(len(predicted_quality)):
    print("Wine", i+1, ": Predicted vs Real Quality: {:.2f}/{:.2f}\n".format(predicted_quality[i], y.head(i+1).values[-1]))

Wine 1 : Predicted vs Real Quality: 5.05/5.00

Wine 2 : Predicted vs Real Quality: 5.15/5.00

Wine 3 : Predicted vs Real Quality: 5.21/5.00

Wine 4 : Predicted vs Real Quality: 5.68/6.00

Wine 5 : Predicted vs Real Quality: 5.05/5.00

Wine 6 : Predicted vs Real Quality: 5.08/5.00

Wine 7 : Predicted vs Real Quality: 5.11/5.00

Wine 8 : Predicted vs Real Quality: 5.36/7.00

Wine 9 : Predicted vs Real Quality: 5.33/7.00

Wine 10 : Predicted vs Real Quality: 5.60/5.00



### Top 5 Variables

In [6]:
# Get absolute coefficients and corresponding variable names and combine them
coefficients_abs = abs(coefficients)
variable_names = X.columns
coefficients_df = pd.DataFrame({'Variable': variable_names, 'Coefficient': coefficients_abs})

# Sort by coefficient magnitude in descending order
coefficients_df = coefficients_df.sort_values(by='Coefficient', ascending=False)

# Get the top 5 most useful variables
top_5_variables = coefficients_df.head(5)
print("\nTop 5 most useful variables for estimating wine quality:")
print(top_5_variables)


Top 5 most useful variables for estimating wine quality:
           Variable  Coefficient
7           density    10.351594
4         chlorides     1.806503
1  volatile acidity     1.001304
9         sulphates     0.841172
8                pH     0.393688


### Evaluation

In [7]:
# Make predictions on the testing set
y_pred = model.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error for the model is {:.2f}".format(mse))

Mean squared error for the model is 0.39
