# Regression Assignment

## Read data

In [51]:
import pandas as pd

# Load the Red Wine Quality dataset
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
        
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [52]:
df.shape

(1599, 12)

In [53]:
#  count the number of missing or null values in each column
df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

## Divide data into training & testing sets

In [54]:
from sklearn.model_selection import train_test_split

# Split the data into input and output variables
X = df.drop('quality', axis=1).values  # features
y = df['quality'].values  # quality

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

## Training Linear Regression on the dataset

In [55]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import numpy as np

sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

sc_y = StandardScaler()
sc_y.fit(y_train[:, np.newaxis])
y_train_std = sc_y.transform(y_train[:, np.newaxis]).flatten()
y_test_std = sc_y.transform(y_test[:, np.newaxis]).flatten()


# Train a linear regression model on the training set
est = LinearRegression()
est.fit(X_train_std, y_train_std)


# Make predictions on the training and testing sets
y_train_pred = est.predict(X_train_std)
y_test_pred = est.predict(X_test_std)

## Report the R^2 value of the trained model

In [56]:
# Calculate the Mean Square Error(MSE) and R^2 value of the trained model on the training and testing sets
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train_std, y_train_pred),
        mean_squared_error(y_test_std, y_test_pred)))

print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train_std, y_train_pred),
        r2_score(y_test_std, y_test_pred)))

MSE train: 0.629, test: 0.624
R^2 train: 0.371, test: 0.340


## Data Description
The <strong>Red Wine Quality</strong> dataset is a collection of 1599 red wine samples, each with 11 physicochemical features such as fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. The quality of each wine is rated on a scale from 0 to 10, with higher scores indicating better quality. The dataset was created by Paulo Cortez and António Cerdeira and can be found on the UCI Machine Learning Repository.

## Regression Task
The regression task that we are training our models to perform is to predict the quality of red wine based on its physicochemical features. This is a typical regression problem, where the input variables (features) are continuous and the output variable (quality) is also continuous.

## Regression Method
We are using a <strong>linear regression</strong> model to perform the regression task. Linear regression is a widely used method for modeling the relationship between a dependent variable and one or more independent variables. In our case, the dependent variable is the quality of red wine, and the independent variables are the 11 physicochemical features. Linear regression models assume that there is a linear relationship between the input variables and the output variable, and try to find the coefficients that best fit this relationship.

## Assessment of Results
We used the <strong>R^2</strong> score to evaluate the performance of our model. The R^2 score measures the proportion of variance in the output variable that is explained by the input variables. A higher R^2 score indicates a better fit between the model and the data.

We obtained an R^2 score of 0.34 on the test set, which means that our model explains 34% of the variance in the quality of red wine based on its physicochemical features. This is not a very high R^2 score, which suggests that the linear regression model may not be the best choice for this problem, or that we may need to use more sophisticated methods or feature engineering techniques to improve the performance of the model.

Therefore, while the Red Wine Quality dataset is a useful dataset for regression tasks in machine learning, the performance of the model may be limited by the choice of algorithm or the feature representation used. Other regression methods, such as decision trees, random forests, or support vector machines, may provide better performance on this dataset. Additionally, more data may be required to train a model with higher accuracy and robustness.