# Linear Regression

For the linear regression I have split the data set into the white and red wine data sets. The chemical properties should be different between the two types of wine so I felt that it was necessary to do so. 

## Importing Dataset

In [32]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
white = pd.read_csv("white_wine_data.csv")
red = pd.read_csv("red_wine_data.csv")

In [11]:
white.info()
red.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 13 columns):
type                    4898 non-null object
fixed acidity           4890 non-null float64
volatile acidity        4891 non-null float64
citric acid             4896 non-null float64
residual sugar          4896 non-null float64
chlorides               4896 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4891 non-null float64
sulphates               4896 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
dtypes: float64(11), int64(1), object(1)
memory usage: 497.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 13 columns):
type                    1599 non-null object
fixed acidity           1597 non-null float64
volatile acidity        1598 non-nu

## Cleaning White Dataset

In [31]:
clean_white = white.copy()

white_fixed_acidity_mean = clean_white["fixed acidity"].mean()
white_volatile_acidity_mean = clean_white["volatile acidity"].mean()
white_citric_acid_mean = clean_white["citric acid"].mean()
white_chlorides_mean = clean_white["chlorides"].mean()
white_residual_sugar_mean = clean_white["residual sugar"].mean()
white_pH_mean = clean_white["pH"].mean()
white_sulphates_mean = clean_white["sulphates"].mean()

clean_white["fixed acidity"].fillna( value=white_fixed_acidity_mean, inplace=True)
clean_white["volatile acidity"].fillna( value=white_volatile_acidity_mean, inplace=True)
clean_white["citric acid"].fillna( value=white_citric_acid_mean, inplace=True)
clean_white["residual sugar"].fillna( value=white_residual_sugar_mean, inplace=True)
clean_white["chlorides"].fillna( value=white_chlorides_mean, inplace=True)
clean_white["pH"].fillna( value=white_pH_mean, inplace=True)
clean_white["sulphates"].fillna( value=white_sulphates_mean, inplace=True)

clean_white.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 13 columns):
type                    4898 non-null object
fixed acidity           4898 non-null float64
volatile acidity        4898 non-null float64
citric acid             4898 non-null float64
residual sugar          4898 non-null float64
chlorides               4898 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4898 non-null float64
sulphates               4898 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
dtypes: float64(11), int64(1), object(1)
memory usage: 497.5+ KB


## Cleaning Red Dataset

In [13]:
clean_red = red.copy()

red_fixed_acidity_mean = clean_red["fixed acidity"].mean()
red_volatile_acidity_mean = clean_red["volatile acidity"].mean()
red_citric_acid_mean = clean_red["citric acid"].mean()
red_pH_mean = clean_red["pH"].mean()
red_sulphates_mean = clean_red["sulphates"].mean()

clean_red["fixed acidity"].fillna( value=red_fixed_acidity_mean, inplace=True)
clean_red["volatile acidity"].fillna( value=red_volatile_acidity_mean, inplace=True)
clean_red["citric acid"].fillna( value=red_citric_acid_mean, inplace=True)
clean_red["pH"].fillna( value=red_pH_mean, inplace=True)
clean_red["sulphates"].fillna( value=red_sulphates_mean, inplace=True)

clean_red.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 13 columns):
type                    1599 non-null object
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1), object(1)
memory usage: 162.5+ KB


## Splitting White Dataset

In [14]:
from sklearn.model_selection import train_test_split

white_train_set, white_test_set = train_test_split(clean_white, test_size=0.2, random_state=123)
print(len(white_train_set), len(white_test_set))
print(white_train_set.head())
print(white_test_set.head())

3918 980
       type  fixed acidity   ...     alcohol  quality
4120  white            7.4   ...         9.0        5
1169  white            7.5   ...         9.6        6
1941  white            6.8   ...         9.0        5
1971  white            6.3   ...        11.7        7
3124  white            6.5   ...         9.2        6

[5 rows x 13 columns]
       type  fixed acidity   ...     alcohol  quality
1088  white            7.4   ...         9.2        6
4366  white            5.9   ...        12.6        6
92    white            6.9   ...        12.6        7
2901  white            6.5   ...        11.9        7
2330  white            7.5   ...         9.3        6

[5 rows x 13 columns]


## Splitting Red Dataset

In [19]:
red_train_set, red_test_set = train_test_split(clean_red, test_size=0.2, random_state=123)
print(len(red_train_set), len(red_test_set))
print(red_train_set.head())
print(red_test_set.head())

1279 320
     type  fixed acidity  volatile acidity   ...     sulphates  alcohol  quality
1076  red            9.9              0.32   ...          0.73     11.4        6
847   red            7.4              0.68   ...          0.70      9.9        6
582   red           11.7              0.49   ...          0.43      9.2        5
172   red            8.0              0.42   ...          0.61      9.2        6
779   red            7.1              0.52   ...          0.60      9.8        5

[5 rows x 13 columns]
     type  fixed acidity  volatile acidity   ...     sulphates  alcohol  quality
912   red           10.0              0.46   ...          0.62     12.2        6
772   red            9.5              0.57   ...          0.55      9.4        5
1037  red            7.3              0.91   ...          0.56      9.2        5
1106  red            8.2              0.23   ...          0.54     12.3        6
263   red            7.9              0.37   ...          0.67      9.3      

### Heat Map for White Wine

In [20]:
correlation = white_train_set.corr()
plt.figure(figsize=(14, 12))
heatmap = sns.heatmap(correlation, annot=True, linewidths=0, vmin=-1, cmap=sns.diverging_palette(150, 275, s=80, l=55, n=9))

### Heat Map for Red Wine

In [18]:
correlation = red_train_set.corr()
plt.figure(figsize=(14, 12))
heatmap = sns.heatmap(correlation, annot=True, linewidths=0, vmin=-1, cmap="RdBu_r")

## Initial Training for White Wine 

For the first inital training of the wine data set model I am using alcohol. Alcohol can make the wine more bitter so those might be important for determing the quality of the wine. 

In [21]:
from sklearn.linear_model import LinearRegression
white_reg = LinearRegression()

white_X = white_train_set[["alcohol"]]
white_Y = white_train_set["quality"]
white_reg.fit(white_X, white_Y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [22]:
fig, axs = plt.subplots(ncols=1,figsize=(10,6))
sns.barplot(x='quality', y='alcohol', data=white_train_set, ax=axs)
plt.title('quality VS alcohol')
plt.tight_layout()
plt.show()

  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval


## Initial Results

In [23]:
from sklearn.metrics import mean_squared_error

print("The R2 score is:", white_reg.score(white_X,white_Y))

white_y_pred = white_reg.predict(white_X)
white_mse = mean_squared_error(white_Y, white_y_pred)
print("Mean Squared Error is: ",  white_mse)

The R2 score is: 0.18294068827397758
Mean Squared Error is:  0.6436278365855493


### Notes: 
The R2 score was not very good. More testing will be needed.

## Initial Training for Red Wine

In [24]:
red_reg = LinearRegression()

red_X = red_train_set[["alcohol"]]
red_Y = red_train_set["quality"]
red_reg.fit(red_X, red_Y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [25]:
fig, axs = plt.subplots(ncols=1,figsize=(10,6))
sns.barplot(x='quality', y='alcohol', data=red_train_set, ax=axs)
plt.title('quality VS alcohol')
plt.tight_layout()
plt.show()

  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval


## Inital Results

In [26]:
print("The R2 score is:", red_reg.score(red_X,red_Y))

red_y_pred = red_reg.predict(red_X)
red_mse = mean_squared_error(red_Y, red_y_pred)
print("Mean Squared Error is: ",  red_mse)

The R2 score is: 0.22944655975826067
Mean Squared Error is:  0.4997760282831427


### Notes: 
Using the same X values as the model for the white data set the model for red data set did a bit better but it still needs work. 

## Round 2 Training White Wine

In [27]:
white_reg2 = LinearRegression()

white_X2 = white_train_set[["alcohol", "volatile acidity", "residual sugar", "sulphates"]]
white_Y2 = white_train_set["quality"]
white_reg2.fit(white_X2, white_Y2)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [28]:
fig, axs = plt.subplots(ncols=1,figsize=(10,6))
sns.barplot(x='quality', y='volatile acidity', data=white_train_set, ax=axs)
plt.title('quality VS volatile acidity')
plt.tight_layout()
plt.show()

  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval


In [29]:
fig, axs = plt.subplots(ncols=1,figsize=(10,6))
sns.barplot(x='quality', y='residual sugar', data=white_train_set, ax=axs)
plt.title('quality VS residual sugar')
plt.tight_layout()
plt.show()

In [170]:
fig, axs = plt.subplots(ncols=1,figsize=(10,6))
sns.barplot(x='quality', y='sulphates', data=white_train_set, ax=axs)
plt.title('quality VS sulphates')
plt.tight_layout()
plt.show()

## Round 2 Results

In [146]:
print("The R2 score is:", white_reg2.score(white_X2,white_Y2))

white_y_pred2 = white_reg2.predict(white_X2)
white_mse2 = mean_squared_error(white_Y2, white_y_pred2)
print("Mean Squared Error is: ",  white_mse2)

The R2 score is: 0.25546985764902086
Mean Squared Error is:  0.5864939275727611


### Notes:
The R2 score got better after using several more parameters but overall it still is not the best

## Round 2 Training Red Wine

In [154]:
red_reg2 = LinearRegression()

red_X2 = red_train_set[["alcohol", "volatile acidity", "sulphates", "chlorides"]]
red_Y2 = red_train_set["quality"]
red_reg2.fit(red_X2, red_Y2)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [171]:
fig, axs = plt.subplots(ncols=1,figsize=(10,6))
sns.barplot(x='quality', y='volatile acidity', data=red_train_set, ax=axs)
plt.title('quality VS volatile acidity')
plt.tight_layout()
plt.show()

In [172]:
fig, axs = plt.subplots(ncols=1,figsize=(10,6))
sns.barplot(x='quality', y='sulphates', data=red_train_set, ax=axs)
plt.title('quality VS sulphates')
plt.tight_layout()
plt.show()

In [173]:
fig, axs = plt.subplots(ncols=1,figsize=(10,6))
sns.barplot(x='quality', y='chlorides', data=red_train_set, ax=axs)
plt.title('quality VS chlorides')
plt.tight_layout()
plt.show()

## Round 2 Results

In [155]:
print("The R2 score is:", red_reg2.score(red_X2,red_Y2))

red_y_pred2 = red_reg2.predict(red_X2)
red_mse2 = mean_squared_error(red_Y2, red_y_pred2)
print("Mean Squared Error is: ",  red_mse2)

The R2 score is: 0.34623470873378615
Mean Squared Error is:  0.4240279825314856


### Notes: 
This also got better when using several more parameters. Notice how different parameters made the model get better between the red and white wine data. The R2 score got higher but is still not that good. 

## Testing White Wine

In [160]:
white_reg3 = LinearRegression()

white_X3 = white_test_set[["alcohol", "volatile acidity", "residual sugar", "sulphates"]]
white_Y3 = white_test_set["quality"]
white_reg3.fit(white_X3, white_Y3)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [161]:
print("The R2 score is:", white_reg3.score(white_X3,white_Y3))

white_y_pred3 = white_reg3.predict(white_X3)
white_mse3 = mean_squared_error(white_Y3, white_y_pred3)
print("Mean Squared Error is: ",  white_mse3)

The R2 score is: 0.2908622327476922
Mean Squared Error is:  0.5450784051448285


## Testing Red Wine

In [156]:
red_reg3 = LinearRegression()

red_X3 = red_test_set[["alcohol", "volatile acidity", "sulphates", "chlorides"]]
red_Y3 = red_test_set["quality"]
red_reg3.fit(red_X3, red_Y3)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [157]:
print("The R2 score is:", red_reg3.score(red_X3,red_Y3))

red_y_pred3 = red_reg3.predict(red_X3)
red_mse3 = mean_squared_error(red_Y3, red_y_pred3)
print("Mean Squared Error is: ",  red_mse3)

The R2 score is: 0.33885365578048865
Mean Squared Error is:  0.43903603770100147


## Overall Results

When using the test set the scores were close but none of them were that high. Looking at the MSE the data does not appear to be overfitted because the numbers are very close. After doing some research perhaps using a decision tree or a random forest would be better to predict the quality of wine. The data in this must not be very linear. 