# Exploratory Data Analysis (EDA) 

First, the data was loaded from a CSV file. For the analysis and processing of this data, the wine dataset was cleaned, and finally, we proceeded with visualization and outlier detection. The data was normalized using encoders.

For data cleaning, the text in categorical columns was corrected to fix encoding errors, and the relevant columns—Type, Grape, and Price—were selected. The Price column was cleaned by removing non-numeric characters and converting the values to float type. Rows with missing values in the selected columns were also removed.

The encoders used were One-hot for the Type column and Binary for the Grape column. The encoded data was combined into a new DataFrame. The Price column was scaled using MinMaxScaler to normalize the data between 0 and 1.

For data visualization, distribution plots (histplots) of the predictive variable Price were created, a countplot was used to visualize the distribution of wine types, and a correlation matrix with its corresponding heatmap was generated to observe relationships between variables.

To detect outliers, the interquartile range (IQR) was calculated to identify outliers in the Price column, and these outliers were visualized using a boxplot. Due to the nature of the data and its fidelity to the market reality, the outliers were not removed.

The main insights from the EDA are as follows:

- Q1 ($12.99): 25% of prices are below this
- Q3 ($29.99): 75% of prices are below this
- IQR ($17.00): Range containing middle 50% of prices (Q3-Q1)
- Red and White types are the clear market leaders, each with almost 550 units
- Weak correlation between Price and Type, Grape
- From PCA, we can see that the data is not linearly separable
- The Price distribution is right-skewed, with a long tail of high prices, which means that the data is not normally distributed and the closest model to follow the data distribution is a polinomial regression model 
- The Types Orange and Tawny does not have a significant number of samples, which can lead to a bias in the model

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import (OneHotEncoder, MinMaxScaler, 
                                   StandardScaler, PolynomialFeatures)

# Regression Models
from sklearn.linear_model import (LinearRegression, Ridge, Lasso, ElasticNet, 
                                  SGDRegressor)
# Dimensionality Reduction
from sklearn.decomposition import PCA

# Metrics and Model/Data Selection
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score

# Feature engineering
import category_encoders as ce
from ftfy import fix_text
from sklearn.pipeline import Pipeline

In [None]:
# Seaborn and Matplotlib configurations for modern aesthetics
sns.set_theme(style="whitegrid", palette="muted")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["axes.titlesize"] = 16
plt.rcParams["axes.labelsize"] = 14
plt.rcParams["xtick.labelsize"] = 12
plt.rcParams["ytick.labelsize"] = 12

# Load dataset
df = pd.read_csv('../datasets/WineDataset.csv')

# Continue with your cleaning steps...
for col in df.columns:
    if df[col].dtype == 'object': 
        df[col] = df[col].apply(lambda x: fix_text(x) if isinstance(x, str) else x)

# Print after cleaning to verify data still exists
print("\nAfter cleaning:")
print("Number of rows:", len(df))

# Select relevant columns
df = df[['Type', 'Grape', 'Price']]

# Clean the 'Price' column by extracting numerical values
df['Price'] = df['Price'].str.replace('£', '', regex=False)        
df['Price'] = df['Price'].str.replace('per bottle', '', regex=False) 

# Remover espaços extras e converter para float
df['Price'] = df['Price'].str.strip()
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')

# Drop rows with missing or invalid prices
df = df.dropna(subset=['Price'])

# Drop rows with missing values in categorical columns
df = df.dropna(subset=['Type', 'Grape'])

one_hot_encoder = OneHotEncoder(sparse_output=False, drop='first')
type_encoded = pd.DataFrame(one_hot_encoder.fit_transform(df[['Type']]), columns=one_hot_encoder.get_feature_names_out(['Type']))

# Initialize the binary encoder
binary_encoder = ce.BinaryEncoder(cols=['Grape'])

# Apply the encoder to the 'Grape' column
grape_encoded = binary_encoder.fit_transform(df['Grape'])

# Combine all encoded data into a single dataframe
df_encoded = pd.concat([df[['Price']], type_encoded, grape_encoded], axis=1)
df_encoded = df_encoded.dropna()  # Assign the result back

scaler = MinMaxScaler()
df_encoded['Price'] = scaler.fit_transform(df_encoded[['Price']])

print(df_encoded)

# Plot distribution of Price
sns.histplot(df['Price'], kde=True, bins=30, color='blue')
plt.title('Price Distribution')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# Plot Type Distribution
sns.countplot(x='Type', data=df, palette='cool')
plt.title('Type Distribution')
plt.xlabel('Type')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

# Correlation Matrix Visualization
correlation_matrix = df_encoded.corr()
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

# Detecting Outliers using Boxplot

# Calculate IQR
Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1

# Calculate whisker boundaries
whisker_min = Q1 - 1.5 * IQR 
whisker_max = Q3 + 1.5 * IQR

# Print IQR statistics
print(f'Q1: {Q1}')
print(f'Q3: {Q3}')
print(f'IQR: {IQR}')
print(f'Lower whisker: {whisker_min}')
print(f'Upper whisker: {whisker_max}')

# Boxplot
plt.figure(figsize=(10, 6))
max_price = df['Price'].max()
plt.xticks(np.arange(0, max_price + 50, 50))  
sns.boxplot(x=df['Price'], color='red')
plt.title('Outliers in Price')
plt.xlabel('Price')
plt.tight_layout()
plt.show()


# PCA Analysis

We use PCA in order to perform dimensionality reduction, by reducing the complexity of the data  into a 2D space we can visualize high-dimensional data to understand the relationships between features and the target variable, Price

In [None]:
df_encoded['Price'] = df['Price']

df_encoded_clean = df_encoded.dropna()
features = df_encoded_clean.drop(columns=['Price']).values
labels = df_encoded_clean['Price'].values

pca = PCA(n_components=2)
df_pca = pca.fit_transform(features)

unique_labels = np.unique(labels)
step = 10

for i in range(0, len(unique_labels), step):
    plt.figure(figsize=(8, 6))
    subset_labels = unique_labels[i:i + step]
    
    for label in subset_labels:
        indices = np.where(labels == label)[0]
        plt.scatter(
            df_pca[indices, 0],
            df_pca[indices, 1],
            label=f'Price: {label:.4f}'
        )
    
    plt.title('PCA (2D) - Coloreado por Precio')
    plt.xlabel('Componente Principal 1')
    plt.ylabel('Componente Principal 2')
    plt.legend(title='Rango de Precios', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.show()

In [None]:
# Initialize the binary encoder
one_hot_encoder = OneHotEncoder(sparse_output=False, drop='first')

df_copy = df.copy()

# Drop rows with Grape 'Orange' and 'Tawny'
values = ['Tawny', 'Orange']
df_copy = df_copy[~df_copy['Type'].isin(values)]

# Apply the encoder to the 'Grape' column
type_encoded = pd.DataFrame(one_hot_encoder.fit_transform(df_copy[['Type']]), columns=one_hot_encoder.get_feature_names_out(['Type']))

# Combine all encoded data into a single dataframe
df_encoded = pd.concat([df[['Price']], type_encoded, grape_encoded], axis=1)
df_encoded = df_encoded.dropna()  # Assign the result back

# Combine all encoded data into a single dataframe
df_encoded = pd.concat([df[['Price']], type_encoded, grape_encoded], axis=1)
df_encoded = df_encoded.dropna()  # Assign the result back


print(df_encoded)

# Split the data into training and testing sets

X = df_encoded.drop(columns=['Price'])
y = df_encoded['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Model Polynomial Regression Price not Encoded

In [None]:
# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)

# Fit model
model = LinearRegression()
model.fit(X_poly, y_train)

# Print metrics
print('R² score:', model.score(X_poly_test, y_test))
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)

# Predict prices
y_pred = model.predict(X_poly_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Print metrics
print('MAE:', mae)
print('MSE:', mse)
print('RMSE:', rmse)

# Linear Model Regression Price not Encoded

In [None]:
# Fit linear model
model = LinearRegression()
model.fit(X_train, y_train)

# Print metrics
print('R² score:', model.score(X_test, y_test))
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)

# MAE (Mean Absolute Error), MSE (Mean Squared Error) and RMSE (Root Mean Squared Error) metrics


# Predict prices
y_pred = model.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Print metrics
print('MAE:', mae)
print('MSE:', mse)
print('RMSE:', rmse)

# Linear and Polynomial Regression Model Price Encoded

In [None]:

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_scaled_test = scaler.transform(X_test)

# Linear Regression with scaled features
linear_model = LinearRegression()
linear_model.fit(X_scaled, y_train)

# Polynomial Regression with scaled features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_scaled)
X_poly_test = poly.transform(X_scaled_test)

poly_model = LinearRegression()
poly_model.fit(X_poly, y_train)

# Print metrics
print("Linear Regression Results:")
print('R² score:', linear_model.score(X_scaled_test, y_test))
print('Coefficients:', linear_model.coef_)
print('Intercept:', linear_model.intercept_)

print("\nPolynomial Regression Results:")
print('R² score:', poly_model.score(X_poly_test, y_test))
print('Coefficients:', poly_model.coef_)
print('Intercept:', poly_model.intercept_)

# Predict prices
y_pred = linear_model.predict(X_test)
y_pred_poly = poly_model.predict(X_poly_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Print metrics
print('\nLinear Regression Metrics:')
print('MAE:', mae)
print('MSE:', mse)
print('RMSE:', rmse)

# Calculate metrics for polynomial regression
mae = mean_absolute_error(y_test, y_pred_poly)
mse = mean_squared_error(y_test, y_pred_poly)
rmse = np.sqrt(mse)

# Print metrics
print('\nPolynomial Regression Metrics:')
print('MAE:', mae)
print('MSE:', mse)
print('RMSE:', rmse)

# Linear Regression Model Price Encoded and with Log transformation

In [None]:
# Prepare data with log transformation of price
X = df_encoded.drop('Price', axis=1).values
y = np.log1p(df_encoded['Price'].values)  # log1p handles zero values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_scaled_test = scaler.transform(X_test)

# Fit model
model = LinearRegression()
model.fit(X_scaled, y_train)

# Predictions (need to transform back)
y_pred = np.expm1(model.predict(X_scaled_test))

# Print metrics
print('R² score:', model.score(X_scaled_test, y_test))

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Print metrics
print('MAE:', mae)
print('MSE:', mse)
print('RMSE:', rmse)

# Polynomial Regression Model Price Encoded and with Log transformation

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_scaled_test = scaler.transform(X_test)

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_scaled)
X_poly_test = poly.transform(X_scaled_test)

model = LinearRegression()
model.fit(X_poly, y_train)

y_pred = np.expm1(model.predict(X_poly_test))

print('R² score:', model.score(X_poly_test, y_test))

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Print metrics
print('MAE:', mae)
print('MSE:', mse)
print('RMSE:', rmse)

In [None]:
# Prepare data
X = df_encoded.drop('Price', axis=1).values
y = np.log1p(df_encoded['Price'].values)

# Ridge Regression with CV
ridge_pipe = Pipeline([
   ('scaler', StandardScaler()),
   ('poly', PolynomialFeatures(degree=2)),
   ('ridge', Ridge(alpha=1.0))
])

ridge_scores = cross_val_score(ridge_pipe, X, y, cv=5)
print("Ridge CV scores:", ridge_scores)
print("Ridge mean CV score:", ridge_scores.mean())

# Lasso Regression with CV
lasso_pipe = Pipeline([
   ('scaler', StandardScaler()),
   ('poly', PolynomialFeatures(degree=2)),
   ('lasso', Lasso(alpha=1.0))
])

lasso_scores = cross_val_score(lasso_pipe, X, y, cv=5)
print("\nLasso CV scores:", lasso_scores)
print("Lasso mean CV score:", lasso_scores.mean())

# ElasticNet with CV
elastic_pipe = Pipeline([
   ('scaler', StandardScaler()),
   ('poly', PolynomialFeatures(degree=2)),
   ('elastic', ElasticNet(alpha=1.0, l1_ratio=0.5))
])

elastic_scores = cross_val_score(elastic_pipe, X, y, cv=5)
print("\nElasticNet CV scores:", elastic_scores)
print("ElasticNet mean CV score:", elastic_scores.mean())

# Polinomial e Linear Regression Model Price Encoded and with Log transformation and Price Categorized

In [None]:
# Convert 'Price' to categorical with every 10 pounds as a category
df_encoded['Price'] = np.ceil(df['Price'] / 20).astype(int)
df_encoded['Price'] = df_encoded['Price'].dropna()

scaler = MinMaxScaler(feature_range=(0, 1))

df_encoded['Price'] = scaler.fit_transform(df_encoded[['Price']])

print(df_encoded['Price'])

# max and min values of 'Price' and the number of unique values and mean value and standard deviation and more common value

print('Max:', df_encoded['Price'].max())
print('Min:', df_encoded['Price'].min())
print('Unique:', df_encoded['Price'].nunique())
print('Mean:', df_encoded['Price'].mean())
print('Std:', df_encoded['Price'].std())
print('Most common:', df_encoded['Price'].mode().values[0])

X = df_encoded.drop(columns=['Price'])
y = df_encoded['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_scaled_test = scaler.transform(X_test)

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_scaled)
X_poly_test = poly.transform(X_scaled_test)

model = LinearRegression()
model.fit(X_poly, y_train)

y_pred = np.expm1(model.predict(X_poly_test))

print('R² score:', model.score(X_poly_test, y_test))

# Predict prices
y_pred = model.predict(X_poly_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Print metrics
print('MAE:', mae)
print('MSE:', mse)
print('RMSE:', rmse)

In [None]:
# Aplicar Logistic Regression
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, learning_rate='optimal')
sgd_reg.fit(X_train, y_train)

# Previsões
y_pred = sgd_reg.predict(X_test)

print(X_train)
print(X_test)

print('R² score:', sgd_reg.score(X_test, y_test))

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Print metrics
print('MAE:', mae)
print('MSE:', mse)
print('RMSE:', rmse)

# Test Polynomial Regression Model Normalized Price with Log Transformation Removing Price Outliers

In [None]:
# drop rows with price greater than 90

df_copy = df_copy[df_copy['Price'] <= 40]

df_encoded = pd.concat([df_copy[['Price']], type_encoded, grape_encoded], axis=1)
df_encoded = df_encoded.dropna()  # Assign the result back

scaler = MinMaxScaler()
df_encoded['Price'] = scaler.fit_transform(df_encoded[['Price']])

X = df_encoded.drop(columns=['Price'])
y = df_encoded['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_scaled_test = scaler.transform(X_test)

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_scaled)
X_poly_test = poly.transform(X_scaled_test)

model = LinearRegression()
model.fit(X_poly, y_train)

y_pred = np.expm1(model.predict(X_poly_test))

print('R² score:', model.score(X_poly_test, y_test))

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Print metrics
print('MAE:', mae)
print('MSE:', mse)
print('RMSE:', rmse)
