# Pre Data Preprocessing - Task [002]

## Purpose

1. Splitting text components for better granularity.
2. Normalizing numerical values to ensure consistency.
3. Removing unnecessary symbols or irrelevant text.

The preprocessing will address these specific attributes in the dataset:
- **Style**
- **Characteristics**
- **Price**
- **Capacity**
- **ABV (Alcohol by Volume)**
- **Vintage**

In [149]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
import datetime
import re

# Load the dataset
file_path = '../datasets/WineDataset.csv'
merged_df = pd.read_csv(file_path)

def convert_to_liters(capacity):
    capacity = str(capacity).strip().upper()
    if 'CL' in capacity:  # Centiliters to Liters
        return float(re.sub(r'[^\d.]', '', capacity)) / 100
    elif 'ML' in capacity:  # Milliliters to Liters
        return float(re.sub(r'[^\d.]', '', capacity)) / 1000
    elif 'LITRE' in capacity or 'L' in capacity:  # Liters already
        return float(re.sub(r'[^\d.]', '', capacity))
    elif 'LTR' in capacity or 'L' in capacity:  # Liters already
        return float(re.sub(r'[^\d.]', '', capacity))
    elif 'L' in capacity or 'L' in capacity:  # Liters already
        return float(re.sub(r'[^\d.]', '', capacity))
    else:
        return ''  # Handle any unknown format

def preprocess_data(df):

    numeric_cols = ['Price', 'ABV', 'Capacity']

    merged_df['Capacity'] = df['Capacity'].apply(convert_to_liters)

    if not df.empty:
        for col in numeric_cols:
            if col in df.columns:
                # Remove non-numeric characters and convert to float
                df[col] = df[col].apply(lambda x: re.sub(r'[^\d.]', '', str(x)).strip() if str(x).strip() else np.nan)
                df[col] = pd.to_numeric(df[col], errors='coerce')
                
                if df[col].notnull().any():  # Check if there's valid data for scaling
                    scaler = MinMaxScaler()
                    df[col] = scaler.fit_transform(df[[col]])
                
                df[col] = df[col].round(3)

        # Clean and split the 'Style' column
        if 'Style' in df.columns:
            df['Style'] = (
                df['Style']
                .str.replace(r'[^\w\s&]', '', regex=True)
                .str.split('&')
                .apply(lambda x: [item.strip() for item in x] if isinstance(x, list) else x)  # Clean whitespace
            )

            # This code divides the 'Style' array into several columns, each representing a position in that array
            max_len = df['Style'].apply(lambda x: len(x) if isinstance(x, list) else 0).max()

            for i in range(1, max_len + 1):
                df[f'Style {i}'] = df['Style'].apply(lambda x: x[i-1] if isinstance(x, list) and len(x) >= i else '')

            df = df.drop(columns=['Style'])

        # Clean and split the 'Characteristics' column
        if 'Characteristics' in df.columns:
            df['Characteristics'] = (
                df['Characteristics']
                .str.replace(r'[^\w\s,]', '', regex=True)
                .str.split(',') 
                .apply(lambda x: [item.strip() for item in x] if isinstance(x, list) else x)  # Clean whitespace
            )
            
            # This code divides the 'Characteristics' array into several columns, each representing a position in that array
            max_len = df['Characteristics'].apply(lambda x: len(x) if isinstance(x, list) else 0).max()

            for i in range(1, max_len + 1):
                df[f'Characteristic {i}'] = df['Characteristics'].apply(lambda x: x[i-1] if isinstance(x, list) and len(x) >= i else '')

            df = df.drop(columns=['Characteristics'])
            


    return df

# Preprocess the dataset
df_cleaned = preprocess_data(merged_df)

# Save or display the cleaned dataset
df_cleaned.to_csv('../datasets/cleaned_wines.csv', index=False)
df_cleaned.head()


Unnamed: 0,Title,Description,Price,Capacity,Grape,Secondary Grape Varieties,Closure,Country,Unit,Per bottle / case / each,...,Characteristic 1,Characteristic 2,Characteristic 3,Characteristic 4,Characteristic 5,Characteristic 6,Characteristic 7,Characteristic 8,Characteristic 9,Characteristic 10
0,"The Guv'nor, Spain",We asked some of our most prized winemakers wo...,0.012,0.081,Tempranillo,,Natural Cork,Spain,10.5,per bottle,...,Vanilla,Blackberry,Blackcurrant,,,,,,,
1,Bread & Butter 'Winemaker's Selection' Chardon...,This really does what it says on the tin. Itâ€...,0.026,0.081,Chardonnay,,Natural Cork,USA,10.1,per bottle,...,Vanilla,Almond,Coconut,Green Apple,Peach,Pineapple,Stone Fruit,,,
2,"Oyster Bay Sauvignon Blanc 2022, Marlborough",Oyster Bay has been an award-winning gold-stan...,0.018,0.081,Sauvignon Blanc,,Screwcap,New Zealand,9.8,per bottle,...,Tropical Fruit,Gooseberry,Grapefruit,Grass,Green Apple,Lemon,Stone Fruit,,,
3,Louis Latour MÃ¢con-Lugny 2021/22,Weâ€™ve sold this wine for thirty years â€“ an...,0.031,0.081,Chardonnay,,Natural Cork,France,10.1,per bottle,...,Peach,Apricot,Floral,Lemon,,,,,,
4,Bread & Butter 'Winemaker's Selection' Pinot N...,Bread & Butter is that thing that you can coun...,0.026,0.081,Pinot Noir,,Natural Cork,USA,10.1,per bottle,...,Smoke,Black Cherry,Cedar,Raspberry,Red Fruit,,,,,


# Pre Data Preprocessing - Task [92]

## Purpose

1. Merging both updated_wines.csv that has the mean ratings, with the merged_wine_dataset that was the result of Report 2(id:65). Adding the rating of the first dataset to the second.

In [150]:
import pandas as pd
from ftfy import fix_text

file1 = "../datasets/updated_wines.csv"
file2 = "../datasets/merged_wine_dataset.csv"

df1 = pd.read_csv(file1) 
df2 = pd.read_csv(file2) 

# Merge the datasets based on WineName and WineryName
merged_df = df2.merge(df1[['WineName', 'WineryName', 'Ratings']], on=['WineName', 'WineryName'], how='left')

# fix text using ftfy
# fix all the columns except when the column is a float
for col in merged_df.columns:
    if merged_df[col].dtype == 'object': 
        merged_df[col] = merged_df[col].apply(lambda x: fix_text(x) if isinstance(x, str) else x)

# Save the new dataset
output_file = "../datasets/PLNTD_dataset.csv"
merged_df.to_csv(output_file, index=False)

print(f"PLNTD_dataset created and saved to {output_file}")

missing_ratings = merged_df[merged_df['Ratings'].isna()]

#Testing purposes
if not missing_ratings.empty:
    print("WARNING: Some rows in the dataset are missing a rating.")
    print(missing_ratings)
else:
    print("SUCCESS: All rows have a rating.")


PLNTD_dataset created and saved to ../datasets/PLNTD_dataset.csv
SUCCESS: All rows have a rating.


### Preprocessing

In [151]:
numeric_cols = ['Price', 'ABV', 'Body']

mapping_acidity = {'Low': 1, 'Medium': 2, 'High': 3}
mapping_body = {'Very light-bodied': 1, 'Light-bodied': 2, 'Medium-bodied': 3, 'Full-bodied': 4, 'Very full-bodied': 5}
merged_df['Acidity'] = merged_df['Acidity'].map(mapping_acidity)
merged_df['Body'] = merged_df['Body'].map(mapping_body)

for col in numeric_cols:
    if col in merged_df.columns:
        # Remove non-numeric characters and convert to float
        merged_df[col] = merged_df[col].apply(lambda x: re.sub(r'[^\d.]', '', str(x)).strip() if str(x).strip() else np.nan)
        merged_df[col] = pd.to_numeric(merged_df[col], errors='coerce')

# Normalize both body and acidity
scaler = MinMaxScaler()
merged_df[['Acidity', 'Body','Price','ABV','Ratings']] = scaler.fit_transform(merged_df[['Acidity', 'Body', 'Price','ABV','Ratings']])
# Convert to the Saaty scale
merged_df[['Acidity', 'Body', 'Price', 'ABV', 'Ratings']] = merged_df[['Acidity', 'Body', 'Price', 'ABV', 'Ratings']].apply((lambda x: (x*8+1).round(7)))
        
#Drop every column besides the 'Ratings', 'ABV', 'Price', 'Acidity', 'Body'
df = merged_df[['WineName','WineryName','ABV', 'Price','Acidity', 'Body','Ratings']]

#Drop everything but the first twelve rows and twelve columns

display(df)

Unnamed: 0,WineName,WineryName,ABV,Price,Acidity,Body,Ratings
0,Chardonnay,Bread & Butter,4.333333,1.395604,9.0,7.0,5.058823
1,Pinot Noir,Bread & Butter,4.333333,1.395604,9.0,5.0,5.176471
2,Rioja Reserva,Marqués de Riscal,5.000000,1.483516,9.0,7.0,5.941177
3,Cabernet Sauvignon,Bread & Butter,4.333333,1.395604,9.0,7.0,5.176471
4,Gavi,La Raia,4.000000,1.351648,9.0,3.0,4.117647
...,...,...,...,...,...,...,...
217,Torre Muga,Muga,5.000000,3.901099,9.0,7.0,7.823529
218,Marsanne-Roussanne,Trizanne Signature Wines,4.000000,1.351648,9.0,5.0,2.647059
219,Vat 1 Sémillon,Tyrrell's,3.000000,2.450550,9.0,5.0,6.294118
220,Sémillon,Tyrrell's,3.000000,2.450550,9.0,5.0,2.823529


# AHP

### Pairwise Matrices

In [152]:
# Creates a pairwise matrix
def create_pairwise_matrix(values, minimize=False):
    n = len(values)
    matrix = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            if minimize:
                matrix[i][j] = values[j] / values[i]
            else:
                matrix[i][j] = values[i] / values[j]
    return matrix

# Creates a pairwise matrix, normalizes it and calculates the row averages
def create_pairwise_matrix_and_normalize(values, minimize=False):
    # Creating the pairwise matrix
    matrix = create_pairwise_matrix(values, minimize)
    
    # Matrix normalization
    normalized_matrix = np.zeros((len(values), len(values)))
    column_sums = matrix.sum(axis=0)
    for i in range(len(column_sums)):
        for j in range(len(column_sums)):
            normalized_matrix[i][j] = matrix[i][j] / column_sums[j]

    # Row Averages calculation (preference vector)
    row_averages = [np.mean(row) for row in normalized_matrix]

    return normalized_matrix, row_averages


columns_to_compare = ['Price', 'ABV', 'Acidity', 'Body', 'Ratings']
minimize_criteria = ['Price']

# Pairwise matrices for each criteria
alternatives_matrices = {}

for column in columns_to_compare:
    # Check if the criteria is to minimize
    is_minimize = column in minimize_criteria
    normalized_matrix, row_averages = create_pairwise_matrix_and_normalize(
        df[column].values, minimize=is_minimize
    )

    # Convert to DataFrame and add averages as a new column
    multi_index = pd.MultiIndex.from_arrays([df["WineName"], df["WineryName"]], names=["Wine Name", "Winery Name"])
    pairwise_df = pd.DataFrame(normalized_matrix, index=multi_index, columns=multi_index)
    
    pairwise_df["Preference Vector"] = row_averages

    alternatives_matrices[column] = pairwise_df
    

#Create pairwise matrix for the criteria comparison
criteria_values = [1, 1, 1, 1, 1]

criteria_matrix, criteria_averages = create_pairwise_matrix_and_normalize(criteria_values)
criteria_df = pd.DataFrame(criteria_matrix, index=columns_to_compare, columns=columns_to_compare)
criteria_df["Preference Vector"] = criteria_averages

### Overall Score

In [153]:
#Create matrix with the Preference Vectors of the alternatives of each criteria
preference_vectors_matrix = pd.DataFrame({column: alternatives_matrices[column]["Preference Vector"] for column in columns_to_compare})

criteria_weights = criteria_df["Preference Vector"].values
print(criteria_weights)

weighted_matrix = preference_vectors_matrix.multiply(criteria_weights, axis=1)
preference_vectors_matrix["Ranking"] = weighted_matrix.sum(axis=1)

# Sorting according to the ranking
ahp_results_df = preference_vectors_matrix.sort_values(by='Ranking', ascending=False)

display(ahp_results_df.head())

[0.2 0.2 0.2 0.2 0.2]


Unnamed: 0_level_0,Unnamed: 1_level_0,Price,ABV,Acidity,Body,Ratings,Ranking
Wine Name,Winery Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Matsu,Matsu,0.006362,0.00493,0.004854,0.006311,0.005824,0.005656
Matsu,Matsu,0.005561,0.00493,0.004854,0.006311,0.005824,0.005496
El Viejo,Matsu,0.00413,0.005258,0.004854,0.006311,0.006716,0.005454
Cabernet Sauvignon,Silver Ghost,0.00776,0.003615,0.002697,0.006311,0.006663,0.005409
El Recio,Matsu,0.005561,0.00493,0.004854,0.006311,0.005247,0.005381


### Dominance Analysis

In [154]:
min_scores = []
max_scores = []

for index, row in preference_vectors_matrix.iterrows():
    wine_entry_min_score = float('inf')
    wine_entry_max_score = float('-inf')
    
    # Iterate over different weights to calculate the min and max scores for each wine
    for col_idx in range(len(preference_vectors_matrix.columns) - 1):

        # Create a weight vector where this column is 1 and others are 0 so that only this column is considered
        weight_vector = np.zeros(len(preference_vectors_matrix.columns) - 1)
        weight_vector[col_idx] = 1

        # Calculate the weighted score for this alternative
        weighted_score = 0
        for idx in range(len(preference_vectors_matrix.columns) - 1):
            weighted_score += row.iloc[idx] * weight_vector[idx]

        wine_entry_min_score = min(wine_entry_min_score, weighted_score)
        wine_entry_max_score = max(wine_entry_max_score, weighted_score)
    
    # Store the min and max scores for this wine
    min_scores.append(wine_entry_min_score)
    max_scores.append(wine_entry_max_score)

# Compare min and max scores for full dominance
for i, (min_i, max_i) in enumerate(zip(min_scores, max_scores)):
    alternative_name = preference_vectors_matrix.index[i]

    # Check if this alternative is always the best
    is_always_best = all(min_i > max_scores[j] for j in range(len(max_scores)) if j != i)
    
    # Check if this alternative is always the worst
    is_always_worst = all(max_i < min_scores[j] for j in range(len(min_scores)) if j != i)

    # Print results
    if is_always_best:
        print(f"{alternative_name} is always the best, independent of weight assignments.")
    if is_always_worst:
        print(f"{alternative_name} is always the worst, independent of weight assignments.")

The dominance analysis revealed that no single wine consistently ranks as the best or worst across all criteria, regardless of the weight assigned to each criterion.

### Consistency Analysis

In [155]:
not_normalized_criteria_matrix = create_pairwise_matrix(criteria_values)
not_normalized_criteria_df = pd.DataFrame(not_normalized_criteria_matrix, index=columns_to_compare, columns=columns_to_compare)
not_normalized_criteria_df["Preference Vector"] = criteria_averages
not_normalized_criteria_df.drop(columns=["Preference Vector"], inplace=True)

sums_array = []
for i in range(len(not_normalized_criteria_df.columns)):
    sum = 0
    for j in range(len(not_normalized_criteria_df.columns)):
        sum = sum + (not_normalized_criteria_df.iloc[i, j] * criteria_weights[j])
    sums_array.append(sum)

print("Sums array: ", sums_array)

for i in range(len(sums_array)):
    sums_array[i] = sums_array[i] / criteria_weights[i]

print("Consistency Vector: ", sums_array)

print("Average of the sums: ", np.mean(sums_array))

#Calculate the Consistency Index
consistency_index = (np.mean(sums_array) - len(not_normalized_criteria_df.columns)) / (len(not_normalized_criteria_df.columns) - 1)
print("Consistency Index: ", consistency_index)

#Get the Random Index
random_index = [0, 0, 0.58, 0.9, 1.12, 1.24, 1.32, 1.41, 1.45, 1.49] #RI values table
random_index = random_index[len(not_normalized_criteria_df.columns) - 1]
print("Random Index: ", random_index)

#Calculate the Consistency Ratio
consistency_ratio = consistency_index / random_index

print("Consistency Ratio: ", consistency_ratio)
if(consistency_ratio < 0.1):
    print("Consistency Ratio is below 0.1, the matrix is consistent")
else:
    print("Consistency Ratio is above 0.1, the matrix is inconsistent")


Sums array:  [1.0, 1.0, 1.0, 1.0, 1.0]
Consistency Vector:  [5.0, 5.0, 5.0, 5.0, 5.0]
Average of the sums:  5.0
Consistency Index:  0.0
Random Index:  1.12
Consistency Ratio:  0.0
Consistency Ratio is below 0.1, the matrix is consistent


Upon analyzing the results, we observe that the consistency ratio is 0, which signifies perfect consistency in our weight calculations. This outcome reflects that the assigned weights are fully consistent, adhering to the Saaty scale (1 to 9). This approach avoids the necessity of manually comparing each pair of criteria, which would have significantly increased the number of comparisons, doubling the combinations in our case. For our application, this level of complexity would be impractical for daily users. By structuring the process this way, we maintain both methodological rigor and user-friendliness, ensuring the system is accessible while adhering to AHP’s principles.