In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
path = "/content/drive/MyDrive/Datasets for CMPE 255/wine_quality.csv"
df = pd.read_csv(path)

**Knowledge Discovery in Databases (KDD)** process is typically composed of the following steps:

1. **Data Selection**: Choose the relevant data from a data source.

2. **Data Preprocessing**: Cleanse and preprocess the raw data. This can include handling missing values, removing outliers, and converting data types.

3. **Data Transformation**: Convert the cleansed data into a format suitable for mining. This can involve normalization, discretization, or one-hot encoding.

4. **Data Mining**: Use statistical and machine learning algorithms to discover patterns and knowledge from the transformed data.

5. **Evaluation and Interpretation**: Evaluate the mined patterns and interpret the results.

6. **Knowledge Discovery and Use**:
This is the final step in the KDD process and requires the ‘knowledge’ extracted from the previous step to be applied to the specific application or domain in a visualised format such as tables, reports etc. This step drives the decision-making process for the said application.

Let's walk through each of these steps using a dataset. For this exercise, I'll use a "Wine Quality" dataset which contains various physicochemical properties of wines, and a target variable indicating quality.

# **Step 1: Data Selection**

`The wine quality dataset contains information on various wines, their chemical properties, and their quality.`

`LOAD THE DATASET.`

In [3]:
import pandas as pd
path = "/content/drive/MyDrive/Datasets for CMPE 255/wine_quality.csv"
df = pd.read_csv(path)

# **Step 2: Data Preprocessing**

`We'll start by loading the dataset and examining its structure, then move on to preprocessing`.

`Let's load the data and take a look at the first few rows`.



In [4]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,pH,sulphates,alcohol,quality
0,7.9,0.6,0.06,1.6,0.069,3.18,0.46,9.4,5
1,11.2,0.28,0.56,1.9,0.075,3.16,0.58,9.8,6
2,7.3,0.65,0.0,1.2,0.065,3.3,0.47,10.0,7
3,7.4,0.7,0.0,1.9,0.076,3.51,0.56,9.4,5
4,7.9,0.6,0.06,1.6,0.069,3.18,0.46,9.4,5


**Here's the structure of the dataset:**

fixed acidity: Amount of tartaric acid in wine (g/dm^3)

volatile acidity: Amount of acetic acid in wine (g/dm^3). High levels can lead to an unpleasant, vinegar taste.

citric acid: Found in small quantities, citric acid can add 'freshness' and flavor to wines (g/dm^3).

residual sugar: Amount of sugar remaining after fermentation stops (g/dm^3).

chlorides: Amount of salt in the wine (g/dm^3).

pH: Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4.

sulphates: A wine additive which can contribute to sulfur dioxide gas (S02) levels (g/dm^3).

alcohol: Alcohol content (% volume).

quality: Score between 0 and 10.

`Next, we'll check for missing values and other preprocessing tasks`

In [5]:
# Checking for missing values in the dataset
missing_values = df.isnull().sum()
missing_values

fixed acidity       0
volatile acidity    0
citric acid         0
residual sugar      0
chlorides           0
pH                  0
sulphates           0
alcohol             0
quality             0
dtype: int64

# **Step 3: Data Transformation**

Typically in this step, we might:

  1. Normalize or standardize features if they have different scales.

  2. One-hot encode categorical features.

  3. Create new features based on the existing ones.

Given the nature of this dataset, we'll standardize the features since all of them are numerical and on different scales. This step is especially crucial if we plan on using algorithms that are sensitive to feature scales, such as k-means clustering or support vector machines.

`Let's standardize the features`.

In [6]:
from sklearn.preprocessing import StandardScaler

# Excluding target variable 'quality' for standardization
features = df.drop(columns=['quality'])
standardized_features = StandardScaler().fit_transform(features)

# Convert the standardized features back to a dataframe for better visualization
standardized_df = pd.DataFrame(standardized_features, columns=features.columns)
standardized_df.head()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,pH,sulphates,alcohol
0,-0.006824,-0.237403,-0.217707,-0.559917,-0.830327,-1.212714,-1.357101,-0.872384
1,3.125778,-2.400475,2.571847,-0.332511,-0.207077,-1.36553,-0.098975,0.290338
2,-0.576388,0.100577,-0.552453,-0.863125,-1.245827,-0.295817,-1.252257,0.871699
3,-0.481461,0.438558,-0.552453,-0.332511,-0.103202,1.308753,-0.308662,-0.872384
4,-0.006824,-0.237403,-0.217707,-0.559917,-0.830327,-1.212714,-1.357101,-0.872384


`The features have now been standardized, which means they have a mean of 0 and a standard deviation of 1`.

# **Step 4: Data Mining**

Given that our target variable, "quality," is numeric, this is a regression problem. We can use various regression models to predict wine quality based on the features. For simplicity, let's use a linear regression model to see how well we can predict wine quality based on the given physicochemical properties.

I'll split the data into training and testing sets, train a linear regression model on the training set, and evaluate its performance on the test set.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(standardized_features, df['quality'], test_size=0.3, random_state=42)

# Training a linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Making predictions
y_pred = lr.predict(X_test)

`This fits the model to the training data and then uses it to make predictions on the test data`.

# **Step 5: Evaluation and Interpretation**

The RMSE value provides a quantitative measure of the model's performance. To qualitatively understand the model, we can also look at the coefficients of the regression model to see which features are the most influential in predicting wine quality.

After mining, it's crucial to evaluate and interpret the results to understand the model's performance and derive insights.

`Evaluate the model's performance`:

In [10]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
rmse = mse**0.5

rmse

3.888619978213844e-14

The RMSE gives an idea of the model's prediction accuracy. For our dataset, an RMSE value indicates, on average, how much our predictions deviate from the actual wine quality scores

`Interpret the model`:

In [11]:
coefficients = pd.DataFrame({
    'Feature': features.columns,
    'Coefficient': lr.coef_
}).sort_values(by='Coefficient', ascending=False)

coefficients

Unnamed: 0,Feature,Coefficient
3,residual sugar,8.938219
0,fixed acidity,8.08893
7,alcohol,3.931479
4,chlorides,3.507323
5,pH,2.491853
1,volatile acidity,-3.853254
6,sulphates,-7.797169
2,citric acid,-11.210901


# **Step 6: Knowledge Discovery**

***Positive Coefficients***:

  **Residual Sugar**: The highest positive coefficient, 8.938219, suggests that for each standardized unit increase in residual sugar, the wine quality increases by approximately 8.938219 units, all else being equal. This might indicate that wines with higher residual sugar are perceived as of higher quality in this dataset.
  
  **Fixed** **Acidity**: Similarly, for each standardized unit increase in fixed acidity, wine quality increases by about 8.088930 units.
  
  **Alcohol**: An increase in the standardized value of alcohol content is associated with a rise in wine quality by approximately 3.931479 units.
  
  **Chlorides**: An increase in the standardized value of chlorides increases wine quality by about 3.507323 units.
  
  **pH**: A standardized unit increase in pH corresponds to a 2.491853 unit increase in wine quality.

  .


***Negative Coefficients***:

  
  **Volatile Acidity**: For each standardized unit increase in volatile acidity, wine quality decreases by about 3.853254 units. This suggests that higher volatile acidity might negatively impact the perception of wine quality.
  
  **Sulphates**: Similarly, an increase in the standardized value of sulphates decreases wine quality by approximately 7.797169 units.
  
  **Citric Acid**: This has the most substantial negative impact. For each standardized unit increase in citric acid, wine quality drops by around 11.210901 units. It suggests that wines with higher citric acid content are perceived as of lower quality in this dataset.
  
  .

***Conclusions***:

  
  **Sweetness Preference**: The positive coefficient for residual sugar might indicate a preference for sweeter wines in this dataset. However, this can vary based on regions, types of wine, or the demographics of the tasters.

  
  **Acidity Balance**: While fixed acidity positively influences wine quality, volatile acidity has a negative effect. This balance between different types of acidity can be crucial in wine production. Too much volatile acidity can give an unpleasant, vinegar taste.

  
  **Alcohol Content**: The positive coefficient for alcohol might suggest that wines with higher alcohol content are preferred or are associated with higher quality. This is often seen in many wine evaluations, where higher alcohol can contribute to the body and mouthfeel of the wine.

  
  **Citric Acid Caution**: Given the substantial negative coefficient for citric acid, winemakers might be cautious about the citric acid levels in their wines, as it seems to negatively influence the perceived quality in this dataset.

It's essential to note that these interpretations are based on this specific dataset and model. The actual impact of these features on wine quality can be more complex and influenced by many other factors not captured in this dataset. Additionally, correlation does not imply causation, so while these features might be associated with wine quality, changing them won't necessarily result in the expected change in quality.