
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, delimiter=';')
print("First few rows of the dataset:")
print(data.head())

Question 1: What libraries would you need to import for data manipulation, preprocessing (scaling and encoding), splitting data, and building a pipeline to streamline transformations in this project?


Question 2: How would you load the Wine Quality dataset from a URL into a pandas DataFrame? What delimiter would you use, and how could you display the first few rows to inspect the data structure?

Question 3: How would you check for missing values in the dataset and remove any duplicates? Additionally, how would you filter the dataset to ensure pH values are within a reasonable range (e.g., between 2.5 and 4.5)?



Question 4: How would you preprocess the numeric features of the dataset by creating a pipeline that applies StandardScaler to standardize them? What steps would you take to apply this transformation and convert the preprocessed data back to a pandas DataFrame?

Question 5: How would you split the dataset into training and testing sets, ensuring that 20% of the data is used for testing and 80% for training? How can you ensure the split is reproducible?

Question 6: After splitting the data, how would you check and display the shape of the training and testing sets to verify that the data was correctly split?

Question 1: What libraries would you need to import for data manipulation, preprocessing (scaling and encoding), splitting data, and building a pipeline to streamline transformations in this project?

In [24]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

Question 2: How would you load the Wine Quality dataset from a URL into a pandas DataFrame? What delimiter would you use, and how could you display the first few rows to inspect the data structure?

In [25]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, delimiter=';')
print("First few rows of the dataset:")
display(data.head())

First few rows of the dataset:


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Question 3: How would you check for missing values in the dataset and remove any duplicates? Additionally, how would you filter the dataset to ensure pH values are within a reasonable range (e.g., between 2.5 and 4.5)?

In [26]:
missing_values = data.isnull().sum()
print("Missing values in each column:")
print(missing_values)

Missing values in each column:
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64


In [27]:
data['pH'].max()

4.01

In [28]:
data['pH'].min()

2.74

In [29]:
print(f"Data shape before removing duplicates: {data.shape}")
data = data.drop_duplicates()
print(f"Data shape after removing duplicates: {data.shape}")

data = data[data['pH'] >= 2.5]
data = data[data['pH'] <= 4]
print(f"Data shape after filtering pH values: {data.shape}")

Data shape before removing duplicates: (1599, 12)
Data shape after removing duplicates: (1359, 12)
Data shape after filtering pH values: (1357, 12)


Question 4: How would you preprocess the numeric features of the dataset by creating a pipeline that applies StandardScaler to standardize them? What steps would you take to apply this transformation and convert the preprocessed data back to a pandas DataFrame?

In [34]:
numeric_features = data.select_dtypes(include=["float64", "int64"]).columns

numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[("num", numeric_transformer, numeric_features)]
)

data_preprocessed = preprocessor.fit_transform(data)

data_preprocessed_df = pd.DataFrame(data_preprocessed, columns=numeric_features)

print("First few rows of the preprocessed data:")
display(data_preprocessed_df.head())

First few rows of the preprocessed data:


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,-0.527939,0.93392,-1.396283,-0.462586,-0.247014,-0.468194,-0.383804,0.582419,1.317546,-0.578798,-0.953427,-0.755926
1,-0.297191,1.917957,-1.396283,0.05522,0.198674,0.871376,0.603592,0.045611,-0.712013,0.124151,-0.582901,-0.755926
2,-0.297191,1.261933,-1.1915,-0.166697,0.077123,-0.085459,0.214618,0.152972,-0.319195,-0.051586,-0.582901,-0.755926
3,1.664161,-1.362165,1.470681,-0.462586,-0.267273,0.105908,0.394145,0.68978,-0.973891,-0.46164,-0.582901,0.458028
4,-0.527939,0.715245,-1.396283,-0.536559,-0.267273,-0.276827,-0.204277,0.582419,1.317546,-0.578798,-0.953427,-0.755926


Question 5: How would you split the dataset into training and testing sets, ensuring that 20% of the data is used for testing and 80% for training? How can you ensure the split is reproducible?

In [31]:
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

# we would ensure reproducible splits by using the same random_state

Question 6: After splitting the data, how would you check and display the shape of the training and testing sets to verify that the data was correctly split?

In [32]:
train_df.shape

(1085, 12)

In [33]:
test_df.shape

(272, 12)