## Wine Quality Dataset - Data Preprocessing ##

In this tutorial, we'll walk through the data preprocessing steps for the Wine Quality Dataset. The goal is to clean and prepare the data by handling missing values, detecting outliers, and performing any necessary transformations to get the data ready for further analysis.

**Step 1: Importing Libraries**

We begin by importing the necessary libraries, such as pandas for data manipulation and scipy.stats for statistical analysis (Z-score for outliers).

Next, we load the Wine Quality dataset. This dataset is stored in a CSV file and uses a semicolon (;) as a delimiter, so we specify that while loading the data.

In [13]:
import pandas as pd
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Load your dataset
df = pd.read_csv('/home/shahzaib/Downloads/Assignment(4) (2)/Assignment/winequality.csv', delimiter=';')
df.head()




Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


**Step 2: Checking for Missing Values**

It’s essential to check for any missing values before analysis, as these can affect the outcome of your models.

In [12]:
print(df.isnull().sum())

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64


**Step 3: Detecting Outliers Using Z-Score**

Outliers can distort statistical results, so it’s important to identify them. Here, we use the Z-score method to detect outliers.
stats.zscore(df): Calculates the Z-score for each value in the dataset, which indicates how many standard deviations away a value is from the mean.
outliers > 3: Z-scores greater than 3 (or less than -3) are typically considered outliers, meaning they are far from the average value.

In [5]:
# Example using Z-score to detect outliers
z_scores = stats.zscore(df)
outliers = (z_scores > 3).sum(axis=0)
print(z_scores)
print(outliers)

      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0         -0.528360          0.961877    -1.391472       -0.453218  -0.243707   
1         -0.298547          1.967442    -1.391472        0.043416   0.223875   
2         -0.298547          1.297065    -1.186070       -0.169427   0.096353   
3          1.654856         -1.384443     1.484154       -0.453218  -0.264960   
4         -0.528360          0.961877    -1.391472       -0.453218  -0.243707   
...             ...               ...          ...             ...        ...   
1594      -1.217796          0.403229    -0.980669       -0.382271   0.053845   
1595      -1.390155          0.123905    -0.877968       -0.240375  -0.541259   
1596      -1.160343         -0.099554    -0.723916       -0.169427  -0.243707   
1597      -1.390155          0.654620    -0.775267       -0.382271  -0.264960   
1598      -1.332702         -1.216849     1.021999        0.752894  -0.434990   

      free sulfur dioxide  

**Step 4: Data Transformation**

If your dataset requires feature scaling or normalization, you can apply transformations here. For example, scaling the dataset so that all the features have a similar range.
StandardScaler: This scales the dataset to have a mean of 0 and a standard deviation of 1. This step is important for certain algorithms that assume features are on a similar scale.

In [6]:

# Scale the features using StandardScaler
X=df.drop('quality', axis=1)
Y=df['quality']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(scaler, X_scaled)


StandardScaler() [[-0.52835961  0.96187667 -1.39147228 ...  1.28864292 -0.57920652
  -0.96024611]
 [-0.29854743  1.96744245 -1.39147228 ... -0.7199333   0.1289504
  -0.58477711]
 [-0.29854743  1.29706527 -1.18607043 ... -0.33117661 -0.04808883
  -0.58477711]
 ...
 [-1.1603431  -0.09955388 -0.72391627 ...  0.70550789  0.54204194
   0.54162988]
 [-1.39015528  0.65462046 -0.77526673 ...  1.6773996   0.30598963
  -0.20930812]
 [-1.33270223 -1.21684919  1.02199944 ...  0.51112954  0.01092425
   0.54162988]]


**Step 5: Binning the Quality Variable**

Binning is a technique used to convert continuous data into categorical data by dividing the range of values into intervals. This can help in analyzing and visualizing the data more effectively.

In this step, we will bin the quality column of the Wine Quality dataset into three categories: 'low', 'medium', and 'high'.

In [7]:
bins = [2, 5, 7, 9]
labels = ['low', 'medium', 'high']
df['quality_binned'] = pd.cut(df['quality'], bins=bins, labels=labels, include_lowest=True)
print(df['quality_binned'])

0          low
1          low
2          low
3       medium
4          low
         ...  
1594       low
1595    medium
1596    medium
1597       low
1598    medium
Name: quality_binned, Length: 1599, dtype: category
Categories (3, object): ['low' < 'medium' < 'high']


**Step 6: Creating Interaction Features**

Interaction features are created by combining two or more features to capture relationships that might not be apparent when analyzing each feature individually. In this step, we will create a new feature by multiplying the pH and density columns of the Wine Quality dataset. This interaction feature may help to uncover patterns or improve model performance.

In [9]:
df['pH_density_interaction'] = df['pH'] * df['density']
print(df['pH_density_interaction'])

0       3.502278
1       3.189760
2       3.250220
3       3.153680
4       3.502278
          ...   
1594    3.432405
1595    3.502822
1596    3.405431
1597    3.553828
1598    3.374711
Name: pH_density_interaction, Length: 1599, dtype: float64


**Step 9: Splitting the Dataset into Training and Testing Sets**

In machine learning, it’s important to evaluate the performance of your model on unseen data. To achieve this, we split the dataset into two subsets: a training set, used to train the model, and a testing set, used to evaluate its performance.

In this step, we will split the Wine Quality dataset into features (X) and target (y), and then create training and testing sets.

In [10]:

X = df.drop('quality', axis=1)
y = df['quality']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.head())
print(y_train.head())
print(X_test.head())
print(y_train.head())

     fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
493            8.7             0.690         0.31             3.0      0.086   
354            6.1             0.210         0.40             1.4      0.066   
342           10.9             0.390         0.47             1.8      0.118   
834            8.8             0.685         0.26             1.6      0.088   
705            8.4             1.035         0.15             6.0      0.073   

     free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
493                 23.0                  81.0  1.00020  3.48       0.74   
354                 40.5                 165.0  0.99120  3.25       0.59   
342                  6.0                  14.0  0.99820  3.30       0.75   
834                 16.0                  23.0  0.99694  3.32       0.47   
705                 11.0                  54.0  0.99900  3.37       0.49   

     alcohol quality_binned  pH_density_interaction  
493     