### Question 3 - notmalization

### **Preprocessing Steps**
a. **Data Cleaning**:  
   - Removed the `Id` column as it was unnecessary for analysis and modeling.

b. **Handling Missing Values**:  
   - Checked for missing values and confirmed that the dataset is complete with no missing entries.

c. **Dimensionality Reduction**:  
   - Initially considered PCA for dimensionality reduction but decided to postpone it to **question 15** to avoid losing interpretability of the original features.

d. **Data Balancing**:  
   - Observed that the dataset is imbalanced, with fewer samples for marginal quality levels (e.g., quality 3, 4, and 8).  
   - Addressed this imbalance in **question 14** by oversampling minority classes to improve model performance.

e. **Normalization**:  
   - Applied **Z-score normalization** to standardize the features, ensuring all variables are on the same scale. This is particularly well-suited for our scenario as it preserves the distribution of the data while reducing the impact of outliers.



### Question 4 - split train test
split 0.2 of data for tets and 0.8 for training
### Question 5 - Logistic Regression model (multi-class)
### Question 6 - Evaluate the model

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Step 1: Load the dataset
df = pd.read_csv("WineQT.csv")

# Step 2: Drop the 'Id' column (not needed)
df.drop(columns=["Id"], inplace=True)

# Step 3: Separate features (X) and target (y)
X = df.drop(columns=["quality"])  # Features
y = df["quality"]  # Target variable (discrete classes)

# Step 4: Perform Z-score normalization (standardization)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert the scaled features back to a DataFrame
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

# Step 5: Split the data into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X_scaled_df, y, test_size=0.2, random_state=42)

# Step 6: Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Step 7: Train the model
clf.fit(X_train, y_train)

# Step 8: Make predictions on the test set
y_pred = clf.predict(X_test)

# Step 9: Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.5283842794759825

Classification Report:
               precision    recall  f1-score   support

           3       0.00      0.00      0.00         0
           4       0.14      0.17      0.15         6
           5       0.64      0.58      0.61        96
           6       0.52      0.52      0.52        99
           7       0.42      0.50      0.46        26
           8       0.00      0.00      0.00         2

    accuracy                           0.53       229
   macro avg       0.29      0.29      0.29       229
weighted avg       0.54      0.53      0.53       229


Confusion Matrix:
 [[ 0  0  0  0  0  0]
 [ 0  1  2  2  1  0]
 [ 3  4 56 31  2  0]
 [ 2  2 30 51 14  0]
 [ 0  0  0 13 13  0]
 [ 0  0  0  1  1  0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Logistic Regression Output:

#### 1. **Precision, Recall, and F1-Score**:
These metrics evaluate the classification performance per class (wine quality from 4 to 8).
- **Precision**: The ratio of true positive predictions to all positive predictions made by the model.
- **Recall**: The ratio of true positive predictions to all actual positive instances.
- **F1-Score**: The harmonic mean of Precision and Recall, balancing both metrics.

#### 2. **Support**:
The support indicates the number of occurrences of each class in the dataset.
- For example, there are 6 instances of class 4, 96 of class 5, etc.

#### 3. **Accuracy**:
This is the overall correctness of the model. It’s calculated as:

#### 4. **Macro Average**:
- **Macro Average Precision, Recall, F1-Score**: These are the unweighted averages of Precision, Recall, and F1-Score across all classes. It treats all classes equally, regardless of their frequency in the dataset.

#### 5. **Weighted Average**:
- **Weighted Average Precision, Recall, F1-Score**: These averages are weighted by the support (class frequencies). It gives more importance to classes with more instances.


### Question 6: Model Evaluation

- **Accuracy**: The model achieved an accuracy of **52.83%**, indicating that it correctly classified about 58% of the samples.
- **Precision, Recall, and F1-Score**: The model performs poorly on class 4 and class 8 (with 0 precision and recall). For class 5 and 6, it performs better, with decent precision and recall. Class 7 has lower recall, suggesting it's harder to predict.
- **Macro Average**: The macro average precision, recall, and F1-score are low (around 0.29), indicating that the model struggles with balancing performance across all classes.
- **Weighted Average**: The weighted averages are higher (around 0.54 for precision and 0.53 for F1-score), reflecting better performance for the more frequent classes (class 5 and 6).

In summary, the model's performance is not strong, especially for rare classes (4 and 8), but it performs reasonably well for more frequent ones.