# Day 02 - Project 1

## Wine Quality Prediction using Random Forest Regressor

### Step 1: Import Required Libraries
We import all necessary libraries for data handling, modeling, evaluation, and visualization.

In [None]:
# Numpy and Pandas for data manipulation
import numpy as np
import pandas as pd

# Matplotlib and Seaborn for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn for model building and evaluation
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

### Step 2: Load the Dataset
We load the red wine quality dataset. It uses semicolons as separators.

In [None]:
# Load the dataset (must be in the same directory)
df = pd.read_csv('winequality-red.csv', sep=';')
# Display first few rows to understand structure
df.head()

### Step 3: Explore the Dataset
Check data types, missing values, and feature distributions.

In [None]:
# Display structure and check for nulls
df.info()

# Statistical summary of numerical features
df.describe()

# Check for missing values in each column
df.isnull().sum()

In [None]:
# Plot the distribution of wine quality scores
sns.countplot(x='quality', data=df)
plt.title('Distribution of Wine Quality')
plt.xlabel('Quality Score')
plt.ylabel('Frequency')
plt.show()

### Step 4: Prepare Features and Target Variable
We separate the input features (X) and the target (y).

In [None]:
# X = all columns except 'quality'
X = df.drop('quality', axis=1)
# y = the target column we want to predict
y = df['quality']

### Step 5: Train-Test Split
Split data to train the model on part of it and test on unseen data.

In [None]:
# 80% training and 20% testing split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Confirm shapes of the split
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

### Step 6: Train the Random Forest Regressor
Fit the model on training data to learn patterns.

In [None]:
# Create a Random Forest Regressor model
model = RandomForestRegressor(random_state=42)

# Train the model on the training data
model.fit(X_train, y_train)

### Step 7: Make Predictions
Use the trained model to predict wine quality on the test set.

In [None]:
# Predict on test data
y_pred = model.predict(X_test)

### Step 8: Evaluate the Model
Measure the model's performance using MSE, RMSE, and R² score.

In [None]:
# Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
# Root Mean Squared Error
rmse = np.sqrt(mse)
# R-squared Score
r2 = r2_score(y_test, y_pred)

# Print evaluation metrics
print(f'MSE: {mse:.4f}')
print(f'RMSE: {rmse:.4f}')
print(f'R² Score: {r2:.4f}')

### Step 9: Feature Importance
Understand which features most influenced the model's decisions.

In [None]:
# Get importance of each feature from the model
importances = model.feature_importances_
feature_names = X.columns

# Create a DataFrame for visualization
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importances')
plt.show()

### ✅ Final Notes
- Random Forest Regressors handle non-linear relationships well.
- RMSE provides a sense of how far off predictions are (on average).
- Feature importances are a great way to understand model logic.
- Always explore and understand your data before training models.