In [None]:
## ANS 1.
'''
The wine quality dataset typically contains features such as:

Fixed Acidity: Determines the wine's sharpness or crispness. Important for flavor balance.
Volatile Acidity: High levels lead to an unpleasant vinegar taste, so it negatively affects quality.
Citric Acid: Contributes freshness and enhances flavor.
Residual Sugar: Adds sweetness; very low or very high values may affect consumer preference.
Chlorides: Reflects salinity; higher levels usually degrade quality.
Free Sulfur Dioxide: Prevents microbial growth and oxidation; impacts wine preservation.
Total Sulfur Dioxide: Overly high levels can affect taste and safety.
Density: Indicates sugar and alcohol content, influencing body and mouthfeel.
pH: Measures acidity; affects microbial stability and flavor.
Sulphates: Enhance antioxidants and contribute to wine stability.
Alcohol: Positively correlated with perceived quality due to its impact on body and flavor.
Wine Quality Score (Target): Rated by wine tasters, serving as the target variable.'''

# Ans 2.
'''
Techniques Used:

Mean/Median/Mode Imputation: Replacing missing values with the mean (numerical), median (for skewed distributions), or mode (categorical).

Advantages: Simple, fast, and retains dataset size.
Disadvantages: Can introduce bias, ignoring feature relationships.
K-Nearest Neighbors (KNN): Predicts missing values based on nearest neighbors.

Advantages: Preserves feature relationships.
Disadvantages: Computationally expensive for large datasets.
Multiple Imputation by Chained Equations (MICE): Uses iterative regression to predict missing values.

Advantages: Handles complex relationships well.
Disadvantages: Computationally intensive.
Dropping Rows/Columns: Removes rows or features with missing data.

Advantages: Ensures data integrity.
Disadvantages: Risk of losing valuable information.'''


## Ans 3.
'''
Socio-economic Status (SES): Parent's education and income.
Study Habits: Regular study patterns and time management.
School Resources: Access to quality teaching and infrastructure.
Mental and Physical Health: Stress levels, sleep quality, and nutrition.
Learning Environment: Peer influence and classroom engagement.
Statistical Analysis:

Use correlation analysis to identify relationships.
Perform regression modeling to understand the impact of predictors.
Conduct factor analysis to reduce dimensionality and group similar variables.'''

# Ans 4.
'''
Selection:

Identify important features such as parental education, study hours, and test preparation.
Use feature importance scores from models like Random Forest.

Transformation:

Normalize numerical variables for algorithms sensitive to scale.
Encode categorical variables (e.g., one-hot encoding for gender, test preparation).
Create interaction terms, such as combining study hours and parental support.

Handling Missing Data:
Impute missing data based on trends or patterns.

Outlier Detection:

Identify and treat outliers using methods like the IQR rule. '''

# ans 5.

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from scipy.stats import boxcox

# Load the wine quality dataset
# Assuming the dataset is a CSV file named 'winequality.csv'
# Modify the file path as necessary
data = pd.read_csv('winequality.csv')

"""
Q5: Perform Exploratory Data Analysis (EDA)
- Identify the distribution of each feature
- Highlight features exhibiting non-normality
"""

# Display basic statistics
print(data.describe())

# Plot histograms for feature distributions
data.hist(bins=20, figsize=(12, 10), grid=False)
plt.tight_layout()
plt.show()

# Identify non-normal features using skewness
skewed_features = data.skew().sort_values(ascending=False)
print("\nSkewed Features:\n", skewed_features)

# Example: Applying log transformation to reduce skewness
for feature in skewed_features[skewed_features > 1].index:
    if (data[feature] > 0).all():  # Log-transformable only for positive values
        data[feature] = np.log1p(data[feature])

"""
Q6: Perform Principal Component Analysis (PCA)
- Reduce the number of features while preserving most of the variance
"""

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data.drop(columns=['quality']))  # Exclude target variable

# Apply PCA
pca = PCA(n_components=0.95)  # Retain 95% variance
principal_components = pca.fit_transform(scaled_data)

# Explained variance ratio
explained_variance = pca.explained_variance_ratio_
print("\nExplained Variance Ratio by Components:\n", explained_variance)

# Plot cumulative variance explained
plt.figure(figsize=(8, 5))
plt.plot(np.cumsum(explained_variance), marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA Explained Variance')
plt.grid()
plt.show()

"""
Summary:
1. Non-normal features identified and transformed.
2. Dimensionality reduced using PCA, retaining 95% of the variance.
"""
