# Exploratory Data Analysis - 1

## Q1. What are the key features of the wine quality data set? Discuss the importance of each feature in predicting the quality of wine.

**Answer:**

Key features typically include: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. Each feature can influence wine quality, e.g., high volatile acidity can negatively impact taste, while alcohol and sulphates often have a positive effect.

## Q2. How did you handle missing data in the wine quality data set during the feature engineering process? Discuss the advantages and disadvantages of different imputation techniques.

**Answer:**

Common techniques: mean/median imputation, mode imputation, KNN imputation, or dropping rows. Mean/median is simple but may distort distributions. KNN preserves relationships but is computationally expensive. Dropping rows can reduce data size and bias results.

## Q3. What are the key factors that affect students' performance in exams? How would you go about analyzing these factors using statistical techniques?

**Answer:**

Key factors: study time, attendance, parental education, previous grades, socio-economic status, health, etc. Analyze using correlation, regression, and hypothesis testing to identify significant predictors.

## Q4. Describe the process of feature engineering in the context of the student performance data set. How did you select and transform the variables for your model?

**Answer:**

Process: Identify relevant features (e.g., study time, absences), encode categorical variables (e.g., gender, school), scale numerical features, create new features (e.g., average grade), and remove irrelevant or redundant variables.

## Q5. Load the wine quality data set and perform exploratory data analysis (EDA) to identify the distribution of each feature. Which feature(s) exhibit non-normality, and what transformations could be applied to these features to improve normality?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the wine quality dataset (update path as needed)
wine = pd.read_csv('winequality-red.csv')

# Plot distributions
wine.hist(bins=20, figsize=(15, 10))
plt.tight_layout()
plt.show()

# Check skewness
print(wine.skew())

**Answer:**

Features with high skewness (e.g., residual sugar, chlorides) exhibit non-normality. Log or Box-Cox transformations can be applied to improve normality.

## Q6. Using the wine quality data set, perform principal component analysis (PCA) to reduce the number of features. What is the minimum number of principal components required to explain 90% of the variance in the data?

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd

wine = pd.read_csv('winequality-red.csv')
X = wine.drop('quality', axis=1)
X_scaled = StandardScaler().fit_transform(X)
pca = PCA().fit(X_scaled)

explained_var = pca.explained_variance_ratio_.cumsum()
for i, var in enumerate(explained_var):
    if var >= 0.9:
        print(f"Number of components to explain 90% variance: {i+1}")
        break