In [None]:
pip install pandas numpy matplotlib seaborn skilit-learn

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PowerTransformer

# Load the datasets
wine_df = pd.read_csv("WineQT.csv")
student_df = pd.read_csv("student_data.csv")

# Display the first few rows of each dataset
display(wine_df.head(), student_df.head())


1: Key Features of the Wine Quality Dataset
The Wine Quality Dataset consists of several chemical properties that influence the taste, texture, and overall quality of the wine. Each feature represents a different chemical characteristic, such as acidity, sugar content, or alcohol percentage, all of which contribute to the final quality rating. For example, fixed acidity affects the wine’s tartness, while volatile acidity influences the vinegar-like aroma. Similarly, residual sugar impacts the wine's sweetness, and alcohol content often has a strong correlation with perceived quality. Understanding these features is essential in predicting wine quality, as each plays a role in shaping the sensory experience of the wine.

In [None]:
# Display column names
print("Wine Quality Dataset Columns:", wine_df.columns.tolist())

# Display basic statistics
print(wine_df.describe())


2: Handling Missing Data in Wine Quality Dataset
Handling missing data is crucial in any dataset to ensure the accuracy and reliability of the analysis. In the Wine Quality Dataset, missing values can arise due to incomplete measurements or data collection errors. Various imputation techniques can be used to handle these missing values. Mean imputation replaces missing values with the average of that feature, preserving the overall distribution but potentially distorting relationships. Median imputation is useful for skewed data, while mode imputation works best for categorical features. Advanced techniques like K-Nearest Neighbors (KNN) imputation can predict missing values based on the nearest data points, but they require more computation. The choice of imputation technique depends on the dataset characteristics and the impact of missing values on the analysis.

In [None]:
# Check for missing values
print("Missing Values in Wine Dataset:\n", wine_df.isnull().sum())

# Apply mean imputation
imputer = SimpleImputer(strategy="mean")
wine_df_imputed = pd.DataFrame(imputer.fit_transform(wine_df), columns=wine_df.columns)


3: Key Factors Affecting Student Performance in Exams
Student performance is influenced by multiple factors, including academic preparation, parental support, and personal study habits. One of the most effective ways to analyze these factors is through correlation analysis, which identifies relationships between different variables. For instance, a strong positive correlation between study time and exam scores suggests that students who study more tend to perform better. Similarly, parental education level might influence a student’s academic success, as parents with higher education may provide better guidance and resources. By visualizing these relationships using a correlation heatmap, we can determine which factors have the most significant impact on student performance.

In [None]:
# Display column names
print("Student Performance Dataset Columns:", student_df.columns.tolist())

# Correlation heatmap
plt.figure(figsize=(10,6))
sns.heatmap(student_df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix of Student Performance Data")
plt.show()


4: Feature Engineering for Student Performance Dataset
Feature engineering is a crucial step in preparing data for analysis, as it involves creating new features or transforming existing ones to improve model performance. In the Student Performance Dataset, we can enhance our analysis by creating a new feature called "average_score", which represents the mean of math, reading, and writing scores. This provides a single metric to evaluate overall student performance rather than analyzing each subject separately. Additionally, categorical variables such as gender, parental education, and test preparation course completion might need to be converted into numerical values through one-hot encoding to make them useful for machine learning models. These transformations help in improving data interpretability and ensuring better predictive performance.

In [None]:
# Create a new feature for overall performance
student_df["average_score"] = student_df[["math_score", "reading_score", "writing_score"]].mean()

# Display the first few rows with the new feature
display(student_df.head())


5: Exploratory Data Analysis (EDA) on Wine Quality Dataset
EDA helps us understand the distribution of features and identify potential data issues. By plotting histograms of different features, we can observe their spread and detect any skewness in the dataset. Many features in the Wine Quality Dataset exhibit skewed distributions, meaning they do not follow a normal bell-shaped curve. This non-normality can impact the performance of certain statistical and machine learning models, which assume normally distributed data.

In [None]:
# Plot histograms to check feature distribution
plt.figure(figsize=(12,6))
wine_df.hist(bins=30, figsize=(12,8), layout=(3,4))
plt.suptitle("Feature Distribution in Wine Quality Dataset")
plt.show()


Identifying Non-Normal Features
To quantify skewness, we calculate the skewness values for each feature. A high skewness value (greater than ±1) indicates a strong deviation from normality.

In [None]:
# Check skewness of features
skewness = wine_df.skew()
print("Skewness of Features:\n", skewness)


Applying Transformations to Improve Normality
When a dataset has skewed features, applying transformations can help normalize the data. One effective method is the Power Transformation, which stabilizes variance and makes the distribution more Gaussian-like. This transformation improves the performance of machine learning models that assume normality.

In [None]:
# Apply Power Transformation to normalize skewed features
pt = PowerTransformer()
wine_df_transformed = pd.DataFrame(pt.fit_transform(wine_df), columns=wine_df.columns)

# Visualize transformed features
plt.figure(figsize=(12,6))
wine_df_transformed.hist(bins=30, figsize=(12,8), layout=(3,4))
plt.suptitle("Transformed Feature Distribution")
plt.show()
