## Insight Extraction

This section extracts **meaningful insights** from the dataset, building on skills explored in `insights_extraction.ipynb`.  

Highlights include:
- Correlation analysis with wine quality
- Comparison of high-quality vs poor-quality wines
- Normalization of key features using NumPy for standardized analysis

These examples demonstrate the ability to **connect numerical data to actionable insights**, a key skill for data analysis roles.


In [2]:
# ===============================================
# ðŸ“— insights_extraction.ipynb
# Purpose: Perform statistical insights and correlations
# using the Wine Quality dataset.
# ===============================================

import numpy as np
import pandas as pd
import os
os.chdir(r"C:\Users\Naspers_Labs\desktop\udacity\aws_ai_scientist\data-analysis-python\numpy")

# Load dataset
data = pd.read_csv("winequality-red.csv", sep=';')

# Basic descriptive statistics
print("Summary Statistics:\n")
print(data.describe())

# Mean alcohol by quality
mean_alcohol_quality = data.groupby("quality")["alcohol"].mean()
print("\nMean alcohol content by quality:\n", mean_alcohol_quality)

# Correlation matrix
correlation = data.corr(numeric_only=True)
print("\nCorrelation Matrix:\n", correlation)

# Find the strongest correlations
correlation_target = correlation["quality"].sort_values(ascending=False)
print("\nTop correlations with wine quality:\n", correlation_target)

# NumPy-based analysis: variance and normalization
acidity = data["fixed acidity"].values
normalized_acidity = (acidity - np.mean(acidity)) / np.std(acidity)

print("\nFirst 5 normalized acidity values:\n", normalized_acidity[:5])
print("Variance of acidity:", np.var(acidity))

# Conditional insights
good_wines = data[data["quality"] >= 7]
poor_wines = data[data["quality"] <= 5]

print("\nNumber of good wines:", len(good_wines))
print("Number of poor wines:", len(poor_wines))
print("Average alcohol (good wines):", good_wines["alcohol"].mean())
print("Average alcohol (poor wines):", poor_wines["alcohol"].mean())


Summary Statistics:

       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000         