# Mini-Project: Array Operations and Insights

## Purpose
This mini-project demonstrates my ability to work with **real-world datasets** using NumPy and pandas.  
It combines array manipulation, statistical analysis, and insight extraction, showcasing skills developed in the accompanying practice notebooks:
- `array_operations.ipynb`: Core NumPy fundamentals (indexing, slicing, reshaping, and boolean operations)
- `insights_extraction.ipynb`: Analytical experimentation, correlation analysis, and exploratory statistics

The project uses the **Wine Quality (Red Wine) dataset** (1,599 records, 12 numeric features) to illustrate practical, real-world data handling.

In [None]:
# Import Libraries

import numpy as np 
import pandas as pd 
import os
os.chdir(r"C:\Users\Naspers_Labs\desktop\udacity\aws_ai_scientist\data-analysis-python\numpy")

## 1. Dataset Overview

We start by loading the dataset to understand its **structure, size, and features**.  
This section sets the stage for array operations and insights extraction.


In [None]:
# This is the foundation for all subsequent array and analysis operations
# Load the Wine Quality (Red Wine) dataset
df = pd.read_csv("winequality-red.csv", sep=";")

print("Dataset shape:", df.shape)
df.head()

Dataset shape: (1599, 12)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## 2. Key Array Operations

In this section, I demonstrate selected **array-level manipulations** inspired by `array_operations.ipynb`.  
Focus is on **practical array handling skills**:
- Column and row selection
- Boolean indexing to filter data
- Reshaping and sorting arrays

These operations illustrate foundational data handling skills critical for any data-driven project.

In [None]:
# General information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


## 3. Insight Extraction

This section extracts **meaningful insights** from the dataset, building on skills explored in `insights_extraction.ipynb`.  

Highlights include:
- Statistical summaries
- Correlation analysis with wine quality
- Comparison of high-quality vs poor-quality wines
- Normalization of key features using NumPy for standardized analysis

These examples demonstrate the ability to **connect numerical data to actionable insights**, a key skill for data analysis roles.

In [4]:
# Summary statistics for all numeric columns
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


Quality Distribution

In [5]:
# Count how many wines fall into each quality category
quality_counts = df["quality"].value_counts().sort_index()
print(quality_counts)

quality
3     10
4     53
5    681
6    638
7    199
8     18
Name: count, dtype: int64


Mean Alcohol Content by Quality

In [6]:
# Group wines by quality and compute average alcohol content
mean_alcohol_by_quality = df.groupby("quality")["alcohol"].mean()

print("Average alcohol content by quality:\n")
print(mean_alcohol_by_quality)

Average alcohol content by quality:

quality
3     9.955000
4    10.265094
5     9.899706
6    10.629519
7    11.465913
8    12.094444
Name: alcohol, dtype: float64


Correlation Analysis

In [7]:
# Compute correlation matrix
correlation_matrix = df.corr()

# Correlation with quality
quality_correlations = correlation_matrix["quality"].sort_values(ascending=False)

print("Correlation of features with wine quality:\n")
print(quality_correlations)

Correlation of features with wine quality:

quality                 1.000000
alcohol                 0.476166
sulphates               0.251397
citric acid             0.226373
fixed acidity           0.124052
residual sugar          0.013732
free sulfur dioxide    -0.050656
pH                     -0.057731
chlorides              -0.128907
density                -0.174919
total sulfur dioxide   -0.185100
volatile acidity       -0.390558
Name: quality, dtype: float64


In [8]:
# Normalize fixed acidity using NumPy
fixed_acidity = df["fixed acidity"].to_numpy()

mean_fa = np.mean(fixed_acidity)
std_fa = np.std(fixed_acidity)

normalized_fixed_acidity = (fixed_acidity - mean_fa) / std_fa

print("First 5 normalized fixed acidity values:\n", normalized_fixed_acidity[:5])

First 5 normalized fixed acidity values:
 [-0.52835961 -0.29854743 -0.29854743  1.65485608 -0.52835961]


## 4. Comparing Good and Poor Quality Wines

This part highlights **conditional analysis**:
- Good wines: quality >= 7
- Poor wines: quality <= 5

The comparison showcases practical reasoning:
- How key features (like alcohol) differ between high and low-quality wines
- Ability to summarize results in a meaningful way

In [9]:
# Define quality thresholds
good_wines = df[df["quality"] >= 7]
poor_wines = df[df["quality"] <= 5]

print("Number of good wines:", len(good_wines))
print("Number of poor wines:", len(poor_wines))

print("\nAverage alcohol (good wines):", good_wines["alcohol"].mean())
print("Average alcohol (poor wines):", poor_wines["alcohol"].mean())

Number of good wines: 217
Number of poor wines: 744

Average alcohol (good wines): 11.518049155145931
Average alcohol (poor wines): 9.926478494623655


### Key Insights

- Wines with higher quality ratings tend to have higher alcohol content.
- Alcohol shows one of the strongest positive correlations with quality.
- Most wines fall within the medium quality range (5â€“6).
- Feature normalization helps compare attributes on the same scale.


## 5. Conclusion and Project Significance

This mini-project ties together:
- **Technical skills**: Array manipulation, slicing, boolean logic, reshaping
- **Analytical skills**: Statistical summaries, correlations, comparisons
- **Storytelling skills**: Explaining workflow, highlighting reasoning

By linking practice notebooks (`array_operations.ipynb` and `insights_extraction.ipynb`) with this curated showcase, the project communicates both **ability to execute technical tasks** and **capacity to synthesize insights for a real-world audience**.

This structure ensures reviewers see **not only what I can do**, but also **why it matters**.