<a href="https://colab.research.google.com/github/SANGRAMLEMBE/Hands_on_ML/blob/main/Hands_on/NumPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### What is NumPy?

NumPy (short for Numerical Python) is a fundamental package in Python used for scientific computing. It provides support for:

#### Key Features:
- Works with 1D, 2D, and higher-dimensional arrays
- Fast math operations (like sum, mean, etc.)
- Supports matrix and linear algebra



# Using NumPy for Smart Fitness Score Calculation

## 🏋️ Scenario: Fitness Performance Scoring

You run a fitness platform that evaluates daily performance using three metrics:
- Minutes of exercise
- Liters of water consumed
- Hours of sleep

You want to assign a **daily fitness score** based on how well a user balances these metrics.
---

##Fitness Score Formula

We will use a simple linear formula:

score = w1 * exercise + w2 * water + w3 * sleep


Where `w1`, `w2`, and `w3` reflect the importance of each metric.

---

##Define the Weights and Sample Data


In [1]:
# Assigning importance to each factor
weights = [0.5, 0.2, 0.3]  # More emphasis on exercise

# Sample user data: [exercise_minutes, water_liters, sleep_hours]
day1 = [45, 2.0, 6.5]
day2 = [30, 1.5, 8.0]
day3 = [60, 3.0, 7.0]

In [2]:
def compute_score(day_data, weights):
    result = 0
    for i in range(len(day_data)):
        result += day_data[i] * weights[i]
    return result

# Example usage
print(compute_score(day1, weights))  # Output: Fitness score for day1
print(compute_score(day2, weights))
print(compute_score(day3, weights))


24.849999999999998
17.7
32.7


In [3]:
weight = [0.4,0.6,0.3]

# sample student data = ["college_lecture", "self_study" , "Hack_a_thon" ]
stud_1 = [40,15,2]
stud_2 = [35,20,3]
stud_3 = [45,10,1]

def chances_of_placement(stud_data , weight):
  result = 0
  for i in range(len(stud_data)):
    result += stud_data[i]*weight[i]
  return result

print(chances_of_placement(stud_1,weight))
print(chances_of_placement(stud_2,weight))
print(chances_of_placement(stud_3,weight))

25.6
26.9
24.3


### using Zip function
The zip() function in Python is a built-in function used to combine multiple iterables (like lists, tuples, or strings) element-wise into a single iterable of tuples. It pairs corresponding elements from each input iterable.

In [4]:
list1 = [1, 2, 3]
list2 = ['a', 'b', 'c']
zipped_result = zip(list1, list2)

# Convert the zip object to a list to view the contents
print(list(zipped_result))

[(1, 'a'), (2, 'b'), (3, 'c')]


In [5]:
list1 = ["sangram","Jeevan","Himanshu","Divyanshu"]
list2 = [80,85,65,30]
zipped_list = zip(list1,list2)
print(list(zipped_list))
print(zipped_list)

[('sangram', 80), ('Jeevan', 85), ('Himanshu', 65), ('Divyanshu', 30)]
<zip object at 0x7fc12c9a94c0>


In [6]:
print(list(zip(day1,weights)))

[(45, 0.5), (2.0, 0.2), (6.5, 0.3)]


In [7]:
print(list(zip(stud_2,weight)))

[(35, 0.4), (20, 0.6), (3, 0.3)]


In [8]:
def compute_score(day_data, weights):
    result = 0
    for data,w in zip(day_data,weights):
        result += data*w

    return result

# Example usage
print(compute_score(day1, weights))
print(compute_score(day2, weights))
print(compute_score(day3, weights))

24.849999999999998
17.7
32.7


In [9]:
def chances_of_placement(stud_data,weight):
  result = 0
  for data,w in zip(stud_data,weight):
    result += data*w
  return result

print(chances_of_placement(stud_1,weight))
print(chances_of_placement(stud_2,weight))
print(chances_of_placement(stud_3,weight))

25.6
26.9
24.3


 ## Now Use NumPy

In [10]:
import numpy as np


###Convert Lists to Arrays

In [11]:
day1 = np.array([45, 2.0, 6.5])
weights = np.array([0.5, 0.2, 0.3])


In [12]:
type(day1)
type(weights)

numpy.ndarray

In [13]:
day1[0]

np.float64(45.0)

In [14]:
weights[1]

np.float64(0.2)

In [15]:
a = 5.0
print(type(a))

<class 'float'>


# Dot method


In [16]:
score = np.dot(day1, weights)
print("Fitness Score:", score)


Fitness Score: 24.849999999999998


In [17]:
placement_chance = np.dot(stud_2,weight)
print("Student 2 Placement Chances score is ",placement_chance)

Student 2 Placement Chances score is  26.9


### Or:

In [18]:
print(day1*weights)

[22.5   0.4   1.95]


In [19]:
(day1 * weights).sum()


np.float64(24.849999999999998)

In [20]:
print(stud_2,weight)
print(type(stud_2))

# convert list into numpy array

stud_2 = np.array(stud_2)
weight = np.array(weight)


print(stud_2 * weight)
print(type(stud_2))

(stud_2 * weight).sum()

[35, 20, 3] [0.4, 0.6, 0.3]
<class 'list'>
[14.  12.   0.9]
<class 'numpy.ndarray'>


np.float64(26.9)

### Compare performance of Python loop vs NumPy dot product

In [21]:

workout_minutes = list(range(30, 1000000))
calories_burned = list(range(30, 1000000))

# Convert to NumPy arrays
workout_np = np.array(workout_minutes)
calories_np = np.array(calories_burned)

# Compare performance of Python loop vs NumPy dot product


In [22]:
%%time
energy_score_loop = 0
for w, c in zip(workout_minutes, calories_burned):
    energy_score_loop += w * c
print(energy_score_loop)

333332833333491445
CPU times: user 413 ms, sys: 0 ns, total: 413 ms
Wall time: 629 ms


### Using NumPy vectorized dot product

In [23]:
%%time
energy_score_np = np.dot(workout_np, calories_np)
print(energy_score_np)

333332833333491445
CPU times: user 3.27 ms, sys: 28 µs, total: 3.29 ms
Wall time: 2.1 ms


In [24]:
%%time
energy_score_loop = 0
for s, w in zip(stud_1, weight):
    energy_score_loop += s * w
print(energy_score_loop)

25.6
CPU times: user 903 µs, sys: 0 ns, total: 903 µs
Wall time: 2.56 ms


In [25]:
%%time
energy_score_np = np.dot(stud_1,weight)
print(energy_score_np)

25.6
CPU times: user 749 µs, sys: 0 ns, total: 749 µs
Wall time: 3.56 ms


###Evaluate a Week at Once

In [26]:
week_data = np.array([
    [45, 2.0, 6.5],
    [30, 1.5, 8.0],
    [60, 3.0, 7.0],
    [50, 2.5, 7.5],
    [40, 1.8, 6.0],
    [55, 2.2, 6.8],
    [35, 1.6, 7.2]
])
#2d array
print(week_data.shape)

(7, 3)


In [27]:
# print(week_data)
print(week_data.shape)

(7, 3)


In [28]:
x=np.array([1,2,3])
print(x.shape)

(3,)


In [29]:
y = np.array([[3,4,5],
 [3,7,9]])
print(y.shape)

(2, 3)


In [30]:
weights = np.array([0.5, 0.2, 0.3])
#1d array

In [31]:

arr = np.array([[[1, 2, 3], [4, 5, 6]],

                 [[1, 2, 3], [4, 5, 6]]])

print(arr.shape)

(2, 2, 3)


In [32]:
arr = np.array([
    [[[1,2,3],[2,4,6],[5,6,7]],[[2,3,4],[4,4,5],[6,7,8]]],
    [[[1,8,3],[3,5,6],[3,9,7]],[[2,3,5],[4,4,5],[6,7,8]]]
])

print(arr.shape)

(2, 2, 3, 3)


All elements in a NumPy array must have the same data type. You can inspect the type using .dtype.

In [33]:
arr = np.array([
    [8, 2, 1],
    [9, 3, 2],
    [7, 1, 2]
])
print("arr dtype:", arr.dtype)

arr dtype: int64


In [34]:
arr = np.array([[2,3],[4,7],[1,5]])
print(arr.dtype)

int64


Add Floating Point to One Entry

In [35]:
# Introduce one float into the array
arr2 = np.array([
    [8.0, 2, 1],
    [9, 3, 2],
    [7, 1, 2]
])
print("arr2 dtype:",arr2.dtype)  # datatype will changes to float64


arr2 dtype: float64


- np.dot will work same as @

In [36]:
weekly_scores = np.dot(week_data, weights)
print("Weekly Scores:", weekly_scores)


Weekly Scores: [24.85 17.7  32.7  27.75 22.16 29.98 19.98]


In [37]:
weekly_scores1=week_data @ weights
print("Weekly Scores1:", weekly_scores1)

Weekly Scores1: [24.85 17.7  32.7  27.75 22.16 29.98 19.98]


In [38]:
weekly_scores = np.dot(week_data, weight)
print("weekly score :",weekly_scores)

weekly_scores = week_data @ weight
print("weekly score :",weekly_scores)

weekly score : [21.15 15.3  27.9  23.75 18.88 25.36 17.12]
weekly score : [21.15 15.3  27.9  23.75 18.88 25.36 17.12]


# NumPy Analysis on Wine Quality Dataset

We will use NumPy to explore the [Wine Quality dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality), which contains physicochemical and quality-related properties of red and white wine samples.

---

## Step 1: Load CSV Data using NumPy


In [39]:
# Load data from UCI Wine Quality Dataset (hosted on GitHub)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'


- **genfromtext**  --> It will convert url into numpy array
- **delimiter**  --> It will seprates the value by (,      ;     " ")

In [40]:
wine_data = np.loadtxt("/content/winequality-white.csv", delimiter=";", skiprows=1)   # headings are string , can't take with numerical
print(wine_data.shape)
print(wine_data)


(4898, 12)
[[ 7.    0.27  0.36 ...  0.45  8.8   6.  ]
 [ 6.3   0.3   0.34 ...  0.49  9.5   6.  ]
 [ 8.1   0.28  0.4  ...  0.44 10.1   6.  ]
 ...
 [ 6.5   0.24  0.19 ...  0.46  9.4   6.  ]
 [ 5.5   0.29  0.3  ...  0.38 12.8   7.  ]
 [ 6.    0.21  0.38 ...  0.32 11.8   6.  ]]


In [41]:
# Use genfromtxt with proper delimiter and header skip
wine_data = np.genfromtxt(url, delimiter=';', skip_header=1)
print("Shape of dataset:", wine_data.shape)


Shape of dataset: (1599, 12)


In [42]:

print(wine_data)

[[ 7.4    0.7    0.    ...  0.56   9.4    5.   ]
 [ 7.8    0.88   0.    ...  0.68   9.8    5.   ]
 [ 7.8    0.76   0.04  ...  0.65   9.8    5.   ]
 ...
 [ 6.3    0.51   0.13  ...  0.75  11.     6.   ]
 [ 5.9    0.645  0.12  ...  0.71  10.2    5.   ]
 [ 6.     0.31   0.47  ...  0.66  11.     6.   ]]


Download and Save the CSV File:

In [43]:
import urllib.request

# URL of the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'

# Local filename to save as
filename = 'wine-quality-red.csv'

# Download the file from `url` and save it locally under `filename`
urllib.request.urlretrieve(url, filename)

print(f"File saved as: {filename}")


File saved as: wine-quality-red.csv


In [44]:

wine_data = np.genfromtxt('winequality-red.csv', delimiter=';', skip_header=1)
print(wine_data.shape)

(1599, 12)


In [45]:
wine_data

array([[ 7.4  ,  0.7  ,  0.   , ...,  0.56 ,  9.4  ,  5.   ],
       [ 7.8  ,  0.88 ,  0.   , ...,  0.68 ,  9.8  ,  5.   ],
       [ 7.8  ,  0.76 ,  0.04 , ...,  0.65 ,  9.8  ,  5.   ],
       ...,
       [ 6.3  ,  0.51 ,  0.13 , ...,  0.75 , 11.   ,  6.   ],
       [ 5.9  ,  0.645,  0.12 , ...,  0.71 , 10.2  ,  5.   ],
       [ 6.   ,  0.31 ,  0.47 , ...,  0.66 , 11.   ,  6.   ]])

Check Column-wise Statistics

In [46]:
# COLUMN WISE --> axis = 0

mean_vals = np.mean(wine_data, axis=0)
std_vals = np.std(wine_data, axis=0)
print("Means:\n", mean_vals)
print("Standard Deviations:\n", std_vals)


Means:
 [ 8.31963727  0.52782051  0.27097561  2.5388055   0.08746654 15.87492183
 46.46779237  0.99674668  3.3111132   0.65814884 10.42298311  5.63602251]
Standard Deviations:
 [1.74055180e+00 1.79003704e-01 1.94740214e-01 1.40948711e+00
 4.70505826e-02 1.04568856e+01 3.28850367e+01 1.88674370e-03
 1.54338181e-01 1.69453967e-01 1.06533430e+00 8.07316877e-01]


In [47]:
# ROW WISE --> axis = 1

mean_vals = np.mean(wine_data, axis=1)
std_vals = np.std(wine_data, axis=1)
print("Means:\n", mean_vals)
print("Standard Deviations:\n", std_vals)

Means:
 [ 6.21198333 10.25456667  8.30825    ...  8.37347833  8.76795583
  7.7077075 ]
Standard Deviations:
 [ 9.12775181 18.3785345  14.4697315  ... 12.29881939 13.60147485
 11.52640133]


In [48]:
x = np.array([[1, 2, 3],
                      [4, 5, 6],
                      [7, 8, 9]])
mean_vals = np.mean(x, axis=1)
print(mean_vals)


[2. 5. 8.]


In [49]:
x = np.array([ [2,3,6,7],
                              [1,2,9,4],
               [5,8,3,6]])
print(x)

[[2 3 6 7]
 [1 2 9 4]
 [5 8 3 6]]


In [50]:
mean_value = np.mean(x, axis = 0)
print(mean_value)

[2.66666667 4.33333333 6.         5.66666667]


In [51]:
mean_value = np.mean(x, axis = 1)
print(mean_value)

[4.5 4.  5.5]


###Indexing and Filtering Examples

Filter wines with alcohol > 10

- wine_data is a 2D NumPy array (typically representing a dataset with rows for samples and columns for features, e.g., “wine” characteristics).

* wine_data[:, 10] selects all rows (you could think of this as every wine sample) but only the 11th column (Python uses zero-based indexing, so column index 10 is the 11th column).

>10 creates a boolean mask: For every value in that column, it checks if the value is greater than 10. The result is a 1D NumPy array of True/False values, one for each row.

In [52]:
wine_data[:,10]>10

array([False, False, False, ...,  True,  True,  True])

How the Code Works
wine_data[:, 10] > 10
This part creates a boolean mask: a 1D array of True/False values, one for each row in wine_data.
It checks whether the value in the 11th column (index 10) for each row is greater than 10.

wine_data[ ... ]
Here, wine_data is the original 2D array (your dataset).
Inside the square brackets, the boolean mask is used to select rows.
Only rows where the mask is True (i.e., where the 11th column > 10) are kept.

high_alcohol = wine_data[wine_data[:, 10] > 10]
This line creates a new array containing only the rows of wine_data where the alcohol content (column 10) is greater than 10.

The rest of the code just prints information about this new array, including the number of such “high alcohol” wines (high_alcohol.shape).

Why Does wine_data Appear Twice?
The first use (wine_data[:, 10] > 10) generates a filter: it checks a condition on every row, producing the mask.

The second use (wine_data[ ... ]) applies that filter: it uses the mask to extract the subset of rows that meet the condition.

This is a common and efficient NumPy pattern:
You do not copy the whole dataset. Instead, the first reference is used to quickly create a filter, and the second applies that filter to select data directly.





In [53]:
high_alcohol = wine_data[wine_data[:, 10] > 10]
print(high_alcohol)
high_alcohol.shape
print("High alcohol wines:", high_alcohol.shape[0])


[[ 7.5    0.5    0.36  ...  0.8   10.5    5.   ]
 [ 7.5    0.5    0.36  ...  0.8   10.5    5.   ]
 [ 8.5    0.28   0.56  ...  0.75  10.5    7.   ]
 ...
 [ 6.3    0.51   0.13  ...  0.75  11.     6.   ]
 [ 5.9    0.645  0.12  ...  0.71  10.2    5.   ]
 [ 6.     0.31   0.47  ...  0.66  11.     6.   ]]
High alcohol wines: 852


Filter wines with quality (last column) ≥ 7

In [54]:
good_wines = wine_data[wine_data[:, -1] >= 7]
print("Good quality wines:", good_wines.shape[0])


Good quality wines: 217


In [55]:
lst=[1,2,3,4,5,6]
print(lst[:5])

[1, 2, 3, 4, 5]


Slice first 5 rows and columns

In [56]:
print(wine_data[:5, :5])


[[ 7.4    0.7    0.     1.9    0.076]
 [ 7.8    0.88   0.     2.6    0.098]
 [ 7.8    0.76   0.04   2.3    0.092]
 [11.2    0.28   0.56   1.9    0.075]
 [ 7.4    0.7    0.     1.9    0.076]]


Compute average alcohol content per quality level

In [57]:
np.unique(wine_data[:, -1])

array([3., 4., 5., 6., 7., 8.])

In [58]:
wine_data[:, -1] == 5

array([ True,  True,  True, ..., False,  True, False])

In [59]:
np.mean(wine_data[wine_data[:, -1] == 5][:,10])

np.float64(9.899706314243758)

In [60]:
for quality in np.unique(wine_data[:, -1]):
    avg_alcohol = np.mean(wine_data[wine_data[:, -1] == quality][:, 10])
    print(f"Quality {int(quality)}: Avg Alcohol = {avg_alcohol:.2f}")


Quality 3: Avg Alcohol = 9.96
Quality 4: Avg Alcohol = 10.27
Quality 5: Avg Alcohol = 9.90
Quality 6: Avg Alcohol = 10.63
Quality 7: Avg Alcohol = 11.47
Quality 8: Avg Alcohol = 12.09


Compute correlation between alcohol and quality

In [61]:
alcohol = wine_data[:, 10]
quality = wine_data[:, -1]
correlation = np.corrcoef(alcohol, quality)[0, 1]
print("Correlation between alcohol and quality:", correlation)


Correlation between alcohol and quality: 0.47616632400113584


###Compute Weighted Wine Score
Let’s say you care about three columns:

Alcohol (index 10)

Sulphates (index 9)

Volatile acidity (index 1)

In [62]:
# Select relevant columns
features = wine_data[:, [10, 9, 1]]

# Assign weights: high alcohol is good, low volatile acidity is good
weights = np.array([0.4, 0.3, -0.3])

# Compute custom wine score
scores = features @ weights
print("Wine scores (first 5):", scores[:5])


Wine scores (first 5): [3.718 3.86  3.887 4.01  3.718]


##  Broadcasting

NumPy lets you perform mathematical operations with operators like `+`, `-`, `*`, `/` on arrays.
You can use these with either a single number (scalar) or another array of the same shape.
Here are some useful examples:


In [63]:
import numpy as np

# Example arrays
matrixA = np.array([[4, 7, 2, 5],
                    [6, 3, 8, 1],
                    [0, 9, 2, 6]])

matrixB = np.array([[14, 11, 16, 12],
                    [13, 19, 10, 15],
                    [21, 14, 18, 13]])

# matrix sum
print("# maatrix sum\n", matrixA + matrixB)

# Subtract arrays
print("\n# Subtract arrays\n", matrixB - matrixA)

# Add a scalar
print("# Add a scalar\n", matrixA + 2)


# Divide by scalar
print("\n# Divide by scalar\n", matrixA / 2)

# Elementwise multiplication
print("\n# Elementwise multiplication\n", matrixA * matrixB)

# Modulus with scalar
print("\n# Modulus with scalar\n", matrixA % 3)


# maatrix sum
 [[18 18 18 17]
 [19 22 18 16]
 [21 23 20 19]]

# Subtract arrays
 [[10  4 14  7]
 [ 7 16  2 14]
 [21  5 16  7]]
# Add a scalar
 [[ 6  9  4  7]
 [ 8  5 10  3]
 [ 2 11  4  8]]

# Divide by scalar
 [[2.  3.5 1.  2.5]
 [3.  1.5 4.  0.5]
 [0.  4.5 1.  3. ]]

# Elementwise multiplication
 [[ 56  77  32  60]
 [ 78  57  80  15]
 [  0 126  36  78]]

# Modulus with scalar
 [[1 1 2 2]
 [0 0 2 1]
 [0 0 2 0]]


## Array Broadcasting

Broadcasting describes how NumPy handles operations between arrays of different shapes. Instead of forcing you to create arrays with the exact same dimensions, NumPy automatically "broadcasts" the smaller array across the larger one so that their shapes match.

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimension and works its way left. Two dimensions are compatible when they are equal, or one of them is 1.

In [64]:
a = np.array([1, 2, 3])
b = 10
print(a + b)

[11 12 13]


2D array and 1D array


In [65]:
A = np.array([[1, 2, 3],

              [4, 5, 6]])

B = np.array([10, 20, 30])
print(A + B)
#Here, B is broadcasted over each row of A

[[11 22 33]
 [14 25 36]]


In [66]:
matrixA = np.array([[4, 7, 2, 5],
                    [6, 3, 8, 1],
                    [0, 9, 2, 6]])
vectorC = np.array([2, 4, 6, 8])

print("matrixA.shape:", matrixA.shape)
print("vectorC.shape:", vectorC.shape)
print("\n# Broadcasting add\n", matrixA + vectorC)

vectorD = np.array([5, 7])
# This will error:
try:
    matrixA + vectorD
except Exception as e:
    print("\n# Broadcasting error:", e)


matrixA.shape: (3, 4)
vectorC.shape: (4,)

# Broadcasting add
 [[ 6 11  8 13]
 [ 8  7 14  9]
 [ 2 13  8 14]]

# Broadcasting error: operands could not be broadcast together with shapes (3,4) (2,) 


In [67]:
A = np.array([[1, 2, 3],

              [4, 5, 6]])

B = np.array([[10],

              [20]])
print(A + B)


[[11 12 13]
 [24 25 26]]


In [68]:
a = np.array([1, 2, 3])
b = np.array([1, 2])
a + b  # This will raise an error because their shapes are not compatible!

ValueError: operands could not be broadcast together with shapes (3,) (2,) 

In [None]:
A = np.array([[1],

              [2],

               [3]])   # Shape (3,1)
B = np.array([10, 20, 30])      # Shape (3,)
print(A.shape)
print(B.shape)

In [None]:
a = np.array([1, 2, 3])
b = np.array([4, 5])
print(a.shape,b.shape)
a+b

In [None]:
 np.ones((4, 3, 5))

In [None]:
a = np.ones((1, 3, 5))
b = np.ones((4,3,1))
print(a.shape)
print(b.shape)
# Error!
c=a + b

print(c.shape)

## Array Comparison

Elementwise comparisons in NumPy return boolean arrays.
You can use operations like `==`, `!=`, `>`, `<`, `>=`, `<=`.

Example:


In [None]:
X = np.array([[5, 1, 8], [2, 4, 7]])
Y = np.array([[3, 1, 8], [2, 9, 6]])

print("# X == Y\n", X == Y)
print("\n# X >= Y\n", X >= Y)
print("\n# Count not equal:", (X != Y).sum())


## Array Indexing and Slicing

You can select single elements, slices, or subarrays in NumPy arrays using indices and ranges.
Here are a few examples:


In [None]:
cube = np.array([
    [[ 3,  5,  7], [ 8, 10, 12]],
    [[13, 15, 17], [18, 20, 22]],
    [[23, 25, 27], [28, 30, 32]]
])

# print("# Shape:", cube.shape)

# Single element
print("\n# cube[2, 1, 0]:", cube[2, 1, 0])

# Subarray using ranges
print("\n# cube[1:, :, 1]:\n", cube[1:, :, 1])

# Mixing indices and ranges
print("\n# cube[1, :, 1:]:\n", cube[1, :, 1:])

# Fewer indices (returns 2D slice)
print("\n# cube[2]:\n", cube[2])


## Different Ways to Initialize Numpy Arrays

Numpy offers many built-in methods to create arrays with preset or random values.  
Here are some useful examples with different shapes and values.  
Check the [official docs](https://numpy.org/doc/stable/reference/routines.array-creation.html) for more options!


In [None]:
import numpy as np

# All zeros array
zero_grid = np.zeros((4, 2))
print("# All zeros array\n", zero_grid)

# All ones, higher dimension
ones_cube = np.ones((2, 3, 2))
print("\n# All ones (3D)\n", ones_cube)

# Identity matrix
identity = np.eye(4)
print("\n# Identity matrix\n", identity)

# Random vector (0 to 1)
rand_vec = np.random.rand(6)
print("\n# Random vector\n", rand_vec)

# Random matrix, standard normal distribution
rand_matrix = np.random.randn(3, 4)
print("\n# Random matrix (normal dist)\n", rand_matrix)

# Array with all entries set to a fixed value
fixed_arr = np.full((3, 2), 77)
print("\n# All 77s\n", fixed_arr)

# Array with range and step
range_arr = np.arange(5, 50, 7)
print("\n# Range with step\n", range_arr)

# Equally spaced points in an interval
eq_space = np.linspace(2, 18, 9)
print("\n# Evenly spaced in [2,18]\n", eq_space)


#Assignment


Use the [Wine Quality Red dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv) from UCI.

**Rules:**
- Use only NumPy (not Pandas, not scikit-learn, not SciPy)
- Download, load, and manipulate the data with NumPy only

###1. What is the shape of the data array? (How many wine samples and how many columns of features does it have?)


In [None]:
print ("shape of wine Data :" , wine_data.shape)


###2. Calculate the average (mean) quality score of the red wines. (Hint: this is the mean of the last column in the data array.)



In [None]:
# It explicitly selects all rows but only the last column ([:, -1]) and then computes the mean of that slice.

mean_quality = wine_data[:,-1].mean()
print("Average quality score of red Wine:",mean_quality)

In [None]:
mean_quality = np.mean(wine_data[:,-1])
print("Average quality score of red Wine:",mean_quality)

In [None]:
#This calculates the mean over the entire array except the last row
# The slice wine_data[:-1] means all rows except the last one, but across all columns


mean_quality = wine_data[:-1].mean()    # If we romove , from above code it will consider all columns
print("Average quality score of red Wine:",mean_quality)


###3. What are the minimum and maximum pH values observed in the dataset? (Use NumPy to find the smallest and largest pH.)







In [None]:
min_ph = np.min(wine_data[:,-4])
max_ph = np.max(wine_data[:,-4])
print("Minimum pH value:", min_ph)
print("Maximum pH value:", max_ph)

In [None]:
ph_column = 8
min_ph = np.min(wine_data[:,ph_column])
max_ph = np.max(wine_data[:,ph_column])
print("Minimum pH value:", min_ph)
print("Maximum pH value:", max_ph)

###4. Determine the mean and standard deviation of the alcohol content in these wines. (Calculate the average alcohol percentage and how much it varies.)


In [None]:
alcohol_column = 10
mean_alcohol = np.mean(wine_data[:,alcohol_column])
std_alcohol = np.std(wine_data[:,alcohol_column])
print("Mean alcohol content:", mean_alcohol)
print("Standard deviation of alcohol content:", std_alcohol)

###5. List all the unique quality ratings present in the dataset. (What distinct quality values appear? Use np.unique on the quality column.)


In [None]:
unique_ratings = np.unique(wine_data[:,-1])
print("Unique quality ratings:", unique_ratings)

###6. How many wine samples have a quality rating of 7? (Count the number of entries where the quality column is exactly 7.)


In [None]:
number = np.count_nonzero(wine_data[:,-1] == 7)
print("Total Number of times Number Repitated:",number)

number = np.count_nonzero(wine_data[:,-1] == 5)
print("Total Number of times Number Repitated:",number)

number = np.count_nonzero(wine_data[:,-1] == 6)
print("Total Number of times Number Repitated:",number)

###7. How many wines have an alcohol content greater than 10%? (Hint: create a boolean mask for alcohol > 10 and sum it or use np.where.)



In [None]:
alcohol_column = 10
alcohol_greater_than_10 = wine_data[:,alcohol_column] > 10
num_wines = np.sum(alcohol_greater_than_10)
print("Number of wines with alcohol content > 10%:", num_wines)

In [None]:
alcohol_greater_than_10 = wine_data[:,-2] > 10
num_wines = np.sum(alcohol_greater_than_10)
print("Number of wines with alcohol content > 10%:", num_wines)

###8. What is the average citric acid concentration across all red wines? (Compute the mean of the citric acid column.)


In [70]:
average_citric_acid = np.mean(wine_data[:,2])
print("Average citric acid concentration:", average_citric_acid)

Average citric acid concentration: 0.2709756097560976


In [71]:
average_citric_acid = wine_data[:,2].mean()
print("Average citric acid concentration:", average_citric_acid)

Average citric acid concentration: 0.2709756097560976


###9. Determine the median residual sugar content in the dataset. (Hint: you can use np.median or np.percentile with 50th percentile to find the median residual sugar.)


In [72]:
median_residual_sugar = np.median(wine_data[:,3])
print("Median residual sugar content:", median_residual_sugar)

Median residual sugar content: 2.2


In [73]:
median_residual_sugar = np.percentile(wine_data[:,3],50)
print("Median residual sugar content:", median_residual_sugar)

Median residual sugar content: 2.2


###10. What is the 75th percentile (upper quartile) of the alcohol content? (In other words, 25% of the wines have an alcohol content above what value?)


In [74]:
alcohol_content = np.percentile(wine_data[:,10],75)
print("75th percentile of alcohol content:", alcohol_content)

75th percentile of alcohol content: 11.1



###11. determine how many wines fall into each quality score category. (Find the count of samples for each unique quality value.) Which quality rating is most common in the red wine dataset?



In [81]:
quality_scores = wine_data[:, -1]
unique_qualities, counts = np.unique(quality_scores, return_counts=True)

print("Quality scores:", unique_qualities)
print("Counts:", counts)

total_unique_qualities = len(unique_qualities)
print("Total unique quality scores:", total_unique_qualities)

Quality scores: [3. 4. 5. 6. 7. 8.]
Counts: [ 10  53 681 638 199  18]
Total unique quality scores: 6



###12. Which five wine samples have the highest alcohol content? Identify their alcohol values and corresponding quality scores. (Hint: use np.argsort to sort by the alcohol column and pick the top 5 entries.)


In [82]:
import numpy as np

# Assuming wine_data is already loaded as a NumPy array

# Specify the alcohol column index
alcohol_col_index = -2
quality_col_index = -1  # last column

# Get indices that would sort the alcohol column in ascending order
sorted_indices = np.argsort(wine_data[:, alcohol_col_index])

# Pick the indices of the top 5 samples with highest alcohol, reverse for descending
top5_indices = sorted_indices[-5:][::-1]

# Extract the alcohol and quality values for these samples
top5_alcohol = wine_data[top5_indices, alcohol_col_index]
top5_quality = wine_data[top5_indices, quality_col_index]

# Print results
for i, idx in enumerate(top5_indices):
    print(f"Sample index {idx}: Alcohol = {top5_alcohol[i]}, Quality = {int(top5_quality[i])}")


Sample index 652: Alcohol = 14.9, Quality = 5
Sample index 144: Alcohol = 14.0, Quality = 6
Sample index 142: Alcohol = 14.0, Quality = 6
Sample index 1270: Alcohol = 14.0, Quality = 6
Sample index 821: Alcohol = 14.0, Quality = 7


###13. Do wines with higher alcohol content tend to have higher quality? To investigate, compare the average quality of wines with above-average alcohol to those with below-average alcohol. (Calculate the mean quality for wines where alcohol is above the overall average, and compare it to the mean quality for wines where alcohol is below the average.)
###14. Compute the average alcohol percentage for each quality score in the dataset. (Group the data by the quality column and calculate the mean alcohol content for each quality value. Which quality level has the highest average alcohol content?)
###15. Define total acidity as the sum of fixed acidity and volatile acidity for each wine. First, calculate the total acidity for every sample. Next, add this as a new column to the dataset (so the array now has 13 columns). Which wine has the highest total acidity, and what is its quality rating? (Find the index of the max total acidity and check the quality at that index.)
###16. Compute the Pearson correlation coefficient between alcohol content and quality. (Use NumPy to see if higher alcohol correlates with higher quality. A positive correlation would indicate that as alcohol increases, quality tends to increase.)
###17. How many wines have quality >= 7 and alcohol > 10%? (In other words, count the wines that are high quality and also have relatively high alcohol. Use a boolean condition combining both criteria.)
###18.Which feature exhibits the greatest variability among all the wines? (Calculate the standard deviation of each feature column (consider the 11 physicochemical features) and identify which feature has the highest standard deviation.)
###19. Compare the mean volatile acidity of high-quality wines versus lower-quality wines. (Consider wines with quality ≥ 7 as "high quality" and those with quality ≤ 6 as "lower quality". Compute the average volatile acidity in each group. Do high-quality wines have lower volatile acidity on average?)
###20. What is the highest quality score attained by any red wine in the dataset, and how many samples achieved that score? (Find the maximum value in the quality column, and count how many wines have that maximum quality.)