# Chapter 4: NumPy Basics: Arrays and Vectorized Computation

In [23]:
import numpy as np

## Dataset:80 Cereals - Nutrition data on 80cereal products

The data file can be downloaded from [Kaggle.com](https://www.kaggle.com/crawford/80-cereals)

>If you like to eat cereal, do yourself a favor and avoid this dataset at all costs. After seeing these data it will never be the same for me to eat Fruity Pebbles again. - Kaggle

- Download the zip file from Kaggle (login required)
- Unzip to get `cereal.csv` file
- Move the csv file to a proper folder
- Open the csv file using notepad and excel to examine its content

In [55]:
# Load the csv file with np.loadtxt()
# Spoiler alert: in the next chapter we will learn a more user-friendly
# way of loading data.
import os
print("My current working directory",os.getcwd())
print("make sure the csv file exis",os.listdir('/Users/baboury/Desktop/CMP646DATA'))

# How to use np.loadtxt()?
# ?np.loadtxt() # Display documentation
# ??np.loadtxt() # Display source code

My current working directory /Users/baboury/CMP464-Fall2019
make sure the csv file exis ['cereal.csv']


In [26]:
# Try an example from documentation
from io import StringIO
c= StringIO(u" 0 1\n2 3")
np.loadtxt(c)

array([[0., 1.],
       [2., 3.]])

In [62]:
# Load cereal.csv as a numpy array named raw_data

raw_data = np.loadtxt("/Users/baboury/Desktop/CMP646DATA/cereal.csv", dtype =str, delimiter= ",")
print(raw_data[0,:])


['name' 'mfr' 'type' 'calories' 'protein' 'fat' 'sodium' 'fiber' 'carbo'
 'sugars' 'potass' 'vitamins' 'shelf' 'weight' 'cups' 'rating']


In [65]:
# What is the shape of raw_data?

print(" the shape of raw_data is :",raw_data.shape)

 the shape of raw_data is : (78, 16)


In [70]:
# Create a list of feature names (call it feature_names)
feature_names = raw_data[0,:]
#print("list of features names :", feature_names)
# Print a list in a nicer format:
# Create a string that joins all values from the array


feature_names = ",".join(feature_names)
print(feature_names)


name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating


In [73]:
# Assign the rest to data

data = raw_data [1:,:]
print("the rest of the data is:", data)

# Print the shape of data

print("the shape of the new data is : ", data.shape)



the rest of the data is: [['100% Bran' 'N' 'C' ... '1' '0.33' '68.402973']
 ['100% Natural Bran' 'Q' 'C' ... '1' '1' '33.983679']
 ['All-Bran' 'K' 'C' ... '1' '0.33' '59.425505']
 ...
 ['Wheat Chex' 'R' 'C' ... '1' '0.67' '49.787445']
 ['Wheaties' 'G' 'C' ... '1' '1' '51.592193']
 ['Wheaties Honey Gold' 'G' 'C' ... '1' '0.75' '36.187559']]
the shape of the new data is :  (77, 16)


### Content
What are the features?

Next, let's examine some important features

In [74]:
# ---------Names------------
# Display the list of cereal names
print("\n".join(data[:,0]))


100% Bran
100% Natural Bran
All-Bran
All-Bran with Extra Fiber
Almond Delight
Apple Cinnamon Cheerios
Apple Jacks
Basic 4
Bran Chex
Bran Flakes
Cap'n'Crunch
Cheerios
Cinnamon Toast Crunch
Clusters
Cocoa Puffs
Corn Chex
Corn Flakes
Corn Pops
Count Chocula
Cracklin' Oat Bran
Cream of Wheat (Quick)
Crispix
Crispy Wheat & Raisins
Double Chex
Froot Loops
Frosted Flakes
Frosted Mini-Wheats
Fruit & Fibre Dates; Walnuts; and Oats
Fruitful Bran
Fruity Pebbles
Golden Crisp
Golden Grahams
Grape Nuts Flakes
Grape-Nuts
Great Grains Pecan
Honey Graham Ohs
Honey Nut Cheerios
Honey-comb
Just Right Crunchy  Nuggets
Just Right Fruit & Nut
Kix
Life
Lucky Charms
Maypo
Muesli Raisins; Dates; & Almonds
Muesli Raisins; Peaches; & Pecans
Mueslix Crispy Blend
Multi-Grain Cheerios
Nut&Honey Crunch
Nutri-Grain Almond-Raisin
Nutri-grain Wheat
Oatmeal Raisin Crisp
Post Nat. Raisin Bran
Product 19
Puffed Rice
Puffed Wheat
Quaker Oat Squares
Quaker Oatmeal
Raisin Bran
Raisin Nut Bran
Raisin Squares
Rice Chex
Rice Kr

In [77]:
# The names are nicely sorted. How to sort an array?
ary = np.array([1,3,2,7,5])
print(ary)
# ary = np.sort(ary)
ary.sort()

print(ary)





[1 3 2 7 5]
[1 2 3 5 7]


In [78]:
matrix = np.array([[1, 4, 23, 19], 
                   [5, 2, 6, -20]])
print(matrix)
print("sorting each row:")
print(np.sort(matrix, axis=1)) # or: np.sort(matrix)
print("sorting each column:")
print(np.sort(matrix, axis=0))



[[  1   4  23  19]
 [  5   2   6 -20]]
sorting each row:
[[  1   4  19  23]
 [-20   2   5   6]]
sorting each column:
[[  1   2   6 -20]
 [  5   4  23  19]]


In [85]:
# The names are nicely sorted. How to sort an array?

# To make an example, let's shuffle the array first
name_data = np.array(data[:, 15])
example = np.array(name_data)
np.random.shuffle(example)
print("\n".join(example))


28.025765
35.782791
29.924285
63.005645
45.811716
45.863324
60.756112
40.400208
40.105965
28.742414
23.804043
53.371007
30.313351
46.658844
37.038562
36.187559
28.592785
32.207582
37.136863
39.259197
40.560159
72.801787
59.425505
36.176196
33.174094
93.704912
30.450843
31.230054
34.139765
27.753301
39.106174
39.703400
53.131324
38.839746
74.472949
59.363993
21.871292
37.840594
52.076897
49.120253
64.533816
49.511874
31.072217
51.592193
41.503540
50.828392
29.509541
39.241114
31.435973
53.313813
36.471512
19.823573
35.252444
68.235885
40.448772
44.330856
22.736446
22.396513
26.734515
54.850917
59.642837
58.345141
33.983679
68.402973
41.998933
18.042851
55.333142
46.895644
40.692320
45.328074
41.445019
50.764999
49.787445
40.917047
41.015492
36.523683
34.384843


In [None]:
# ------------- Weight -------------
# What is the index of weight in feature_names?



# How many different weights per serving are there?




## Measure nutrition by serving

The following project is inspired by [This Kaggle kernel](https://www.kaggle.com/frankwwu/how-cereal-manufacturers-mislead-consumers)

Manufacturers like to measure nutrition with serving. Every manufacturer chaotically defines the serving with different weights and cups. Thus, for consumers, comparing nutrition measured with different serving is very confusing in practice. Imagine you are comparing nutrition facts of different cereals in a grocery store and they are measured with different serving, you definitely need a calculator and a piece of paper.

In [None]:
# Divide sugars by weight



#### Arithmetic with NumPy arrays

In [None]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr
arr * arr
arr - arr

In [None]:
1 / arr
arr ** 0.5

In [None]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2
arr2 > arr

In [1]:
# Plot ratings vs. unified_sugars_data




In [None]:
# What is the maximum and minimum amount of sugar in a unified serving?




### Create our own ratings

- good-cereal-rating = protein + fiber + vitamins
- bad-cereal-rating = fat + sodium + potass + sugars

In [None]:
good_rating = data[:, 4].astype(float) + data[:, 7].astype(float) + data[:, 11].astype(float)
print(good_rating)

In [None]:
plt.plot(rating_data, good_rating, 'b.')

# Week 2 Homework
1. Fat
    - Calculate fat per gram
    - What is the maximum and minimum value for fat per gram?
2. Calories
    - Calculate calories per gram
    - find the top 5 cereals with highest calories
    - find the top 5 cereals with lowest calories
3. Bad rating
    - Calculate bad-cereal-rating for each cereal
    - Plot Ratings vs. Bad-Cereal-Ratings.
    - Do they agree each other?