### Introduction to NumPy: Numerical Python

Sarah records her second-grade class's grades in an online spreadsheet. Her web browser records that she visited that spreadsheet, in addition to every other site she's visited. Those sites record her location, the time she spent on them, and where she visited next. The world is chock-full of all sorts of different datasets, and learning how to create, analyze, and manipulate these datasets can give us some insight and control over our digital surroundings.

In this lesson, we'll be constructing and manipulating single-variable datasets. One way to think of a single-variable dataset is that it contains answers to a question. For instance, we might ask 100 people, “How tall are you?” Their heights in inches would form our dataset.

To work with our datasets, we'll be using a powerful Python module known as NumPy, which stands for Numerical Python.

NumPy has many uses including:
* Efficiently working with many numbers at once
* Generating random numbers
* Performing many different numerical functions (i.e., calculating sin, cos, tan, mean, median, etc.)

In the following exercises, we'll learn how to construct one- and two-dimensional arrays and perform basic array operations.

### NumPy Arrays

NumPy includes a powerful data structure known as an array. A NumPy array is a special type of list. It’s a data structure that organizes multiple items. Each item can be of any type (strings, numbers, or even other arrays). You can even have different types of items in the same array.

Arrays are most powerful when they are used to store numbers. This is because arrays give us special ways of performing mathematical operations that are both simpler to write and more efficient computationally. We’ll get more into this later.

A NumPy array looks a lot like a Python list:

my_array = np.array([1, 2, 3, 4, 5, 6])
We can transform a regular list into a NumPy array by using np.array() and saving the value to a new variable:

my_list = [1, 2, 3, 4, 5, 6]
my_array = np.array(my_list)

In [1]:
import numpy as np

In [2]:
test_1 = np.array([92, 94, 88, 91, 87])

#### Creating an Array from a CSV

Typically, you won't be entering data directly into an array. Instead, you'll be importing the data from somewhere else.

We're able to transform CSV (comma-separated values) files into arrays using the np.genfromtxt() function:

Consider the following CSV, sample.csv,

34,9,12,11,7
We can import this into a NumPy array using the following code:

csv_array = np.genfromtxt('sample.csv', delimiter=',')
Note that in this case, our file sample.csv has values separated by commas, so we use delimiter=',', but sometimes you'll find files with other delimiters, the most common being tabs or colons.

Once imported, this CSV will create the array

>>> csv_array
array([34, 9, 12, 11, 7])

In [3]:
# test_2 = np.genfromtxt('test_2.csv', delimiter=',')

#### Operations with NumPy Arrays

Generally, NumPy arrays are more efficient than lists. One reason is that they allow you to do element-wise operations. An element-wise operation allows you to quickly perform an operation, such as addition, on each element in an array.

Let's compare how to add a number to each value in a python list versus a NumPy array:

With a list
l = [1, 2, 3, 4, 5]
l_plus_3 = []
for i in range(len(l)):
    l_plus_3.append(l[i] + 3)

With an array
a = np.array(l)
a_plus_3 = a + 3
As we can see, if we were to add 3 to every number in a list, we would have to use a for loop or a list comprehension. With an array, we can just add 3. The same is true for subtraction, multiplication, and division.

We can also use NumPy Arrays to find the squares or square roots of each value.

Squaring each value:

>>> a ** 2
array([ 1,  4,  9, 16, 25, 36])
(Note: ** is the exponent notation in Python. For example, 3 squared can be calculated using 3 ** 2.)

Taking the square root of each value:

>>> np.sqrt(a)
array([ 1, 1.41421356, 1.73205081, 2, 2.23606798, 2.44948974])

In [4]:
test_1 = np.array([92, 94, 88, 91, 87])
test_2 = np.array([79, 100, 86, 93, 91])
test_3 = np.array([87, 85, 72, 90, 92])

test_3_fixed = test_3 + 2

print(test_3_fixed)

[89 87 74 92 94]


In [5]:
total_grade = test_1 + test_2 + test_3_fixed

final_grade = total_grade / 3

print(final_grade)

[ 86.66666667  93.66666667  82.66666667  92.          90.66666667]


#### Two-Dimensional Arrays

In Python, we can create lists that are made up of other lists. Similarly, in NumPy we can create an array of arrays. If the arrays that make up our bigger array are all the same size, then it has a special name: a two-dimensional array.

In the previous exercises we had stored the student's test scores in separate one-dimensional arrays for each test:

test_1 = np.array([92, 94, 88, 91, 87])
test_2 = np.array([79, 100, 86, 93, 91])
test_3 = np.array([87, 85, 72, 90, 92])
But we could have also stored all of this data in a single, two-dimensional array:

np.array([[92, 94, 88, 91, 87], 
          [79, 100, 86, 93, 91],
          [87, 85, 72, 90, 92]])
Here, each row represents a test, and each column represents a student. This allows us to store all of our data in a single array without losing any of its organization.

As we mentioned, a two-dimensional array is a list of lists where each list has the same number of elements. Here are some examples that are not two-dimensional arrays.

This code will run but it will not create a two-dimensional array because the lists have different numbers of elements:

np.array([[29, 49,  6], 
          [77,  1]])
This code will not run because the [] for the outer lists are missing:

np.array([68, 16, 73],
         [61, 79, 30])

In [7]:
coin_toss = np.array([1, 0, 0, 1, 0])

coin_toss_again = np.array(
    [
        [1, 0, 0, 1, 0],
        [0, 0, 1, 1, 1],
    ]
)

#### Selecting Elements from a 1-D Array

NumPy allows us to select elements from an array using their indices. Consider the one-dimensional array

a = np.array([5, 2, 7, 0, 11])
If we wanted to select the first element in this array, we would call:

>>> a[0]
5
In typical Python fashion, the indices for an array start at 0. This is known as zero-indexed numbering. In the array above, 5 is known as the zeroth element, a[0]. It follows that 2 is the first element, a[1].

We can also select negative indices, which count from opposite end of the array and start at -1. This is particularly useful when you want to access the last element or two of an array:

>>> a[-1]
11
>>> a[-2]
0
If we wanted to select multiple elements in the array, we can define a range, such as a[1:3], which will select all the elements from a[1] to a[3], including a[1] but excluding a[3].

>>> a[1:3]
array([2, 7])
Similarly, if we wanted to select all elements before a[3] we would use:

>>> a[:3]
array([5, 2, 7])
We can also use negative indices to select multiple elements. Let's say we want to select the last 3 elements in an array:

>>> a[-3:]
array([7, 0, 11])
Notice that when we select multiple elements, we get an array.

In [8]:
test_1 = np.array([92, 94, 88, 91, 87])
test_2 = np.array([79, 100, 86, 93, 91])
test_3 = np.array([87, 85, 72, 90, 92])

jeremy_test_2 = test_2[3]

manual_adwoa_test_1 = test_1[1:3]

#### Selecting Elements from a 2-D Array

Selecting elements from a 2-d array is very similar to selecting them from a 1-d array, we just have two indices to select from. The syntax for selecting from a 2-d array is a[row,column] where a is the array.

It's important to note that when we work with arrays that have more than one dimension, the relationship between the interior arrays is defined in terms of axes. A two-dimensional array has two axes: axis 0 represents the values that share the same indexical position (are in the same column), and axis 1 represents the values that share an array (are in the same row). This is illustrated below.

Diagram showing the axes in an array

Consider the array

a = np.array([[32, 15, 6, 9, 14], 
              [12, 10, 5, 23, 1],
              [2, 16, 13, 40, 37]])
We can select specific elements using their indices:

>>> a[2,1]
16
Let's say we wanted to select an entire column, we can put : as the column index the :

selects the first column
>>> a[:,0]
array([32, 12,  2])
The same works if we want to select an entire row:

selects the second row
>>> a[1,:]
array([12, 10,  5, 23,  1])
We can further narrow it down and select a range from a specific row:

selects the first three elements of the first row
>>> a[0,0:3]
array([32, 15,  6])

A two-dimensional array has two axes: 

* axis 0 represents the values that share the same indexical position (are in the same column)
* axis 1 represents the values that share an array (are in the same row). 

In [9]:
student_scores = np.array([[92, 94, 88, 91, 87],
                           [79, 100, 86, 93, 91],
                           [87, 85, 72, 90, 92]])

tanya_test_3 = student_scores[2, 0]

print(tanya_test_3)

cody_test_scores = student_scores[:,4]

print(cody_test_scores)

87
[87 91 92]


#### Logical Operations with Arrays

Another useful thing that arrays can do is perform element-wise logical operations. For instance, suppose we want to know how many elements in an array are greater than 5. We can easily write some code that checks to see whether this statement evaluates to True for each item in the array, without having to use a for loop :

>>> a = np.array([10, 2, 2, 4, 5, 3, 9, 8, 9, 7])
>>> a > 5
array([True, False, False, False, False, False, True, True, True, True], dtype=bool)
We can then use logical operators to evaluate and select items based on certain criteria. To select all elements from the previous array that are greater than 5, we'd write the following:

>>> a[a > 5]
array([10, 9, 8, 9, 7])
We can also combine logical statements to further specify our criteria. To do so, we place each statement in parentheses and use boolean operators like & (and) and | (or).

In our example, we can use combined statements to find the elements that are greater than five or less than two:

>>> a[(a > 5) | (a < 2)]
array([10, 9, 8, 9, 7])

In [10]:
a = np.array([10, 2, 2, 4, 5, 3, 9, 8, 9, 7])

In [11]:
a > 5

array([ True, False, False, False, False, False,  True,  True,  True,  True], dtype=bool)

In [12]:
a[a > 5]

array([10,  9,  8,  9,  7])

In [13]:
a[(a > 5) | (a < 3)] 

array([10,  2,  2,  9,  8,  9,  7])

In [14]:
porridge = np.array([79, 65, 50, 63, 56, 90, 85, 98, 79, 51])

cold = porridge[porridge < 60]
print(cold)

hot = porridge[porridge > 80]
print(hot)

just_right = porridge[(porridge >= 60) & (porridge <= 80)]
print(just_right)

[50 56 51]
[90 85 98]
[79 65 63 79]


#### Review

Let's take a second and review. In this lesson, you learned the basics of the NumPy package. Here are some key points:

* Arrays are a special type of list that allows us to store values in an organized manner.
* An array can be created by either defining it directly using np.array() or by importing a CSV using np.genfromtxt('file.csv', delimiter=',').
* An operation (such as addition) can be performed on every element in an array by simply performing it on the array itself.
* Elements can be selected from arrays using their index and array locations, both of which start at 0.
* Logical operations can be used to create new, more focused arrays out of larger arrays.

The next lesson will explore how to analyze these arrays and use means, medians, and standard deviations to tell a story. But first, practice what you've learned by working through the following checkpoints.

In [23]:
# temperatures = np.genfromtxt('temperature_data.csv', delimiter=',')

# rows: Monday, Tuesday, Wednesday, Thursday, Friday
# colums: 0:00, 6:00, 12:00, and 18:00 hours

temperatures = np.array(
    [
        [43.6,  45.1,  58.8,  53. ],
        [47.0,  44.5,  58.3,  52.6],
        [46.7,  44.2,  57.9,  52.2],
        [46.5,  44.1,  57.6,  51.9],
        [46.2,  43.9,  57.2,  51.5],
    ]
)

# Add 3 to all readings...
temperatures_fixed = temperatures + 3
print(temperatures_fixed)

[[ 46.6  48.1  61.8  56. ]
 [ 50.   47.5  61.3  55.6]
 [ 49.7  47.2  60.9  55.2]
 [ 49.5  47.1  60.6  54.9]
 [ 49.2  46.9  60.2  54.5]]


In [26]:
monday_temperatures = temperatures_fixed[0, :]
print(monday_temperatures)

thursday_friday_morning = temperatures_fixed[3:5, 1]
print(thursday_friday_morning)

temperature_extremes = temperatures_fixed[(temperatures_fixed < 50) | (temperatures_fixed > 60)]
print(temperature_extremes)

[ 46.6  48.1  61.8  56. ]
[ 47.1  46.9]
[ 46.6  48.1  61.8  47.5  61.3  49.7  47.2  60.9  49.5  47.1  60.6  49.2
  46.9  60.2]


In [27]:
cupcakes = np.array([2, .75, 2, 1, .5])
print(cupcakes)

recipes = np.genfromtxt('recipes.csv', delimiter=',')
print(recipes)

eggs = recipes[:, 2]
print(eggs)

print(eggs == 1)

cookies = recipes[2, :]
print(cookies)

double_batch = cupcakes * 2
print(double_batch)

grocery_list = cupcakes + cookies
print(grocery_list)

[ 2.    0.75  2.    1.    0.5 ]
[[ 2.     0.75   2.     1.     0.5  ]
 [ 1.     0.125  1.     1.     0.125]
 [ 2.75   1.5    1.     0.     1.   ]
 [ 4.     0.5    2.     2.     0.5  ]]
[ 2.  1.  1.  2.]
[False  True  True False]
[ 2.75  1.5   1.    0.    1.  ]
[ 4.   1.5  4.   2.   1. ]
[ 4.75  2.25  3.    1.    1.5 ]


#### INTRODUCTION TO STATISTICS WITH NUMPY

You're a citizen scientist who has started collecting data about rising water in the river next to where you live. For months, you painstakingly measure the water levels and enter your findings into a notebook. But at the end of it, what exactly do you have? What can all this data tell us?

In this lesson, we'll explore how we can use NumPy to analyze data. We'll learn different methods to calculate common statistical properties of a dataset, such as finding the mean and standard deviation. By the end, you'll be able to do basic analysis of a dataset and understand how we can use statistics to come to conclusions about data.

The statistical concepts that we'll cover include:
* Mean
* Median
* Percentiles
* Interquartile Range
* Outliers
* Standard Deviation

To start, we'll be analyzing single-variable datasets. One way to think of a single-variable dataset is that it contains answers to a question. For instance, we might ask 100 people, “How tall are you?” Their heights in inches would form our dataset.

#### NumPy and Mean

The first statistical concept we'll explore is mean, also commonly referred to as an average. The mean is a useful measurement to get the center of a dataset. NumPy has a built-in function to calculate the average or mean of arrays: np.mean

Let's say we want to find the average number of pounds of produce a person purchases per week. We administered a survey and received 1,000 responses:

survey_responses = [5, 10.2, 4, .3 ... 6.6]
We can then transform the dataset into a NumPy array and use the function np.mean to calculate the average:

>>> survey_array = np.array(survey_responses)
>>> np.mean(survey_array)
5.220

In [2]:
store_one = np.array([2, 5, 8, 3, 4, 10, 15, 5])
store_two = np.array([3, 17, 18,  9,  2, 14, 10])
store_three = np.array([7, 5, 4, 3, 2, 7, 7])

store_one_avg = np.mean(store_one)
print(store_one_avg)

store_two_avg = np.mean(store_two)
print(store_two_avg)

store_three_avg = np.mean(store_three)
print(store_three_avg)

6.5
10.4285714286
5.0


#### Mean and Logical Operations

We can also use np.mean to calculate the percent of array elements that have a certain property.

As we know, a logical operator will evaluate each item in an array to see if it matches the specified condition. If the item matches the given condition, the item will evaluate as True and equal 1. If it does not match, it will be False and equal 0.

When np.mean calculates a logical statement, the resulting mean value will be equivalent to the total number of True items divided by the total array length.

In our produce survey example, we can use this calculation to find out the percentage of people who bought more than 8 pounds of produce each week:

>>> np.mean(survey_array > 8)
0.2
The logical statement survey_array > 8 evaluates which survey answers were greater than 8, and assigns them a value of 1. np.mean adds all of the 1s up and divides them by the length of survey_array. The resulting output tells us that 20% of responders purchased more than 8 pounds of produce.

In [3]:
class_year = np.array([1967, 1949, 2004, 1997, 1953, 1950, 1958, 1974, 1987, 2006, 2013, 1978, 1951, 1998, 1996, 1952, 2005, 2007, 2003, 1955, 1963, 1978, 2001, 2012, 2014, 1948, 1970, 2011, 1962, 1966, 1978, 1988, 2006, 1971, 1994, 1978, 1977, 1960, 2008, 1965, 1990, 2011, 1962, 1995, 2004, 1991, 1952, 2013, 1983, 1955, 1957, 1947, 1994, 1978, 1957, 2016, 1969, 1996, 1958, 1994, 1958, 2008, 1988, 1977, 1991, 1997, 2009, 1976, 1999, 1975, 1949, 1985, 2001, 1952, 1953, 1949, 2015, 2006, 1996, 2015, 2009, 1949, 2004, 2010, 2011, 2001, 1998, 1967, 1994, 1966, 1994, 1986, 1963, 1954, 1963, 1987, 1992, 2008, 1979, 1987])

millennials = np.mean(class_year >= 2005)
print(millennials)

0.21


#### Calculating the Mean of 2D Arrays

If we have a two-dimensional array, np.mean can calculate the means of the larger array as well as the interior values.

Let's imagine a game of ring toss at a carnival. In this game, you have three different chances to get all three rings onto a stick. In our ring_toss array, each interior array (the arrays within the larger array) is one try, and each number is one ring toss. 1 represents a successful toss, 0 represents a fail.

First, we can use np.mean to find the mean across all the arrays:

>>> ring_toss = np.array([[1, 0, 0], 
                          [0, 0, 1], 
                          [1, 0, 1]])
>>> np.mean(ring_toss)
0.44444444444444442
To find the means of each interior array, we specify axis 1 (the "rows"):

>>> np.mean(ring_toss, axis=1)
array([ 0.33333333,  0.33333333,  0.66666667])
To find the means of each index position (i.e, mean of all 1st tosses, mean of all 2nd tosses, ...), we specifiy axis 0 (the "columns"):

>>> np.mean(ring_toss, axis=0)
array([ 0.66666667,  0.        ,  0.66666667])

In [4]:
allergy_trials = np.array([[6, 1, 3, 8, 2], 
                           [2, 6, 3, 9, 8], 
                           [5, 2, 6, 9, 9]])

total_mean = np.mean(allergy_trials)
print(total_mean)

trial_mean = np.mean(allergy_trials, axis=1)
print(trial_mean)

patient_mean = np.mean(allergy_trials, axis=0)
print(patient_mean)

5.26666666667
[ 4.   5.6  6.2]
[ 4.33333333  3.          4.          8.66666667  6.33333333]


#### Outliers

As we can see, the mean is a helpful way to quickly understand different parts of our data. However, the mean is highly influenced by the specific values in our data set. What happens when one of those values is significantly different from the rest?

Values that don’t fit within the majority of a dataset are known as outliers. It’s important to identify outliers because if they go unnoticed, they can skew our data and lead to error in our analysis (like determining the mean). They can also be useful in pointing out errors in our data collection.

When we're able to identify outliers, we can then determine if they were due to an error in sample collection or whether or not they represent a significant but real deviation from the mean.

Suppose we want to determine the average height for 3rd graders. We measure several students at the local school, but accidentally measure one student in centimeters rather than in inches. If we're not paying attention, our dataset could end up looking like this:

[50, 50, 51, 49, 48, 127]
In this case, 127 would be an outlier.

Some outliers aren’t the result of a mistake. For instance, suppose that one of our 3rd graders had skipped a grade and was actually a year younger than everyone else in the class:

[50, 50, 51, 49, 48, 45]
She might be significantly shorter at 45", but her height would still be an outlier.

Suppose that another student was just unusually tall for his age:

[50, 50, 51, 49, 48, 58.5]
His height of 58.5" would also be an outlier.

Sorting and Outliers
One way to quickly identify outliers is by sorting our data, Once our data is sorted, we can quickly glance at the beginning or end of an array to see if some values lie far beyond the expected range. We can use the NumPy function np.sort to sort our data.

Let’s go back to our 3rd grade height example, and imagine an 8th grader walked into our experiement:

>>> heights = np.array([49.7, 46.9, 62, 47.2, 47, 48.3, 48.7])
If we use np.sort, we can immediately identify the taller student since their height (62") is noticeably outside the range of the dataset:

>>> np.sort(heights)
array([ 46.9,  47. ,  47.2,  48.3,  48.7,  49.7,  62])

In [6]:
temps = np.array([86, 88, 94, 85, 97, 90, 87, 85, 94, 93, 92, 95, 98, 85, 94, 91, 97, 88, 87, 86, 99, 89, 89, 99, 88, 96, 93, 96, 85, 88, 191, 95, 96, 87, 99, 93, 90, 86, 87, 100, 187, 98, 101, 101, 96, 94, 96, 87, 86, 92, 98,94, 98, 90, 99, 96, 99, 86, 97, 98, 86, 90, 86, 94, 91, 88, 196, 195,93, 97, 199, 87, 87, 90, 90, 98, 88, 92, 97, 88, 85, 94, 88, 93, 198, 90, 91, 90, 92, 92])

sorted_temps = np.sort(temps)
print(sorted_temps)

[ 85  85  85  85  85  86  86  86  86  86  86  86  87  87  87  87  87  87
  87  88  88  88  88  88  88  88  88  89  89  90  90  90  90  90  90  90
  90  91  91  91  92  92  92  92  92  93  93  93  93  93  94  94  94  94
  94  94  94  95  95  96  96  96  96  96  96  97  97  97  97  97  98  98
  98  98  98  98  99  99  99  99  99 100 101 101 187 191 195 196 198 199]


#### NumPy and Median

Another key metric that we can use in data analysis is the median. The median is the middle value of a dataset that’s been ordered in terms of magnitude (from lowest to highest).

Let's look at the following array:

np.array( [1, 1, 2, 3, 4, 5, 5])
In this example, the median would be 3, because it is positioned half-way between the minimum value and the maximum value.

If the length of our dataset was an even number, the median would be the value halfway between the two central values. So in the following example, the median would be 3.5:

np.array( [1, 1, 2, 3, 4, 5, 5, 6])
But what if we had a very large dataset? It would get very tedious to count all of the values. Luckily, NumPy also has a function to calculate the median, np.median:

>>> my_array = np.array([50, 38, 291, 59, 14])
>>> np.median(my_array)
50.0

In [7]:
data_set = np.array([50000, 27500, 75000, 62500, 37500])
print(np.median(data_set))

50000.0


#### Percentiles, Part I

As we know, the median is the middle of a dataset: it is the number for which 50% of the samples are below, and 50% of the samples are above. But what if we wanted to find a point at which 40% of the samples are below, and 60% of the samples are above?

This type of point is called a percentile. The Nth percentile is defined as the point N% of samples lie below it. So the point where 40% of samples are below is called the 40th percentile. Percentiles are useful measurements because they can tell us where a particular value is situated within the greater dataset.

Let's look at the following array:

d = [1, 2, 3, 4, 4, 4, 6, 6, 7, 8, 8]
There are 11 numbers in the dataset. The 40th percentile will have 40% of the 10 remaining numbers below it (40% of 10 is 4) and 60% of the numbers above it (60% of 10 is 6). So in this example, the 40th percentile is 4.

percentile

In NumPy, we can calculate percentiles using the function np.percentile, which takes two arguments: the array and the percentile to calculate.

Here's how we would use NumPy to calculate the 40th percentile of array d:

>>> d = np.array([1, 2, 3, 4, 4, 4, 6, 6, 7,  8, 8])
>>> np.percentile(d, 40)
4.00

In [9]:
patrons = np.array([ 2, 6, 14, 4, 3, 9, 1, 11, 4, 2, 8])

print(np.sort(patrons))

thirtieth_percentile = np.percentile(patrons, 30)
print(thirtieth_percentile)

seventieth_percentile = np.percentile(patrons, 70)
print(seventieth_percentile)

[ 1  2  2  3  4  4  6  8  9 11 14]
3.0
8.0


#### Percentiles, Part II

Some percentiles have specific names:

The 25th percentile is called the first quartile
The 50th percentile is called the median
The 75th percentile is called the third quartile
The minimum, first quartile, median, third quartile, and maximum of a dataset are called a five-number summary. This set of numbers is a great thing to compute when we get a new dataset.

The difference between the first and third quartile is called the _interquartile range_. 50% of the dataset will lie within the interquartile range. The interquartile range gives us an idea of how spread out our data is.

d = [1, 2, 3, 4, 4, 4, 6, 6, 7, 8, 8]
We can calculate the 25th and 75th percentiles using np.percentile:

np.percentile(d, 25)
>>> 3.5

np.percentile(d, 75)
>>> 6.5

Then to find the interquartile range, we subtract the value of the 25th percentile from the value of the 75th:

6.5 - 3.5 = 3
50% of the dataset will lie within the interquartile range. The interquartile range gives us an idea of how spread out our data is. The smaller the interquartile range value, the less variance in our dataset. The greater the value, the larger the variance.

In [10]:
movies_watched = np.array([2, 3, 8, 0, 2, 4, 3, 1, 1, 0, 5, 1, 1, 7, 2])

first_quarter = np.percentile(movies_watched, 25)
print(first_quarter)
third_quarter = np.percentile(movies_watched, 75)
print(third_quarter)

# 50% of the dataset will lie within the interquartile range
interquartile_range = (third_quarter - first_quarter)
print(interquartile_range)

1.0
3.5
2.5


#### NumPy and Standard Deviation, Part I

While the mean and median can tell us about the center of our data, they do not reflect the range of the data. That's where standard deviation comes in.

Similar to the interquartile range, the standard deviation tells us the spread of the data. The larger the standard deviation, the more spread out our data is from the center. The smaller the standard deviation, the more the data is clustered around the mean.

#### NumPy and Standard Deviation, Part II

As we saw in the last exercise, knowing the standard deviation of a dataset can help us understand how spread out our dataset is.

We can find the standard deviation of a dataset using the Numpy function np.std:

>>> nums = np.array([65, 36, 52, 91, 63, 79])

>>> np.std(nums)
17.716909687891082

In [11]:
pumpkin = np.array([68, 1820, 1420, 2062, 704, 1156, 1857, 1755, 2092, 1384])

acorn_squash = np.array([20, 43, 99, 200, 12, 250, 58, 120, 230, 215])

pumpkin_avg = np.mean(pumpkin)
print(pumpkin_avg)

acorn_squash_avg = np.mean(acorn_squash)
print(acorn_squash_avg)

pumpkin_std = np.std(pumpkin)
acorn_squash_std = np.std(acorn_squash)

print(pumpkin_std)

print(acorn_squash_std)

1431.8
124.7
611.318378588
87.2250537403


#### Review

Let's review! In this lesson, you learned how to use NumPy to analyze single-variable datasets. Here's what we covered:

Using the np.sort method to locate outliers.
Calculating central positions of a dataset using np.mean and np.median.
Understanding the spread of our data using percentiles and the interquartile range.
Finding the standard deviation of a dataset using np.std.

In [12]:
rainfall = np.array([5.21, 3.76, 3.27, 2.35, 1.89, 1.55, 0.65, 1.06, 1.72, 3.35, 4.82, 5.11])

rain_mean = np.mean(rainfall)
print(rain_mean)

rain_median = np.median(rainfall)
print(rain_median)

first_quarter = np.percentile(rainfall, 25)
print(first_quarter)

third_quarter = np.percentile(rainfall, 75)
print(third_quarter)

interquartile_range = third_quarter - first_quarter
print(interquartile_range)

rain_std = np.std(rainfall)
print(rain_std)

2.895
2.81
1.6775
4.025
2.3475
1.52673125773


#### CrunchieMunchies

You work in marketing for a food company YummyCorps, which is developing a new kind of tasty, wholesome cereal called CrunchieMunchies. You want to demonstrate to consumers how healthy your cereal is in comparison to other leading brands, so you've dug up nutritional data on several different competitors.

Your task is to use NumPy statistical calculations to analyze this data and prove that your CrunchieMunchies cereal is the healthiest choice for consumers.

In [13]:
calorie_stats = np.genfromtxt('cereal.csv', delimiter=',')

average_calories = np.mean(calorie_stats)
print(average_calories)

calories_stats_sorted = np.sort(calorie_stats)
print(calories_stats_sorted)

median_calories = np.median(calorie_stats)
print(median_calories)

print(np.percentile(calorie_stats, 3))
nth_percentile = 3

print(np.percentile(calorie_stats, 4))

percentage = np.mean(calorie_stats > 60)
print(percentage)
more_calories = 96.10

calorie_std = np.std(calorie_stats)
print(calorie_std)

106.883116883
[  50.   50.   50.   70.   70.   80.   90.   90.   90.   90.   90.   90.
   90.  100.  100.  100.  100.  100.  100.  100.  100.  100.  100.  100.
  100.  100.  100.  100.  100.  100.  110.  110.  110.  110.  110.  110.
  110.  110.  110.  110.  110.  110.  110.  110.  110.  110.  110.  110.
  110.  110.  110.  110.  110.  110.  110.  110.  110.  110.  110.  120.
  120.  120.  120.  120.  120.  120.  120.  120.  120.  130.  130.  140.
  140.  140.  150.  150.  160.]
110.0
55.6
70.0
0.961038961039
19.3571853339
