## Fundamentals and Basic of Python

### Data Types and Operators

`+` Addition

`-` Subtraction

`*` Multiplication

`/` Division

`%` Mod (the remainder after dividing)

`**`Exponentiation (note that ^ does not do this operation, as you might have seen in other languages)

`//` Divides and rounds down to the nearest integer

#### Variables

Variables are used all the time in Python! Below is the example you saw in the video where we performed the following:

`mv_population = 74728`

Here mv_population is a variable, which holds the value of 74728. This assigns the item on the right to the name on the left, which is actually a little different than mathematical equality, as 74728 does not hold the value of mv_population.

In any case, whatever term is on the left side, is now a name for whatever value is on the right side. Once a value has been assigned to a variable name, you can access the value from the variable name.

Besides writing variable names that are descriptive, there are a few things to watch out for when naming variables in Python.

1. Only use ordinary letters, numbers and underscores in your variable names. They can’t have spaces, and need to start with a letter or underscore.

2. You can’t use Python's reserved words, or "keywords," as variable names. There are reserved words in every programming language that have important purposes, and you’ll learn about some of these throughout this course. Creating names that are descriptive of the values often will help you avoid using any of these keywords. Here you can see a table of Python's reserved words.

3. The pythonic way to name variables is to use all lowercase letters and underscores to separate words.

Example:
`my_height = 58;
my_lat = 40;
my_long = 105`

#### Python Best Practices
For all the best practices, see the PEP8 Guidelines.

You can use the atom package linter-python-pep8 to use pep8 within your own programming environment in the Atom text editor, but more on this later. If you aren't familiar with text editors yet, and you are performing all of your programming in the classroom, no need to worry about this right now.

Follow these guidelines to make other programmers and future you happy!

#### String Methods

A method in Python behaves similarly to a function. Methods actually are functions that are called using dot notation. For example, lower() is a string method that can be used like this, on a string called "sample string": sample_string.lower().

Methods are specific to the data type for a particular variable. So there are some built-in methods that are available for all strings, different methods that are available for all integers, etc.

Below is an image that shows some methods that are possible with any string.

Each of these methods accepts the string itself as the first argument of the method. However, they also could receive additional arguments, that are passed inside the parentheses. Let's look at the output for a few examples.

###### BASIC METHODS

In [None]:
my_string = 'Gustavo Aguiar'

In [None]:
my_string.lower()

'gustavo aguiar'

In [None]:
my_string.upper()

'GUSTAVO AGUIAR'

In [None]:
my_string.islower()

False

In [None]:
my_string.isupper()

False

In [None]:
my_string.count('a')

2

In [None]:
my_string.find('a')

4

In [None]:
my_string.lstrip('Gus')

'tavo Aguiar'

In [None]:
my_string.replace('Aguiar', 'Martins')

'Gustavo Martins'

In [None]:
my_string.rsplit()

['Gustavo', 'Aguiar']

In [None]:
my_string.rstrip()

'Gustavo Aguiar'

###### FORMAT()

One important string method: format()

We will be using the format() string method a good bit in our future work in Python, and you will find it very valuable in your coding, especially with your print statements.

In [None]:
print("Mohammed has {} balloons".format(27))

Mohammed has 27 balloons


In [None]:
animal = "dog"
action = "bite"
print("Does your {} {}?".format(animal, action))

Does your dog bite?


In [None]:
maria_string = "Maria loves {} and {}"
print(maria_string.format("math", "statistics"))

Maria loves math and statistics


Notice how in each example, the number of pairs of curly braces {} you use inside the string is the same as the number of replacements you want to make using the values inside format().

Another important string method: split()

A helpful string method when working with strings is the .split method. This function or method returns a data container called a list that contains the words from the input string. We will be introducing you to the concept of lists in the next video.

The split method has two additional arguments (sep and maxsplit). The sep argument stands for "separator". It can be used to identify how the string should be split up (e.g., whitespace characters like space, tab, return, newline; specific punctuation (e.g., comma, dashes)). If the sep argument is not provided, the default separator is whitespace.

###### SPLIT

A helpful string method when working with strings is the .split method. This function or method returns a data container called a list that contains the words from the input string. We will be introducing you to the concept of lists in the next video.

The split method has two additional arguments (sep and maxsplit). The sep argument stands for "separator". It can be used to identify how the string should be split up (e.g., whitespace characters like space, tab, return, newline; specific punctuation (e.g., comma, dashes)). If the sep argument is not provided, the default separator is whitespace.

True to its name, the maxsplit argument provides the maximum number of splits. The argument gives maxsplit + 1 number of elements in the new list, with the remaining string being returned as the last element in the list. You can read more about these methods in the Python documentation too.

Here are some examples for the .split() method.

In [None]:
new_str = "The cow jumped over the moon."
new_str.split()

['The', 'cow', 'jumped', 'over', 'the', 'moon.']

In [None]:
new_str.split(' ', 3)

['The', 'cow', 'jumped', 'over the moon.']

In [None]:
new_str.split('.')

['The cow jumped over the moon', '']

In [None]:
new_str.split(None, 3)

['The', 'cow', 'jumped', 'over the moon.']

#### Lists

In [None]:
month = 8
days_in_month = [31,28,31,30,31,30,31,31,30,31,30,31] # or days_in_month = list(31,28,31,30,31,30,31,31,30,31,30,31)

Use list indexing to determine how many days are in a particular month based on the integer variable month, and store that value in the integer variable num_days. For example, if month is 8, num_days should be set to 31, since the eighth month, August, has 31 days.

In [None]:
num_days = days_in_month[month - 1]
print(num_days)

31


Select the three most recent dates from this list using list slicing notation. Hint: negative indexes work in slices!

In [None]:
eclipse_dates = ['June 21, 2001', 'December 4, 2002', 'November 23, 2003',
                 'March 29, 2006', 'August 1, 2008', 'July 22, 2009',
                 'July 11, 2010', 'November 13, 2012', 'March 20, 2015',
                 'March 9, 2016']
                 
print(eclipse_dates[-3:])

['November 13, 2012', 'March 20, 2015', 'March 9, 2016']


###### List Methods

Useful methods:

* len() returns how many elements are in a list.

* max() returns the greatest element of the list. How the greatest element is determined depends on what type objects are in the list. The maximum element in a list of numbers is the largest number. The maximum elements in a list of strings is element that would occur last if the list were sorted alphabetically. This works because the the max function is defined in terms of the greater than comparison operator. The max function is undefined for lists that contain elements from different, incomparable types.

* min() returns the smallest element in a list. min is the opposite of max, which returns the largest element in a list.

* sorted() returns a copy of a list in order from smallest to largest, leaving the list unchanged

Join Method

In [None]:
new_str = "\n".join(["fore", "aft", "starboard", "port"])
print(new_str)

fore
aft
starboard
port


In [None]:
name = "-".join(["García", "O'Kelly"])
print(name)

García-O'Kelly


Append Method

In [None]:
letters = ['a', 'b', 'c', 'd']
letters.append('z')
print(letters)

['a', 'b', 'c', 'd', 'z']


#### Tuples

A tuple is another useful container. It's a data type for immutable ordered sequences of elements. They are often used to store related pieces of information. Consider this example involving latitude and longitude:

In [None]:
location = (13.4125, 103.866667)
print("Latitude:", location[0])
print("Longitude:", location[1])

Latitude: 13.4125
Longitude: 103.866667


Tuples are similar to lists in that they store an ordered collection of objects which can be accessed by their indices. Unlike lists, however, tuples are immutable - you can't add and remove items from tuples, or sort them in place.

Tuples can also be used to assign multiple variables in a compact way.

In [None]:
dimensions = tuple(52, 40, 100) # or dimensions = 52, 40, 100
length, width, height = dimensions
print("The dimensions are {} x {} x {}".format(length, width, height))

The dimensions are 52 x 40 x 100


The parentheses are optional when defining tuples, and programmers frequently omit them if parentheses don't clarify the code.

In the second line, three variables are assigned from the content of the tuple dimensions. This is called tuple unpacking. You can use tuple unpacking to assign the information from a tuple into multiple variables without having to access them one by one and make multiple assignment statements.

If we won't need to use dimensions directly, we could shorten those two lines of code into a single line that assigns three variables in one go!

In [None]:
length, width, height = 52, 40, 100
print("The dimensions are {} x {} x {}".format(length, width, height))

The dimensions are 52 x 40 x 100


#### Sets

A set is a data type for mutable unordered collections of unique elements. One application of a set is to quickly remove duplicates from a list.

In [None]:
numbers = [1, 2, 6, 3, 1, 1, 6]
unique_nums = set(numbers)
print(unique_nums)

{1, 2, 3, 6}


Sets support the in operator the same as lists do. You can add elements to sets using the add method, and remove elements using the pop method, similar to lists. Although, when you pop an element from a set, a random element is removed. Remember that sets, unlike lists, are unordered so there is no "last element".

In [None]:
fruit = {"apple", "banana", "orange", "grapefruit"}  # define a set

print("watermelon" in fruit)  # check for element

fruit.add("watermelon")  # add an element
print(fruit)

print(fruit.pop())  # remove a random element
print(fruit)

False
{'apple', 'grapefruit', 'orange', 'banana', 'watermelon'}
apple
{'grapefruit', 'orange', 'banana', 'watermelon'}


#### Dictionaries

A dictionary is a mutable data type that stores mappings of unique keys to values. Here's a dictionary that stores elements and their atomic numbers.

In [None]:
elements = {"hydrogen": 1, "helium": 2, "carbon": 6}

Dictionaries can have keys of any immutable type, like integers or tuples, not just strings. It's not even necessary for every key to have the same type! We can look up values or insert new values in the dictionary using square brackets that enclose the key.

In [None]:
print(elements["helium"])  # print the value mapped to "helium"
elements["lithium"] = 3  # insert "lithium" with a value of 3 into the dictionary

2


In [None]:
print("carbon" in elements)
print(elements.get("dilithium"))

True
None


Carbon is in the dictionary, so True is printed. Dilithium isn’t in our dictionary so None is returned by get and then printed. If you expect lookups to sometimes fail, get might be a better tool than normal square bracket lookups because errors can crash your program.

We can define the dictionary like this:

In [None]:
population = {'Shanghai': 17.8,
              'Istanbul': 13.3,
              'Karachi': 13.0,
              'Mumbai': 12.5}
print(population)

{'Shanghai': 17.8, 'Istanbul': 13.3, 'Karachi': 13.0, 'Mumbai': 12.5}


I chose to put each key-value pair on its own line to make this dictionary definition easier to read, but where and whether you use line breaks is simply a stylistic choice. This code works just as well:

In [None]:
population = {'Shanghai': 17.8, 'Istanbul': 13.3, 'Karachi': 13.0, 'Mumbai': 12.5}

We can include containers in other containers to create compound data structures. For example, this dictionary maps keys to values that are also dictionaries!

In [None]:
elements = {"hydrogen": {"number": 1,
                         "weight": 1.00794,
                         "symbol": "H"},
              "helium": {"number": 2,
                         "weight": 4.002602,
                         "symbol": "He"}}

In [None]:
helium = elements["helium"]  # get the helium dictionary
print(helium)
hydrogen_weight = elements["hydrogen"]["weight"]  # get hydrogen's weight
print(hydrogen_weight)

{'number': 2, 'weight': 4.002602, 'symbol': 'He'}
1.00794


In [None]:
oxygen = {"number":8,"weight":15.999,"symbol":"O"}  # create a new oxygen dictionary 
elements["oxygen"] = oxygen  # assign 'oxygen' as a key to the elements dictionary
print('elements = ', elements)

elements =  {'hydrogen': {'number': 1, 'weight': 1.00794, 'symbol': 'H'}, 'helium': {'number': 2, 'weight': 4.002602, 'symbol': 'He'}, 'oxygen': {'number': 8, 'weight': 15.999, 'symbol': 'O'}}


In [None]:
elements = {'hydrogen': {'number': 1, 'weight': 1.00794, 'symbol': 'H'},
            'helium': {'number': 2, 'weight': 4.002602, 'symbol': 'He'}}

elements['hydrogen']['is_noble_gas'] = False
elements['helium']['is_noble_gas'] = True

In [None]:
print(elements)

{'hydrogen': {'number': 1, 'weight': 1.00794, 'symbol': 'H', 'is_noble_gas': False}, 'helium': {'number': 2, 'weight': 4.002602, 'symbol': 'He', 'is_noble_gas': True}}


### Control Flow

###### Building Dictionaries

In [None]:
book_title =  ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin']

In [None]:
word_counter = {}

for word in book_title:
    if word not in word_counter:
        word_counter[word] = 1
    else:
        word_counter[word] += 1

print(word_counter)

{'great': 2, 'expectations': 1, 'the': 2, 'adventures': 2, 'of': 2, 'sherlock': 1, 'holmes': 1, 'gasby': 1, 'hamlet': 1, 'huckleberry': 1, 'fin': 1}


What's happening here?

The for loop iterates through each element in the list. For the first iteration, word takes the value 'great'.

Next, the if statement checks if word is in the word_counter dictionary.
Since it doesn't yet, the statement word_counter[word] = 1 adds great as a key to the dictionary with a value of 1.

Then, it leaves the if else statement and moves on to the next iteration of the for loop. word now takes the value expectations and repeats the process.

When the if condition is not met, it is because thatword already exists in the word_counter dictionary, and the statement word_counter[word] = word_counter[word] + 1 increases the count of that word by 1.

Once the for loop finishes iterating through the list, the for loop is complete.

Using .get() method

In [None]:
word_counter = {}

for word in book_title:
    word_counter[word] = word_counter.get(word, 0) + 1

print(word_counter)

{'great': 2, 'expectations': 1, 'the': 2, 'adventures': 2, 'of': 2, 'sherlock': 1, 'holmes': 1, 'gasby': 1, 'hamlet': 1, 'huckleberry': 1, 'fin': 1}


What's happening here?

The for loop iterates through the list as we saw earlier. The for loop feeds 'great' to the next statement in the body of the for loop.

In this line: word_counter[word] = word_counter.get(word,0) + 1, since the key 'great' doesn't yet exist in the dictionary, get() will return the value 0 and word_counter[word] will be set to 1.

Once it encounters a word that already exists in word_counter (e.g. the second appearance of 'the'), the value for that key is incremented by 1. On the second appearance of 'the', the key's value would add 1 again, resulting in 2.
Once the for loop finishes iterating through the list, the for loop is complete.


###### Iterating through dicts

In [None]:
cast = {
           "Jerry Seinfeld": "Jerry Seinfeld",
           "Julia Louis-Dreyfus": "Elaine Benes",
           "Jason Alexander": "George Costanza",
           "Michael Richards": "Cosmo Kramer"
       }

for key in cast:
    print(key)

Jerry Seinfeld
Julia Louis-Dreyfus
Jason Alexander
Michael Richards


In [None]:
for key, value in cast.items():
    print("Actor: {}    Role: {}".format(key, value))

Actor: Jerry Seinfeld    Role: Jerry Seinfeld
Actor: Julia Louis-Dreyfus    Role: Elaine Benes
Actor: Jason Alexander    Role: George Costanza
Actor: Michael Richards    Role: Cosmo Kramer


In [None]:
result = 0
basket_items = {'apples': 4, 'oranges': 19, 'kites': 3, 'sandwiches': 8}
fruits = ['apples', 'oranges', 'pears', 'peaches', 'grapes', 'bananas']

for object, count in basket_items.items():
   if object in fruits:
       result += count

print("There are {} fruits in the basket.".format(result))

There are 23 fruits in the basket.


In [None]:
fruit_count, not_fruit_count = 0, 0
basket_items = {'apples': 4, 'oranges': 19, 'kites': 3, 'sandwiches': 8}
fruits = ['apples', 'oranges', 'pears', 'peaches', 'grapes', 'bananas']

#Iterate through the dictionary
for object, count in basket_items.items():
    if object in fruits:
       fruit_count += count
    else:
        not_fruit_count += count

print("The number of fruits is {}.  There are {} objects that are not fruits.".format(fruit_count, not_fruit_count))

The number of fruits is 23.  There are 11 objects that are not fruits.


##### ZIP

zip returns an iterator that combines multiple iterables into one sequence of tuples. Each tuple contains the elements in that position from all the iterables. For example, printing

list(zip(['a', 'b', 'c'], [1, 2, 3])) would output [('a', 1), ('b', 2), ('c', 3)].

Like we did for range() we need to convert it to a list or iterate through it with a loop to see the elements.

You could unpack each tuple in a for loop like this.

In [None]:
letters = ['a', 'b', 'c']
nums = [1, 2, 3]

for letter, num in zip(letters, nums):
    print("{}: {}".format(letter, num))

a: 1
b: 2
c: 3


In [None]:
some_list = [('a', 1), ('b', 2), ('c', 3)]
letters, nums = zip(*some_list)
print(letters, nums)

('a', 'b', 'c') (1, 2, 3)


In [None]:
x_coord = [23, 53, 2, -12, 95, 103, 14, -5]
y_coord = [677, 233, 405, 433, 905, 376, 432, 445]
z_coord = [4, 16, -6, -42, 3, -6, 23, -1]
labels = ["F", "J", "A", "Q", "Y", "B", "W", "X"]

points = []
for point in zip(labels, x_coord, y_coord, z_coord):
    points.append("{}: {}, {}, {}".format(*point))

for point in points:
    print(point)

F: 23, 677, 4
J: 53, 233, 16
A: 2, 405, -6
Q: -12, 433, -42
Y: 95, 905, 3
B: 103, 376, -6
W: 14, 432, 23
X: -5, 445, -1


Build dict from lists

In [None]:
cast_names = ["Barney", "Robin", "Ted", "Lily", "Marshall"]
cast_heights = [72, 68, 72, 66, 76]

cast = dict(zip(cast_names, cast_heights))
print(cast)

{'Barney': 72, 'Robin': 68, 'Ted': 72, 'Lily': 66, 'Marshall': 76}


Transpose with Zip

In [None]:
data = ((0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11))

data_transpose = tuple(zip(*data))
print(data_transpose)

((0, 3, 6, 9), (1, 4, 7, 10), (2, 5, 8, 11))


Using enumerate

In [None]:
cast = ["Barney Stinson", "Robin Scherbatsky", "Ted Mosby", "Lily Aldrin", "Marshall Eriksen"]
heights = [72, 68, 72, 66, 76]

for i, character in enumerate(cast):
    cast[i] = character + " " + str(heights[i])

print(cast)

['Barney Stinson 72', 'Robin Scherbatsky 68', 'Ted Mosby 72', 'Lily Aldrin 66', 'Marshall Eriksen 76']


###### List Comprehension

In Python, you can create lists really quickly and concisely with list comprehensions. This example from earlier:

In [None]:
cities = ('rio de janeiro', 'são paulo', 'brasília', 'recife', 'salvador')
capitalized_cities = []
for city in cities:
    capitalized_cities.append(city.title())
print(capitalized_cities)

['Rio De Janeiro', 'São Paulo', 'Brasília', 'Recife', 'Salvador']


can be reduced to:

In [None]:
capitalized_cities = [city.title() for city in cities]
print(capitalized_cities)

['Rio De Janeiro', 'São Paulo', 'Brasília', 'Recife', 'Salvador']


List comprehensions allow us to create a list using a for loop in one step.

You create a list comprehension with brackets [], including an expression to evaluate for each element in an iterable. This list comprehension above calls city.title() for each element city in cities, to create each element in the new list, capitalized_cities.

In [None]:
squares = [x**2 if x % 2 == 0 else x + 3 for x in range(9)]
print(squares)

[0, 4, 4, 6, 16, 8, 36, 10, 64]


In [None]:
scores = {
             "Rick Sanchez": 70,
             "Morty Smith": 35,
             "Summer Smith": 82,
             "Jerry Smith": 23,
             "Beth Smith": 98
          }

passed = [name for name, score in scores.items() if score >= 65]
print(passed)

['Rick Sanchez', 'Summer Smith', 'Beth Smith']


### Function

###### Function Structure

Function Header

Let's start with the function header, which is the first line of a function definition.

The function header always starts with the def keyword, which indicates that this is a function definition.

Then comes the function name (here, cylinder_volume), which follows the same naming conventions as variables. You can revisit the naming conventions below.
Immediately after the name are parentheses that may include arguments 
separated by commas (here, height and radius). Arguments, or parameters, are values that are passed in as inputs when the function is called, and are used in the function body. If a function doesn't take arguments, these parentheses are left empty.

The header always end with a colon :.

Function Body

The rest of the function is contained in the body, which is where the function does its work.

The body of a function is the code indented after the header line. Here, it's the two lines that define pi and return the volume.

Within this body, we can refer to the argument variables and define new variables, which can only be used within these indented lines.

The body will often include a return statement, which is used to send back an output value from the function to the statement that called the function. A return statement consists of the return keyword followed by an expression that is evaluated to get the output value for the function. If there is no return statement, the function simply returns None.

Naming Conventions for Functions

Function names follow the same naming conventions as variables.

Only use ordinary letters, numbers and underscores in your function names. They can’t have spaces, and need to start with a letter or underscore.

You can’t use Python's reserved words or keywords for function names, as discussed earlier with variable names. Here again is that table of Python's reserved words.

Try to use descriptive names that can help readers understand what the function does.

Example 1

In [None]:
def population_density(population, land_area):
    return population/land_area

# test cases for your function
test1 = population_density(10, 1)
expected_result1 = 10
print("expected result: {}, actual result: {}".format(expected_result1, test1))

test2 = population_density(864816, 121.4)
expected_result2 = 7123.6902801
print("expected result: {}, actual result: {}".format(expected_result2, test2))

expected result: 10, actual result: 10.0
expected result: 7123.6902801, actual result: 7123.690280065897


Example 2

In [None]:
def readable_timedelta(days):
    # use integer division to get the number of weeks
    weeks = days // 7
    # use % to get the number of days that remain
    remainder = days % 7
    return "{} week(s) and {} day(s).".format(weeks, remainder)

# test your function
print(readable_timedelta(10))

1 week(s) and 3 day(s).


###### Documentation

Documentation is used to make your code easier to understand and use. Functions are especially readable because they often use documentation strings, or docstrings. Docstrings are a type of comment used to explain the purpose of a function, and how it should be used. Here's a function for population density with a docstring.

In [None]:
def population_density(population, land_area):
    """Calculate the population density of an area. """
    return population / land_area

Docstrings are surrounded by triple quotes. The first line of the docstring is a brief explanation of the function's purpose. If you feel that this is sufficient documentation you can end the docstring at this point; single line docstrings are perfectly acceptable, as in the example above.

In [None]:
def population_density(population, land_area):
    """Calculate the population density of an area.

    INPUT:
    population: int. The population of that area
    land_area: int or float. This function is unit-agnostic, if you pass in values in terms
    of square km or square miles the function will return a density in those units.

    OUTPUT: 
    population_density: population / land_area. The population density of a particular area.
    """
    return population / land_area

If you think that a longer description would be appropriate for the function, you can add more information after the one-line summary. In the example above, you can see that we wrote an explanation of the function's arguments, stating the purpose and types of each one. It's also common to provide some description of the function's output.

Every piece of the docstring is optional, however, docstrings are a part of good coding practice. You can read more about docstring conventions here.

Example 1

In [None]:
def readable_timedelta(days):
    """Return a string of the number of weeks and days included in days.

    Args:
        days (int): number of days to convert

    Returns:
        string of the number of weeks and days included in days
    """
    weeks = days // 7
    remainder = days % 7
    return "{} week(s) and {} day(s)".format(weeks, remainder)

In [None]:
help(readable_timedelta)

Help on function readable_timedelta in module __main__:

readable_timedelta(days)
    Return a string of the number of weeks and days included in days.
    
    Args:
        days (int): number of days to convert
    
    Returns:
        string of the number of weeks and days included in days



###### Lambda Functions

You can use lambda expressions to create anonymous functions. That is, functions that don’t have a name. They are helpful for creating quick functions that aren’t needed later in your code. This can be especially useful for higher order functions, or functions that take in other functions as arguments.

With a lambda expression, this function:

In [None]:
def multiply(x, y):
    return x * y

can be reduced to:

In [None]:
multiply = lambda x, y: x * y

In [None]:
multiply(4, 7)

28

**Components of a Lambda Function**

The lambda keyword is used to indicate that this is a lambda expression.

Following lambda are one or more arguments for the anonymous function separated by commas, followed by a colon :. Similar to functions, the way the arguments are named in a lambda expression is arbitrary.

Last is an expression that is evaluated and returned in this function. This is a lot like an expression you might see as a return statement in a function.
With this structure, lambda expressions aren’t ideal for complex functions, but can be very useful for short, simple functions.

**Lambda with Map**

map() is a higher-order built-in function that takes a function and iterable as inputs, and returns an iterator that applies the function to each element of the iterable. The code below uses map() to find the mean of each list in numbers to create the list averages. Give it a test run to see what happens.

Rewrite this code to be more concise by replacing the mean function with a lambda expression defined within the call to map().

In [None]:
numbers = [
              [34, 63, 88, 71, 29],
              [90, 78, 51, 27, 45],
              [63, 37, 85, 46, 22],
              [51, 22, 34, 11, 18]
           ]

averages = list(map(lambda x: sum(x) / len(x), numbers))
print(averages)

[57.0, 58.2, 50.6, 27.2]


In [None]:
cities = ["New York City", "Los Angeles", "Chicago", "Mountain View", "Denver", "Boston"]

short_cities = list(filter(lambda x: len(x) < 10, cities))
print(short_cities)

['Chicago', 'Denver', 'Boston']


###### Iterators & Generators

Iterables are objects that can return one of their elements at a time, such as a list. Many of the built-in functions we’ve used so far, like 'enumerate,' return an iterator.

An iterator is an object that represents a stream of data. This is different from a list, which is also an iterable, but is not an iterator because it is not a stream of data.

Generators are a simple way to create iterators using functions. You can also define iterators using classes, which you can read more about here.

Here is an example of a generator function called my_range, which produces an iterator that is a stream of numbers from 0 to (x - 1).

In [None]:
def my_range(x):
    i = 0
    while i < x:
        yield i
        i += 1

Notice that instead of using the return keyword, it uses yield. This allows the function to return values one at a time, and start where it left off each time it’s called. This yield keyword is what differentiates a generator from a typical function.

Remember, since this returns an iterator, we can convert it to a list or iterate through it in a loop to view its contents. For example, this code:

In [None]:
for x in my_range(5):
    print(x)

0
1
2
3
4


**Why Generators?**

You may be wondering why we'd use generators over lists. Here’s an excerpt from a stack overflow page that addresses this:

Generators are a lazy way to build iterables. They are useful when the fully realized list would not fit in memory, or when the cost to calculate each list element is high and you want to do it as late as possible. But they can only be iterated over once.

In [None]:
lessons = ["Why Python Programming", "Data Types and Operators", "Control Flow", "Functions", "Scripting"]

def my_enumerate(iterable, start=0):
    count = start
    for element in iterable:
        yield count, element
        count += 1

for i, lesson in my_enumerate(lessons, 1):
    print("Lesson {}: {}".format(i, lesson))

Lesson 1: Why Python Programming
Lesson 2: Data Types and Operators
Lesson 3: Control Flow
Lesson 4: Functions
Lesson 5: Scripting


In [None]:
def chunker(iterable, size):
    """Yield successive chunks from iterable of length size."""
    for i in range(0, len(iterable), size):
        yield iterable[i:i + size]

for chunk in chunker(range(25), 4):
    print(list(chunk))

[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9, 10, 11]
[12, 13, 14, 15]
[16, 17, 18, 19]
[20, 21, 22, 23]
[24]


**Generator Expressions**

Here's a cool concept that combines generators and list comprehensions! You can actually create a generator in the same way you'd normally write a list comprehension, except with parentheses instead of square brackets. For example:

In [None]:
sq_list = [x**2 for x in range(10)]  # this produces a list of squares

sq_iterator = (x**2 for x in range(10))  # this produces an iterator of squares

print(sq_list, sq_iterator)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81] <generator object <genexpr> at 0x7fa2c3834750>


### Scripting

* Python Installation and Environment Setup

* Running and Editing Python Scripts

* Interacting with User Input

* Handling Exceptions

* Reading and Writing Files

* Importing Local, Standard, and Third-Party Modules

* Experimenting with an Interpreter

###### Python Installation

Before We Install Python:

1. **Prepare to Use Command Line**

To install Python and follow this lesson, you will need to use the command line. We will walk you through all the details, so don't worry if you have never used it before! If you would like to learn or refresh on command lines, we strongly recommend going through this free Shell Workshop lesson, where you can set up and learn how to use Unix Shell commands.

** Note to Windows Users: Install Git Bash
As noted in the free Shell Workshop linked above, we recommend you install Git Bash here and use this as your terminal for this lesson. Please note that during installation you should select the checkbox Use Git and Optional Unix tools from the Windows Command Prompt. This will allow you to use Unix commands while in Windows. If you'd rather use PowerShell, those commands are also provided in this lesson. For more information on the different command shells, check out the Shell Workshop lesson linked above.

2. **Is Python Already Installed On Your Computer?**

In this course, we're using the most recent major version of Python - Python 3. Although Python 2 is still being used in many places, it is no longer being updated. In order to keep up compatibility with future improvements to Python, we recommend using Python 3.

Mac OS X and Linux usually come with Python 2 already installed. We DO NOT recommend that you make any changes to this Python, since parts of the operating system are using Python. However, it shouldn't do any harm to your system to install Python 3 separately, too.

Windows doesn't usually come with Python included, but you can still check whether you have it installed before going ahead. So, first, check that you’ve not already got Python 3 installed.

Open up your Terminal or Command Line (this would be Git Bash on Windows).

In a new terminal or command prompt, type

$ python --version
and press Enter.

You might get a response that the Python version installed is something like Python 2.7.9. In that case, it would tell you that you have Python 2 installed, and you'll want to follow the steps in the next couple of sections to update it to Python 3.

If instead the version number starts with a 3, then you already have Python 3 installed! Don't install Python again!

Alternatively, you might see an error message - don't worry about that for now, just try the steps in the next couple of sections.



###### Anaconda Installation

* O que é o Anaconda?

Bem-vindo(a) à aula sobre o uso do Anaconda para gerenciar seus pacotes e ambientes ao trabalhar com Python. Com o Anaconda, é simples instalar os pacotes que você usará com frequência ao analisar dados. Você também o usará para criar ambientes virtuais que tornam a tarefa de trabalhar em diversos projetos simultaneamente muito mais simples. O Anaconda simplificou meu fluxo de trabalho e resolveu diversas questões que apareciam quando eu precisava lidar com diversos pacotes e versões do Python ao mesmo tempo.

O Anaconda é, na verdade, a distribuição de software que contém o conda, o Python e mais de 150 pacotes científicos e suas dependências. A aplicação conda é o gerenciador de pacotes e ambientes. O Anaconda é um download relativamente pesado (~500 MB), porque vem com os pacotes mais comuns de data science do Python. Se você não precisa de todos os pacotes ou quer economizar no volume de dados baixados/salvos, existe também o Miniconda, uma distribuição menor, que inclui apenas o conda e o Python. Ainda é possível instalar qualquer um dos pacotes disponíveis usando o conda, mas eles não vêm nessa distribuição.

OConda é um programa que você usará apenas na linha de comando, então, caso não esteja confortável com isso, dê uma olhada neste tutorial para linha de comando do Windows ou em nosso curso para OSX/Linux de Linha de comando Linux nível básico.

Você provavelmente já tem o Python instalado e deve estar se perguntando o porquê de fazer tudo isso. Em primeiro lugar, como o Anaconda vem com um monte de pacotes de data science, você já estará preparado para trabalhar com dados. Segundo, usar o conda para gerenciar seus pacotes e ambientes reduzirá o número de problemas futuros decorrentes das diversas bibliotecas que você usará.

* Gerenciando pacotes

Gerenciadores de pacotes são utilizados para instalar bibliotecas e outros programas em seu computador. Você provavelmente já conhece o pip, que é o gerenciador de pacotes padrão das bibliotecas Python. O conda é parecido com o pip, exceto pelo fato de que os pacotes disponíveis são mais focados em data science, enquanto o pip é mais generalista. No entanto, o conda não é específico para o Python, tal como o pip; ele também pode instalar pacotes que não são do Python. Trata-se de um gerenciador de pacotes para qualquer software. Dito isso, nem todas as bibliotecas Python estão disponíveis para a distribuição Anaconda e para o conda. Você também pode (e de fato, irá) utilizar o pip junto ao conda para instalar pacotes.

O conda instala pacotes pré-compilados. Por exemplo, a distribuição Anaconda contém as bibliotecas Numpy, Scipy e Scikit-learn compiladas com a biblioteca MKL, o que acelera diversas operações matemáticas. Os pacotes são mantidos por colaboradores da distribuição, o que significa que nem sempre as últimas versões estão disponíveis automaticamente. Mas, graças ao fato de que alguém precisou montar os pacotes para diversos sistemas, eles tendem a ser mais estáveis (e, por isso, mais convenientes).

* Ambientes

Além de administrar pacotes, o Conda também gerencia ambientes virtuais, similar ao virtualenv e ao pyenv, outros gerenciadores de ambientes famosos.

Ambientes permitem que você separe e isole pacotes que estão sendo utilizados para projetos diferentes. Códigos que dependem de versões diferentes de uma mesma biblioteca serão utilizados com frequência. Por exemplo, é possível ter código que use aspectos novos do Numpy ou então códigos que usem aspectos antigos que foram removidos das versões novas. É praticamente impossível ter duas versões do Numpy instaladas ao mesmo tempo. Em vez disso, cada versão do Numpy deve estar presente em um ambiente e, então, o projeto que precisa de uma versão será feito em determinado ambiente.

Essa questão também surge quando se lida com Python 2 e Python 3. É possível trabalhar com código antigo que não funciona em Python 3 ou código novo que não roda em Python 2. Ter ambas as versões instaladas evita muita confusão e bugs. É muito melhor ter ambientes separados.

Você também pode exportar a lista de pacotes de um ambiente para um arquivo e, então, incluir esse arquivo no código. Isso permite que outras pessoas consigam instalar com facilidade todas as dependências para seu código rodar. O pip tem uma funcionalidade parecida com o comando pip freeze > requirements.txt.

* Instalação

O Anaconda está disponível para Windows, Mac OS X e Linux. É possível encontrar os instaladores e as instruções de instalação no site https://www.continuum.io/downloads.

Caso já tenha o Python instalado em seu computador, essa instalação não estragará nada. Em vez disso, o Python padrão usado nos scripts e programas passará a ser o que vem dentro da distribuição Anaconda.

Escolha a versão com o Python 3.6. Caso queira, é possível instalar o Python 2.0 depois. Também é importante escolher a versão do instalador 64-bit caso seu computador tenha um sistema operacional 64-bit. Caso contrário, escolha o instalador 32-bit. Escolha a versão apropriada e então a instale. Continue depois de fazer isso!

Depois da instalação, você estará automaticamente no ambiente conda padrão com todos os pacotes instalados, como mostrado abaixo. É possível checar sua instalação digitando conda list no terminal.

Diversas aplicações são instaladas junto ao Anaconda:

Anaconda Navigator, uma interface para gerenciar seus pacotes e ambientes
Anaconda Prompt, um terminal para usar a interface de linha de comando para gerenciar seus ambientes e pacotes Spyder, uma interface equipada para o desenvolvimento científico.

Para evitar problemas futuros, é melhor atualizar todos os pacotes do ambiente padrão. Abra o programa Anaconda Prompt, digite os seguintes comandos:

```
conda upgrade conda
conda upgrade --all
```

e confirme quando for pedido. Os pacotes que vêm com a instalação inicial tendem a não ser a versão mais atual. Ao fazer essa atualização agora, você evita erros futuros causados por pacotes antigos.

Observação: na etapa anterior, rodar conda upgrade conda não deveria ser necessário, pois o comando seguinte contém --all, que deveria incluir o pacote conda, mas alguns usuários reportaram erros ao não fazer isso.

Para o restante desta aula, será necessário digitar diversos comandos no terminal. Recomendo fortemente que comece a trabalhar com o Anaconda desta maneira e só depois passe a usar uma interface gráfica, caso você queira.

* Gerenciando pacotes

Uma vez instalado o Anaconda, gerenciar pacotes é bastante simples. Para instalar um pacote, digite conda install nome_do_pacote no terminal. Por exemplo, para instalar o numpy, digite `conda install numpy`.

É possível instalar diversos pacotes ao mesmo tempo. Ao digitar algo como `conda install numpy scipy pandas`, todos os pacotes serão instalados simultaneamente. Também é possível especificar qual versão de um pacote você quer ao adicionar o número da versão, tal como em `conda install numpy=1.10`.

O conda também instala automaticamente as dependências para você. Por exemplo, o pacote scipy depende do pacote numpy, ele usa e exige o numpy. Caso você instale apenas o scipy (`conda install scipy`), o Conda também instalará o numpy, caso ainda não esteja instalado.

A maioria dos comandos é intuitiva. Para remover, use conda remove nome_do_pacote. Para atualizar um pacote, `conda update nome_do_pacote`. Caso você queira atualizar todos os pacotes de um ambiente, o que pode ser bem útil, use `onda update --all`. Por fim, para listar os pacotes instalados, use o comando conda list que vimos anteriormente.

Caso você não saiba exatamente o nome do pacote que está buscando, é possível tentar encontrá-lo com o comando `conda search termo_de_busca`. Por exemplo, eu sei que quero instalar o pacote Beautiful Soup, porém, não tenho certeza do nome exato, então, eu digito `conda search beautifulsoup`. O programa retorna uma lista com os pacotes Beautiful Soup disponíveis com o nome apropriado do pacote, beautifulsoup4.

We can check the version of python in our newly created Conda environment with the following command:

In [None]:
python --version

To create a new Conda environment, open up your terminal and type the following command:

In [None]:
conda create -n datasci-env pandas, numpy, scikit-learn, matplotlib, seaborn

If you have an existing file containing project dependencies, you can recreate the environment by running the following:

In [None]:
conda env create -f environment.yaml

Once your environment is created, you can activate it using:

In [None]:
conda activate datasci-env

Let’s deactivate the environment we just created:

In [None]:
conda deactivate

You can list all available environments using:

In [None]:
conda env list

If you’re done with a project and no longer need an environment, you can remove it using the following command

In [None]:
conda env remove -n datasci-env

* Gerenciando Ambientes

Conforme mencionado anteriormente, o conda pode ser utilizado para criar ambientes que isolem seus projetos. Para criar um ambiente, use `conda create -n nome_amb lista de pacotes` no terminal. Aqui, `-n nome_amb` dá o nome do seu ambiente (-n de nome), e lista de pacotes é a lista de pacotes que você quer instalada no ambiente. Por exemplo, para criar um ambiente chamado my_env e instalar o numpy nele, digite `conda create -n my_env numpy`.

Ao criar um ambiente, você pode especificar qual versão do Python quer que seja instalada nele. Isso é útil para quando você trabalha com código tanto em Python 2.x como em Python 3.x. Para criar um ambiente com uma versão específica de Python, faça algo como `conda create -n py3 python=3` ou `conda create -n py2 python=2`. Eu tenho ambos os ambientes em meu computador pessoal. Eu uso os dois como ambientes gerais que não estão atrelados a nenhum projeto específico, mas sim para trabalhos gerais com cada versão do Python facilmente acessível. Esses comandos instalarão as versões mais recentes de Python 3 e Python 2, respectivamente. Para instalar uma versão específica, use, por exemplo, `conda create -n py python=3.3` para a versão 3.3 do Python.

* Entrando em um ambiente

Ume vez criado o ambiente, use o comando source activate my_env para entrar nele no OSX/Linux. No Windows, use `activate my_env`.

Quando estiver no ambiente, você verá o nome dele no prompt do terminal. Algo como (meu_amb) ~ $. O ambiente tem apenas alguns pacotes instalados automaticamente, além daqueles inseridos no comando de criação. É possível checar isso usando o comando conda list. Instalar pacotes no ambiente é feito da mesma maneira: `conda install nome_do_pacote`, apenas com uma diferença: desta vez, os pacotes específicos que forem instalados estarão disponíveis apenas enquanto o ambiente estiver ativo. Para sair do ambiente, digite source deactivate (no OSX/Linux). No Windows, use deactivate.

Qual comando você usaria para criar um ambiente chamado data com Python 3.6, Numpy e Pandas instalados? 

`conda create -n data python=3.6 numpy pandas`

* Salvando e carregando ambientes

Uma ótima característica de compartilhar ambientes é que outras pessoas podem instalar todos os pacotes utilizados em seu código com a versão correta. Você pode salvar os pacotes em um arquivo YAML usando o comando `conda env export > environment.yaml`. A primeira parte, conda env export, escreve todos os pacotes no ambiente, incluindo a versão Python.

É possível ver, acima, o nome do ambiente e todas as dependências (assim como suas versões). A segunda parte do comando de exportar, > environment.yaml, escreve o texto exportado em um arquivo YAML chamado environment.yaml. Esse arquivo pode, então, ser compartilhado com outros para criar o mesmo ambiente usado em seu projeto.

Para criar um ambiente de um arquivo de ambiente, use o comando `conda env create -f environment.yaml`. Isso criará um novo ambiente com o mesmo nome contido no arquivo environment.yaml.

* Listando ambientes

Caso esqueça os nomes dos ambientes (acontece comigo às vezes), use o comando `conda env list` para listar todos os ambientes criados por você. Deveria aparecer uma lista de ambientes, assim como um asterisco ao lado do ambiente ativo. O ambiente padrão, aquele usado quando nenhum outro está ativo, é chamado de root.

* Removendo ambientes

Caso existam ambientes que não são mais úteis, `conda env remove -n nome_amb` é o comando que remove o ambiente escolhido (no caso, o nomeado nome_amb).

* Recomendações

Algo que me ajudou demais foi ter ambientes separados para o Python 2 e Python 3. Eu usei os comandos conda create -n py2 python=2 e conda create -n py3 python=3 para criar dois ambientes separados, py2 e py3. Agora, tenho um ambiente de uso geral para cada uma das versões de Python. Em cada um deles, instalei a maioria dos pacotes padrão de data science (Numpy, Scipy, Pandas, etc.).

Também achei útil criar ambientes para cada projeto em que estou trabalhando. Isso funciona muito bem até para projetos não relacionados a dados, como aplicações web com Flask. Por exemplo, eu tenho um ambiente para meu blog pessoal usando o Pelican.

**Compartilhando ambientes**

Ao compartilhar seu código no GitHub, é uma boa prática fazer um arquivo de ambiente e inclui-lo no repositório. Isso facilitará a instalação de todas as dependências de seu código pelos usuários. Eu sempre incluo também um arquivo requirements.txt usando o pip freeze (saiba mais aqui) para quem não estiver usando o conda.

###### Summary

In [None]:
# create new environment
conda create -n ENV_NAME

# create new environment with different version of python
conda create -n ENV_NAME python=VERSION

# create environment from existing environment.yaml
conda env create -f environment.yaml

# update existing environment from environment.yaml
conda env update --file environment.yaml

# activate environment
conda activate ENV_NAME

# deactivate environment
conda deactivate

# delete/remove environment
conda env remove -n ENV_NAME

# list all environments
conda env list

# export requirements.txt with conda
conda list --export > requirements.txt

# export requirements.txt with pip
pip freeze > requirements.txt

# export environment.yaml
conda env export > environment.yaml

# export environment.yaml with no builds
conda env export --no-builds > environment.yaml

# install dependencies
conda install PACKAGE_NAME
OR
conda install -c conda-forge PACKAGE_NAME

# install dependencies with a specific version
conda install PACKAGE_NAME=VERSION

# uninstall dependencies
conda uninstall PACKAGE_NAME

# list all dependencies
conda list

# create ipython kernel (for jupyter and jupyterlab)
conda install -c conda-forge jupyterlab ipykernel
ipython kernel install --user --name=KERNEL_NAME

# list all ipython kernels
jupyter kernelspec list

# uninstall/remove existing ipython kernel
jupyter kernelspec uninstall KERNEL_NAME

###### Jupyter Notebooks

* O que são os notebooks Jupyter?

Boas-vindas à aula sobre utilização dos notebooks Jupyter. O notebook é uma aplicação web que permite que você combine texto explicativo, equações matemáticas, código e visualizações em um único documento facilmente compartilhável. Por exemplo, eis um dos meus notebooks favoritos, a análise de ondas gravitacionais vindas de dois buracos negros colidindo, detectadas pelo experimento LIGO. Você pode baixar os dados, rodar o código do notebook e repetir a análise, ou seja, detectar você mesmo as ondas gravitacionais! Os notebooks se tornaram rapidamente uma ferramenta essencial para trabalhar com dados. Você os verá sendo usados para a limpeza e exploração de dados, visualização, machine learning e até análise de big data. 

Eis um notebook de exemplo, que eu fiz para meu blog pessoal, que demonstra diversas das possibilidades dos notebooks. Normalmente, você fará esse trabalho em um terminal, seja no Python shell ou então usando o IPython. As visualizações, então, são demonstradas em janelas separadas, e a documentação está salva em outros arquivos, assim como os vários scripts para diferentes funções e classes. No entanto, ao usar os notebooks, tudo isso fica em um único lugar e é facilmente lido de uma vez. Os notebooks também são carregados automaticamente no GitHub. Essa é uma característica muito poderosa na hora de compartilhar o trabalho feito. Existe também o site http://nbviewer.jupyter.org/, que carrega os notebooks de um repositório Github ou de qualquer outro lugar.

* Programação literária

Os notebooks são uma forma de programação literária. Essa proposta, feita por Donald Knuth, em 1984, faz com que a documentação seja escrita como uma narrativa junto ao código em vez de ser uma coisa independente. Nas próprias palavras de Donald Knuth:

Em vez de pensar que nossa tarefa principal é dizer ao computador o que fazer, vamos nos concentrar em explicar para outros seres humanos o que queremos que o computador faça.

Afinal, o código é escrito para humanos, não computadores. Os notebooks nos ajudam muito com isso. É possível escrever a documentação como um texto narrativo, feito em paralelo ao código. Isso não é útil apenas para pessoas que leem seus notebooks, mas também para você mesmo, quando for retomar aquele código.

Apenas uma nota: recentemente, a ideia de programação literária foi estendida para uma linguagem de programação própria, chamada Eve.

* Como os notebooks funcionam

Os notebooks Jupyter surgiram a partir do projeto IPython, iniciado por Fernando Perez. IPython é um terminal interativo, similar ao terminal Python normal, mas com recursos ótimos como destaques de sintaxe e autopreenchimento para código. Originalmente, os notebooks funcionavam ao mandar mensagens do aplicativo web (o notebook que você visualiza no navegador) para um núcleo IPython (uma aplicação IPython que rodava em segundo plano). O núcleo executava o código e, então, mandava-o de volta para o notebook.

O ponto central é o servidor do notebook. A conexão é feita no servidor por seu navegador, e o notebook é carregado como um aplicativo web. O código escrito nesse aplicativo é mandado pelo servidor para o núcleo. O núcleo roda o código e o manda de volta para o servidor, então, o output é carregado no navegador. Ao salvar um notebook, ele é escrito no servidor como um arquivo JSON com a extensão .ipynb.

O ponto forte dessa arquitetura é que o núcleo não precisa rodar em Python. Como o notebook e o núcleo estão separados, o código em qualquer linguagem pode ser mandado entre eles. Por exemplo, dois dos primeiros núcleos não Python foram nas linguagens R e Julia. Com um núcleo R, o código escrito em R será enviado ao núcleo R, onde ele é executado, exatamente como aconteceu com o núcleo Python. Os notebooks IPython foram, então, renomeados quando se transformaram em algo independente de uma linguagem específica. O novo nome, Jupyter, vem da combinação dos nomes Julia, Python e R. Caso tenha interesse, aqui segue uma lista dos núcleos possíveis.

Outra vantagem é que o servidor pode ser rodado e acessado via Internet. Normalmente, você rodará o servidor em sua própria máquina, onde todos os dados e arquivos notebook estão salvos. Mas, é possível também configurar um servidor em uma máquina remota ou instância de nuvem, como a EC2, da Amazon. Então, você pode acessar os notebooks em um navegador de qualquer lugar do planeta.

* Instalando o notebook Jupyter

O jeito mais fácil de instalar o Jupyter é claramente baixando o Anaconda. Os notebooks Jupyter vêm embutidos na distribuição. É possível usar os notebooks já no ambiente padrão.

Para instalar os notebooks Jupyter em um ambiente do conda, use `conda install jupyter notebook`.

Os notebooks Jupyter também estão disponíveis no pip, digitando `pip install jupyter notebook`.

* Lançando o servidor do notebook

Para lançar um servidor notebook, digite `jupyter notebook` no terminal ou prompt. Isso inicializará o servidor no diretório em que você lançou o comando. Isso significa que os arquivos do notebook serão salvos neste diretório. A prática padrão é inicializar o servidor no diretório onde os notebooks se encontram. No entanto, é possível navegar pelo sistema de arquivos para o local onde se encontram os seus notebooks.

Ao rodar esse comando (faça uma tentativa!), a página inicial do servidor deve se abrir em seu navegador. Na definição padrão, o notebook roda no endereço http://localhost:8888. Caso não esteja familiarizado com isso, localhost significa o seu computador e 8888 é a porta que o servidor está usando. Enquanto o servidor estiver rodando, sempre será possível voltar para ele ao digitar http://localhost:8888 em seu navegador.

Caso inicie outro servidor, ele tentará usar a porta 8888, mas, como ela está ocupada, o novo servidor rodará na porta 8889. Então, para conectar-se a este novo servidor, é só digitar http://localhost:8889. Cada servidor adicional incrementará o número da porta dessa maneira.

É possível que haja alguns arquivos e pastas listados aqui, isso depende do diretório onde você inicializou o servidor.

No canto direito superior, você pode clicar em "New" para criar um notebook, arquivo de texto, pasta ou terminal novo. A lista abaixo de "notebooks" mostra os núcleos (kernels) que você tem instalados. No meu caso, estou rodando o servidor em um ambiente Python 3, então, tenho um núcleo Python 3 disponível. Você pode ver o Python 2 aqui. Eu também instalei os núcleos para Scala 2.10 e 2.11, que também aparecem na lista.

Caso você rode o servidor do notebook Jupyter de um ambiente conda, também será possível escolher os núcleos de quaisquer outros ambientes. Para criar um notebook novo, clique no núcleo que deseja usar.

As abas no topo mostram Files, Running e Cluster. Files mostra todos os arquivos e pastas do diretório atual. Clicar na aba Running listará todos os notebooks atualmente ativos. Neste ponto, é possível gerenciá-los.

Clusters era onde antes você podia criar núcleos múltiplos para usar em computação paralela. Agora, isso foi tomado pelo ipyparallel, então, não há nada demais a ser feito aqui.

Caso esteja rodando o servidor do notebook de um ambiente conda, você também terá acesso a uma aba nomeada "Conda", como mostraremos abaixo. Aqui, é possível administrar os ambientes de dentro do Jupyter. É possível criar ambientes, instalar pacotes, atualizar pacotes, exportar ambientes e muito mais.


* Desligando o Jupyter

É possível desligar os notebooks individualmente ao marcar as caixas ao lado de cada notebook na página inicial do servidor e, depois, clicar em "Shutdown". É bom garantir que todas as mudanças tenham sido salvas antes de fazer isso! Quaisquer mudanças feitas desde a última vez que o arquivo foi salvo serão perdidas. Também será necessário rodar novamente todos os códigos escritos da próxima vez que iniciar o notebook.

Também é possível desligar o servidor inteiro ao pressionar as teclas control + C duas vezes no terminal. De novo, isso desligará imediatamente todos os notebooks, então, tenha certeza de que tudo está salvo!

##### Running and Editing a Python Script

* Running

Download the zip file first_script attached at the bottom of this page (click it to unzip the file, then move it to an appropriate directory on your computer). This might be a good time to set up a new directory for your learning if you don't have one already.

Open your terminal and use cd to navigate to the directory containing that downloaded file.

Now that you’re in the directory with the file, you can run it by typing `python first_script.py` and pressing enter. Note: You may have to enter python3 instead of python to execute Python 3 if you have both versions installed on your computer.

You’ll know you’ve run the script successfully if you see this message printed to your terminal:

Congratulations on running this script!!

* Configure Your Own Python Programming Setup

Now you've seen my setup, take a moment to get yourself comfortable on your own computer.

Below you will find a number of different options for code editors. We recommend for all of our courses using Atom, which will work on all operating systems. If you decide not to use Atom, for first time coders Sublime is also popular.

For Mac and Linux:

1. Visual Studio Code

2. Atom

3. Sublime Text

4. emacs

5. vim

For Windows:

1. Visual Studio Code

2. Atom

3. Sublime Text

4. Notepad++

Get your screen set up with a text editor, terminal/command line and the Udacity classroom in a web browser, so you can iterate on your Python script. Play with the display options to see what you find most comfortable to look at, and see if you can find a tab-to-four-spaces option - that'll be very useful for Python indentation.

#### Scripting With Raw Input

We can get raw input from the user with the built-in function input, which takes in an optional string argument that you can use to specify a message to show to the user when asking for input.

In [None]:
name = input("Enter your name: ")
print("Hello there, {}!".format(name.title()))

This prompts the user to enter a name and then uses the input in a greeting. The input function takes in whatever the user types and stores it as a string. If you want to interpret their input as something other than a string, like an integer, as in the example below, you need to wrap the result with the new type to convert it from a string.

In [None]:
num = int(input("Enter an integer"))
print("hello" * num)

We can also interpret user input as a Python expression using the built-in function eval. This function evaluates a string as a line of Python.

In [None]:
result = eval(input("Enter an expression: "))
print(result)

If the user inputs 2 * 3, this outputs 6.

##### Generate Messages

Imagine you're a teacher who needs to send a message to each of your students reminding them of their missing assignments and grade in the class. You have each of their names, number of missing assignments, and grades on a spreadsheet and just have to insert them into placeholders in this message you came up with:

Hi [insert student name],

This is a reminder that you have [insert number of missing assignments] assignments left to submit before you can graduate. Your current grade is [insert current grade] and can increase to [insert potential grade] if you submit all assignments before the due date.

You can just copy and paste this message to each student and manually insert the appropriate values each time, but instead you're going to write a program that does this for you.

Write a script that does the following:

Ask for user input 3 times. Once for a list of names, once for a list of missing assignment counts, and once for a list of grades. Use this input to create lists for names, assignments, and grades.

Use a loop to print the message for each student with the correct values. The potential grade is simply the current grade added to two times the number of missing assignments.

In [None]:
names = input("Enter names separated by commas: ").title().split(",")
assignments = input("Enter assignment counts separated by commas: ").split(",")
grades = input("Enter grades separated by commas: ").split(",")

message = "Hi {},\n\nThis is a reminder that you have {} assignments left to \
submit before you can graduate. You're current grade is {} and can increase \
to {} if you submit all assignments before the due date.\n\n"

for name, assignment, grade in zip(names, assignments, grades):
    print(message.format(name, assignment, grade, int(grade) + int(assignment)*2))

Enter names separated by commas: Gustavo
Enter assignment counts separated by commas: 2
Enter grades separated by commas: 6
Hi Gustavo,

This is a reminder that you have 2 assignments left to submit before you can graduate. You're current grade is 6 and can increase to 10 if you submit all assignments before the due date.




#### Errors and Exceptions

* Syntax errors occur when Python can’t interpret our code, since we didn’t follow the correct syntax for Python. These are errors you’re likely to get when you make a typo, or you’re first starting to learn Python.

* Exceptions occur when unexpected things happen during execution of a program, even if the code is syntactically correct. There are different types of built-in exceptions in Python, and you can see which exception is thrown in the error message.

ValueError: An object of the correct type but inappropriate value is passed as input to a built-in operation or function.


AssertionError: An assert statement fails.


IndexError: A sequence subscript is out of range.


KeyError: A key can't be found in a dictionary.


TypeError: An object of an unsupported type is passed as input to an operation or function.

#### Try Statement

We can use try statements to handle exceptions. There are four clauses you can use (one more in addition to those shown in the video).

* try: This is the only mandatory clause in a try statement. The code in this block is the first thing that Python runs in a try statement.
* except: If Python runs into an exception while running the try block, it will jump to the except block that handles that exception.
* else: If Python runs into no exceptions while running the try block, it will run the code in this block after running the try block.
* finally: Before Python leaves this try statement, it will run the code in this finally block under any conditions, even if it's ending the program. E.g., if Python ran into an error while running code in the except or else block, this finally block will still be executed before stopping the program.

We can actually specify which error we want to handle in an except block like this:

In [None]:
try:
    # some code
except ValueError:
    # some code

Now, it catches the ValueError exception, but not other exceptions. If we want this handler to address more than one type of exception, we can include a parenthesized tuple after the except with the exceptions.

In [None]:
try:
    # some code
except (ValueError, KeyboardInterrupt):
    # some code

Or, if we want to execute different blocks of code depending on the exception, you can have multiple except blocks.

In [None]:
try:
    # some code
except ValueError:
    # some code
except KeyboardInterrupt:
    # some code

##### Example of handling errors

The party_planner function below takes as input a number of party people and cookies and figures out how many cookies each person gets at the party, assuming equitable distribution of cookies. Then, it returns that number along with how many cookies will be left over.

Right now, calling the function with an input of 0 people will cause an error, because it creates a ZeroDivisionError exception. Edit the party_planner function to handle this invalid input. If it runs into this exception, it should print a warning message to the user and request they input a different number of people.

After you've edited the function, try running the file again and make sure it does what you intended. Try it with several different input values, including 0 and other values for the number of people.

In [None]:
def party_planner(cookies, people):
    leftovers = None
    num_each = None
    # TODO: Add a try-except block here to
    #       make sure no ZeroDivisionError occurs.
    if people > cookies:
        num_each = round(cookies / people, 1)
        leftovers = 0
    else:
        try:
            num_each = cookies // people
            leftovers = cookies % people
        except ZeroDivisionError:
            print("Oops, you entered 0 people will be attending.")
            print("Please enter a good number of people for a party.")

    return(num_each, leftovers)

# The main code block is below; do not edit this
lets_party = 'y'
while lets_party == 'y':

    cookies = int(input("How many cookies are you baking? "))
    people = int(input("How many people are attending? "))

    cookies_each, leftovers = party_planner(cookies, people)

    if cookies_each:  # if cookies_each is not None
        message = "\nLet's party! We'll have {} people attending, they'll each get to eat {} cookies, and we'll have {} left over."
        print(message.format(people, cookies_each, leftovers))
    if cookies_each == 0:
        print("No cookie = No party")

    lets_party = input("\nWould you like to party more? (y or n) ")

#### Acessing Errors Messages

When you handle an exception, you can still access its error message like this:

In [None]:
try:
    # some code
except ZeroDivisionError as e:
   # some code
   print("ZeroDivisionError occurred: {}".format(e))

ZeroDivisionError occurred: integer division or modulo by zero

So you can still access error messages, even if you handle them to keep your program from crashing!

If you don't have a specific error you're handling, you can still access the message like this:

In [None]:
try:
    # some code
except Exception as e:
   # some code
   print("Exception occurred: {}".format(e))

#### Reading and Writing Files

Here's how we read files in Python.

In [None]:
f = open('my_path/my_file.txt', 'r')
file_data = f.read()
f.close()

First open the file using the built-in function, open. This requires a string that shows the path to the file. The open function returns a file object, which is a Python object through which Python interacts with the file itself. Here, we assign this object to the variable f.

There are optional parameters you can specify in the open function. One is the mode in which we open the file. Here, we use r or read only. This is actually the default value for the mode argument.

Use the read method to access the contents from the file object. This read method takes the text contained in a file and puts it into a string. Here, we assign the string returned from this method into the variable file_data.

When finished with the file, use the close method to free up any system resources taken up by the file.

Writing to a File

In [None]:
f = open('my_path/my_file.txt', 'w')
f.write("Hello there!")
f.close()

Open the file in writing ('w') mode. If the file does not exist, Python will create it for you. If you open an existing file in writing mode, any content that it had contained previously will be deleted. If you're interested in adding to an existing file, without deleting its content, you should use the append ('a') mode instead of write.

Use the write method to add text to the file.

Close the file when finished.

Python provides a special syntax that auto-closes a file for you once you're finished using it.

In [None]:
with open('my_path/my_file.txt', 'r') as f:
    file_data = f.read()

This with keyword allows you to open a file, do operations on it, and automatically close it after the indented code is executed, in this case, reading from the file. Now, we don’t have to call f.close()! You can only access the file object, f, within this indented block

##### Reading Line by Line

\n in blocks of text are newline characters. The newline character marks the end of a line, and tells a program (such as a text editor) to go down to the next line. However, looking at the stream of characters in the file, \n is just another character.

Fortunately, Python knows that these are special characters and you can ask it to read one line at a time. Let's try it!

txt: 

We're the knights of the round table

We dance whenever we're able

Conveniently, Python will loop over the lines of a file using the syntax for line in file. I can use this to create a list of lines in the file. Because each line still has its newline character attached, I remove this using .strip().

In [None]:
camelot_lines = []
with open("camelot.txt") as f:
    for line in f:
        camelot_lines.append(line.strip())

print(camelot_lines)

Graham Chapman,  Various / ... (46 episodes, 1969-1974)
Eric Idle,  Various / ... (46 episodes, 1969-1974)
Terry Jones,  Various / ... (46 episodes, 1969-1974)
Michael Palin,  It's Man / ... (46 episodes, 1969-1974)
Terry Gilliam,  Various / ... (46 episodes, 1969-1974)
John Cleese,  Announcer / ... (40 episodes, 1969-1973)
Carol Cleveland,  Various / ... (34 episodes, 1969-1974)

You're going to create a list of the actors who appeared in the television programme Monty Python's Flying Circus.

Write a function called create_cast_list that takes a filename as input and returns a list of actors' names. It will be run on the file flying_circus_cast.txt (this information was collected from imdb.com). Each line of that file consists of an actor's name, a comma, and then some (messy) information about roles they played in the programme. You'll need to extract only the name and add it to a list. You might use the .split() method to process each line.

In [None]:
def create_cast_list(filename):
    cast_list = []
    with open(filename) as f:
        for line in f:
            name = line.split(",")[0]
            cast_list.append(name)

    return cast_list

cast_list = create_cast_list('flying_circus_cast.txt')
for actor in cast_list:
    print(actor)

#### Importing Local Scripts

We can actually import Python code from other scripts, which is helpful if you are working on a bigger project where you want to organize your code into multiple files and reuse code in those files. If the Python script you want to import is in the same directory as your current script, you just type import followed by the name of the file, without the .py extension.

In [None]:
import useful_functions

It's the standard convention for import statements to be written at the top of a Python script, each one on a separate line. This import statement creates a module object called useful_functions. Modules are just Python files that contain definitions and statements. To access objects from an imported module, you need to use dot notation.

In [None]:
import useful_functions
useful_functions.add_five([1, 2, 3, 4])

We can add an alias to an imported module to reference it with a different name.

In [None]:
import useful_functions as uf
uf.add_five([1, 2, 3, 4])

It is nice to have assertions in functions scripts to test the results

##### Using a main block

To avoid running executable statements in a script when it's imported as a module in another script, include these lines in an if __name__ == "__main__" block. Or alternatively, include them in a function called main() and call this in the if main block.

Whenever we run a script like this, Python actually sets a special built-in variable called __name__ for any module. When we run a script, Python recognizes this module as the main program, and sets the __name__ variable for this module to the string "__main__". For any modules that are imported in this script, this built-in __name__ variable is just set to the name of that module. Therefore, the condition if __name__ == "__main__"is just checking whether this module is the main program.

Create these scripts in the same directory and run them in your terminal! Experiment with the if main block and accessing objects from the imported module

In [None]:
# demo.py

import useful_functions as uf

scores = [88, 92, 79, 93, 85]

mean = uf.mean(scores)
curved = uf.add_five(scores)

mean_c = uf.mean(curved)

print("Scores:", scores)
print("Original Mean:", mean, " New Mean:", mean_c)

print(__name__)
print(uf.__name__)

In [None]:
# useful_functions.py

def mean(num_list):
    return sum(num_list) / len(num_list)

def add_five(num_list):
    return [n + 5 for n in num_list]

def main():
    print("Testing mean function")
    n_list = [34, 44, 23, 46, 12, 24]
    correct_mean = 30.5
    assert(mean(n_list) == correct_mean)

    print("Testing add_five function")
    correct_list = [39, 49, 28, 51, 17, 29]
    assert(add_five(n_list) == correct_list)

    print("All tests passed!")

if __name__ == '__main__':
    main()

#### Best Standard Python Modules


The Python Standard Library has a lot of modules! To help you get familiar with what's available, here are a selection of our favourite Python Standard Library modules and why we use them!

* csv: very convenient for reading and writing csv files
* collections: useful extensions of the usual data types including OrderedDict, defaultdict and namedtuple
* random: generates pseudo-random numbers, shuffles sequences randomly and chooses random items
* string: more functions on strings. This module also contains useful collections of letters like string.digits (a string containing all characters which are valid digits).
* re: pattern-matching in strings via regular expressions
* math: some standard mathematical functions
* os: interacting with operating systems
* os.path: submodule of os for manipulating path names
* sys: work directly with the Python interpreter
* json: good for reading and writing json files (good for web work)

#### Techniques for Importing Modules

To import an individual function or class from a module:

In [None]:
from module_name import object_name

To import multiple individual objects from a module

In [None]:
from module_name import first_object, second_object

To rename a module:

In [None]:
import module_name as new_name

To import an object from a module and rename it

In [None]:
from module_name import object_name as new_name

To import every object individually from a module

In [None]:
from module_name import *

If you really want to use all of the objects from a module, use the standard import module_name statement instead and access each of the objects with the dot notation

In [None]:
import module_name

#### Modules, Packages, and Names

In order to manage the code better, modules in the Python Standard Library are split down into sub-modules that are contained within a package. A package is simply a module that contains sub-modules. A sub-module is specified with the usual dot notation.

Modules that are submodules are specified by the package name and then the submodule name separated by a dot. You can import the submodule like this.

In [None]:
import package_name.submodule_name

#### Third-Party Libraries

There are tens of thousands of third-party libraries written by independent developers! You can install them using pip, a package manager that is included with Python 3. pip is the standard package manager for Python, but it isn't the only one. One popular alternative is Anaconda which is designed specifically for data science.

To install a package using pip, just enter "pip install" followed by the name of the package in your command line like this: pip install package_name. This downloads and installs the package so that it's available to import in your programs. Once installed, you can import third-party packages using the same syntax used to import from the standard library.

##### Using a requirements.txt File

Larger Python programs might depend on dozens of third party packages. To make it easier to share these programs, programmers often list a project's dependencies in a file called requirements.txt. This is an example of a requirements.txt file.

In [None]:
beautifulsoup4==4.5.1
bs4==0.0.1
pytz==2016.7
requests==2.11.1

Each line of the file includes the name of a package and its version number. The version number is optional, but it usually should be included. Libraries can change subtly, or dramatically, between versions, so it's important to use the same library versions that the program's author used when they wrote the program.

You can use pip to install all of a project's dependencies at once by typing `pip list --format=freeze > requirements.txt` in your command line.

##### Useful Third-Party Packages

Being able to install and import third party libraries is useful, but to be an effective programmer you also need to know what libraries are available for you to use. People typically learn about useful new libraries from online recommendations or from colleagues. If you're a new Python programmer you may not have many colleagues, so to get you started here's a list of packages that are popular with engineers at Udacity.

* IPython - A better interactive Python interpreter
requests - Provides easy to use methods to make web requests. Useful for accessing web APIs.
* Flask - a lightweight framework for making web applications and APIs.
* Django - A more featureful framework for making web applications. Django is particularly good for designing complex, content heavy, web applications.
* Beautiful Soup - Used to parse HTML and extract information from it. Great for web scraping.
* pytest - extends Python's builtin assertions and unittest module.
* PyYAML - For reading and writing YAML files.
* NumPy - The fundamental package for scientific computing with Python. It contains among other things a powerful N-dimensional array object and useful linear algebra capabilities.
* pandas - A library containing high-performance, data structures and data analysis tools. In particular, pandas provides dataframes!
* matplotlib - a 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments.
* ggplot - Another 2D plotting library, based on R's ggplot2 library.
* Pillow - The Python Imaging Library adds image processing capabilities to your Python interpreter.
* pyglet - A cross-platform application framework intended for game development.
* Pygame - A set of Python modules designed for writing games.
* pytz - World Timezone Definitions for Python

### Numpy

Even though Python lists are great on their own, NumPy has a number of key features that give it great advantages over Python lists. Below are a few convincingly strong features:

One such feature is speed. When performing operations on large arrays NumPy can often perform several orders of magnitude faster than Python lists. This speed comes from the nature of NumPy arrays being memory-efficient and from optimized algorithms used by NumPy for doing arithmetic, statistical, and linear algebra operations.

Another great feature of NumPy is that it has multidimensional array data structures that can represent vectors and matrices. You will learn all about vectors and matrices in the Linear Algebra section of this course later on, and as you will soon see, a lot of machine learning algorithms rely on matrix operations. For example, when training a Neural Network, you often have to carry out many matrix multiplications. NumPy is optimized for matrix operations and it allows us to do Linear Algebra operations effectively and efficiently, making it very suitable for solving machine learning problems.

Another great advantage of NumPy over Python lists is that NumPy has a large number of optimized built-in mathematical functions. These functions allow you to do a variety of complex mathematical computations very fast and with very little code (avoiding the use of complicated loops) making your programs more readable and easier to understand.

These are just some of the key features that have made NumPy an essential package for scientific computing in Python. In fact, NumPy has become so popular that a lot of Python packages, such as Pandas, are built on top of NumPy.

In [None]:
# Why use NumPy?
import time
import numpy as np
x = np.random.random(100000000)

# Case 1
start = time.time()
sum(x) / len(x)
print(time.time() - start)

# Case 2
start = time.time()
np.mean(x)
print(time.time() - start)

19.353888034820557
0.0756673812866211


But before we can dive in and start using NumPy to create ndarrays we need to import it into Python. We can import packages into Python using the import command and it has become a convention to import NumPy as np. Therefore, you can import NumPy by typing the following command in your Jupyter notebook:

In [None]:
import numpy as np

###### NDARRAY

At the core of NumPy is the ndarray, where nd stands for n-dimensional. An ndarray is a multidimensional array of elements all of the same type. In other words, an ndarray is a grid that can take on many shapes and can hold either numbers or strings. In many Machine Learning problems you will often find yourself using ndarrays in many different ways. For instance, you might use an ndarray to hold the pixel values of an image that will be fed into a Neural Network for image classification.

There are several ways to create ndarrays in NumPy. In the following lessons we will see two ways to create ndarrays:

Using regular Python lists

Using built-in NumPy functions

In this section, we will create ndarrays by providing Python lists to the NumPy np.array() function. This can create some confusion for beginners, but it is important to remember that np.array() is NOT a class, it is just a function that returns an ndarray. We should note that for the purposes of clarity, the examples throughout these lessons will use small and simple ndarrays. Let's start by creating 1-Dimensional (1D) ndarrays.

In [None]:
# We import NumPy into Python
import numpy as np

# We create a 1D ndarray that contains only integers
x = np.array([1, 2, 3, 4, 5])

# Let's print the ndarray we just created using the print() command
print('x = ', x)

x =  [1 2 3 4 5]


###### NDIM

It returns the number of array dimensions.

Let's pause for a second to introduce some useful terminology. We refer to 1D arrays as rank 1 arrays. In general N-Dimensional arrays have rank N. Therefore, we refer to a 2D array as a rank 2 array.

In [None]:
# 1-D array
x = np.array([1, 2, 3])
x.ndim

# 2-D array
Y = np.array([[1,2,3],[4,5,6],[7,8,9], [10,11,12]])
Y.ndim

# Here the`zeros()` is an inbuilt function that you'll study on the next page. 
# The tuple (2, 3, 4( passed as an argument represents the shape of the ndarray
y = np.zeros((2, 3, 4))
y.ndim

3

###### SHAPE

It returns a tuple representing the array dimensions. Refer more details here.

Another important property of arrays is their shape. The shape of an array is the size along each of its dimensions. For example, the shape of a rank 2 array will correspond to the number of rows and columns of the array. As you will see, NumPy ndarrays have attributes that allow us to get information about them in a very intuitive way. For example, the shape of an ndarray can be obtained using the .shape attribute. The shape attribute returns a tuple of N positive integers that specify the sizes of each dimension.

###### DTYPE

The type tells us the data-type of the elements. Remember, a NumPy array is homogeneous, meaning all elements will have the same data-type. In the example below, we will create a rank 1 array and learn how to obtain its shape, its type, and the data-type (dtype) of its elements.

In [None]:
# We create a 1D ndarray that contains only integers
x = np.array([1, 2, 3, 4, 5])

# We print information about x
print('x = ', x)
print('x has dimensions:', x.shape)
print('x is an object of type:', type(x))
print('The elements in x are of type:', x.dtype)

x =  [1 2 3 4 5]
x has dimensions: (5,)
x is an object of type: <class 'numpy.ndarray'>
The elements in x are of type: int64


We can see that the shape attribute returns the tuple (5,) telling us that x is of rank 1 (i.e. x only has 1 dimension ) and it has 5 elements. The type() function tells us that x is indeed a NumPy ndarray. Finally, the .dtype attribute tells us that the elements of x are stored in memory as signed 64-bit integers. Another great advantage of NumPy is that it can handle more data-types than Python lists. You can check out all the different data types NumPy supports in the link below:

NumPy Data Types

As mentioned earlier, ndarrays can also hold strings. Let's see how we can create a rank 1 ndarray of strings in the same manner as before, by providing the np.array() function a Python list of strings.

In [None]:
# We create a rank 1 ndarray of floats but set the dtype to int64
x = np.array([1.5, 2.2, 3.7, 4.0, 5.9], dtype = np.int64)

# We print the dtype x
print('x = ', x)
print('The elements in x are of type:', x.dtype)

x =  [1 2 3 4 5]
The elements in x are of type: int64


In [None]:
# We create a rank 2 ndarray that only contains integers
Y = np.array([[1,2,3],[4,5,6],[7,8,9], [10,11,12]])

print('Y = \n', Y)

# We print information about Y
print('Y has dimensions:', Y.shape)
print('Y has a total of', Y.size, 'elements')
print('Y is an object of type:', type(Y))
print('The elements in Y are of type:', Y.dtype)

Y = 
 [[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]
Y has dimensions: (4, 3)
Y has a total of 12 elements
Y is an object of type: <class 'numpy.ndarray'>
The elements in Y are of type: int64


###### Save Arrays

Once you create an ndarray, you may want to save it to a file to be read later or to be used by another program. NumPy provides a way to save the arrays into files for later use

In [None]:
# We create a rank 1 ndarray
x = np.array([1, 2, 3, 4, 5])

# We save x into the current directory as 
np.save('my_array', x)

# We load the saved array from our current directory into variable y
y = np.load('my_array.npy')

# We print y
print()
print('y = ', y)
print()

# We print information about the ndarray we loaded
print('y is an object of type:', type(y))
print('The elements in y are of type:', y.dtype)


y =  [1 2 3 4 5]

y is an object of type: <class 'numpy.ndarray'>
The elements in y are of type: int64


###### Built-in functions to create arrays

One great time-saving feature of NumPy is its ability to create ndarrays using built-in functions. These functions allow us to create certain kinds of ndarrays with just one line of code. Below we will see a few of the most useful built-in functions for creating ndarrays that you will come across when doing AI programming.

Let's start by creating an ndarray with a specified shape that is full of zeros. We can do this by using the np.zeros() function. The function np.zeros(shape) creates an ndarray full of zeros with the given shape. So, for example, if you wanted to create a rank 2 array with 3 rows and 4 columns, you will pass the shape to the function in the form of (rows, columns), as in the example below:

In [None]:
# We create a 3 x 4 ndarray full of zeros. 
X = np.zeros((3,4))

# We print X
print()
print('X = \n', X)
print()

# We print information about X
print('X has dimensions:', X.shape)
print('X is an object of type:', type(X))
print('The elements in X are of type:', X.dtype)


X = 
 [[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

X has dimensions: (3, 4)
X is an object of type: <class 'numpy.ndarray'>
The elements in X are of type: float64


As we can see, the np.zeros() function creates by default an array with dtype float64. If desired, the data type can be changed by using the keyword dtype.

Similarly, we can create an ndarray with a specified shape that is full of ones. We can do this by using the np.ones() function. Just like the np.zeros() function, the np.ones() function takes as an argument the shape of the ndarray you want to make. Let's see an example:

In [None]:
# We create a 3 x 2 ndarray full of ones. 
X = np.ones((3,2))

# We print X
print()
print('X = \n', X)
print()

# We print information about X
print('X has dimensions:', X.shape)
print('X is an object of type:', type(X))
print('The elements in X are of type:', X.dtype) 


X = 
 [[1. 1.]
 [1. 1.]
 [1. 1.]]

X has dimensions: (3, 2)
X is an object of type: <class 'numpy.ndarray'>
The elements in X are of type: float64


As we can see, thenp.ones() function also creates by default an array with dtype float64. If desired, the data type can be changed by using the keyword dtype.

We can also create an ndarray with a specified shape that is full of any number we want. We can do this by using the np.full() function. The np.full(shape, constant value) function takes two arguments. The first argument is the shape of the ndarray you want to make and the second is the constant value you want to populate the array with. Let's see an example:

In [None]:
# We create a 2 x 3 ndarray full of fives. 
X = np.full((2,3), 5) 

# We print X
print()
print('X = \n', X)
print()

# We print information about X
print('X has dimensions:', X.shape)
print('X is an object of type:', type(X))
print('The elements in X are of type:', X.dtype)  


X = 
 [[5 5 5]
 [5 5 5]]

X has dimensions: (2, 3)
X is an object of type: <class 'numpy.ndarray'>
The elements in X are of type: int64


The np.full() function creates by default an array with the same data type as the constant value used to fill in the array. If desired, the data type can be changed by using the keyword dtype.

As you will learn later, a fundamental array in Linear Algebra is the Identity Matrix. An Identity matrix is a square matrix that has only 1s in its main diagonal and zeros everywhere else. The function np.eye(N) creates a square N x N ndarray corresponding to the Identity matrix. Since all Identity Matrices are square, the np.eye() function only takes a single integer as an argument. Let's see an example:

In [None]:
# We create a 5 x 5 Identity matrix. 
X = np.eye(5)

# We print X
print()
print('X = \n', X)
print()

# We print information about X
print('X has dimensions:', X.shape)
print('X is an object of type:', type(X))
print('The elements in X are of type:', X.dtype)  


X = 
 [[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]

X has dimensions: (5, 5)
X is an object of type: <class 'numpy.ndarray'>
The elements in X are of type: float64


In [None]:
# Create a 4 x 4 diagonal matrix that contains the numbers 10,20,30, and 50
# on its main diagonal
X = np.diag([10,20,30,50])

# We print X
print()
print('X = \n', X)
print()


X = 
 [[10  0  0  0]
 [ 0 20  0  0]
 [ 0  0 30  0]
 [ 0  0  0 50]]



In [None]:
# We create a rank 1 ndarray that has evenly spaced integers from 1 to 13 in steps of 3.
x = np.arange(1,14,3)

# We print the ndarray
print()
print('x = ', x)
print()

# We print information about the ndarray
print('x has dimensions:', x.shape)
print('x is an object of type:', type(x))
print('The elements in x are of type:', x.dtype) 


x =  [ 1  4  7 10 13]

x has dimensions: (5,)
x is an object of type: <class 'numpy.ndarray'>
The elements in x are of type: int64


In [None]:
# We create a rank 1 ndarray that has 10 integers evenly spaced between 0 and 25.
x = np.linspace(0,25,10)

# We print the ndarray
print()
print('x = \n', x)
print()

# We print information about the ndarray
print('x has dimensions:', x.shape)
print('x is an object of type:', type(x))
print('The elements in x are of type:', x.dtype) 


x = 
 [ 0.          2.77777778  5.55555556  8.33333333 11.11111111 13.88888889
 16.66666667 19.44444444 22.22222222 25.        ]

x has dimensions: (10,)
x is an object of type: <class 'numpy.ndarray'>
The elements in x are of type: float64


In [None]:
# We create a rank 1 ndarray that has 10 integers evenly spaced between 0 and 25,
# with 25 excluded.
x = np.linspace(0,25,10, endpoint = False)

# We print the ndarray
print()
print('x = ', x)
print()

# We print information about the ndarray
print('x has dimensions:', x.shape)
print('x is an object of type:', type(x))
print('The elements in x are of type:', x.dtype) 


x =  [ 0.   2.5  5.   7.5 10.  12.5 15.  17.5 20.  22.5]

x has dimensions: (10,)
x is an object of type: <class 'numpy.ndarray'>
The elements in x are of type: float64


So far, we have only used the built-in functions np.arange() and np.linspace() to create rank 1 ndarrays. However, we can use these functions to create rank 2 ndarrays of any shape by combining them with the np.reshape() function. The np.reshape(ndarray, new_shape) function converts the given ndarray into the specified new_shape. It is important to note that the new_shape should be compatible with the number of elements in the given ndarray. For example, you can convert a rank 1 ndarray with 6 elements, into a 3 x 2 rank 2 ndarray, or a 2 x 3 rank 2 ndarray, since both of these rank 2 arrays will have a total of 6 elements. However, you can't reshape the rank 1 ndarray with 6 elements into a 3 x 3 rank 2 ndarray, since this rank 2 array will have 9 elements, which is greater than the number of elements in the original ndarray

Using the function reshape

In [None]:
# We create a rank 1 ndarray with sequential integers from 0 to 19
x = np.arange(20)

# We print x
print()
print('Original x = ', x)
print()

# We reshape x into a 4 x 5 ndarray 
x = np.reshape(x, (4,5))

# We print the reshaped x
print()
print('Reshaped x = \n', x)
print()

# We print information about the reshaped x
print('x has dimensions:', x.shape)
print('x is an object of type:', type(x))
print('The elements in x are of type:', x.dtype) 


Original x =  [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


Reshaped x = 
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]

x has dimensions: (4, 5)
x is an object of type: <class 'numpy.ndarray'>
The elements in x are of type: int64


Using the method reshape

In [None]:
# We create a a rank 1 ndarray with sequential integers from 0 to 19 and
# reshape it to a 4 x 5 array 
Y = np.arange(20).reshape(4, 5)

# We print Y
print()
print('Y = \n', Y)
print()

# We print information about Y
print('Y has dimensions:', Y.shape)
print('Y is an object of type:', type(Y))
print('The elements in Y are of type:', Y.dtype) 


Y = 
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]

Y has dimensions: (4, 5)
Y is an object of type: <class 'numpy.ndarray'>
The elements in Y are of type: int64


The last type of ndarrays we are going to create are random ndarrays. Random ndarrays are arrays that contain random numbers. Often in Machine Learning, you need to create random matrices, for example, when initializing the weights of a Neural Network. NumPy offers a variety of random functions to help us create random ndarrays of any shape.

Let's start by using the np.random.random(shape) function to create an ndarray of the given shape with random floats in the half-open interval [0.0, 1.0).

In [None]:
# We create a 3 x 3 ndarray with random floats in the half-open interval [0.0, 1.0).
X = np.random.random((3,3))

# We print X
print()
print('X = \n', X)
print()

# We print information about X
print('X has dimensions:', X.shape)
print('X is an object of type:', type(X))
print('The elements in x are of type:', X.dtype)


X = 
 [[0.07693719 0.15896112 0.86752101]
 [0.77757925 0.86280979 0.26035322]
 [0.27586422 0.74377315 0.49344008]]

X has dimensions: (3, 3)
X is an object of type: <class 'numpy.ndarray'>
The elements in x are of type: float64


NumPy also allows us to create ndarrays with random integers within a particular interval. The function np.random.randint(start, stop, size = shape) creates an ndarray of the given shape with random integers in the half-open interval [start, stop). Let's see an example:

In [None]:
# We create a 3 x 2 ndarray with random integers in the half-open interval [4, 15).
X = np.random.randint(4,15,size=(3,2))

# We print X
print()
print('X = \n', X)
print()

# We print information about X
print('X has dimensions:', X.shape)
print('X is an object of type:', type(X))
print('The elements in X are of type:', X.dtype)


X = 
 [[9 8]
 [8 8]
 [7 5]]

X has dimensions: (3, 2)
X is an object of type: <class 'numpy.ndarray'>
The elements in X are of type: int64


In some cases, you may need to create ndarrays with random numbers that satisfy certain statistical properties. For example, you may want the random numbers in the ndarray to have an average of 0. NumPy allows you create random ndarrays with numbers drawn from various probability distributions. The function np.random.normal(mean, standard deviation, size=shape), for example, creates an ndarray with the given shape that contains random numbers picked from a normal (Gaussian) distribution with the given mean and standard deviation. Let's create a 1,000 x 1,000 ndarray of random floating point numbers drawn from a normal distribution with a mean (average) of zero and a standard deviation of 0.1.

In [None]:
# We create a 1000 x 1000 ndarray of random floats drawn from normal (Gaussian) distribution
# with a mean of zero and a standard deviation of 0.1.
X = np.random.normal(0, 0.1, size=(1000,1000))

# We print X
print()
print('X = \n', X)
print()

# We print information about X
print('X has dimensions:', X.shape)
print('X is an object of type:', type(X))
print('The elements in X are of type:', X.dtype)
print('The elements in X have a mean of:', X.mean())
print('The maximum value in X is:', X.max())
print('The minimum value in X is:', X.min())
print('X has', (X < 0).sum(), 'negative numbers')
print('X has', (X > 0).sum(), 'positive numbers')


X = 
 [[ 0.11062096 -0.18848199 -0.13933814 ... -0.22922672  0.08560342
   0.09494847]
 [-0.21465375  0.17890888  0.08674283 ... -0.02971678 -0.06704127
  -0.16835026]
 [ 0.01510206 -0.0979285   0.15652037 ...  0.0947675  -0.07576427
  -0.00190856]
 ...
 [ 0.01065001 -0.13870851 -0.08652141 ... -0.07782912  0.08430694
  -0.06871916]
 [ 0.11884119  0.03570906  0.02110066 ...  0.1519784  -0.11096837
   0.1534686 ]
 [ 0.05319877  0.09314124 -0.12593448 ... -0.03214836  0.10479879
  -0.10768772]]

X has dimensions: (1000, 1000)
X is an object of type: <class 'numpy.ndarray'>
The elements in X are of type: float64
The elements in X have a mean of: 4.261350583990718e-05
The maximum value in X is: 0.4594371834520016
The minimum value in X is: -0.4963215962662151
X has 500553 negative numbers
X has 499447 positive numbers


As we can see, the average of the random numbers in the ndarray is close to zero, both the maximum and minimum values in X are symmetric about zero (the average), and we have about the same amount of positive and negative numbers.

###### Acessing elements

See how NumPy allows us to effectively manipulate the data within the ndarrays. NumPy ndarrays are mutable, meaning that the elements in ndarrays can be changed after the ndarray has been created. NumPy ndarrays can also be sliced, which means that ndarrays can be split in many different ways. This allows us, for example, to retrieve any subset of the ndarray that we want. Often in Machine Learning you will use slicing to separate data, as for example when dividing a data set into training, cross validation, and testing sets.

We will start by looking at how the elements of an ndarray can be accessed or modified by indexing. Elements can be accessed using indices inside square brackets, [ ]. NumPy allows you to use both positive and negative indices to access elements in the ndarray. Positive indices are used to access elements from the beginning of the array, while negative indices are used to access elements from the end of the array. Let's see how we can access elements in rank 1 ndarrays:

In [None]:
# We create a rank 1 ndarray that contains integers from 1 to 5
x = np.array([1, 2, 3, 4, 5])

# We print x
print()
print('x = ', x)
print()

# Let's access some elements with positive indices
print('This is First Element in x:', x[0]) 
print('This is Second Element in x:', x[1])
print('This is Fifth (Last) Element in x:', x[4])
print()

# Let's access the same elements with negative indices
print('This is First Element in x:', x[-5])
print('This is Second Element in x:', x[-4])
print('This is Fifth (Last) Element in x:', x[-1])


x =  [1 2 3 4 5]

This is First Element in x: 1
This is Second Element in x: 2
This is Fifth (Last) Element in x: 5

This is First Element in x: 1
This is Second Element in x: 2
This is Fifth (Last) Element in x: 5


In [None]:
# We create a rank 1 ndarray that contains integers from 1 to 5
x = np.array([1, 2, 3, 4, 5])

# We print the original x
print()
print('Original:\n x = ', x)
print()

# We change the fourth element in x from 4 to 20
x[3] = 20

# We print x after it was modified 
print('Modified:\n x = ', x)


Original:
 x =  [1 2 3 4 5]

Modified:
 x =  [ 1  2  3 20  5]


In [None]:
# We create a 3 x 3 rank 2 ndarray that contains integers from 1 to 9
X = np.array([[1,2,3],[4,5,6],[7,8,9]])

# We print X
print()
print('X = \n', X)
print()

# Let's access some elements in X
print('This is (0,0) Element in X:', X[0,0])
print('This is (0,1) Element in X:', X[0,1])
print('This is (2,2) Element in X:', X[2,2])


X = 
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

This is (0,0) Element in X: 1
This is (0,1) Element in X: 2
This is (2,2) Element in X: 9


In [None]:
# We create a 3 x 3 rank 2 ndarray that contains integers from 1 to 9
X = np.array([[1,2,3],[4,5,6],[7,8,9]])

# We print the original x
print()
print('Original:\n X = \n', X)
print()

# We change the (0,0) element in X from 1 to 20
X[0,0] = 20

# We print X after it was modified 
print('Modified:\n X = \n', X)


Original:
 X = 
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

Modified:
 X = 
 [[20  2  3]
 [ 4  5  6]
 [ 7  8  9]]


In [None]:
# We create a 3 x 3 rank 2 ndarray that contains integers from 1 to 9
X = np.array([[1,2,3],[4,5,6],[7,8,9]])

# We print the original x
print()
print('Original:\n X = \n', X)
print()

# We change the (0,0) element in X from 1 to 20
X[0,0] = 20

# We print X after it was modified 
print('Modified:\n X = \n', X)


Original:
 X = 
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

Modified:
 X = 
 [[20  2  3]
 [ 4  5  6]
 [ 7  8  9]]


Delete

In [None]:
# We create a rank 1 ndarray 
x = np.array([1, 2, 3, 4, 5])

# We create a rank 2 ndarray
Y = np.array([[1,2,3],[4,5,6],[7,8,9]])

# We print x
print()
print('Original x = ', x)

# We delete the first and last element of x
x = np.delete(x, [0,4])

# We print x with the first and last element deleted
print()
print('Modified x = ', x)

# We print Y
print()
print('Original Y = \n', Y)

# We delete the first row of y
w = np.delete(Y, 0, axis=0)

# We delete the first and last column of y
v = np.delete(Y, [0,2], axis=1)

# We print w
print()
print('w = \n', w)

# We print v
print()
print('v = \n', v)

Append

In [None]:
# We create a rank 1 ndarray 
x = np.array([1, 2, 3, 4, 5])

# We create a rank 2 ndarray 
Y = np.array([[1,2,3],[4,5,6]])

# We print x
print()
print('Original x = ', x)

# We append the integer 6 to x
x = np.append(x, 6)

# We print x
print()
print('x = ', x)

# We append the integer 7 and 8 to x
x = np.append(x, [7,8])

# We print x
print()
print('x = ', x)

# We print Y
print()
print('Original Y = \n', Y)

# We append a new row containing 7,8,9 to y
v = np.append(Y, [[7,8,9]], axis=0)

# We append a new column containing 9 and 10 to y
q = np.append(Y,[[9],[10]], axis=1)

# We print v
print()
print('v = \n', v)

# We print q
print()
print('q = \n', q)


Original x =  [1 2 3 4 5]

x =  [1 2 3 4 5 6]

x =  [1 2 3 4 5 6 7 8]

Original Y = 
 [[1 2 3]
 [4 5 6]]

v = 
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

q = 
 [[ 1  2  3  9]
 [ 4  5  6 10]]


Insert

In [None]:
# We create a rank 1 ndarray 
x = np.array([1, 2, 5, 6, 7])

# We create a rank 2 ndarray 
Y = np.array([[1,2,3],[7,8,9]])

# We print x
print()
print('Original x = ', x)

# We insert the integer 3 and 4 between 2 and 5 in x. 
x = np.insert(x,2,[3,4])

# We print x with the inserted elements
print()
print('x = ', x)

# We print Y
print()
print('Original Y = \n', Y)

# We insert a row between the first and last row of y
w = np.insert(Y,1,[4,5,6],axis=0)

# We insert a column full of 5s between the first and second column of y
v = np.insert(Y,1,5, axis=1)

# We print w
print()
print('w = \n', w)

# We print v
print()
print('v = \n', v)


Original x =  [1 2 5 6 7]

x =  [1 2 3 4 5 6 7]

Original Y = 
 [[1 2 3]
 [7 8 9]]

w = 
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

v = 
 [[1 5 2 3]
 [7 5 8 9]]


HSTACK & VSTACK

It returns a stacked array formed by stacking the given arrays in sequence horizontally (column-wise). See the in-depth details here.

It returns a stacked array formed by stacking the given arrays, will be at least 2-D, in sequence vertically (row-wise). See the in-depth details here.

NumPy also allows us to stack ndarrays on top of each other, or to stack them side by side. The stacking is done using either the np.vstack() function for vertical stacking, or the np.hstack() function for horizontal stacking. It is important to note that in order to stack ndarrays, the shape of the ndarrays must match. Let's see some examples:

In [None]:
# We create a rank 1 ndarray 
x = np.array([1,2])

# We create a rank 2 ndarray 
Y = np.array([[3,4],[5,6]])

# We print x
print()
print('x = ', x)

# We print Y
print()
print('Y = \n', Y)

# We stack x on top of Y
z = np.vstack((x,Y))

# We stack x on the right of Y. We need to reshape x in order to stack it on the right of Y. 
w = np.hstack((Y,x.reshape(2,1)))

# We print z
print()
print('z = \n', z)

# We print w
print()
print('w = \n', w)


x =  [1 2]

Y = 
 [[3 4]
 [5 6]]

z = 
 [[1 2]
 [3 4]
 [5 6]]

w = 
 [[3 4 1]
 [5 6 2]]


###### Slicing Ndarrays

As we mentioned earlier, in addition to being able to access individual elements one at a time, NumPy provides a way to access subsets of ndarrays. This is known as slicing. Slicing is performed by combining indices with the colon : symbol inside the square brackets. In general you will come across three types of slicing:



As we mentioned earlier, in addition to being able to access individual elements one at a time, NumPy provides a way to access subsets of ndarrays. This is known as slicing. Slicing is performed by combining indices with the colon : symbol inside the square brackets. In general you will come across three types of slicing:



In [None]:
# We create a 4 x 5 ndarray that contains integers from 0 to 19
X = np.arange(20).reshape(4, 5)

# We print X
print()
print('X = \n', X)
print()

# We select all the elements that are in the 2nd through 4th rows and in the 3rd to 5th columns
Z = X[1:4,2:5]

# We print Z
print('Z = \n', Z)

# We can select the same elements as above using method 2
W = X[1:,2:5]

# We print W
print()
print('W = \n', W)

# We select all the elements that are in the 1st through 3rd rows and in the 3rd to 4th columns
Y = X[:3,2:5]

# We print Y
print()
print('Y = \n', Y)

# We select all the elements in the 3rd row
v = X[2,:]

# We print v
print()
print('v = ', v)

# We select all the elements in the 3rd column
q = X[:,2]

# We print q
print()
print('q = ', q)

# We select all the elements in the 3rd column but return a rank 2 ndarray
R = X[:,2:3]

# We print R
print()
print('R = \n', R)


X = 
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]

Z = 
 [[ 7  8  9]
 [12 13 14]
 [17 18 19]]

W = 
 [[ 7  8  9]
 [12 13 14]
 [17 18 19]]

Y = 
 [[ 2  3  4]
 [ 7  8  9]
 [12 13 14]]

v =  [10 11 12 13 14]

q =  [ 2  7 12 17]

R = 
 [[ 2]
 [ 7]
 [12]
 [17]]


Notice that when we selected all the elements in the 3rd column, variable q above, the slice returned a rank 1 ndarray instead of a rank 2 ndarray. However, slicing X in a slightly different way, variable R above, we can actually get a rank 2 ndarray instead.

It is important to note that when we perform slices on ndarrays and save them into new variables, as we did above, the data is not copied into the new variable. This is one feature that often causes confusion for beginners. Therefore, we will look at this in a bit more detail.

In the above examples, when we make assignments, such as:

In [None]:
Z = X[1:4,2:5]

the slice of the original array X is not copied in the variable Z. Rather, X and Z are now just two different names for the same ndarray. We say that slicing only creates a view of the original array. This means that if you make changes in Z you will be in effect changing the elements in X as well. Let's see this with an example:

In [None]:
# We create a 4 x 5 ndarray that contains integers from 0 to 19
X = np.arange(20).reshape(4, 5)

# We print X
print()
print('X = \n', X)
print()

# We select all the elements that are in the 2nd through 4th rows and in the 3rd to 4th columns
Z = X[1:4,2:5]

# We print Z
print()
print('Z = \n', Z)
print()

# We change the last element in Z to 555
Z[2,2] = 555

# We print X
print()
print('X = \n', X)
print()


X = 
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]


Z = 
 [[ 7  8  9]
 [12 13 14]
 [17 18 19]]


X = 
 [[  0   1   2   3   4]
 [  5   6   7   8   9]
 [ 10  11  12  13  14]
 [ 15  16  17  18 555]]



numpy.ndarray.copy

It returns a copy of the array. More details about the arguments are available here.

However, if we want to create a new ndarray that contains a copy of the values in the slice we need to use the np.copy() function. The np.copy(ndarray) function creates a copy of the given ndarray. This function can also be used as a method, in the same way as we did before with the reshape function. Let's do the same example we did before but now with copies of the arrays. We'll use copy both as a function and as a method.

In [None]:
# We create a 4 x 5 ndarray that contains integers from 0 to 19
X = np.arange(20).reshape(4, 5)

# We print X
print()
print('X = \n', X)
print()

# create a copy of the slice using the np.copy() function
Z = np.copy(X[1:4,2:5])

#  create a copy of the slice using the copy as a method
W = X[1:4,2:5].copy()

# We change the last element in Z to 555
Z[2,2] = 555

# We change the last element in W to 444
W[2,2] = 444

# We print X
print()
print('X = \n', X)

# We print Z
print()
print('Z = \n', Z)

# We print W
print()
print('W = \n', W)


X = 
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]


X = 
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]

Z = 
 [[  7   8   9]
 [ 12  13  14]
 [ 17  18 555]]

W = 
 [[  7   8   9]
 [ 12  13  14]
 [ 17  18 444]]


In [None]:
# We create a 4 x 5 ndarray that contains integers from 0 to 19
X = np.arange(20).reshape(4, 5)

# We create a rank 1 ndarray that will serve as indices to select elements from X
indices = np.array([1,3])

# We print X
print()
print('X = \n', X)
print()

# We print indices
print('indices = ', indices)
print()

# We use the indices ndarray to select the 2nd and 4th row of X
Y = X[indices,:]

# We use the indices ndarray to select the 2nd and 4th column of X
Z = X[:, indices]

# We print Y
print()
print('Y = \n', Y)

# We print Z
print()
print('Z = \n', Z)


X = 
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]

indices =  [1 3]


Y = 
 [[ 5  6  7  8  9]
 [15 16 17 18 19]]

Z = 
 [[ 1  3]
 [ 6  8]
 [11 13]
 [16 18]]


numpy.diag

It extracts or constructs the diagonal elements. More details about the arguments are available here.

NumPy also offers built-in functions to select specific elements within ndarrays. For example, the np.diag(ndarray, k=N) function extracts the elements along the diagonal defined by N. As default is k=0, which refers to the main diagonal. Values of k > 0 are used to select elements in diagonals above the main diagonal, and values of k < 0 are used to select elements in diagonals below the main diagonal. Let's see an example:

In [None]:
# We create a 4 x 5 ndarray that contains integers from 0 to 24
X = np.arange(25).reshape(5, 5)

# We print X
print()
print('X = \n', X)
print()

# We print the elements in the main diagonal of X
print('z =', np.diag(X))
print()

# We print the elements above the main diagonal of X
print('y =', np.diag(X, k=1))
print()

# We print the elements below the main diagonal of X
print('w = ', np.diag(X, k=-1))


X = 
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]

z = [ 0  6 12 18 24]

y = [ 1  7 13 19]

w =  [ 5 11 17 23]


numpy.unique

In [None]:
# Create 3 x 3 ndarray with repeated values
X = np.array([[1,2,3],[5,2,8],[1,2,3]])

# We print X
print()
print('X = \n', X)
print()

# We print the unique elements of X 
print('The unique elements in X are:',np.unique(X))


X = 
 [[1 2 3]
 [5 2 8]
 [1 2 3]]

The unique elements in X are: [1 2 3 5 8]


###### Boolean Indexing, Set Operations, and Sorting

Up to now we have seen how to make slices and select elements of an ndarray using indices. This is useful when we know the exact indices of the elements we want to select. However, there are many situations in which we don't know the indices of the elements we want to select. For example, suppose we have a 10,000 x 10,000 ndarray of random integers ranging from 1 to 15,000 and we only want to select those integers that are less than 20. Boolean indexing can help us in these cases, by allowing us select elements using logical arguments instead of explicit indices. Let's see some examples:

In [None]:
# We create a 5 x 5 ndarray that contains integers from 0 to 24
X = np.arange(25).reshape(5, 5)

# We print X
print()
print('Original X = \n', X)
print()

# We use Boolean indexing to select elements in X:
print('The elements in X that are greater than 10:', X[X > 10])
print('The elements in X that less than or equal to 7:', X[X <= 7])
print('The elements in X that are between 10 and 17:', X[(X > 10) & (X < 17)])

# We use Boolean indexing to assign the elements that are between 10 and 17 the value of -1
X[(X > 10) & (X < 17)] = -1

# We print X
print()
print('X = \n', X)
print()


Original X = 
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]

The elements in X that are greater than 10: [11 12 13 14 15 16 17 18 19 20 21 22 23 24]
The elements in X that less than or equal to 7: [0 1 2 3 4 5 6 7]
The elements in X that are between 10 and 17: [11 12 13 14 15 16]

X = 
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 -1 -1 -1 -1]
 [-1 -1 17 18 19]
 [20 21 22 23 24]]



In addition to Boolean Indexing NumPy also allows for set operations. This useful when comparing ndarrays, for example, to find common elements between two ndarrays. Let's see some examples:

In [None]:
# We create a rank 1 ndarray
x = np.array([1,2,3,4,5])

# We create a rank 1 ndarray
y = np.array([6,7,2,8,4])

# We print x
print()
print('x = ', x)

# We print y
print()
print('y = ', y)

# We use set operations to compare x and y:
print()
print('The elements that are both in x and y:', np.intersect1d(x,y))
print('The elements that are in x that are not in y:', np.setdiff1d(x,y))
print('All the elements of x and y:',np.union1d(x,y))


x =  [1 2 3 4 5]

y =  [6 7 2 8 4]

The elements that are both in x and y: [2 4]
The elements that are in x that are not in y: [1 3 5]
All the elements of x and y: [1 2 3 4 5 6 7 8]


numpy.ndarray.sort method

The method sorts an array in-place.

Like with other functions we saw before, the sort can be used as a method as well as a function. The difference lies in how the data is stored in memory in this case.

When numpy.sort() is used as a function, it sorts the ndrrays out of place, meaning, that it doesn't change the original ndarray being sorted.
On the other hand, when you use numpy.ndarray.sort() as a method, ndarray.sort() sorts the ndarray in place, meaning, that the original array will be changed to the sorted one.
Let's see some examples:

In [None]:
# We create an unsorted rank 1 ndarray
x = np.random.randint(1,11,size=(10,))

# We print x
print()
print('Original x = ', x)

# We sort x and print the sorted array using sort as a function.
print()
print('Sorted x (out of place):', np.sort(x))

# When we sort out of place the original array remains intact. To see this we print x again
print()
print('x after sorting:', x)


Original x =  [10  5  7  3  2  3  4  2  2 10]

Sorted x (out of place): [ 2  2  2  3  3  4  5  7 10 10]

x after sorting: [10  5  7  3  2  3  4  2  2 10]


In [None]:
# We create an unsorted rank 1 ndarray
x = np.random.randint(1,11,size=(10,))

# We print x
print()
print('Original x = ', x)

# We sort x and print the sorted array using sort as a method.
x.sort()

# When we sort in place the original array is changed to the sorted array. To see this we print x again
print()
print('x after sorting:', x)


Original x =  [5 9 8 9 8 2 1 4 7 3]

x after sorting: [1 2 3 4 5 7 8 8 9 9]


In [None]:
# We create an unsorted rank 2 ndarray
X = np.random.randint(1,11,size=(5,5))

# We print X
print()
print('Original X = \n', X)
print()

# We sort the columns of X and print the sorted array
print()
print('X with sorted columns :\n', np.sort(X, axis = 0))

# We sort the rows of X and print the sorted array
print()
print('X with sorted rows :\n', np.sort(X, axis = 1))


Original X = 
 [[ 8  5  1  8  5]
 [ 7  2  3  4  9]
 [ 1  1  7  8  9]
 [ 7  8  3  8  8]
 [ 7  2 10  5  4]]


X with sorted columns :
 [[ 1  1  1  4  4]
 [ 7  2  3  5  5]
 [ 7  2  3  8  8]
 [ 7  5  7  8  9]
 [ 8  8 10  8  9]]

X with sorted rows :
 [[ 1  5  5  8  8]
 [ 2  3  4  7  9]
 [ 1  1  7  8  9]
 [ 3  7  8  8  8]
 [ 2  4  5  7 10]]


###### Arithmetic operations and Broadcasting

We have reached the last lesson in this Introduction to NumPy. In this last lesson we will see how NumPy does arithmetic operations on ndarrays. NumPy allows element-wise operations on ndarrays as well as matrix operations. In this lesson we will only be looking at element-wise operations on ndarrays. In order to do element-wise operations, NumPy sometimes uses something called Broadcasting. Broadcasting is the term used to describe how NumPy handles element-wise arithmetic operations with ndarrays of different shapes. For example, broadcasting is used implicitly when doing arithmetic operations between scalars and ndarrays.

Let's start by doing element-wise addition, subtraction, multiplication, and division, between ndarrays. To do this, NumPy provides a functional approach, where we use functions such as np.add(), or by using arithmetic symbols, such as +, that resembles more how we write mathematical equations. Both forms will do the same operation, the only difference is that if you use the function approach, the functions usually have options that you can tweak using keywords. It is important to note that when performing element-wise operations, the shapes of the ndarrays being operated on, must have the same shape or be broadcastable. We'll explain more about this later in this lesson. Let's start by performing element-wise arithmetic operations on rank 1 ndarrays:

In [None]:
# We create two rank 1 ndarrays
x = np.array([1,2,3,4])
y = np.array([5.5,6.5,7.5,8.5])

# We print x
print()
print('x = ', x)

# We print y
print()
print('y = ', y)
print()

# We perfrom basic element-wise operations using arithmetic symbols and functions
print('x + y = ', x + y)
print('add(x,y) = ', np.add(x,y))
print()
print('x - y = ', x - y)
print('subtract(x,y) = ', np.subtract(x,y))
print()
print('x * y = ', x * y)
print('multiply(x,y) = ', np.multiply(x,y))
print()
print('x / y = ', x / y)
print('divide(x,y) = ', np.divide(x,y))


x =  [1 2 3 4]

y =  [5.5 6.5 7.5 8.5]

x + y =  [ 6.5  8.5 10.5 12.5]
add(x,y) =  [ 6.5  8.5 10.5 12.5]

x - y =  [-4.5 -4.5 -4.5 -4.5]
subtract(x,y) =  [-4.5 -4.5 -4.5 -4.5]

x * y =  [ 5.5 13.  22.5 34. ]
multiply(x,y) =  [ 5.5 13.  22.5 34. ]

x / y =  [0.18181818 0.30769231 0.4        0.47058824]
divide(x,y) =  [0.18181818 0.30769231 0.4        0.47058824]


In [None]:
# We create two rank 2 ndarrays
X = np.array([1,2,3,4]).reshape(2,2)
Y = np.array([5.5,6.5,7.5,8.5]).reshape(2,2)

# We print X
print()
print('X = \n', X)

# We print Y
print()
print('Y = \n', Y)
print()

# We perform basic element-wise operations using arithmetic symbols and functions
print('X + Y = \n', X + Y)
print()
print('add(X,Y) = \n', np.add(X,Y))
print()
print('X - Y = \n', X - Y)
print()
print('subtract(X,Y) = \n', np.subtract(X,Y))
print()
print('X * Y = \n', X * Y)
print()
print('multiply(X,Y) = \n', np.multiply(X,Y))
print()
print('X / Y = \n', X / Y)
print()
print('divide(X,Y) = \n', np.divide(X,Y))


X = 
 [[1 2]
 [3 4]]

Y = 
 [[5.5 6.5]
 [7.5 8.5]]

X + Y = 
 [[ 6.5  8.5]
 [10.5 12.5]]

add(X,Y) = 
 [[ 6.5  8.5]
 [10.5 12.5]]

X - Y = 
 [[-4.5 -4.5]
 [-4.5 -4.5]]

subtract(X,Y) = 
 [[-4.5 -4.5]
 [-4.5 -4.5]]

X * Y = 
 [[ 5.5 13. ]
 [22.5 34. ]]

multiply(X,Y) = 
 [[ 5.5 13. ]
 [22.5 34. ]]

X / Y = 
 [[0.18181818 0.30769231]
 [0.4        0.47058824]]

divide(X,Y) = 
 [[0.18181818 0.30769231]
 [0.4        0.47058824]]


In [None]:
# We create a rank 1 ndarray
x = np.array([1,2,3,4])

# We print x
print()
print('x = ', x)

# We apply different mathematical functions to all elements of x
print()
print('EXP(x) =', np.exp(x))
print()
print('SQRT(x) =',np.sqrt(x))
print()
print('POW(x,2) =',np.power(x,2)) # We raise all elements to the power of 2


x =  [1 2 3 4]

EXP(x) = [ 2.71828183  7.3890561  20.08553692 54.59815003]

SQRT(x) = [1.         1.41421356 1.73205081 2.        ]

POW(x,2) = [ 1  4  9 16]


Another great feature of NumPy is that it has a wide variety of statistical functions. Statistical functions provide us with statistical information about the elements in an ndarray.

In [None]:
# We create a 2 x 2 ndarray
X = np.array([[1,2], [3,4]])

# We print x
print()
print('X = \n', X)
print()

print('Average of all elements in X:', X.mean())
print('Average of all elements in the columns of X:', X.mean(axis=0))
print('Average of all elements in the rows of X:', X.mean(axis=1))
print()
print('Sum of all elements in X:', X.sum())
print('Sum of all elements in the columns of X:', X.sum(axis=0))
print('Sum of all elements in the rows of X:', X.sum(axis=1))
print()
print('Standard Deviation of all elements in X:', X.std())
print('Standard Deviation of all elements in the columns of X:', X.std(axis=0))
print('Standard Deviation of all elements in the rows of X:', X.std(axis=1))
print()
print('Median of all elements in X:', np.median(X))
print('Median of all elements in the columns of X:', np.median(X,axis=0))
print('Median of all elements in the rows of X:', np.median(X,axis=1))
print()
print('Maximum value of all elements in X:', X.max())
print('Maximum value of all elements in the columns of X:', X.max(axis=0))
print('Maximum value of all elements in the rows of X:', X.max(axis=1))
print()
print('Minimum value of all elements in X:', X.min())
print('Minimum value of all elements in the columns of X:', X.min(axis=0))
print('Minimum value of all elements in the rows of X:', X.min(axis=1))


X = 
 [[1 2]
 [3 4]]

Average of all elements in X: 2.5
Average of all elements in the columns of X: [2. 3.]
Average of all elements in the rows of X: [1.5 3.5]

Sum of all elements in X: 10
Sum of all elements in the columns of X: [4 6]
Sum of all elements in the rows of X: [3 7]

Standard Deviation of all elements in X: 1.118033988749895
Standard Deviation of all elements in the columns of X: [1. 1.]
Standard Deviation of all elements in the rows of X: [0.5 0.5]

Median of all elements in X: 2.5
Median of all elements in the columns of X: [2. 3.]
Median of all elements in the rows of X: [1.5 3.5]

Maximum value of all elements in X: 4
Maximum value of all elements in the columns of X: [3 4]
Maximum value of all elements in the rows of X: [2 4]

Minimum value of all elements in X: 1
Minimum value of all elements in the columns of X: [1 2]
Minimum value of all elements in the rows of X: [1 3]


In [None]:
# We create a 2 x 2 ndarray
X = np.array([[1,2], [3,4]])

# We print x
print()
print('X = \n', X)
print()

print('3 * X = \n', 3 * X)
print()
print('3 + X = \n', 3 + X)
print()
print('X - 3 = \n', X - 3)
print()
print('X / 3 = \n', X / 3)


X = 
 [[1 2]
 [3 4]]

3 * X = 
 [[ 3  6]
 [ 9 12]]

3 + X = 
 [[4 5]
 [6 7]]

X - 3 = 
 [[-2 -1]
 [ 0  1]]

X / 3 = 
 [[0.33333333 0.66666667]
 [1.         1.33333333]]


In [None]:
# We create a rank 1 ndarray
x = np.array([1,2,3])

# We create a 3 x 3 ndarray
Y = np.array([[1,2,3],[4,5,6],[7,8,9]])

# We create a 3 x 1 ndarray
Z = np.array([1,2,3]).reshape(3,1)

# We print x
print()
print('x = ', x)
print()

# We print Y
print()
print('Y = \n', Y)
print()

# We print Z
print()
print('Z = \n', Z)
print()

print('x + Y = \n', x + Y)
print()
print('Z + Y = \n',Z + Y)


x =  [1 2 3]


Y = 
 [[1 2 3]
 [4 5 6]
 [7 8 9]]


Z = 
 [[1]
 [2]
 [3]]

x + Y = 
 [[ 2  4  6]
 [ 5  7  9]
 [ 8 10 12]]

Z + Y = 
 [[ 2  3  4]
 [ 6  7  8]
 [10 11 12]]


### Pandas

Pandas is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. Pandas incorporates two additional data structures into Python, namely Pandas Series and Pandas DataFrame. These data structures allow us to work with labeled and relational data in an easy and intuitive manner. These lessons are intended as a basic overview of Pandas and introduces some of its most important features.

In the following lessons you will learn:

How to import Pandas
How to create Pandas Series and DataFrames using various methods
How to access and change elements in Series and DataFrames
How to perform arithmetic operations on Series
How to load data into a DataFrame
How to deal with Not a Number (NaN) values
The following lessons assume that you are already familiar with NumPy and have gone over the previous NumPy lessons. Therefore, to avoid being repetitive we will omit a lot of details already given in the NumPy lessons. Consequently, if you haven't seen the NumPy lessons we suggest you go over them first.

**Why Use Pandas?**

The recent success of machine learning algorithms is partly due to the huge amounts of data that we have available to train our algorithms on. However, when it comes to data, quantity is not the only thing that matters, the quality of your data is just as important. It often happens that large datasets don’t come ready to be fed into your learning algorithms. More often than not, large datasets will often have missing values, outliers, incorrect values, etc… Having data with a lot of missing or bad values, for example, is not going to allow your machine learning algorithms to perform well. 

Therefore, one very important step in machine learning is to look at your data first and make sure it is well suited for your training algorithm by doing some basic data analysis. This is where Pandas come in. Pandas Series and DataFrames are designed for fast data analysis and manipulation, as well as being flexible and easy to use. Below are just a few features that makes Pandas an excellent package for data analysis:

* Allows the use of labels for rows and columns

* Can calculate rolling statistics on time series data

* Easy handling of NaN values

* Is able to load data of different formats into DataFrames

* Can join and merge different datasets together

* It integrates with NumPy and Matplotlib

For these and other reasons, Pandas DataFrames have become one of the most commonly used Pandas object for data analysis in Python.



Let's start by importing Pandas into Python. It has become a convention to import Pandas as pd, therefore, you can import Pandas by typing the following command in your Jupyter notebook:

In [None]:
import pandas as pd

###### Pandas Series

A Pandas series is a one-dimensional array-like object that can hold many data types, such as numbers or strings, and has an option to provide axis labels.

Difference between NumPy ndarrays and Pandas Series
One of the main differences between Pandas Series and NumPy ndarrays is that you can assign an index label to each element in the Pandas Series. In other words, you can name the indices of your Pandas Series anything you want.
Another big difference between Pandas Series and NumPy ndarrays is that Pandas Series can hold data of different data types.

Let's begin by creating a Pandas Series. You can create Pandas Series by using the command pd.Series(data, index), where index is a list of index labels. Let's use a Pandas Series to store a grocery list. We will use the food items as index labels and the quantity we need to buy of each item as our data

In [None]:
# We import Pandas as pd into Python
import pandas as pd

# We create a Pandas Series that stores a grocery list
groceries = pd.Series(data = [30, 6, 'Yes', 'No'], index = ['eggs', 'apples', 'milk', 'bread'])

# We display the Groceries Pandas Series
groceries

eggs       30
apples      6
milk      Yes
bread      No
dtype: object

We see that Pandas Series are displayed with the indices in the first column and the data in the second column. Notice that the data is not indexed 0 to 3 but rather it is indexed with the names of the food we put in, namely eggs, apples, etc... Also, notice that the data in our Pandas Series has both integers and strings.

Just like NumPy ndarrays, Pandas Series have attributes that allow us to get information from the series in an easy way. Let's see some of them:

In [None]:
# We print some information about Groceries
print('Groceries has shape:', groceries.shape)
print('Groceries has dimension:', groceries.ndim)
print('Groceries has a total of', groceries.size, 'elements')

Groceries has shape: (4,)
Groceries has dimension: 1
Groceries has a total of 4 elements


We can also print the index labels and the data of the Pandas Series separately. This is useful if you don't happen to know what the index labels of the Pandas Series are.

In [None]:
# We print the index and data of Groceries
print('The data in Groceries is:', groceries.values)
print('The index of Groceries is:', groceries.index)

The data in Groceries is: [30 6 'Yes' 'No']
The index of Groceries is: Index(['eggs', 'apples', 'milk', 'bread'], dtype='object')


Now let's look at how we can access or modify elements in a Pandas Series. One great advantage of Pandas Series is that it allows us to access data in many different ways. Elements can be accessed using index labels or numerical indices inside square brackets, [ ], similar to how we access elements in NumPy ndarrays. Since we can use numerical indices, we can use both positive and negative integers to access data from the beginning or from the end of the Series, respectively. Since we can access elements in various ways, in order to remove any ambiguity to whether we are referring to an index label or numerical index, Pandas Series have two attributes, .loc and .iloc to explicitly state what we mean. The attribute .loc stands for location and it is used to explicitly state that we are using a labeled index. Similarly, the attribute .iloc stands for integer location and it is used to explicitly state that we are using a numerical index. Let's see some examples:

In [None]:
# We access elements in Groceries using index labels:

# We use a single index label
print('How many eggs do we need to buy:', groceries['eggs'])
print()

# we can access multiple index labels
print('Do we need milk and bread:\n', groceries[['milk', 'bread']]) 
print()

# we use loc to access multiple index labels
print('How many eggs and apples do we need to buy:\n', groceries.loc[['eggs', 'apples']]) 
print()

# We access elements in Groceries using numerical indices:

# we use multiple numerical indices
print('How many eggs and apples do we need to buy:\n',  groceries[[0, 1]]) 
print()

# We use a negative numerical index
print('Do we need bread:\n', groceries[[-1]]) 
print()

# We use a single numerical index
print('How many eggs do we need to buy:', groceries[0]) 
print()
# we use iloc to access multiple numerical indices
print('Do we need milk and bread:\n', groceries.iloc[[2, 3]]) 

How many eggs do we need to buy: 30

Do we need milk and bread:
 milk     Yes
bread     No
dtype: object

How many eggs and apples do we need to buy:
 eggs      30
apples     6
dtype: object

How many eggs and apples do we need to buy:
 eggs      30
apples     6
dtype: object

Do we need bread:
 bread    No
dtype: object

How many eggs do we need to buy: 30

Do we need milk and bread:
 milk     Yes
bread     No
dtype: object


Pandas Series are also mutable like NumPy ndarrays, which means we can change the elements of a Pandas Series after it has been created. For example, let's change the number of eggs we need to buy from our grocery list

In [None]:
# We display the original grocery list
print('Original Grocery List:\n', groceries)

# We change the number of eggs to 2
groceries['eggs'] = 2

# We display the changed grocery list
print()
print('Modified Grocery List:\n', groceries)

Original Grocery List:
 eggs       30
apples      6
milk      Yes
bread      No
dtype: object

Modified Grocery List:
 eggs        2
apples      6
milk      Yes
bread      No
dtype: object


We can also delete items from a Pandas Series by using the .drop() method. The Series.drop(label) method removes the given label from the given Series. We should note that the Series.drop(label) method drops elements from the Series out-of-place, meaning that it doesn't change the original Series being modified. Let's see how this works:

In [None]:
# We display the original grocery list
print('Original Grocery List:\n', groceries)

# We remove apples from our grocery list. The drop function removes elements out of place
print()
print('We remove apples (out of place):\n', groceries.drop('apples'))

# When we remove elements out of place the original Series remains intact. To see this
# we display our grocery list again
print()
print('Grocery List after removing apples out of place:\n', groceries)

Original Grocery List:
 eggs        2
apples      6
milk      Yes
bread      No
dtype: object

We remove apples (out of place):
 eggs       2
milk     Yes
bread     No
dtype: object

Grocery List after removing apples out of place:
 eggs        2
apples      6
milk      Yes
bread      No
dtype: object


In [None]:
# We display the original grocery list
print('Original Grocery List:\n', groceries)

# We remove apples from our grocery list. The drop function removes elements out of place
print()
print('We remove apples (out of place):\n', groceries.drop('apples'))

# When we remove elements out of place the original Series remains intact. To see this
# we display our grocery list again
print()
print('Grocery List after removing apples out of place:\n', groceries)

Original Grocery List:
 eggs        2
apples      6
milk      Yes
bread      No
dtype: object

We remove apples (out of place):
 eggs       2
milk     Yes
bread     No
dtype: object

Grocery List after removing apples out of place:
 eggs        2
apples      6
milk      Yes
bread      No
dtype: object


###### Arithmetic Operations

Just like with NumPy ndarrays, we can perform element-wise arithmetic operations on Pandas Series. In this lesson we will look at arithmetic operations between Pandas Series and single numbers. Let's create a new Pandas Series that will hold a grocery list of just fruits.

In [None]:
# We create a Pandas Series that stores a grocery list of just fruits
fruits= pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])

# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)

# We perform basic element-wise operations using arithmetic symbols
print()
print('fruits + 2:\n', fruits + 2) # We add 2 to each item in fruits
print()
print('fruits - 2:\n', fruits - 2) # We subtract 2 to each item in fruits
print()
print('fruits * 2:\n', fruits * 2) # We multiply each item in fruits by 2 
print()
print('fruits / 2:\n', fruits / 2) # We divide each item in fruits by 2
print()

Original grocery list of fruits:
  apples     10
oranges     6
bananas     3
dtype: int64

fruits + 2:
 apples     12
oranges     8
bananas     5
dtype: int64

fruits - 2:
 apples     8
oranges    4
bananas    1
dtype: int64

fruits * 2:
 apples     20
oranges    12
bananas     6
dtype: int64

fruits / 2:
 apples     5.0
oranges    3.0
bananas    1.5
dtype: float64



In [None]:
# We import NumPy as np to be able to use the mathematical functions
import numpy as np

# We print fruits for reference
print('Original grocery list of fruits:\n', fruits)

# We apply different mathematical functions to all elements of fruits
print()
print('EXP(X) = \n', np.exp(fruits))
print() 
print('SQRT(X) =\n', np.sqrt(fruits))
print()
print('POW(X,2) =\n',np.power(fruits,2)) # We raise all elements of fruits to the power of 2

Original grocery list of fruits:
 apples     10
oranges     6
bananas     3
dtype: int64

EXP(X) = 
 apples     22026.465795
oranges      403.428793
bananas       20.085537
dtype: float64

SQRT(X) =
 apples     3.162278
oranges    2.449490
bananas    1.732051
dtype: float64

POW(X,2) =
 apples     100
oranges     36
bananas      9
dtype: int64


###### Creating DataFrames

Pandas DataFrames are two-dimensional data structures with labeled rows and columns, that can hold many data types. If you are familiar with Excel, you can think of Pandas DataFrames as being similar to a spreadsheet. We can create Pandas DataFrames manually or by loading data from a file. In this lesson, we will start by learning how to create Pandas DataFrames manually from dictionaries, and later we will see how we can load data into a DataFrame from a data file.

**Create a DataFrame manually**

We will start by creating a DataFrame manually from a dictionary of Pandas Series. It is a two-step process:

The first step is to create the dictionary of Pandas Series.

After the dictionary is created we can then pass the dictionary to the pd.DataFrame() function.

We will create a dictionary that contains items purchased by two people, Alice and Bob, on an online store. The Pandas Series will use the price of the items purchased as data, and the purchased items will be used as the index labels to the Pandas Series. Let's see how this done in code:

In [None]:
# We import Pandas as pd into Python
import pandas as pd

# We create a dictionary of Pandas Series 
items = {'Bob' : pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch']),
         'Alice' : pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants'])}

# We print the type of items to see that it is a dictionary
print(type(items))

<class 'dict'>


Now that we have a dictionary, we are ready to create a DataFrame by passing it to the pd.DataFrame() function. We will create a DataFrame that could represent the shopping carts of various users, in this case we have only two users, Alice and Bob.

In [None]:
# We create a Pandas DataFrame by passing it a dictionary of Pandas Series
shopping_carts = pd.DataFrame(items)

# We display the DataFrame
shopping_carts

Unnamed: 0,Bob,Alice
bike,245.0,500.0
book,,40.0
glasses,,110.0
pants,25.0,45.0
watch,55.0,


There are several things to notice here, as explained below:

1. We see that DataFrames are displayed in tabular form, much like an Excel spreadsheet, with the labels of rows and columns in bold.

2. Also, notice that the row labels of the DataFrame are built from the union of the index labels of the two Pandas Series we used to construct the dictionary. And the column labels of the DataFrame are taken from the keys of the dictionary.

3. Another thing to notice is that the columns are arranged alphabetically and not in the order given in the dictionary. We will see later that this won't happen when we load data into a DataFrame from a data file.

4. The last thing we want to point out is that we see some NaN values appear in the DataFrame. NaN stands for Not a Number, and is Pandas way of indicating that it doesn't have a value for that particular row and column index. For example, if we look at the column of Alice, we see that it has NaN in the watch index. You can see why this is the case by looking at the dictionary we created at the beginning. We clearly see that the dictionary has no item for Alice labeled watches. So whenever a DataFrame is created, if a particular column doesn't have values for a particular row index, Pandas will put a NaN value there.

5. If we were to feed this data into a machine learning algorithm we will have to remove these NaN values first. In a later lesson, we will learn how to deal with NaN values and clean our data. For now, we will leave these values in our DataFrame.

In the example above, we created a Pandas DataFrame from a dictionary of Pandas Series that had clearly defined indexes. If we don't provide index labels to the Pandas Series, Pandas will use numerical row indexes when it creates the DataFrame. Let's see an example:

In [None]:
# We create a dictionary of Pandas Series without indexes
data = {'Bob' : pd.Series([245, 25, 55]),
        'Alice' : pd.Series([40, 110, 500, 45])}

# We create a DataFrame
df = pd.DataFrame(data)

# We display the DataFrame
df

Unnamed: 0,Bob,Alice
0,245.0,40
1,25.0,110
2,55.0,500
3,,45


We can see that Pandas indexes the rows of the DataFrame starting from 0, just like NumPy indexes ndarrays.

Now, just like with Pandas Series we can also extract information from DataFrames using attributes. Let's print some information from our shopping_carts DataFrame

In [None]:
# We print some information about shopping_carts
print('shopping_carts has shape:', shopping_carts.shape)
print('shopping_carts has dimension:', shopping_carts.ndim)
print('shopping_carts has a total of:', shopping_carts.size, 'elements')
print()
print('The data in shopping_carts is:\n', shopping_carts.values)
print()
print('The row index in shopping_carts is:', shopping_carts.index)
print()
print('The column index in shopping_carts is:', shopping_carts.columns)

shopping_carts has shape: (5, 2)
shopping_carts has dimension: 2
shopping_carts has a total of: 10 elements

The data in shopping_carts is:
 [[245. 500.]
 [ nan  40.]
 [ nan 110.]
 [ 25.  45.]
 [ 55.  nan]]

The row index in shopping_carts is: Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')

The column index in shopping_carts is: Index(['Bob', 'Alice'], dtype='object')


When creating the shopping_carts DataFrame we passed the entire dictionary to the pd.DataFrame() function. However, there might be cases when you are only interested in a subset of the data. Pandas allows us to select which data we want to put into our DataFrame by means of the keywords columns and index. Let's see some examples:

In [None]:
# We Create a DataFrame that only has Bob's data
bob_shopping_cart = pd.DataFrame(items, columns=['Bob'])

# We display bob_shopping_cart
bob_shopping_cart

Unnamed: 0,Bob
bike,245
pants,25
watch,55


You can also manually create DataFrames from a dictionary of lists (arrays). The procedure is the same as before, we start by creating the dictionary and then passing the dictionary to the pd.DataFrame() function. In this case, however, all the lists (arrays) in the dictionary must be of the same length. Let' see an example:

In [None]:
# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

# We create a DataFrame 
df = pd.DataFrame(data)

# We display the DataFrame
df

Unnamed: 0,Integers,Floats
0,1,4.5
1,2,8.2
2,3,9.6


Notice that since the data dictionary we created doesn't have label indices, Pandas automatically uses numerical row indexes when it creates the DataFrame. We can, however, put labels to the row index by using the index keyword in the pd.DataFrame() function. Let's see an example

In [None]:
# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

# We create a DataFrame and provide the row index
df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])

# We display the DataFrame
df

Unnamed: 0,Integers,Floats
label 1,1,4.5
label 2,2,8.2
label 3,3,9.6


In [None]:
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])

# We display the DataFrame
store_items

Unnamed: 0,bikes,pants,watches,glasses
store 1,20,30,35,
store 2,15,5,10,50.0


###### Accessing Elements in Pandas DataFrames

We can access elements in Pandas DataFrames in many different ways. In general, we can access rows, columns, or individual elements of the DataFrame by using the row and column labels. We will use the same store_items DataFrame created in the previous lesson. Let's see some examples:

In [None]:
# We print the store_items DataFrame
print(store_items)

# We access rows, columns and elements using labels
print()
print('How many bikes are in each store:\n', store_items[['bikes']])
print()
print('How many bikes and pants are in each store:\n', store_items[['bikes', 'pants']])
print()
print('What items are in Store 1:\n', store_items.loc[['store 1']])
print()
print('How many bikes are in Store 2:', store_items['bikes']['store 2'])

         bikes  pants  watches  glasses
store 1     20     30       35      NaN
store 2     15      5       10     50.0

How many bikes are in each store:
          bikes
store 1     20
store 2     15

How many bikes and pants are in each store:
          bikes  pants
store 1     20     30
store 2     15      5

What items are in Store 1:
          bikes  pants  watches  glasses
store 1     20     30       35      NaN

How many bikes are in Store 2: 15


It is important to know that when accessing individual elements in a DataFrame, as we did in the last example above, the labels should always be provided with the column label first, i.e. in the form dataframe[column][row]. For example, when retrieving the number bikes in store 2, we first used the column label bikes and then the row label store 2. If you provide the row label first you will get an error.

We can also modify our DataFrames by adding rows or columns. Let's start by learning how to add new columns to our DataFrames. Let's suppose we decided to add shirts to the items we have in stock at each store. To do this, we will need to add a new column to our store_items DataFrame indicating how many shirts are in each store. Let's do that:

In [None]:
# We add a new column named shirts to our store_items DataFrame indicating the number of
# shirts in stock at each store. We will put 15 shirts in store 1 and 2 shirts in store 2
store_items['shirts'] = [15,2]

# We display the modified DataFrame
store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts
store 1,20,30,35,,15
store 2,15,5,10,50.0,2


We can also add new columns to our DataFrame by using arithmetic operations between other columns in our DataFrame. Let's see an example

In [None]:
# We make a new column called suits by adding the number of shirts and pants
store_items['suits'] = store_items['pants'] + store_items['shirts']

# We display the modified DataFrame
store_items

Suppose now, that you opened a new store and you need to add the number of items in the stock of that new store into your DataFrame. We can do this by adding a new row to the store_items Dataframe. To add rows to our DataFrame we first have to create a new Dataframe and then append it to the original DataFrame. Let's see how this works

In [None]:
# We create a dictionary from a list of Python dictionaries that will contain the number of different items at the new store
new_items = [{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4}]

# We create new DataFrame with the new_items and provide and index labeled store 3
new_store = pd.DataFrame(new_items, index = ['store 3'])

# We display the items at the new store
new_store

# We append store 3 to our store_items DataFrame
store_items = store_items.append(new_store)

# We display the modified DataFrame
store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts
store 1,20,30,35,,15.0
store 2,15,5,10,50.0,2.0
store 3,20,30,35,4.0,


Add new column that has data from the existing columns

In [None]:
# We add a new column using data from particular rows in the watches column
store_items['new watches'] = store_items['watches'][1:]

# We display the modified DataFrame
store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts,new watches
store 1,20,30,35,,15.0,
store 2,15,5,10,50.0,2.0,10.0
store 3,20,30,35,4.0,,35.0


It is also possible, to insert new columns into the DataFrames anywhere we want. The dataframe.insert(loc,label,data) method allows us to insert a new column in the dataframe at location loc, with the given column label, and given data. Let's add new column named shoes right before the suits column. Since suits has numerical index value 4 then we will use this value as loc. Let's see how this works:

In [None]:
# We insert a new column with label shoes right before the column with numerical index 4
store_items.insert(4, 'shoes', [8,5,0])

# we display the modified DataFrame
store_items

Unnamed: 0,bikes,pants,watches,glasses,shoes,shirts,new watches
store 1,20,30,35,,8,15.0,
store 2,15,5,10,50.0,5,2.0,10.0
store 3,20,30,35,4.0,0,,35.0


Just as we can add rows and columns we can also delete them. To delete rows and columns from our DataFrame we will use the .pop() and .drop() methods. The .pop() method only allows us to delete columns, while the .drop() method can be used to delete both rows and columns by use of the axis keyword. Let's see some examples

In [None]:
# We remove the new watches column
store_items.pop('new watches')

# we display the modified DataFrame
store_items

Unnamed: 0,bikes,pants,watches,glasses,shoes,shirts
store 1,20,30,35,,8,15.0
store 2,15,5,10,50.0,5,2.0
store 3,20,30,35,4.0,0,


In [None]:
# We remove the watches and shoes columns
store_items = store_items.drop(['watches', 'shoes'], axis = 1)

# we display the modified DataFrame
store_items

Unnamed: 0,bikes,pants,glasses,shirts
store 1,20,30,,15.0
store 2,15,5,50.0,2.0
store 3,20,30,4.0,


In [None]:
# We remove the store 2 and store 1 rows
store_items = store_items.drop(['store 2', 'store 1'], axis = 0)

# we display the modified DataFrame
store_items

Unnamed: 0,bikes,pants,glasses,shirts
store 3,20,30,4.0,


Sometimes we might need to change the row and column labels. Let's change the bikes column label to hats using the .rename() method

In [None]:
# We change the column label bikes to hats
store_items = store_items.rename(columns = {'bikes': 'hats'})

# we display the modified DataFrame
store_items

Unnamed: 0,hats,pants,glasses,shirts
store 3,20,30,4.0,


In [None]:
# We change the row label from store 3 to last store
store_items = store_items.rename(index = {'store 3': 'last store'})

# we display the modified DataFrame
store_items

Unnamed: 0,hats,pants,glasses,shirts
last store,20,30,4.0,


You can also change the index to be one of the columns in the DataFrame.

In [None]:
# We change the row index to be the data in the pants column
store_items = store_items.set_index('pants')

# we display the modified DataFrame
store_items

Unnamed: 0_level_0,hats,glasses,shirts
pants,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
30,20,4.0,


###### Dealing with NaN

As mentioned earlier, before we can begin training our learning algorithms with large datasets, we usually need to clean the data first. This means we need to have a method for detecting and correcting errors in our data. While any given dataset can have many types of bad data, such as outliers or incorrect values, the type of bad data we encounter almost always is missing values. As we saw earlier, Pandas assigns NaN values to missing data. In this lesson we will learn how to detect and deal with NaN values.

We will begin by creating a DataFrame with some NaN values in it.

In [None]:
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

# We display the DataFrame
store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,,10,,4.0


We can clearly see that the DataFrame we created has 3 NaN values: one in store 1 and two in store 3. However, in cases where we load very large datasets into a DataFrame, possibly with millions of items, the number of NaN values is not easily visualized. For these cases, we can use a combination of methods to count the number of NaN values in our data. The following example combines the .isnull() and the sum() methods to count the number of NaN values in our DataFrame

In [None]:
# We count the number of NaN values in store_items
x =  store_items.isnull().sum().sum()

# We print x
print('Number of NaN values in our DataFrame:', x)

Number of NaN values in our DataFrame: 3


In the above example, the .isnull() method returns a Boolean DataFrame of the same size as store_items and indicates with True the elements that have NaN values and with False the elements that are not. Let's see an example:

In [None]:
store_items.isnull()

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,False,False,False,False,False,False,True
store 2,False,False,False,False,False,False,False
store 3,False,False,False,True,False,True,False


In Pandas, logical True values have numerical value 1 and logical False values have numerical value 0. Therefore, we can count the number of NaN values by counting the number of logical True values. In order to count the total number of logical True values we use the .sum() method twice. We have to use it twice because the first sum returns a Pandas Series with the sums of logical True values along columns, as we see below:

In [None]:
store_items.isnull().sum()

bikes      0
pants      0
watches    0
shirts     1
shoes      0
suits      1
glasses    1
dtype: int64

The second sum will then add up the 1s in the above Pandas Series.

Instead of counting the number of NaN values we can also do the opposite, we can count the number of non-NaN values. We can do this by using the .count() method as shown below:

In [None]:
# We print the number of non-NaN values in our DataFrame
print()
print('Number of non-NaN values in the columns of our DataFrame:\n', store_items.count())


Number of non-NaN values in the columns of our DataFrame:
 bikes      3
pants      3
watches    3
shirts     2
shoes      3
suits      2
glasses    2
dtype: int64


**Eliminating NaN Values**

Now that we learned how to know if our dataset has any NaN values in it, the next step is to decide what to do with them. In general, we have two options, we can either delete or replace the NaN values. In the following examples, we will show you how to do both.

We will start by learning how to eliminate rows or columns from our DataFrame that contain any NaN values. The .dropna(axis) method eliminates any rows with NaN values when axis = 0 is used and will eliminate any columns with NaN values when axis = 1 is used.

Tip: Remember, you learned that you can read axis = 0 as "down" and axis = 1 as "across" the given Numpy ndarray or Pandas dataframe object.

In [None]:
# We drop any rows with NaN values
store_items.dropna(axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 2,15,5,10,2.0,5,7.0,50.0


In [None]:
# We drop any columns with NaN values
store_items.dropna(axis = 1)

Unnamed: 0,bikes,pants,watches,shoes
store 1,20,30,35,8
store 2,15,5,10,5
store 3,20,30,35,10


Notice that the .dropna() method eliminates (drops) the rows or columns with NaN values out of place. This means that the original DataFrame is not modified. You can always remove the desired rows or columns in place by setting the keyword inplace = True inside the dropna() function.

**Substituting NaN Values**

Now, instead of eliminating NaN values, we can replace them with suitable values. We could choose for example to replace all NaN values with the value 0. We can do this by using the .fillna() method as shown below.

In [None]:
# We replace all NaN values with 0
store_items.fillna(0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,0.0
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,0.0,10,0.0,4.0


We can also use the .fillna() method to replace NaN values with previous values in the DataFrame, this is known as forward filling. When replacing NaN values with forward filling, we can use previous values taken from columns or rows. The .fillna(method = 'ffill', axis) will use the forward filling (ffill) method to replace NaN values using the previous known value along the given axis. Let's see some examples:

In [None]:
# We replace NaN values with the previous value in the column
store_items.fillna(method = 'ffill', axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,2.0,10,7.0,4.0


Notice that the two NaN values in store 3 have been replaced with previous values in their columns. However, notice that the NaN value in store 1 didn't get replaced. That's because there are no previous values in this column, since the NaN value is the first value in that column. However, if we do forward fill using the previous row values, this won't happen. Let's take a look:

In [None]:
# We replace NaN values with the previous value in the row
store_items.fillna(method = 'ffill', axis = 1)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20.0,30.0,35.0,15.0,8.0,45.0,45.0
store 2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
store 3,20.0,30.0,35.0,35.0,10.0,10.0,4.0


We see that in this case all the NaN values have been replaced with the previous row values.

Similarly, you can choose to replace the NaN values with the values that go after them in the DataFrame, this is known as backward filling. The .fillna(method = 'backfill', axis) will use the backward filling (backfill) method to replace NaN values using the next known value along the given axis. Just like with forward filling we can choose to use row or column values. Let's see some examples:

In [None]:
# We replace NaN values with the next value in the column
store_items.fillna(method = 'backfill', axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,50.0
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,,10,,4.0


Notice that the NaN value in store 1 has been replaced with the next value in its column. However, notice that the two NaN values in store 3 didn't get replaced. That's because there are no next values in these columns, since these NaN values are the last values in those columns. However, if we do backward fill using the next row values, this won't happen. Let's take a look:

Backward fill NaN values across (axis = 1) the dataframe

In [None]:
# We replace NaN values with the next value in the row
store_items.fillna(method = 'backfill', axis = 1)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20.0,30.0,35.0,15.0,8.0,45.0,
store 2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
store 3,20.0,30.0,35.0,10.0,10.0,4.0,4.0


Notice that the .fillna() method replaces (fills) the NaN values out of place. This means that the original DataFrame is not modified. You can always replace the NaN values in place by setting the keyword inplace = True inside the fillna() function.

We can also choose to replace NaN values by using different interpolation methods. For example, the .interpolate(method = 'linear', axis) method will use linear interpolation to replace NaN values using the values along the given axis. Let's see some examples:

In [None]:
# We replace NaN values by using linear interpolation using column values
store_items.interpolate(method = 'linear', axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,2.0,10,7.0,4.0


Notice that the two NaN values in store 3 have been replaced with linear interpolated values. However, notice that the NaN value in store 1 didn't get replaced. That's because the NaN value is the first value in that column, and since there is no data before it, the interpolation function can't calculate a value. Now, let's interpolate using row values instead:

In [None]:
# We replace NaN values by using linear interpolation using row values
store_items.interpolate(method = 'linear', axis = 1)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20.0,30.0,35.0,15.0,8.0,45.0,45.0
store 2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
store 3,20.0,30.0,35.0,22.5,10.0,7.0,4.0


###### Loading and Manipulating Data

The GOOG.csv and fake_company.csv are available to download at the bottom of this page. If it doesn't get downloaded upon clicking, try right-click and choose the "Save As..." option.

In machine learning you will most likely use databases from many sources to train your learning algorithms. Pandas allows us to load databases of different formats into DataFrames. One of the most popular data formats used to store databases is csv. CSV stands for Comma Separated Values and offers a simple format to store data. We can load CSV files into Pandas DataFrames using the pd.read_csv() function. Let's load Google stock data into a Pandas DataFrame. The GOOG.csv file contains Google stock data from 8/19/2004 till 10/13/2017 taken from Yahoo Finance.

In [None]:
# We load Google stock data in a DataFrame
Google_stock = pd.read_csv('/content/goog-1.csv')

# We print some information about Google_stock
print('Google_stock is of type:', type(Google_stock))
print('Google_stock has shape:', Google_stock.shape)

Google_stock is of type: <class 'pandas.core.frame.DataFrame'>
Google_stock has shape: (3313, 7)


We see that we have loaded the GOOG.csv file into a Pandas DataFrame and it consists of 3,313 rows and 7 columns. Now let's look at the stock data. 

We see that it is quite a large dataset and that Pandas has automatically assigned numerical row indices to the DataFrame. Pandas also used the labels that appear in the data in the CSV file to assign the column labels.

When dealing with large datasets like this one, it is often useful just to take a look at the first few rows of data instead of the whole dataset. We can take a look at the first 5 rows of data using the .head() method, as shown below

In [None]:
Google_stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2004-08-19,49.676899,51.693783,47.669952,49.845802,49.845802,44994500
1,2004-08-20,50.178635,54.187561,49.925285,53.80505,53.80505,23005800
2,2004-08-23,55.017166,56.373344,54.172661,54.346527,54.346527,18393200
3,2004-08-24,55.260582,55.439419,51.450363,52.096165,52.096165,15361800
4,2004-08-25,52.140873,53.651051,51.604362,52.657513,52.657513,9257400


We can also take a look at the last 5 rows of data by using the .tail() method:

In [None]:
Google_stock.tail()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
3308,2017-10-09,980.0,985.424988,976.109985,977.0,977.0,891400
3309,2017-10-10,980.0,981.570007,966.080017,972.599976,972.599976,968400
3310,2017-10-11,973.719971,990.710022,972.25,989.25,989.25,1693300
3311,2017-10-12,987.450012,994.119995,985.0,987.830017,987.830017,1262400
3312,2017-10-13,992.0,997.210022,989.0,989.679993,989.679993,1157700


Let's do a quick check to see whether we have any NaN values in our dataset. To do this, we will use the .isnull() method followed by the .any() method to check whether any of the columns contain NaN values.

In [None]:
Google_stock.isnull().any()

Date         False
Open         False
High         False
Low          False
Close        False
Adj Close    False
Volume       False
dtype: bool

We see that we have no NaN values.

When dealing with large datasets, it is often useful to get statistical information from them. Pandas provides the .describe() method to get descriptive statistics on each column of the DataFrame. Let's see how this works:

In [None]:
# We get descriptive statistics on our stock data
Google_stock.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
count,3313.0,3313.0,3313.0,3313.0,3313.0,3313.0
mean,380.186092,383.49374,376.519309,380.072458,380.072458,8038476.0
std,223.81865,224.974534,222.473232,223.85378,223.85378,8399521.0
min,49.274517,50.541279,47.669952,49.681866,49.681866,7900.0
25%,226.556473,228.394516,224.003082,226.40744,226.40744,2584900.0
50%,293.312286,295.433502,289.929291,293.029114,293.029114,5281300.0
75%,536.650024,540.0,532.409973,536.690002,536.690002,10653700.0
max,992.0,997.210022,989.0,989.679993,989.679993,82768100.0


In [None]:
# We get descriptive statistics on a single column of our DataFrame
Google_stock['Adj Close'].describe()

count    3313.000000
mean      380.072458
std       223.853780
min        49.681866
25%       226.407440
50%       293.029114
75%       536.690002
max       989.679993
Name: Adj Close, dtype: float64

Similarly, you can also look at one statistic by using one of the many statistical functions Pandas provides. Let's look at some examples:

In [None]:
# We print information about our DataFrame  
print()
print('Maximum values of each column:\n', Google_stock.max())
print()
print('Minimum Close value:', Google_stock['Close'].min())
print()
print('Average value of each column:\n', Google_stock.mean())


Maximum values of each column:
 Date         2017-10-13
Open                992
High             997.21
Low                 989
Close            989.68
Adj Close        989.68
Volume         82768100
dtype: object

Minimum Close value: 49.681866

Average value of each column:
 Open         3.801861e+02
High         3.834937e+02
Low          3.765193e+02
Close        3.800725e+02
Adj Close    3.800725e+02
Volume       8.038476e+06
dtype: float64


Another important statistical measure is data correlation. Data correlation can tell us, for example, if the data in different columns are correlated. We can use the .corr() method to get the correlation between different columns, as shown below:

In [None]:
# We display the correlation between columns
Google_stock.corr()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
Open,1.0,0.999904,0.999845,0.999745,0.999745,-0.564258
High,0.999904,1.0,0.999834,0.999868,0.999868,-0.562749
Low,0.999845,0.999834,1.0,0.999899,0.999899,-0.567007
Close,0.999745,0.999868,0.999899,1.0,1.0,-0.564967
Adj Close,0.999745,0.999868,0.999899,1.0,1.0,-0.564967
Volume,-0.564258,-0.562749,-0.567007,-0.564967,-0.564967,1.0


A correlation value of 1 tells us there is a high correlation and a correlation of 0 tells us that the data is not correlated at all.

**groupby() method**

We will end this Introduction to Pandas by taking a look at the .groupby() method. The .groupby() method allows us to group data in different ways. Let's see how we can group data to get different types of information. For the next examples, we are going to load fake data about a fictitious company.

In [None]:
# We load fake Company data in a DataFrame
data = pd.read_csv('/content/fake-company.csv')

data

Unnamed: 0,Year,Name,Department,Age,Salary
0,1990,Alice,HR,25,50000
1,1990,Bob,RD,30,48000
2,1990,Charlie,Admin,45,55000
3,1991,Dakota,HR,26,52000
4,1991,Elsa,RD,31,50000
5,1991,Frank,Admin,46,60000
6,1992,Grace,Admin,27,60000
7,1992,Hoffman,RD,32,52000
8,1992,Inaar,Admin,28,62000


We see that the data contains information for the year 1990 through 1992. For each year we see name of the employees, the department they work for, their age, and their annual salary. Now, let's use the .groupby() method to get information.

Example 10. Demonstrate groupby() and sum() method
Let's calculate how much money the company spent on salaries each year. To do this, we will group the data by Year using the .groupby() method and then we will add up the salaries of all the employees by using the .sum() method.

In [None]:
# We display the total amount of money spent in salaries each year
data.groupby(['Year'])['Salary'].sum()

Year
1990    153000
1991    162000
1992    174000
Name: Salary, dtype: int64

Now, let's suppose I want to know what was the average salary for each year. In this case, we will group the data by Year using the .groupby() method, just as we did before, and then we use the .mean() method to get the average salary. Let's see how this works



In [None]:
# We display the average salary per year
data.groupby(['Year'])['Salary'].mean()

Year
1990    51000
1991    54000
1992    58000
Name: Salary, dtype: int64

Now let's see how much did each employee gets paid in those three years. In this case, we will group the data by Name using the .groupby() method and then we will add up the salaries for each year. Let's see the result

In [None]:
# We display the total salary each employee received in all the years they worked for the company
data.groupby(['Name'])['Salary'].sum()

Name
Alice      50000
Bob        48000
Charlie    55000
Dakota     52000
Elsa       50000
Frank      60000
Grace      60000
Hoffman    52000
Inaar      62000
Name: Salary, dtype: int64

Now let's see what was the salary distribution per department per year. In this case, we will group the data by Year and by Department using the .groupby() method and then we will add up the salaries for each department. Let's see the result

In [None]:
# We display the salary distribution per department per year.
data.groupby(['Year', 'Department'])['Salary'].sum()

Year  Department
1990  Admin          55000
      HR             50000
      RD             48000
1991  Admin          60000
      HR             52000
      RD             50000
1992  Admin         122000
      RD             52000
Name: Salary, dtype: int64

We see that in 1990 the Admin department paid 55,000 dollars in salaries,the HR department paid 50,000, and the RD department 48,0000. While in 1992 the Admin department paid 122,000 dollars in salaries and the RD department paid 52,000.

https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

ADD HERE TWO MINI PROJECTS

LINK: https://github.com/udacity/AIPND