# A Practical Introduction to Python for Data Analysis

![image.png](attachment:image.png)


#### Lewys Brace
l.brace@exeter.ac.uk

# 1. Background Knowledge

### What is Python?

> Python is a high-level, interpretive, generic programming language with a "batteries included" philosphy.

What does this mean? There are four points to focus on in this description:
1. Generic programming language
2. High-level
3. Interpretive
4. Batteries included philosophy

### 1. Generic programming languages
- A general-purpose programming language is one that is designed to be used in the widest possible variety of contexts.

- This means that generic programming languages do not include language constructs that are designed to be used within a specific application domain. 

- This is in contrast to other forms of languages, such as R for example, where the constructs are designed to carry out specific tasks.

### 2. High-level language
- High-level programming languages are called such because they have a syntax that is closer to human languages than machine languages. More on this below.

- These high-level languages allow programmers to write scripts that are more or less independent of a particular type of computer.

### 3. Interpretive vs. compiled languages
- Python is an interpretive language.
- This means that your code is not directly run by the hardware. It is instead passed to a virtual machine, which is just another programme that reads and interprets your code. If your code used the ‘+’ operation, this would be recognised by the interpreter at run time, which would then call its own internal function ‘add(a,b)’, which would then execute the machine code ‘ADD’.
- This is in contrast to compiled languages, such as C, where your code is translated into native machine instructions, which are then directly executed by the hardware. Here, the ‘+’ in your code would be translated directly in the ‘ADD’ machine code.

### 4. Batteries included philosophy
- This refers to the way in which Python has numerous pieces of specially-written code that have been written to carry out specific tasks, and which can be imported and used by a programmer in order to save them from having to write code to carry out every single task.
- These are called modules.
- In Python, these modules are loaded into your Python script by using the 'import' function.
- There are are modules for most tasks. For example:

In [None]:
import antigravity

### Advantages of Python

Because Python is an interpretive language, it has a number of advantages:
- Automatic memory management.
- Expressivity and syntax that is ‘English’.
- Ease of programming.
- Minimises development time.
- Python also has a focus on importing modules, a feature that makes it useful for scientific computing.

### Disadvantages of Python

However, Python does also have a number of disadvatages:
- Interpreted languages are slower than compiled languages.
- The modules that you import are developed in a decentralised manner; this can cause issues based upon individual assumptions.
- Multi-threading can be difficult in Python

### So which language is best?

![image.png](attachment:image.png)

- No one language is better than all others. 
- The ‘best’ language depends on the task you are using it for and your personal preference.

### Versions of Python
- There are currently two versions of Python in use; Python 2 and Python 3.
- Python 3 is not backward compatible with Python 2.
- A lot of the imported modules were only available in Python 2 for quite some time, leading to a slow adoption of Python 3. However, this not really an issue anymore.
- Support for Python 2 will end in 2020.

# 2. Variables

- Variables in python can contain alphanumerical characters and some special characters.
- By convention, it is common to have variable names that start with lower case letters and have class names beginning with a capital letter.
- Some keywords are reserved and cannot be used as variable names due to them serving an in-built Python function; i.e. and, continue, break. Your IDE will let you know if you try to use one of these.
- Python is dynamically typed, meaninf that the type of the variable is derived from the value it is assigned.

### Variable types

Python has a number of in-built variable type:
- Integer (int)		
- Float (float)
- String (str)
- Boolean (bool)
- Complex (complex)
[…]
- User defined (classes)

A variable is assigned using the = operator.

The print() function is used to print something to the screen.

In [None]:
int_var = 5
float_var = 3.2
string_var = "Food"
print(int_var)
print(float_var)
print(string_var)

The print() statement above is a function, a special keyword that is used for a specific task and cannot be used for others. Python will produce function names in a different colour to the rest of your syntax.

It is important to note that in regards to string variables, such as string_var above, Python will interpret everything contained within "" literally. In other words, string variables are cap and white space sensitive. For example:

"Food" is not equal to "food"

"this is a string" is not equal to "this_is_a_string" or "thisisastring"

You can always check the type of a variable using the type() function:

In [None]:
print(type(int_var))

You can also embed functions within one another in order to produce specific output, like we did here to print the type of a variable to the screen.

Variables can be cast to different types.

In [None]:
share_of_rent = 295.50/2
print("1: ", share_of_rent)
print("2: ", type(share_of_rent))
rounded_share = int(share_of_rent)
print("3: ", rounded_share)
print("4: ", type(rounded_share))

You can feed multiple arguments into some functions, such as with the print() function here.

# 3. Operators

### Arithmetic operators

Python uses the usual arithemtic operators:
Addition: +, Subtraction: -, Multiplication: * , division: /, power: **, etc.

In [None]:
x = 10
y = 2
z = x/y
print(x)

### The increment operator

Python has a common idiom that is not necessary, but which is used frequently used and useful:

x += 1 is the same as x = x + 1

This also works for other operators:

	x += y 		 adds y to the value of x 
	x *= y 		 multiplies x by the value y 
	x -= y 		 subtracts y from x 
	x /= y 		 divides x by y
    
### Boolean operators

Boolean operators are useful when making conditional statements, we will cover these in-depth later. Some examples include:

- and
- or
- not

### Comparison operators

Python also uses all of the standard comparison operators: greater than: > , lesser than: < , greater than or equal to: >= , lesser than or equal to: <= , is equal to: == .


In [None]:
int_var = 5
float_var = 3.2

if int_var > float_var:
    print("Int_var is larger than float_var")

if int_var == 5:
    print("Int_var is equal to 5")

Do note the difference between the 'greater than' and 'greater than or equal' variables, and the lesser versions:

In [None]:
int_var = 5
int_var2 = 5

if int_var > int_var2:
    print("Int_var is larger than int_var2")

if int_var >= int_var2:
    print("Int_var is greater than or equal to int_var2")

You'll notice that the first line did not print anything out. This is because the "if statement" criteria was not met. We'll discuss this more later, but for the time being, they essentially state "If what follows is true, then do something".

# 4. Indexing

Indices allow you to access specific elements in a sequence, where a sequence can be strings, lists (discussed below), etc, by their location.

Indexing is an important thing to practice as it is quite easy to introduce an error into your programme when using it.

It is crucial to remember that indexing in Python is 0-based, meaning that the first element in a string, list, array, etc, has an index of 0. The second element then has an index of 1, and so on.

We're going to introduce ourselves to indexing by using it on a string to begin with. We'll use it in other contexts later on.

We use indexing by specifying the the sequence we want to access then adding some square brackets at the end. The integer we put in these square brackets is the index location that we wish to access.

Python indexing starts at zero, so if we want the first element of a string:

In [None]:
test_string = "Dogs are better than cats"
print("First element: ", test_string[0])

If we want the second element:

In [None]:
print("second element: ", test_string[1])

You can also cycle backwards through a sequence by placing - in front of your indexing number. However, do note that, when using reverse indexing you start at 1. So, to get the last element of a given sequence, you would use:

In [None]:
print("Last element: ", test_string[-1])

To get the second-to-last element you would use -2, and so on:

In [None]:
print("Last element: ", test_string[-2])

You can also gain access to the middle section(s) of a sequence by including two numbers separated by a : within the [] indexing. This is called slicing, in Python terminology:

In [None]:
print("Middle elements: ", test_string[6:10])

If you just include one number and a : within the [], you'll get the specified elements before or after the :, depending on which side you put the number:

In [None]:
print("All elements after location 5: ", test_string[5:])
print("All elements before location 5: ", test_string[:5])

# 5. Commenting out code

In python, if you wish to "comment out" a piece of code, i.e. write some code or text that you do not want Python to run, you can just place a # in front of that line of code. Python will not run any code that appears in a line after a #. For example:

In [None]:
print("This code piece will run") #print("This will not")

This is useful if you want to leave notes to yourself within your code, stating how it works. Something that I, and ayone who has ever coded recommend you do.

# 6. Data containers

Python has a number of data containers. You can think of them as tupperware for your data. However, each container is slightly different, and the one you will want to use will depend on the task at hand.

### Dictionaries

Python dictionaries are the same, in principle, to language dictionaries; in the sense that you have a term or combination of integers you want to look-up in order to get a definition. In Python dictionaries, the thing you use to look-up, is called a key. The equivalent of the definition is called a value. In short, Python dictionaries are essentially a sequence of key-values pairs.

You create a dictionary by first typing a variable name, which will be the name of the dictionary. You then add your key-value pairs between a set of {}. Here, we are going to create a dictionary of the prices of some of my favourite food items:

In [None]:
prices = {"eggs": 2.30, "steak": 13.50, "bacon": 2.30, "beer": 14.95}
print(prices)

We can check the type of our data container:

In [None]:
print(type(prices))

We can then fetch the price of an item (the value) by using the name of the item (the key). We do this by typing the name of the dictionary and feeding in our key as a string value between a set of [], just like with standard indexing:

In [None]:
print("The price of steak is: ", prices['steak'])

### Tuples

Tuples are containers that are immutable. This means that once a tuple has been created, its contents cannot be changed. This makes them useful if you have a series of data that you wish to use, but don't wish to accidentally overwrite; i.e. seeding numbers for a computer simulation. Furthermore, tuples can only contain data of a single type; i.e. you cannot have a tuple containing both integers and strings.

You create a tuple using parentheses:

In [None]:
tuple1 = (5, 10)
print("Tuple1 contents: ", tuple1)
print("container type: ", type(tuple1))

### Lists

Python lists are probably the container that you will use most frequently. Unlike tuples, lists can contain elements of different types; lists can even contain other lists! The difference between tuples and lists is in performance; it is much faster to ‘grab’ an element stored in a tuple, but lists are much more versatile. NOTE: lists are denoted by [] and not the () used by tuples.

In [None]:
number_list = [1, 2, 3]
print("List contents: ", number_list)
print("Type of container: ", type(number_list))
second_list = [1, number_list, "Hello"]
print("List 2 contents: ", second_list)
print("Type of container 2: ", type(second_list))

As stated above, lists are mutable; meaning that you can alter their contents. There are numerous ways to add, amend, and remove elements in an existing list.

First, you can change an element of a list using indexing:

In [None]:
numbers = [1, 2, 3]
print("Contents of numbers list", numbers)
numbers[1] = 5
print("Contents of numbers list now", numbers)

If you wish to add an element to a specific location in a list, but do not wish to overwrite an element that is already in the list like we did above, you can use the .insert() function.

Note: All of the functions we've encountered before, such as print() and type(), appear in a different colour to the rest of your code and you feed your arguments into the parentheses. The .insert() fucntion is a little different. Here, you first type the name of the container you want to alter, then you type ".insert, and then feed your arguments into the brackets. This difference in function implementation is an artefact of the decentralised nature of Python development.

In [None]:
numbers = [1, 2, 3]
print("Contents of numbers list", numbers)
#We'll no insert the string "Surprise!" into location 2, which 
#will move the other elements 'up' by one position.
numbers.insert(2, "Surprise!")
print("Contents of numbers list again", numbers)

The .append() function will add your new element to the end of the list.

In [None]:
numbers = [1, 2, 3]
print("Contents of numbers list", numbers)
numbers.append(4)
print("Contents of numbers list again", numbers)
numbers.append(6)
print("Contents of numbers list again again", numbers)

Equally, you are likely to want to remove an item from a list, and there are multiple ways in which to do this aswell.

First, you can use the .remove() function:

In [None]:
numbers = [1, 2, 3, 3]
print("Contents of numbers list", numbers)
numbers.remove(3)
print("Contents of numbers list again", numbers)

However, as you can see above, the .remove() function only removes the first occurance of the value you feed into it. This can obviously lead to confusion. As such, it  is better to remove elements from a list using the del function, which uses indexing. Here, we encounter yet another way in which to use a function. Here, we type del then the variable name and then the index of the element we wish to remove:

In [None]:
numbers = [1, 2, 3, 4]
print("Contents of numbers list", numbers)
#Remove the second element of a list, remember 
#that indexing in Python starts with 0
del numbers[1]
print("Contents of numbers list again", numbers)

# 7. Loops

### For loops

The for loop is used to iterate over elements in a sequence, and is often used when you have a piece of code that you want to repeat a number of times.

For loops essentially say:
    
> "For all elements in a sequence, do something"

We have a list of species:

In [None]:
species = ['dog', 'cat', 'shark', 'falcon', 'deer', 'tyrannosaurus rex']
print(species)

The following lines of code then cycles through each entry in the species list and prints the animal’s name to the screen. Note: The i is quite arbitrary. You could just as easily replace it with ‘animal’, ‘t’, or anything else.

We are now going to write a for loop that iterates through each of the elements in our species list above, and prints each element to the screen in turn:

In [None]:
for i in species:
    print(i)

The syntax above contains multiple parts. The "for" designates we're using a for loop. The i is essentially an indexer or placeholder. The use of "i" here is quite arbitrary, you could in fact use the world "biscuit" and it would still work. We then specify the sequence we are going to iterate through. Finally, we tell Python what we want it to do for each of the elements. In this case, print it to the screen.

As another example, we can use for loops to carry out other operations. For example:

In [None]:
numbers = [1, 20, 18, 5, 15, 160]
total = 0
#below, we'll use "value" instead of "i"
for value in numbers:
    total += value
print(total)

### The range() function

The range() function generates a list of numbers, which can be used to iterate over within for loops. It can take a number parameters: range([start], stop, [step])

Notes: 
- Above, [] denotes an optional argument.
- All parameters must be integers.
- The range() function (and Python in general) is 0-index based, meaning list indices start at 0, not 1. eg. The syntax to access the first element of a list is mylist[0]. Therefore the last integer generated by range() is up to, but not including, the stop number.

Our first example only has the stop argument. This generates a number of integers, starting from zero and up to the number specified minus 1.

In [None]:
for i in range(5):
    print(i)

Second example: Start at a given number and stop at the other given number minus 1:

In [None]:
for i in range(3, 6):
    print(i)

Third example: Start at a given number, stop at a given number minus 1, and increase by a certain number each time:

In [None]:
for i in range(4, 10, 2):
    print(i)

### The break() function

To terminate a loop, you can use the break() function.

The break() statement breaks out of the innermost enclosing for loop or while loop.

In [None]:
for i in range(1, 10):
    if i == 3:
        break
    print(i)

### While loops

The while loop tells the computer to do something as long as a specific condition is met.

It essentially says:
> "while this is true, do this".

- When working with while loops, its important to remember the nature of various operators.
- While loops use the break() and continue() functions in the same way as a for loop does.

The code below has a counter variable, "i". We then write the "while" to denote that we're using a while loop. The code below then says "while i is less than 3, print the animal name at the current index location, dentoed by i, and then add 1 to i". This natually means that once i = 3, the while loop will terminate.

In [None]:
species = ['dog', 'cat', 'shark', 'falcon', 'deer', 'tyrannosaurus rex']
i = 0
while i < 3:
    print(species[i])
    i += 1

Be aware that it is very easy to create a while loop that has a condition that is never satisfied, meaning that the loop never terminates. This is called an infinite loop. An example is:


In [None]:
#counter = 0
#while counter <= 100:
#    print(counter)
#    counter + 99

This loop will never terminate and print 0 to the screen indefinitely because nothing is ever added to the counter, meaning it is always 0, and that the criteria is never satisfied to terminate the while loop. Can you work out why this is the case?

### For loops vs. while loops

- You will use for loops more often than while loops.
- The for loop is the natural choice for cycling through a list, characters in a string, etc; basically, anything of determinate size.
- The while loop is the natural choice if you are cycling through something, such as a sequence of numbers, an indeterminate number of times until some condition is met.

### Nested loops

In some situations, you may want a loop within a loop. This is known as a nested loop. The code below will first calulate the value of x[0] multipled by y[0]; 1\*1=1. Then it will do the same for the second value of (y[1]), but leave x as is (x[0]); i.e.: 1\*2 = 2. It will carry on like this until y[10] is reached. At which point, it will get the value of x[1] and multiply it by y[0]; 2\*1=2. And so on until all possible combinations of x and y have been calculated.

In [None]:
for x in range(1, 11):
    for y in range(1, 11):
        print("%d * %d = %d" %(x, y, x*y))

The line: print("%d * %d = %d" %(x, y, x*y)) uses the string insert method. Here, each of the %d tells Python that we want to insert a digit into this location. Whenever Python encounters this symbol, it goes to the brackets following the %, and then takes the first digit that it has not already inserted into the string and inserts it into the location of %d.

# 8. Conditionals

There are three main conditional statements in Python; if, else, elif.

We have already used if when looking at while loops.

Conditionals essentially say:
    
> "If this condition(s) is met, do this"

The code below has a variable that states whether or not I have to go to work tomorrow. This is a boolean variable; i.e. it is either true or false. The "if" then says "If I have work tomorrow, then print that I cannot have beer tonight". The "else" function then says "If the "if" statement is not true, print that I can have beer tonight".

In [None]:
work_night = True
if work_night == True:
    print("No beer")
else:
    print("You may have beer")

By switching the value of the boolean variable to "false", we get a happier result:

In [None]:
work_night = False
if work_night == True:
    print("No beer")
else:
    print("You may have beer")

You can also use elif in order to have a complex conditional statement. Here is a, somewhat unhappy, example:

In [None]:
Lewys_has_teach = False
Lewys_is_hungry = True
if Lewys_has_teach == True:
    print("Lewys has to teach, so no time for food")
elif Lewys_is_hungry is True:
    print("No food for Lewys")
else:
    print("Go on, have a biscuit")

# 9. Functions

A function is a block of code which only runs when it is called.

They are really useful if you have operations that need to be done repeatedly; i.e. calculations.

The function must be defined before it is called. In other words, the block of code that makes up the function must come before the block of code that makes use of the function.

In [None]:
#We first define our function, which here just multiplies two
#numbers together. This function takes two variables as inputs;
#a and b. Then returns the result of the multiplication by 
#assigning it to the answer variable, then feeding that into
#the "return" function.
def practice_function(a, b):
    answer = a * b
    return answer
#Create two variables to test our fucntion.
x = 5
y = 4
#"calculated" is the result of multiplying the two numbers together.
#The function then takes our two variables (note that the two
#variable names fed into the function call do not need to match
#those in the function definition, but they are fed in in sequential
#order); i.e. x becomes a and y becomes b. The "answer" that is
#returned is then assigned to "calculated".
calculated = practice_function(x, y)
print(calculated)

### Multiple returns

You can have a function return multiple outputs.

In [None]:
def practice_function(a, b):
    answer = a * b
    answer2 = answer * answer
    return answer, answer2

x = 5
y = 4

calculated, calculated2 = practice_function(x, y)
print("First output: ", calculated)
print("First output: ", calculated2)

# 10. Reading and writing to files in Python: The file object

File handling in Python can easily be done with the built-in file object.

The file object provides all of the basic functions necessary in order to manipulate files.

### The open() function

Before you can work with a file, you first have to open it using Python’s in-built open() function.

The open() function takes two arguments; the name of the file that you wish to use and the mode for which we would like to open the file. There are a number of different file opening modes. The most common are: ‘r’= read, ‘w’=write, ‘r+’=both reading and writing, ‘a’=appending.

In [None]:
#working_file = open("training_file.txt", 'r')

The open() fucntion takes two arguments. The first is the file name, and you have to specify the file type here. In this example, we are reading in a text file, so we specify .txt at the end of the file name.

The second argument specifies what mode you want to open the file in, which is related to what you want to do with the file. By default, the open() function opens a file in ‘read mode’; this is what the ‘r’ above signifies.

Do note that, if the file you want to read in is not in your current working directory, then you will need to include the full file path to the file in the string that contains your file name, i.e:

In [None]:
#working_file = open("C:/username/Documents/data_folder/training_file.txt", 'r')

### The close() function

Likewise, once you’re done working with a file, you can close it with the .close() function.

Using this function will free up any system resources that are being used up by having the file open and will limit the chances of you accidently deleting the data contained therein.


In [None]:
#working_file.close()

### Reading in a file and printing to screen example

There are numerous ways in which to load in the contents of a file.

One way is to iterate through every line in a file by using a for loop.

In [None]:
working_file = open("training_file.txt", 'r')
for line in working_file:
    print(line)
working_file.close()

### The read() functions

However, Python has three in-built functions for reading in file contents.

The first is .read(), which returns the entire contents of the file as a single string.

In [None]:
working_file = open("training_file.txt", 'r')
print(working_file.read())
working_file.close()

The second is .readline(), which returns one line at a time.

In [None]:
working_file = open("training_file.txt", 'r')
print(working_file.readline())
working_file.close()

The third is .readlines(), which returns a list of lines

In [None]:
working_file = open("training_file.txt", 'r')
print(working_file.readlines())
working_file.close()

### The write() function

Likewise, there are two similar in-built functions for getting Python to write to a file.

The first is .write(), which writes a specified sequence of characters to a file. Do note that, when loading this file in, we want to write to it so we replace the 'r' with a 'w'.

In [None]:
working_file = open("training_file.txt", 'w')
working_file.write("Add this line of text to the file")
working_file.close()

The second is .writelines() which writes a list of strings to a file.

In [None]:
test_list = ["text piece 1", "text piece 2", "text piece 3"]
working_file = open("training_file.txt", 'w')
working_file.writelines(test_list)
working_file.close()

Important: Using the write() or writelines() function will overwrite anything contained within a file, if a file of the same name already exists in the working directory.

### Writing to a file without overwriting contents

If you do not want to overwrite a file’s contents, you can use the append() function.

To append to an existing file, simply put ‘a’ instead of ‘r’ or ‘w’ in the open() when opening a file.

In [None]:
working_file = open("training_file.txt", 'a')
working_file.write("Add this line of text to the file without overwriting")
working_file.close()

# 11. A word on import

As stated at the start, Python has a "batteries included" philosophy, meaning that you do not need to create code for every single task you wish to carry out. Instead, you can import pre-built pieces of code, known as packages, to carry out such a task.

To use a package in your code, you must first make it accessible.

This is one of the features of Python that make it so popular.

The below, for example, uses the datetime module in order to get the current date and time from your computer's clock.

In [None]:
import datetime
current_time = datetime.datetime.now()
print(current_time)

# 12. Introduction to plotting in Python

Before creating any plots, it is first worth spending sometime familiarising ourselves with the matplotlib module.

### Some history

Matplotlib was originally developed by a neurobiologist in order to emulate aspects of the MATLAB software.

The pythonic concept of importing is not utilised by MATLAB, and this is why something called Pylab exists.

Pylab is a module within the Matplotlib library that was built to mimic the MATLAB style. It only exists in order to bring aspects of NumPy and Matplotlib into the namespace, thus making for an easier transition for ex-MATLAB users, because they only had to do one import in order to access the necessary functions:

> from pylab import *

However, using the above command is now considered bad practice, and Matplotlib actually advises against using it due to the way in which it creates many opportunities for conflicted name bugs.

### Getting started

Without Pylab, we can get away with just one canonical import; the top line from the example below.

We are also going to import NumPy, which we are going to use to generate random data for our examples.

     
        

In [None]:
import matplotlib.pyplot as plt
import numpy as np

### Our first plot

In [None]:
plt.plot([1,2,3,4])
plt.ylabel('Some numbers')
plt.xlabel("A meaningless axis")
plt.show()

You may be wondering why the x-axis ranges from 0-3 and the y-axis from 1-4.

If you provide a single list or array to the plot() command, Matplotlib assumes it is a sequence of y values, and automatically generates the x values for you. Since Python indexing starts at 0, the default x vector has the same length as y but starts with 0 and ends at N-1. Hence the x data are [0,1,2,3].

### The plot() function

The plot() argument is quite versatile, and will take any arbitrary collection of numbers. For example, if we add an extra entry to the x-axis:

In [None]:
plt.plot([1,2,3,4,5], [1,4,9,50,2])
plt.ylabel('Some numbers')
plt.xlabel("A meaningless axis")
plt.show()

The plot() function has an optional third argument that specifies the appearance of the data points.

The default is b-, which is the blue solid line seen in the last two examples. The full list of styles can be found here:
https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html 

In the example below, we tell the .plot() command to produce a green data point that is a circle. This is what the "go" denotes.

In [None]:
plt.plot([1,2,3,4,5], [1,4,9,50,2], 'go')

Likewise, we could have red data points shaped as addition symbols.

In [None]:
plt.plot([1,2,3,4,5], [1,4,9,50,2], 'r+')

You can quite easily alter the properties of the line with the plot() function.

In [None]:
plt.plot([1,2,3,4,5], [1,4,9,50,2], '-', linewidth=5.0)
plt.axis([0,10,0,60])
plt.show()

### Altering tick labels

The plt.xticks() and plt.yticks() allows you to manually alter the ticks on the x-axis and y-axis respectively.

Note that the tick values have to be contained within a list object.

In [None]:
plt.plot([1,2,3,4,5], [1,4,9,50,2], '-', linewidth=5.0)
plt.axis([0,10,0,60])
plt.xticks([0,5,10])
plt.yticks([0,25,50,60])
plt.show()

### The axis() function

The axis() function allows us to specify the range of the axis.

It requires a list that contains the following:

[The min x-axis value, the max x-axis value, the min y-axis, the max y-axis value]

In [None]:
plt.plot([1,2,3,4,5], [1,4,9,50,2], 'bo')
plt.axis([0,10,0,60])
plt.show()

### Matplotlib and NumPy arrays

Normally when working with numerical data, you’ll be using NumPy arrays, which we'll cover in detail later. For the time being, you just need to know that these work really well with Matplotlib. In fact, all sequences are converted into NumPy arrays internally by matplotlib anyway.

In [None]:
t = np.arange(0., 5., 0.2)
plt.plot(t,t,'rx',t,t**2,'b*', t, t**3, 'go')
plt.show()

### Working with text

There are a number of different ways in which to add text to your graph:

- title() = Adds a title to your graph, takes a string as an argument.
- xlabel() = Add a title to the x-axis, also takes a string as an argument.
- ylabel() = Same as xlabel().
- text() = Can be used to add text to an arbitrary location on your graph. Requires the following arguments: text(x-axis location, y-axis location, the string of text to be added)

Note: Matplotlib uses TeX equation expressions. So, as an example, if you wanted to put the $\sigma$ symbol in one of the text blocks, you would write plt.title(r'$\sigma_i=15$').

### Annotating data points

The annotate() function allows you to easily annotate data points or a specific area on a graph.

In [None]:
t = np.arange(0., 5., 0.2)
lines = plt.plot(t, t, 'b-', t, t**2, 'r-', t, t**3, 'g-', linewidth=2.0)
plt.xlabel('Dummy for x')
plt.ylabel('Dummy for y')
plt.title('An example graph')
plt.annotate('Divergence point', xy=(1.4,3), xytext=(3,1.5),
            arrowprops=dict(facecolor='black', shrink=0.05,))
plt.setp(lines, 'color', 'r', 'linewidth', 2.0)
plt.show()

### Legend

The location of a legend is specified by the loc command. There are a number of in-built locations that can be altered by replacing the number, and full details can be found here: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html.

You can then use the bbox_to_anchor() function to manually place the legend, or when used with loc, to make slight alterations to the placement.

In [None]:
t = np.arange(0., 5., 0.2)
lines = plt.plot(t,t,'b-',linewidth=2.0, label='Thing 1')
lines = plt.plot(t,t**2,'r-',linewidth=2.0, label='Thing 2')
lines = plt.plot(t,t**3,'g-',linewidth=2.0, label='Thing 3')
plt.xlabel('Dummy for X')
plt.ylabel('Dummy for Y')
plt.title('An example graph')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

### Saving a figure as a file

The plt.savefig() allows you to save your plot as a file.

This function takes a string as an argument, which will be the name of the file. You must remember to state which file type you want the figure saved as; i.e. .png or .jpeg. If you want to save the image file to a folder that is not your current working directory, then you have to include the file path in the file name string, just like we did when we read in files earlier.

Make sure you put the plt.savefig() before the plt.show() function. Otherwise, the file will be a blank file. Also ensure that you put plt.close() after saving the file in order to avoid overwriting previously saved files and draining memory resources.

In [None]:
t = np.arange(0., 5., 0.2)
lines = plt.plot(t,t,'b-',linewidth=2.0, label='Thing 1')
lines = plt.plot(t,t**2,'r-',linewidth=2.0, label='Thing 2')
lines = plt.plot(t,t**3,'g-',linewidth=2.0, label='Thing 3')
plt.xlabel('Dummy for X')
plt.ylabel('Dummy for Y')
plt.title('An example graph')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.savefig('test.png')
plt.close()

# 13. Debugging

![image.png](attachment:image.png)

Debugging is in fundamental aspect of coding, and you will probably spend more time debugging than actually writing code. It is important to remember that EVERYONE has to debug, it is nothing to be ashamed of.

In fact, you should be particularly concerned if you do write a programme that does not display any obvious errors when run, as it likely means that you are just unaware of some bugs.

There are a number of debugging programmes available to coders. However, debugging the most common issues that you’ll encounter when developing programmes can be done by following a few key principles.

### Print everything

When debugging, the most important function at your disposal is the print command. Every coder uses this as a debugging tool, regardless of their amount of experience.

You should have some sense as to what every line of code you have written does. If not, print those lines out. You will then be able to see how the values of variables are changing as the programme runs through.

Even if you think you know what each line does, it is still recommended that you print out certain lines as often this can aid you in realising errors that you may have overlooked.

### Run your code when you make changes

Do not sit down and code for a hour or so without running the code you are writing. Chances are, you will never get to the bottom of all of the errors that your programme reports when it runs.

Instead, you should run your script every few minutes. It is not possible to run your code too many times.

Remember, the more code you write or edit between test runs, the more places you are going to have to go back an investigate when your code hits an error.

### Read your error messages

Do not be disheartened when you get an error message. More often than not, you’ll realise what the error is as soon as you read the message; i.e. the for loop doesn’t work on a list because the list is empty.

This is particularly the case with Python, which provides you with error messages in ‘clear English’ compared to the cryptic messages given by offered by other languages.

At the very least, the error message will let you know which lines is experiencing the error. However, this may not be the line causing the error. Still, this offers a good starting point for your bug search.

### Google the error message

![image.png](attachment:image.png)

If you cannot work out the cause of an error message, google the error code and description.

This can sometimes be a bit of a hit-or-miss, depending on the nature of the error.

If your error is fairly specific, then there will nearly always be a webpage where someone has already asked for help with an error that is either identical or very similar to the one you are experiencing; stackoverflow.com is the most common page you’ll come across in this scenario.

Do make sure that you read the description of the problem carefully to ensure that the problem is the same as the one you are dealing with. Then read the first two or three replies to see if page contains a workable solution.

### Comment out code

You can often comment out bits of code that are not related to the chunk of code that contains the error.

This will obviously make the code run faster and might make it easier to isolate the error.

### Binary searches

This method draws upon a lot of the methods we have already covered.

Here, you want to break the code into chunks; normally two chunks, hence this method’s name.

You then isolate which chunk of code the error is in.

After which, you take the chunk of code in question, and divide that up, and work out which of these new chunks contains the error.

So on until you’ve isolate the cause of the error.

### Walk away

If you have been trying to fix an error for a prolonged period of time, 30 minutes or so, get up and walk away from the screen and do something else for a while.

Often the answer to your issue will present itself upon your return to the computer, as if by magic.

### Phrase your problem as a question

Many software developers have been trained to phrase their problem as a question.

The idea here is that phrasing your issue in this manner often helps you to realise the cause of the problem.

This often works!

### Ask someone

If all else fails, do not hesitate to ask a colleague or friend who is a coder and maybe familiar with the language for help.

They may not even need to be a specialist, sometimes a fresh pair of eyes belonging to someone who is not invested in the project is more efficient at helping you work out your issue than spending hours trying to solve the issue on your own or getting lost the internet trying to find a solution.

# 14. Numerical Python (NumPy)

NumPy is the most foundational package for numerical computing in Python.

If you are going to work on data analysis or machine learning projects, then having a solid understanding of NumPy is nearly mandatory.

Indeed, many other libraries, such as pandas and scikit-learn, use NumPy’s array objects as the lingua franca for data exchange.

One of the reasons as to why NumPy is so important for numerical computations is because it is designed to work efficiently with large arrays of data in a number of ways, including:

1. Storing data internally in a continuous block of memory, independent of other in-built Python objects.
2. Performing complex computations on entire arrays without the 	need for for loops.

### What you’ll find in NumPy

- ndarray: an efficient multidimensional array providing fast array-orientated arithmetic operations and flexible broadcasting capabilities.

- Mathematical functions for fast operations on entire arrays of data without having to write loops.

- Tools for reading/writing array data to disk and working with memory-mapped files.

- Linear algebra, random number generation, and Fourier transform capabilities.

- A C API for connecting NumPy with libraries written in C, C++, and FORTRAN. This is why Python is the language of choice for wrapping legacy codebases.

### Importing NumPy

We first need to import NumPy into our working environment, which we do using the import funtion. You'll also notice that we add "as np" afterwards. This "as np" allows us to type "np" when we call a NumPy function instead of "numpy"; i.e. np.array instead of NumPy.array. This is similar to how we imported the matplotlib packages above as "plt". In principle, you could use anything as the shorthand, i.e. "import numpy as nump", but "np" is the standard convention.

In [None]:
import numpy as np

### The NumPy ndarray: A multi-dimensional array object

The NumPy ndarray object is a fast and flexible container for large data sets in Python. They are a bit like Python lists, but are still a very different beast at the same time. 

### Ndarray vs. lists

By now, you are familiar with Python lists and how incredibly useful they are.

So, you may be asking yourself:

> "I can store numbers and other objects in a Python list and do all sorts of computations and manipulations through list comprehensions, for-loops etc. What do I need a NumPy array for?"

There are some very significant advantages of using NumPy arrays overs lists when working with numerical data.

### Creating a NumPy array

To understand these advantages, lets create an array.

One of the most common, of the many, ways to create a NumPy array is to create one from a list by passing it to the np.array() function.

In [None]:
list1 = [0,1,2,3,4]
arr = np.array(list1)
print(type(arr))
print(arr)

### Differences between lists and ndarrays

The key difference between an array and a list is that arrays are designed to handle vectorised operations while a python lists are not.

It should be noted here that, once a Numpy array is created, you cannot increase its size. To do so, you will have to create a new array. 

That means, if you apply a function, it is performed on every item in the array, rather than on the whole array object.

For example:

In [None]:
list1 = [0,1,2,3,4]
arr = np.array(list1)
print("The array: ", arr)
arr+2
print("The array again: ", arr)

### Create a 2D array from a list of list

You can convert a list of lists into a 2D array through the use of the np.array() function. When doing so, NumPy treats each of the sub-lists as a row in the output array, meaning that the first elements in each sub-list, for example, will form the first column.

In [None]:
list2 = [[0,1,2], [3,4,5,], [6,7,8]]
print("List2: ", list2)
arr2 = np.array(list2)
print("Array2: ", arr2)

### The dtype argument

You can specify the data type of an array, whether the elements in the array are integers, floats, etc, by using the dtype().

In [None]:
list2 = [[0,1,2], [3,4,5,], [6,7,8]]
print("List2: ", list2)
arr3 = np.array(list2, dtype="float")
print("Array3: ", arr3)

### The astype argument

You can also convert a pre-existing array with elements of one particular type to an array with elements of a different type using the .astype() argument.

In [None]:
print("Array3: ", arr3)
arr3_int = arr3.astype('int')
print(arr3_int)

### dtype='object'

You can force NumPy to create an array with elements of different types by creating a list that contains your elements then feeding that into the np.array() function using the dtype='object' argument.

In [None]:
arr_obj = np.array([1, 'a'], dtype='object')
print(arr_obj)

However, it is strongly advised that you do NOT do this. Hence, this technically being a "work around". You should not have an array of different types because doing so will eventually cause you issues in your programme when it comes to carrying out calculations.

### The tolist() function

Alternatively, you could convert your array into a list with the .tolist() argument.

In [None]:
arr_list = arr_obj.tolist()
print(arr_list)

### Inspecting a NumPy array

There are a range of functions built into NumPy that allow you to inspect different aspects of an array.

In [None]:
list2 = [[0,1,2], [3,4,5], [6,7,8]]
arr3 = np.array(list2, dtype='float')
#We can view the shape of the array with .shape. Here, we see it has 3 rows and 3 columns.
print("Shape: ", arr3.shape)
#We can use .dtype to tell us the type of the elements.
print("Type of elements: ", arr3.dtype)
#We can get the size with .size. Here is is 9 (3 rows * 3 colums)
print("Size: ", arr3.size)
#We get the number of dimensions of an array with ndim.
print("Dimensions: ", arr3.ndim)

### Extracting specific items from an array

You can extract certain elements of an array using indices, much like when you’re working with lists. Unlike lists, however, arrays can optionally accept as many parameters in the square brackets as there are number of dimensions.

In [None]:
print("Whole array: ", arr3)
print("First two elements of the first two rows: ", arr3[:2, :2])

### Boolean indexing

A boolean index array is of the same shape as the filtered array, but it only contains TRUE and FALSE values; where these True or False values are dependent upon the corresponding element fulfilling a certain criteria.

In [None]:
arr_bool = arr3>2
print(arr_bool)

So we see here that the boolean array tells us which elements in arr3 contain values greater than 2.

# 15. Pandas

- Pandas, like NumPy, is one of the most popular Python libraries for data analysis.
- It is a high-level abstraction over low-level NumPy, which is written in pure C.
- The main benefit of Pandas is that it provides high-performance, easy-to-use, data structures and data analysis tools.
- There are two main data structures in Pandas: dataframes and series.

When importing pandas, its conventional to import it "as pd", much like the "as np" when we import NumPy.


In [None]:
import pandas as pd

#### Pandas Series

A pandas series is similar to a list, but differs in the fact that a series associates a label with each element, forming an index. If an index is not explicitly provided by the user, pandas creates a RangeIndex ranging from 0 to N-1; where N is the number of elements in your series.

In [None]:
#First, let's convert a list of integers into a Series.
list1 = [5, 6, 7, 8, 9,10]
new_series = pd.Series(list1)
#We'll see here that Pandas has auto-created an index of 0-5 for the Series.
print(new_series)

### Indices in Series

As you may suspect by this point, a series has ways to extract all of the values in the series, as well as individual elements by index. If you didn't specify an index, and instead allowed Pandas to autocreate one, then you can use indexing in exactly the same way as you would with a list:

In [None]:
print(new_series[4])

However, if you used the index=[] argument when you created your index, then you can use your custom index in order to access a specific element. You can even use strings as indices in Pandas data structures.

In [None]:
new_series = pd.Series(list1, index=['a','b','c', 'd', 'e', 'f'])
print(new_series['c'])

You can retrieve several elements simultaneously by feeding index values in as a list.

In [None]:
print(new_series[['a', 'd', 'f']])

You can also use indexing to alter the value of an element(s).

In [None]:
new_series[['b', 'd']] = 0
print(new_series)

### Filtering and maths operations

Filtering and maths operations are easy with Pandas as well. To do this, you specify the variable name of the series, and then in the proceeding [], you write the series name again, followed by the conditonal statement.

In [None]:
#print all elements in the series that have a value greater than 2:
print(new_series[new_series>2])
print("_______")
#Print all elements with a value greater than two and multiply each of these values by 2.
print(new_series[new_series>2]*2)

### Pandas dataframe

- The dataframe object, often abbreviated to df, is the data structure that makes Pandas such a powerful and useful package, particularly for data analysts.
- Simplistically, you can think of it as a table. Where the columns are variables and the rows are observations. For example:

![image.png](attachment:image.png)

### Creating a Pandas dataframe

In order to create a new Pandas dataframe, let's first create a dictionary.

In [None]:
country_dict = {'Country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
                   'Population': [17.04, 143.5, 9.5, 45.5],
                   'Square_km': [2724902, 17125191, 207600, 603628]}
print(type(country_dict))
print(country_dict)

Now, lets convert our dictionary into a Pandas dataframe. To do this, we feed our dictionary into Pandas' DataFrame() argument.

In [None]:
df = pd.DataFrame(country_dict)
print(df)

Note: When you create a dataframe from a dictionary, Pandas will get the column (variable) names for your dataframe from the dictionary's keys.
    
You can also create a dataframe from a list of lists. When you do so, however, you need to use the columns=[] argument in order to specify the names of your columns.

In [None]:
list1 = [[0,1,2], [3, 4, 5], [6, 7, 8]]
df2 = pd.DataFrame(list2, columns = ['V1', 'V2', 'V3'])
print(df2)

### Dataframe variables

Note: That all columns in a Pandas dataframe are of series type:

In [None]:
print(type(df['Country']))

### Indexing in a dataframe

- A Pandas data frame object as two indices; a column index and row index.
- As with the series object, if you do not provide one, Pandas will create a RangeIndex from 0 to N-1.
- You can address a single column (variable) of your dataframe by using similar indexing methods to that which we used in order to get values from a series using string indexing.

In [None]:
print(df['Country'])

There are numerous ways to provide row indices explicitly. For example, you could provide an index when creating a dataframe. Here, we are going to use strings for indexing.

In [None]:
df = pd.DataFrame(country_dict, index=['KZ', 'RU', 'By', 'UA'])
print(df)

Alternatively you can do it during runtime. Here, we are going to actually name the index column 'country code'.

In [None]:
df = pd.DataFrame(country_dict)
df.index = ['KZ', 'RU', 'By', 'UA']
df.index.name = "Country_code"
print(df)

Row access using index can be performed in several ways. If you're using string indexing, you can use the .loc() argument:

In [None]:
print(df.loc['KZ'])

If you're using numerical indexing, you can use .iloc()

In [None]:
print(df.iloc[1])

Note: If you specify a string index, you can still use numerical indexing; one does not replace the other.
    
You can equally use both types of indexing simultaneously in order to get a specific cell from a particular column. 

In [None]:
print(df['Country'].iloc[1])

A selection of particular rows and columns can be selected in the following way. The first argument in the indexing brackets is the indexing for the rows and the second is the columns. Here, we are going to provide a list of indices for the rows so that we can get two rows.

In [None]:
print(df.loc[['KZ', 'RU'], 'Population'])

You can also use slicing, like we did with lists in the previous sessions.

In [None]:
print(df.loc[['KZ', 'RU'], :'Population'])

### Filtering

We can also filter a dataframe using Boolean arrays. For example, we want to get the values for the Country and Square_km variables where the population value is above 18.

In [None]:
print(df[df['Population'] > 18][['Country', 'Square_km']])

### Deleting columns

You can delete a column using the .drop() function. The axis argument states whether you want to drop labels from the rows (0 or ‘index’) or columns (1 or ‘columns’).

In [None]:
df = df.drop(['Population'], axis='columns')
print(df)

### Reading from and writing to a file

- Pandas supports many popular file formats including CSV, XML, HTML, Excel, SQL, JSON, etc.
- Out of all of these, CSV is the file format that you will work with the most.
- Using the .read_csv function, you can read data in directly from a csv file into a Pandas dataframe. This function takes two main arguments. The first is the filename as a string. If your file is NOT in the current working directory, then the string containing your filepath also has to include the file path to the directory where the file is stored. It is also important to end this string with the type of file you are importing; in this case, a .csv file. The second argument is sep, and let's Pandas know which character separates the columns.

> df = pd.read_csv('Username/filepath/filename.csv', sep=',')

You can also write a dataframe to a .csv file using .to_csv()

> df.to_csv('path/to/directory/filename.csv')

### Final notes on NumPy and Pandas
NumPy and Pandas have the ability to do so much more than has been discussed here. While it is worth developing a strong understanding of both packages, I would recommend against spending a disproportionate amount of time learning NumPy. This is because it is a vast package and you will learn the parts of NumPy that you need to learn, as you spend more time coding different tasks in Python.


# 16. Exploratory data analysis (EDA)


Exploring your data is a crucial step in data analysis. It involves:
- Organising the data set
- Plotting aspects of the data set
- Maybe producing some numerical summaries; central tendency and spread, etc.

> “Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.”
                                                    - John Tukey.


### Download the data

First, lets download the dataset for the rest of this tutorial.

> https://github.com/LewBrace/da_and_vis_python

We're going to be researching Pokemon for the remainder of this tutorial. First, go to the link above and download the file as a .zip folder. Then, unzip the folder, and save the data file in a location you’ll remember.

![image.png](attachment:image.png)

Now, let's import all the packages we'll need for the following exercises. 

We'll then load in the dataset using read_csv. The index_col argument tells Python that the first row of the data file contains column headers. The encoding arguments tells Pandas that we have to import the file and read it with a specific encoder in order to special characters such as commas. Remember, you will need to alter the file path part of the first string argument being fed into the pd.read_csv() function in order to match the file path to where you saved the data file.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

df = pd.read_csv("C:/filepath/to/your/directory/pokemon_dataset.csv", index_col=0, encoding="ISO-8859-1")

### Examine the dataset

You can use the .head() and .tail() functions in order to view the first 5 or last 5 rows of data in a dataframe, respectively. You can also put n=10 within the parentheses if you want to see the top or bottom 10 records, for example. The ... between the two middle columns is Pandas' way of informing you that there are additional variables in the dataframe, but they can't all fit on the screen.

In [None]:
print(df.head())

The .describe() function provides you with the summary statistics for all of the variables in the dataframe.

In [None]:
print(df.describe())

# 17. Plotting a histogram in Python

We could spend time staring at these numbers, but that is unlikely to offer us any form of insight. We could begin by conducting all of our statistical tests. However, a good field commander never goes into battle without first doing a recognisance of the terrain, and this is exactly what EDA is for.

Histograms are really useful for looking at the distribution of a variable; the frquency of each value within a variable.

![image.png](attachment:image.png)

Now, lets look at how to create such a histogram.

In [None]:
#We call matplotlib's histogram function with plt.hist. We then say that we want a histogram of the speed variable.
#The histtype argument tells Python that we want a bar-type histogram, and the ec argument tells it that we want
#black lines around our bars.
g = plt.hist(df['Speed'], histtype='bar', ec='black')
#Label our x-axis and y-axis
g = plt.xlabel('Pokemon Speed')
g = plt.ylabel('Frequency')
#We then tell Python to show us the graph.
plt.show()

### Bins

You may have noticed the two histograms we’ve seen so far look different, despite using the exact same data. This is because they have different bin values. This refers to the width and number of bars in the graph. The left graph used the default bins generated by plt.hist(), while the one on the right used bins that I specified.

![image.png](attachment:image.png) ![image.png](attachment:image.png)

There are a couple of ways to manipulate bins in matplotlib. Below, we specify where the edges of the bars of the histogram are located; the bin edges. The first bin will begin at 0 and end at 10, the second bar will begin at 10 and ends at 20, etc. We create a list of these bin edges, and feed it in to the bins argument.

In [None]:
bin_edges = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150]
g = plt.hist(df['Speed'], histtype='bar', ec='black', bins=bin_edges)
g = plt.xlabel('Pokemon Speed')
g = plt.ylabel('Frequency')
plt.show()

You can also specify the total number of bins you want, and Matplotlib will automatically generate a number of evenly spaced bins.

In [None]:
g = plt.hist(df['Speed'], histtype='bar', ec='black', bins=15)
g = plt.xlabel('Pokemon Speed')
g = plt.ylabel('Frequency')
plt.show()

# 18. Seaborn

Matplotlib is a powerful, but sometimes unwieldy, Python library. Indeed, for a long while the default behaviour of matplotlib was to produce graphs like the follow:

![image.png](attachment:image.png)


Using Jupyter notebooks and Spyder saves us from having to deal with tinkering all the various parameters that you would need to set in order to get a graph to look like the following:

![image.png](attachment:image.png)

This happens by "borrowing" some of the default settings from a package called Seaborn. Seaborn provides a high-level interface to Matplotlib and makes it easier to produce graphs like the second one above. In this sense, it is essentially a "wrap around" which allows you to create graphs within Seaborn, which are then created using all of the best feastures of matplotlib.
    

### Benefits of Seaborn

Seaborn offers:

- Default themes that are aesthetically pleasing.
- Setting of custom colour palettes.
- Making attractive statistical plots.
- Easy to generate and felxibility in visualising distributions.
- Visualising information from matrices and dataframes.

The last three points have led to Seaborn becoming the exploratory data analysis tool of choice for many Python users.

### Plotting with Seaborn
- One of Seaborn's greatest strengths is its diversity of plotting functions. 
- Most plots can be created with one line of code.

For example….

### Histograms

Seaborn creates a form of histrogram referred to as a density plot.

In [None]:
#Seaborn has a number of styles you can import to make the graphs look different. Here, we'll use the default style.
sns.set_style()
sns.distplot(df['Speed'])

### Scattergraph

Scattergraphs are incredibly useful for understanding the relationship between two variables. Creating a scatterplot with a line of best fit is also incredibly simple with Seaborn, using the lmplot() function.

In [None]:
#Specify, as a string variable, the variable we want on the x- and y- axis. 
#Then specify the particular dataframe that we want to draw the variables from using the data= argument.
sns.lmplot(x='Attack', y='Defense', data=df)

Seaborn doesn't have a dedicated scatter plot function, this is why we used the linear moodel plot (lmplot()) function. However, if we do not want the line of best fit on our graph, we can remove it by setting the fit_reg argument to false.

In [None]:
sns.lmplot(x='Attack', y='Defense', data=df, fit_reg=False)

### The hue function

Seaborn allows us to add useful features to graphs that could otherwise be tricky to implement. The hue function, for example, allows us to colour our data pints in accordance with that data point's value in a particular variable. You may be aware that Pokemon evolve through different stages. So, lets re-create our attack and defence graph, but this time, lets colour the data points in accordance with the pokemon's evolutionary stage. To do this, we just feed the 'Stage' variable into the hue argument.

In [None]:
sns.lmplot(x='Attack', y='Defense', data=df, fit_reg=False, hue='Stage')

### Factorplots

If we instead wanted to create three separate scattplots, one for each of the different evolutionary stages, we can do that with Seaborn's factorplot() function. The arguments here are the same as those for the lmplot() function, with just two additons. The first is the col argument, which is how we tell Seaborn how to breakdown the data into a number of different graphs. Since we want a graph for each of the evolutionary stages, we are going to feed the "Stage" variable into it. The second is kind, which is how we tell Seaborn what kind of plot we want, and we are going to ask for a swarmplot. Swamplots are useful in situations whereby you have categorical variables on the x-axis and continuous data on the y-axis. Here, we'll plot the type of the pokemon, fire, water, etc, against their attack level. The set_xticklabels() function allows us to set the settings for the x-axis ticks. Here, we are going to put them on an incline.

In [None]:
g = sns.factorplot(x='Type 1', y='Attack', data=df, hue='Stage', col='Stage', kind='swarm')
g.set_xticklabels(rotation=-45)

### Boxplot

A boxplot can easily be created in Seaborn with the .boxplot() function.

In [None]:
g = sns.boxplot(data=df)
#We get a list of labels from g using g.get_xticklabels(), then feed it into .set_xticklabels(), so that
#we can manipulate the axis labels
g.set_xticklabels(g.get_xticklabels(), rotation=-45)

The total, stage, and legendary entries are not combat stats so we should remove them. Pandas makes this easy to do, we just create a new dataframe. We just use Pandas’ .drop() function to create a dataframe that doesn’t include the variables we don’t want.

In [None]:
#When using the .drop function, we specify the axis that we do NOT want in our new dataframe within a list.
stats_df = df.drop(['Total', 'Stage', 'Legendary'], axis=1)
#Feed our new dataframe into the boxplot() function.
sns.boxplot(data=stats_df)

### Seaborn themes

It was mentioned above that seaborn has a number of default themes that you can import in order to change the appearance of graphs. Until now, we have been using the default Seaborn style, which we set by including the sns.set_style() at the top of our code. Now, we'll set the whitegrid style, in order to create graphs with y-axis lines going across the background of our graphs.

In [None]:
sns.set_style('whitegrid')
sns.boxplot(data=stats_df)

### Violin plots

If you want to plot categorical data on the x-axis and continuous data on the y-axis in a way that allows you to easily see the variance between categories, we can use a violin plot.

In [None]:
g = sns.violinplot(x='Type 1', y='Attack', data=df)
g.set_xticklabels(g.get_xticklabels(), rotation=-45)

Slide Type
Dragon types tend to have higher Attack stats than Ghost types, but they also have greater variance. But there is something not right here…

The colours! The grass type Pokemon should not be coloured pink or red, but should surely be coloured green.

### Seaborn's colour palettes

This brings us to colour palettes. Seaborn allows us to easily set custom colour palettes by providing it with an ordered list of colour hex values. Seaborn also has a number of in-built colour palettes that you can use, details of which can be found here https://seaborn.pydata.org/tutorial/color_palettes.html. Here, however, we are going to use a custom colour palette, which we first define as a list object that will define one colour for each of the Pokemon types.

In [None]:
type_colours = ['#78C850', #Grass
               '#F08030', #Fire
               '#6890F0', #Water
               '#A8B820', #Bug
               '#A8A878', #Normal
               '#A040A0', #Poison
               '#F8D030', #Electric
               '#E0C068', #Ground
               '#EE99AC', #Fairy
               '#C03028', #Fighting
               '#F85888', #Psychic
               '#B8A038', #Rock
               '#705898', #Ghost
               '#98D8D8', #Ice
               '#7038F8', #Dragon
               ]

We can then feed our custom colour palette into the palette argument.

In [None]:
g = sns.violinplot(x='Type 1', y='Attack', data=df, palette=type_colours)
g.set_xticklabels(g.get_xticklabels(), rotation=-45)

We only have a limited number of observations, so we could use a swarmplot. In a swarmplot, each data point is an observation, but they are grouped together by a variable's values. Here, we'll group the data points by the Type of the pokemon.

In [None]:
g = sns.swarmplot(x='Type 1', y='Attack', data=df, palette=type_colours)
g.set_xticklabels(g.get_xticklabels(), rotation=-45)

### Overalapping plots

Both the swarmplot and the violinplot show similar information, so you may wish to combine the two.

In [None]:
#Set the size of the figure canvas
plt.figure(figsize=(10,6))
#We add one additional argument, inner, which removes the bars inside of the violins
sns.violinplot(x='Type 1', y='Attack', data=df, palette=type_colours, inner=None)
#Set the data points to black and use the alpha argument to give the dots some transparency.
sns.swarmplot(x='Type 1', y='Attack', data=df, color = 'k', alpha=0.7)
#Add a title to the plot
plt.title('Attack by Type')

# 19. Data wrangling with Pandas

What if we wanted to create such a plot that included all of the other stats as well? In our current dataframe, all of the variables are in different columns:


In [None]:
print(df.head())

So, if we wanted to visualise all the stats, then we'll have to "melt" the dataframe.

In [None]:
#We again use the .drop() function to create our dataframe without the following 3 variables.
stats_df = df.drop(['Total', 'Stage', 'Legendary'], axis=1)

melted_df = pd.melt(stats_df, #The dataframe that we're going to melt.
                   id_vars=['Name', 'Type 1', 'Type 2'], #These are the variables we want to keep.
                   var_name='Stat') #This will the name of the new variable, which will be the melted variable.
print(melted_df.head())

It's hard to see what we did from the above print statement, but we can see what we did by using the .shape argument.

In [None]:
print(stats_df.shape)
print(melted_df.shape)

The above print statements show that the number of variables has decreased from 9 to 6. Corresponding to the way in which the 6 pokemon statistics columns, attack, speed, etc, have been merged into one. The number of rows has also increased 6 times from 151 to 906. We see from looking at the stats column below that the HP stat for all of the individual pokemon are listed. Then, at row 151, we go back to the first pokemon and see its Attack variable, and so on for all of the stats.

In [None]:
print(melted_df.iloc[0])
print("_______")
print(melted_df.iloc[151])

We can now create a swarmplot using the newly created stats variable as the categorical variable on the x-axis.

In [None]:
sns.swarmplot(x='Stat', y='value', data=melted_df, hue='Type 1')
#the loc argument places the legend in a rough location. The bbox_to_anchor then moves the legend into a more specific
#location.
plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))

This graph needs a few tweaks in order to make it look better.

In [None]:
plt.figure(figsize=(10,6)) #Set the size of the figure canvas size
sns.swarmplot(x='Stat', #The variable for the x-axis
              y='value', #The variable for the y-axis
              data=melted_df, #The dataframe from which the data will be drawn.
              hue='Type 1', #Colour the data points by their evolutionary stage.
             split=True, #Separate the points by hue
             palette=type_colours) #Use our special Pokemon colour palette.
plt.ylim(0, 260) #Adjust the y-axis
plt.legend(bbox_to_anchor=(1,1), loc=2) #Place the legend in a specific location

# 20. Empirical cumulative distribution functions (ECDFs)

An alternative way of visualising a distribution of a variable in a large dataset is to use an ECDF. Here we have an ECDF that shows the percentages of different attack strengths of pokemon.

![image.png](attachment:image.png)

- An x-value of an ECDF is the quantity you are measuring; i.e. attacks strength.
- The y-value is the fraction of data points that have a value smaller than the corresponding x-value. For example…

![image.png](attachment:image.png)

In [None]:
x = np.sort(df['Attack']) #The variable for the x-axis
y = np.arange(1, len(x)+1/len(x)) #Calculate the fractions for the y-axis
g = plt.plot(x, y, marker='.', linestyle='none')
g = plt.xlabel('Attack')
g = plt.ylabel('ECDF')
plt.margins(0.02) #Add some space to the top of the graph so that half of the top data points are not cut off.
plt.show()

You can also plot multiple ECDFs on the same plot by creating multiple plt.plot() calls. As an example, here with have an ECDF for Pokemon attack, speed, and defence levels:
    
![image.png](attachment:image.png)

We can see here that defence levels tend to be a little less than the other two.

### The usefulness of ECDFs

It is often quite useful to plot the ECDF first as part of your workflow as it shows all the data and gives a complete picture as to how the data are distributed.

# 21. Other types of Seaborn graphs

To finish this tutorial, we are just going to look at a couple of other popular graph types and how to create them.

### Heatmap

Heatmaps are useful for visualising matrix-like data. Here, we’ll plot the correlation of the stats_df variables.

In [None]:
#The .corr() fucntion takes all of the variables in a dataframe
#and calculates the correlation between each one.
corr = stats_df.corr()
#We just then feed our data into Seaborn's heatmap function to create a heatmap.
sns.heatmap(corr)


### Bar plot

These visualise the distributions of categorical variables. To create a bar plot, we use Seaborn's countplot function.

In [None]:
sns.countplot(x='Type 1', data=df, palette=type_colours) #We even fed in our fancy custom colour palette.
plt.xticks(rotation=-45)

### Joint distribution plot

Joint distribution plots combine information from scatter plots and histograms to give you detailed information for bi-variate distributions.

In [None]:
sns.jointplot(x='Attack', y='Defense', data=df)

# End of workbook


# 22. Useful resources

There are two great online resources for learning this language through practical examples. These are the Code Academy (https://www.codecademy.com/catalog/subject/web-development) and  Data Camp (https://www.datacamp.com/?utm_source=adwords_ppc&utm_campaignid=805200711&utm_adgroupid=39268379982&utm_device=c&utm_keyword=data%20camp&utm_matchtype=e&utm_network=g&utm_adpostion=1t1&utm_creative=230953641482&utm_targetid=kwd-298095775602&utm_loc_interest_ms=&utm_loc_physical_ms=1006707&gclid=EAIaIQobChMI3o2iqtbV2wIVTkPTCh2QRA19EAAYASAAEgLZdPD_BwE).