## Variables, Data Types and Data Structures

Follow along with the code by running cells as you encounter them. 

*Chapter  Overview and Learning Objectives*:
* Variables and naming objects
* Common Data Types ( `Int`, `Float`, `Str`, `Bool` ) and how to convert them
* Objects and Classes – including `lists` , `tuples` and `dictionaries`
* How to select items (indexing) from our common objects
* A basic introduction to functions, methods and attributes
* A basic introduction to the Pandas package, and the Series and DataFrame objects.
* How to access help



<a id='help'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## Variables

Variables are words and numbers that act like labels; a reference to an object that lives in memory. 

Without this variable we can’t “find” the object again in memory and won’t be able to use it.

In [None]:
x = 4 + 3

In [None]:
x

Jupyter notebook will only print one output per cell.

In [None]:
x = 4 + 3
y = "Hello"

x
y

To print both, use the following notation

In [None]:
print(x,y)

*You can add text as well*

In [None]:
print(y, "I am", x, "years old!")

<a id='help'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## Naming conventions

Choosing sensible names saves time and energy later, when you try and remember what you've called something.

Clever naming allows you to figure out what an object contains without having to inspect it.

A variable name:

* Must start with a letter or an underscore
* Can’t begin with a number
* Only contain alphanumeric characters and underscores
* Is case sensitive (MY_VARIABLE and my_variable and My_vArIaBle are treated independently)
* Must not have hyphens

Consistency is important – if you pick up your code after 6 months, or pass code on to your colleagues, using consistent names ensures your code works and improves the quality.

``` python
music_df  # The contents should be a DataFrame

names_list # The contents should be a list

```

The following variable name will lead to an error. 

In [None]:
name-list = ["Dan", "Mark", "Susan"]

In [None]:
name_list = ["Dan", "Mark", "Susan"]
name_list

To see what variables you've already declared, use:

`%whos`

In [None]:
%whos

To delete variables, use 
`del`

In [None]:
del y
%whos

Trying to call variable y after it's deleted will lead to an error. 

In [None]:
y

<a id='help'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## 1.0 Data types

### (1.1) Numeric

**int** (plain integers) are positive or negative (or zero) whole numbers.

**floats** are decimal numbers.

Check the data type with `type`

In [None]:
type(x)

### (1.2) Strings

Strings are sequences of character data. The type in Python is called **str**. 

Strings are contained within either 'single' or "double" quotation marks.

In [None]:
print(type("Hello"), type("This is a string"), type('This string contains 5 words!'))

The choice between single and double quotes is up to the user.

We recommend that if you’re creating strings that use apostrophes or single quote marks, use double quotes to open and close your string.

The code below will provide an error message.

In [None]:
print('Tom's phone')

In [None]:
print("Tom's phone")

### (1.3) Boolean

Boolean values are sometimes called logical values in other languages. 

There are two values `True` and `False`. 

In Python they must be spelt out fully and have a capital first letter. They are **not** text values; so do **not** require quote marks. They are a reserved word, and so are displayed in bold green text.

As we’ll see later Python often evaluates expressions in a Boolean context; something is either “true” or “false”.

`True` has the value of 1 and `False` has the value of 0.


In [None]:
print(type(True), type(False))

### Type conversion

* `int()` converts things to an integer,
* `float()` converts things to a float (n.b any numbers after the decimal are removed; this is not rounding behaviour! round(5.7, 0) will round to nearest whole number.
* `str()` converts things to a string.

In [None]:
print(float(3), int(5.7), str(3))

<a id='help'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## 2.0 Data Structures

So far we have only stored one piece of information; where as we will usually want to store many different pieces. These data structures provide us with a particular way of organising data so it can be accessed efficiently.

### (2.1) Lists

A list holds a collection of items. 

Lists are **mutable**, we can change items in them and add or delete items,

We create lists using square brackets `[ ]` and separate each item with a comma.

Lists
* Are the most versatile of the built-in data structures
* Can hold any sequence of objects
* Can hold mixed objects (like strings, integers and Booleans together)
* Are mutable (can be changed)

In [None]:
# Create my list

the_pythons = ["Chapman", "Idle", "Gillam", "Jones", "Cleese", "Palin"]

# Display my list

the_pythons

We can also create lists using the `list()` function.

In [None]:
team = "Pragya", "Hannah", "Katherine"
team_list = list(team)
team_list

### (2.2) Tuples
Tuples are similar to lists; the two main differences are:

* Tuples are immutable (unchangeable)
* Tuples are created with round brackets `( )`

In [None]:
# Create my tuple

days_of_the_week = ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

#print out my tuple

days_of_the_week

We can also create tuples using the `tuple()` function. 

In [None]:
team = "Pragya", "Hannah", "Katherine"  # This automatically is a tuple data type
team_tuple = tuple(team) # But we can use the function too
team_tuple

When we say something is mutable or immutable we mean changing the contents programmatically. There are **methods** we’ll see later where we can add or remove a value from a list; like `.append()`, `.extend()` or `.pop()`

See https://www.w3schools.com/python/python_ref_list.asp for more methods. 

### (2.3) Dictionaries 

Dictionaries also store a collection of objects.

Whereas lists and tuples have an order (an index); Dictionaries are unordered.


Dictionaries:
* Mutable (can be altered)
* Can contain lists and other (nested) dictionaries.

Dictionaries are created using curly braces `{ }`

Dictionaries contain key value pairs.

**Keys** are usually integers or strings (an immutable data type)

**Values** can be any type of object

Keys and values are separated by a colon ( `:` )


In [None]:
# Create my dictionary

brian_information = {"name": "Brian Cohen",
                     "occupation": ["messiah ", "very naughty boy"],  # more than one value gets stored in a list
                     "enemy": "Pontius Pilate",
                     "outlook": "Look on the bright side of life"}

# Print the dictionary

brian_information

In [None]:
# Example - Call the values of a key in a dictionary
brian_information['occupation']

### Indexing 
Each item within our list (or tuple) has an index. This relates to it’s position in the list.

Python starts indexing at 0 – which may be different to some other languages you’ve used. (Think of it as starting on the ground floor of a building, then going up to the first floor).

In order to return the element in the list we can simply give the object name, then the index of the item we want to return within square brackets.

The code below will return the value `Jones`

In [None]:
the_pythons = ["Chapman", "Idle", "Gillam", "Jones", "Cleese", "Palin"]

the_pythons[3]

We can also use what’s known as negative indexing.

The code in the next cell will return the same values. This can be useful if you want to return the last element; without knowing how many elements there are in your object. **caution** R uses negative indexing to delete elements! Use caution of trying this with other languages.

In [None]:
print(the_pythons[0], the_pythons[-6])

If we want to select more than one sequential item we can use a colon ( `:` ) between the first index position (inclusive) and the last index position (exclusive).

It is important to remember that the last index is **exclusive** in this case, it will not be included.

In [None]:
the_pythons[1:3]

In [None]:
the_pythons[3:]

These methods also work for tuples.

Using the `days_of_the_week` tuple we declared earlier, below I return, Wednesday, the week days and the weekend days.

In [None]:
print("The element at index 2 is", days_of_the_week[2])
print("The weekdays are:", days_of_the_week[:5])
print("The weekend days are:", days_of_the_week[-2: ])

There's no direct inbuilt way of accessing non sequential items. In other languages like R this is simple to do; but not so much in python. 

We can use a function called `itemgetter` to do this This takes the indexes we want to return - 2 and 5 and then the list.

In [None]:
from operator import itemgetter
itemgetter(2, 5)(the_pythons)

## Exercise 

Create a list and store five shopping items in the list. Give the list an appropriate name. 

Create another list with corresponding numbers of items bought. Give the list an appropriate name. 

Access the second item in the list 

Print the second to fifth item (inclusive) in the list

<a id='help'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## 3.0 Accessing Functions, Methods and Attributes

Unlike other programming languages we can perform actions, or get information from Python in a variety of ways
* Functions
* Methods
* Attributes


### (3.1) Functions
Functions are pieces of code that are called by name. They are a set of statements that take inputs, do something to it, and produce an output. The concept is that we can perform a commonly done or repeated task by just calling the function.

We can create our own user defined functions. Information on how to do this can be found in the course "Python: Loops, Control Flow and Functions" avalible for self directed learning on the Learning Hub.

For now let’s have a look at some of the inbuilt functions in Python. You can find out what inbuilt functions are available at these links - for [Python 3.3]( https://docs.python.org/3.3/library/functions.html) or [Python 3.8](https://docs.python.org/3.8/library/functions.html) 

We’ve already seen 
* `type()`  - tells us the data type of what we put within the brackets. 
* `int()`, `float()`, `str()` - for type conversion

### (3.2) Methods and attributes

Methods and attributes relate to the objects that we perform them on.

Each object has it’s own methods and attributes that relate to that object.

**Methods** we've seen include `.append()`, `.extend()` or `.pop()`. These have mandatory parameters. 

We can use the method `.sort()` on a `parrots` list we create. `sort()` has no mandatory parameters (brackets can be left empty).

* **Firstly** we tell python the object we want to work on – here by giving the variable/identifier `parrots`.
* **Secondly** we use the full stop or “dot notation”
* **Thirdly** we call our method – here ` sort` and finish it with round brackets

In [None]:
parrots = ["Cockatoo" , "Macaw", "Parakeet", "Lorikeet", "Norwegian Blue", "Conure"]

# Use the sort method to sort the list
parrots.sort()

# Print out the parrots list to see what's happened.
parrots

In [None]:
team_list

In [None]:
# the append() method adds an element at the end of the list
team_list.append("Beth")
team_list

In [None]:
# the pop() method removes the element at the specified position/ index
team_list.pop(2)
team_list

In [None]:
#The extend() method adds the specified list elements (or any iterable) to the end of the current list.
fruits = ['apple', 'banana', 'cherry']
team_list.extend(fruits)
team_list

Sometimes we’ll want to find something out about an object. If something is descriptive about an object, we’ll often use **attributes** to return information.

Attributes are called like methods, using dot notation. However, they do *not* have round brackets.
e.g.

`my_dataframe.shape` returns the number of rows and columns in a dataframe

### What methods and attributes can I use?

To find the methods and attributes for an object you can use

In [None]:
dir(team_list)

This function returns the attributes and methods of a specified object, here a list.

It doesn’t tell you if something is an attribute or a property, and it doesn’t give you any values to pass to the methods; but can be useful to see if something you want to do has a function or method that looks about right.

<a id='pandas_objects'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## (4.0) Pandas objects

`pandas` is a Python library originally written by Wes McKinney, and is the most preferred, most used tool when data wrangling tabular data within Python.

`pandas` is a Python library, which are often refered to as "packages".

### Packages

Packages are a collection of functions, objects and compiled code which are stored in a “library” of code within Python.

For installing packages that are not included within the Anaconda distribution there are a few options. The most popular being `pip` and `conda`.  Python packages can be found on PyPi [(the Python Package Index)]( https://pypi.org/), as well as being shared by colleagues or downloaded from the internet. Caution should be taken when using the latter, as there could be malicious software in the bundle.

Other government departments have their own rules regarding installation of packages, please contact your IT department or Data Science department for guidance.


### Importing pandas

In order to use the functionality of the Pandas package within our Python code; we first need to import it.

We need to import our packages in every script we write.

You should always import the packages you’re going to use at the very **top** of your script.

To use the functions within a package we need to import it by using an import statement:

In [None]:
import pandas as pd

When we import a package we don't get any feedback it's worked. The cell will change from a green to blue, and the line number will increase.

* We start by using the key word `import` - this is a reserved word in Python, you can see the syntax highlighting, it is now bold and green.


* We then give the package name


* The keyword `as` allows us to give a nickname to the package


* Which here is `pd`

We don’t have to give nicknames to packages; however, it’s a very common standard; that saves us some typing. We’ll often need to reference the package containing the function; e.g `pd.Series()` - which is much faster than `pandas.Series`().

All packages have versions; this allows code to be updated, extended or modified as time goes on.


For reproducibility purposes, as well as searching for help it can be important to know which version of a package you’re using.

In [None]:
pd.__version__ 

# Note here there's TWO underscores before *and* after version.

Pandas gives us two new object types.

### (4.1) Series

`Series` are:

* One dimensional arrays
* Act like columns in a spreadsheet
* Must have items of the same type (int, float or str)
* Has a series index – defaults to start at 0


Series have a large number of special methods and attributes associated with them.


In [None]:
# I'm manually creating a Series here - we'll talk about why we don't teach this in a second.

python_movies = pd.Series(["And Now For Something Completely Different",
                           "Monty Python and the Holy Grail",
                           "Monty Python's Life of Brian",
                           "Monty Python Live at the Hollywood Bowl",
                           "Monty Python's Meaning of Life",
                           "Monty Python Live (Mostly)"])


# Display the series below the cell.

python_movies


### (4.2) Dataframes

`DataFrames` are

* A two dimensional version of the series object
* Like a whole spreadsheet
* Essentially a collection of series objects (one series per column)

where

* A column can only have one data type
* Each column can have a different data type
* The dimensions are labelled similarly to a series object
	* **index** refers to the row labels – and defaults to starting at 0
	* **columns** refers to the column labels – or headers.

The `DataFrames` will have some of the same methods as `Series` and some different.

In [None]:
# First I'm setting up some Series of data to be my columns.
# Note these are in the same order as python_movies
# python_movies already exists from my previous cell so I can recycle it

python_live = pd.Series([False, False, False, False, False, True])
python_year = pd.Series([1971, 1975, 1979, 1982, 1983, 2014])

# Now I'll create my Dataframe

python_movies_df = pd.DataFrame({"film_name": python_movies,
                                 "live": python_live, 
                                 "year": python_year})

# This is passing a dictionary { } to the pd.DataFrame command.
# In the dictionary I have each column header (as a string) and then the Series data.
# These are seperated by a colon - like we saw earlier.
 
# then print out the Data Frame

python_movies_df

## Exercise 

Create a dataframe with your shopping list and number of items bought list. 

<a id='help'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## Accessing Help

In [None]:
python_movies_df.sample()

By default `.sample()` returns us one row of a DataFrame.

What if I want to change the behaviour to return multiple rows? To find out how to do this I could search the function on Google.

However it’s easier to look in the documentation contained within Python.

We can do this in a few different ways - by using the function `help()` on our code

In [None]:
# Run this Cell to show the help
help(python_movies_df.sample)

The help documentation has a few different sections. As the contents of these are defined by the creator of the function some may have more information in them than others.

This one has:

`Signature` - the text we use to call the function and the parameters (think options) and arguments (default values) inside of the brackets. Here we have several; the paramater `n = ` contols the number of rows returned and parameter called `replace = ` and the default argument for this is `False`.

Underneath this tells us the `Docstring` – which is information about the function.

Thirdly `Paramaters` gives us more information about the paramaters (options) we can set within this function.

Going back to our `replace = ` parameter, we can see that this takes a `Boolean` argument (`True` or `False`) and that it controls if we sample with or without replacement.

At the bottom we have `Returns` which tells us what the function returns; and underneath this we have `Examples` - not all functions will have this level of detail in the docstring.

Now we know other parameters we can set for `python_movies_df.sample()` let’s try returning a sample of 4 rows, and turn `replace = ` to `True`

In [None]:
python_movies_df.sample(n=4 , replace=True)

`.sample()` is a fairly simple method – and doesn’t have many parameters and arguments.

Depending on what your function does depends on how many arguments and parameters there may be.

Have a look at the Docstring below for `pd.read_csv()` which we use for reading in CSV files into Python.

In [None]:
help(pd.read_csv)

## Summary

In this chapter we’ve explored:
* Variables and naming objects
* Common Data Types ( `Int`, `Float`, `Str`, `Bool` ) and how to convert them
* Objects and Classes – including `lists` , `tuples` and `dictionaries`
* How to select items (indexing) from our common objects
* A basic introduction to functions, methods and attributes
* A basic introduction to the Pandas package, and the Series and DataFrame objects.
* How to access help
