<img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="ONS Data Science Campus Logo"
             width = "240"
             style="margin: 0px 60px"
             />

In [None]:
# import the helper functions from the parent directory,
# these help with things like graph plotting and notebook layout
import sys
sys.path.append('..')
from helper_functions import *

# set things like fonts etc - comes from helper_functions
set_notebook_preferences()

# add a show/hide code button - also from helper_functions
toggle_code(title = "import functions")

# The Pandas Library - Objects for Data Science



## 1.1 General Purpose Objects

The data structures - lists, tuples, and dictionaries, are the core, in-built data structures available in python. They are always available for use and easily created and destroyed. They are general purpose which means they are useful in many situations, however, they may not allow us to complete specific tasks as easily as we would like.

Nonetheless, it is really important to know a bit about lists, tuples and dictionaries because they are used throughout python, and you will be very likely to encounter them in some form. Perhaps as an output from a function or method; as a way of providing input to a procedure; or, as the way that the library you are using stores properties.

Many advanced python libraries that have been built for specific purposes, like data science, take advantage of the fact that objects are extensible. Programmers can create new classes based on existing objects and add in specific properties or methods that do specific things. This means that general purpose objects can be customised for particular tasks.

In data science applications, the Pandas library provides specialised objects that make working with data tables a lot easier than if we had to rely on python's general purpose objects. However, because pandas objects are specialised, they are also more complicated than general purpose objects.

Pandas adds two important data structures that we need to know about. The 1 dimensional `Series` and the 2 dimensional `DataFrame`.


## 1.2 Series Objects

* A Series is like a one-dimensional array (effectively a list) with a label for each observation.

* You can think of a Series as being similar to a single column in a spreadsheet: a list of values.

* However, the items in a Series must all be of the same data type, unlike a list.

* The Series index provides a label for each observation, the default is for consecutive integer labels starting at 0.

* However, a Series index does not have to be consecutive numbers, they can be non-unique numbers, non-consecutive numbers, as well as string objects or tuples.

* As such, the Series itself is a bit like a list, whereas the Series index is a bit like a dictionary, and is able to return a value in the Series based on the index, like a key-value pair.

* The Series object implements a large number of special behaviours (methods, procedures) that can be called on the data held in a Series to get useful outputs. This includes mathematical functions, like taking the average of all values in a numerical Series.

## 1.3 DataFrame Objects

* A DataFrame is a two-dimensional version of a Series object.

* In effect, a DataFrame is a collection of Series objects - one Series for each column in the DataFrame.

* A DataFrame is thus similar to a whole spreadsheet, like an Excel file, a STATA .dta file, a SAS file, an SPSS .sav file etc.

* A DataFrame can hold columns with different data types, but as per Series above, any single column must contain values of the same data type.

* The dimensions are labeled in a similar way to the Series object:

    * **index** - refers to the row labels
    * **columns** - refers to the column labels
    
 * Having indexed data allows fast look up and powerful relational operations 
 
 * Each row has a label and each column has a label

Most the the time you'll be working with DataFrame objects in Pandas, however, as DataFrames are composed of Series, it is very likely that through indexing, selection, creation of new columns, and analysis you'll encounter Series objects too. Luckily, DataFrame and Series objects are similar, DataFrames effectively extend the Series into two-dimensions and as such implement some additional properties and methods that may not be relevant to Series on their own.

## 1.4 Import Pandas


In [None]:
# importing pandas is this simple!
import pandas as pd

The above code cell has two lines.

The first line is a comment - Python (and other languages) "skip" anything with a # infront of it.

On the second line I've written my import statement. There are two parts to this statement:
1. `import pandas`
2. `as pd`

In python, it is sufficient just to write `import pandas`.

`pd` is a common nickname or 'alias' used for pandas - rather than write 'pandas' each time I can just write 'pd.'

Many packages have nicknames that are consistently applied by members of the wider python community, such as pandas being pd. You don't have to adhere to these conventions, but you'll probably see them used a lot in training and help materials online.

When you successfully import a module, you don't usually get any feedback. However, if you get it wrong you'll get an error message that should give you some idea of what happened, for instance:

In [None]:
# If you run this code cell you should get a 'ModuleNotFoundError' as there is no 'pandahs' module!
import pandahs # by the way, I can also comment on the end of lines too!

## 1.5 Reading in Data

Now that you've loaded pandas, lets read some data into python and see what that looks like.

Pandas can read data in a variety of different formats. In the code cell below type pd.read then press **Tab**

Pressing Tab gives you all the things you can do with pandas that start with 'read'. You probably recognise some of the data input types (e.g. Excel, SAS, Stata), while others you may not have heard of (e.g. pickle, json etc.) You can then select whichever method you wish to use.

**Tab** is a really useful shortcut to know for finding or completing commands in notebooks. 

As you can see, we can read in a large variety of data using pandas data readers.

(NB There isn't a specific reader for SPSS files - it is possible to read .sav files, unfortunately you need another library called savReaderWriter at the moment.)

## 1.6 Reading a CSV file

Now that you've seen the different options, try reading a simple csv (comma separated value) dataset. 

To read a csv file, the only piece of information we absolutely need is the location of the file.

The file is called 'titanic.csv' and is in the data folder.

The code below demonstrates how you can read these data.

In [None]:
# Read the data using an absolute path
titanic = pd.read_csv('C:/intro to Python/titanic.csv')

## 1.7 Exploring the Data

There are a number of variables within this dataset:
* pclass = Passenger class of travel.
* survived = 1 if the passenger survived the sinking, 0 if not.
* name = Full name of the passenger, including title.
* sex = Passenger gender.
* age = Passenger age.
* sibsp = Count of siblings or spouse also aboard.
* Parch = Count of parents or children also aboard.
* ticket = Ticket reference.
* fare = fare paid.
* cabin = Cabin number.
* embarked = Port of embarkation. (S = Southampton (UK); C = Cherbourg (France); Q = Queenstown (Cobh, Ireland))



The first thing we may want to do having read in a DataFrame is to take a quick overview and check it looks right.

There are several DataFrame methods we might use to do this: `.head()`, `.tail()` and `.sample()`

In [None]:
# head shows the first n rows of the dataframe (5 by default).
titanic.head()

In [None]:
# tail shows the last n rows of a dataframe (5 by default)
titanic.tail(3)

In [None]:
# sample shows n rows chosen at random from the dataframe (1 by default)
titanic.sample(4)

In [None]:
# dimensions of dataframe
titanic.shape

In [None]:
# get the number of rows and columns
# NB This line 'unpacks' the tuple returned by shape
numrows, numcols = titanic.shape

In [None]:
# number of rows using pythons inbuilt len() function (len meaning 'length')
numrows = len(titanic)
# print the number of rows using f-string formatting.
print(f"There are {numrows} rows in the titanic dataframe")

In [None]:
#type of object
type(titanic)

In [None]:
# get column data types in dataframe
titanic.dtypes

Remember our different data types - 

* **int** indicates integers, 'parch' for instance is recording counts as whole numbers, such as: 0, 1, 2 etc.
* **float** indicates 'floating-point numbers', effectively decimal numbers like in the age column.
* **object** indicates text, also known as 'string' data. The 'name' column gives passenger full names and titles.
* **bool** which indicates Boolean values, Booleans encode True or False values.

Other data types you might see include:
* **datetime** which encodes date and time values.
* **category** which is a special Pandas datatype for categorical or factor variables.


In [None]:
# Column names are really easy to get!
colnames = titanic.columns
colnames

In [None]:
columnsNamesArr = titanic.columns.values
columnsNamesArr

In [None]:
# colnames is an index object, if we wanted a list we could use:

colnames = list(colnames)
colnames

In [None]:
#DataFrames also have an `.info()` method which returns a concise summary of information about the Data.
titanic.info()

Note the different counts in the information above, this suggests that some variables are completely observed (e.g. pclass, survived), while others have missing data values (e.g. age, embarked, cabin).

When you read a DataFrame into pandas, the data are loaded in memory. This means that any changes you make won't be reflected in the original file you loaded. If you want to preserve the changes you make to the dataset you have to export the DataFrame object to a file.

Pandas has a number of file writers. Let's save the `titanic` DataFrame as an excel file.

In [None]:
# File writers are prefixed with .to_ press tab to find avaialble options.
titanic.to_excel('../Save/titanic.xlsx')

## 1.8 Getting Help 

Most python modules include information on how to use their functions with something called a 'docstring'. You can look at docstrings using python's in-built help() function. Run the code below to look at the docstring for the pandas read_csv function.

In [None]:
# Have a look at the docstring using python.
help(pd.read_csv)

When you're using a notebook, you can also access the docstring by pressing Shift-Tab with the cursor somewhere in the object. This produces a **tool-tip**. A small pop-up box with relevant information.

Pressing Tab once shows the parameters available for the particular object you are looking at.  
Pressing Tab twice shows the whole docstring.  
Pressing Tab a third time makes the tool-tip linger for 10 seconds.  
Pressing Tab a fourth time put the tool-tip into a larger pane in the browser.  

The tool-tip provides the same information as the `help()` function, but with nicer formatting!

In [None]:
help(pd.read_csv)

**Docstrings**


Docstrings look a bit scary at first, but once you learn how to read them they're super useful!

The first block of text is the function we're looking at - read_csv - followed by all the possible parameters you can use to modify how pandas reads a csv file.

Note that the first parameter is 'filepath_or_buffer', that's currently all you're passing to this function - the filepath of 'MarvelUniverse.csv'.

If you scroll down a little, you can then see all of the possible parameters listed and described.

E.g If your data were semi-colon delimited you'd have to specify the 'sep' parameter. 
No header row? (i.e. column names) you could set 'header' to None.

At the very bottom of the docstring, you can see that read_csv function returns a 'DataFrame'.

Don't get too hung up on docstrings for now though, as you become more experienced they'll begin to make more sense!
<br>