# Welcome to the Python Course

Welcome to the Python Course.  This is an introduction to Python, with a focus on data analysis.  We start with the obligatory "Hello World in Python", look at a short snippet of code to import and view a dataset to whet our appetite then we go through the basics; data structures, functions, methods and the numpy and pandas packages.

Enjoy!

# Whetting your appetite
A short example to show how useful Python is in data analysis - before we start on the basics.  We will import a CSV file from github into a dataset and plot a  time-series chart.

In [None]:
print("Hello Python")

In [None]:
# Import the modules
import pandas as pd
import matplotlib.pyplot as plt
# Load our data from  CSV file on github
df = pd.read_csv("https://raw.githubusercontent.com/MarkWilcock/Python-Course/main/tfl-daily-cycle-hires-nov-2020-cleaned.csv")
df

In [None]:
df.info()
#df.info

In [None]:
# Plot the dataset as a line chart.
plt.plot(df['Number of Bike Hires'])
plt.show()

In [None]:
df['ProperDate'] = pd.to_datetime(df['Date'])
df
df.info()

In [None]:
plt.plot(df['ProperDate'], df['Number of Bike Hires'])
plt.show()

# Getting started
## The basics - Python data types, assignments, operators and the print() function

In [None]:
print("Let's get started")

Python variables can be one of a few data types.  These are int, float and str.  We don't need to specify a data type when assigning a variable. Python does that for us

In [None]:
# an int (integer) datatype is a whole number
answer = 33
type(answer)

In [None]:
# a float is a real number with a fractional part
e = 2.718
type(e)

In [None]:
# define a string in either single or double quotes
todays_date = '2020-01-05'
type(todays_date)

In [None]:
today_is_monday = True
type(today_is_monday)

In [None]:
#  The behaviour of operatiors depends on the datatypes
print(2 + 3) # adding numbers, what you expect
print('2' + '3') # adding strings concatenates them
print('hello ' * 3) # multiplying strings repeats them

# Data Structures

This lesson covers data structures : 
*   strings
*   lists
*   dicts
*   tuples

## Strings
A string is an ordered collection of characters.

In [None]:
title = 'Premier League Table'
len(title)

### Indexing and Slicing
We can get a single element by using the position of that element. 
Python counts from 0 !

In [None]:
title = 'Premier League Table'
len(title)
title[0]
title[10]
#title[19]
# title[100] #Oops!


We can get a range of elements by slicing


In [None]:
print('title[0:5]', title[0:5])

# we don't have to start the slice at the beginning of the string
print('title[10:15]', title[10:15])

# we can slice with negative integers
print('title[-5:-1]', title[-5:-1])
# will return the last 5 characters apart from the final one

# omit the 2nd index to go to the end of the string
print('title[-5:]', title[-5:]) # wil return the last 5 characters
print('title[8:]', title[8:]) # will return all characters from the 9th to the end of the string 

# omit the 1st index to start at the start of the string
print('title[:9]', title[:9]) # will return the first 9 characters


## Lists
A list is an ordered collection of elements.  The elements can be of different data types.  Create a list with [].

In [None]:
football_teams = ['Chelsea', 'Liverpoool', 'Arsenal']
goalkeepers = ['Chelsea G', 'Liverpoool G', 'Arsenal G']
points = [17, 19, 15]

#  This list contains a string, int, bool and another list
a_diverse_list = ['Chelsea', 15, True, points]
a_diverse_list

In [None]:
# We can get an element of the list from its position
# Python starts counting from 0
football_teams[0]

*Aside (since we have not yet looked at loops)*
We can iterate through the elements.  Notice the : and the indentation.
This "format as function" sets Python apart from other languages.

In [None]:
for p in points:
  print(p**2)

In [None]:
#  This elegant expression is a list comprehension 
#  Question: does it need to be p?
[p + 10 for p in points]   

What happens if not all the elements are numbers?

In [None]:
strange_points = [17, 'eighteen', 19, 15, 'twenty', True]
[p + 10 for p in strange_points if type(p) == int]   
# [p + 10 for p in strange_points]   

We can find the number of elements in a list with len() function

In [None]:
len(football_teams)

Lists are mutable - they can be "edited" We can change an item in the list or append to the list.

For example, let's append an element to a list.
*This is the first time that we have seen a method.*

In [None]:
football_teams.append('Fulham')
football_teams.sort()
football_teams

## Dictionaries
A dict is an ordered collection of key  / value pairs.
Create a dict with this syntax {key1: value1, key2: value2}

In [None]:
football_ages = {'Kane' : 24, 'Pickford' : 22, 'Dele' : 20}
# We can get the value of a pair by providing the key
print('Dele is ', football_ages['Dele'], 'years old')
# We can loop through all the keys and values of a dict
print('keys', football_ages.keys())
print('values', football_ages.values())

The key can be any data type as can the value.

In [None]:
dinner = {
        'starter' : 'salad',
        'main course' : ['meat' , 
                         'potatoes', {
                                 'veg1': 'peas', 
                                 'veg2' : 'beans'}],
        'is_desert_included': True,
        20 : 'cost'
        }
dinner

## Tuples
A tuple is an **immutable** ordered set of elements.  



In [None]:
# Create a tuple with parentheses (i.e.round brackets)
my_tuple = ('alpha', 'beta', True, 42)

#  We can assign variables to each element in a tuple as follows
a, b, c, d = my_tuple
b

# Functions

Functions are resuable bulding blocks of useful functionality.  Many are written for us e.g. print(), len().  Some we have to write ourselves.
```
def <function name>():
--->docstring
--->indented implementation
```
Let's define our first function.

In [None]:
def hello():
  """
  Prints a greeting
  """
  print('Hello there')

Let's invoke our function.  Note that we see the docstring after we type the (.

In [None]:
hello()

*Exercise:* write a function named goodbye.

Let's improve our function so it says the student's name  and makes smalltalk. The function now has an argument which tailors its behaviour


In [None]:
def hello2(student_name):
  print('Hello', student_name)
  print("How are you?")

hello2('Mark')
hello2('Mike')

Most students like smalltalk but a few don't .  So let's add an argument make_smalltalk with a default value of yes.  This is the first time we have seen an if statement in Python.


In [None]:
def hello3(student_name, make_smalltalk = True):
  print('Hello', student_name)
  if make_smalltalk:
    print('How are you')

In [None]:
hello3('Mark')
hello3('Mary')

Functions return a result of a calculation even if it is None


In [None]:
result = hello3('Anne')
print(result)


Let's improve our function to return something.

In [None]:
def hello4(student_name, make_smalltalk = True):
  print('Hello', student_name)
  if make_smalltalk:
    print('How are you')
  return 42  

In [None]:
result = hello4('Sara')
print(result)

In [None]:
def divide_by_7(some_number):
  return some_number / 7

divide_by_7(56)

Let's calculate our body mass index. 
The formula is *bmi = weight / (height * height)*  where
height is in metres, weight is in kg
https://www.nhs.uk/Tools/Pages/Healthyweightcalculator.aspx


In [None]:
def bmi(height, weight):
  return weight / (height * height)

In [None]:
bmi(1.75, 76)

We want to calculate the BMI of three (unnamed) students. Here's the data in two lists.  Can we use our shiny new bmi() function  to calcuate results of a group of people?  What will happen here?



In [None]:
weights = [74, 75, 76]
heights = [1.77, 1.76, 1.75]
#bmi(weights, heights)  # this unfortnately does not work

So let's write something ugly.  We'll write it more elegantly later with a list comprehension

In [None]:
bmi_results = [] # an empty list to hold the results
for i in range(3):
  bmi_result = bmi(heights[i], weights[i])
  print(i, weights[i], heights[i], bmi_result)
  bmi_results.append(bmi_result)

bmi_results


# Methods
Methods are functions owned by objects  of a particular data type or class. 
We will look at some methods of strings and lists - they have lots of useful methods.

Invoke a method with the dot notation on a string
Instead of capitalize(title), use title.capitalize()


In [None]:
title= 'monty python'
len(title) # len() is a function
# capitalize(title) #Oops
title.capitalize()


In [None]:
print("title.count('o')", title.count('o'))
print("title.count('on')", title.count('on'))
print("title.count('ont')", title.count('ont'))
print("title.replace('o', 'i')", title.replace('o', 'i'))
print("title.index('o')", title.index('o'))

List also have methods.  Some of these have the same name as string methods and
have similar (but not the same) behaviour.


In [None]:
squares = [1,4,9,16]
squares.index(9)

# Basics of the Numpy Package

Just enough numpy that we need to do our data analysis in pandas

numpy is a very important package in Python
It provides the np.ndarray data type 
(I'll call it an array for short)
Unlike lists, all the element in arrays must be the 
same data type (usually  ints or float)
We can do element-wise operations on these arrays 
to make our code simpler & shorter
Operatons on arrays work a lot faster 
than operations on lists

In [None]:
# our first import statement
import numpy as np

In [None]:
# Create an array from a list
np_squares = np.array([1, 4, 9, 16, 25])
type(np_squares) # a n-dimensional array
np_squares

In [None]:
# Let's add 10 to these squares
# The 'old' way was to write a for loop
# The new way with element-wise operations
np_squares + 10

*Exercise:* Create a numpy array, team_played, to hold the number of games played - based on the arrays of those won, drawn and lost

In [None]:
#  These are the scores of three teams
team_won = np.array([4,3,1])
team_drawn = np.array([2,1,1])
team_lost = np.array([1,3,4])




*Exercise:* Create a numpy array, team_score, to hold the teams scores  (3 for a win, 1 for a draw, 0 for a loss)

In [None]:
# The Titanic dataset a row for each passengers 
# and has two integer columns sib_sp and parch
# To find the size of the family group. family_size, 
# we add sib_sp to parch and add 1
# Let's look at the  7 passengers
titanic_sib_sp = [2, 0, 3, 1, 4, 0]
titanic_parch = [1, 2, 1, 1, 2, 3]

In [None]:
# Exercise - calculate the family size 
# for each of these passengers
titanic_family_size_np = 0 # Answer at the end

In [None]:
# Optional exercise : given the BMI data in the previous session
# Now do the calculation  with numpy

weights = [74, 75, 76]
heights = [1.77, 1.76, 1.75]

In [None]:
# Answer

np_weights = np.array(weights)
np_heights = np.array([heights])

np_weights / (np_heights * 2)


In [None]:
# EXERCISE ANSWERS

team_score = 3 * team_won + team_drawn


titanic_sib_sp_np = np.array(titanic_sib_sp)
titanic_parch_np = np.array(titanic_parch)
titanic_family_size_np = titanic_parch_np + titanic_sib_sp_np + 1


In [None]:
# If you want to show off, write a function like this
def calc_family_size(sib_sp_list, parch_list):
    """
    sb_sp_list: a list of the count of siblings and spouses of each passenger
    parch_list: a list of the count of parents and children of the passenger
    Assumes lists of equal lengths and both contain only integers
    
    Returns: a numpy array of the count of the family size of each passenger
    """
    sib_sp_array = np.array(sib_sp_list)
    parch_array = np.array(parch_list)
    return sib_sp_array + parch_array + 1

titanic_family_size_np_calc = calc_family_size(titanic_sib_sp, titanic_parch)
titanic_family_size_np_calc

# Data analysis with pandas package

Creating a dataset, adding new columns, selecting columns, filtering rows & grouping
using pandas

 A pandas Dataframe is a 2D object (table)
 A pandas Series is a 1D labelled object
 pandas rows have unique labels 
 (a strange concept to SQL people)

 Creating a pandas dataframe

 We can build a pandas dataframe from a dict
 where the keys are the column names 
 and the values are the columns

In [None]:
import pandas as pd
import numpy as np

In [None]:
fb_labels = ['MCY', 'LIV', 'TOT', 'CHE', 'ARL']

fb_dict = {
        'city' :	['Manchester',	'Liverpool', 'London', 'London', 'London'],
        'team' :	['Manchester City',	'Liverpool', 'Tottenham Hotspur', 'Chelsea', 'Arsenal'],
        'won' :	[6, 6, 6, 5, 5],
        'drawn' : [1, 1, 0, 2,0],
        'lost' : [0, 0, 2, 0, 2],
        'form' : ['DWWWW', 'WWWWD', 'LLWWW', 'WWWDD', 'WWWWW']
        }


Exercise: add Arsenal, a north London team,
to the data structure above
They have won 5, drawn none, lost 2 
and have won all their last 5 games


In [None]:
fb_df = pd.DataFrame(fb_dict)
fb_df.info()
fb_df.index = fb_labels
fb_df.info()
fb_df

In [None]:
# Calculate the number of games played.  
#Add a new column 'played'
fb_df['played'] = fb_df['won'] + fb_df['drawn']  + fb_df['lost']

Exercise: calculate the points for each team.   
Add a new column 'points'
Teams get 3 point for a win, 1 for a draw, none for a loss


In [None]:
fb_df['points'] = fb_df['won'] * 3 + fb_df['drawn']
fb_df.info()
fb_df.info

In [None]:
# Add a new column team_code - 
# the first five characters of the team name
fb_df['team_caps'] = fb_df['team'].apply(str.upper)

# The next two statements have the same effect - 
# which do you prefer?
fb_df['team_short'] = fb_df['team'].str[:5]

# uses a list comprehension
fb_df['team_short2'] = [x[0:5] for x in fb_df['team'] if len(x) > 5] 

## Filtering Rows And Choosing Columns

In [None]:
# use loc to filter rows and select columns
# get a single row either as a Series or as a Dataframe
mcy_series = fb_df.loc['MCY']
mcy_df = fb_df.loc[['MCY']]

In [None]:
# get a single column either as a Series or as a Dataframe
fb_team_series = fb_df.loc[:, 'team'] # Series
fb_one_col_df =  fb_df.loc[:, ['team']] # 1 column dataframe
fb_many_col_df =  fb_df.loc[:, ['team', 'points']] # many column dataframe


In [None]:
# Filter only teams that have never lost
# and keep only the city, team and points columns
never_lost_series = fb_df['lost'] == 0 # a Series of booleans
never_lost_df = fb_df[never_lost_series] # the Series of booleans returns  filters the rows

# We can do this in one step 
never_lost_df_wide = fb_df.loc[fb_df['lost'] == 0] 
# the Series of booleans returns  filters the rows

never_lost_df_narrow = fb_df.loc[
        fb_df['lost'] == 0, 
        ['team', 'points']
        ] # the Series of booleans returns  filters the rows


 Exercise: Can we find those teams that have never lost
 at least in the last 5 games
 using only the information in the form column?
 Hint 
 use np.where to add a new column never_lost 
 with possible values yes and no
 and search for Python string contains
 filter on this never_lost column


## Group By

In [None]:
# Let group by City to get the total scores per vcity 

#columns_to_keep = ['city', 'played']
#fb_df_slim = fb_df.loc[:, columns_to_keep]

by_city_df = fb_df.groupby(['city'], as_index = False)
r = by_city_df.sum()


ANSWERS

In [None]:
fb_df['points'] = fb_df['won'] * 3 + fb_df['drawn']

# How do we find one string  inside another string?
'he' in 'hello'
'the' in 'hello'
'L' in 'LLWWW'
'L' in 'WWWDD'
 # But this is a dead end!

# How does np.where() work exactly
tmp_won = np.where(fb_df['won'] == 5)
tmp_won
tmp_lost = np.where(fb_df['won'] != 5)
tmp_lost

form_series = fb_df['form']
type(form_series)
form_series.str.contains('L')

fb_df['never_lost'] = np.where(form_series.str.contains('L'), 'no', 'yes')

never_lost_v2_df = fb_df.loc[fb_df['never_lost'] == 'yes']


# Visualisation

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
survey = pd.read_csv("https://raw.githubusercontent.com/MarkWilcock/CourseDatasets/main/COVID%20Infection%20Survey/Covid%20Infection%20Survey%202021-02-12.csv")
survey['ProperDate'] = pd.to_datetime(survey['Date'], format = '%Y-%m-%d')

In [None]:
survey.head(2)

In [None]:
fig, ax = plt.subplots()
ax.plot(survey["ProperDate"], survey["London-Central"], marker = "o", color = "blue")
ax.plot(survey["ProperDate"], survey["London-Upper"], marker = "v", color = "grey")
ax.plot(survey["ProperDate"], survey["London-Lower"], marker = "v", color = "grey")
ax.set_xlabel("Date")
ax.set_ylabel("Infection Rate")
ax.set_title("Estimated infection rate in London with 95% credible intervals over a recent period")
fig.set_size_inches(4,3)
plt.show()