# Python for Data Science

### Brief note on how to use a Jupyter notebook

If you have never used a Jupyter notebook, the next few shortcuts should prove extremely useful. Below `:` indicates that you should release the previous key and enter the next one. On the other hand `+` indicates that you should press both keys simultaneously.

The *current cell* designates the cell in which you currently have your cursor. 

* `ESC:A` (resp. `ESC:B`) adds a new cell above (resp. below) the current one.
* `ESC:D:D` deletes the current cell
* `MAJ+ENTER` executes the current cell
* `ESC:L` toggles the line numbering in the current cell
* `ESC:M` changes the current cell to a **markdown** cell (a cell in which you can write formatted text)
* `ESC:Y` changes the current cell to a **code** cell (opposite effect)

**for Windows and linux users: 
replace `ESC` by `Ctrl + M` **

# Introduction to Python

Before we start, please ensure the following libraries are installed by executing the following script:

Type Markdown and LaTeX: α2

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

%matplotlib inline

print("libraries all imported, ready to go")

If not, please install it in your terminal by typing the following 

* `pip install %the package name%` 

# Mathematical operations

Every programming language has some kind of way of doing numbers and math. Here are the way python does maths

* \+ plus
* \- minus
* \* times
* / division
* \*\* to the power
* % modulo math

In [None]:
# What is 1 + 1?

In [None]:
# What is the remainder of 7 divided by 3

# Boolean Logic

In Python we have the following terms (characters and phrases) for determining if something is
“True” or “False.” Logic on a computer is all about seeing if some combination of these characters and some variables is True at that point in the program.

* and 
* or 
* not 
* != (not equal)
* == (equal)
* \>= (greater than or equal)
* <= (less than or equal)
* True
* False


In [None]:
# Small, greater, equal, not equal
a = 10
b = 100
print(a < b)   # a smaller than b: True
print(a == b)  # a equal b: False
print(a !=b )  # a not equal (!=) b : True

In [None]:
# Try more!
# Is 'testing' the same as "testing"?


In [None]:
# What is this?
print(1 == 1 and not ("testing" == 1 or 1 == 0))

## Challenge: XOR

The XOR gate (sometimes EOR gate, or EXOR gate and pronounced as Exclusive OR gate) is a digital logic gate that gives a true (1 or HIGH) output when the number of true inputs is odd. 

Can you use the operand we have learnt just now to get the following truth table?

|True|False  | True? |
|----| ------------- | ------------- |
|1| 1  | 0  |
|1| 0  | 1  |
|0| 1  | 1  |
|0| 0  | 0  |



In [None]:
# Try it out yourself! 

# Comment 
Code Comment is important! It helps other people to understand what your code is.
Use" # to comment your code. See more examples below.

# Writing Functions

A block of organized, reusable code that is used to perform a single, related action. See the following example.

In [None]:
# Addition
def addition(a, b): # Define function name
    return (a + b)  # Return function will return the answer when you call the function like below
addition(1, 2)

In [None]:
# Write a Quadratic Function

Run the following function, what do you see ?

In [None]:
def cheese_and_crackers(cheese_count, boxes_of_crackers):
    print ("You have %d cheeses!" % cheese_count)
    print ("You have %d boxes of crackers!" % boxes_of_crackers)
    print ("Man that's enough for a party!")
    print ("Get a blanket.\n")

It's time to execute the function.

In [None]:
#Execute the function
cheese_and_crackers(10, 100)

You can include mulitple statements and action in the function.

# If-conditioning 

Run the following example

In [None]:
weight = float(input("How many pounds does your suitcase weigh?"))
if weight > 50:
    print("There is a $25 charge for luggage that heavy.")
else:
    print("We will not charge you any extra")
    print("Thank you for your business.")

You can also embed a if conditional statement in a list comprehension

In [None]:
# If-else
# Syntax: If expression: statements else expression: statments
# We set the elemet x in the list to be 1 if the element x < 3, otherise 0
a = [1,2,3,4,5,6,7]                                # create list
b = [1 if a[i] < 3 else 0 for i in range(len(a))]  
# for each element in a, we check the condition and return 1 or 0
print(b)

## Exercise:
Write a function that takes in two  2-digit integer and return the product of them, if it is more than 2 digit, return `"Mental math is too hard for me"`

In [None]:
# Try it out.

# While loop

Loops are used to repeatedly execute a block of program statements.

The while loop runs as long as the expression (condition) evaluates to True and execute the program block. The condition is checked every time at the beginning of the loop and the first time when the expression evaluates to False, the loop stops without executing any remaining statement(s). 

In [None]:
# Run the following code and see what happens
x = 0
while x**2<100:
    print(x)
    x += 1

In [None]:
x = 0
while x*2<100:
    print(x**2)
    x += 1

# For loop

The for loop that is used to iterate over elements of a sequence, it is often 
used when you have a piece of code which you want to repeat "n" number of time. 


In [None]:
for x in range(10):
    print (x)

In [None]:
for x in range(3):
    print (x**2)    # x to the power 2

Can you explain the difference between while loop and for loop? 

## Exercise

Use a for loop to print out the first 10 odd numbers.

Now try it with a while loop.

In [None]:
# Print out the first 10 odd numbers

In [None]:
# Multiple conditions
# Hint: adding else if in between and for loop
# Question: write a loop and for x = [1,2,..,10] print x**2 , for x = [11,..,20] print 3*x  and
#           for x = [21,..,,30] print "come on, this is boring"

## Challenge: Prime numbers 

Combining what we have done so far, can you write some code to print out the first 10 prime numbers?

In [None]:
# Print out the first 10 prime numbers 

# Containers

Containers in python are objects that store other objects in python. The most common container in python are the following:

* list
* Dictionaries
* Numpy array
* Pandas Dictionary

In [None]:
#list
l_1 = [1, 2, 3, 4, 5]
print(l_1[1])

In [None]:
#Dictionaries
d_1 = dict()
d_1["key_1"] = "pair1"
d_1["key_2"] = "pair2"
d_1["key_3"] = 3

# Numpy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

* a powerful N-dimensional array object
* sophisticated (broadcasting) functions
* useful linear algebra (eg. np.linalg), Statistics (eg. np.mean()), and random number capabilities (np.random)

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. In the following exercises, you will learn how to use numpy to solve some mathematical problems.

### Getting started with Numpy

Start by loading the `numpy` library (with alias `np`) and create a first array `a1` corresponding to the following matrix:

$m_1 = \left(\begin{array}{ccc}1&2&3&4&5\\5&4&3&2&1\end{array}\right)$

using the attribute `dtype` of the array, show that Numpy inferred that it is an array of integers.

You can also query the dimension of the array by examining its attribute 'shape'.

In [None]:
# numpy is pre-installed if you download Anaconda
# You have to import libaray before using its functions
# General format: import library as short-form
# There are many functions/packages in the library. If you only want to use particular function, 
# you can use the following command: from library import function/package as short-form
import numpy as np

In [None]:
# Create and display the array
m1 = np.array([[1,2,3,4,5],[5,4,3,2,1]])
print(m1)

# Display the type of the array
print(m1.dtype)

# Display its dimension
print(m1.shape)

In [None]:
# Create array with [0, 1, 2, 3, 4,5]
a = np.arange(6)
print(a)

In [None]:
# Reshape the array 
B = a.reshape(2,3)  ##change the array to 2 x 3 array
print(B)

In [None]:
# Length of array
len(a)

In [None]:
# Shape of Array
B.shape

In [None]:
# Create array
a = np.array([2,8,5])
print(a)

# Create a two-dimensional array
b = np.array([[2,8,5],[3,2,1]])
print(b)

In [None]:
# Create zero arrays, ones arrays or empty array
# zero arrays
print(np.zeros((2,3)))    #np.zeros(nrow, ncol)

# ones array
print(np.ones((3,4)))    #np.ones(nrow, ncol)

# empty array
print(np.empty([6,2]))   #np.empty((2,6))

In [None]:
# create array with 3 number from 0 to 4
np.linspace(0, 4, 3)

In [None]:
# Indexing
a = np.arange(10)*3         # each element in array multiply by 3
print(a)
print(a[2])                 # 3rd element of the array
print(a[2:5])               # 3rd to 5th element of the array

In [None]:
# Manipulate array
print(a.mean()) # Mean
print(a.std())  # standard deviation
print(a.sum())  # sum

# Note that we can also compute the mean, standard deviation and sum etc in the following way
# np.mean(a) 
# np.std(a)
# np.sum(a)
# However the computational time is longer if we use np.function(array)
# We will not cover anything about the comptational time in this workshop. (maybe next one!)

### General Random Number

In [None]:
normalnumber = np.random.randn(10) # generate 10 number from standard normal distribution
print(normalnumber)

# Challenge 

Consider a 1d array of the form [5,6,7,8,9,…], can you write a function to build a new vector with 4 consective zeros inserted between each value in the array?

In [None]:
#Function here 

# Pandas
Aim: Handling data, for example merging different datasets and summarzing data

In [None]:
import pandas as pd

### Object Creation

In [None]:
data1 = pd.DataFrame({'Name': ['Tom', 'Peter', 'Mary', 'Susan'],
                      'Age' : np.array([10, 13, 11, np.nan])}, columns=['Name','Age'])
data2 = pd.DataFrame({'Name': ['Tom', 'Peter', 'Alan', 'Susan', 'Harry'],
                      "Gender": ['Male', 'Male', 'Female', 'Female', np.nan]}, columns=['Name','Gender'])

### Import Data

In [None]:
##delimiter: "\t" or  ","  or " " 
##header:  'None' or 'infer'   
data = pd.read_csv('file path', delimiter='', header='')

### View Data

In [None]:
##first 2 rows of data1
data1.head(2) 

In [None]:
##last 2 rows of data2
data2.tail(2)

In [None]:
##obtain the summary of the data  
##five number summary, mean, standard deviation and count for numeric value
##count, number of unique element, element with highest frequency and the corresponding frequency for categorical variable
data1.describe()

In [None]:
##Sorting
data.sort_values(by='B')

### Selection of Data

In [None]:
##select column  
##data.A  gives the same result
print(data1['Age'])
##select row
print(data1[0:3])

In [None]:
##select by position
data1.iloc[:,0]

In [None]:
##Boolean Indexing
data1[data1.Name=='Tom']

### Missing Value

There are many methods to deal with missing value. The easiest way is to drop it. We can fill in particular value
but we have to justify the choice.

In [None]:
##Drop value
data2.dropna(how='any')

In [None]:
##Filling data
data2.fillna(value='Male')

# More to come
The datasets usually come from different sources and we have to combine them!

We can combine them in different ways:
* concatenate
* join
* append

In [None]:
##Concatenate
##axis: 0 or 1  (concatenate along row or column)
data3 = pd.concat([data1, data2], axis=0)
data4 = pd.concat([data1, data2], axis=1)
print(data3)
print(data4)

In [None]:
##Append is a shortcut to concantenate
newline = pd.Series(['Steven', 18])
RESULT = data1.append(newline, ignore_index=True)
print(RESULT)

In [None]:
##Append is a shortcut to concantenate
newline = pd.Series(['Steven', 18])
RESULT = data1.append(newline, ignore_index=True)
print(RESULT)

In [None]:
##Merge
result = pd.merge(data1, data2, on='Name', how='inner')
print(result)
# We can use Pandas to do similar things as SQL 
# SQL equivalent names
# SQL: LEFT OUTER JOIN
# SQL: RIGHT OUTER JOIN
# SQL: FULL OUTER JOIN
# SQL: INNER JOIN

In [None]:
print(result)


# Challenge
Do LEFT OUTER JOIN, RIGHT OUTER JOIN and INNER JOIN for the above dataframes. What do you see?

In [None]:
# Exercise
# Outer Join

# Left Join

# Right Join

# Can you explain the output?

# Matplotlib
We like figures!
Matplotlib is a library that help visualize data.
Advance module: ggplot & seaborn (not include in this workshop)

In [None]:
import matplotlib.pyplot as plt

### Basic Plotting 

In [None]:
# create some data points for y = sin(x) using numpy
xmin = -10
xmax = 10
npts = 100
x = np.linspace (xmin, xmax, npts)
y = np.sin(x)

In [None]:
# Plot y=sin(x)
plt.plot (x, y, "b--") # blue, dotted line
plt.plot (x, y, "ro")  # red dots, no line
plt.show () #show the plot

In [None]:
# save same image to file 
# could be in different format, e.g. jpeg, png 
plt.savefig ("cos_plot.pdf")

In [None]:
##Legend (add description in the figure)
plt.plot()                       ##Parameter setting for the figure
plt.plot (x, y, linewidth=4)     ##plot 
plt.title('y=sin(x)')            ##Title for the figure
plt.xlabel('x')                  ##x-axis label
plt.ylabel('y')                  ##y-axis label
plt.axis()
plt.show()                       ##Show the figure

# Subplot
We can plot multiple plots

In [None]:
# Example
(fig, axes) = plt.subplots (nrows = 2, ncols = 2)
axes[0,0].plot (x, y, 'b-')
axes[1,1].plot (x, y, 'r-')
plt.show ()

### Exercise

In [None]:
# General subplot
# We want to plot four y=sin(x) graphs with different size and color
# Hint:subplot2grid

# Answer


# Sckitlearn

Sckitlearn is a set of python modules for machine learning and data mining. Basically ML practicioner uses this. There are other ML modules in python such as keras, tensorflow, xgboost, Tourch that serves different perposes.

In the remaining section, we will introduction the basic sckit learn framework to you by working through an example with you.

## K-th Nearest neighbour

In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.

In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

In the following example, we will be using the famous IRIS dataset.

In [None]:
#Import the modules and load the data
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
print("feature names: %r" %iris.feature_names)
print("target names: %r" %iris.target_names)
print(np.unique(iris_y))

## Training set and testing set

While experimenting with any learning algorithm, it is important not to test the prediction of an estimator on the data used to fit the estimator as this would not be evaluating the performance of the estimator on new data. This is why datasets are often split into train and test data

![image.png](attachment:image.png)

In [None]:
# Split iris data in train and test data
# A random permutation, to split the data randomly
np.random.seed(0) #what have we done here?
indices = np.random.permutation(len(iris_X))
iris_X_train = iris_X[indices[:-50]]
iris_y_train = iris_y[indices[:-50]]
iris_X_test  = iris_X[indices[-50:]]
iris_y_test  = iris_y[indices[-50:]]

In [None]:
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(iris_X_train, iris_y_train) 

knn_result = knn.predict(iris_X_test)
print(knn_result)

In [None]:
#How would you evaluate the result?
#Compute the misclassification error!

# Challenge time! 

Can you come up with a way to improve the performance of the algorithm?

# Final challenge

This final challenge will be a wrap up of all the things we have learnt today! In the following, you will be analysing an online retail case study. The dataset is available from the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/datasets/Online+Retail). The dataset has been modified for the workshop.

we are interested to model the behaviour of the customers ("returning" or "non-returning") based on their purchase pattern(such as balance, max spent and number of orders). Following the instructions below and complete the challenge! 

In [None]:
# Load the data using pd.read_csv, set "CustomerID" as the index column
retail = pd.read_csv("retail_data.csv", index_col="CustomerID")

### Exploratory data analysis
Answer the following query

In [None]:
# What is the dimension of the dataset?

In [None]:
# Is there any missing data?

In [None]:
#Which customer spent the most? 


In [None]:
#Which spent the least?


In [None]:
#who order the most frequently?

In [None]:
#Plot a histogram for the balance to understand the approximated distribution


In [None]:
# Plot a barchat showing the count for each number of orders the dataset has

In [None]:
# write a for loop to change the class of the customter to 1 if he is returning, 0 if he is not.

# Is there a quicker way?

### Knn again

In [None]:
# Model the returning class as an outcome, run a knn algorithm with your specified number of neighbour.
# Remember you should never use your test set during the training phase

In [None]:
# Split to input matrix X and class vector y

In [None]:
# Evaluate your performance
# Look into ROC curve online, plot a ROC curve and return a AUC score. Compare it to your neighbour. Who's better?