<a href="https://colab.research.google.com/github/DavidLedvinka/Python-for-Math-and-Statistics-Workshop/blob/master/Python_for_Math_and_Stats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Python Workshop for Mathematics and Statistics
===

### About Python

* Python is an interpreted, high-level, general purpose programming language \\
(R is interpreted and high-level but not general purpose)
* Python supports multiple programming paradigms included procedural, object-oriented and functional, but is primarily object-oriented
(R is also multi-paradigmatic but is primarily functional)


### Basic Python Syntax and Structure


In python we have the typical basic datatypes such as:

In [0]:
3     # Integers
3.14  # Floating Points
"pi"  # Strings (can use double quotes or single quotes)
True  # Booleans`

We have data structures such as:

In [0]:
[3,".1",4]            # Lists 
(3, ".1", 4)          # Tuples (like lists but immutable)
{3,"1",4}             # Sets (like lists but unordered)
{"pi":3.14, "e":2.71} # Dictionaries (kind of like lists but indexed by your choice of keys)

We assign variables by:

In [0]:
x = 3.14

We define functions by:

In [0]:
def foo(x):
  val = 2 * (x ** 2) + 1
  output = "2({})^2 + 1 = {}".format(x,val)
  return output

In [0]:
a = foo(2)
print(a)

We have the following control structures:

In [0]:
# While Loop
i = 1
s = 0
while i <= 10:
  s += i
  i += 1

print(s)

In [0]:
# For Loop
s = 0
for i in range(11): # INCLUDES i = 0 !!
  s += i

print(s)

In [0]:
# You can also for loop over a list
for letter in ["H","E","L","L","O"]:
  print(letter)

In [126]:
# If,Then,Else Statements
a = 0
if  a == 420:
  print("a is 420")
else:
  print("a is not 420")

a is not 420


R being a language focused on statistical computing, has some powerful core data structures which are useful for its purpose. Most R programs are focused around manipulating arrays and dataframes.

Python on the other hand is general purpose, and we generally need to import libraries which give us the tools to handle specific kinds of tasks, including libraries which give us data structures that behave similarily to arrays (NumPy)
and dataframes (Pandas).

Furthermore Python is Object-Oriented, meaning we will often be dealing with different kinds of data structures (objects) built specifically to handle a certain set of behaviours or tasks.

Here are a few more tricks with lists that we will use later on:

In [0]:
# You can extract elements by indexing
[3,".1",4][0]

In [0]:
# You can "splice" them
["H","E","L","L","0"][1:4]

In [0]:
# You can do list comprehension
[i for i in range(11) if i > 5]

In [0]:
# You can add them to concatenate
['a','b','c'] + ['e','f','g']

In [0]:
# And multiply too!
2 * ("H" + 20 * "A" + " ")

### What is Object Oriented Programming?

Object-oriented programming is a programming paradigm where the structure of programs is based around "objects" that contain data (called attributes) and have a set of functions which can be applied to them (called methods). 

In Python: \\
A  **Class** is a template for a type of object \\
An **Object** is an instance of a class

A Class defines the set of attributes that its objects should have and methods that can be applied to them.


#### Example

In python libraries are called modules. We are going to import a module to help us with our example. This module is also extremely useful for data science and math applications in general.

In [0]:
import numpy as np

This module gives us an array data structure which behave similar to the arrays from R.

We will make a Class whose objects represent polynomials: $a_nx^n + \ldots a_1x + a_0$ \\
<details>
<summary> What attributes (data) should a polynomial have? </summary>
An array of its coefficients
</details>


In [0]:
class Polynomial:

  # Define Initializer
  def __init__(self, coefficients, variable):
    # Coefficients should be a list where the kth value is
    # the coefficient of x^k
    self.coefficients = coefficients
    # Variable should be a string
    self.variable = variable

In [0]:
# Create a Polynomial object
p = Polynomial([1,2,3],'x')

In [0]:
# Get its coefficients
p.coefficients
# Get its variable
p.variable

<details>
<summary> What are some possible methods for a polynomial? </summary>

* Addition
* Subtraction
* Multiplication
* Differentiation

</details>

In [0]:
class Polynomial:

  # Define Initializer
  def __init__(self, coefficients):
    # Coefficients should be an numpy array where the kth value is
    # the coefficient of x^k
    self.coefficients = coefficients
  # Redefine the add method
  def __add__(self, summand):
    return self.coefficients + summand.coefficients

In [0]:
p = Polynomial(np.array([1,4,5]))
q = Polynomial(np.array([2,2,1]))
print(p.__add__(q))
print(p + q)

As an excercise try to impliment one (or multiple) of the other method ideas for polynomials.

We can define a function which makes it easier to create polynomials

### Web Scraping

Web scraping is the process of extracting data from websites. Python has some of the best libraries (modules) for automating this task. 

We are going to need the following modules:

In [0]:
import requests # get html from web
from bs4 import BeautifulSoup # html parser

The **requests** module will allow us to get the html code for the webpage that has the target data. The **BeautifulSoup** module is an html parser that will allow us to parse the html for the desired data. There is an alternative to requests called **selenium** which allows you to automate a webbrowser (firefox or chrome). One should always use requests over selenium if possible since it is easier to use, much faster, and safer. However there are times when a tool like selenium is necessary, for example if one needs to interact with java script scripts to access the target data. 

Suppose we wanted to get the game by game stats of every player in the NBA. First we need to find a website that has the desired data:

https://www.basketball-reference.com

Since the game by game stats appear on individual pages per player, if we want the game by game stats for every player, we would need to write a loop over all players in the NBA, but first lets just write a program that extracts the data for one player. 

Lets use Kawhi Leonard as a template. The url with his game by game data is:

https://www.basketball-reference.com/players/l/leonaka01/gamelog/2020/

First we need to get the html code from the webpage:


In [0]:
url = "https://www.basketball-reference.com/players/l/leonaka01/gamelog/2019/"
request = requests.get(url)
html = request.content

In [0]:
print(str(html))

Next we want to create a BeautifulSoup object:

In [0]:
soup = BeautifulSoup(html)

Then find the table with the game by game data

In [0]:
gamelog = soup.find('table', {'id': 'pgl_basic'})

Before extracting the data, we need a place to put it. Lets create a csv file:

In [0]:
csv = open("kawhi_leonard_gamelog.csv", 'w')

The 'w' stands for "write" and means that the file will be overwritten if it already exists and created if it doesn't exist. There is also a read option 'r' and append option 'a'.

In [0]:
header = "Rk,G,Date,Age,Tm,Away,Opp,Res,GS,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,+/-\n"
csv.write(header)

and then extract the data:

In [0]:
# Initalize a counter for the rank
rank = 0
# Loop over all rows in the table
for row in gamelog.find_all('tr'):
  # Initalize a string for the row entry to the csv
  row_entry = ""
  # Get the data in the row
  row_data = row.find_all('td')
  # Check if the row is a header row
  if row_data: 
    # Add one to the rank
    rank += 1
    # Add the rank to the entry 
    row_entry += "{},".format(rank) 
    # Get the first entry G
    G = row_data[0].text
    # Add G to the row entry
    # ... in the case that G is empty
    if not G:
      row_entry += "DNP,"
    # ... in the case that G is nonempty
    else:
      row_entry += "{},".format(G)
    # Add the next three entries Date,Age,Tm
    row_entry += ','.join([td.text for td in row_data[1:4]]) + ','
    # Set Away to 1 if player was on the away team, else 0
    if row_data[4].text:
      row_entry += "1,"
    else:
      row_entry += "0,"
    # Add the next three entries Opp, Res, Gm
    row_entry += ','.join([td.text for td in row_data[5:8]]) + ','
    # If the player didnt play set all the remaining entries to DNP
    if not G:
      row_entry += "DNP," * 20 + "DNP"
    else:
      # Otherwise add the remaining entries
      row_entry += ",".join([td.text for td in row_data[8:]])
    # Add a newline character to the end of the entry
    row_entry += "\n"
    # Write the row to the file
    csv.write(row_entry)

Finally close the file

In [0]:
csv.close()

I feel obligated to mention that many websites have a policy on the use of bots on their site. You can find it by going to the *url*/robots.txt, for example: https://www.basketball-reference.com/robots.txt