# Week 1 - Python Programming for Data Mining



## Python
Python is a high-level, interpreted programming language known for its readability and versatility.
- Easy-to-learn syntax
- Extensive standard library
- Supports multiple programming paradigms (procedural, object-oriented, functional)
- Large community and ecosystem of third-party libraries

Applications: Web development, Data Science, Machine Learning, Automation, etc.

Key Concepts:

- Dynamic typing: no need to declare variable types
- Indentation-based syntax: no braces, use indentation for code blocks
- Interpreted language: execute code line by line


In [None]:
x = 10
if x > 5:
    print("x is greater than 5")

## Jupyter Notebook and IPython
**Jupyter Notebook** is an open-source web application that allows you to create and share documents containing live code, equations, visualizations, and text. 

- Supports multiple programming languages (Python, R, Julia, etc.)
- Interactive and real-time code execution
- Rich media support: plots, text, Markdown, LaTeX, and images
- Widely used in data science, education, and scientific research.




In [None]:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y)

**IPython** stands for Interactive Python, an enhanced Python shell. Its most two important features we care about for this class are:
1. Magic commands.
1. Easy execution of shell commands within Python

For a full list of magic commands, [click here](https://ipython.readthedocs.io/en/stable/interactive/magics.html).

Magic commands we will be using in class:
- `%time` and `%timeit`:

In [None]:
def run_iteration(n):
    s = 0
    for i in range(n):
        s = s + 1
# use %time and %timeit to time an execution
%timeit run_iteration(1000000)
%timeit run_iteration(2000000)
%time run_iteration(4000000)
%time run_iteration(8000000)

- `%%writefile`: write text to a file

In [None]:
%%writefile greet.py
for i in range(5):
    print("Hello COSC 437")

- `%run`: Run a Python script from an external file.

In [None]:
%run greet.py


- `%pwd`: Print the current working directory.


In [None]:
%pwd

- `%ls`: List the contents of the current directory

In [None]:
%ls

- `%system`: run shell commands (you may also use the `!` operator, they work more or less the same)

In [None]:
%system dir

- `%matplotlib inline`: Display plots inline in Jupyter Notebook. Usually is not needed.

## Introduction to NumPy
### Python's built-in lists

In [None]:
py_list = [1,2,3,4]
py_list.append(5)
print(type(py_list))
print(py_list)

In [None]:
py_list.append("what is love")
print(py_list)

In [None]:
py_list.append(["baby don't hurt me", "don't hurt me", "no more"])
print(py_list)

In [None]:
print(py_list + 1)

### NumPy and `numpy.ndarray`

In [None]:
import numpy as np

# create an numpy array from a python list
number_list = [1,2,3,4,5]
arr = np.array(number_list)
print(arr)
print(type(arr))

The `numpy.ndarray` class is the essence of NumPy. We are going to use it throughout this course. [ndarray document](https://numpy.org/doc/2.1/reference/generated/numpy.ndarray.html)

In [None]:
arr + 1

In [None]:
# unlike lists, elements in an array must be of the same data type
print(arr.dtype)

# the data type of the array itself is always numpy.ndarray
print(type(arr))

# We maybe working with memory-intensive tasks. Sometimes we need to estimate
# how much memory our program will require
print(arr.itemsize)

In [None]:
# The length of an array, using Python's built-in len() function
print(len(arr))

# The size of an array, size and length are actually different as we will soon see
print(arr.size)
print(np.size(arr))

# The shape of an array
print(arr.shape)

# The number of dimensions of an array
print(arr.ndim)

In [None]:
# Let's see a two dimensional array
two_dimensional = np.array([[1,2,3],[4,5,6]])

### Creating Arrays

In [None]:
# No one actually uses the ndarray constructor
# Create an array from a list
np.array([1,2,3,4,5])

# Create an array from a range object
np.array(range(100))

# The arange() function (not a typo) does the two things together! 
np.arange(100)

In [None]:
# create an array of evenly spaced values
# linearly-spaced values. the end point is actually included
start = 0
end = 100
numbers_of_numbers = 101
np.linspace(start, end, numbers_of_numbers)

# logarithmetically-spaced values (default base is 10)
start = 0
end = 5
numbers_of_numbers = 6
np.logspace(start, end, numbers_of_numbers)

In [None]:
# Most commonly used are these three:
# create an array with whatever values (not really random)
shape = (4,5)
arr1 = np.empty(shape)

# create an array of randomly generated values, in the range of [0,1)
arr2 = np.random.random(shape)

# create an array of zeros
arr3 = np.zeros(shape)

# create an array of all ones
arr4 = np.ones(shape)

In [None]:
# element data type are automatically chosen
arr = np.array([1,2,3])
print(arr.dtype)

In [None]:
# dtype may be specified at creation
arr = np.array([1,2,3], dtype=np.float32)
print(arr.dtype)

# or changed using astype() function
arr.astype(np.int16)

### Indexing, Slicing, and Masking

In [None]:
arr_1d = np.arange(100)
arr_2d = np.arange(100).reshape(5,20)
arr_3d = np.arange(100).reshape(5,5,4)

In [None]:
# indexing with a simple index
print(arr_1d[3])
print(arr_2d[3])
print(arr_3d[3])
# notice that the dimensionality is collapsed

In [None]:
# indexing with a list/array
print(arr_1d[[2,1,2,3]])
print(arr_2d[[2,1,2,3]])
print(arr_3d[[2,1,2,3]])
# useful for reording elements and matrix permutation


In [None]:
# slicing
print(arr_1d[3:10])
print(arr_1d[10:50:2])
print(arr_2d[1:3,1:3])

In [None]:
# masking


### Aggregation functions
`numpy.ndarray` has many useful aggregation functions:
- `sum()`: Sum
- `prod()`: Product
- `mean()`: Mean, average
- `max()`: Maximum
- `min()`: Minimum
- `std()`: Standard deviation
- `var()`: Variance

In [None]:
# we use sum as an example
print(arr_1d.sum())
print(arr_2d.sum())
print(arr_3d.sum())

In [None]:
# calculate sum over an axis
print(arr_2d.sum(axis=0))
print(arr_2d.sum(axis=1))

### Index Functions
If we want to find out, not only the maximum, but also *which* one in the array is the maximum, we can use `argmax()` function.

In [None]:
random_arr = (100 * np.random.random((10))).astype(np.int16)
print(random_arr)

In [None]:
print(random_arr.argmax())

Other index functions:
- `argmin()` returns the index of the smallest element
- `argsort()` returns the order of the items (this index can be used to sort the array)
- `argwhere()` evaluate a boolean expression for all the elements, returns the index where the expression is true

### `ndarray` operations

In [None]:
arr1 = np.array([1,2,3,4]).reshape(2,2)
arr2 = np.array([2,1,2,1]).reshape(2,2)
print(arr1)
print(arr2)

In [None]:
# If both operators are arrays of the same shape, operations are element-wise
print(arr1 + arr2)
print(arr1 * arr2)
print(arr1 ** arr2)

If operators have different dimensionalities, the one of lower dimensionality will be broadcasted (when the shape is compatible):

In [None]:
# simplest example: 1D operates with 0D (single number)
arr = np.array([1,2,3,4])
print(arr + 5)

In [None]:
# 2D operates with 1D:
arr_2d = np.array([1,2,3,4]).reshape(2,2)
arr_1d = np.array([5,6])
print(arr_2d)
print(arr_1d)
print(arr_2d + arr_1d)

### Vector and Matrix Multiplications
These are the basic linear algebra operations. In NumPy, vectors and matrices are represented with 1-dimensional and 2-dimensional arrays.

#### Dot product of two vectors
Let $\mathbf{a} = [a_1, a_2, ... , a_n]$ and $\mathbf{b} = [b_1, b_2, ... , b_n]$ be two vectors of the same length. The dot product of them, is calculated as $$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^n a_i  b_i$$

In [None]:
a = np.array([1,2,3])
b = np.array([2,3,4])
print((a * b).sum())
print(np.dot(a,b))

Matrix multiplication is defined as follows:

Given two matrices, 
$$\mathbf{A}_{m \times n} = 
\begin{bmatrix}
a_{11} & a_{12} & \cdots & a_{1n} \\
a_{21} & a_{22} & \cdots & a_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
a_{m1} & a_{m2} & \cdots & a_{mn}
\end{bmatrix}
$$ and $$\mathbf{B}_{n \times p} =
\begin{bmatrix}
b_{11} & b_{12} & \cdots & b_{1p} \\
b_{21} & b_{22} & \cdots & b_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
b_{n1} & b_{n2} & \cdots & b_{np}
\end{bmatrix}
$$, their product $\mathbf{C} = \mathbf{A} \times \mathbf{B} $, where $ \mathbf{C} $ is an $ m \times p $ matrix, is: 
$$
\mathbf{C}_{m \times p} =
\begin{bmatrix}
c_{11} & c_{12} & \cdots & c_{1p} \\
c_{21} & c_{22} & \cdots & c_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
c_{m1} & c_{m2} & \cdots & c_{mp}
\end{bmatrix}$$
Each element $c_{ij}$ of matrix $ \mathbf{C} $ is calculated as:
$$c_{ij} = \sum_{k=1}^{n} a_{ik} \cdot b_{kj}$$

The dot product of two vectors is actually a matrix multiplication if you treat the first vector as a $m$ by 1 matrix (a column vector).



In [None]:
# Matrix multiplication in NumPy
a = np.arange(1,7).reshape(3,2)
b = np.arange(8,0,-1).reshape(2,4)
print(a)
print(b)

In [None]:
# It is recommended to use np.matmul() or @ operator to perform matrix multiplication
print(np.matmul(a,b)) 
print(a@b)

# for 2d matrices, np.dot() will also work, although not recommended.
print(np.dot(a,b))

Why linear algebra? Recall how data is organized in data warehouse. A record is just a multi-dimensional data point. Each row is a data point, and each column is an attribute. Suppose we assign each attribute a weight:
- How would you calculate the weighted sum of a record?
- How would you calculate the weighted sum of all records together?
- If we have another set of weights, can we do the same for different sets of weights together?


### NumPy File I/O
Consider the following file

In [None]:
%%writefile data.csv
1,2,3,4,5,6,7
2,3,4,5,6,7,8
3,4,5,6,7,8,9

In [None]:
# python's basic IO
data = []
with open("data.csv") as f:
    for line in f.readlines():
        values = line.split(sep=",")
        data.append(values)
arr = np.array(data).astype
print(arr)

In [None]:
# NumPy has loadtxt() function
arr = np.loadtxt("data.csv", delimiter=",")
print(arr)


In [None]:
# or a more flexible function, np.genfromtxt()
arr = np.genfromtxt("data.csv", delimiter=",")
print(arr)

In [None]:
# you can skip rows, which is handy when your data has a header
arr = np.loadtxt("data.csv", delimiter=",", skiprows=1)
print(arr)

arr = np.genfromtxt("data.csv", delimiter=",", skip_header=1)
print(arr)

In [None]:
# save data to a text file
np.savetxt("output.csv", arr, delimiter=",")

np.savetxt("output2.csv", arr, fmt="%.2f", delimiter=",")

In [None]:
!type output.csv
!type output2.csv