# Basic Python Data Manipulation

<a target="_blank" href="https://colab.research.google.com/github/andrew-nash/CS6421-labs-2025/blob/main/CS6421_Lab_01.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This lab will cover the basics of NumPy, Pandas and a brief intoduction to basic operations in TensorFlow.

While teaching Scientific Python is outside the scope of the course, we will touch on the use of these packages throughout the term

There are plenty of resources online that cover this topic in much more detail such as:

https://github.com/guiwitz/NumpyPandas_course



## NumPy (https://numpy.org/)

Described as 'the fundamental package for scientific computing with python'



In [2]:
import numpy as np

The main purpose of NumPy is to allow us to perform mathematical operations easily and efficiently over multi-dimensional arrays

### Python Arrays

In [4]:
simple_list = [1,2,3,4,5]

In [None]:
print(simple_list[0])

In [None]:
print(simple_list[0:3])

In [None]:
###### EXERCISE ######
# create a new list from simple_list, with values double that of simple_list
new_list = []

# .. your code here

In [20]:
simple_2d_list = [[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15]]

In [None]:
print(simple_2d_list[0])

In [None]:
print(simple_2d_list[0][0:3])

In [None]:
###### EXERCISE ######
# Is it possible to slice the 2D list, to get the first 3 elements of the first 2#
# rows in a new 2D list?

small_slice = # ... your code

### Creating NumPy Arrays


In [None]:
np_1d_list = np.array(simple_list)
np_1d_list

In [22]:
np_2d_list = np.array(simple_2d_list)
np_2d_list

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

In [None]:
np.zeros(5)

In [None]:
np.zeros((2,2))

When creating NumPy arrrays, it is also possible to declare the data type (dtype) that they will contain

In [5]:
np_1d_list = np.array(simple_list, dtype=np.float32)
np_1d_list

array([1., 2., 3., 4., 5.], dtype=float32)

In [6]:
np_1d_list = np.array(simple_list, dtype=np.complex128)
np_1d_list

array([1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j])

In [7]:
np_1d_list = np.array(simple_list, dtype=str)
np_1d_list

array(['1', '2', '3', '4', '5'], dtype='<U1')

In [8]:
np_1d_list = np.array(simple_list, dtype=np.int16)
np_1d_list

array([1, 2, 3, 4, 5], dtype=int16)

### Indexing and Slicing

Selecting elements of numpy arrays is similar to selecting elements of standard Python lists but with much more flexibility

In [None]:
np_1d_list[0]

In [None]:
np_1d_list[0:2]

In [None]:
np_2d_list[0,2]

In [None]:
np_2d_list[0:2,0:3] #convert to array first before doing this

In [9]:
elements_selected = np.array([True,False,False,True,True])
np_1d_list[elements_selected] #only true elements will be displayed

array([1, 4, 5], dtype=int16)

### Mathematical Operations

In [12]:
np_1d_list * 5 #here you multiple each element of array


array([ 5, 10, 15, 20, 25], dtype=int16)

In [13]:
np_1d_list / 2

array([0.5, 1. , 1.5, 2. , 2.5])

In [14]:
np.exp(np_1d_list)

array([  2.718282 ,   7.3890557,  20.085537 ,  54.59815  , 148.41316  ],
      dtype=float32)

In [15]:
np.max(np_1d_list)

5

In [16]:
second_np_1d_list = np.array([10,20,30,40,50])

In [17]:
np_1d_list+second_np_1d_list

array([11, 22, 33, 44, 55])


These same operations work with higher dimensional arrays

In [25]:
np_2d_list+2


array([[ 3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12],
       [13, 14, 15, 16, 17]])

In [26]:
print(np_2d_list)
print(np_1d_list)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]]
[1 2 3 4 5]


What do you expect the outcome of multiplying a 1D array with a 2D array will be?

In [27]:
np_2d_list * np_1d_list # here each row of 2d array is multiplied by 1d this is not matric multiplication

array([[ 1,  4,  9, 16, 25],
       [ 6, 14, 24, 36, 50],
       [11, 24, 39, 56, 75]])

### Linear Algebra in NumPy

Basic linear algebraic operations can also be performed in NumPy

Vector-matrix multiplication

In [28]:
np_2d_list @ np_1d_list # matrix multiplication we do this way

array([ 55, 130, 205])

This also works for matrix-matrix multiplication

In [31]:
a = np.array([[1,0],[0,1]])
b = np.array([[4,1],[2,2]])

a@b

array([[4, 1],
       [2, 2]])

**EXERCISE**

Why does the following code fail?

`np_1d_list @ np_2d_list`

In [32]:
np_1d_list @ np_2d_list # fails due to dimensionality

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is different from 5)

#### Shapes

Numpy arrays have some very useful attributes - particularly size, shape and dtpye

1. Size tracks the number of scalar values contained within then array (and any sub-arrays)
2. Shape contains the size of each dimension of the array - e.g. shape=(3,4) corresponds to a 3x4 matrix

In [33]:
np_2d_list.shape

(3, 5)

In [34]:
np_1d_list.shape

(5,)

In [35]:
np_2d_list.size

15

In [36]:
np_1d_list.shape # 5 elements

(5,)

This becomes even more important when working with tensors:

In [38]:
np_tensor_a = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])
np_tensor_a

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [39]:
np_tensor_a.shape

(2, 2, 3)

It is possible to transpose (rotate by 90 degrees) an array with `.T`

In [41]:
np_2d_list.T

array([[ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14],
       [ 5, 10, 15]])

In [40]:
np_1d_list.T

array([1, 2, 3, 4, 5], dtype=int16)

In [42]:
np_1d_list.T

array([1, 2, 3, 4, 5], dtype=int16)

In [43]:
np_tensor_a.T

array([[[ 1,  7],
        [ 4, 10]],

       [[ 2,  8],
        [ 5, 11]],

       [[ 3,  9],
        [ 6, 12]]])

**EXERCISE**

What are the requirements (in terms of shape) for two matrices to be multiplicable?

#### Re-shaping

Given a NumPy array, it is possible to change its shape - provided that the total number of elements matches between the original and new shapes

In [44]:
flat_list = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
flat_list

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [45]:
flat_list.reshape( (2,6) )

array([[ 1,  2,  3,  4,  5,  6],
       [ 7,  8,  9, 10, 11, 12]])

In [46]:
flat_list.reshape( (2,3,2) )

array([[[ 1,  2],
        [ 3,  4],
        [ 5,  6]],

       [[ 7,  8],
        [ 9, 10],
        [11, 12]]])

In [47]:
flat_list.reshape( (12,1) )

array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12]])

In [48]:
square_mat = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
square_mat

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [51]:
square_mat.reshape( (12,) ) #creates a 1D array with 12 elements(4x3)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [52]:
square_mat.reshape( (1,12) )

array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]])

In [53]:
square_mat.reshape( (1,6,2) )

array([[[ 1,  2],
        [ 3,  4],
        [ 5,  6],
        [ 7,  8],
        [ 9, 10],
        [11, 12]]])

## Pandas (https://pandas.pydata.org/)

Pandas is a powerful data analysis and manipulation package built on top of NumPy, that targets tabular data


In [54]:
import pandas as pd

The basic data structures of Pandas are the Series and Dataframe

In [55]:
data = np.array([2,3,62,1,2])
series1 = pd.Series(data, name="Example Series 1")

In [56]:
data2 = np.log(data)
series2 = pd.Series(data2, name="Example Series 2")
series2

Unnamed: 0,Example Series 2
0,0.693147
1,1.098612
2,4.127134
3,0.0
4,0.693147


In [57]:
df = pd.DataFrame({"Series 1": series1, "Series 2": series2})
df

Unnamed: 0,Series 1,Series 2
0,2,0.693147
1,3,1.098612
2,62,4.127134
3,1,0.0
4,2,0.693147


In [58]:
df.describe()

Unnamed: 0,Series 1,Series 2
count,5.0,5.0
mean,14.0,1.322408
std,26.842131,1.616886
min,1.0,0.0
25%,2.0,0.693147
50%,2.0,0.693147
75%,3.0,1.098612
max,62.0,4.127134


Pandas supports many useful operations over DataFrames

1. Indexing over rows and columns
2. Simple sumary statistics can be calculated over the columns
3. Transformations can be applied to differnt columns
4. SQL-like queries over the whole DataFrame, including grouping
5. SQL-like merge and intersection operations between DataFrames
6. And more ...

In [59]:
df.iloc[0:2,0]

Unnamed: 0,Series 1
0,2
1,3


In [60]:
df['Series 1'].sum()

70

In [61]:
df['Series 1'].apply(lambda x: "High" if x>6 else "Low")

Unnamed: 0,Series 1
0,Low
1,Low
2,High
3,Low
4,Low


In [62]:
df[(df["Series 1"]>5 & (df["Series 2"]>0))]

Unnamed: 0,Series 1,Series 2
0,2,0.693147
1,3,1.098612
2,62,4.127134
3,1,0.0
4,2,0.693147


In [64]:
df[df['Series 1']>1 & (df['Series 2']<1)]

Unnamed: 0,Series 1,Series 2
0,2,0.693147
1,3,1.098612
2,62,4.127134
4,2,0.693147


In [65]:
df["Series 1"].to_numpy() #NumPy array

array([ 2,  3, 62,  1,  2])

In [66]:
df.to_numpy()

array([[ 2.        ,  0.69314718],
       [ 3.        ,  1.09861229],
       [62.        ,  4.12713439],
       [ 1.        ,  0.        ],
       [ 2.        ,  0.69314718]])

### Reading from CSV

In [67]:
!wget https://github.com/datasciencedojo/datasets/raw/refs/heads/master/titanic.csv

--2025-01-26 14:15:43--  https://github.com/datasciencedojo/datasets/raw/refs/heads/master/titanic.csv
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv [following]
--2025-01-26 14:15:43--  https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60302 (59K) [text/plain]
Saving to: ‘titanic.csv’


2025-01-26 14:15:44 (4.51 MB/s) - ‘titanic.csv’ saved [60302/60302]



In [68]:
tianic_df = pd.read_csv("titanic.csv")

In [69]:
tianic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [70]:
selected_columns = ["Pclass","Fare"]
tianic_df[selected_columns]

Unnamed: 0,Pclass,Fare
0,3,7.2500
1,1,71.2833
2,3,7.9250
3,1,53.1000
4,3,8.0500
...,...,...
886,2,13.0000
887,1,30.0000
888,3,23.4500
889,1,30.0000


In [73]:
######## EXERCISE
# Find the average fare for first class passengers
x=tianic_df[tianic_df['Pclass']==1]
x['Fare'].mean()

84.1546875

In [91]:
######## EXERCISE
# Get a NumPy array containing the ticket class of passsengers under 18 who did not survive the sinking
under_18_not_survived = tianic_df[(tianic_df['Age'] < 18) & (tianic_df['Survived'] == 0)]
lists=under_18_not_survived['Pclass'].to_numpy()
lists
#y['Pclass'].to_numpy()

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       2, 3, 3, 3, 2, 3, 3, 3])

In [96]:
######## ADVANCED EXERCISE
# find the number of each class in this array
# hint:  https://numpy.org/doc/stable/reference/routines.html
x=np.sort(lists)
y={}
for i in x:
  if i in y:
    y[i]=y[i]+1
  else:
    y[i]=1

np.unique(x,return_counts=True)

(array([1, 2, 3]), array([ 1,  2, 49]))

# Brief intro to TensorFlow (https://www.tensorflow.org)

TensorFlow is an end-to-end platform for machine learning.

## Tensors (https://www.tensorflow.org/guide/basics)

The basic data structure in TensorFlow is the tf.Tensor, which is very similar to the np.array

In [3]:
import tensorflow as tf

In [98]:
# An immutable Tensor
x = tf.constant([[1., 2., 3.],
                 [4., 5., 6.]])
# A mutable Tensor
vx = tf.Variable([[1., 2., 3.],
                 [4., 5., 6.]])


In [99]:
x

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [100]:
vx

<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

## Mathematical operations

These can be performed in much the same way as NumPy

In [101]:
x = tf.constant(1.75)
x*2

<tf.Tensor: shape=(), dtype=float32, numpy=3.5>

In [102]:
tf.exp(x) #e^1.75

<tf.Tensor: shape=(), dtype=float32, numpy=5.754603>

In [103]:
A = tf.constant([[1,2,3],[4,5,6]])
B = tf.constant([[1,2,3,4],[5,6,7,8],[9,10,11,12]])

C=tf.matmul(A,B) #matrix multiplication
C

<tf.Tensor: shape=(2, 4), dtype=int32, numpy=
array([[ 38,  44,  50,  56],
       [ 83,  98, 113, 128]], dtype=int32)>

In [104]:
C.shape

TensorShape([2, 4])

## Auto-differentiation

One of the most imporant differces over NumPy is TensorFlow's ability to autmatically differentiate user-defined functions

In [105]:
def f(x):
  y = x**2 + 2*x - 5
  return y

In [107]:
x = tf.Variable(2.0)

with tf.GradientTape() as tape:
  y = f(x) # Define the function f(x)
print(y)
g_x = tape.gradient(y, x) # Compute the gradient ∂y/∂x
g_x

tf.Tensor(3.0, shape=(), dtype=float32)


<tf.Tensor: shape=(), dtype=float32, numpy=6.0>

This also works over multi-variate functions

In [108]:
def f2(x):
  # y = 5*x + 2*exp(x)
  A = tf.constant(5.0)
  B = tf.constant(2.0)
  y = tf.add(tf.multiply(x,A), tf.multiply(B, tf.exp(x)))
  return y

In [109]:
x = tf.Variable([1.0,2.0,3.0,4.0,5.0])

with tf.GradientTape() as tape:
  y = f2(x)

g_x = tape.gradient(y, x)
g_x

<tf.Tensor: shape=(5,), dtype=float32, numpy=
array([ 10.4365635,  19.778112 ,  45.171074 , 114.1963   , 301.82632  ],
      dtype=float32)>

# Keras - a preview

In practice, you will not often be working at the level of basic operations (like the above) to create models - in practice, this will look ssomething like the following

In [4]:
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

x_train

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       ...,

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0.

Notice how x_train is simply a NumPy array

In [5]:
x_train.shape

(60000, 28, 28)

In [6]:
y_train.shape

(60000,)

Creating a model:

In [7]:
model = tf.keras.models.Sequential([
  tf.keras.layers.Input(shape=(28, 28)),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

Performing Inference:

In [8]:
model(x_train[0].reshape(1,28,28)).numpy()

array([[-0.23399705,  0.2929467 ,  0.56989455,  0.02095149, -0.01088616,
         0.4030909 ,  0.01429404,  0.19887796,  0.1191797 , -0.13164903]],
      dtype=float32)

Training your model:

In [9]:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test,  y_test, verbose=2)


Epoch 1/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 7ms/step - accuracy: 0.8574 - loss: 0.4877
Epoch 2/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 4ms/step - accuracy: 0.9572 - loss: 0.1464
Epoch 3/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.9686 - loss: 0.1021
Epoch 4/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - accuracy: 0.9736 - loss: 0.0867
Epoch 5/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - accuracy: 0.9763 - loss: 0.0719
313/313 - 1s - 4ms/step - accuracy: 0.9787 - loss: 0.0706


[0.07055143266916275, 0.9786999821662903]

# Optional Practice - not graded

In [None]:
!wget https://github.com/datasets/house-prices-uk/raw/refs/heads/main/data/data.csv

## NumPy

1. Create a vector v1 of the values $1,2,\dots,24$
2. Create a second vector v2 of the values $1,3,5,\dots,47$ (bonus: try to use np.arange)
3. Create a vector consisting of the values of v1/v2
3. Find the dot product between v1 and v2 using the '@' and '.T' operations. Make sure that the shape of the result is (1,)

## Pandas
1. Load the data from data.csv to a Pandas dataframe
2. Get the values of 'Price (New)' and 'Price (Modern)' for years where 'Change (All)' was negative
3. Get the mean of 'Price (New)' and 'Price (Modern)'

## TensorFlow

1. Create the following matrix as a tf.constant

\begin{equation}
  W = \left(\begin{array}[cc]\\
  7.0 & -5.0 \\
  2.5 & 3.0
  \end{array}\right)
\end{equation}

2. Define the following function where $Wx$ is matrix multiplication with 2-vector x:

\begin{equation}
  f(x) = \frac{1}{1+e^{-Wx}}
\end{equation}

3. Get the gradient using tf's autodifferentiation of $f$ at \begin{equation}x=\left(\begin{array}[c]
\;  2.56 \\
    1.75 \\
  \end{array}\right)\end{equation}