# Introduction to Python and Pandas

# Python basic commands

Check the current version of python by executing the following command. ! makes it possible to execute terminal (shell) commands

In [1]:
!python --version

Python 3.9.13


## Print function

In [2]:
print("Hello all - we should be around 40 students today.")

Hello all - we should be around 40 students today.


Actually we are probably less than 40 students -- maybe 32 or so. We would like to store the number of students in a variable. 

In [5]:
students = "32"
print("Hello all - we should be around " + students +" students today.")

Hello all - we should be around 32 students today.


This is a TypeError -- we are using a string and an integer in the same print. You can do this, but not with the plus (+) operator. You can fix this by casting the integer value students as a string. Or by initializing it as a string in the first place. Try it!

Another convenient solution uses format, which is a very handy way of controlling what you print:

In [5]:
students = 70
print("Hello all - we should be around {} students today.".format(students))

Hello all - we should be around 70 students today.


## Variables

As you can see from above, you don't declare variables in Python - it's therefore not necessary to declare a type 

In [6]:
x = 10
y = 'AI'

print(x, y)

10 AI


## Types of data

Common data types; integer, float and string. You have to be careful when you combine different data types. 

In [6]:
integer_variable = 10
string_variable = 'Hello World'
float_variable = 10.12

print("integer_variable with value {} is of type: {}".format(integer_variable, type(integer_variable)))
print("string_variable with value {} is of type: {}".format(string_variable, type(string_variable)))
print("float_variable with value {} is of type: {}".format(float_variable, type(float_variable)))

integer_variable with value 10 is of type: <class 'int'>
string_variable with value Hello World is of type: <class 'str'>
float_variable with value 10.12 is of type: <class 'float'>


In [7]:
integer_variable / float_variable

0.9881422924901186

In [8]:
float_variable / integer_variable

1.012

In [9]:
float_variable + string_variable

TypeError: unsupported operand type(s) for +: 'float' and 'str'

The plus operation doesn't work with a string and a float; it does work with two strings:

In [10]:
float_variable_str = str(float_variable)
float_variable_str + string_variable

'10.12Hello World'

## Conditionals

In [12]:
x = 10
y = 20
z = 50

if x == 10:
    print('x is 10')

x is 10


In [13]:
if x == 10 and y == 20: 
    print('x is 10 and y is 20')

x is 10 and y is 20


In [14]:
if x == 11: 
    print('x is 11')
else: 
    print('x is not 11')

x is not 11


In [15]:
if x == 11: 
    print(' x is 11')
elif y == 20: 
    print('x is not 11 and y is 20')
else: 
    print('x is not 11 and y is not 20')

x is not 11 and y is 20


Nested if statements

In [16]:
if x == 10: 
    if y == 21 or z == 50: 
        print('Nested if executed')

Nested if executed


In [17]:
number_list = [1,2,3,4,5,6]
number = 3

if number in number_list: 
    print('{} is in number_list'.format(number))

3 is in number_list


## Loops
Two most common varieties: while and for

In [11]:
counter = 0
while counter < 5: 
    print(counter)
    counter += 1

0
1
2
3
4


In [19]:
for x in range(5): 
    print(x)

0
1
2
3
4


It is also possible to loop over items in a list

In [20]:
for number in number_list: 
    print(number)

1
2
3
4
5
6


## Lists

Python lists are not fixed in size; they can grow or shrink as needed, and you can always add extra elements onto your array and slice things in and out easily

In [14]:
# Declaration of a list

nums = []
nums_1 = list()
nums_2 = [1,2,3,4,5]

display(nums)
display(nums_1)
display(nums_2)

[]

[]

[1, 2, 3, 4, 5]

In [22]:
# Grab a specific element in a list - the index starts at 0

display(nums_2[2])

# Add element to a list

nums_2.append(6)

display(nums_2)

3

[1, 2, 3, 4, 5, 6]

## Dictionaries

Collection of key-value pairs which is unordered, changeable and indexed. They can be very handy for mapping string values into integers

In [23]:
dic = {'teacher': 0, 'student':1}

In [24]:
# Access value from dic
display(dic.get('teacher'), dic.get('student'))

0

1

In [25]:
# Loop through all keys
for key in dic:
    print(key)

teacher
student


In [26]:
# Loop through all values
for val in dic.values():
    print(val)

0
1


In [27]:
# Loop through key-value pairs: 

for key, val in dic.items(): 
    print("Key: {}, value: {}".format(key, val))

Key: teacher, value: 0
Key: student, value: 1


## Functions

In [28]:
nums = [1,2,3,4,5]

def remove3(l): 
    for item in l: 
        if item != 3: 
            print(item)
        else: 
            None

In [29]:
def square(n): 
    return n ** 2

In [30]:
remove3(nums)

1
2
4
5


In [31]:
squared = square(5)
squared

25

In [32]:
def print_list_squared(l): 
    for item in l: 
        item_squared = square(item)
        print("Item: {}, Squared {}".format(item, item_squared))

In [33]:
print_list_squared(nums)

Item: 1, Squared 1
Item: 2, Squared 4
Item: 3, Squared 9
Item: 4, Squared 16
Item: 5, Squared 25


# Pandas


Pandas is a library used for handling data. You will be using it throughout the course - primarily dataframes. Often when you're starting a machine learning project, your data will be in different formats. Pandas makes it easy to read and store in a nice table. 

In [16]:
from sklearn.datasets import load_iris
import pandas as pd

data = load_iris()
# convert scikit learn dataset to pandas dataframe
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df.columns = ['sepal length', 'sepal width', 'petal length',
       'petal width']
df['species'] = pd.Series(data.target)




How would you get the first 15 rows? The last 6 rows? The whole dataframe? Try it.

In [17]:
df.tail(6)

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
144,6.7,3.3,5.7,2.5,2
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2
149,5.9,3.0,5.1,1.8,2


In [18]:
# print descriptive statistics of the dataset
df.describe()

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [19]:
# unique values within a columm and their count
df.species.value_counts()

0    50
1    50
2    50
Name: species, dtype: int64

We currently have four columns. Often when doing a ML project, we try to do some feature engineering, combining two or more of the columns. We will in the following create a sepal_length_to_width ratio by adding a new column and using the two existing columns data

In [20]:
# Simply create the new coloumn by stating its name - in this case sepal_length_width
df['sepal_length_width'] = df['sepal length' ] / df['sepal width']
df.head(5)

Unnamed: 0,sepal length,sepal width,petal length,petal width,species,sepal_length_width
0,5.1,3.5,1.4,0.2,0,1.457143
1,4.9,3.0,1.4,0.2,0,1.633333
2,4.7,3.2,1.3,0.2,0,1.46875
3,4.6,3.1,1.5,0.2,0,1.483871
4,5.0,3.6,1.4,0.2,0,1.388889


Try adding another column, which is the ratio of petal_length to petal_width

In [21]:
df['petal_length_width'] = df['petal length' ] / df['petal width']



Say we want to select all the instances where petalwidth is smaller than 0.2. Combine conditional statements with pandas. 

In [22]:
df_test = df[df['petal width'] < 0.2]
df_test

Unnamed: 0,sepal length,sepal width,petal length,petal width,species,sepal_length_width,petal_length_width
9,4.9,3.1,1.5,0.1,0,1.580645,15.0
12,4.8,3.0,1.4,0.1,0,1.6,14.0
13,4.3,3.0,1.1,0.1,0,1.433333,11.0
32,5.2,4.1,1.5,0.1,0,1.268293,15.0
37,4.9,3.6,1.4,0.1,0,1.361111,14.0


Try selecting all instances where petal width is smaller than sepal width

Only 5 observations have a petal width smaller than 0.2. If these were outliers, we could remove them from the original dataset, by:

In [23]:
df = df[df['petal width'] >= 0.2]
df.describe()

Unnamed: 0,sepal length,sepal width,petal length,petal width,species,sepal_length_width,petal_length_width
count,145.0,145.0,145.0,145.0,145.0,145.0,145.0
mean,5.878621,3.046897,3.84,1.237241,1.034483,1.971095,3.983276
std,0.817872,0.432219,1.737991,0.74684,0.811495,0.395262,1.761815
min,4.4,2.0,1.0,0.2,0.0,1.277778,2.125
25%,5.1,2.8,1.6,0.4,0.0,1.5625,2.789474
50%,5.8,3.0,4.4,1.3,1.0,2.035714,3.266667
75%,6.4,3.3,5.1,1.8,2.0,2.225806,4.333333
max,7.9,4.4,6.9,2.5,2.0,2.961538,9.5


We have sucessfully removed all observations. 

It is possible to drop a column for good or temporarily. If the variable inplace is set to True, then it will be for good.

## Split Training and Test Set

When training a ML model, we split the data into features (x variables) and target variable (y variable). This is often done the following way. Axis = 1 means that we will be removing the column

In [24]:
x = df.drop(['species'], axis = 1)
y = df.species

In [25]:
display(x.head(3))
display(y.head(3))

Unnamed: 0,sepal length,sepal width,petal length,petal width,sepal_length_width,petal_length_width
0,5.1,3.5,1.4,0.2,1.457143,7.0
1,4.9,3.0,1.4,0.2,1.633333,7.0
2,4.7,3.2,1.3,0.2,1.46875,6.5


0    0
1    0
2    0
Name: species, dtype: int64

## Summary information about dataframe

In [30]:
# Sum of values in a data frame
print(f"Sum of values in a data frame:\n{(df.sum())}")

# Lowest value of a data frame
print("\nLowest values of a data frame:\n{}".format(df.min()))

# Highest value of a data frame
print("\nHighest values of a data frame:\n{}".format(df.max()))

# Index of the lowest value
print("\nIndex of the lowest values:\n{}".format(x.idxmin()))

# Index of the highest value
print("\nIndex of the highest values:\n{}".format(x.idxmax()))

# Average values
print("\nAverage values:\n{}".format(df.mean()))

# Median values
print("\nMedian values:\n{}".format(df.median()))

# Correlation between columns
print("\nCorrelation between columns:\n{}".format(df.corr()))

# To get these values for only one column, just select it like this
print("\nMedian of Sepal length: {}".format(df["sepal_length"].median()))

# Statistical summary of the data frame, with quartiles, median, etc.
df.describe()

Sum of values in a data frame:
sepal length          852.400000
sepal width           441.800000
petal length          556.800000
petal width           179.400000
species               150.000000
sepal_length_width    285.808748
petal_length_width    577.574964
dtype: float64

Lowest values of a data frame:
sepal length          4.400000
sepal width           2.000000
petal length          1.000000
petal width           0.200000
species               0.000000
sepal_length_width    1.277778
petal_length_width    2.125000
dtype: float64

Highest values of a data frame:
sepal length          7.900000
sepal width           4.400000
petal length          6.900000
petal width           2.500000
species               2.000000
sepal_length_width    2.961538
petal_length_width    9.500000
dtype: float64

Index of the lowest values:
sepal length            8
sepal width            60
petal length           22
petal width             0
sepal_length_width     22
petal_length_width    114
dtype: in

KeyError: 'sepal_length'