# Lab 2 Transition to Python from R

In this lab, you will see the major differences between R and Python. After this lab, you are expected to perform these basic operations in Python: 

1. [Documentation](#Documentation)
2. [Declare variables](#Declare-Variables)
3. [Data types](#Data-Types)
    - [Int and Float](#Int-and-Float)
    - [String](#String)
    - [Boolean](#Boolean)
    - [List](#List)
    - [Tuple](#Tuple)
    - [Set](#Set)
    - [Dictionary](#Dictionary)
    - [Object Reference](#Object-Reference)
4. [If-Else](#If-Else)
5. [Loops](#Loops)
6. [Functions](#Functions)
7. [Numpy Arrays](#Numpy-Arrays)
8. [Data Frames](#Data-Frames)
9. [Some practice](#Some-Practice)
    - [Binary Search](#Binary-Search)
    - [Two-Sum](#Two-Sum)
    - [Pandas](#Pandas)

## Documentation

In R, you can access the documentation of a function by inputting `?function_name`. However, in Python, as well as other modern programming languages, we usually read the full documentation online. And if you are using a modern IDE like PyCharm, or a very nice code editor like VS Code, you can also see the documentation of a function by hovering over the function name.

For example, I would like to know what `List` can do, I will go to the [official documentation](https://docs.python.org/3/tutorial/datastructures.html) to find the answer.

Reading the documentation is a very important skill for a programmer. It is not only about how to use a function, but also about how to use it efficiently. 

## Declare Variables

In R, you can declare a variable like this: 
```R
x <- 5
```
But in Python, you declare a variable like this:

In [1]:
x = 5
x

5

## Data Types

Let's take a look at the most basic data types in Python: int, float, string, and boolean.

### Int and Float
Integers, can be usually found in most of the programming languages, which usually have 32 bits (c++), meaning that an int variable can store a number from -2^31 to 2^31 - 1. But in Python, it is very different, there is no such limitation. You can store a very large number in an integer variable.

Floats are numbers with a decimal point. In Python, the built-in float type is an implementation of a double-precision (64-bit) floating-point number, typically following the IEEE 754 standard. This means it has:

- About 15–17 significant digits of precision.
- A (roughly) 3.4 × 10^(-308) to 1.7 × 10^(308) range in magnitude.
- Special representations for infinity and NaN.

Unlike Python’s int, which can grow arbitrarily large given enough memory, the size and precision of a float are limited by its internal binary representation.

In [65]:
# declare an integer
x = 5
type(x)

int

In [71]:
# store a very big number into integer
x = 300000000000000
type(x)

int

In [66]:
# declare a float
y = 5.0
type(y)

int

In [72]:
# inf in float
y = float('inf')
y

inf

In [73]:
# nan in float
y = float('nan')
y

nan

### String

In Python, you can declare a string like this:

In [74]:
x = 'Hello, World!'
print(x)

Hello, World!


And concatenate strings in Python is very easy:

In [75]:
x = 'Hello, ' + 'World!'
print(x)

Hello, World!


And let's say you have a variable that you want to insert into a string, you have several options to do that:

In [76]:
name = 'William'

# option 1 with f-string
x = f'Hello, {name}!'

# option 2 with format function
y = 'Hello, {}!'.format(name)

# option 3 with % operator
z = 'Hello, %s!' % name

# option 4 with + operator
w = 'Hello, ' + name + '!'

print(x)
print(y)
print(z)
print(w)

Hello, William!
Hello, William!
Hello, William!
Hello, William!


### Boolean

Boolean is a data type that has only two possible values: True and False. In Python, boolean variables are defined by the True and False keywords.

In [77]:
x = True
type(x)

bool

In [78]:
y = False
type(y)

bool

In [80]:
print(x and y)
print(x & y)

False
False


In [81]:
print(x or y)
print(x | y)

True
True


1 and 0 are not boolean values in Python, they are integers. But you can use them as boolean values in Python. 0 is False and 1 is True.

In [84]:
1 and 0

0

In [85]:
1 and 1

1

In [87]:
0 and 0

0

In [86]:
1 or 0

1

In Python, you will find those data structures very useful: list, tuple, set, dictionary (those are all built-in data types in Python, we will talk about useful libraries later).

### List

A list can be declared like this:

In [108]:
x = [1, 20, 32, 41, 57]
type(x)

list

And please note that, the index in Python as well as the rest of the common programming languages starts from 0. So, if you want to access the first element in the list, you should use `x[0]` instead of `x[1]`.

In [109]:
# access the first element
x[0]

1

In [110]:
# access the last element
x[-1]

57

In [112]:
# remove specific element
x.remove(20)
x

[1, 32, 41, 57]

In [113]:
# remove specific element by index
x.pop(2) # remove the third element
x

[1, 32, 57]

In [114]:
# add an element at index
x.insert(1, 20) # add 20 at index 1
x

[1, 20, 32, 57]

Note that Lists are very flexible, it can even store different types of data in the same list.

In [118]:
x = ["A", 1, True, False]
x

['A', 1, True, False]

And it is very easy to contact two lists in Python:

In [119]:
x = [1, 2, 3]
y = [4, 5, 6]
z = x + y
z

[1, 2, 3, 4, 5, 6]

In [121]:
x = [1,2,3]
y = [4,5]
x.extend(y)
print(x)

[1, 2, 3, 4, 5]


### Tuple

A tuple is something like a list, but it is immutable. Usually, a tuple is used when you want to make sure the data is not changed.

In [116]:
# unchangeable
x = (1, 2, 3, 4, 5)
x[1] = 3


TypeError: 'tuple' object does not support item assignment

In [117]:
# access the first element
x[0]

1

In [115]:
# convert a string to a tuple
x = "hello"
x = tuple(x)
x

('h', 'e', 'l', 'l', 'o')

### Set

Now, let's talk about set. A set is an unordered collection of items. Every element is unique (no duplicates) and must be immutable (which cannot be changed).

In [40]:
x = {1, 2, 3, 4, 5}

In [41]:
x[0] # unordered

TypeError: 'set' object is not subscriptable

In [123]:
x = {1, 2, 3, 4, 5, 5}
x # no duplicates

{1, 2, 3, 4, 5}

In [124]:
x.pop() # the way to remove the first element
x.remove(2) # the way to remove a specific element
x

{3, 4, 5}

In [125]:
# add an element
x.add(6)
x

{3, 4, 5, 6}

In [126]:
# combine two sets
y = {7, 8, 9}
x.union(y)

{3, 4, 5, 6, 7, 8, 9}

### Dictionary

And the last one is dictionary. A dictionary is a collection which is unordered, changeable and indexed. In Python, dictionaries are written with curly brackets, and they have keys and values.

In [127]:
x = {'a': 1, 
     'b': 2, 
     'c': 3}

x['a']

1

In [128]:
# show all keys
x.keys()

dict_keys(['a', 'b', 'c'])

In [129]:
# show all values
x.values()

dict_values([1, 2, 3])

In [130]:
# remove an element
x.pop('a')
x

{'b': 2, 'c': 3}

In [131]:
# add an element
x['d'] = 4
x

{'b': 2, 'c': 3, 'd': 4}

In [133]:
# combine two dictionaries
y = {'e': 5, 'f': 6}
x.update(y)
x

{'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6}

Sets and Dictionary are implemented as hash tables in Python, so they are very fast when you want to search for an element. But the order of the elements is not guaranteed. If you want to keep the order of the elements. Both of the data types are very good option to speed up your code. This is something that you can't easily find in R.

### Object Reference

Very interestingly, when you stored a list of `[1, 2, 3, 4, 5]` in a variable `x`, you are actually storing the memory address of the list in the variable `x`. So, if you assign `x` to another variable `y`, you are actually assigning the memory address to `y`. This is the reason why if you change the value of `y`, the value of `x` will also be changed. When you want to copy a list to do some modification but don't want to change the original list, you should use `copy` function. This situation also applies to dictionary, set and tuple.

In [None]:
x = [1, 2, 3, 4, 5]
y = x.copy()
y.append(6)
print(x)

In [None]:
x = [1, 2, 3, 4, 5]
y = x
y.append(6)
print(x)

## If-Else

If-Else in Python is very easy to use. You can use it like this:

In [138]:
x = 5

if x > 5:
    print('x is greater than 5')
else:
    print('x is no greater than 5')

x is no greater than 5


In [139]:
x = 5

#elif
if x > 5:
    print('x is greater than 5')
elif x == 5:
    print('x is equal to 5')
else:
    print('x is less than 5')

x is equal to 5


## Loops

In R, you can use a for loop like this:
```R
for (i in 1:5) {
    print(i)
}
```

In Python, it is very similar:

In [34]:
for i in range(1, 6):
    print(i)

1
2
3
4
5


In R, you define a while loop like this:
```R
i <- 1
while (i <= 5) {
    print(i)
    i <- i + 1
}
```
In Python, it's like:

In [35]:
i = 1
while i <= 5:
    print(i)
    i += 1

1
2
3
4
5


A loop can be used very flexibly in Python. For example, you can use a loop to create a list like this:

In [134]:
x = [i for i in range(1, 6)]
x

[1, 2, 3, 4, 5]

And you can also use a loop to create a dictionary like this:

In [135]:
x = {i: i**2 for i in range(1, 6)}
x

{1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

You can also use a loop to filter a list:

In [136]:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
y = [i for i in x if i % 2 == 0]
y

[2, 4, 6, 8]

## Functions

In R, you can define a function like this:
```R
add <- function(x, y) {
    return(x + y)
    }
```

In Python, you can define a function like this:

In [27]:
def add(x, y):
    """
    This function returns the sum of x and y
    
    :param x: first number
    :param y: second number
    :return: the sum of x and y
    """
    return x + y
add(3, 4)

7

Python use spaces to define the scope of a function, while R uses curly braces. So, be careful with the indentation in Python. For example, the following function won't work in Python because you are not using the correct indentation:

In [30]:
def add(x, y):
      x+1
    return x + y

add(3, 4)

IndentationError: unindent does not match any outer indentation level (<string>, line 3)

## Import Libraries

In R, you import libraries like this:
```R
library(tidyverse)
```
But in Python, you import libraries like this:

In [None]:
import pandas as pd
pd.__version__

## Numpy Arrays

We have talked about the basic arrays in Python before, now let me show you a more common way that people use to create arrays in Python. You can use `numpy` library to create arrays like this:

In [4]:
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
x

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

`numpy` is very widly used in Python for numerical operations. It is very similar to R's `matrix` and `array` functions. You can see it in a lot of operations in Python. For example, when you want to perform a matrix multiplication, you can use `np.dot` function like this:

In [11]:
x = np.array([[13, 21, 3], [24, 30, 16], [7, 2, 1]])
y = np.array([[11, 22, 33], [44, 55, 66], [77, 88, 99]])

np.dot(x, y)

array([[1298, 1705, 2112],
       [2816, 3586, 4356],
       [ 242,  352,  462]])

And you can also use `np.transpose` function to transpose a matrix like this:

In [12]:
np.transpose(x)

array([[13, 24,  7],
       [21, 30,  2],
       [ 3, 16,  1]])

You can also use `np.linalg.inv` function to get the inverse of a matrix:

In [13]:
np.linalg.inv(x)

array([[-0.00149701, -0.01122754,  0.18413174],
       [ 0.06586826, -0.00598802, -0.10179641],
       [-0.12125749,  0.09056886, -0.08532934]])

And you can use `np.linalg.det` function to get the determinant of a matrix like this:

In [14]:
np.linalg.det(x)

1335.9999999999995

## Data Frames

In R, you can create a data frame like this:
```R
df <- data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
```
But in Python, you can create a pandas DataFrame like this:

In [16]:
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
df

Unnamed: 0,x,y
0,1,4
1,2,5
2,3,6


Pandas is a very powerful library in Python for data manipulation. You can do a lot of operations on a DataFrame. For example, you can select a column like this:


In [19]:
df['x']

0    1
1    2
2    3
Name: x, dtype: int64

You can select multiple columns with:

In [20]:
df[['x', 'y']]

Unnamed: 0,x,y
0,1,4
1,2,5
2,3,6


You can select a row like this:

In [21]:
df.iloc[0]

x    1
y    4
Name: 0, dtype: int64

You can also select a cell with:

In [22]:
df['x'][0]

1

You can select a subset of the DataFrame like this:

In [23]:
df.iloc[0:2]

Unnamed: 0,x,y
0,1,4
1,2,5


And more interestingly, you can filter the DataFrame like this:

In [24]:
df[df['x'] > 1]

Unnamed: 0,x,y
1,2,5
2,3,6


And even use sql-like queries like this:

In [26]:
df.query('x > 1 and y < 6')

Unnamed: 0,x,y
1,2,5


## Some Practice

### Binary Search

Binary search is a very common algorithm in computer science. You can implement it like this in R:
```R
binary_search <- function(arr, x) {
    low <- 1
    high <- length(arr)
    
    while (low <= high) {
        mid <- (low + high) %/% 2
        if (arr[mid] == x) {
            return(mid)
        } else if (arr[mid] < x) {
            low <- mid + 1
        } else {
            high <- mid - 1
        }
    }
    
    return(-1)
}
```

Well, this is how we do it in Python:

In [53]:
def binary_search(arr, x):
    """
    This function implements binary search algorithm
    
    :param arr: array that you want to search
    :param x: the item that you want to search in the array
    :return: the index of the item in the array, if not found, return -1
    """
    low = 0
    high = len(arr) - 1
    
    while low <= high:
        mid = (low + high) // 2
        if arr[mid] == x:
            return mid
        elif arr[mid] < x:
            low = mid + 1
        else:
            high = mid - 1
            
    return -1

arr = [1, 2, 3, 4, 5, 6, 7, 8, 9]
binary_search(arr, 5)

4

### Two-Sum

This is a question from [Leetcode 1. Two Sum](https://leetcode.com/problems/two-sum):

Given an array of integers, return indices of the two numbers such that they add up to a specific target.

Example 1:
```
Input: nums = [2, 7, 11, 15], target = 9
Output: [0, 1]
Explanation: Because nums[0] + nums[1] == 9, we return [0, 1].
```

Example 2:
```
Input: nums = [3,2,4], target = 6
Output: [1,2]
```

Example 3:
```
Input: nums = [3,3], target = 6
Output: [0,1]
```


In [142]:
def two_sum(nums, target):
    """
    This function returns indices of the two numbers such that they add up to a specific target.
    
    :param nums: the list of integers
    :param target: the target sum
    :return: the indices of the two numbers
    """
    dic = {}
    for i, num in enumerate(nums):
        if target - num in dic:
            return [dic[target - num], i]
        dic[num] = i
    
    return []

two_sum([2, 7, 11, 15], 9)

[0, 1]

### Pandas

This is a question from [Leetcode 595. Big Countries](https://leetcode.com/problems/big-countries/):

Find the number of countries that are big. A country is big if it has an area of bigger than 3000000 square km or a population of more than 25000000.

Example:

Input:
| name         | continent | area    | population | gdp          |
|--------------|-----------|---------|------------|--------------|
| Afghanistan  | Asia      | 652230  | 25500100   | 20343000000  |
| Albania      | Europe    | 28748   | 2831741    | 12960000000  |
| Algeria      | Africa    | 2381741 | 37100000   | 188681000000 |
| Andorra      | Europe    | 468     | 78115      | 3712000000   |
| Angola       | Africa    | 1246700 | 20609294   | 100990000000 |

Output:
| name         | population | area    |
|--------------|------------|---------|
| Afghanistan  | 25500100   | 652230  |
| Algeria      | 37100000   | 2381741 |

In [55]:
data = [
    {
        'name': 'Afghanistan',
        'continent': 'Asia',
        'area': 652230,
        'population': 25500100,
        'gdp': 20343000000
    },
    {
        'name': 'Albania',
        'continent': 'Europe',
        'area': 28748,
        'population': 2831741,
        'gdp': 12960000000
    },
    {
        'name': 'Algeria',
        'continent': 'Africa',
        'area': 2381741,
        'population': 37100000,
        'gdp': 188681000000
    },
    {
        'name': 'Andorra',
        'continent': 'Europe',
        'area': 468,
        'population': 78115,
        'gdp': 3712000000
    },
    {
        'name': 'Angola',
        'continent': 'Africa',
        'area': 1246700,
        'population': 20609294,
        'gdp': 100990000000
    }
]

df = pd.DataFrame(data)
df

Unnamed: 0,name,continent,area,population,gdp
0,Afghanistan,Asia,652230,25500100,20343000000
1,Albania,Europe,28748,2831741,12960000000
2,Algeria,Africa,2381741,37100000,188681000000
3,Andorra,Europe,468,78115,3712000000
4,Angola,Africa,1246700,20609294,100990000000


In [56]:
# Solution 1
def big_countries_solution_1(world: pd.DataFrame):
    """
    This function returns the countries that are big. A country is big if it has an area of bigger than 3000000 square km or a population of more than 25000000.
    
    :param world: the DataFrame that contains the information of the countries 
    :return: data frame that contains the name, population, and area of the big countries
    """
    return world.query("area >= 3000000 or population >= 25000000")[["name", "population", "area"]]

big_countries_solution_1(df)

Unnamed: 0,name,population,area
0,Afghanistan,25500100,652230
2,Algeria,37100000,2381741


In [57]:
# Solution 2
def big_countries_solution_2(world: pd.DataFrame) -> pd.DataFrame:
    """
    This function returns the countries that are big. A country is big if it has an area of bigger than 3000000 square km or a population of more than 25000000.
    
    :param world: the DataFrame that contains the information of the countries 
    :return: data frame that contains the name, population, and area of the big countries
    """
    world = world[(world['population'] >= 25000000) | (world['area'] >= 3000000)]
    result = world[['name', 'population', 'area']]
    return result

big_countries_solution_2(df)

Unnamed: 0,name,population,area
0,Afghanistan,25500100,652230
2,Algeria,37100000,2381741


What did you notice from the running time? Why?

In [58]:
import time

start = time.time()
big_countries_solution_1(df)
print('Solution 1:', time.time() - start)

start = time.time()
big_countries_solution_2(df)
print('Solution 2:', time.time() - start)

Solution 1: 0.0012960433959960938
Solution 2: 0.0005402565002441406
