# Lab 2 Transition to Python from R

In this lab, you will see the major differences between R and Python. After this lab, you are expected to perform these basic operations in Python: 

1. [Import libraries](#Import-Libraries)
2. [Declare variables](#Declare-Variables)
3. [Data types](#Data-Types)
4. [Loops](#Loops)
5. [Functions](#Functions)
6. [Classes](#Classes)
7. [Numpy Arrays](#Numpy-Arrays)
8. [Data Frames](#Data-Frames)
9. [Some practice](#Some-Practice)
    - [Binary Search](#Binary-Search)
    - [Pandas](#Pandas)

## Import Libraries

In R, you import libraries like this:
```R
library(tidyverse)
```
But in Python, you import libraries like this:

In [2]:
import pandas as pd
pd.__version__

'2.2.2'

## Declare Variables

In R, you can declare a variable like this: 
```R
x <- 5
```
But in Python, you declare a variable like this:

In [1]:
x = 5
x

5

## Data Types

In Python, you will find those data structures very useful: list, tuple, set, dictionary (those are all built-in data types in Python, we will talk about useful libraries later).

A list can be declared like this:

In [60]:
x = [1, 2, 3, 4, 5]
type(x)

list

And please note that, the index in Python as well as the rest of the common programming languages starts from 0. So, if you want to access the first element in the list, you should use `x[0]` instead of `x[1]`.

In [61]:
x[0]

1

A tuple is something like a list, but it is immutable. Usually, a tuple is used when you want to make sure the data is not changed.

In [39]:
x = (1, 2, 3, 4, 5)
x[1] = 3


TypeError: 'tuple' object does not support item assignment

Now, let's talk about set. A set is an unordered collection of items. Every element is unique (no duplicates) and must be immutable (which cannot be changed).

In [40]:
x = {1, 2, 3, 4, 5}

In [41]:
x[0] # unordered

TypeError: 'set' object is not subscriptable

In [50]:
x = {1, 2, 3, 4, 5, 5}
x # no duplicates

{1, 2, 3, 4, 5}

In [51]:
x.pop() # the way to remove the first element
x.remove(2) # the way to remove a specific element
x

{3, 4, 5}

And the last one is dictionary. A dictionary is a collection which is unordered, changeable and indexed. In Python, dictionaries are written with curly brackets, and they have keys and values.

In [52]:
x = {'a': 1, 
     'b': 2, 
     'c': 3}

x['a']

1

Sets and Dictionary are implemented as hash tables in Python, so they are very fast when you want to search for an element. But the order of the elements is not guaranteed. If you want to keep the order of the elements. Both of the data types are very good option to speed up your code. This is something that you can't easily find in R.

## Loops

In R, you can use a for loop like this:
```R
for (i in 1:5) {
    print(i)
}
```

In Python, it is very similar:

In [34]:
for i in range(1, 6):
    print(i)

1
2
3
4
5


In R, you define a while loop like this:
```R
i <- 1
while (i <= 5) {
    print(i)
    i <- i + 1
}
```
In Python, it's like:

In [35]:
i = 1
while i <= 5:
    print(i)
    i += 1

1
2
3
4
5


## Functions

In R, you can define a function like this:
```R
add <- function(x, y) {
    return(x + y)
    }
```

In Python, you can define a function like this:

In [27]:
def add(x, y):
    return x + y
add(3, 4)

7

Python use spaces to define the scope of a function, while R uses curly braces. So, be careful with the indentation in Python. For example, the following function won't work in Python because you are not using the correct indentation:

In [30]:
def add(x, y):
      x+1
    return x + y

add(3, 4)

IndentationError: unindent does not match any outer indentation level (<string>, line 3)

## Classes

You may have very limited chances to use classes in R, but in Python, classes are very common. You can define a class like this:

In [33]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def greet(self):
        print(f'Hello, my name is {self.name} and I am {self.age} years old.')
        
p = Person('William', 18)
p.greet()
p.age

Hello, my name is William and I am 18 years old.


18

Classes are an important concept in Python and OOP (Object Oriented Programming). In the class above, you can see it has two methods: `__init__` and `greet`. `__init__` is a special method in Python that is called when you create an object of the class. `self` is a reference to the object itself. You can access the attributes of the object with `self`.

## Numpy Arrays

We have talked about the basic arrays in Python before, now let me show you a more common way that people use to create arrays in Python. You can use `numpy` library to create arrays like this:

In [4]:
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
x

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

`numpy` is very widly used in Python for numerical operations. It is very similar to R's `matrix` and `array` functions. You can see it in a lot of operations in Python. For example, when you want to perform a matrix multiplication, you can use `np.dot` function like this:

In [11]:
x = np.array([[13, 21, 3], [24, 30, 16], [7, 2, 1]])
y = np.array([[11, 22, 33], [44, 55, 66], [77, 88, 99]])

np.dot(x, y)

array([[1298, 1705, 2112],
       [2816, 3586, 4356],
       [ 242,  352,  462]])

And you can also use `np.transpose` function to transpose a matrix like this:

In [12]:
np.transpose(x)

array([[13, 24,  7],
       [21, 30,  2],
       [ 3, 16,  1]])

You can also use `np.linalg.inv` function to get the inverse of a matrix:

In [13]:
np.linalg.inv(x)

array([[-0.00149701, -0.01122754,  0.18413174],
       [ 0.06586826, -0.00598802, -0.10179641],
       [-0.12125749,  0.09056886, -0.08532934]])

And you can use `np.linalg.det` function to get the determinant of a matrix like this:

In [14]:
np.linalg.det(x)

1335.9999999999995

## Data Frames

In R, you can create a data frame like this:
```R
df <- data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
```
But in Python, you can create a pandas DataFrame like this:

In [16]:
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
df

Unnamed: 0,x,y
0,1,4
1,2,5
2,3,6


Pandas is a very powerful library in Python for data manipulation. You can do a lot of operations on a DataFrame. For example, you can select a column like this:


In [19]:
df['x']

0    1
1    2
2    3
Name: x, dtype: int64

You can select multiple columns with:

In [20]:
df[['x', 'y']]

Unnamed: 0,x,y
0,1,4
1,2,5
2,3,6


You can select a row like this:

In [21]:
df.iloc[0]

x    1
y    4
Name: 0, dtype: int64

You can also select a cell with:

In [22]:
df['x'][0]

1

You can select a subset of the DataFrame like this:

In [23]:
df.iloc[0:2]

Unnamed: 0,x,y
0,1,4
1,2,5


And more interestingly, you can filter the DataFrame like this:

In [24]:
df[df['x'] > 1]

Unnamed: 0,x,y
1,2,5
2,3,6


And even use sql-like queries like this:

In [26]:
df.query('x > 1 and y < 6')

Unnamed: 0,x,y
1,2,5


## Some Practice

### Binary Search

Binary search is a very common algorithm in computer science. You can implement it like this in R:
```R
binary_search <- function(arr, x) {
    low <- 1
    high <- length(arr)
    
    while (low <= high) {
        mid <- (low + high) %/% 2
        if (arr[mid] == x) {
            return(mid)
        } else if (arr[mid] < x) {
            low <- mid + 1
        } else {
            high <- mid - 1
        }
    }
    
    return(-1)
}
```

Well, this is how we do it in Python:

In [53]:
def binary_search(arr, x):
    low = 0
    high = len(arr) - 1
    
    while low <= high:
        mid = (low + high) // 2
        if arr[mid] == x:
            return mid
        elif arr[mid] < x:
            low = mid + 1
        else:
            high = mid - 1
            
    return -1

arr = [1, 2, 3, 4, 5, 6, 7, 8, 9]
binary_search(arr, 5)

4

### Pandas

This is a question from [Leetcode 595. Big Countries](https://leetcode.com/problems/big-countries/):

Find the number of countries that are big. A country is big if it has an area of bigger than 3000000 square km or a population of more than 25000000.

Example:

Input:
| name         | continent | area    | population | gdp          |
|--------------|-----------|---------|------------|--------------|
| Afghanistan  | Asia      | 652230  | 25500100   | 20343000000  |
| Albania      | Europe    | 28748   | 2831741    | 12960000000  |
| Algeria      | Africa    | 2381741 | 37100000   | 188681000000 |
| Andorra      | Europe    | 468     | 78115      | 3712000000   |
| Angola       | Africa    | 1246700 | 20609294   | 100990000000 |

Output:
| name         | population | area    |
|--------------|------------|---------|
| Afghanistan  | 25500100   | 652230  |
| Algeria      | 37100000   | 2381741 |

In [55]:
data = [
    {
        'name': 'Afghanistan',
        'continent': 'Asia',
        'area': 652230,
        'population': 25500100,
        'gdp': 20343000000
    },
    {
        'name': 'Albania',
        'continent': 'Europe',
        'area': 28748,
        'population': 2831741,
        'gdp': 12960000000
    },
    {
        'name': 'Algeria',
        'continent': 'Africa',
        'area': 2381741,
        'population': 37100000,
        'gdp': 188681000000
    },
    {
        'name': 'Andorra',
        'continent': 'Europe',
        'area': 468,
        'population': 78115,
        'gdp': 3712000000
    },
    {
        'name': 'Angola',
        'continent': 'Africa',
        'area': 1246700,
        'population': 20609294,
        'gdp': 100990000000
    }
]

df = pd.DataFrame(data)
df

Unnamed: 0,name,continent,area,population,gdp
0,Afghanistan,Asia,652230,25500100,20343000000
1,Albania,Europe,28748,2831741,12960000000
2,Algeria,Africa,2381741,37100000,188681000000
3,Andorra,Europe,468,78115,3712000000
4,Angola,Africa,1246700,20609294,100990000000


In [56]:
# Solution 1
def big_countries_solution_1(world: pd.DataFrame):
    return world.query("area >= 3000000 or population >= 25000000")[["name", "population", "area"]]

big_countries_solution_1(df)

Unnamed: 0,name,population,area
0,Afghanistan,25500100,652230
2,Algeria,37100000,2381741


In [57]:
# Solution 2
def big_countries_solution_2(world: pd.DataFrame) -> pd.DataFrame:
    world = world[(world['population'] >= 25000000) | (world['area'] >= 3000000)]
    result = world[['name', 'population', 'area']]
    return result

big_countries_solution_2(df)

Unnamed: 0,name,population,area
0,Afghanistan,25500100,652230
2,Algeria,37100000,2381741


What did you notice from the running time? Why?

In [58]:
import time

start = time.time()
big_countries_solution_1(df)
print('Solution 1:', time.time() - start)

start = time.time()
big_countries_solution_2(df)
print('Solution 2:', time.time() - start)

Solution 1: 0.0012960433959960938
Solution 2: 0.0005402565002441406
