# A (very) brief Python refresher

### Strings

A string can be defined by enclosing it in a single quote(') or a double quote(")

In [3]:
my_str = "This is a string"

The characters of a string can be accessed by indices and the indices go from 0 to n-1

In [2]:
my_str[0]

'T'

Slice notation "[a:b:c]" means "count in increments of c starting at a inclusive, up to b exclusive".

In [3]:
my_str[0:4]

'This'

A string can be reversed using the following way. The first index corresponds to the start, second to the end and the last one indicates the increment that needs to be done. 

In [4]:
my_str[::-1]

'gnirts a si sihT'

A string can be splitted as well based on a delimmitter. A list is returned after splitting

In [5]:
my_str = "one,two,three,four,five"
my_str.split(',')

['one', 'two', 'three', 'four', 'five']

A string can be stripped as well of extra spaces at the ends. 

In [6]:
my_str = " hello "
print(my_str)
my_str.strip()

 hello 


'hello'

#### Lists

In [8]:
# List can be of a mixed type
my_list = ['item1', 'item2', 100, 3.14]


In [9]:
# List elements can be accessed by the indices starting from 0 to n-1
my_list[2]

100

In [10]:
# Function to find the length of the list
len(my_list)

4

In [11]:
# range function to generate a range object
num_list = range(0,10)
num_list

range(0, 10)

In [12]:
# use list() to get a list out of the range object
list(range(0,10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [13]:
# Iterate over a list using for loop
for num in num_list:
    print(num)

0
1
2
3
4
5
6
7
8
9


#### List comprehension

new_list = [expression(item) for item in old_list]

In [14]:
num_squares = [num * num for num in num_list]
num_squares

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [15]:
# The list can be filtered based on a condition
num_evens = [num for num in num_list if num %2 == 0]
num_evens

[0, 2, 4, 6, 8]

In [16]:
# zip() function to combine 2 lists

country_list = ["Australia", "France", "USA", "Italy"]
capital_list = ["Canberra", "Paris", "Washington DC", "Rome"]
pairs = zip(country_list, capital_list)
for country, capital in pairs:
    print("The country is {0} and the capital is {1}".format(country, capital))

The country is Australia and the capital is Canberra
The country is France and the capital is Paris
The country is USA and the capital is Washington DC
The country is Italy and the capital is Rome


In [17]:
# sort() function to sort a list. It stores the sorted list in the original list itself.

my_list = [954, 341, 100, 3.14]
my_list.sort()
my_list

[3.14, 100, 341, 954]

#### Loops

In [18]:
# Loops can be run over list of lists as well
languages = [['Spanish', 'English',  'French', 'German'], ['Python', 'Java', 'Javascript', 'C++']]

In [29]:
for lang in languages:
    print(lang)

['Spanish', 'English', 'French', 'German']
['Python', 'Java', 'Javascript', 'C++']


As we see above, in each iteration, we get one list at a time. 

If each element of the nested list is needed, then a nested loop should be written as below:

In [19]:
for lang_list in languages:
    print("--------------")
    for lang in lang_list:
        print(lang)

--------------
Spanish
English
French
German
--------------
Python
Java
Javascript
C++


There are various ways to manipulate the functioning of a loop. 

* continue: This will skip the rest of the statements of that iteration and continue with the next iteration.
* break: This will break the entire loop and go to the next statement after the loop.

In [20]:
for lang_list in languages:
    print("--------------")
    for lang in lang_list:
        if lang == "German":
            continue
        print(lang)
print("End of loops")

--------------
Spanish
English
French
--------------
Python
Java
Javascript
C++
End of loops


In [21]:
for lang_list in languages:
    print("--------------")
    for lang in lang_list:
        if lang == "Java":
            break
        print(lang)
print("End of loops")

--------------
Spanish
English
French
German
--------------
Python
End of loops


In [22]:
# another example of continue
from math import sqrt
number = 0

for i in range(10):
    number = i ** 2
    if i % 2 == 0:
        continue    # continue here
    
    print(str(round(sqrt(number))) + ' squared is equal to ' + str(number))

1 squared is equal to 1
3 squared is equal to 9
5 squared is equal to 25
7 squared is equal to 49
9 squared is equal to 81


#### Sets

Sets is an unordered collections of unique elements. Common uses include membership testing, removing duplicates from a sequence, and computing standard math operations on sets such as intersection, union, difference, and symmetric difference.


A set can be created using the '{}' brackets. 

In [23]:
my_set = {1, 2, 3}
print(my_set)

{1, 2, 3}


In [24]:
my_list = [1,2,3,4,2,3]
print(set(my_list))

{1, 2, 3, 4}


In [25]:
# A set can be made of mixed types as well. 
my_set = {1.0, "Hello", (1, 2, 3)}
print(my_set)

{1.0, 'Hello', (1, 2, 3)}


In [26]:
# Even if duplicate elements are added while initialising, they get remived. 
my_set = {1,2,3,4,3,2}
print(my_set)

{1, 2, 3, 4}


In [7]:
#Creating an empty set is a bit tricky.
my_set = {}
print(type(my_set))
my_set = set()
print(type(my_set))

<class 'dict'>
<class 'set'>


Elements can be added individualy or as a list.

In [27]:
my_set = {1, 2, 3}
my_set.add(4)
print(my_set)
my_set.update([6, 7, 8])
my_set

{1, 2, 3, 4}


{1, 2, 3, 4, 6, 7, 8}

In [30]:
# Union
A = {1, 2, 3, 4, 5}
B = {4, 5, 6, 7, 8}
A|B  # or A.union(B)

{1, 2, 3, 4, 5, 6, 7, 8}

In [32]:
# Intersection
A = {1, 2, 3, 4, 5}
B = {4, 5, 6, 7, 8}
A&B # A.intersection(B)

{4, 5}

#### Dictionary

Dictionaries are a container that store key-value pairs. They are unordered.

Other programming languages might call this a 'hash', 'hashtable' or 'hashmap'.

In [33]:
dict1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
dict1

{'a': 1, 'b': 2, 'c': 3, 'd': 4}

In [34]:
# Adding a key to the dictionary
dict1['e'] = 5
dict1

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

keys() method returns the keys in the dictionary. 

In [35]:
dict1.keys()

dict_keys(['a', 'b', 'c', 'd', 'e'])

In [36]:
# items() method can be used to get all the pairs of the dictionary.
dict1.items()

dict_items([('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)])

Dictionary comprehension can be used to manipulate the elements of a dictionary. 

In [37]:
double_dict1 = {k:v*2 for (k,v) in dict1.items()}
double_dict1

{'a': 2, 'b': 4, 'c': 6, 'd': 8, 'e': 10}

In [38]:
dict1_cond = {k:v for (k,v) in dict1.items() if v>2}
dict1_cond

{'c': 3, 'd': 4, 'e': 5}

#### Functions

A function is a block of organized, reusable code that is used to perform a single, related action. Functions provide better modularity for your application and a high degree of code reusing.

In [41]:
# Function definition
def hello():
    print("Hello World") 
    return

In [42]:
# Function calling
hello()

Hello World


Parameters vs arguments: 
Parameters are a and b. Arguments are 2 and 5.

In [12]:
def plus(a,b):
    return a + b
plus(2, 5)


7

A function can return nothing (null/None) as well. 

In [43]:
def run():
    for x in range(10):
        if x == 2:
            return
    print("Run!")
    

In [44]:
run()

#### Keyword arguments with default values

In [48]:
# Here the value of parameter `b` is 2 by default if the value is not passed. 
def plus(a,b = 2):
    return a + b
  

In [49]:
# Call `plus()` with only `a` parameter
print(plus(a=1))

3


In [50]:
# Call `plus()` with `a` and `b` parameters
print(plus(a=1, b=3))

4


#### Variable arguments:



In [51]:

def plus(*args):
    return sum(args)

# Calculate the sum
plus(1,4,5)

10

In [52]:
def concatenate(**kwargs):
    result = ""
    # Iterating over the Python kwargs dictionary
    for arg in kwargs.values():
        result = result + arg + " "
    return result

print(concatenate(a="Real", b="Python", c="Is", d="Great", e="!"))

Real Python Is Great ! 


the correct order for your parameters is:

    Standard arguments
    *args arguments
    **kwargs arguments


In [53]:
# Anonymous functions: lambda
# `sum()` lambda function
sum = lambda x, y: x + y;

# Call the `sum()` anonymous function
sum(4,5)

# "Translate" to a UDF
# def sum(x, y):
#     return x+y


9

#### Use of main()

In [54]:
def hello():
    print("Hello World") 
    return

# Define `main()` function
def main():
    hello()
    print("This is a main function")

main()

# As is, if the script is imported, it will execute the main function.

Hello World
This is a main function


The following code needs a script mode to show the use. 

In [56]:
# Define `main()` function
def main():
    hello()
    print("This is a main function")
    
# Execute `main()` function 
if __name__ == '__main__':
    main()

Hello World
This is a main function


#### Global vs local variables

In [57]:
# Global variable `init`
init = 1

# Define `plus()` function to accept a variable number of arguments
def plus(*args):
    # Local variable `sum()`
    total = 0
    for i in args:
        total += i
    return total
  
# Access the global variable
print("this is the initialized value " + str(init))

# (Try to) access the local variable
print("this is the sum " + str(total))

this is the initialized value 1


NameError: name 'total' is not defined

#### json

JSON: JavaScript Object Notation.

When exchanging data between a browser and a server, the data can only be text.

Python has a built-in package called json, which can be used to work with JSON data.

In [59]:
import json

In [63]:
# Convert from JSON to Python:

# some JSON:
x =  '{ "name":"John", "age":30, "city":"New York"}'

# parse x:
y = json.loads(x)

# the result is a Python dictionary:
print(y["age"])

30


In [64]:
y

{'name': 'John', 'age': 30, 'city': 'New York'}

In [65]:
# Convert from Python to JSON

# a Python object (dict):
x = {
  "name": "John",
  "age": 30,
  "city": "New York"
}

# convert into JSON:
y = json.dumps(x)

# the result is a JSON string:
print(y)

{
    "name": "John",
    "age": 30,
    "city": "New York"
}


In [67]:
# Indentation
y = json.dumps(x, indent=4)
print(y)


{
    "name": "John",
    "age": 30,
    "city": "New York"
}


### Dataframes

One of the most powerful data structures in Python is the Pandas `dataframe`. It allows tabular data, including `csv` (comma seperated values) and `tsv` (tab seperated values), to be processed and manipulated. People familiar with Excel will no doubt find it intuitive and easy to grasp. Since most `csv` (or `tsv`) has become the de facto standard for sharing datasets both large and small, Pandas dataframe is the way to go.

For Pandas documentation : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

In [2]:
import pandas as pd # importing the package and using `pd` as the alias 

Suppose we wanted to create a dataframe as follows,

| name | title     |
|------|-----------|
| sam  | physicist |
| rob  | economist |

Let's create a dictionary with the headers as keys and their corresponding values as a list as follows,

In [4]:
data = {'name': ['sam', 'rob'], 'title': ['physicist', 'economist']}

Converting the same to a dataframe,

In [7]:
df = pd.DataFrame(data)
df

Unnamed: 0,name,title
0,sam,physicist
1,rob,economist


In order to read external files we use `read_csv()` function,
```python
pd.read_csv(filename, sep=',')
```

Similarly, for exporting a Pandas dataframe to a `csv` file, we can use `to_csv()` as follows
```python
df.to_csv(index=False)
```

### Regex

Regular expressions or regex are a powerful tool to extract key pieces of data from raw text. 

Try your regex here : https://pythex.org/

In [13]:
html = r'''
         <!DOCTYPE html>
        <html>
        <head>
        <title>site</title>
        </head>
        <body>

        <h1>Sam</h1>
        <h2>Physicist</h2>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod.</p>

        <h1>Rob</h1>
        <h2>Economist</h2>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et.</p>


        </body>
        </html> 
        '''

In [14]:
print(html)


         <!DOCTYPE html>
        <html>
        <head>
        <title>site</title>
        </head>
        <body>

        <h1>Sam</h1>
        <h2>Physicist</h2>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod.</p>

        <h1>Rob</h1>
        <h2>Economist</h2>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et.</p>


        </body>
        </html> 
        


Now, if we are only interested in : 
- names i.e. the data inside the `<h1></h1>` tags, and
- title i.e. the data inside the `<h2></h2>` tags
we can extract the same using regex.

First lets import the regex module in python called `re`

In [16]:
import re

Now lets define the expressions (or patterns) to capture all text between the tags as follows :

- `<h1>(.*?)</h1>` : capture all text contained within `<h1></h1>` tags
- `<h2>(.*?)</h2>` : capture all text contained within `<h2></h2>` tags



In [26]:
regex_h1 = re.compile('<h1>(.*?)</h1>')
regex_h2 = re.compile('<h2>(.*?)</h2>')

and use `findall()` to return all the instances that match with our pattern,

In [27]:
names = regex_h1.findall(html)
titles = regex_h2.findall(html)

print(names, titles)

['Sam', 'Rob'] ['Physicist', 'Economist']
