# <a name="0">Data Analysis with Python: Full Course for Beginners</a>
> Version 1.0 [Course](https://www.udemy.com/course/data-analysis-with-python-full-course-for-beginners/)
by Jason Zhang
>
> Contact: jszhang0001@gmail.com;

[Chapter 1:Introduction](#introduction)
- [1.1 What is Python](#What_is_Python)
- [1.2 Install Python](#install_python)
- [1.3 Jupyter notebook](#jupyter_notebook)

[Chapter 2:Python Basics](#python_basics)
- [2.1 print](#print)
- [2.2 variables](#variables)
- [2.3 comments](#comments)
- [2.4 Data types](#data_types)
- [2.5 String Formatting](#string_formatting)
- [2.6 Operators](#operators)
- [2.7 Control Flow](#control_flow)
- [2.8 Built-in Functions](#builtin_functions)
- [2.9 Functions](#functions)

[Chapter 3:Data Structure](#data_structure)
- [3.1 List](#list)
- [3.2 Tuple](#tuple)
- [3.3 Dictionary](#dictionary)
- [3.4 Set](#set)

[Chapter 4:Pandas Guidebook](#pandas)
- [4.1 packages](#packages)
- [4.2 regular imports](#imports)
- [4.3 pandas intro](#pandas_intro)
- [4.4 View data](#view_data)
- [4.5 Slicing & Indexing](#slice_index_pandas)
- [4.6 Operations](#operations_pandas)
- [4.7 Merge Data](#merge_data)
- [4.8 Concatenate Data](#concat_data)
- [4.9 Group data](#group_data)
- [4.10 Get data in/out](#get_data_inout)
- [4.11 Handle missing values](#handle_missing_value)

[Chapter 5:Numpy Guidebook](#numpy)
- [5.1 array creation](#array_creation)
- [5.2 array attribute](#array_attributes)
- [5.3 Basic operations](#array_operations)
- [5.4 Slicing & Indexing](#slice_index_numpy)
- [5.5 Functions & Methods](#functions_methods)
- [5.6 Data processing using array](#data_processing_numpy)

[Chapter 6:Functional Programming](#functional_programming)
- [6.1 Lambda Function](#lambda)
- [6.2 Filter](#filter)
- [6.3 Map](#map)
- [6.4 Reduce](#reduce)

[Chapter 7:Exception handling](#exception_handling)

[Chapter 8:Work with Files](#work_with_files_in_pythons)
- [8.1 Read from files](#read_files)
- [8.2 Write to files](#write_files)
- [8.3 Working with JSON files](#work_with_json)


## Chapter 1: Introduction<a id='introduction'></a>
<a href="#0">Go to top</a>
***

### §1.1 What is Python<a id='What_is_Python'></a>

Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation.


- An interpreted high-level programming language
- Emphasizes code readability
- With tons of data science libraries

#### What can python do?
- Software and web application development ([Django](https://www.djangoproject.com/), [Flask](https://palletsprojects.com/p/flask/)...)
- Data processing, scientific computing ([Pandas](https://pandas.pydata.org/),[scipy](https://www.scipy.org/)...), 
- Machine learning, artificial intelligence ([scikit-learn](https://scikit-learn.org/stable/),[PySpark](https://spark.apache.org/docs/latest/api/python/index.html),[PyTorch](https://pytorch.org/)...)
- System scripting, robotic processing
- Blockchain ([Ethereum Development](https://ethereum.org/en/developers/docs/programming-languages/python/)...)
- and many more...

#### Why do we use Python
- Easy to read
- Easy to write
- Easy to learn

### §1.2 Install Python3<a id='install_python'></a>
#### Windows
Download Python installer from official website according to your Windows version (32-bit or 64-bit).  
Run the downloaded installer program, remember to select `Add Python 3.x to PATH` before clicking `Install Now`.

To run Python3 REPL (interactice program, REPL stands for Read-Evaluate-Print-Loop), type `python` in Command Prompt.  
If you see error message `'python' is not recognized as an internal or external command`, then add `the path to python.exe` to `environment variables`.

> Type `exit()` to quite Python REPL.

#### MacOS
The pre-installed Python version is 2.7. There are 2 ways to install Python3:  
Method 1:  
Download installation package from Python official website:  
[https://www.python.org/downloads/](https://www.python.org/downloads/)

Method 2:  
Run command `brew install python3` if Homebrew is installed.

To run Python3 REPL, type `python3` in Terminal application.

### §1.3 Jupyter Notebook<a id='jupyter_notebook'></a>
#### Install Jupyter Notebook through Anaconda
[https://www.anaconda.com/download](https://www.anaconda.com/download)

> Note: Python and pip (Python package manager) are included in Anaconda. After the installation, the default python path is modified by Anaconda.

#### Jupyter Notebook Shortcuts
**Edit mode**: text area is focused, left border is **<span style="color:#66BB6A">green</span>**  
**Command mode**: text area is out of focus, left border is **<span style="color:#42A5F5">blue</span>**
<h4><center>Keyboard shortcut</center></h4>

| Key| Description|<th>Mode</th>|
|---|---|---|
| `ctrl` + `enter`       | run cell                    |<td rowspan=10>Command</td>|
| `a`                    | insert cell above           |
| `b`                    | insert cell below           |
| `c`                    | copy the cell selected      |
| `v`                    | paste cell                  |
| `d`, `d`               | selete the selected cell    |
| `shift` + `m`          | merge selected cells        |
| `y`                    | change cell to Code         |
| `m`                    | change cell to Markdown     |
| `h`                    | view all keyboard shortcuts |
| `cmd/ctrl` + `click`   | multi-cursor editing        |<td rowspan=3>Edit</td>|
| `cmd/ctrl` + `/`       | toggle command line         |
| `ctrl` + `shift` + `-` | split cell                  |
|`cmd` + `shift` + `p`   | search commands             |<td>Any mode</td>|

Press `Esc` to exit `Edit mode` and enter `Command mode`. Use up/down arrow keys to select different lines.  
Press `Enter` to enter the `Edit mode` of the selected line.
  
> Tips:  
to clean a cell's output,  
select the target cell, then press `Esc`, `r`, `y` in sequence.

## Chapter 2: Python Basics<a id='python_basics'></a>
<a href="#0">Go to top</a>
***

### §2.1 print<a id='print'></a>

Output values to the console.

In [None]:
print("Hello World!")

In Python3, `print` is a function.  
A function can be invoked by specifying the function name, followed by arguments eclosed by a pair of parentheses. We will look at more about functions later.  


If no argument is given, `print` will output an empty line.

In [None]:
print()

`print` function accepts arbitrary number of arguments seperated by commas. `Print` outputs a `Space` for every comma.

In [None]:
print("1 + 2 =", 1 + 2)

In [None]:
print(12*25)

The above `print` function outputs 2 parameters:
- string `"1 + 2 ="`
- the evaluated value of expression `1 + 2`, which is `3`.

### §2.2 Variables<a id='variables'></a>

Variable is also called `Identifier`, which is a name referring to a value or values stored in the memory of the computer.

We can imagine there is a look-up table stored inside the computer, every time we define a variable, Python inserts one key-value pair into that table, where `key` is our variable name, and `value` is simply the value that the variable name refering to.  

Let's define a variable, and assign a value to it, then print it out.

In [None]:
name = "Jason Zhang"
print(name)

In the above program, we created a variable called `name`, and assigned the value `"Jason Zhang"` to it. The first line of the program is called an `assignment statement`.  

`name ← "Jason Zhang"`

Variables' values can be changed (re-assigned).

In [None]:
a = 'abc'
a = 3
print(a)

In the above program, variable `a` is defined with an initial value `'abc'`.
Then a new value `3` is assigned to the variable `a`, thus the value of `a` is `3` at the final state.

`Assignment statement` is evaluated from right to left. The `expression` on the right-hand-side of `=` is evaluated first, then assigned to the variable name on the left-hand-side of `=`.  

In [None]:
a = 5
a = a + 3
print(a)

Try to explain the following program's output.

In [None]:
a = 3
b = a = 5
print (a, b)

Python is case sensitive, variable name `val` is differnt from `Val`.

You can assign values to multiple variables within one `assignment statement`:

In [None]:
a, b, c = 12, "Wow!", False
print(a, b, c)

Try the following program, and think about why `a` and `b` have different values eventually.

In [None]:
a = 1
b = a
a = 2
print(a, b)

### 2.2 Code Challenge

How can we swap two variables?


In [None]:
#TO DO




### §2.3 Comments<a id='comments'></a>
Comments are used to notate your work inside your source program. All the comments are ignored by the computer during execution.
You can simply add a comment by beginning a line with the hash character `#`. It will be terminated by the end of line.

In [None]:
# This is my variable
myVar = 100

Python only provides one way of adding single-line comments, but there is a trick to add multiple-line comments using triple quotes.

> Try to add some meaningful comments to your program to help understanding, not only for you, but also someone who may read your code.

### §2.4 Simple Data Types<a id='data_types'></a>
There are several commonly used primary data types in Python.  
Use `type()` to get the data type of a specific value or a variable.

In [None]:
type_1 = type(123)
type_2 = type(123.0)
type_3 = type("123")
type_4 = type(True)
type_5 = type(print)
type_6 = type(None)

print("type_1 =", type_1)
print("type_2 =", type_2)
print("type_3 =", type_3)
print("type_4 =", type_4)
print("type_5 =", type_5)
print("type_6 =", type_6)

#### Integers (`int`)
Python can deal with any integer values, including negative integers.  
For instance: `0`, `-99999999`, `12345`.  

#### Floating Point Numbers (`float`)
A floating point number is a number that has a fractional part. It is called floating point number because the position of the decimal point can be changed in the scientific notation. For example, `3.1415e9` is the same as `314.15e7`.  

Floating point numbers are stored as approximated values in the computer's memory.

In [None]:
print(5.1 + 2.3)

We can round this number using `round` function.

In [None]:
round(5.1 + 2.3, 2) # round to 2 digits precision after the decimal point

#### String (`str`)
A string is a text surrounded by a pair of single quotes `'` or double quotes `"`, for example `'This is a string.'`, `"This is another string."`. The first and last quotation marks should be matched, and they are not parts of the string.

Use escape character (`\`) to output new line `\n`, tab `\t`, back slash `\\`, single quote `\'`, double quote `\"` etc. inside a string.

In [None]:
print('First line\nSecond line\nI\'m the third line')

You can use r'' to disable escape characters:

In [None]:
print('This is \\ a string.')
print(r'This is \\ a string.')

To define a string which spans multiple lines, you can use triple quotes (`'''` or `"""`):

In [None]:
print('''
First line
Second line
''')

> Tips:  
> Triple quotes strings can also be used for multi-line comments.
>
Use [`split`](https://docs.python.org/3/library/stdtypes.html#str.split) to split a string into a list.  
`'a b c'.split()` => `['a', 'b', 'c']`  
> 
Use [`join`](https://docs.python.org/3/library/stdtypes.html#str.join) to convert a list to a string  
`' '.join(['a', 'b', 'c'])` => `'a b c'`  

#### Boolean (`bool`)
A boolean value can be either `True` or `False` in Python.
> Be careful with the capitalization.

#### None (`None`)
In Python, `None` stands for an empty value.


### §2.5 String Formatting<a id='string_formatting'></a>
During programming, we are required to form strings containing variables frequently, and we can do this with concatenation using `+` operator.
For example, we have the following defined variables:

In [None]:
name = 'Tom'
age = 23

string = name + ' is ' + str(age) + ' years old'
print(string)

Few things to take note here. First, a string cannot concatenate with a non-string data type, thus we have to cast non-string values using `str` function. Second, we need to add leading or trailing spaces to some strings in order to make sure words are separated by spaces properly.  
But with the string formatting feature in Python, we don't have to worry about such issues.

#### Basic formatting
In Python, each string has a format function.  
`format_string.format(arguments)`  
In the above example, we can do something like this:

In [None]:
string = '{} is {} years old'.format(name, age)
print(string)

`{}` represents a variable in the format string. We can re-arrange the order by specifying the index number in the braces, for example `{2}`.

#### Padding and Aligning
By default, the formatted string takes up as many characters as needed to represent the content, but we can define a value that should be padded to a specific length.

In [None]:
'{:10}'.format('Python')

Add a colon `:` followed by a number to specify how many characters you want to use for this variable. By default, the variable is aligned to the left.  
To change to right aligned:

In [None]:
'{:>10}'.format('Python')

To center align the string:

In [None]:
'{:<10}'.format('Python')

We can also define the padding character, for example, using underscore

In [None]:
'{:_^15}'.format('Python')

#### Truncating
We can truncate overly long strings to a specific length.
To specify the precision of the output string, we can put a dot `.` followed by a number indicating the length.

In [None]:
'{:.7}'.format('Data Analysis with Python')

#### Formatting Numbers
We can also apply padding, aligning and truncating to formatting numbers.  
Using `d` to indicate it is an integer.

In [None]:
'{:=^6d}'.format(246)

For floating point numbers, use letter `f`, similar to truncating strings, add a number after `.` to limit the number of digits after the decimal point.

In [None]:
'{:06.2f}'.format(3.14159265358)


### §2.6 Operators<a id='operators'></a>
#### Basic Operators
| Operator | Description    | Example     |
|----------|----------------|-------------|
| `+`        | Addition       | `3 + 5` ⇒ `8`   |
| `-`        | Subtraction    | `5 - 3` ⇒ `2`   |
| `*`        | Multiplication | `5 * 3` ⇒ `15`  |
| `/`        | Division       | `9 / 2` ⇒ `4.5` |
| `//`       | Floor Division | `9 // 2` ⇒ `4`  |
| `%`        | Modulus        | `9 % 2` ⇒ `1`   |
| `**`       | Exponent       | `2 ** 3` ⇒ `8`  |

#### Assignment Operators
An assignment operator applies a certain operation on the vraiable and the right operand, then assign the result to the variable on the left.

| Operator | Example                  |
|----------|--------------------------|
| `=`        | `a = 10`                 |
| `+=`       | `a += 5` ⇒ `a = a + 5`   |
| `-=`       | `a -= 5` ⇒ `a = a - 5`   |
| `*=`       | `a *= 5` ⇒ `a = a * 5`   |
| `/=`       | `a /= 5` ⇒ `a = a / 5`   |
| `//=`      | `a //= 5` ⇒ `a = a // 5` |
| `%=`       | `a %= 5` ⇒ `a = a % 5`   |
| `**=`      | `a **= 5` ⇒ `a = a ** 5` |

#### Logical Operators
| Operator | Description                  |
|----------|--------------------------|
| `>`        | greater than                 |
| `>=`       | greater than or equal   |
| `<`       | less than   |
| `<=`       | less than or equal   |
| `==`       | equal   |
| `!=`       | not equal   |
| `in`      | is in sequence |
| `not in`       | is not in sequence   |

> A sequence can be a list, tuple etc, see [Part 3](#%E2%87%92-Part-3-Data-Structure)

#### Boolean Operations
| Operator | Description    | Example     |
|----------|----------------|-------------|
| `and`        | Both conditons are True ⇒ result is True  | True and True ⇒ True   |
| `or`        | At lease one of the conditions is true ⇒ result is True   | False or True ⇒ True   |
| `not`        | Negate the boolean value | not True ⇒ False  |


> ⇨ Exercise 2.5.1  
Try different operators on the same or different data types.


### §2.7 Control Flow<a id='control_flow'></a>
#### if statement

In [None]:
score = 70
if score >= 60:
    print('Pass')

Be careful to the indentation. Usually, we use 4-space or Tab indentation.  
If the condition is `True`, the indented part (`if clause`) will be executed. Otherwise, the `if clause` will be skipped.
You can also add an `else clause` in case the conditon is evaluated to be `False`, the `else clause` will be executed.

In [None]:
score = 50
if score >= 60:
    print('Pass')
else:
    print('Fail')

> Important:  
> Don't miss the colon `:`.  

If you want to add more condition branches, you can use `elif` clause:

In [None]:
score = 75
if score >= 90:
    print('A')
elif score >= 80:
    print('B')
elif score >= 70:
    print('C')
elif score >= 60:
    print('D')
else:
    print('F')

#### for loops
To calculate the sum of three numbers, we can write:

In [None]:
result = 1 + 2 + 3
print(result)

However, if we are going to calculate 1 + 2 + 3 + ... + 99999, it will not be pratical to list down all the numbers in the statement. Thus we need loop statements.  
There are two types of loop statements in Python, they are `for` loops and `while` loops.

To print out all the content in a list:

In [None]:
countries = ['Singapore', 'Vietnam', 'Thailand', 'Indonesia']
for country in countries:
    print(country)

Let's calculate the sum of all the integers between 1 and 100.
We don't have to list all the numbers, Python provides the `range()` function.
We can check the generated range by converting it to a list:

In [None]:
r = range(10)
print(list(r))

Let's try calculating the sum from 1 to 100:

In [None]:
sum = 0
for x in range(101):
    sum += x
print(sum)

#### While Loop
In while loop, the `loop body` is repeated as long as the condition is `True`, and it stops when the condition is evaluated to be `False`.
For example, calculate the factorial of a given number:

In [None]:
def factorial(n):
    fact = n
    while n > 1:
        n -= 1
        fact *= n
    return fact

print(factorial(5))

In [None]:
i = 0
while i < 5:
    print("i is {}".format(i)) 
    i = i + 1


### 2.7 Code Challenge

Write a program to get sum of all the odd numbers within 100 using loops in at least three ways.

In [None]:
# TO DO

#### Break
We can use `break` statement to exit a loop.

In [None]:
for x in [2, 4, 6, -1, 8, 0]:
    if x < 0:
        break
    print(x)
print('END')

#### Continue
We can use `continue` statement to skip current iteration, and continue with the next one.  
For example, we are going to get the sum of all the non-negative numbers in a list:

In [None]:
sum = 0
for x in [2, -2, 4, 6, -1, 8, 0]:
    if x < 0:
        print('skip ', x)
        continue
    sum += x
print('END')
print('result =', sum)

### §2.8 Some Commonly Used Builtin Functions<a id='builtin_functions'></a>
[int](https://docs.python.org/3/library/functions.html#int): Return an integer object constructed from a number or string x, or return 0 if no arguments are given.  
[float](https://docs.python.org/3/library/functions.html#float): Return a floating point number constructed from a number or string x.  
[str](https://docs.python.org/3/library/stdtypes.html#str): Return a string version of object.  
[bool](https://docs.python.org/3/library/functions.html#bool): Return a Boolean value, i.e. one of True or False.  
[isinstance](https://docs.python.org/3/library/functions.html#isinstance): Return true if the object argument is an instance of the classinfo argument, or of a (direct, indirect or virtual) subclass thereof.  
[abs](https://docs.python.org/3/library/functions.html#abs): Return the absolute value of a number.  
[max](https://docs.python.org/3/library/functions.html#max): Return the largest item in an iterable or the largest of two or more arguments.  
[min](https://docs.python.org/3/library/functions.html#min): Return the smallest item in an iterable or the smallest of two or more arguments.  
[len](https://docs.python.org/3/library/functions.html#len): Return the length (the number of items) of an object.  
[range](https://docs.python.org/3/library/stdtypes.html#typesseq-range): The range type represents an immutable sequence of numbers and is commonly used for looping a specific number of times in for loops.  
[type](https://docs.python.org/3/library/functions.html#type): With one argument, return the type of an object.  
[round](https://docs.python.org/3/library/functions.html#round): Return number rounded to ndigits precision after the decimal point.  
[map](https://docs.python.org/3/library/functions.html#map): Return an iterator that applies function to every item of iterable, yielding the results.  
[filter](https://docs.python.org/3/library/functions.html#filter): Construct an iterator from those elements of iterable for which function returns true.  

### §2.9 Functions<a id='functions'></a>
We all know the following formula to calculate the area of a circle:
\begin{equation*}
S = \pi r^2
\end{equation*}

Suppose we are going to calculate the areas of three different circles:

In [None]:
r1 = 5.23
r2 = 45.6
r3 = 9.4

s1 = 3.14 * r1 ** 2
s2 = 3.14 * r2 ** 2
s3 = 3.14 * r3 ** 2

print('s1={:.2f}, s2={:.2f}, s3={:.2f}'.format(s1, s2, s3))

As you can see, we repeated `3.14 * r ** 2` every time we are trying to calculate a circle's area. Besides, once the value of π is changed, we have to update each place where π is used one by one.  
We can extract the circle area calculation part into a function, so that we can simply call the defined function to do the work for us.

In [None]:
def circle_area(r):
    return 3.14 * r ** 2

print('The circle area is {}'.format(circle_area(5.6)))

`Functions` are defined using the def keyword. After this keyword comes an `identifier` name for the function, followed by a pair of parentheses which may enclose some names of variables, and by the final colon that ends the line. Next follows the block of statements that are part of this function. An example will show that this is actually very simple:

In [None]:
def say_hello():
    # block belonging to the function
    print('hello world')
# End of function

say_hello()  # call the function
say_hello()  # call the function again

In [None]:
def print_max(a, b):
    if a > b:
        print(a, 'is maximum')
    elif a == b:
        print(a, 'is equal to', b)
    else:
        print(b, 'is maximum')

# directly pass literal values
print_max(3, 4)

x = 5
y = 7

# pass variables as arguments
print_max(x, y)

## Chapter 3: Data Structure<a id='data_structure'></a>
<a href="#0">Go to top</a>
***

Data structures provide us with a specific and way of storing and organizing data such that they can be easily accessed and worked with efficiently.

![image.png](attachment:image.png)

### §3.1 List<a id='list'></a>
A `list` is a sequential collection of Python data values, where each value is identified by an index. 
The values that make up a `list` are called its elements. 
Lists are like strings, which are ordered collections of characters, except that the elements of a list can have any type and for any one list, the items can be of different types.

In [None]:
vocabulary = ["iteration", "selection", "control"]
numbers = [17, 123]
empty = []
mixedlist = ["hello", 2.0, 5*2, [10, 20]]

print(numbers)
print(mixedlist)
newlist = [ numbers, vocabulary ]
print(newlist)

The variable `vocabulary` is a `list`. You can get the number of items in it by using `len()`:

In [None]:
len(vocabulary)

Visit elements in the list by using indexing. Bare in mind that the index starts from `0`.

In [None]:
numbers = [17, 123, 87, 34, 66, 8398, 44]
print(numbers[0])
print(numbers[9 - 8])
print(numbers[-2])
print(numbers[len(numbers) - 1])
print(numbers[7]) # out of range, will cause error

We can get the index number of the last element in a list by calculating `len(countries) - 1`.  
Or we can use `-1` to get the last index directly:

In [None]:
print(numbers[-1])

`list` is a mutable sequence, which means you can add items to it, or remove items from it.
####  append()
To add one more item to the end of the list:

In [None]:
fruit = ['apple', 'banana', 'orange', 'cherry', 'kiwi']
fruit.append('coconat')
print(fruit)

#### extend()

In [None]:
fruit.extend(['avocado', 'mango'])
print(fruit)

Another way to extend a list:

In [None]:
fruit = ['apple', 'banana', 'orange', 'cherry', 'kiwi', 'coconat'] + ['avocado', 'mango']
print(fruit)

####  insert()
Insert an item to a specific position in the list:

In [None]:
fruit = ['apple', 'banana', 'oranges', 'cherry', 'kiwi']
fruit.insert(2, 'mango')
print(fruit)

#### pop()
Remove the last item.

In [None]:
fruit = ['apple', 'banana', 'orange', 'cherry', 'kiwi']
print(fruit.pop()) # pop() returns the popped item
print(fruit)

The data types in a list can be different. For example:

In [None]:
my_list = [123, 'abc', False]

`list` can be nested, which means a `list` can be in another list:

In [None]:
my_list = [1, 2, 3, [4, 5, 6], 7]
len(my_list)

If there is nothin in a `list`, its length is `0`:

In [None]:
empty_list = []
len(empty_list)

#### Slicing
It is a common operation to get a part of a `list`.
For instance, to get the first 3 items in the following list.

In [None]:
fruit = ['apple', 'banana', 'organge', 'cherry', 'kiwi']

In [None]:
fruit[0:3]

`fruit[0:3]` means getting the items from index `0` to index `3`, but `3` is not included. Thus the result is equivalent to `[ fruit[0], fruit[1], fruit[2] ]`.

We can omit the starting index if it is `0`.

In [None]:
# from the beginning to the 3rd item -> the first 3 items
fruit[:3]

Since `fruit[-2]` represent the second last item, we can apply the same idea in slicing:

In [None]:
# from the second last to the end -> the last 2 items
fruit[-2:]

In [None]:
# from the 2nd last to the last -> the 2nd last item
fruit[-2:-1]

If the ending index is not included, slicing will get until the end of the list. Remember that `-1` index represents the last item.  
If there is a number after two colons `:`, it represents steps.

In [None]:
# starting at 0 and continuing until end, 
# take every other item from `fruit`
fruit[::2]

#### List and Loops

It is also possible to perform list traversal using iteration by item as well as iteration by index.

In [None]:
fruits = ["apple", "orange", "banana", "cherry"]

for afruit in fruits:     # by item
    print(afruit)

It almost reads like natural language: For (every) fruit in (the list of) fruits, print (the name of the) fruit.

We can also use the indices to access the items in an iterative fashion.

In [None]:
fruits = ["apple", "orange", "banana", "cherry"]

for position in range(len(fruits)):     # by index
    print(fruits[position])

Since lists are mutable, it is often desirable to traverse a list, modifying each of its elements as you go. The following code squares all the numbers from 1 to 5 using iteration by position.

In [None]:
numbers = [1, 2, 3, 4, 5]
print(numbers)

for i in range(len(numbers)):
    numbers[i] = numbers[i] ** 2

print(numbers)


The following example shows how we can get the maximum value from a list of integers.

In [None]:
nums = [9, 3, 8, 11, 5, 29, 2]
best_num = 0
for n in nums:
    if n > best_num:
        best_num = n
print(best_num)

#### List generation

Whenever you need to write a function that creates and returns a list, the pattern is usually:

In [None]:
def doubleStuff(a_list):
    new_list = []
    for value in a_list:
        new_elem = 2 * value
        new_list.append(new_elem)
    return new_list

things = [2, 5, 9]
print(things)
things = doubleStuff(things)
print(things)

In [None]:
def primes_upto(n):
    """ Return a list of all prime numbers less than n. """
    result = []
    for i in range(2, n):
        if is_prime(i):
            result.append(i)
    return result

`Slicing` is also applied to `string` and `tuple`.

#### List Comprehension
List comprehension is a handy way to generate lists in Python.  
For example, to generate a list `[1, 2, 3, 4, 5]`, we can easily try `list(range(1, 6))`. However, what if we want a list like `[1**2, 2**2, 3**2, 4**2, 5**2]`?  
We can write a loop like the following way:

In [None]:
num = []
for i in range(1, 6):
    num.append(i ** 2)
print(num)

By using `list comprehension`, we can easily do the same thing with only one line of code:

In [None]:
[i * i for i in range(1, 6)]

In list comprehension, we first write down the expression to calculate every item in the list, followed by a `for` loop.  
You can also add a condition after the `for` loop to filter the source list.

In [None]:
[i * i for i in range(1, 6) if i % 2 == 0]

It is possible to have multiple levels of `for` loops.  
Try to explain the difference between the following 2 list comprehensions:

In [None]:
list_1 = [x + str(y) for x in 'abc' for y in range(3)]
list_2 = [x + str(y) for y in range(3) for x in 'abc']

print(list_1)
print(list_2)

### §3.2 Tuple<a id='tuple'></a>
`tuple` is very similar to `list` except that `tuple` is immutable, which means you cannot modify a `tuple`, including add or remove items.
Instead of using square brackets `[]` as in a list, we use parentheses `()` to represent tuples.

In [None]:
colors = ('red', 'green', 'blue')

Once a tuple is initialized, elements cannot be changed, thus methods like `append()` or `insert()` are not applicable to tuples. But you can do indexing, slicing on tuples just lists.

In [None]:
colors[-1]

In [None]:
colors[:-2]

#### Tuple as return values

Functions can return tuples as return values. This is very useful — we often want to know some batsman’s highest and lowest score, or we want to find the mean and the standard deviation, or we want to know the year, the month, and the day, or if we’re doing some ecological modeling we may want to know the number of rabbits and the number of wolves on an island at a given time. In each case, a function (which can only return a single value), can create a single tuple holding multiple elements.

For example, we could write a function that returns both the area and the circumference of a circle of radius r.

In [None]:
def circleInfo(r):
    """ Return (circumference, area) of a circle of radius r """
    c = 2 * 3.14159 * r
    a = 3.14159 * r * r
    return (c, a)

print(circleInfo(10))

### §3.3 Dictionary<a id='dictionary'></a>
Python has a dictionary data structure, called `dict`. `dict` stores key-value pairs. It is super efficient to look up values by keys.
The following dict contains name-age pairs, where names are the keys:

In [None]:
ages = {
    'Elbert': 23,
    'Bob': 55,
    'David': 8,
    'Alice': 20,
    'Frank': 14,
    'Calvin': 42,
}
ages['Calvin']

We can add new key-value pairs into a `dict`:

In [None]:
ages['Harry'] = 19
ages

Since keys are unique in a `dict`, assigning a value to an existing key will replace the old value:

In [None]:
ages['Alice'] = 30
ages

> Note:  
The order of key-value pairs is not preserved in `dict`.  
Keys must be immutable, mutable objects like `list` cannot be used as keys.

If a key is not in the dict, trying to get its value will cause a KeyError:

In [None]:
ages['Jack']

To check if a key is in a `dict`, we can use `in` keyword, just like checking if a value is in a `list`:

In [None]:
'David' in ages

#### keys method

The keys method returns what Python 3 calls a view of its underlying keys. We can iterate over the view or turn the view into a list by using the list conversion function.

In [None]:
inventory = {'apples': 430, 'bananas': 312, 'oranges': 525, 'pears': 217}

for akey in inventory.keys():     # the order in which we get the keys is not defined
   print("Got key", akey, "which maps to value", inventory[akey])

ks = list(inventory.keys())
print(ks)

#### values method

The values and items methods are similar to keys. They return view objects which can be turned into lists or iterated over directly. Note that the items are shown as tuples containing the key and the associated value.

In [None]:
inventory = {'apples': 430, 'bananas': 312, 'oranges': 525, 'pears': 217}

print(list(inventory.values()))

In [None]:
inventory = {'apples': 430, 'bananas': 312, 'oranges': 525, 'pears': 217}
print(list(inventory.items()))

for (k,v) in inventory.items():
    print("Got", k, "that maps to", v)

#### Get method

There is also a safe way to get value by key using `get()`, we can get a default value if the specified key does not exists:

In [None]:
print(inventory.get('apples'))
print(inventory.get('mango'))

A `None` value is returned by default if the key cannot bt found.  
You can also define your own default value as the second argument:

In [None]:
ages = {
    'Elbert': 23,
    'Bob': 55,
    'David': 8,
    'Alice': 20,
    'Frank': 14,
    'Calvin': 42,
}
print(ages.get('Jack', 0))

Using `pop(key)` to remove a key-value pair from a `dict`:

In [None]:
popped_value = ages.pop('Elbert')
print(popped_value)
print(ages)

If you only need to remove a cetrain key-value pair without caring about the removed value, you can use the `del` keyword:

In [None]:
del ages['Elbert']
print(ages)

Using a `for` loop to iterate all the key-value pairs in the dictionary:

In [None]:
for key in ages:
    print(key, ages[key])

In [None]:
for key, value in ages.items():
    print(key, value)

### 3.3 Code Challenge

Write a program called alice_words.py that creates a text file named alice_words.txt containing an alphabetical listing of all the words, and the number of times each occurs, in the text version of Alice’s Adventures in Wonderland. (You can obtain a free plain text version of the book, along with many others, from http://www.gutenberg.org.) The first 10 lines of your output file should look something like this

![image.png](attachment:image.png)

In [None]:
# TO DO




### §3.4 Set<a id='set'></a>
`set` is like a `dict` without values. Since keys are always unique, there is no duplicate in a `set`.

In [None]:
s = {2, 1, 1, 3, 2}
print(s)
print(s[0]) # error, because the items in a set have no order

Or you can also construct a `set` by passing a list to the `set()` function:

In [None]:
s = set([2, 1, 1, 3, 2])
print(s)

To create an empty set, you should use the `set()` function without an argument. You cannot use `{}` since it represents an empty dictionary.

In [None]:
empty_set = set()
print(empty_set)

The above example shows how to create a `set`. As you can see, all the duplicates in the input list are removed after creating the set. Similar to `dict`, the order of keys in a set is not preserved.  
#### add

In [None]:
s = set([2, 1, 1, 3, 2])
s.add(9)
print(s)

#### remove

In [None]:
s.remove(2)
print(s)

#### difference
The difference() method returns a set that contains the difference between two sets.

In [None]:
x = {"apple", "banana", "cherry"}
y = {"google", "microsoft", "apple"}
z = y.difference(x)

print(z)

#### update
The update() method updates the current set, by adding items from another set.

In [None]:
x = {"apple", "banana", "cherry"}
y = {"google", "microsoft", "apple"}

x.update(y)

print(x)

## Chapter 4: Pandas Guidebook<a id='pandas'></a>
<a href="#0">Go to top</a>
***


### §4.1 Packages<a id='packages'></a>

It makes a Python program unmaintainable if there are too many variables and functions in a single file. Through grouping functions into different modules, each file can have fewer codes. In Python, each `.py` file can be regarded as a `module`.

In order to prevent conflicts between modules, Python provides `package` to organize modules.

For example, there are two modules, and both of them are called `mod`, but if they are in differnt packages, `packageone` and `packagetwo`, then it is totally fine. They can be identified as `packageone.mod` and `packagetwo.mod`:

```
packageone
    ├─ __init__.py
    └─ mod.py

packagetwo
    ├─ __init__.py
    └─ mod.py
```

Every package directory has a `__init__.py`, otherwise, Python will regard this directory as a normal folder instead of a package. `__init__` can be an empty file.  
Package can be nested, like the following directory structure:

```
mypackage
    │
    ├─ innerpackage
    │    ├─ __init__.py
    │    ├─ one.py
    │    └─ two.py
    │
    ├─ __init__.py
    └─ mod.py
```

`one.py`'s module name is `mypackage.innerpackage.one`.  
The Python file for module `mypackage.innerpackage` is `__init__.py` in `mypackage/innerpackage`.

> Note:  
The names of self created modules should never conflict with system modules, such as `os`, `sys`.


In [None]:
import sys
!{sys.executable} -m pip install pandas

### §4.2 Regular Imports<a id='imports'></a>
To use a certain module/package in our code, we can simply import it at the beginning of our codes:

In [None]:
import math

result = math.factorial(5)
print(result)

We can import multiple modules in one line:

In [None]:
import math, time

We can also rename the package/module imported:

In [None]:
import numpy as np

np.average([1,2,3,4])

Sometimes we may just want to import a certain part of a module or library:

In [None]:
from math import sqrt
sqrt(9)

In [None]:
import math
math.sqrt(9)

You can check the function documentations using `?`

In [None]:
math.sqrt?

You can also import everything from a package/module using `*` (this manner is usually not recommended):

In [None]:
from math import *

The built-in function dir() is used to find out which names a module defines. It returns a sorted list of strings

In [None]:
import sys
dir(sys)

### §4.3 Pandas intro<a id='pandas_intro'></a>

Pandas is an open-source Python library for performing highly specialized data analysis. Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.

- It provides a single library for data analyst to easily process data, extract data and manipulate data.
- Pandas provides two new data structures: Series and DataFrame.
- The new data structures provide data manipulation capability equivalent to SQL-based relational database within Python


In [None]:
!pip install pandas 
import pandas as pd

#### Create Series

Creating a Series by passing a list of values, letting pandas create a default integer index.
The default index starts from 0.

In [None]:
s = pd.Series([1, 3, 5, 4, 6, 8])
print(s)

#### Create Dataframe

You can create a DataFrame by passing a NumPy array, with an index and labeled columns:

In [None]:
import numpy as np
import pandas as pd
df = pd.DataFrame(
            {
               "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
               "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"])
           }
       )

print(df)

In [None]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list("ABCD"))
print(df)

You can create a DataFrame by passing a dict of objects that can be converted to series-like.

In [None]:
df2 = pd.DataFrame(
                {
                "A": 1.0,
                "B": pd.Timestamp("20210402"),
                "C": pd.Series(1, index=list(range(4)), dtype="float32"),
                "D": np.array([3] * 4, dtype="int32"),
                "E": pd.Categorical(["test", "train", "test", "train"]),
                "F": "foo",
            }
        )
print(df2)

We can get the dataframe information by **df.info()**, **df.describe()**, which shows a quick statistic summary of your data:

In [None]:
df2.info()

In [None]:
df2.describe()

### §4.4 View Data<a id='view_data'></a>

We can view the top and bottom rows of the frame using **head()**, **tail()**

In [None]:
df.head() #default top 5 rows

In [None]:
df.tail(2)

We can view the dataframe index and columns, data types, data shape

In [None]:
# index
print(df.index)

# column names
print(df.columns)

# data types
print(df.dtypes)

# shape
print(df.shape)

### §4.5 Slicing & Indexing<a id='slice_index_pandas'></a>

There are different ways to do [slicing & indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing). It will be used a lot to get and set subsets of pandas objects.

- [ ]: indexing with [] (a.k.a. __getitem__ for those familiar with implementing class behavior in Python) is selecting out lower-dimensional slices.
- .loc: `.loc` is primarily label based, but may also be used with a boolean array. `.loc` will raise `KeyError` when the items are not found.
    1. A single label, e.g. `5` or `'a'` (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).

    2. A list or array of labels `['a', 'b', 'c']`.

    3. A slice object with labels `'a':'f'` (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)

    4. A boolean array (any NA values will be treated as False).

    5. A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above) 
- .iloc: `.iloc` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. 
    1. An integer e.g. `5`.

    2. A list or array of integers `[4, 3, 0]`.

    3. A slice object with ints `1:7`.

    4. A boolean array (any NA values will be treated as False).

    5. A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
- .at: similarly to `loc`, `at` provides label based scalar lookups
- .iat: similarly to `iloc`, `iat` provides integer based scalar lookups
- .isin: we can select rows from a DataFrame using a boolean vector. `isin()` method can return a boolean vector.
- .where: It is similar to .isin method. where can return a boolean vector.

In [None]:
df["A"] # select the column A

In [None]:
df[0:2] # select the first 2 rows

In [None]:
df.loc[:, ["A", "B"]] # select all rows in column A, column B

In [None]:
df2.loc[1:2, ["A", "B"]] # the 2nd and 3rd row in column A and column B

In [None]:
df.iloc[3] # the 4th row 

In [None]:
df.iloc[1:3, :] # 2nd and 3rd row in all columns

In [None]:
df.iat[1, 1] # the 2nd row, 2nd column value

In [None]:
df[df["A"] > 0] # use boolean index to filter the dataframe

In [None]:
df[df > 0]

Use the isin() method for filtering:

In [None]:
df["E"] = ["one", "one", "two", "three", "four", "three"] # new column is added to the dataframe.

In [None]:
df

In [None]:
df[df["E"].isin(["two", "four"])]

In [None]:
df.where(df['B'] > 0) # which is equal to df[df['B'] < 0]

### §4.6 Operations<a id='operations_pandas'></a>

- `sum()`: Aggregation function that gives the total sum of a column

- `mean()`: Aggregation function that gives the average value of a column

- `std()`: Aggregation function that gives the standard deviation value of a column

- `count()`: Counts the number of fill rows in the columns, where empty rows are ignored.

- `unique()`: Similar to set(), but only works for Series object, i.e `data[‘Age’]`

- `nunique()`: Much like unique(), except it counts the number of unique elements in the Series.

- `value_counts()`: Counting duplicated values


In [None]:
df.mean() # get mean value 

In [None]:
df.mean(1)

In [None]:
df['E'].nunique() # number of unique values in column E

In [None]:
df['E'].value_counts()

In [None]:
df.apply(np.cumsum)

In [None]:
df[['A','B','C','D']].apply(lambda x: x.max() - x.min())

### §4.7 Merge Data in Pandas<a id='merge_data'></a>

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

![image.png](attachment:image.png)

Suppose we have two tables: `df_1`and `df_2`

In [None]:
df_1 = pd.DataFrame({"Name": ["Andy","Bob","Charlie","Elsa","Felicia"], "Gender": ["M","M","M","F","F"]})
df_2 = pd.DataFrame({"Name": ["Andy","Bob","Damian","Elsa","Gertrude"], "Age": [15,23,27,18,46]})

print(df_1)
print(df_2)

In [None]:
# inner join
df_inner = pd.merge(df_1, df_2, on = "Name",how = "inner")
print(df_inner)

# left join
df_left = pd.merge(df_1, df_2, on =  "Name",how = "left")
print(df_left)

# right join
df_right = pd.merge(df_1, df_2, on =  "Name",how = "right")
print(df_right)

# outer join
df_outer = pd.merge(df_1, df_2, on =  "Name",how = "outer")
print(df_outer)

### §4.8 Concatenate Data in Pandas<a id='concat_data'></a>


Concatenate pandas objects together with `concat()`:

In [None]:
result = pd.concat([df_1, df_2])
print(result)

In [None]:
df1 = pd.DataFrame(
        {
            "A": ["A0", "A1", "A2", "A3"],
            "B": ["B0", "B1", "B2", "B3"],
            "C": ["C0", "C1", "C2", "C3"],
            "D": ["D0", "D1", "D2", "D3"],
        },
        index=[0, 1, 2, 3],
   )

df2 = pd.DataFrame(
        {
            "A": ["A0", "A5", "A6", "A7"],
            "B": ["B0", "B5", "B6", "B7"],
            "C": ["C0", "C5", "C6", "C7"],
            "D": ["D0", "D5", "D6", "D7"],
        },
        index=[4, 5, 6, 7],
    )

df3 = pd.DataFrame(
        {
            "A": ["A0", "A9", "A10", "A11"],
            "B": ["B0", "B9", "B10", "B11"],
            "C": ["C0", "C9", "C10", "C11"],
            "D": ["D0", "D9", "D10", "D11"],
        },
        index=[8, 9, 10, 11],
   )

frames = [df1, df2, df3]
result = pd.concat(frames)

![image.png](attachment:image.png)

In [None]:
result

It can be useful when we want to combine multiple files into one single file for our data analysis.

> Note:
It is worth noting that `concat()` (and therefore `append()`) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

```python
frames = [ process_your_file(f) for f in files ]
result = pd.concat(frames)
```

A useful shortcut to `concat()` are the `append()` instance methods on Series and DataFrame.

`append()` can take multiple objects to concatenate.

In [None]:
result = df1.append(df2)
result

In [None]:
result = df1.append([df2, df3])
result

#### drop duplicate values

After joining, appending, merging, there might be duplicates that needs to be removed.
We can use `drop_duplicates()` to remove the duplicate rows.

In [None]:
result.drop_duplicates() #without resetting the index

In [None]:
result.drop_duplicates(inplace=False)
result

In [None]:
result.drop_duplicates(inplace=True)
result

In [None]:
result.drop_duplicates(inplace=True)
result.reset_index(drop = True)

### §4.9 Grouping Data in Pandas<a id='group_data'></a>

In data analysis, we often need to answer business questions by summarizing data. For example,

"What is the department sales in Q1?"

"How many new customers we have acquired in the past year?"

"Which product saw the biggest increase during the past 6 months?". 

To answer these questions, we need to group the data by certain dimension(s) and calculate the metrics. By “group by” we are referring to a process involving one or more of the following steps:

- **Splitting** the data into groups based on some criteria

- **Applying** a function to each group independently

- **Combining** the results into a data structure

In [None]:
df = pd.DataFrame(
        {
            "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
            "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
            "C": np.random.randn(8),
            "D": np.random.randn(8),
        })
df

In [None]:
# Grouping by column A, and then applying the sum() function to the resulting groups.
df.groupby("A").sum()

In [None]:
# Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.
df.groupby(["A", "B"]).sum()

In [None]:
# Grouping by column A and apply the maximum function on column C.
df.groupby("A")["C"].max()

In [None]:
# Grouping by column A and aggregating by multiple functions to find mean of column C and sum of column D
df.groupby(['A']).agg({'C':'mean', 'D':'sum'})

>Bonus example: We can create a summary table using lambda functions to groupby and aggregate data by multiple functions

In [None]:
df = pd.DataFrame(
        data={
            "Province": ["ON", "QC", "BC", "AL", "AL", "MN", "ON"],
            "City": [
                "Toronto",
                "Montreal",
                "Vancouver",
                "Calgary",
                "Edmonton",
                "Winnipeg",
                "Windsor",
            ],
            "Sales": [13, 6, 16, 8, 4, 3, 1],
        })
df

In [None]:
df.groupby(['Province']).apply(lambda x: pd.Series({'Sales_sum': x['Sales'].sum(),
                                             'Sales_Max': x['Sales'].max(),
                                             'Sales_Min': x['Sales'].min(),
                                             'Sales_Avg': x['Sales'].mean(),
                                             'Sales_# of Unique': x['Sales'].nunique(),
                                             'Sales_Unique': x['Sales'].unique()}))

In [None]:
grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]

df = pd.DataFrame(
        {
            "ID": ["x%d" % r for r in range(10)],
            "Gender": ["F", "M", "F", "M", "F", "M", "F", "M", "M", "M"],
            "ExamYear": [
                "2007",
                "2007",
                "2007",
                "2008",
                "2008",
                "2008",
                "2008",
                "2009",
                "2009",
                "2009",
            ],
            "Class": [
                "algebra",
                "stats",
                "bio",
                "algebra",
                "algebra",
                "stats",
                "stats",
                "algebra",
                "bio",
                "bio",
            ],
            "Participated": [
                "yes",
                "yes",
                "yes",
                "yes",
                "no",
                "yes",
                "yes",
                "yes",
                "yes",
                "yes",
            ],
            "Passed": ["yes" if x > 50 else "no" for x in grades],
            "Employed": [
                True,
                True,
                True,
                False,
                False,
                False,
                False,
                True,
                True,
                False,
            ],
            "Grade": grades,
        }
    )

df

In [None]:
df.groupby("ExamYear").agg(
        {
            "Participated": lambda x: x.value_counts()["yes"],
            "Passed": lambda x: sum(x == "yes"),
            "Employed": lambda x: sum(x),
            "Grade": lambda x: sum(x)/len(x),
        }
    )

#### Pivot tables

In [None]:
table = pd.pivot_table(
        df,
        values=["Sales"],
        index=["Province"],
        columns=["City"],
        aggfunc=np.sum,
        margins=True)
table

### §4.10 Getting data in/out<a id='get_data_inout'></a>

We can use pandas to read files to extract data. pandas supports a wide range of file format, such as `csv`, `excel`,`json`, `HTML`, `SQL` and so on. You can refer to this [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#)

#### Read in CSV files

In [None]:
df = pd.read_csv("../data/world-happiness-report.csv")

In [None]:
df.head(5)
df.shape
df.info()
df.describe()

In [None]:
top_10_GDP = df.groupby('Country name')['Log GDP per capita'].mean().sort_values(ascending=False)[:10]
top_10_GDP

#### write to csv file

In [None]:
top_10_GDP.to_csv('../data/top10gdp_country.csv')

#### Read in multiple CSV files

The best way to combine multiple files into a single DataFrame is to read the individual frames one by one, put all of the individual frames into a list, and then combine the frames in the list using `pd.concat()`:

In [None]:
# generate sample csv files
for i in range(3):
        data = pd.DataFrame(np.random.randn(10, 4))
        data.to_csv("../data/file_{}.csv".format(i), index = False)
    
files = ["../data/file_0.csv", "../data/file_1.csv", "../data/file_2.csv"]

# concatenate files into single dataframe by putting all individual frames into a list
result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

result

You can use the same approach to read all files matching a pattern. Here is an example using `glob`:

In [None]:
import glob
import os

files = glob.glob("../data/file_*.csv")
result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

result

#### Read/Write excel files

In the most basic use-case, `read_excel` takes a path to an Excel file, and the `sheet_name` indicating which sheet to parse.

The `to_excel()` instance method is used for saving a DataFrame to Excel. Generally the semantics are similar to working with csv data.

```Python
pd.read_excel("path_to_file.xls", sheet_name="Sheet1")
```

In [None]:
apple_price = pd.read_excel('data/AAPL_excel.xls', sheet_name= 'AAPL')
apple_price.head()

### §4.11 Handle missing value<a id='handle_missing_value'></a>

Before you start cleaning a data set, it’s a good idea to just get a general feel for the data. After that, you can put together a plan to clean the data.

- What are the features?
- What are the expected types (int, float, string, boolean)?
- Is there obvious missing data (values that Pandas can detect)?
- Is there other types of missing data that’s not so obvious (can’t easily detect with Pandas)?

Sources of Missing Values:
- User forgot to fill in a field.
- Data was lost while transferring manually from a legacy database.
- There was a programming error.
- Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted.

For more details, you can refer to the [link](https://github.com/matthewbrems/ODSC-missing-data-may-18/blob/master/Analysis%20with%20Missing%20Data.pdf)

In [None]:
import pandas as pd

# Read csv file into a pandas dataframe
df = pd.read_csv("../data/property_data.csv")

# Take a look at df
df

In [None]:
# Total missing values for each feature
df.isnull().sum()

#### Remove missing values

One way to handle missing values is to drop rows with na values.

In [None]:
df.dropna(how="any") #To drop any rows that have missing data.

In [None]:
df.dropna(subset=['PID', 'ST_NUM']) # drop rows that have missing data on columns PID or ST_NUM.

#### Impute data

A good strategy when dealing with missing values involves their replacement with another value. Usually, the following strategies are adopted:

- for numerical values replace the missing value with the mean or median of the column
- for categorial values replace the missing value with the most frequent value of the column

In [None]:
df['ST_NUM'].fillna(125, inplace=True) # Replace missing values with a number
df

In [None]:
# Replace using median 
median = df['ST_NUM'].median()
df['ST_NUM'].fillna(median, inplace=True)

df

## Chapter 5: NumPy Guidebook<a id='numpy'></a>
<a href="#0">Go to top</a>
***

NumPy is a package for scientific computing and data analysis.

NumPy provides a multi-dimensional array data structure, `ndarray`, which is a fast, flexible container for large data sets in Python.

In an ndarray, all the elements must be the same type.
Every ndarray has a `shape`, which is a tuple indicating the size of each dimension, and a `dtype`, describing the data type of the elements in the array.

In [None]:
import numpy as np
a = np.arange(15).reshape(3, 5)
print(a)

print(a.shape)

print(a.ndim)

print(a.dtype.name)

print(a.size)

### §5.1 Array Creation<a id='array_creation'></a>

The most common way to create an ndarray is through a regular Python list or tuple using `array` function:

In [None]:
import numpy as np
data = [1,2,3]
a = np.array(data)
print(a)
a.dtype

A frequent error consists in calling array with multiple arguments, rather than providing a single sequence as an argument.

In [None]:
a = np.array(1,2,3,4)

In [None]:
b = np.array([1,2,3,4])
print(b.dtype)

> Be careful that the input argument is a `list`.

The nested level decides the number of dimensions. For example, a sequence (can be list, tuple, etc.) of sequences is a two-dimensional array. 

In [None]:
arr = np.array([(2.1, 4, 7.0), (-5, 72, 4.3)])
arr

While creating an array, `np.array` will try to choose a proper data type for you.

In [None]:
arr.dtype

There are some other functions for creating new arrays:

In [None]:
np.zeros( (5,3) )

In [None]:
np.ones( (2, 4) )

In [None]:
np.empty( (5, 6) ) # with garbage values

In [None]:
np.arange(8)

In [None]:
np.eye(4) # identity

In [None]:
np.linspace(0, 18, 10)  # 10 numbers from 0 to 18

In [None]:
import matplotlib.pyplot as plt

x_ = np.linspace(-5, 5, 100)
y_ = 4 * (x_**3) + 2 * (x_**2) + 5 * x_
plt.plot(x_, y_)
plt.show()

You can specify the data type of the created array by setting the `dtype` argument:

In [None]:
np.array([1,2,3], dtype=np.float64)

> Note  
You can explicitly convert an array from one dtype to another using `astype`.  
e.g. arr.astype(np.int32)

When you print an array, NumPy displays it in a similar way to nested lists, but with the following layout:

- the last axis is printed from left to right,

- the second-to-last is printed from top to bottom,

- the rest are also printed from top to bottom, with each slice separated from the next by an empty line.

One-dimensional arrays are then printed as rows, bidimensionals as matrices and tridimensionals as lists of matrices.

In [None]:
a = np.arange(6)  # 1d array
print(a)

In [None]:
b = np.arange(12).reshape(4,3) # 2d array
print(b)

In [None]:
c = np.arange(24).reshape(2,3,4) # 3d array
print(c)

### §5.2 Array Attributes<a id='array_attributes'></a>

`.ndim`  
the number of dimensions (rank)

`.shape`  
the dimensions of the array represented as a tuple

`.size`  
the number of items in the array, which is also the product of the numbers in `.shape`

`.dtype`  
the data type of the items in the array

`.itemsize`  
the size of bytes of each item in the array


In [None]:
arr = np.array( [ [1, 2, 3, 4, 5],
                  [9, 7, 5, 3, 1],
                  [-2, 4, -6, 8, -10]] )
print( "Dimension:", arr.ndim )
print( "Shape:    ", arr.shape )
print( "Size:     ", arr.size )
print( "Type:     ", arr.dtype )
print( "Item size:", arr.itemsize )

### §5.3 Basic Operations<a id='array_operations'></a>
Any arithmetic operations between equal-size arrays applies the operation elementwise:


In [None]:
arr = np.array([[1, 3, 5], [2, 4, 6]])
arr

In [None]:
arr + arr

In [None]:
arr * arr

When doing arithmetic operations with scalars, the operation will take effects on each item:

In [None]:
arr / 2

In [None]:
arr ** 0.5

In [None]:
arr < 5

Unlike in many matrix languages, the product operator * operates elementwise in NumPy arrays. The matrix product can be performed using the @ operator (in python >=3.5) or the dot function or method:

In [None]:
X = np.array( [ [2, 3],   # [ [a,b],
                [0, 1] ]) #   [c,d] ]
Y = np.array( [ [1, 2],   # [ [e,f],
                [3, 4] ]) #   [g,h] ]
print(X @ Y)
print(X.dot(Y))

#[ [a*e+b*g, a*f+b*h]
#  [c*e+d*g, c*f+d*h] ]

Assignment operations, such as `+=`, `*=`, will modify the existing array, rather than create a new one.

In [None]:
arr = np.zeros( (3, 2) )
arr += 5
arr

### §5.4 Basic Indexing and Slicing<a id='slice_index_numpy'></a>

Indexing and slicing on a one-dimensional array is the same as on a list.  
For multi-dimensional arryas, these indices are given in a tuple separated by commas,  
i.e. `arr[(1st dimension), (2nd dimension), (3rd dimension), ...]`

In [None]:
arr = np.arange(9).reshape(3,3)
arr

In [None]:
arr[1,2] # the element on the 2nd row, 3rd column

In [None]:
arr[0:3, 1] # row from 0 to 3 in the 2nd column

In [None]:
arr[:, 1] # same as above

In [None]:
arr[1:3, :] # 2nd and 3rd (1:3) rows of all collumns

>  Try to predict the following slicing result on a 3*3 array  
`arr[:2, 1:]`
>  
`arr[2]`
`arr[2, :]`
`arr[2:, :]`
>  
`arr[:, :2]`
>  
`arr[1, :2]`
`arr[1:2, :2]`

### Array index tricks

NumPy offers more indexing facilities than regular Python sequences. In addition to indexing by integers and slices, as we saw before, arrays can be indexed by arrays of integers and arrays of booleans.

In [None]:
a = np.arange(12)**2 # the first 12 square numbers
i = np.array([1, 1, 3, 8, 5]) # an array of indices
print(a)

a[i]

In [None]:
j = np.array([[3, 4], [9, 7]]) # a bidimensional array of indices
a[j]

In [None]:
palette = np.array([[0, 0, 0],         # black
                    [255, 0, 0],       # red
                    [0, 255, 0],       # green
                    [0, 0, 255],       # blue
                    [255, 255, 255]])  # white

image = np.array([[0, 1, 2, 0],        
                  [0, 3, 4, 0]])
# each value corresponds to a color in the palette

palette[image] # the (2, 4, 3) color image

The most natural way one can think of for boolean indexing is to use boolean arrays that have the same shape as the original array:

In [None]:
a = np.arange(12).reshape(3,4)
b = a > 4
b # b is a boolean with a's shape

In [None]:
a[b] # 1d array with the selected elements

This property can be very useful in assignments:

In [None]:
a[b] = 0
a

### §5.5 Funtions and Methods<a id='functions_methods'></a>
`.random.random`  
Return random floats in the half-open interval [0.0, 1.0).

In [None]:
np.random.random(5)

`.random.normal`  
Draw random samples from a normal (Gaussian) distribution.

In [None]:
import matplotlib.pyplot as plt
values= np.random.normal(size=100)
plt.hist(values, 30)
plt.show()

`.random.permutation`  
Randomly permute a sequence, or return a permuted range.

In [None]:
np.random.permutation(5)

In [None]:
np.random.permutation([3, 5, 2, 10, 0])

`.floor`  
Return the floor of the input, element-wise.

In [None]:
arr = np.array([-2.5, -1.7, -2.0, 0.2, 1.3])
np.floor(arr)

`.vstack`  
Stack arrays in sequence vertically (row wise).

In [None]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.vstack((a,b))

`.hstack`  
Stack arrays in sequence horizontally (column wise).

In [None]:
a = np.array([(1,2),(3,4)])
b = np.array([(5,6),(7,8)])
np.hstack((a,b))

`.hsplit`  
Split an array into multiple sub-arrays horizontally (column-wise).

In [None]:
x = np.arange(12).reshape(3, 4)
np.hsplit(x, 2)

`.vsplit`  
Split an array into multiple sub-arrays vertically (row-wise).

In [None]:
x = np.arange(8).reshape(4, 2)
np.vsplit(x, 2)

`.split`  
`.array_split`  
Split an array into multiple sub-arrays.

In [None]:
import numpy as np
x = np.arange(7)
np.array_split(x, 2)

`.diag`  
Extract a diagonal or construct a diagonal array.

In [None]:
x = np.arange(9).reshape((3,3))
x

In [None]:
np.diag(x)

In [None]:
np.diag(np.diag(x))

#### Universal Functions
A universal function, or `ufunc` is a function that performs elementwise operations on data in `ndarray`.  
**Unary ufuncs**  
`abs`  

In [None]:
arr = np.array([3, -6, 0, 19, -5.4])
np.abs(arr)

`sqrt`  
`square`  
`exp`  
`log`,`log10`, `log2`   
`ceil`  
`floor`  

**Binary ufuncs**  
`add`  

In [None]:
arr1 = np.array([1,2,3,4])
arr2 = np.array([5,6,7,8])
np.add(arr1, arr2)

`subtract`  
`multiply`  
`divide`, `floor_divide`  
`power`  

`maximum`  

In [None]:
arr1 = np.array([1,4,5,8])
arr2 = np.array([2,3,6,7])
np.maximum(arr1, arr2)

`minimum`  
`mod`  
`greater`, `greater_equal`  
`less`, `less_equal`, `equal`  
`not_equal`  

#### Mathematical and Statistical Methods
`.sum`  
Sum of all the elements in the array or along an axis. Zero-length arrays have sum 0.  

In [None]:
arr = np.arange(10)
arr.sum()

`.mean`  
`.std`, `.var`  
`.min`, `.max`  
`.argmin`, `.argmax`  
Indives of minimum and maximum elements, respectively.  
`.cumsum`  
Cumulative sum of elements starting from 0.  
`.cumprod`   
Cumulative product of elements starting from 1.  

### §5.6 Data Processing Using Arrays<a id='data_processing_numpy'></a>
** Expressing Conditional Logics as Array Operations **  
The is a `where` function in NumPy that allows us to create a new array of values base on another array.

In [None]:
arr = np.random.randn(3,3)
arr

In [None]:
# if the value less than 0, then 0, otherwise, unchanged
np.where(arr < 0, 0, arr) 

**Methods for Boolean Arrays**  


`sum` is often used to count `True` values in a boolean array.

In [None]:
(arr > 0).sum()

`any` and `all` are two methods used in boolean arrays.  
`any` checks if one or more values in an array is `True`.  
`all` checks if every value in an array is `True`.

In [None]:
bools = np.array([True, False, True, True])

print( bools.any() )
print( bools.all() )

### Code Challenge

Write a program to solve sudoku.
We represent a sudoku in a 2-D array. The empty cell is filled with zero.

In [None]:
puzzle = [[5,3,0,0,7,0,0,0,0],
            [6,0,0,1,9,5,0,0,0],
            [0,9,8,0,0,0,0,6,0],
            [8,0,0,0,6,0,0,0,3],
            [4,0,0,8,0,3,0,0,1],
            [7,0,0,0,2,0,0,0,6],
            [0,6,0,0,0,0,2,8,0],
            [0,0,0,4,1,9,0,0,5],
            [0,0,0,0,8,0,0,7,9]]

In [None]:
# TO DO





## Chapter 6: Functional Programming<a id='functional_programming'></a>
<a href="#0">Go to top</a>
***

Python's functional programming features make data processing more convenient.

### §6.1 Lambda Function<a id='lambda'></a>
In Python, `lambda` functions are just anonymous functions.

A lambda function can take any number of arguments but can only have one expression.

In [None]:
x = lambda a, b, c : a + b + c
print(x(5, 6, 2))

In [None]:
lambda x: x**3

The above `lambda` returns the cube of the input value:

In [None]:
(lambda x: x**3)(2)

The `x` before `:` is the input argument, the expression after `:` will be evaluated and returned. There should not be any `return` statement inside a `lambda`.  
The above `lambda` does the same job as the following normal function:

In [None]:
def cube(x):
    return x ** 3

print(cube(2))

If a function is "small" enough and only used once, it is convenient to use `lambda`. `lambda` also accepts multiple arguments:

In [None]:
(lambda x, y: x + y)(2, 5)

### §6.2 Filter<a id='filter'></a>
Creating a list of elements for which a function/lambda returns `True`.

The filter() function returns an iterator were the items are filtered through a function to test if the item is accepted or not. 

Filter the array, and return a new array with only the values equal to or above 18:

In [None]:
ages = [5, 12, 17, 18, 24, 32]

def myFunc(x):
  if x < 18:
    return False
  else:
    return True

adults = filter(myFunc, ages)

for x in adults:
  print(x)

In [None]:
ages = [5, 12, 17, 18, 24, 32]
list(filter(lambda x: x >= 18, ages))

### §6.3 Map<a id='map'></a>

A map function executes a specified function for each item in an iterable. The item is sent to the function as a parameter.

Applying a function/lambda to every item in a list.

In [None]:
def addition(n): 
    return n + n 
  
# We double all numbers using map() 
numbers = (1, 2, 3, 4) 
result = map(addition, numbers) 
print(list(result))

In [None]:
nums_1 = [1, 2, 3, 4]
nums_2 = [5, 6, 7, 8]
list(map(lambda x, y: x + y, nums_1, nums_2))

In [None]:
nums_1 = [1, 2, 3, 4]

list(map(lambda x: x + 1, nums_1))

### §6.4 Reduce<a id='reduce'></a>
Performing some computation on a list and return the result.
Applying a rolling computation to sequential pairs of values in a list.

In [None]:
import functools 

lis = [ 1 , 3, 5, 6, 2, ] 
  
# using reduce to compute sum of list 
print ("The sum of the list elements is : ",end="") 
print (functools.reduce(lambda a,b : a+b,lis)) 
  
# using reduce to compute maximum element from list 
print ("The maximum element of the list is : ",end="") 
print (functools.reduce(lambda a,b : a if a > b else b,lis)) 

In [None]:
from functools import reduce

nums = [1, 2, 3, 4, 5]
reduce(lambda x, y: x * y, nums)

We can also write a for loop to get the same result:

In [None]:
nums = [1, 2, 3, 4, 5]
result = 1
for num in nums:
    result *= num
print(result)

#### Map lambda function as a list comprehension

The map() function runs a lambda function over the list [1, 2, 3, 4, 5], building a list-like collection of the results, like this:

In [None]:
list(map(lambda n: n * 2, [1, 2, 3, 4, 5]))

In [None]:
strs = ['Python', 'is', 'great']
list(map(lambda s: s.upper() + '!', strs))

## Chapter 7: Exception Handling<a id='exception_handling'></a>
<a href="#0">Go to top</a>
***

Sometimes, running our syntactically correct code may cause errors, and we call it exceptions.  
For exmaple, trying to concatenate a number to a string directly will raise a `TypeError` exception.

In [None]:
print('Result:' + 200)

Of course we can prevent the above exception by casting the data types properly. However, it is not likely to guarantee there is no exception in some situations, especially while dealing with IO (input/output) and network requests.  
For example, your program asks the user to input an integer, but the user inputs a letter:

In [None]:
user_input = input('Input a number: ')
print(int(user_input))

Another example, you are trying to read data from a non-existing file:

In [None]:
with open('no_such_file.txt', 'r') as f:
    print(f.read())

**Typical Exceptions Description**

![image.png](attachment:image.png)

![image.png](attachment:image.png)

#### Handling Exceptions
You can detect exceptions and handle them properly using try-except blocks.  
```
try:
    run your code
except:
    this part will be executed if any exception occurs
```

In [None]:
try:
    user_input = 'a'
    print(int(user_input))
except:
    print('Error: invalid input.')

In [None]:
while True:
    try:
        user_input = input('Input a number: ')
        print(int(user_input))
        break
    except:
        print('[invalid number]')
        continue

In the above examples, you can find there are different types of exceptions. Actually you can catch a specific type of exception and define a variable for it.

In [None]:
try:
    user_input = input('Input a number: ')
    print(int(user_input))
except ValueError as error:
    print(error)

The try statement has another optional clause which is intended to define clean-up actions that must be executed under all circumstances. 

In [None]:
try:
    raise KeyboardInterrupt
finally:
    print('Goodbye, world!')

In [None]:
try:
   f = open("my_file.txt", "w")
   try:
      f.write("Writing some data to the file")
   finally:
      f.close()
except IOError:
   print "Error: my_file.txt does not exist or it can't be opened for output."

**Debugging**

There are different kinds of errors can occur in program. It is useful if we can distinguish them:

1. Syntax error: It usually indicates something wrong with syntax of your code, such as
    - Omitting colon at the end of functions
    - Wrong indentation.
    - Strings should have matching quotation marks. 
    - Unclosed bracket {, (
    - Sign = is not the same as ==

2. Runtime error: The program is syntactically correct, but it did not give what we expected, such as
    - Infinite loop
    - Infinite recursion
    - Exception handling, such as NameError, TypeError, KeyError, IndexError etc.

3. Semantic error: Semantic errors are hard to debug because compiler and runtime system provide no information
    - Break the program into smaller components and test each component independently.
    - Get some rest before you get frustrated.


## Chapter 8: Working with Files in Python<a id='work_with_files_in_pythons'></a>
<a href="#0">Go to top</a>
***

In real life data reside in files. File reading and writing are common input/output (IO) operations.

In Python, we must open files before we can use them and close them when we are done with them. 

### §8.1 Reading from Files<a id='read_files'></a>

Python provides an `open()` function to create a file object, and we can read data from the file object.

In [None]:
f = open('../folder/data.txt', 'r')

In the above example, `'folder/data.txt'` is the path to the file we want to read, `'r'` stands for reading mode.
If the file does not exists, `open()` will throw an `IOError` exception.

![image.png](attachment:image.png)

If the specified file can be successfully opened, we can call the file object's `read()` method to get the contents from it:

In [None]:
f.read()

The final step is calling `close()` method to close the file. Opened files must be closed in order to release system resources.

In [None]:
f.close()

There is an `with` statement in Python which can help us close the opened files automatically:

In [None]:
with open('../data/graudate course type in Singapore.csv', 'r') as f:
    print(f.read())

If the file size is small, we can read all the data from the file using `read()` conveniently. However, it is not practical to read all the data from a large file directly, instead, we can add a size argument to it and read repeatedly:

In [None]:
gcfile = open("../data/graduate course type in Singapore.txt", "r")

for aline in gcfile:
    values = aline.split(",")
    print('In Year ', values[0], ", There are ",values[3], " ", values[1], " graduates in course ", values[2] )

gcfile.close()

Instead of reading the whole file content, we can read one line from the file.
Below table summarizes methods we can use.
When it reaches end of file, readline() and readlines() will return empty string.

![image.png](attachment:image.png)

In [None]:
infile = open("../data/graduate course type in Singapore.txt", "r")
line = infile.readline()
while line:
    values = line.split(",")
    print('In Year ', values[0], ", There are ",values[3], " ", values[1], " graduates in course ", values[2] )
    line = infile.readline()

infile.close()

In [None]:
f = open('../data/graduate course type in Singapore.txt', 'r')
f.readlines(5)

Calling `readlines()` method can read the file into lines, so that we can process line by line:

In [None]:
for line in f.readlines():
    print(line.strip()) # strip() removes '\n' at the end of the line

In `'rb'` mode, we can read binary files like image, video, etc.

In [None]:
f = open('../data/binary_file.bin', 'rb')

### §8.2 Writing to Files<a id='write_files'></a>
Writing files is similar to reading files, but we use `'w'` or `'wb'` mode to write text or binary files respectively.

In [None]:
f = open('data.txt', 'w')
f.write('01 02 03')
f.close()

Remember to close the file, otherwise the last part of the data may be lost.  
Similarly, we can also write files using `with` statement without caring about closing it:

In [None]:
with open('data.txt', 'w') as f:
    f.write('1234567')

If the file we are going to write is already existing, the original one will be replace. To append new content to the existing file, we can use `'a'` mode instead of `'w'`.

### §8.3 Working with JSON Data<a id='work_with_json'></a>


JSON is one of the most popular formats for transferring data through APIs nowadays.  
Python comes with a built-in module for encoding and decoding JSON data.

In [None]:
import json

The following API returns random quote with some extra information in JSON format.

In [None]:
import requests
r = requests.get('https://api.github.com/')
print(r.text)

We can convert this JSON object (saved as a Python string) to a Python dictionary using `json.loads`

In [None]:
json.loads(r.text)

In case the response we got is not in JSON format, calling `json.loads` will cause exceptions. Again, it is better to surround this piece of code with a try-except block.

In [None]:
try:
    data = json.loads(r.text)
    print(data['emails_url'])
except:
    print('invalid JSON data')

`json.dumps` can convert a Python dictionary to a JSON formatted string.

In [None]:
json.dumps({'key': 'abc', 'value': 'Hello World', 'valid': True})

#### Request data from online
Nowadays, most of the programs or applications requres access to the internet and downloads or uploads certatin data. In this part, let's look at how can we download data in Python in a convenient way.

#### Using requests
Making a request by a cetrain URL using `requests` module is very simple. First, install `requests` module by running
```
pip install requests
```
then import it in your code

In [None]:
import requests

Now, let's try to get some data from an API:

In [None]:
r = requests.get('https://api.github.com')
print(r.text)

By passing the URL to the `requests.get` function, we can get the response from the URL, which is usually HTML data.  
If you copy paste the URL to your browser, you can see the same result.

Try the URL `https://api.github.com` in your browser. Right click the webpage and select `View Page Source`, you can see the data.

If you want to extract certain information from the webpage, especially when you are writing a web crawler, you have to deal with such string containing HTML tags.

In [None]:
r = requests.get('https://api.github.com')
user_url = json.loads(r.text)['current_user_url']
print(user_url)

It is possbile to encounter network issues while requesting online resources, thus it is important to surround your requests with try-except blocks, check response status, as well as set timeout.

In [None]:
try:
    # set timeout as 5 seconds, to prevent endless waitting
    r = requests.get('https://api.github.com/', timeout=5)
    if r.status_code == 200:
        # HTTP 200 status response code indicates the request has succeeded
        print('Valie response status')
    else:
        print('Invalid response status')
except:
    print('Error occurred')