# Quick Guide to Python and Jupyter Notebook

## Python Basics

### What is Python?

> Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. 

### The goal of this quick guide is to allow you to

- do basic calculations;
- import modules, such as `Numpy`, `Pandas` and `Statsmodels`, and use built-in functions within them;
- manage dataset with the help of `Pandas`;

### For more detailed and comprehensive tutorials, please refer to

1. BU TechWeb Tutorials:
https://www.bu.edu/tech/support/research/training-consulting/rcs-tutorial-videos-and-third-party-tutorials/intro-python/

2. Python 3 Documentation:
https://docs.python.org/3/tutorial/index.html

*If you want to intall Python on your local machine, we strongly recommand the [Anaconda distribution](https://www.anaconda.com/). Please download it [here](https://www.anaconda.com/download), and follow the instructions.*

### Operators

The operators in Python are similar to those in other programming languages, and is easy to read. Please try out some of the following operators in the code cells below.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky"></th>
    <th class="tg-0pky">Operators in Python</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky">Arithmetic</td>
    <td class="tg-0pky">+;  -;   *;   /;   % (remainder);  ** (power)</td>
  </tr>
  <tr>
    <td class="tg-0pky">Logical</td>
    <td class="tg-0pky">and; or; not</td>
  </tr>
  <tr>
    <td class="tg-0pky">Comparison</td>
    <td class="tg-0pky">&gt;;&nbsp;&nbsp;&lt;;&nbsp;&nbsp;&gt;=;&nbsp;&nbsp;&lt;=;&nbsp;&nbsp;!=;&nbsp;&nbsp;==</td>
  </tr>
  <tr>
    <td class="tg-0pky">Membership</td>
    <td class="tg-0pky">in; not in</td>
  </tr>
  <tr>
    <td class="tg-0pky">Assignment</td>
    <td class="tg-0pky">=</td>
  </tr>
  <tr>
    <td class="tg-0pky">Comments</td>
    <td class="tg-0pky"># (single line); '''(text)''' (block)</td>
  </tr>
</tbody>
</table>

### Modules

> Modules, a.k.a. libraries or packages, add functionality to the core Python language.

- The `import` command is used to load a module.
- The name of the module is prepended to function names and data structures in the module.
- This allows different modules to have the same function names – when loaded the module name keeps them separate.

Here, we list some modules that will be used in this course:
- `Numpy`: powerful tool to deal with n-dimensional arrays;
- `SciPy`: fundamental algorithms for scientific computing;
- `Pandas`: provides high-performance, easy-to-use data structures and data analysis tools;
- `Statsmodels`: provides classes and functions for the estimation of many different statistical models (regressions), as well as for conducting statistical tests, and statistical data exploration;
- `PyFixest`: provides fixed effects regression methods;
- `Seaborn`: statistical data visualization;
- `Stargazer`: helps to generate beautiful tables of regression results. 

In [1]:
# import the module, and claim its name as 'np'
import numpy as np  

np.sqrt(5) # call a function from the module

2.23606797749979

Theoratically, you can rename the module as wild as you like. However, following some conventions makes your code more readable.

In [2]:
import pandas as pd
import statsmodels as sm 
import seaborn as sns

Alternatively, we can also import select functions from a module. For example, we only want to use `stats.norm` in the `SciPy` module.

In [3]:
from scipy.stats import norm 

# return the 97.5% quantile of the standard normal distribution. 
norm.ppf(0.975) 

# Note that we don't need to specify the module name before the function now.


1.959963984540054

Among the modules mentioned above, `Pandas` and `Seaborn` are the most relevant in this course. Please take a look at the following links:

- Pandas Tutorials: https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html
- An Intro to Seaborn: https://seaborn.pydata.org/tutorial/introduction.html

### Variables

- Variables are assigned values using the "=" operator;
- Variable type is not specified (Python determines data types for variables based on the context);
- Types can be changed with a reassignment.
- A built-in function, type(), returns the type of the data assigned to a variable.

1. Boolean: `True` and `False`

2. Numbers: Integers and floating point (64-bit)
  - [more examples](https://docs.python.org/3/tutorial/introduction.html#numbers)

In [4]:
a = 1.0

type(a)

float

In [5]:
b = 2

type(b)

int

In [6]:
# operation between integers can return a float
c = 2 / 3

c

0.6666666666666666

3. Strings
  - Indicated using pairs of single '' or double "" quotes.
  - Strings can be indexed and sliced (see below). Note that in Python, we always count from 0.
  - [more examples](https://docs.python.org/3/tutorial/introduction.html#text)

In [7]:
c = 'python'

len(c) # length of the string

6

In [8]:
c[0] # character in position 0

'p'

In [9]:
c[-1] # the last character

'n'

In [10]:
c[1:3] # characters from position 1 (included) to 3 (excluded)

'yt'

4. Lists
  - Indicated using square brackets [].
  - Lists can be indexed and sliced (in the same way of strings).
  - Lists can be nested.
  - [more examples](https://docs.python.org/3/tutorial/introduction.html#lists)
  - [advanced methods](https://docs.python.org/3/tutorial/datastructures.html?highlight=dictionary#more-on-lists)

In [11]:
l = [1, 2, 3, 5, 8]

l[0:3] # slicing returns a new list

[1, 2, 3]

In [12]:
l[-2] # indexing returns the item

5

In [13]:
# concatenate using '+'

l + ['a', 'b', 'c']

[1, 2, 3, 5, 8, 'a', 'b', 'c']

In [14]:
# change value in a certain position

l[3] = 99

l

[1, 2, 3, 99, 8]

In [15]:
# use append() to add an element

l.append(11)

l

[1, 2, 3, 99, 8, 11]

In [16]:
# nested lists

x = [[1, 2, 3], [4, 5, 6]]

x[0][2]

3

5. `numpy.ndarray`
  - [intro](https://numpy.org/doc/stable/reference/arrays.ndarray.html#constructing-arrays)
  - Please try out indexing, slicing, assigning values in the code cell below.

In [17]:
import numpy as np # import the numpy module

y = np.array(x) # create ndarray from list

y

array([[1, 2, 3],
       [4, 5, 6]])

We can apply element-wise operations on `ndarray`s.

In [18]:
np.sqrt(y)

array([[1.        , 1.41421356, 1.73205081],
       [2.        , 2.23606798, 2.44948974]])

**6. Dictionaries**

  - A dictionary is a set of 'key: value' pairs. 
  - A pair of braces creates an empty dictionary: {}.

In [19]:
# create a dictionary

data = {'person': [1, 2, 3, 4, 5, 6],
        'A': [1, 1, 1, 0, 0, 0],
        'Y': [1, 1, 1, 0, 1, 0]}

# returns the list of keys in the dictionary
data.keys() 

dict_keys(['person', 'A', 'Y'])

In [20]:
data['person']

[1, 2, 3, 4, 5, 6]

In [21]:
# an example of dictionary comprehension

# create the key 'Z' and assign values as （A + Y）
data['Z'] = [x + y for x,y in zip(data['A'], data['Y'])] 

data

{'person': [1, 2, 3, 4, 5, 6],
 'A': [1, 1, 1, 0, 0, 0],
 'Y': [1, 1, 1, 0, 1, 0],
 'Z': [2, 2, 2, 0, 1, 0]}

**7. `pandas.DataFrame`**

  - we store data in a `DataFrame`.
  - we can create a DataFrame from a dictionary; the dictionary keys will be used as column headers and the values in each list as columns of the DataFrame.
  - Please refer to [the tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html) for more details.

In [22]:
import pandas as pd # import the pandas module

# create a DataFrame from the dictionary
df = pd.DataFrame(data) 

df # show the DataFrame

Unnamed: 0,person,A,Y,Z
0,1,1,1,2
1,2,1,1,2
2,3,1,1,2
3,4,0,0,0
4,5,0,1,1
5,6,0,0,0


Each column in a DataFrame is a `Series`.

In [23]:
df['A']

0    1
1    1
2    1
3    0
4    0
5    0
Name: A, dtype: int64

In [24]:
type(df['A'])

pandas.core.series.Series

We can extract several columns into a sub-DataFrame.

In [25]:
subdf = df[['A', 'Y']]

subdf

Unnamed: 0,A,Y
0,1,1
1,1,1
2,1,1
3,0,0
4,0,1
5,0,0


In [26]:
type(subdf)

pandas.core.frame.DataFrame

We can use `read_csv` to create DataFrames from CSV files. To write data to files, use `to_csv`.

In [27]:
# read data from CSV file. Note the path!
df_titanic = pd.read_csv("../data/titanic.csv") 

# it's a long table, we only want to display the first 5 rows.
df_titanic.head(5) 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [28]:
# write the DataFrame to CSV file
df_titanic.to_csv("../data/titanic2.csv", index = False)

# read the new file
df_temp = pd.read_csv("../data/titanic2.csv")

df_temp.tail(5) # this time we display the last 5 rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


We may be interested in filter specific rows based on some conditions. For example, we want to know, in the Titanic dataset, the information of passengers who were above 35 years old.

In [29]:
above_35 = df_titanic[df_titanic["Age"] > 35]

above_35.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S


Use `loc` to select specific rows and columns. E.g., I’m interested in the names and sex info of the passengers older than 35 years.

In [30]:
adult_names = df_titanic.loc[df_titanic["Age"] > 35, ["Name", "Sex"]]

adult_names.head()

Unnamed: 0,Name,Sex
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
6,"McCarthy, Mr. Timothy J",male
11,"Bonnell, Miss. Elizabeth",female
13,"Andersson, Mr. Anders Johan",male
15,"Hewlett, Mrs. (Mary D Kingcome)",female


To create a new column based on existing columns, simply apply operations on `Series`. Consider our simple DataFrame:

In [31]:
df

Unnamed: 0,person,A,Y,Z
0,1,1,1,2
1,2,1,1,2
2,3,1,1,2
3,4,0,0,0
4,5,0,1,1
5,6,0,0,0


In [32]:
# define the new variable X = 2Y - Z
df['X'] = df['Y'] * 2 - df['Z'] 

df

Unnamed: 0,person,A,Y,Z,X
0,1,1,1,2,0
1,2,1,1,2,0
2,3,1,1,2,0
3,4,0,0,0,0
4,5,0,1,1,1
5,6,0,0,0,0


The calculation of the values is done element-wise. Note the difference comparing to that in a dictionary.

To combine data from multiple DataFrames, use `concat()` and `merge()`. Please refer to [the tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/08_combine_dataframes.html) for examples.

`Pandas` also provides handy methods for statistical analysis. Please refer to [the tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html) to see how to calculate summary statistics in DataFrames.

## Jupyter Notebook

> A Jupyter notebook integrates `Python` code and its output into a single document that combines visualizations, narrative text, mathematical equations, and other rich media. In other words: it's a single document where you can run code, display the output, and also add explanations, formulas, charts, and make your work more transparent, understandable, repeatable, and shareable.

Please take about 30 minutes to watch the following YouTube video.

[Jupyter Notebook Tutorial: Introduction, Setup, and Walkthrough](https://www.youtube.com/watch?v=HW29067qVWk)

The Anaconda distribution already includes Jupyter notebook. In this course, we will use Codespaces for most of the work.

There are two types of cells in Jupyter notebooks: markdown and code cell.

### Markdown Cell

The markdown cell is where you can type plain text in [the markdown style](https://www.markdownguide.org/basic-syntax/). More speifically, 

- '\#' indicates the title and headings. 
>
> \# This is the title
>
> \#\# This is a level 1 heading
>
> \#\#\# This is a level 2 heading
>
> etc.

- `$(equation)$` and `$$(equation)$$` for math equations (in LaTex style).
> `$y = 2x$` will show $y = 2x$ (inline equation);
>
> `$$\hat{Y} = \alpha + \beta X$$` will show (display equation)
> $$\hat{Y} = \alpha + \beta X$$

- '\*(text)\*' will make the inside text italic: *(text)*.
- '\*\*(text)\*\*' or '\_(text)\_' will make the inside text bold: **(text)** or __(text)__;

- To insert tables, you may try this tool: [Table generator](https://www.tablesgenerator.com/). Simply setup your table, press 'Generate', then copy the result and paste it to the position where you want to put the table!
> The markdown cells can work with html source code. To make your table look more beautiful, I suggest you to insert tables in html style, as I did in the *Operators* section.

### Code Cell

The code cell is where Python code is written. Press 'Run' on the left to execute the code in the current cell. For more details (shortcuts, etc.), refer to the YouTube video.

### Export Jupyter Notebook as PDF Report

You will be required to generate a formal report from Jupyter notebook, with all markdown texts, code and outputs. Before you go, please double-check that your code cells have present the proper outputs - I will do 'Clear All Outputs' and 'Run All' to make sure everything works and is in correct order. Once you are ready, open terminal and do the following:

1. If you want to export your notebook as a webpage (.html):
(in terminal)
```
pip install nbconvert
pip install pyppeteer
jupyter nbconvert --to html path/filename.ipynb
```
where you should change `path/filename` to the path and filename of your own notebook. If you alreay have the modules installed, only the third line is needed.

To transfer webpage to PDF file, (in Windows, for example) do 'Print - Save as PDF', where you may change page layout, margins and scales.

2. If you want to export your notebook to PDF:
(in terminal)
```
pip install nbconvert[webpdf]
pip install pyppeteer
jupyter nbconvert --to webpdf --allow-chromium-download path/filename.ipynb 
```
If you still run into errors regarding on `playwright`, please try (still in terminal):
```
playwright install-deps
```
and run
```
jupyter nbconvert --to webpdf --allow-chromium-download path/filename.ipynb
```
again.

## Ending Remarks

This quick guide just give you a taste of Python coding and Jupyter notebook. In order to suceed in programming and data analytics, please:

- try the code out on your own;
- make the best use of official documentations and tutorials;
- google it (modules, functions, etc.).

Happy coding!