# LAB 1 - INTRO TO PYTHON

This lab is comprised of three parts:

- 1. Introduction to Jupyter Notebooks

- 2. Python Language and NumPy Library

- 3. Data Manipulation with Pandas Library


## 1. INTRODUCTION TO JUPYTER NOTEBOOKS

Open the file "Lab1.ipynb"

`"File -> Open..."` 

Navigate to where you saved the files you downloaded for this class and double click on the "Lab1.ipynb" file

### Running notebook cells

The notebook is divided into cells. Each cell can contain texts, codes or html scripts. Running a non-code cell simply advances to the next cell. Make sure to type commands always in a Code cell. You can verify this by checking the scroll-down window in the above toolbar menu.

To run a code cell use `Shift + Enter` or `Ctrl + Enter`. In Mac, `Shift + Enter` or `Command + Enter`. 

Try running these lines:

In [1]:
8*6

48

In [2]:
2**16

65536

Incomplete command, Jupyter will display a `SyntaxError`

In [3]:
2^

SyntaxError: invalid syntax (3817186346.py, line 1)

Your turn:

In [5]:
# EXERCISE: Compute 284455 divided by 3.67778

284455/3.67778

77344.21308506763

In Python, any text following a hash sign in a code cell is a comment

### Interrupting the kernel

For debugging, often we would like to interupt the current running process. This can be done by pressing the stop button. 

When a processing is running, the circle on the right upper corner is filled. When idle, the circle is empty.

Interrupting sometimes does not work. You can reset the state by restarting the kernel. This is done by clicking Kernel/Restart or the Refresh button in the toolbar above.

### Undoing

To undo changes in each cell, hit `Command-z` for Mac and `Ctrl-z` for Windows.
To undo `Delete Cell`, select `Edit->Undo Delete Cell`.

### Saving the notebook

To save your notebook, either select `"File->Save and Checkpoint"` or hit `Command-s` for Mac and `Ctrl-s` for Windows

### Other Notebook tips
- To add a new cell, either select `"Insert->Insert New Cell Below"` or click the white plus button
- You can change the cell mode from code to text in the pulldown menu by selecting `Markdown`.
- `Help->Keyboard Shortcuts` has a list of keyboard shortcuts

## 2. PYTHON LANGUAGE AND NUMPY LIBRARY

##  Data Types

### Floats and Integers

In [6]:
x = 4
print(x, type(x))

4 <class 'int'>


In [7]:
x = 1 / 4
print(x, type(x))

0.25 <class 'float'>


### Strings

Double quotes and single quotes are the same thing. Both represent strings. `'+'` concatenates strings

In [8]:
"IEOR " + '242'

'IEOR 242'

### Lists

A list is a mutable collection of data. That is we can change it after we create it. They can be created using square brackets []


Important functions: 
- `'+'` appends lists. 
- `len(x)` to get length

In [9]:
x = ["IEOR"] + [2, 4, 2]
print(x)

['IEOR', 2, 4, 2]


In [10]:
print(len(x))

4


### Tuples

A tuple is an immutable collection of data. They can be created using round brackets (). 
They are usually used as inputs and outputs to functions.

In [11]:
t = ("I", "E", "O", "R") + (2, 4, 2)
print(t)

('I', 'E', 'O', 'R', 2, 4, 2)


In [12]:
# Cannot do assignment to a tuple after creation - it's immutable
t[4] = 3 # will error

# Note: errors in notebook appear inline

TypeError: 'tuple' object does not support item assignment

## Functions and Variables

A function can take in several arguments or inputs, and returns an output value.
Python has some built-in functions:

In [13]:
abs(-65)

65

In [14]:
max([2, 4, 2])

4

In [15]:
# Get help on any function:
max?

[1;31mDocstring:[0m
max(iterable, *[, default=obj, key=func]) -> value
max(arg1, arg2, *args, *[, key=func]) -> value

With a single iterable argument, return its biggest item. The
default keyword-only argument specifies an object to return if
the provided iterable is empty.
With two or more arguments, return the largest argument.
[1;31mType:[0m      builtin_function_or_method


Basic variable naming rules: 
- Don't use spaces (underscores or capital letters instead), i.e time_reader() or timeReader
- Don't start names with a number
- Variable names are case sensitive - capital and lowercase letters are different

In [16]:
# EXERCISE: 
# Create a variable called "SecondsDay" that is equal to the number of seconds in a day, and output its value.
SecondsDay = 3600 * 24
print(SecondsDay)


86400


### User-defined Functions

We can define functions ourselves, by using `def` and passing the expected inputs, as well as stating the returned output.
In this example we create a function that takes two numbers `x` and `y`, and returns the sum.

In [17]:
def my_function(x, y):
    
    result = x+y
    
    return result

In [18]:
my_function(5, 3)

8

## Linear Algebra with Numpy

The numpy array, aka an "ndarray", is like a list with multidimensional support and more functions.
https://numpy.org/doc/stable/reference/routines.linalg.html

Important NumPy Array functions:

- `.shape` returns the dimensions of the array.

- `.ndim` returns the number of dimensions. 

- `.size` returns the number of entries in the array.

- `len()` returns the first dimension.


To use functions in NumPy, we have to import NumPy to our workspace. This is done by the command `import numpy`. By convention, we rename `numpy` as `np` for convenience.

### Arrays

In [19]:
import numpy as np

a = np.array([1,2,3])
a.shape

(3,)

In [20]:
2*a

array([2, 4, 6])

In [21]:
# Element-wise multiplication
b = np.array([3,3,3])
np.multiply(a,b)

array([3, 6, 9])

In [22]:
# Inner product
inner_product = np.dot(a,b)
print(inner_product)

18


### Slicing

Numpy uses pass-by-reference semantics so it creates views into the existing array, without implicit copying. This is particularly helpful with very large arrays because copying can be slow.

In [23]:
x= np.array([1,2,3,4,5,6])
y = x[0:4]
print(y)

[1 2 3 4]


In [24]:
y[3] = 5

In [25]:
x

array([1, 2, 3, 5, 5, 6])

Because slicing does not copy the array, changing `y` changes `x`. To actually copy x, we should use `.copy()`. 

### Matrices

In [26]:
# Create a matrix
A = np.array([[1, 2, 8],
             [3, 2, 9]])
print(A)

[[1 2 8]
 [3 2 9]]


In [27]:
# Matrix multiplication
B = np.array([[1, 2],
              [3, 8],
              [2, 9]])

# There are tow ways to perform matrix multiplication:
print(np.matmul(A,B))

# Alternatively:
print(A@B)

[[ 23  90]
 [ 27 103]]
[[ 23  90]
 [ 27 103]]


In [28]:
# Transpose a matrix
A.T

array([[1, 3],
       [2, 2],
       [8, 9]])

In [29]:
# Compute the inverse
C = np.array([[1, 2],
             [3, 2]])
D = np.linalg.inv(C)
C@D

array([[1.00000000e+00, 1.11022302e-16],
       [2.22044605e-16, 1.00000000e+00]])

## 3. DATA MANIPULATION WITH PANDAS LIBRARY

`pandas` is designed to make it easier to work with structured data. Most of the analyses you might perform will likely involve using tabular data, e.g., from .csv files or relational databases (e.g., SQL). The `DataFrame` object in `pandas` is "a two-dimensional tabular, column-oriented data structure with both row and column labels."

If you're curious:

>The `pandas` name itself is derived from *panel data*, an econometrics term for multidimensional structured data sets, and *Python data analysis* itself. After getting introduced, you can consult the full [`pandas` documentation](http://pandas.pydata.org/pandas-docs/stable/).

### Setting the working directory

Before loading the data, let's begin by setting the right working directory. In order to change the working directory, we use the `os` library

In [30]:
import os
os.getcwd()

'c:\\Users\\minon\\Downloads'

Change the working directory to whichever path has your python files and data files

In [33]:
os.chdir("c:\\Users\\minon\\Downloads")
print(os.getcwd())

# to go one folder level back use a double dot
os.chdir("..")
print(os.getcwd())

c:\Users\minon\Downloads
c:\Users\minon


### Loading CSV files

Now we can use the `pandas` library to load the data. Import `pandas` using the conventional abbreviation and call the `read_csv` method on your file's path name

In [36]:
import pandas as pd

# WHO = pd.read_csv("WHO.csv")
WHO = pd.read_csv("c:\\Users\\minon\\Downloads\\WHO_AsiaEurope.csv", encoding = "ISO-8859-1") ##check this encoding, also encoding = 'unicode_escape'

### The Dataframe

In [37]:
WHO

Unnamed: 0.1,Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
0,0,Afghanistan,Eastern Mediterranean,29825,47.42,3.82,5.40,60,98.5,54.26,,1140.0,,
1,1,Albania,Europe,3162,21.33,14.93,1.75,74,16.7,96.39,,8820.0,,
2,3,Andorra,Europe,78,15.20,22.86,,82,3.2,75.49,,,78.4,79.4
3,7,Armenia,Europe,2969,20.34,14.06,1.74,71,16.4,103.57,99.6,6100.0,,
4,9,Austria,Europe,8464,14.51,23.52,1.44,81,4.0,154.78,,42050.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81,181,Ukraine,Europe,45530,14.18,20.76,1.45,71,10.7,122.98,99.7,7040.0,90.8,91.5
82,182,United Arab Emirates,Eastern Mediterranean,9206,14.41,0.81,1.84,76,8.4,148.62,,47890.0,,
83,183,United Kingdom,Europe,62783,17.54,23.06,1.90,80,4.8,130.75,,36010.0,99.8,99.6
84,187,Uzbekistan,Europe,28541,28.90,6.38,2.38,68,39.6,91.65,99.4,3420.0,93.3,91.0


In [38]:
# Structure of the data
WHO.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 14 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Unnamed: 0                     86 non-null     int64  
 1   Country                        86 non-null     object 
 2   Region                         86 non-null     object 
 3   Population                     86 non-null     int64  
 4   Under15                        86 non-null     float64
 5   Over60                         86 non-null     float64
 6   FertilityRate                  83 non-null     float64
 7   LifeExpectancy                 86 non-null     int64  
 8   ChildMortality                 86 non-null     float64
 9   CellularSubscribers            83 non-null     float64
 10  LiteracyRate                   41 non-null     float64
 11  GNI                            69 non-null     float64
 12  PrimarySchoolEnrollmentMale    48 non-null     float

In [None]:
# Recent statistics from the World Health Organization (WHO)
# The variables are: 
# the name of the country
# the region the country is in
# the population in thousandsa
# the percentage of the population under 15 and over 60
# the fertility rate (average number of children per woman)
# the Life Expectancy in years
# the Child Mortality rate (the number of children who die by age 5 per 1000 births)
# the number of cellular subscribers per 100 population
# the literacy rate among adults aged >= 15
# the gross national income per capita
# the percentage of male children enrolled in primary school
# the percentage of female children enrolled in primary school

In [39]:
# Statistical summary of the data:
WHO.describe()

Unnamed: 0.1,Unnamed: 0,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
count,86.0,86.0,86.0,86.0,83.0,86.0,86.0,83.0,41.0,69.0,48.0,48.0
mean,101.255814,38995.8,22.752326,14.746163,2.260482,73.965116,20.973256,106.259036,91.521951,19958.405797,94.758333,94.364583
std,56.024194,137444.8,9.005119,8.080025,1.096263,6.966217,26.891439,38.679951,12.791687,17635.245593,5.264406,7.138523
min,0.0,31.0,13.17,0.81,1.26,50.0,2.2,2.57,56.8,1140.0,78.4,66.5
25%,60.25,3266.0,15.1475,6.71,1.495,69.25,4.225,87.875,91.2,5930.0,93.3,92.55
50%,98.5,9257.5,18.825,15.87,1.93,75.0,9.6,109.35,97.9,14470.0,96.6,96.75
75%,154.0,28477.75,28.8375,22.9825,2.51,80.0,25.2,126.675,99.6,31020.0,98.95,99.2
max,191.0,1240000.0,47.42,26.97,6.77,83.0,147.4,191.24,99.8,86440.0,99.8,100.0


In [40]:
# Display a few data points at the "head" (start) of the dataset, i.e. the first few records
WHO.head()

Unnamed: 0.1,Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
0,0,Afghanistan,Eastern Mediterranean,29825,47.42,3.82,5.4,60,98.5,54.26,,1140.0,,
1,1,Albania,Europe,3162,21.33,14.93,1.75,74,16.7,96.39,,8820.0,,
2,3,Andorra,Europe,78,15.2,22.86,,82,3.2,75.49,,,78.4,79.4
3,7,Armenia,Europe,2969,20.34,14.06,1.74,71,16.4,103.57,99.6,6100.0,,
4,9,Austria,Europe,8464,14.51,23.52,1.44,81,4.0,154.78,,42050.0,,


In [41]:
# The last 6 records
WHO.tail(6)

Unnamed: 0.1,Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
80,178,Turkmenistan,Europe,5173,28.65,6.3,2.38,63,52.8,68.77,99.6,8690.0,,
81,181,Ukraine,Europe,45530,14.18,20.76,1.45,71,10.7,122.98,99.7,7040.0,90.8,91.5
82,182,United Arab Emirates,Eastern Mediterranean,9206,14.41,0.81,1.84,76,8.4,148.62,,47890.0,,
83,183,United Kingdom,Europe,62783,17.54,23.06,1.9,80,4.8,130.75,,36010.0,99.8,99.6
84,187,Uzbekistan,Europe,28541,28.9,6.38,2.38,68,39.6,91.65,99.4,3420.0,93.3,91.0
85,191,Yemen,Eastern Mediterranean,23852,40.72,4.54,4.35,64,60.0,47.05,63.9,2170.0,85.5,70.5


### Subsets of data

In [42]:
# Subset with only the countries in Europe
WHO_Europe = WHO[WHO['Region'] == 'Europe']
WHO_Europe.head()

Unnamed: 0.1,Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
1,1,Albania,Europe,3162,21.33,14.93,1.75,74,16.7,96.39,,8820.0,,
2,3,Andorra,Europe,78,15.2,22.86,,82,3.2,75.49,,,78.4,79.4
3,7,Armenia,Europe,2969,20.34,14.06,1.74,71,16.4,103.57,99.6,6100.0,,
4,9,Austria,Europe,8464,14.51,23.52,1.44,81,4.0,154.78,,42050.0,,
5,10,Azerbaijan,Europe,9309,22.25,8.24,1.96,71,35.2,108.75,,8960.0,85.3,84.1


In [43]:
WHO_Europe.count()

Unnamed: 0                       53
Country                          53
Region                           53
Population                       53
Under15                          53
Over60                           53
FertilityRate                    50
LifeExpectancy                   53
ChildMortality                   53
CellularSubscribers              51
LiteracyRate                     26
GNI                              48
PrimarySchoolEnrollmentMale      38
PrimarySchoolEnrollmentFemale    38
dtype: int64

In [44]:
# Other subsets
WHO_AsiaEurope = WHO[(WHO['Region'] == 'Europe') | (WHO['Region'] == 'South-East Asia') | (WHO['Region'] == "Eastern Mediterranean")] 
WHO_AsiaEurope.count()

Unnamed: 0                       86
Country                          86
Region                           86
Population                       86
Under15                          86
Over60                           86
FertilityRate                    83
LifeExpectancy                   86
ChildMortality                   86
CellularSubscribers              83
LiteracyRate                     41
GNI                              69
PrimarySchoolEnrollmentMale      48
PrimarySchoolEnrollmentFemale    48
dtype: int64

### Saving dataframe to CSV file

In [45]:
WHO_AsiaEurope.to_csv("WHO_AsiaEurope.csv")

In [46]:
# EXERCISE: How many countries have population greater than 50 million? 
Answer = WHO[WHO['Population'] > 50000]
len(Answer)


14

### More Data Analysis

To access a variable in a data frame, you always have to link it to the data frame and call it using square brackets and pass it's name as a string.

In [47]:
# This will give you an error!
LifeExpectancy

NameError: name 'LifeExpectancy' is not defined

In [48]:
# Now, run this.
WHO['LifeExpectancy']

0     60
1     74
2     82
3     71
4     81
      ..
81    71
82    76
83    80
84    68
85    64
Name: LifeExpectancy, Length: 86, dtype: int64

### Statistics

In [49]:
# Statistics about this variable
print((WHO['LifeExpectancy'].mean()))
print((WHO['LifeExpectancy'].max()))
print((WHO['LifeExpectancy'].min()))

73.96511627906976
83
50


In [50]:
# Standard deviation
WHO['LifeExpectancy'].std
WHO['LifeExpectancy'].describe()

count    86.000000
mean     73.965116
std       6.966217
min      50.000000
25%      69.250000
50%      75.000000
75%      80.000000
max      83.000000
Name: LifeExpectancy, dtype: float64

In [51]:
WHO['GNI'].describe()

# What's different here?



count       69.000000
mean     19958.405797
std      17635.245593
min       1140.000000
25%       5930.000000
50%      14470.000000
75%      31020.000000
max      86440.000000
Name: GNI, dtype: float64

In [52]:
# Identify countries corresponding to max and min
idx_min = WHO['LifeExpectancy'].argmin()
print(WHO['Country'][idx_min])

idx_max = WHO['LifeExpectancy'].argmax()
print(WHO['Country'][idx_max])

Somalia
San Marino


In [58]:
# EXERCISE:
# What is the largest population value among all countries?
# Which country has the largest population?

pop_max = WHO['Population'].max()

idx_country = WHO['Population'].argmax()



In [60]:
print('The country having the largest population is : {} and its population is : {}'.format(pop_max, WHO['Country'][idx_country]))

The country having the largest population is : 1240000 and its population is : India


### Dealing with missing data

In [61]:
WHO

Unnamed: 0.1,Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
0,0,Afghanistan,Eastern Mediterranean,29825,47.42,3.82,5.40,60,98.5,54.26,,1140.0,,
1,1,Albania,Europe,3162,21.33,14.93,1.75,74,16.7,96.39,,8820.0,,
2,3,Andorra,Europe,78,15.20,22.86,,82,3.2,75.49,,,78.4,79.4
3,7,Armenia,Europe,2969,20.34,14.06,1.74,71,16.4,103.57,99.6,6100.0,,
4,9,Austria,Europe,8464,14.51,23.52,1.44,81,4.0,154.78,,42050.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81,181,Ukraine,Europe,45530,14.18,20.76,1.45,71,10.7,122.98,99.7,7040.0,90.8,91.5
82,182,United Arab Emirates,Eastern Mediterranean,9206,14.41,0.81,1.84,76,8.4,148.62,,47890.0,,
83,183,United Kingdom,Europe,62783,17.54,23.06,1.90,80,4.8,130.75,,36010.0,99.8,99.6
84,187,Uzbekistan,Europe,28541,28.90,6.38,2.38,68,39.6,91.65,99.4,3420.0,93.3,91.0


In [62]:
# Dealing with NAs
# Try:
WHO['LiteracyRate'].head()

0     NaN
1     NaN
2     NaN
3    99.6
4     NaN
Name: LiteracyRate, dtype: float64

In [65]:
WHO.dropna(subset = ['LiteracyRate'], inplace=True)
WHO['LiteracyRate'].head()

3     99.6
6     91.9
7     56.8
11    97.9
13    98.8
Name: LiteracyRate, dtype: float64

# References
- [1] Special thanks to the [EECS127 Fall 2019](https://inst.eecs.berkeley.edu/~ee127/fa19/) for providing a great starting point for Intro to Jupyter
- [2] D-lab intro to Pandas [Link](https://github.com/dlab-berkeley/Python-Data-Wrangling)
- [3] The official Python 3 language documentation. [Link](https://docs.python.org/3/).
- [4] The official NumPy and SciPy documentation. [Link](https://docs.scipy.org/doc/).

# Other useful material

- Towards data science tutorial on pandas [Link](https://medium.com/towards-data-science/be-a-more-efficient-data-scientist-today-master-pandas-with-this-guide-ea362d27386)
- Coursera beginner level class on Pandas (2 hours long prject based teaching) [Link](https://www.coursera.org/projects/python-for-data-analysis-numpy)
- [Stackoverflow](https://stackoverflow.com): when you have a doubt about how to code something, you can usually find the answer you are looking for
- [Coursera](https://www.coursera.org): popular website for online learning, good material for beginner and gently introduction to Python