## Python Intro

Before we begin, let's cover a few basics about Jupyter (also known as iPython) notebooks. To run the cell, and go to the next cell, press Shift+Enter. If you want to run the cell without advancing to the next, press Ctrl+Enter.

 IPython is the interactive control panel 

### Variable assignment, basic calculations, and data types

In [None]:
# This is a comment. Comments will not appear in the output when a cell is run.

a = 45    # assigning the value 45 to the letter "a"
print(a)

As you may have guessed, the `print()` function displays the value of the argument that's passed to it (e.g. whatever is inside the parentheses).

In [None]:
# Let's assign a couple more variables 
b = 12
c = a + b
print(c)

In [None]:
# Let's increment "c" by 2
## you can add multiples commas and different type of datatype and print function will work

print('c =', c)
c = c + 2
print('c + 2 =', c)

Try running the previous cell again. What happens?

In [None]:
# Another useful method to increment

print('c =', c)
c += 2
print('Now c =', c)

Decrementing works similarly (the operator is `-=`)
Other useful operations:

- subtraction: `a - b`
- multiplication: `a * b`
- division: `a / b`
- floor division (the integer part, or quotient, of a division operation): `a // b`
- modulo (remainder): `a % b`

In [None]:
# Some of the above in action
print('a', a)
print('b', b)
div = a/b
print('a / b is', div)
floor = a//b
##floor function is used to return the closest integer value which is less than or equal to the specified expression or Value.
print('a // b is', floor)
mod = a%b
print('a % b is', mod)

So far, you've seen three data types: strings, integers, and floats. We can use the function `type()` to find out what data type a value or variable represents.

In [None]:
type(3.75)      # this is a float
#A float is a floating-point number, which means it is a number that has a decimal place. 

In [None]:
type(3)     # this is an integer
# A string is a sequence of characters

In [None]:
type('This is a string.')      # this is a string

In [None]:
type("This is also a string.")     # double quotes or single quotes can be used

The `math` module is part of the standard library and has a lot of useful functions. To use it, we need to import it into this notebook.

In [None]:
import math

math.sqrt(9)

In [None]:
math.pi

To learn more about the various functions belonging to the `math` module, call the `help()` function on it. Alternatively, you can read the online documentation for this module here: https://docs.python.org/3/library/math.html. This applies to any module, class, function, etc. that you may want more information on.

In [None]:
## explain about specific function in a package

help(math)

To convert a number from a float into an integer, use the function `int()`:

In [None]:
# casting of integer to string

a = int(6.0)
type(a)

What if we have a number in string format?

In [None]:
# This doesn't work

'6.5'+7

In [None]:
# This works

float('6.5')+7

If we want to convert an integer or float to a string, we can use the function `str()`:

In [None]:
str(100)

### Strings: cleaning and manipulation

Indexing in Python starts from 0. That means that the first element of any string, list, array, etc. is actually considered to be element # 0.

In [None]:
myString = 'Rutgers is one of the top 10 oldest colleges in the U.S.'

Accessing characters in the string:

In [None]:
myString[0]

In [None]:
myString[1]

You can also access characters with reference to the end of the string:

In [None]:
myString[-1]

In [None]:
myString[-2]

To access larger portions (called "slices") of the string, we can use the following syntax: $string[startIndex:endIndex:stepSize]$. The string returned will start from the character at index $startIndex$, but it will end with the character at index $endIndex-1$. If not specified, $stepSize$ = 1, $startIndex$ = 0, and $endIndex$ = one beyond the last index.
For example,

In [None]:
myString[5:11]

In [None]:
myString[1:11:2]

In [None]:
myString[:11]

In [None]:
myString[11:]

In [None]:
myString[:]

Let's look at other useful string operations.

In [None]:
# How long is the string?

len(myString)

In [None]:
# Concatenation

myString2 = '; it was originally "Queen\'s College".'
print(myString + myString2)



There's a lot you can do with strings! Let's go through a few useful methods:

In [None]:
# Converting to all lower case

new = 'RU'
new.lower()

In [None]:
# Finding the first position(index) of a character 
v = '...Alice twisted her ankle playing basketball on Saturday...'
v.find('l')

In [None]:
# str.find() can also be used for a substring; it returns the index of the first character in the substring

v.find('Alice')

In [None]:
# Replacing all instances of a character or substring

y = 'Day 1: Prep. Day 2: Execute. Day 3: Review.'
y.replace('Day ', '')

### Lists: working with a data collection

Many of the operations we used with strings can be applied to lists as well, including
- indexing
- slicing
- finding the length
- concatenation

In [None]:
# Creating a list of the top 40 U.S. cities by population

topcities = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Philadelphia', 'Phoenix', 'San Antonio', 'San Diego',
         'Dallas', 'San Jose', 'Austin', 'Jacksonville', 'San Francisco', 'Indianapolis', 'Columbus', 'Fort Worth',
         'Charlotte', 'Seattle', 'Denver', 'El Paso', 'Detroit', 'Washington', 'Boston', 'Memphis', 'Nashville', 'Portland',
         'Oklahoma City', 'Las Vegas', 'Baltimore', 'Louisville', 'Milwaukee', 'Albuquerque', 'Tucson', 'Fresno', 'Sacramento',
         'Kansas City', 'Long Beach', 'Mesa', 'Atlanta', 'Colorado Springs']
print(topcities)

In [None]:
# Which is the 5th most populous city?

topcities[4]

In [None]:
# Which cities are ranked #11-#20?

topcities[10:20]

In [None]:
# Are there really 40 cities in the list?

len(topcities)

In [None]:
# Let's add the next 5 cities to topcities

cities41to45 = ['Virginia Beach', 'Raleigh', 'Omaha', 'Miami', 'Oakland']
topcities + cities41to45

However, **lists, unlike strings, are mutable** - their identities can be changed in-place without creating a new list.

In [None]:
# Another way to add multiple items to the list is to use the "extend" method

topcities.extend(cities41to45)
len(topcities)
#print(topcities)

In [None]:
# You can also add items one at a time using the "append" method

topcities.append('Minneapolis')
topcities[-2:]   # let's just look at the end of the list

In [None]:
# Is Orlando in the list?

'Orlando' in topcities

In [None]:
# Is Dallas in the list?

'Dallas' in topcities

In [None]:
# Which position is Dallas in?

topcities.index('Dallas')

In [None]:
# Let's get the list in alphabetical order

topcities.sort()
topcities[:10]   # just looking at the first 10 to verify sorting

In [None]:
# Sorting also works on numbers

newList = [3,53,7,768,7,4,563]
newList.sort()
newList

You can even create a list of lists.

In [None]:
# Indexing with a list of lists

nestedList = [[1,2,3],[2,4,6],[3,6,9],[4,8,12]]
print(nestedList[0])

In [None]:
# Accessing a single element in a sublist

print(nestedList[0][2])

In [None]:
# Slicing

print(nestedList[:2])

There's a lot you can do with lists. A brief overview can be found here: https://www.tutorialspoint.com/python/python_lists.htm; full documentation can be found at the official Python documentation page.

## Now the real fun begins...
Before we start playing with data files, we need to cover one more really important section.
### Loops, conditionals, and functions
If you have used a progamming language before, you're probably familiar with the for-loop. For everyone else, a for-loop is a way of iterating through a data structure - a string, a list, a dictionary, etc. - or file. It's a way to execute the same piece of code multiple times with a parameter being updated on every iteration.
What is data structure? In computer science, a data structure is a data organization, management, and storage format that enables efficient access and modification. 

In [None]:
# The go-to first example of a for-loop

for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:
    print(i)

The `range()` function is useful here. It can be given up to three parameters: $([start], stop, [step])$ (brackets indicate optional parameters). As with slicing in strings and lists, the loop will stop at $stop-1$.

In [None]:
# Do the above more efficiently using the "range" function:
## End at stop-1

for i in range(1,11, 2):
    print(i)

You may be wondering why we chose the letter "i" to iterate through the lists above. The answer is that it's just convention - letters like "i" and "j" are often used for iteration, but in practice, you can use any letters, letter + number combinations, or even an underscore.

One common use of a for-loop is to iteratively append elements to a list.

In [None]:
# What are the first ten multiples of 3?

multiples = []    # initializing list
for i in range(1,11):
    multiples.append(3*i)

print(multiples)

As you begin writing more complex code, you may find it helpful to use this step-by-step visualization tool: http://pythontutor.com/visualize.html#. (DEMO)

What if we want to control what code gets executed based on certain conditions? That's where conditional statements come in. Let's look at some operators you'll likely use:

In [None]:
print(3 < 4)    # less than

The return values of True/False are called "Booleans". These are actual values that can be assigned to a variable:

In [None]:
var = True
new_var = False

print('var:', var, '\nnew_var:', new_var)

In Part 1, we saw another case where the result was a Boolean; we were checking if "Orlando" was in the $topcities$ list. The `in` and `not in` membership checks return True or False.``

Now, we can implement some of these comparisons in what's called an if-else statement. The gist of it is this: if {some condition}, execute some code; for all other cases, execute some other code.

In [None]:
# On which days could we potentially have a picnic?

forecast7Day = ['rain', 'mostly cloudy', 'rain', 'mostly cloudy', 'sunny', 'partly cloudy', 'rain']
picnic = []
for i in forecast7Day:
    if i == 'rain':
        picnic.append('no')
    else:
        picnic.append('yes')
        
print(picnic)

One last type of control structure - the while-loop. The general structure is the following: while {some condition}, execute some code. Iteration will continue until that condition is no longer true.

In [None]:
# Using up a gift card

balance = 110     # initial balance = $110
while balance - 20 >= 0:
    print('Your balance is now $' + str(balance))
    balance -= 20    # using up $20 for each purchase
print('Final balance: $' + str(balance))

Finally, functions. Functions are extremely useful for when you want to execute a section of code repeatedly, but with parameters (called "arguments") for which values can be defined when the function is called. Functions are defined with `def` and then a user-provided name.

In [None]:
# A function to generalize the gift card code in the previous example

def giftCard(init_balance, purchase_size):    # this function has two arguments
    balance = init_balance
    while balance - purchase_size >= 0:
        print('Your balance is now $' + str(balance))
        balance -= purchase_size
    print( 'Final balance: $' + str(balance))

What happened when you ran the previous cell?

In order to use the function, we have to call it.

In [None]:
# Calling the giftCard function

newCard = giftCard(200, 50)    # initial balance = $200, purchase_size = $50
print(newCard)

Try calling `giftCard()` with different parameters.

## Numpy Array

In [None]:
import numpy as np

In [None]:
arr = np.array([1,3,5,7])
arr

Why ndarrays?
- efficient vectorized, elementwise operations for homogeneous data (sometimes orders of magnitude faster than in "pure Python")
- provides foundation for operations in **pandas**

In [None]:
## Comparision of timing
import time
import numpy as np

size_of_vec = 1000

def pure_python_version():
    t1 = time.time()
    X = range(size_of_vec)
    Y = range(size_of_vec)
    Z = [X[i] + Y[i] for i in range(len(X)) ]
    return time.time() - t1

def numpy_version():
    t1 = time.time()
    X = np.arange(size_of_vec)
    Y = np.arange(size_of_vec)
    Z = X + Y
    return time.time() - t1


t1 = pure_python_version()
t2 = numpy_version()
print(t1, t2)
print("Numpy is in this example " + str(t1/t2) + " faster!")

### Creating an ndarray

In [None]:
# We already saw this

np.array([1,3,5,7])

In [None]:
from google.colab import files
files.upload()

# From a text file
my_arr = np.loadtxt('loadarray.txt')
my_arr

### Investigating an ndarray

In [None]:
# Number of dimensions

my_arr.ndim

In [None]:
# Shape

my_arr.shape

In [None]:
# Data type

my_arr.dtype

### Indexing and slicing

In [None]:
sales = np.array([20, 30, 31, 33, 33, 35, 40, 410, 410, 45])
sales[7:9]

### Mathematical and statistical operations, functions, and methods

Remember, *elementwise* operations.

In [None]:
arr1 = np.arange(10)
arr2 = arr1 + 2

In [None]:
arr1

In [None]:
arr2

In [None]:
arr1 + arr2

In [None]:
arr2 * arr2

In [None]:
arr2 ** 0.5

In [None]:
1/arr2

In [None]:
# Equivalent to arr2 * arr2

np.square(arr2)

In [None]:
# Equivalent to arr2 ** 0.5

np.sqrt(arr2)

This is only a small sample of the many ufuncs that are out there - for more ufuncs, check out the documentation: https://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs

### Array aggregate

In [None]:
my_arr

In [None]:
## CODE CELL 33
# Sum of all elements

my_arr.sum()

In [None]:
## CODE CELL 34
# Arithmetic mean

my_arr.mean()

In [None]:
# Standard deviation (optionally, adjust degrees of freedom used in calculation via ddof parameter)

my_arr.std()

In [None]:
# Variance (ddof adjustable)

my_arr.var()

In [None]:
# Maximum of all elements

my_arr.max()

In [None]:
# Minimum of all elements

my_arr.min()

In [None]:
# What if I want the maximum value in each row?

my_arr.max(axis=1)

In [None]:
# Finding the indices of the maximum element of the array

my_arr.argmax()

## Pandas

In [None]:
import pandas as pd

In [None]:
from google.colab import files
files.upload()

In [None]:
# Read in the data

persons = pd.read_csv('unhcr_popstats_export_persons_of_concern_all_data.csv', header=3, na_values = '*')

In [None]:
persons.tail()

In [None]:
persons.info()

In [None]:
per_renamed = persons.rename(index=str, columns ={'Country / territory of asylum/residence': 'Residence',
                                    'Refugees (incl. refugee-like situations)': 'Refugees',
                                    'Asylum-seekers (pending cases)': 'Asylum-seekers',
                                    'Internally displaced persons (IDPs)': 'IDPs'})
per_renamed.head()

How do we look at specific columns?

In [None]:
per_renamed['Origin']

Let's say we want to focus on persons from Somalia.

In [None]:
somali = per_renamed[per_renamed['Origin'] == 'Somalia']
somali

Now, let's say we want to be more specific and focus on Somalis who have come to the U.S. between 2000-2016.

In [None]:
somali_us = somali[(somali['Residence'] == 'United States of America') & (somali['Year'] >= 2000) & (somali['Year'] <= 2016)]
somali_us

## Data Visulization

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.cm as cm
from google.colab import files

In [None]:
## CODE CELL 2
# Line plot

x = np.arange(0,101,10)    # creating an array from 0 to 100 in steps of 10
y = np.square(x)           # creating an array by element-wise squaring of array "x"

plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Part of a Parabolic Plot')

plt.savefig('parabolic.png')    # saving plot to the Figures folder (or overwrite if file already in folder)
files.download("parabolic.png") 

xmin, xmax = plt.xlim()    # get values for current x-axis limits
print('xmin:', xmin, 'xmax:', xmax)

What if we want to change the axes ranges?

In [None]:
## CODE CELL 3
# Line plot again, with adjusted axes

x = np.arange(0,101,10)   
y = np.square(x)           

plt.plot(x, y)

plt.xlim(0, 40)    # setting x-axis limits
plt.ylim(0, 2000)  # setting y-axis limits

plt.xlabel('x')
plt.ylabel('y')
plt.title('Part of a Parabolic Plot - Zoomed In')

plt.show()

Alright, let's say that we're happy with these x- and y- limits. What if we want to change the color and style of the plotted line?

`plt.plot()` allows you to specify formatting options for each x-y dataset you're plotting. Each formatting type is optional (not specifying just means default values will be used), but must be specified in a string in the following order: *color* + *marker style* + *line style*. See the following example:

In [None]:
# Line plot again, this time with customized formatting

x = np.arange(0,101,10)   
y = np.square(x)           

plt.plot(x, y, 'mo--')    # color = magenta, marker = circle, line style = dashed
plt.xlabel('x')
plt.ylabel('y')
plt.title('Part of a Parabolic Plot - Magenta')

plt.show()

In [None]:

x = np.arange(0,101,10)   
y = np.square(x)           

plt.scatter(x, y)    # color = magenta, marker = circle, line style = dashed
plt.xlabel('x')
plt.ylabel('y')
plt.title('Part of a Parabolic Plot - Magenta')

plt.show()

What if you want to plot multiple data series? If you're calling `plt.plot()` on both of them, you can just add the x, y, and formatting options in the same call:

In [None]:
# Two series plotted as lines

x = np.arange(0,101,10)   
y1 = np.square(x)
y2 = np.power(x, 2.2)

plt.plot(x, y1, 'mo--', x, y2, 'gD-')    # color = magenta, marker = circle, line style = dashed
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['Series 1','Series 2'])
plt.title('Plotting Two Series')

plt.show()

In [None]:
# Line plot and scatter plot

x1 = np.arange(0,101,10)
x2 = np.arange(10,91,5)
y1 = np.square(x1)  
y2 = np.power(x2, 2.2)

plt.plot(x1, y1, 'mo--')
plt.scatter(x2, y2)

plt.xlabel('x')
plt.ylabel('y')
plt.legend(['Line','Scatter'])
plt.title('Part of a Parabolic Plot - Magenta')

plt.show()

In [None]:
# Example distribution using sample data from normal distribution

np.random.seed(10)
distr = 10 + np.random.randn(1000)*5

# Creating a histogram from the distribution with 20 bins, counts normalized to get probability density
histogram = plt.hist(distr, bins=20, density=True, alpha=0.7)
n, bins, patches = histogram        # plt.hist() returns values for bin values, bin edges, and patches used to make histogram
plt.xlabel('x', size=12)
plt.ylabel('Probability', size=12)
plt.show()

print('Values for each bin:', n)
print('Edges of the bins:', bins)

*References*:

The following materials were consulted during development of this notebook:

J. Zelle, *Python Programming: An Introduction to Computer Science*, 2nd ed. Sherwood, Oregon: Franklin, Beedle & Associates Inc., 2010.

Python 3 Documentation from the Python Software Foundation: https://docs.python.org/3/