# Introduction to Python

In this workshop, we will be covering basics of python as well as two very useful libraries that you will need to store and manipulate your data. 

The two libraries are:

1.   Numpy - *Numerical Python is fundamental for scientific computations in python*
2.   Pandas - *Python Data Analysis Library provides high performance and easy to use data structures*



## Python as a calculator

In [0]:
2**3 # to power of
42/15 # division
42/15. # float division
42//15. # floor division
8%3 # modulo

8
2
2.8
2.0
2


## Using Variables

In [0]:
width=20
height=5
print (width*height)
print (width==height)
width=width+50
print (width)
height+=10
print (height)

100
False
70
15


### Booleans


In [0]:
True & False

False

In [0]:
True | False 

True

In [0]:
z= True 
z2 = False 
z &= z2 
print(z)

False


In [0]:
verylongvariable = 10
verylongvariable -=2
verylongvariable

8

## Strings

In [0]:
print ('spam eggs')

print ('doesn\'t') 
# OR
print ("hadn't")

print ("""This is a 
multiline string""") 
# OR
print ("First line \n second line")

print ("first part, " + "second part")




spam eggs
doesn't
hadn't
This is a 
multiline string
First line 
 second line
first part, second part


In [0]:
word="University"
print (word[0])
print (word[-1])


U
y


## Type casting

In [0]:
#It is often useful to change the type of an object
a=2
b=3
print (a+b)
print ('Value of b: ' + str(b))

5
Value of b: 3


## Lists
Lists can be used to group together other items. In python (unlike other languages), lists can contain items of different types.

In [0]:
squares=[1,4,9,16,25]
print (squares[0])
print (squares[2:])
print (squares[2:4])
print (squares[1:-1])
print (len(squares))  # The len function returns the length of a list. 
print ([1,2]+[3,4])

1
[9, 16, 25]
[9, 16]
[4, 9, 16]
5
[1, 2, 3, 4]


## Data Types
**Mutable vs Immutable:** in Python, some objects can be changed without changing their identity(mutable) and others cannot(immutable).

Some immutable objects (i.e. objects that can't be modified after creation):
*   int
*   float
*   string
*   boolean


Some mutable objects:
*   list
*   dictionary


This fact if goes unnoticed can result in some nasty bugs (errors) when defining functions, and therefore is important to understand. 

NOTE: Mutable objects are great to use when you need to change the size of the object, example list, dict etc.. Immutables are used when you need to ensure that the object you made will always stay the same.


In [0]:
# INT is immutable
x = 2
x = y
print(x)
print(y)
print(id(x))
print(id(y))

2
2
94048004720416
94048004720416


id(x) == id(y)

id(y) == id(10)

In [0]:
x = x + 1
print(x)
print(y)
print(id(x))
print(id(y))

3
2
94048004720392
94048004720416


id(x) != id(y)

id(y) != id(10)

In [0]:
# LIST is mutable
m = list([1, 2, 3])
n = m
print(m)
print(n)
print(id(m))
print(id(n))

[1, 2, 3]
[1, 2, 3]
139727692220536
139727692220536


id(m) == id(n)

In [0]:
m.pop()
print(m)
print(n)
print(id(m))
print(id(n))

[1, 2]
[1, 2]
139727692220536
139727692220536


id(m) == id(n)

Notice the following error:

In [0]:
word = "university" 
word[2]='r' 

TypeError: ignored

But it works for lists, for example:

In [0]:
squares=[1,4,9,16,25]
squares[-2]=100
print (squares)

[1, 4, 9, 100, 25]


Examples of boolean operations:

In [0]:
x=2
y=2
print(x is y)

True


In [0]:
x=2
y=5
print(x != y)

True


In [0]:
x=2
y=3
print(x | y)

3


In [0]:
x=2
y=3
print(x & y)

2


## Control Flow

In [0]:
#Fibonacci series
numbers=[1,2]
while numbers[-1] + numbers[-2] < 100:
    new_num = numbers[-1] + numbers[-2]
    numbers = numbers + [new_num]
print(numbers)


[1, 2, 3, 5, 8, 13, 21, 34, 55, 89]


In [0]:
#Fibonacci series
numbers=[1,2]
while numbers[-1] + numbers[-2] < 100:
    new_num = numbers[-1] + numbers[-2]
    numbers = numbers + [new_num]
    print(numbers)


[1, 2, 3]
[1, 2, 3, 5]
[1, 2, 3, 5, 8]
[1, 2, 3, 5, 8, 13]
[1, 2, 3, 5, 8, 13, 21]
[1, 2, 3, 5, 8, 13, 21, 34]
[1, 2, 3, 5, 8, 13, 21, 34, 55]
[1, 2, 3, 5, 8, 13, 21, 34, 55, 89]


**Notice:**


*   The while loop executes as long as the condition remains true.
*   The body of the loop is indented. Indentation is how python groups statements. You can use tab or spaces for indentation, but you have to be consistent with the number of spaces. It is good practice to use exactly 4 spaces(or a tab).



### If statements

In [0]:
x=int(input("Please enter an integer:"))
if x < 0:
    print ('Negative, changed to zero')
elif  x == 0:
    print ('Zero')
elif x == 1:
    print ('Single')
else:
    print ('More')  

Please enter an integer:2
More


### For loop

The for loop iterates over the items of a list (or the characters of a string) in the order that they appear. In other languages, for loops always have the form of

     for (i= 1 to 10) --> loop iterator
           perform action

In [0]:
#Measure some strings
words=['cat','window','university']
for w in words:
    print (w, len(w))

('cat', 3)
('window', 6)
('university', 10)


In [0]:
#If you want to iterate over numbers use the range function
a=['Mary','had','a','little','lamb']
for i in range(len(a)):
    print (i, a[i])

(0, 'Mary')
(1, 'had')
(2, 'a')
(3, 'little')
(4, 'lamb')


In [0]:
for i in enumerate (['Mary','had','a','little','lamb']):
    print(i)

(0, 'Mary')
(1, 'had')
(2, 'a')
(3, 'little')
(4, 'lamb')


## Functions

In [0]:
def equal(x, y):
    if (x==y):
        return True        
    else: return False         

print (equal(2,3))
print (equal(2,2))

False
True


# Introduction to Numpy

Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. If you are already familiar with MATLAB, you might find this tutorial useful to get started with Numpy. 

Source found [here](http://cs231n.github.io/python-numpy-tutorial/).


In [0]:
#importing Numpy library
import numpy as np

## Arrays
A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

We can initialize numpy arrays from nested Python lists, and access elements using square brackets:

In [0]:
a = np.array([1, 2, 3])   # Create a rank 1 array

print(type(a))            # Prints "<class 'numpy.ndarray'>"
print(a.shape)            # Prints "(3,)"
print(a[0], a[1], a[2])   # Prints "1 2 3"

<type 'numpy.ndarray'>
(3,)
(1, 2, 3)
[5 2 3]
(2, 3)
(1, 2, 4)


In [0]:
a[0] = 5                  # Change an element of the array
print(a)                  # Prints "[5, 2, 3]"


[5 2 3]


In [0]:
b = np.array([[1,2,3],[4,5,6]])    # Create a rank 2 array
print(b.shape)                     # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0])   # Prints "1 2 4"

(2, 3)
(1, 2, 4)


Numpy also provides many functions to create arrays:

In [0]:
a = np.zeros((2,2))   # Create an array of all zeros
print(a)              # Prints "[[ 0.  0.]
                      #          [ 0.  0.]]"

b = np.ones((1,2))    # Create an array of all ones
print(b)              # Prints "[[ 1.  1.]]"


[[0. 0.]
 [0. 0.]]
[[1. 1.]]


## Array Indexing

Numpy offers several ways to index into arrays.

Slicing: Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array:

In [0]:
# Create the following rank 2 array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(a, a.shape)

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
#  [6 7]]
b = a[:2, 1:3]
print(b, b.shape)


# A slice of an array is a view into the same data, so modifying it
# will modify the original array.
print(a[0, 1])   # Prints "2"
b[0, 0] = 77     # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1])   # Prints "77"

(array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]]), (3, 4))
(array([[2, 3],
       [6, 7]]), (2, 2))
2
77


One useful trick with integer array indexing is selecting or mutating one element from each row of a matrix:

In [0]:
# Create a new array from which we will select elements
a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])

print(a)  # prints "array([[ 1,  2,  3],
          #                [ 4,  5,  6],
          #                [ 7,  8,  9],
          #                [10, 11, 12]])"

# Create an array of indices
b = np.array([0, 2, 0, 1])

# Select one element from each row of a using the indices in b
print(a[np.arange(4), b])  # Prints "[ 1  6  7 11]"


[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]
[ 1  6  7 11]


## Array Math

In [0]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)
# OR
print(np.add(x, y))

# Elementwise product; both produce the array
# [[ 5.0 12.0]
#  [21.0 32.0]]
print(x * y)
# OR
print(np.multiply(x, y))


[[ 6.  8.]
 [10. 12.]]
[[ 6.  8.]
 [10. 12.]]
[[ 5. 12.]
 [21. 32.]]
[[ 5. 12.]
 [21. 32.]]


Note:

(*) is elementwise multiplication, not matrix multiplication. 

Instead use the (dot function) to compute inner products of vectors, to multiply a vector by a matrix, and to multiply matrices.

In [0]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

print(x*y)

print(x.dot(y))
# OR
print(np.dot(x,y))

[[ 5 12]
 [21 32]]
[[19 22]
 [43 50]]
[[19 22]
 [43 50]]


## Changing Array Dimensions

In [0]:
#creating an array a
a = np.arange(6)
b = a.reshape((3, 2))
print('Before reshaping: ', a)
print('After reshaping: ', b)

('Before reshaping: ', array([0, 1, 2, 3, 4, 5]))
('After reshaping: ', array([[0, 1],
       [2, 3],
       [4, 5]]))


# Introduction to Pandas

Pandas stands for “Python Data Analysis Library”.Pandas is quite a game changer when it comes to analyzing data with Python and it is one of the most preferred and widely used tools in data munging/wrangling if not THE most used one. Pandas is an open source, free to use (under a BSD license). 

Source can be found [here](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673, https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) 

In [0]:
#importing Pandas library
import pandas as pd


## Object Creation

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.) 

Creating a Series by passing a list of values, letting pandas create a default integer index:

In [0]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:


In [0]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [0]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.496258,-0.730723,-1.245745,0.545074
2013-01-02,2.339862,0.706279,0.810967,-1.138736
2013-01-03,0.384474,0.759304,-1.669582,0.100087
2013-01-04,-1.274437,0.118514,0.433957,-0.470793
2013-01-05,1.008659,0.357255,1.040918,0.926418
2013-01-06,-1.249155,1.257724,-1.77429,-0.371438


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [0]:
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


Notice the columns of the resulting DataFrame have different dtypes.

In [0]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## Viewing Data

Here is how to view the top and bottom rows of the frame:

In [0]:
df.head() #view first 5 rows of datafarme

Unnamed: 0,A,B,C,D
2013-01-01,-0.496258,-0.730723,-1.245745,0.545074
2013-01-02,2.339862,0.706279,0.810967,-1.138736
2013-01-03,0.384474,0.759304,-1.669582,0.100087
2013-01-04,-1.274437,0.118514,0.433957,-0.470793
2013-01-05,1.008659,0.357255,1.040918,0.926418


In [0]:
df.head(3) #view first 3 rows of datafarme

Unnamed: 0,A,B,C,D
2013-01-01,-0.496258,-0.730723,-1.245745,0.545074
2013-01-02,2.339862,0.706279,0.810967,-1.138736
2013-01-03,0.384474,0.759304,-1.669582,0.100087


In [0]:
df.tail(2) #view last 2 rows

Unnamed: 0,A,B,C,D
2013-01-05,1.008659,0.357255,1.040918,0.926418
2013-01-06,-1.249155,1.257724,-1.77429,-0.371438


Display the index:

In [0]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [0]:
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

describe() shows a quick statistic summary of your data:

In [0]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.118858,0.411392,-0.400629,-0.068232
std,1.413241,0.680796,1.300305,0.747499
min,-1.274437,-0.730723,-1.77429,-1.138736
25%,-1.06093,0.178199,-1.563623,-0.445955
50%,-0.055892,0.531767,-0.405894,-0.135676
75%,0.852613,0.746048,0.716714,0.433827
max,2.339862,1.257724,1.040918,0.926418


Sorting by values:

In [0]:
df.sort_values(by='A') # updates not stored in dataframe

Unnamed: 0,A,B,C,D
2013-01-04,-1.274437,0.118514,0.433957,-0.470793
2013-01-06,-1.249155,1.257724,-1.77429,-0.371438
2013-01-01,-0.496258,-0.730723,-1.245745,0.545074
2013-01-03,0.384474,0.759304,-1.669582,0.100087
2013-01-05,1.008659,0.357255,1.040918,0.926418
2013-01-02,2.339862,0.706279,0.810967,-1.138736


Notice the following:

In [0]:
print(df)
df = df.sort_values(by='A') # should assign changes to dataframe
print(df)

                   A         B         C         D
2013-01-01 -0.496258 -0.730723 -1.245745  0.545074
2013-01-02  2.339862  0.706279  0.810967 -1.138736
2013-01-03  0.384474  0.759304 -1.669582  0.100087
2013-01-04 -1.274437  0.118514  0.433957 -0.470793
2013-01-05  1.008659  0.357255  1.040918  0.926418
2013-01-06 -1.249155  1.257724 -1.774290 -0.371438
                   A         B         C         D
2013-01-04 -1.274437  0.118514  0.433957 -0.470793
2013-01-06 -1.249155  1.257724 -1.774290 -0.371438
2013-01-01 -0.496258 -0.730723 -1.245745  0.545074
2013-01-03  0.384474  0.759304 -1.669582  0.100087
2013-01-05  1.008659  0.357255  1.040918  0.926418
2013-01-02  2.339862  0.706279  0.810967 -1.138736


## Selection

In [0]:
df['A'] # selecting column by name

2013-01-01   -0.496258
2013-01-02    2.339862
2013-01-03    0.384474
2013-01-04   -1.274437
2013-01-05    1.008659
2013-01-06   -1.249155
Freq: D, Name: A, dtype: float64

In [0]:
df[0:3] # selecting rows 

Unnamed: 0,A,B,C,D
2013-01-01,-0.496258,-0.730723,-1.245745,0.545074
2013-01-02,2.339862,0.706279,0.810967,-1.138736
2013-01-03,0.384474,0.759304,-1.669582,0.100087


In [0]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-04,-1.274437,0.118514,0.433957,-0.470793
2013-01-03,0.384474,0.759304,-1.669582,0.100087
2013-01-02,2.339862,0.706279,0.810967,-1.138736


***loc*** gets rows (or columns) with particular labels from the index. 

***iloc*** gets rows (or columns) at particular positions in the index (so it only takes integers).

In [0]:
df.loc[dates[0]] # selecting by label

A   -0.496258
B   -0.730723
C   -1.245745
D    0.545074
Name: 2013-01-01 00:00:00, dtype: float64

In [0]:
df.loc['20130102':'20130104', ['A', 'B']]

In [0]:
df.iloc[3:5, 0:2] # selection by undex

Unnamed: 0,A,B
2013-01-03,0.384474,0.759304
2013-01-05,1.008659,0.357255


## Boolean Indexing

In [0]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-03,0.384474,0.759304,-1.669582,0.100087
2013-01-05,1.008659,0.357255,1.040918,0.926418
2013-01-02,2.339862,0.706279,0.810967,-1.138736


In [0]:
#Using the isin() method for filtering:
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

Unnamed: 0,A,B,C,D,E
2013-01-04,-1.274437,0.118514,0.433957,-0.470793,one
2013-01-06,-1.249155,1.257724,-1.77429,-0.371438,one
2013-01-01,-0.496258,-0.730723,-1.245745,0.545074,two
2013-01-03,0.384474,0.759304,-1.669582,0.100087,three
2013-01-05,1.008659,0.357255,1.040918,0.926418,four
2013-01-02,2.339862,0.706279,0.810967,-1.138736,three


In [0]:
df2[df2['E'].isin(['two', 'four'])]

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.496258,-0.730723,-1.245745,0.545074,two
2013-01-05,1.008659,0.357255,1.040918,0.926418,four


## Operations 

### Statistics

In [0]:
df.mean() #Performing average over columns

A    0.118858
B    0.411392
C   -0.400629
D   -0.068232
dtype: float64

In [0]:
#Performing average over rows
df.mean(1)

2013-01-04   -0.298190
2013-01-06   -0.534290
2013-01-01   -0.481913
2013-01-03   -0.106429
2013-01-05    0.833313
2013-01-02    0.679593
dtype: float64