# Data Analysis in Python




## 1. Type of containers in Python

### Lists

* Lists are good when the order matter but there is no label associated with the data
* Methods for lists: append, extend, del, remove, sort

### Dictionaries

* Dictionaries are suitable when data have a unique label, and the order does not matter
* Methods: update, del

### Comprehensions

* Shorthand for writing loops. Examples: 

In [1]:
list1 = [i**2 for i in range(10)]
dict1 = {i: i**2 for i in range(10)}
dict2 = {i: i**2 for i in range(30) if i%3== 0 }
print(list1,dict1,dict2)


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81] {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81} {0: 0, 18: 324, 3: 9, 21: 441, 6: 36, 24: 576, 9: 81, 27: 729, 12: 144, 15: 225}


## 2. Packages and Modules

### Numpy

* Faster when dealing with lots of data of the same type (numerical); cannot deal with different types
* Legitimate types: int, float
* Numpy is good at creating special vectors and matrices in a Matlab/R fashion. Examples:

In [6]:
import numpy as np
#print(np.zeros((3,3),'d'))

[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]


In [8]:
print(np.linspace(0,10,5))

[  0.    2.5   5.    7.5  10. ]


In [9]:
print(np.arange(0,10,2))

[0 2 4 6 8]


In [13]:
list1 = np.random.standard_normal([2,3])
list2 = np.random.standard_normal([2,3])
test1 = np.vstack([list1, list2])
test2 = np.hstack([list1, list2])
print(test1)
print(test2)

[[-0.06975694  1.37733576  1.35501899]
 [-0.09499279 -0.53974045  0.73653713]
 [ 0.22177781 -1.16348898 -0.39544124]
 [ 1.18731143  0.23417283  0.10172748]]
[[-0.06975694  1.37733576  1.35501899  0.22177781 -1.16348898 -0.39544124]
 [-0.09499279 -0.53974045  0.73653713  1.18731143  0.23417283  0.10172748]]


Numpy arrays can be saved and loaded by the following commands:
* np.save('test.npy',test1)
* loadtest1 = np.load('test.npy')

__Warning__: if you create a subset of array from the original, and later revise the subset, then data from the original array is changed as well. This will not be true for lists.

### Slicing using indexing


In [14]:
test2[:,::2] #column indexes that can be divided by 2

array([[-0.06975694,  1.35501899, -1.16348898],
       [-0.09499279,  0.73653713,  0.23417283]])

In [15]:
test2[:,::3] #column indexes that can be divided by 3

array([[-0.06975694,  0.22177781],
       [-0.09499279,  1.18731143]])

In [21]:
test2[:,2:] #column 3 to the end column

array([[ 1.35501899,  0.22177781, -1.16348898, -0.39544124],
       [ 0.73653713,  1.18731143,  0.23417283,  0.10172748]])

In [22]:
test2[:,2:-1] #column 3 to the second-to-last column

array([[ 1.35501899,  0.22177781, -1.16348898],
       [ 0.73653713,  1.18731143,  0.23417283]])

### Pandas: database management module in Python

__Series in Pandas:__ think of it as array with dictionaries. Here, "dict"s are called indexes. Series can be indexed either by number or by index names. When ambiguity exists, use "iloc" to indicate numerical indexing, and "loc" to indicate index-name indexing. Example:

In [2]:
import pandas as pd

ser1 = pd.Series({"A":"Nov.", "B":"Dec."})

In [3]:
ser2 = pd.Series(["Nov.", "Dec."], index = ["A", "B"])

In [4]:
ser1 ==ser2

A    True
B    True
dtype: bool

In [5]:
print(ser1[1])
print(ser2["B"])

Dec.
Dec.


In [6]:
print(ser1.loc["A"])
print(ser2.iloc[0])

Nov.
Nov.


Note that "loc" actually includes the end of index when doing slicing, while "iloc" does not:

In [7]:
ser1.iloc[0:1]

A    Nov.
dtype: object

In [8]:
ser1.loc["A":"B"]

A    Nov.
B    Dec.
dtype: object

__Dataframes in Pandas.__ 

* Multi-indexing, "stack" and "unstack": these methods correspond to "long" and "wide" panel datasets. A stacked dataset is a long one, where both indices are on the left end; an unstacked dataset is a wide one, where one index is on the left while the other is on the top.
* By-group aggregation and pivot tables. By-group aggregation is similar to "aggregate" in R. Pivot table command is a more convenient wrapper for the same purpose. Syntax:
    + df.groupby["gender","age"].mean()
    + df["sales"].groupby["gender","age"].mean()
    + pd.pivot_table(df, "sales", ["gender","age"]) 
The second and third commands will give exactly the same results. Note that if the square bracket in the third command is eliminated, the result table will be in wide rather than long format (thus the name "pivot table").

## 3. Syntax Shortcuts

* Multiple assignments: 

In [1]:
test1, test2 = "A","B"

* "except": Allowing codes to run despite errors, and print out error messages. See example below. Also see documentation:
https://docs.python.org/3/tutorial/errors.html

In [6]:
import sys
try: 
    print(test1)
    print(A)
except :
    print("System error message:", sys.exc_info()[0])
    
print("Continue execution")    

A
System error message: <class 'NameError'>
Continue execution
