- NumPy is a great tool for dealing with numeric matrices and vectors in Python
- For more complex data, it is limited.
- Fortunately, when dealing with __complex data__ we can use a powerful Python data analysis toolkit, pandas
- Pandas is an open source library providing __high-performance and easy-to-use data structures__ for the Python programming language
- Used primarily for data manipulation and analysis
- Resources: http://pandas.pydata.org/pandas-docs/version/0.13.1/pandas.pdf

### Pandas introduces two new data structures to Python:
- (i) Series and
- (ii) DataFrame

- Both Series and Dataframe are built on top of __NumPy__ (which means it is very fast)


# Pandas: Series

#### A Series is an one-dimensional object similar to an array, list, or column in a table
- Pandas will assign a labelled index to each item in the Series
- By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

 S = Series (data, index = index) : data can be many different things such as a NumPy arrays, list of scalar values, dictionary

In [6]:
import numpy as np
import pandas as pd

s1 = pd.Series( np.random.randn(5), index=['a','b','c','d','e'])

s2 = pd.Series(np.random.randn(5))
print ("S1\n\n")
print (s1)
print ("\n\nS2\n\n")
print (s2)

print ("\n\nS2[1]:\n\n")
print (s2[1])

print ("\n\n S2[[2, 2]]:\n\n")
print (s2[[2, 2]])


s3 = s2[[1, 2]]
s3[0] = 12

print ("\n\ns3:\n\n")
print (s3)

print ("\n\nnp.square(s3):\n\n")
print (np.square(s3))

S1


a    0.724320
b    0.857300
c   -1.377617
d    0.127625
e    1.840001
dtype: float64


S2


0   -0.108797
1    0.980716
2    1.001213
3    1.747460
4    1.038069
dtype: float64


S2[1]:


0.9807155580636534


 S2[[2, 2]]:


2    1.001213
2    1.001213
dtype: float64


s3:


1     0.980716
2     1.001213
0    12.000000
dtype: float64


np.square(s3):


1      0.961803
2      1.002427
0    144.000000
dtype: float64


#### You will notice that the functionality and syntax used in Series quite similar to that of numpy array.

In [8]:
# Dictionary with annual car robberies in each Irish city
d = {'Dublin': 245, 'Cork': 150, 'Limerick': 125,'Galway': 360, 'Belfast': 300}

irishCities = pd.Series(d)

print (irishCities <230)

print ("\n", irishCities [ irishCities <230  ]  )



print ( type ( irishCities[irishCities <200] ) )

Dublin      False
Cork         True
Limerick     True
Galway      False
Belfast     False
dtype: bool

 Cork        150
Limerick    125
dtype: int64
<class 'pandas.core.series.Series'>


#### As in NumPy Array, relational operators return you a separate copy of the data. The original series and the one returned by the relational operator do not refer to the same copy of the same data.
- Another useful feature of a series is using boolean conditions
- __IrishCities < 200__ returns a Series of True/False values, which we then pass to our Series cities, returning the
corresponding True

# Dataframe

- A DataFrame is a data structure comprised of rows and columns of data.
- It is similar to a __spreadsheet__ or a database table.

- You can also think of a DataFrame as a __collection of Series objects__ that share an index

- The syntax for creating a data frame is as follows:

__DataFrame(data, columns=listOfColumns)__

- Using the columns parameter allows us to tell the constructor how we'd like the columns ordered.

### Creating a DataFrame


In [3]:
seriesA = pd.Series( np.random.rand(3), index=['a', 'b', 'c']  )
seriesB = pd.Series( np.random.rand(4), index=['a', 'b', 'c', 'd'] )
seriesC = pd.Series( np.random.rand(3), index=['b', 'c', 'd'] )

df = pd.DataFrame( { 'one' : seriesA,  'two' : seriesB, 'three' : seriesC } )
print (df)

        one       two     three
a  0.798410  0.885450       NaN
b  0.753788  0.897512  0.724706
c  0.866511  0.125578  0.363893
d       NaN  0.402872  0.400120


#### Remember we mentioned you can view a dataset as a group of Series object. Here create a DataFrame by passing it a number of Series objects

### In the example below we create a dataframe with 2D Numpy array. The array (arr) is passed as an argument when the dataframe is created

In [4]:
import pandas as pd
import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], float)

df = pd.DataFrame( arr )

print (df)

     0    1    2
0  1.0  2.0  3.0
1  4.0  5.0  6.0
2  7.0  8.0  9.0


# Revert from DataFrame to NumPy Array

- It is very easy to convert from a DataFrame object to a NumPy array using .values. 
- We can also convert a Series object to a NumPy array in the same way!

In [6]:
import pandas as pd
import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], float)

df = pd.DataFrame( arr )

dataArr = df.values

print (dataArr)
print (type(dataArr))

[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]
<class 'numpy.ndarray'>


## The most common way of creating a dataframe is by reading existing data directly into a dataframe:

There are a number of ways of doing this: 
read_csv,
read_excel,
read_hdf,
read_sql,
read_json,
read_sas …

We will look at how to read from a CSV file. 

### VARIABLE DESCRIPTIONS:
- survival:        Survival  	 	(0 = No; 1 = Yes)
                    
- pclass:          Passenger Class  		 (1 = 1st; 2 = 2nd; 3 = 3rd)
                     
- name:            Name

- sex:                Sex

- age:               Age

- sibsp:            Number of Siblings/Spouses Aboard

- parch:           Number of Parents/Children Aboard

- ticket:           Ticket Number

- fare:              Passenger Fare

- cabin:            Cabin

- embarked:    Port of Embarkation	(C = Cherbourg; Q = Queenstown; S = Southampton)

- To pull in the text file, we will use the pandas function read_csv method. 

- The read_csv has a very large number of parameters such as specifying the delimiter, included headers, etc

- Typically it’s not very useful to print out an entire dataframe. 

- However, there are some useful functions you can use to get summary data. 

In [7]:
import pandas as pd

df = pd.read_csv("titanic.csv")

print (df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None


# Describing a DataFrame
DataFrame's also have a useful __describe__ method, which is used for viewing basic statistics about the dataset's numeric columns. 
It will return information on all columns of a numeric datatype, therefore some of the data may not be of use .
The data type of what is returned is itself a dataframe

In [8]:
import pandas as pd

df = pd.read_csv("titanic.csv")

#print (df.info())

print (df.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


### Data Stat
We can easily see the average age of the passengers is 29.6 years old, with the youngest being 0.42 and the oldest being 80. The median age is 28, with the youngest quartile of users being 20 or younger, and the oldest quartile being at least 38

# Accessing Column Data

To select a column, we index with the name of the column:
__dataframe[‘columnName’]__

- Note this column is returned as a Series object

- Alternatively, a column of data may be accessed using the dot notation with the column name as an attribute (df.Age). Although it works with this particular example, it is not best practice and is prone to error and misuse. 

- Column names with spaces or special characters cannot be accessed in this manner.


In [9]:
import pandas as pd

df = pd.read_csv("titanic.csv")

print ( df['Age'] )
print ( "\n\n",df.Age)

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64


 0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64


# Accessing Columns

We mentioned in a previous slide that you can also think of a DataFrame as __a group of Series objects__ that share an index. When you access an individual column from a dataframe the data type returned is a series. 
Note if you extract multiple columns the data type returned is  __still__ a DataFrame.

In [10]:
df = pd.read_csv("titanic.csv")

ages = df['Age']
print (type(ages))

moreInfo = df[['Age', 'Name']]
print (type(moreInfo))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


# Using Head and Tail

- To view a small sample of a Series or DataFrame object, use the head (start) and tail (end) methods. The default number of elements to display is five, but you can pass a number as an argument.

- If I want to capture the last 7 age values in the dataset

In [13]:
import pandas as pd
import numpy as np

df = pd.read_csv("titanic.csv")
freqAges = df['Age']
#print (freqAges.head())

#print (freqAges.tail(7))


age = df.loc[df["Age"] < 22] # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
print (age)
type(age) 

     PassengerId  Survived  Pclass                                  Name  \
7              8         0       3        Palsson, Master. Gosta Leonard   
9             10         1       2   Nasser, Mrs. Nicholas (Adele Achem)   
10            11         1       3       Sandstrom, Miss. Marguerite Rut   
12            13         0       3        Saundercock, Mr. William Henry   
14            15         0       3  Vestrom, Miss. Hulda Amanda Adolfina   
..           ...       ...     ...                                   ...   
869          870         1       3       Johnson, Master. Harold Theodor   
875          876         1       3      Najib, Miss. Adele Kiamie "Jane"   
876          877         0       3         Gustafsson, Mr. Alfred Ossian   
877          878         0       3                  Petroff, Mr. Nedelio   
887          888         1       1          Graham, Miss. Margaret Edith   

        Sex   Age  SibSp  Parch     Ticket     Fare Cabin Embarked  
7      male   2.0 

pandas.core.frame.DataFrame

# Counting Value of Columns

In [12]:
df = pd.read_csv("titanic.csv")
print (df['Sex'].value_counts())

male      577
female    314
Name: Sex, dtype: int64


- A very useful method value_counts() can be used to count the number of occurrences of each entry in a column (it returns a Series object)

- It presents the results in descending order 
- For examples, how many males and females are

In [13]:
import pandas as pd

df = pd.read_csv("titanic.csv")
print (df['Sex'].value_counts(normalize=True))

male      0.647587
female    0.352413
Name: Sex, dtype: float64


- Read data in from the titanic dataset and determine the four most common ages represented. 

In [14]:
import pandas as pd

df = pd.read_csv("titanic.csv")
freqAges = df['Age']
print (freqAges.value_counts().head(4))

24.0    30
22.0    27
18.0    26
19.0    25
Name: Age, dtype: int64


### NumPy 

- [row, column] access
- Slice operations [start:stop:step]
- Performing operations of a specific axis ( np.sum(arr1, axis = 0) )
- Comparison Operators
- Advanced Index (Boolean index with comparison operation, interger list)
- Logical Operators
- ...

### Pandas

- Series and DataFrame 
- Accessing Columns
- Reading csv ...
- Producing stats 
- ...

# Titanic Data Set

### SPECIAL NOTES:

- Pclass is a proxy for socio-economic status (SES)
- 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

- Age is in Years; Fractional if Age less than One (1)
- If the Age is Estimated, it is in the form xx.5

- With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored.  The following are the definitions used for __sibsp__ and __parch__.

- Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
- Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiancees Ignored)
- Parent:   Mother or Father of Passenger Aboard Titanic
- Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic