# Pandas

In [11]:
# import pandas
import pandas as pd

## Series

**A Series** is a ***one-dimensional object*** containing a sequence of values, very similar to a column of data in an Excel spreadsheet. The Series data structure has an associated index which can be either numbers, or strings (ie. named labels). 

In [12]:
marks_series = pd.Series([45, 67, 98, 23, 17])

marks_series

0    45
1    67
2    98
3    23
4    17
dtype: int64

In [13]:
marks_series[0:2]

0    45
1    67
dtype: int64

### Using named labels
A label can be used to identify each data point (row) by providing a list of the desired labels to the index property of the Series object.

In [14]:
marks_update = pd.Series([76, 78, 67, 93, 81], index=['John', 'Alex', 'Elon', 'Jacinta', 'Lily'])

marks_update

John       76
Alex       78
Elon       67
Jacinta    93
Lily       81
dtype: int64

In [15]:
# Accessing labelled series
marks_update['John':'Jacinta'] # inclusive

John       76
Alex       78
Elon       67
Jacinta    93
dtype: int64

In [16]:
marks_update[0:4] # exclusive

John       76
Alex       78
Elon       67
Jacinta    93
dtype: int64

# Elementwise operations


In [18]:
# Multiplication
marks_update * 1.05

John       79.80
Alex       81.90
Elon       70.35
Jacinta    97.65
Lily       85.05
dtype: float64

In [19]:
# Addition
marks_update + 3

John       79
Alex       81
Elon       70
Jacinta    96
Lily       84
dtype: int64

In [24]:
# series addition
cat_marks = pd.Series([12, 14, 22, 17, 20], index=['John', 'Lily', 'Jacinta', 'Alex', 'Elon'])

new_marks = marks_update + cat_marks

# Not the order of index does not matter but they must existing in the first series

new_marks

Alex        95
Elon        87
Jacinta    115
John        88
Lily        95
dtype: int64

In [25]:
# Boolean selection
new_marks < 100

Alex        True
Elon        True
Jacinta    False
John        True
Lily        True
dtype: bool

In [26]:
# selecting the data that is true based on the condition
new_marks[new_marks > 90]

Alex        95
Jacinta    115
Lily        95
dtype: int64

## DataFrame

The **DataFrame** is another data structure in pandas. A DataFrame is a ***two-dimensional*** array of values. Much like an Excel spreadsheet, it is a rectangular table of columns and rows with both a row index and a column index. It can, somewhat, be thought of as a collection of Series objects, each of which can have a different data type (e.g., strings, numbers). 

Typically, pandas will import data (e.g., from a csv file) as a DataFrame. However, a DataFrame may be constructed, for example, by using a dictionary (key/value pairs) with the key being the column name and the value being a list of row values.

In [30]:
data = {
    'name': ['Aloyce', 'Njeri', 'Wambua', 'Rashid', 'Lulu', 'Kipyegon', 'Wanjala', 'Betty', 'Clare', 'Didimus'],
    'gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male'],
    'age': [30, 24, 35, 42, 29, 19, 27, 49, 28, 39]
}

df = pd.DataFrame(data=data)

df

Unnamed: 0,name,gender,age
0,Aloyce,Male,30
1,Njeri,Female,24
2,Wambua,Female,35
3,Rashid,Male,42
4,Lulu,Female,29
5,Kipyegon,Male,19
6,Wanjala,Male,27
7,Betty,Female,49
8,Clare,Female,28
9,Didimus,Male,39


In [31]:
# Using head() function to select the first 5 rows
df.head() # default is 5 rows but you can pass argument indicating the number of rows

Unnamed: 0,name,gender,age
0,Aloyce,Male,30
1,Njeri,Female,24
2,Wambua,Female,35
3,Rashid,Male,42
4,Lulu,Female,29


In [32]:
df.head(10)

Unnamed: 0,name,gender,age
0,Aloyce,Male,30
1,Njeri,Female,24
2,Wambua,Female,35
3,Rashid,Male,42
4,Lulu,Female,29
5,Kipyegon,Male,19
6,Wanjala,Male,27
7,Betty,Female,49
8,Clare,Female,28
9,Didimus,Male,39


In [33]:
# using tail() to display the last 5 rows
df.tail()

Unnamed: 0,name,gender,age
5,Kipyegon,Male,19
6,Wanjala,Male,27
7,Betty,Female,49
8,Clare,Female,28
9,Didimus,Male,39


### Column selection
You can retrieve a column from a DataFrame by using the column's name between square brackets. (A Series object is returned.):

In [34]:
df['name']

0      Aloyce
1       Njeri
2      Wambua
3      Rashid
4        Lulu
5    Kipyegon
6     Wanjala
7       Betty
8       Clare
9     Didimus
Name: name, dtype: object

You can also use dot notation to retrieve a column (Series) from a DataFrame, but only if the column name is a valid Python variable name (e.g., no spaces):

In [36]:
df.name

0      Aloyce
1       Njeri
2      Wambua
3      Rashid
4        Lulu
5    Kipyegon
6     Wanjala
7       Betty
8       Clare
9     Didimus
Name: name, dtype: object

<div class="alert alert-block alert-success">
<b>Note:</b> Row selection syntax is used to select rows (e.g., df[:3]).
    Passing the string name of a column selects columns (e.g., df["Age"]). 
</div>

In [38]:
print(df[:3])

print(df['gender'])

print(df[['age', 'name']])

     name  gender  age
0  Aloyce    Male   30
1   Njeri  Female   24
2  Wambua  Female   35
0      Male
1    Female
2    Female
3      Male
4    Female
5      Male
6      Male
7    Female
8    Female
9      Male
Name: gender, dtype: object
   age      name
0   30    Aloyce
1   24     Njeri
2   35    Wambua
3   42    Rashid
4   29      Lulu
5   19  Kipyegon
6   27   Wanjala
7   49     Betty
8   28     Clare
9   39   Didimus
