# Pandas and NumPy

In [3]:
import numpy as np
import pandas as pd

## NumPy arrays

These are like Python lists but more flexible. They're called ndarrays as they can have any number (n) of dimensions (d). They hold a collection of any one data type and can be either vector (one-dimensional) or a matrix (multi-dimensional).

In [4]:
# Creating a one-dimensional ndarray

list1 = [1,2,3,4] # Creates Python list
array1 = np.array(list1) # Converts to ndarray
print(array1)

[1 2 3 4]


In [5]:
# Creating a two-dimensional ndarray

list2 = [[1,2,3],[4,5,6]]
array2 = np.array(list2)
print(array2)

[[1 2 3]
 [4 5 6]]


### Mathmatical operations

Operations can be performed on all values in a ndarray in one go, rather than having to loop through all the values like with a standard Python list.

In [7]:
array3 = np.array([5,8,3,6])
print(array3 - 2)

[3 6 1 4]


## Pandas series

Series are the core object of the pandas library. It's similar to a one-dimensional ndarray, except that it allows values in the series to be indexed using labels. This is useful to access data by name rather than just an array indice.

With a series object, the indices are set to 0,1,2,3... by default, but can be customised so that ages can be accessed by names, for example. A Series holds data of any one type and can be created by sending in a scalar value, Python list, dictionary, or ndarray as a parameter to the pandas Series constructor. If a dictionary is sent in, the keys may be used as the indices.

In [8]:
# Create a Series using a NumPy array, using default numerical indices
ages = np.array([34,45,22,23])
series1 = pd.Series(ages)
print(series1)

0    34
1    45
2    22
3    23
dtype: int64


When printing a series, the data type of its elements is also printed. To customize the indices of a Series object, use the index argument of the Series constructor.

In [9]:
# Create a Series using a NumPy array of ages but customize the indices to be the names that correspond to each age
ages = np.array([13,25,19])
series1 = pd.Series(ages,index=['Emma', 'Swetha', 'Serajh'])
print(series1)

Emma      13
Swetha    25
Serajh    19
dtype: int64


## Pandas DataFrames

The dataframe is similar to a two-dimensional ndarray, but like a series, both the rows and columns can be indexed with numbers or string names. A single dataframe can contain any type of data, but columns must have the same data type as it's basically a series. All columns must have the same number of rows.

In [25]:
# Create a basic dataframe, setting the column names

df1 = pd.DataFrame([
    ['Sarah Cabbages','123 Main St',34],
    ['Gary Peas', '456 Maple Ave',28],
    ['Hannah Strawberry', '789 Broadway',51],
    ['Bob Turnip', '66a Halfway Street',32],
    ['Elsie Cauliflower', '12 Bobton Vale',75],
    ['Harry Pineapple', '43 Street Road',24]
    ],
    columns=['name','address','age'])

print(df1)

                name             address  age
0     Sarah Cabbages         123 Main St   34
1          Gary Peas       456 Maple Ave   28
2  Hannah Strawberry        789 Broadway   51
3         Bob Turnip  66a Halfway Street   32
4  Elsie Cauliflower      12 Bobton Vale   75
5    Harry Pineapple      43 Street Road   24


This still uses the default 0,1,2 row indices, but this can be changed.

In [26]:
df1.set_index('name') # Sets the row indices to the name
print(df1)

                name             address  age
0     Sarah Cabbages         123 Main St   34
1          Gary Peas       456 Maple Ave   28
2  Hannah Strawberry        789 Broadway   51
3         Bob Turnip  66a Halfway Street   32
4  Elsie Cauliflower      12 Bobton Vale   75
5    Harry Pineapple      43 Street Road   24


## Loading & saving CSV content

In [None]:
# Create a dataframe from CSV content

dataframe = pd.read_csv('my-csv-file.csv')

# Save a CSV from dataframe content

df.to_csv('new-csv-file.csv')

## Viewing a dataframe

### Printing content

Use .head() to print the first few rows of a dataframe.

In [33]:
# Print the entire dataframe
df1

Unnamed: 0,name,address,age
0,Sarah Cabbages,123 Main St,34
1,Gary Peas,456 Maple Ave,28
2,Hannah Strawberry,789 Broadway,51
3,Bob Turnip,66a Halfway Street,32
4,Elsie Cauliflower,12 Bobton Vale,75
5,Harry Pineapple,43 Street Road,24


In [34]:
# Print the first 5 rows
df1.head()

# Print the first 10 rows
df1.head(10)

Unnamed: 0,name,address,age
0,Sarah Cabbages,123 Main St,34
1,Gary Peas,456 Maple Ave,28
2,Hannah Strawberry,789 Broadway,51
3,Bob Turnip,66a Halfway Street,32
4,Elsie Cauliflower,12 Bobton Vale,75
5,Harry Pineapple,43 Street Road,24


### Getting stats

Use .info() to view information about the dataframe, including the number of rows and columns, the data type of each column and whether a cell is empty.

In [36]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     6 non-null      object
 1   address  6 non-null      object
 2   age      6 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 272.0+ bytes


### Finding type

Use type() to find out whether an object is a dataframe or a series.

In [37]:
type(df1)

pandas.core.frame.DataFrame

## Selecting content

### Selecting columns

There are two ways to select all the values in a column. The first is to select it like a dictionary key, the second is to select it like a variable name. The second can only be used if the column name doesn't start with a number, contain spaces or special characters etc.

In [22]:
# First method
age = df1['age']

# Second method 
age = df1.age

# Selecting multiple columns 

new_df = df1[['name', 'age']]

### Selecting rows

Rows can be selected by the index. Dataframes are zero-indexed, so the first row index is 0.

In [38]:
# Selecting the third row 

row = df1.iloc[2]
row

name       Hannah Strawberry
address         789 Broadway
age                       51
Name: 2, dtype: object

In [None]:
# Selecting multiple rows

df1.iloc[2:4] 
# Selects all rows starting at 2 and up to but not including 4.

df1.iloc[:4] 
# Selects all rows up to but not including 4.

df1.iloc[-3:] 
# Select the rows starting at the 3rd to last row and up to and including the final row

### Selecting with logic

Standard logical operators can be used to select content, such as == != <= >= etc.
Multiple statements can be combined with and (&), or (|) etc. 

In [31]:
# Select all rows where the age is less than or equal to 30
df1[df1.age <= 30] 



Unnamed: 0,name,address,age
1,Gary Peas,456 Maple Ave,28
5,Harry Pineapple,43 Street Road,24


In [39]:
# Select all rows where the age is less than 30 and the name is Gary Peas
# Note the extra curly brackets to contain the statements inside the square brackets.

df1[(df1.age <= 30) & (df1.name == 'Gary Peas'] 

Unnamed: 0,name,address,age
1,Gary Peas,456 Maple Ave,28
