# Tutorial 4: NumPy & Pandas


## Objectives

After this tutorial you will be able to:

*   understand the basics of NumPy and Pandas libraries
*   use NumPy arrays for efficient data manipulation
*   use Pandas DataFrames for structured data manipulation
*   load data from different sources into NumPy arrays and Pandas DataFrames
*   perform data manipulation tasks such as filtering, sorting, and aggregating data


<h2>Table of Contents</h2>

<ol>
    <li>
        <details>
            <summary><a href="#numpy">NumPy</a></summary>
            <ul>
                <li><a href="#numpy-create">Create and initialize NumPy arrays</a></li>
                <li><a href="#numpy-perform">Perform basic arithmetic operations on NumPy arrays</a></li>
                <li><a href="#numpy-index">Index and slice NumPy arrays</a></li>
                <li><a href="#numpy-shape">Change the shape of NumPy arrays</a></li>
            </ul>
        </details>
    </li>
    <br>
    <li>
        <details>
            <summary><a href="#pandas">Pandas</a></summary>
            <ul>
                <li><a href="#pandas-create">Create and initialize Pandas DataFrames</a></li>
                <li><a href="#pandas-index">Index and slice Pandas DataFrames</a></li>
                <li><a href="#pandas-select">Select columns from Pandas DataFrames</a></li>
                <li><a href="#pandas-filter">Filter and sort Pandas DataFrames</a></li>
                <li><a href="#pandas-aggregate">Aggregate data in Pandas DataFrames</a></li>
            </ul>
        </details>
    </li>
    <br>
    <li>
        <details>
            <summary><a href="#data">Loading and Manipulating Data</a></summary>
            <ul>
                <li><a href="#data-csv">Load data from CSV files into Pandas DataFrames</a></li>
                <li><a href="#data-json">Load data from JSON files into Pandas DataFrames</a></li>
                <li><a href="#data-sql">Load data from SQL database into Pandas DataFrames</a></li>
            </ul>
        </details>
    </li>
    <br>    
</ol>


<hr id="numpy">

<h2>1. NumPy</h2>

Numpy is a library used for working with arrays.  
It also has useful functions for linear agebra and matrices.

<h5 id="numpy-create">Create and initialize NumPy arrays</h5>

import the NumPy library

In [None]:
# np is the most popular alias for numpy
import numpy as np

Create a 1D NumPy array

In [None]:
array_1d = np.array([1, 2, 3, 4, 5])
array_1d

Create a 2D NumPy array

In [None]:
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array_2d

To get the number of dimensions or an array, use the attribute `ndim`.

In [None]:
# Show the numpy array dimensions
print(array_1d.ndim)
print(array_2d.ndim)

To get the size of each dimension, use the attribute `shape`.`


In [None]:
print(array_1d.shape)
print(array_2d.shape)

To get the total number of elements in the array, use the attribute `size`.


In [None]:
# Show the numpy array size
print(array_1d.size)
print(array_2d.size)

<h5 id="numpy-perform">Perform basic arithmetic operations on NumPy arrays</h5>

In [None]:
# create 2 new arrays
A = np.array([1, 2, 3])
B = np.array([4, 5, 6])

# Addition
print('Addition: ', A + B)

# Subtraction
print('Subtraction: ', A - B)

# Multiplication
print('Multiplication: ', A * B)

# Division
print('Division: ', A / B)

# Dot product
print('Dot product: ', A.dot(B))

<h5 id="numpy-index">Index and slice NumPy arrays</h5>

In [None]:
# accessing elements in a 2d array
print(array_2d[0, 0])
print(array_2d[1][2])
print(array_2d[2, -1])

# slicing a 2d array
print(array_2d[0, :])
print(array_2d[:, 0])
print(array_2d[0:2, 1:3])


<h5 id="numpy-shape">Change the shape of NumPy arrays</h5>

In [None]:
array = np.array([1, 2, 3, 4, 5, 6])

# Reshape the array into a 2D array
array_2d = array.reshape((2, 3))
print(array_2d)

In [None]:
# Transpose the array
print(array_2d.T)

<hr id="pandas">

<h2>2. Pandas</h2>

Pandas is a Python library for data manipulation and analysis. It is a powerful and versatile tool that can be used to work with a wide variety of data types, includingtabular data, time series data, and categorical data.

Pandas provide 2 data structures for working with data:
1. Series: a one-dimensional array of indexed data.  
2. DataFrame: a two-dimensional data structure, with data arranged in rows and columns in a tabular manner.

<h5 id="pandas-create">Create and initialize Pandas DataFrames</h5>

import the Pandas library

In [None]:
# pd is the most popular alias for pandas
import pandas as pd

We can create a pandas DataFrame from a list of dictionaries where each dictionary has:
- the column headers as *dict keys*
- the row values as *dict values*.

In [None]:
# Create a DataFrame from a list of dictionaries
data = [
    {'name': 'John Doe', 'age': 30, 'occupation': 'Engineer'}, 
    {'name': 'Jane Doe', 'age': 25, 'occupation': 'Data scientist'},
    {'name': 'Mary Smith', 'age': 27, 'occupation': 'Software developer'},
    {'name': 'Mark Smith', 'age': 28, 'occupation': 'Data analyst'}
]
df = pd.DataFrame(data)
df

We can create a pandas DataFrame from a dictionary of lists where:
- the keys are the column headers
- the lists are the column values

In [None]:
data = {
    'name': ['John Doe', 'Jane Doe', 'Mike Smith'], 
    'age': [30, 25, 40], 
    'occupation': ['Engineer', 'Doctor', 'Teacher']
}
df2 = pd.DataFrame(data)
df2

We can create a pandas DataFrame from a NumPy 2D array, and a columns list.

In [None]:
# Create a DataFrame from a NumPy array
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df3 = pd.DataFrame(array, columns=['A', 'B', 'C'])
df3

<h5 id="pandas-index">Index and slice Pandas DataFrames</h5>

We have 2 ways to index and slice Pandas DataFrames:
1. `iloc()`: an index-based selecting method similar to lists and NumPy arrays 

In [None]:
# print the DataFrame for reference
df

In [None]:
# access the 1st row and 1st column
print(df.iloc[0, 0])

In [None]:
# access the 2nd row and 3rd column
print(df.iloc[1, 2])

In [None]:
# slice the 1st 2 rows and 1st 2 columns
print(df.iloc[0:2, 0:2])

2. `loc()`: a label-based selecting method
When using this method for slicing, the upper limit is included.

In [None]:
# print the DataFrame for reference
df

In [None]:
# access the 1st row age
print(df.loc[0, 'age'])

We can also set the DataFrame index to a certain column instead of the default numerical index of (0, 1, 2, ...)

In [None]:
# set index to name
df.set_index('name', inplace=True)
print(df)

In that case, we can use the `loc()` method with the row/index name and the column/header name.

In [None]:
# access the 1st row age
print('\n1st row age: ', df.loc['John Doe', 'age'])

In [None]:
# slice the 1st 2 rows and 1st 2 columns
print('\n1st 2 rows and 1st 2 columns: ')
print(df.loc['John Doe':'Jane Doe', 'age':'occupation'])

<h5 id="pandas-select">Select columns from Pandas DataFrames</h5>

In [None]:
# print the DataFrame for reference
df

In [None]:
# Select the 'name' column as a series
df = pd.DataFrame(data)
name_column = df['name']
print(name_column)
type(name_column)

In [None]:
# select the 'name' column as a DataFrame
name_column = df[['name']]
print(name_column)
type(name_column)

In [None]:
# Select multiple columns as a DataFrame
multiple_columns = df[['name', 'age']]
print(multiple_columns)

<h5 id="pandas-filter">Filter and sort Pandas DataFrames</h5>

In [None]:
# print the DataFrame for reference
df

In [None]:
# filter rows based on a condition
df[df['age'] > 25]

In [None]:
# sort the DataFrame by age in ascending order
df.sort_values(by='age', ascending=True, inplace=True)
df

<h5 id="pandas-aggregate">Aggregate data in Pandas DataFrames</h5>

In [None]:
# Create a DataFrame
df = pd.DataFrame({
    'product': ['A', 'A', 'B', 'B', 'C'],
    'price': [10, 12, 15, 20, 25]
})

# Group the DataFrame by product
grouped_df = df.groupby(['product'])['price'].sum()
grouped_df

<hr id="data">

<h2>2. Loading and Manipulating Data</h2>

Pandas provides useful built-in functions for loading data from different sources.

<h5 id="data-csv">Load data from CSV files into Pandas DataFrames</h5>

In [None]:
# load data from a CSV file
df = pd.read_csv('harry_potter.csv', delimiter=';')
df.head()   # shows the first 5 rows

<h5 id="data-json">Load data from JSON files into Pandas DataFrames</h5>


In [None]:
# import data from a JSON file
df = pd.read_json('pokemon.json')
df.head()

In [None]:
# normalize the JSON data
stats = pd.json_normalize(df['stats'])
stats.rename(columns={x:f'stats.{x}' for x in stats.columns.values.tolist()}, inplace=True)     # this line is to rename column (you can ignore it)
stats.head()
df.drop('stats', axis=1, inplace=True)
df = pd.concat([df, stats], axis=1)
df.head()

<h5 id="data-sql">Load data from SQL database into Pandas DataFrames</h5>


In [5]:
# import SQLite library
import sqlite3      # make sure you have sqlite3 installed first using "pip install sqlite3"

# create a connection to the database
conn = sqlite3.connect('employee_data.db')

# see the tables in the database
df_tables = pd.read_sql('SELECT name FROM sqlite_master WHERE type="table"', conn)
df_tables

Unnamed: 0,name
0,employees
1,departments


In [6]:
# read employee details table
df_employees = pd.read_sql('SELECT * FROM employees', conn)
df_employees.head()

Unnamed: 0,emp_idno,emp_fname,emp_lname,emp_dept
0,127323,Michale,Robbin,57
1,328717,Jhon,Snares,63
2,444527,Joseph,Dosni,47
3,526689,Carlos,Snares,63
4,539569,George,Mardy,27


In [7]:
# read department details table
df_departments = pd.read_sql('SELECT * FROM departments', conn)
df_departments.head()

Unnamed: 0,dpt_code,dpt_name,dpt_allotment
0,57,IT,65000
1,63,Finance,15000
2,47,HR,240000
3,27,RD,55000
4,89,QC,75000


In [8]:
# read employee details table with department name using the "LEFT JOIN" clause
df_full = pd.read_sql('SELECT employees.emp_idno, employees.emp_fname, employees.emp_lname, departments.dpt_name FROM employees LEFT JOIN departments ON employees.emp_dept = departments.dpt_code', conn)
df_full.head()

Unnamed: 0,emp_idno,emp_fname,emp_lname,dpt_name
0,127323,Michale,Robbin,IT
1,328717,Jhon,Snares,Finance
2,444527,Joseph,Dosni,HR
3,526689,Carlos,Snares,Finance
4,539569,George,Mardy,RD


<hr style="margin-top: 4rem;">
<h2>Author</h2>

<a href="https://github.com/SamerHany">Samer Hany</a>

<h2>References</h2>
<a href="https://www.w3schools.com/python/default.asp">w3schools.com</a>
<br>
<a href="https://www.kaggle.com">kaggle.com</a>