# Tutorial 4: NumPy, Pandas & Data Gathering


## Objectives

After this tutorial you will be able to:

*   understand the basics of NumPy and Pandas libraries
*   use NumPy arrays for efficient data manipulation
*   use Pandas DataFrames for structured data manipulation
*   load data from different sources into NumPy arrays and Pandas DataFrames
*   perform data manipulation tasks such as filtering, sorting, and aggregating data


<h2>Table of Contents</h2>

<ol>
    <li>
        <details>
            <summary><a href="#numpy">NumPy</a></summary>
            <ul>
                <li><a href="#numpy-create">Create and initialize NumPy arrays</a></li>
                <li><a href="#numpy-perform">Perform basic arithmetic operations on NumPy arrays</a></li>
                <li><a href="#numpy-index">Index and slice NumPy arrays</a></li>
                <li><a href="#numpy-shape">Change the shape of NumPy arrays</a></li>
            </ul>
        </details>
    </li>
    <br>
    <li>
        <details>
            <summary><a href="#pandas">Pandas</a></summary>
            <ul>
                <li><a href="#pandas-create">Create and initialize Pandas DataFrames</a></li>
                <li><a href="#pandas-index">Index and slice Pandas DataFrames</a></li>
                <li><a href="#pandas-select">Select columns from Pandas DataFrames</a></li>
                <li><a href="#pandas-filter">Filter and sort Pandas DataFrames</a></li>
                <li><a href="#pandas-aggregate">Aggregate data in Pandas DataFrames</a></li>
            </ul>
        </details>
    </li>
    <br>
    <li>
        <details>
            <summary><a href="#data">Loading and Manipulating Data</a></summary>
            <ul>
                <li><a href="#data-csv">Load data from CSV files into Pandas DataFrames</a></li>
                <li><a href="#data-json">Load data from JSON files into Pandas DataFrames</a></li>
                <li><a href="#data-sql">Load data from SQL database into Pandas DataFrames</a></li>
            </ul>
        </details>
    </li>
    <br>    
</ol>


<hr id="numpy">

<h2>1. NumPy</h2>

Numpy is a library used for working with arrays.  
It also has useful functions for linear agebra and matrices.

<h5 id="numpy-create">Create and initialize NumPy arrays</h5>

import the NumPy library

In [None]:
# np is the most popular alias for numpy
import numpy as np

Create a 1D NumPy array

In [None]:
list = [1, 2, 3, 4, 5]
array_1d = np.array(list)
array_1d

Create a 2D NumPy array

In [None]:
array_2d = np.array(
    [
        [1, 2, 3], 
        [4, 5, 6], 
        [7, 8, 9]
    ]
)
array_2d

To get the number of dimensions or an array, use the attribute `ndim`.

In [None]:
# Show the numpy array dimensions
print(array_1d.ndim)
print(array_2d.ndim)

To get the size of each dimension, use the attribute `shape`.`


In [None]:
print(array_1d.shape)
print(array_2d.shape)

To get the total number of elements in the array, use the attribute `size`.


In [None]:
# Show the numpy array size
print(array_1d.size)
print(array_2d.size)

<h5 id="numpy-perform">Perform basic arithmetic operations on NumPy arrays</h5>

In [None]:
# create 2 new arrays
A = np.array([1, 2, 3])
B = np.array([4, 5, 6])

# element-wise operations
# Addition
print('Addition: ', A + B)

# Subtraction
print('Subtraction: ', A - B)

# Multiplication
print('Multiplication: ', A * B)

# Division
print('Division: ', A / B)

In [None]:
# matrix operation
# Dot product A.B
print('Dot product: ', A.dot(B))

# 2D matrices
X = np.array([
    [1, 2],
    [3, 4],
    [5, 6],
]) # 3x2

Y = np.array([
    [7, 8, 9],
    [0, 1, 2],
]) # 2x3

# X x Y
print('Matrix multiplication: ', X.dot(Y))

<h5 id="numpy-index">Index and slice NumPy arrays</h5>

accessing elements in a 2d array

In [None]:
# for reference
array_2d

In [None]:
# first row and first column
array_2d[0, 0]

In [None]:
# second row and third column
print(array_2d[1][2])

In [None]:
# third row and last column
print(array_2d[2, -1])

slicing a 2d array

In [None]:
# first row (all columns)
print(array_2d[0, :])

In [None]:
# first column (all rows)
print(array_2d[:, 0])

In [None]:
# first two rows and second and third columns
print(array_2d[0:2, 1:3])

<h5 id="numpy-shape">Change the shape of NumPy arrays</h5>

In [None]:
array = np.array([1, 2, 3, 4, 5, 6])

# Reshape the array into a 2D array
array_2d = array.reshape((2, 3))
print(array_2d)

In [None]:
# Transpose the array
print(array_2d.T)

<hr id="pandas">

<h2>2. Pandas</h2>

Pandas is a Python library for data manipulation and analysis. It is a powerful and versatile tool that can be used to work with a wide variety of data types, including tabular data, time series data, and categorical data.

Pandas provide 2 data structures for working with data:
1. Series: a one-dimensional array of indexed data.  
2. DataFrame: a two-dimensional data structure, with data arranged in rows and columns in a tabular manner.

<h5 id="pandas-create">Create and initialize Pandas Series and DataFrames</h5>

import the Pandas library

In [None]:
# pd is the most popular alias for pandas
import pandas as pd

We can create a pandas Series from a list of values and specify a custom list of indices

In [None]:
# create data and corresponding indices
temp_data = [25, 27, 26, 27]
indices = ['2023-10-01', '2023-10-02', '2023-10-03', '2023-10-04']

# createa Pandas Sries from the data
series = pd.Series(temp_data, index=indices)
series

In [None]:
# access the value corresponding to day/index = "2023-10-03"
series['2023-10-03']

We can create a pandas DataFrame from a **list of dictionaries** where each dictionary has:
- the column headers as *dict keys*
- the row values as *dict values*.

In [None]:
# Create a DataFrame from a list of dictionaries
data = [
    {'name': 'John Doe',    'age': 30,  'occupation': 'Engineer'}, 
    {'name': 'Jane Doe',    'age': 25,  'occupation': 'Data scientist'},
    {'name': 'Mary Smith',  'age': 27,  'occupation': 'Software developer'},
    {'name': 'Mark Smith',  'age': 28,  'occupation': 'Data analyst'}
]
df = pd.DataFrame(data)
df

We can create a pandas DataFrame from a **dictionary of lists** where:
- the keys are the column headers
- the lists are the column values

In [None]:
data = {
    'name': ['John Doe', 'Jane Doe', 'Mary Smith', 'Mark Smith'], 
    'age': [30, 25, 27, 28], 
    'occupation': ['Engineer', 'Data scientist', 'Software devleoper', 'Data analyst']
}
df2 = pd.DataFrame(data)
df2

We can create a pandas DataFrame from a **NumPy 2D array**, and a **columns list** (column headers).

In [None]:
# Create a DataFrame from a NumPy array
array = np.array([
    [1, 2, 3], 
    [4, 5, 6], 
    [7, 8, 9]
])
df3 = pd.DataFrame(array, columns=['A', 'B', 'C'])
df3

<h5 id="pandas-index">Index and slice Pandas DataFrames</h5>

We have 2 ways to index and slice Pandas DataFrames:
1. `iloc()`: an **index-based** selecting method similar to lists and NumPy arrays 

In [None]:
# print the DataFrame for reference
df

In [None]:
# access the 1st row and 1st column
print(df.iloc[0, 0])

In [None]:
# access the 2nd row and 3rd column
print(df.iloc[1, 2])

In [None]:
# slice the 1st 2 rows and 1st 2 columns
print(df.iloc[0:2, 0:2])

2. `loc()`: a **label-based** selecting method
When using this method for slicing, the upper limit is included.

In [None]:
# print the DataFrame for reference
df

In [None]:
# access the 1st row age
print(df.loc[0, 'age'])

We can also set the DataFrame index to a certain column instead of the default numerical index of (0, 1, 2, ...)

In [None]:
# set index to name
df.set_index('name', inplace=True)
print(df)

In that case, we can use the `loc()` method with the row/index name and the column/header name.

In [None]:
# access the 1st row age
print('\n1st row age: ', df.loc['John Doe', 'age'])

In [None]:
# slice the 1st 2 rows and 1st 2 columns
print('\n1st 2 rows and 1st 2 columns: ')
print(df.loc['John Doe':'Jane Doe', 'age':'occupation'])

<h5 id="pandas-select">Select columns from Pandas DataFrames</h5>

In [None]:
# reset and print the DataFrame for reference
df = pd.DataFrame(data)
df

In [None]:
# Select the 'name' column as a series
name_column = df['name']
print(name_column)
type(name_column)

In [None]:
# select the 'name' column as a DataFrame
name_column = df[['name']]
print(type(name_column))
name_column

In [None]:
# rename the 'occupation' column to 'job'
updated_columns = {
    'occupation': 'job',
}
df.rename(columns=updated_columns, inplace=True)
df

In [None]:
# Select multiple columns as a DataFrame
multiple_columns = df[['name', 'age']]
multiple_columns

<h5 id="pandas-filter">Filter and sort Pandas DataFrames</h5>

In [None]:
# print the DataFrame for reference
df

In [None]:
# filter rows based on a condition
filtered = df['age'] > 25
print(filtered)
df[filtered]

In [None]:
# sort the DataFrame by age in ascending order
df.sort_values(by='age', ascending=True, inplace=True)
df

<h5 id="pandas-aggregate">Aggregate data in Pandas DataFrames</h5>

In [None]:
# Create a sales data DataFrame
df = pd.DataFrame({
    'category': ['Tech', 'Tech', 'Cons', 'Tech', 'Cons'],
    'product': ['A', 'A', 'B', 'C', 'B'],
    'price': [10, 12, 15, 20, 25]
})
df

In [None]:
# Group the DataFrame by product
grouped_df = df.groupby(['category', 'product'])['price'].sum()
grouped_df

<hr id="data">

<h2>3. Loading and Manipulating Data</h2>

Pandas provides useful built-in functions for loading data from different sources.

<h5 id="data-csv">Load data from CSV files into Pandas DataFrames</h5>

In [None]:
# load data from a CSV file
df = pd.read_csv('harry_potter.csv', delimiter=';')
df.set_index('Id', inplace=True)
df.head()   # shows the first 5 rows

<h5 id="data-json">Load data from JSON files into Pandas DataFrames</h5>


In [None]:
# import data from a JSON file
df = pd.read_json('pokemon.json')
df.set_index('id', inplace=True)
df.head()

In [None]:
# normalize the JSON data into a separate DataFrame
stats = pd.json_normalize(df['stats'])
updated_cols = {x : f'stats.{x}' for x in stats.columns}
stats.rename(columns=updated_cols, inplace=True)
stats.head()

In [None]:
# drop the 'stats' column from the original DataFrame
# and add the normalized DataFrame instead
df.drop('stats', axis=1, inplace=True)
df = pd.concat([df, stats], axis=1)
df.head()

<h5 id="data-sql">Load data from SQL database into Pandas DataFrames</h5>


In [None]:
# import SQLite library
import sqlite3      # make sure you have sqlite3 installed first using "pip install sqlite3"

# connect to the database
conn = sqlite3.connect('employee_data.db')

In [None]:
# see the tables in the database
df_tables = pd.read_sql('SELECT name FROM sqlite_master WHERE type="table"', conn)
df_tables

In [None]:
# read employee details table
df_employees = pd.read_sql('SELECT * FROM employees', conn)
df_employees.head()

In [None]:
# read department details table
df_departments = pd.read_sql('SELECT * FROM departments', conn)
df_departments.head()

In [None]:
# read employee details table with department name using the "LEFT JOIN" clause
df_full = pd.read_sql('SELECT employees.emp_idno, employees.emp_fname, employees.emp_lname, departments.dpt_name FROM employees LEFT JOIN departments ON employees.emp_dept = departments.dpt_code', conn)
df_full.head()

In [None]:
# close database connection
conn.close()

<hr style="margin-top: 4rem;">
<h2>Author</h2>

<a href="https://github.com/SamerHany">Samer Hany</a>

<h2>References</h2>
<a href="https://www.w3schools.com/python/default.asp">w3schools.com</a>
<br>
<a href="https://www.kaggle.com">kaggle.com</a>