# 3.0 Python Programming and Libraries

**Learning Objectives:** By the end of the lesson, students should have a basic understanding of the key Python libraries used for data analysis, how to use them for simple tasks, and where to go for further learning.

Python libraries are collections of pre-written code that provide functions, classes, and modules to help with specific tasks, making development faster and easier. These libraries allow you to avoid reinventing the wheel, as they offer well-tested, optimized, and reusable code to handle common problems or tasks.

There are many different types of Python libraries, depending on the domain or purpose. Below are some categories and examples of popular libraries in Python:

**Data Manipulation and Analysis**
These libraries help you manage, analyze, and process data, especially useful in data science, machine learning, and statistical analysis.
* Pandas: Provides data structures like DataFrame for handling structured data, such as CSV files, databases, and Excel sheets. `Example: import pandas as pd`
* NumPy: Used for numerical computing. It provides support for arrays, matrices, and a large collection of mathematical functions. `Example: import numpy as np`
* SciPy: Builds on NumPy and provides algorithms for optimization, integration, interpolation, eigenvalue problems, and other advanced mathematics. `Example: from scipy import stats`

**Visualization**
These libraries help you create charts, plots, and other visualizations to communicate data insights.
* Matplotlib: The most commonly used library for creating static, animated, and interactive visualizations in Python. `Example: import matplotlib.pyplot as plt`
* Seaborn: Built on top of Matplotlib, it provides a higher-level interface for creating more attractive and informative statistical graphics. `Example: import seaborn as sns`
*Plotly: Used for creating interactive plots and dashboards. `Example: import plotly.express as px`

**Machine Learning and Artificial Intelligence**
These libraries provide tools and algorithms for building, training, and deploying machine learning models.
* Scikit-learn: A popular machine learning library for classical machine learning algorithms like regression, classification, clustering, and dimensionality reduction. `Example: from sklearn.ensemble import RandomForestClassifier`
* TensorFlow: A powerful library for deep learning and neural networks, often used for training large-scale machine learning models. `Example: import tensorflow as tf`
* Keras: High-level neural networks API, running on top of TensorFlow, designed to be simple and user-friendly. ` Example: import keras` 
* PyTorch: Another deep learning framework, known for its flexibility and ease of use, especially in research settings. `Example: import torch`
  
**Web Development**
Python has libraries for both backend and frontend web development.
* Flask: A lightweight web framework for building small to medium-sized web applications. `Example: from flask import Flask`
* Django: A high-level web framework that promotes rapid development and clean, pragmatic design. `Example: import django`
* FastAPI: A modern, fast web framework for building APIs with Python 3.7+ based on standard Python type hints. `Example: from fastapi import FastAPI`
  
**Data Serialization**
These libraries help with reading and writing data in various formats such as JSON, CSV, and XML.

* JSON: Built-in Python module to handle JSON data serialization. `Example: import json`
* Pickle: A built-in library for serializing and deserializing Python objects (often used for saving machine learning models). `Example: import pickle`
* CSV: Built-in Python module for reading and writing CSV files. `Example: import csv`

**Web Scraping**
Libraries to extract data from websites.

* BeautifulSoup: A library for parsing HTML and XML documents, used for web scraping. `Example: from bs4 import BeautifulSoup`
* Scrapy: A powerful framework for extracting data from websites and processing it as structured data. `Example: import scrapy`
* Selenium: A tool for automating web browsers, useful for scraping dynamic web pages (JavaScript-heavy sites). `Example: from selenium import webdriver`

**File I/O and System Operations** 
Libraries for file handling and interacting with the system.

* os: Built-in Python module for interacting with the operating system (e.g., file operations, environment variables). `Example: import os`
* shutil: Built-in module for high-level file operations, like copying or moving files. `Example: import shutil`
* glob: For finding all the pathnames matching a specified pattern. `Example: import glob`

## 3.1. NumPy (arrays, mathematical operations)
NumPy (Numerical Python) is one of the most essential libraries in Python for data science, machine learning, scientific computing, and any domain that involves heavy manipulation of numerical data. It provides powerful tools for working with arrays, matrices, and high-level mathematical functions.

### 3.1.1. Common Numpy Functions and Methods
**1) Creating NumPy Arrays**

NumPy's core data structure is the ndarray (N-dimensional array), which is used to store elements of the same type. You can create arrays from Python lists, tuples, or even using NumPy functions directly.

**a) Creating NumPy arrays from Python lists:**

In [None]:
import numpy as np

# 1D array
arr1 = np.array([1, 2, 3, 4])

# 2D array (Matrix)
arr2 = np.array([[1, 2], [3, 4]])
print("1D array, shape", arr1.shape)
print(arr1)
print("2D array, shape", arr2.shape)
print(arr2)

**b) Creating NumPy arrays from Numpy functions:**

In [None]:
# Creating a 1D array with evenly spaced values between 0 and 10
arr_linspace = np.linspace(0, 10, 5)  # 5 values between 0 and 10
print("Linspace Array:", arr_linspace)

# Creating an array of zeros
arr_zeros = np.zeros((3, 3))  # 3x3 matrix of zeros
print("Zeros Array:\n", arr_zeros)

# Creating an array of ones
arr_ones = np.ones((2, 4))  # 2x4 matrix of ones
print("Ones Array:\n", arr_ones)

**2) Indexing and Slicing Arrays**
NumPy allows advanced indexing and slicing techniques for extracting or modifying subsets of data.

**a) Accessing Single Elements:**

In [None]:
arr = np.array([10, 20, 30, 40])
print(arr[0])  # Access the first element: Output: 10
print(arr[2])  # Access the third element: Output: 30

**b) Slicing Arrays:**

In [None]:
arr = np.array([1, 2, 3, 4, 5])
print(arr[1:4])  # Access elements from index 1 to 3: Output: [2 3 4]

# Slicing a 2D array
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr_2d[0, 1])  # Access the element in the first row, second column: Output: 2
print(arr_2d[:, 1])   # Access all rows in the second column: Output: [2 5 8]

**c) Modifying Elements:**

In [None]:
arr = np.array([1, 2, 3, 4, 5])
arr[2] = 10  # Change the third element
print(arr)  # Output: [ 1  2 10  4  5]

**3) Array Operations**
NumPy allows for fast and efficient element-wise mathematical operations on arrays. This includes addition, subtraction, multiplication, division, and more.

NumPy supports vectorized operations, which means you can perform mathematical operations directly on arrays without the need for explicit loops.

Common Operations:

* Arithmetic: Addition, subtraction, multiplication, division, etc.
* Element-wise Operations: Operations on individual elements of the array.
* Broadcasting: Allows NumPy to perform element-wise operations on arrays of different shapes.

**a) Element-wise Arithmetic:**

In [None]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Element-wise addition
sum_array = arr1 + arr2
print("Sum:", sum_array)  # Output: [5 7 9]

# Element-wise multiplication
product_array = arr1 * arr2
print("Product:", product_array)  # Output: [ 4 10 18]

**b) Using Mathematical Functions:**

In [None]:
arr = np.array([1, 4, 9, 16])

# Square root
sqrt_arr = np.sqrt(arr)
print("Square Root:", sqrt_arr)  # Output: [1. 2. 3. 4.]

# Exponential
exp_arr = np.exp(arr)
print("Exponential:", exp_arr)  # Output: [2.71828183e+00 5.45981500e+01 8.10308393e+03 8.88611052e+06]

# Logarithm (natural log)
log_arr = np.log(arr)
print("Logarithm:", log_arr)  # Output: [0.         1.38629436 2.19722458 2.77258872]

**4) Linear Algebra with NumPy**
NumPy provides a variety of functions for linear algebra tasks, such as matrix multiplication, computing determinants, and finding eigenvalues.

**a) Matrix Multiplication:**

In [None]:
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix multiplication (dot product)
dot_product = np.dot(A, B)
print("Matrix Product:\n", dot_product)

**b) Matrix Transpose:**

In [None]:
A = np.array([[1, 2], [3, 4]])
transpose_A = np.transpose(A)
print("Transpose of A:\n", transpose_A)  # Output: [[1 3] [2 4]]

**c) Determinant of a Matrix:**

In [None]:
matrix = np.array([[1, 2], [3, 4]])
det_matrix = np.linalg.det(matrix)
print("Determinant of matrix:", det_matrix)  # Output: -2.0

**d) Eigenvalues and Eigenvectors:**

In [None]:
matrix = np.array([[1, 2], [3, 4]])
eigenvalues, eigenvectors = np.linalg.eig(matrix)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:\n", eigenvectors)

**5) Broadcasting in NumPy**
Broadcasting allows NumPy to perform arithmetic operations on arrays of different shapes by "stretching" the smaller array to match the larger array. This eliminates the need for explicit loops and makes code cleaner and more efficient.

In [None]:
arr1 = np.array([1, 2, 3])  # 1D array
arr2 = np.array([[1], [2], [3]])  # 2D array (3x1)

# Broadcasting: arr1 is stretched to shape (3x3) and added element-wise to arr2
result = arr1 + arr2
print(result)  # Output: [[2 3 4] [3 4 5] [4 5 6]]

In [None]:
arr = np.array([1, 2, 3])
matrix = np.array([[1, 2, 3], [4, 5, 6]])

# Broadcasting adds the array to each row of the matrix
result = matrix + arr
print(result)

**6) Aggregating Functions**
NumPy provides many functions for aggregating data, such as calculating the sum, mean, median, standard deviation, and more across arrays.

**Calculating sum, mean, std:**

In [None]:
arr = np.array([1, 2, 3, 4, 5])

# Sum of elements
sum_arr = np.sum(arr)
print("Sum:", sum_arr)  # Output: 15

# Mean of elements
mean_arr = np.mean(arr)
print("Mean:", mean_arr)  # Output: 3.0

# Standard deviation of elements
std_arr = np.std(arr)
print("Standard Deviation:", std_arr)  # Output: 1.4142135623730951

**7) Reshaping Arrays**
NumPy allows you to easily change the shape of an array. This is especially useful in machine learning tasks where data often needs to be reshaped to fit into a model.

In [13]:
arr = np.array([1, 2, 3, 4, 5, 6])
reshaped_arr = arr.reshape(2, 3)  # Reshape to a 2x3 matrix
print("Reshaped Array:\n", reshaped_arr)

[[1 2 3]
 [4 5 6]]


## 3.2. Pandas (data manipulation, DataFrames)
The Pandas library in Python is a powerful, flexible, and easy-to-use open-source data analysis and manipulation tool. It is widely used for working with structured data (e.g., tabular data in the form of spreadsheets, CSV files, or databases). Pandas provides high-level data structures and methods for efficiently manipulating, analyzing, and visualizing data.

### 3.2.1. Key Features of Pandas
**Data Structures:**

* **Series:** A one-dimensional labeled array capable of holding any data type (integers, strings, floats, etc.). It is similar to a Python list or NumPy array but with additional functionality such as indexing and alignment.

**Example of a Series:**

In [None]:
import pandas as pd

data = [1, 2, 3, 4]
s = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(s)

* **DataFrame:** A two-dimensional labeled data structure with columns of potentially different types (e.g., integers, floats, strings). It is essentially a table (like a spreadsheet or SQL table) where data is organized into rows and columns.

**Example of a DataFrame:**

In [None]:
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [24, 27, 22],
    'city': ['New York', 'San Francisco', 'Los Angeles']
}

df = pd.DataFrame(data)
print(df)

In [None]:
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [24, 27, 22],
    'city': ['New York', 'San Francisco', 'Los Angeles']
}

df = pd.DataFrame(data)
print(df)

In [None]:
**Data Manipulation:**

* **Filtering and Selection:** You can filter data based on conditions or select specific rows, columns, or subsets of the data.
* **Indexing and Alignment:** DataFrames and Series come with automatic and flexible indexing, allowing for easy alignment of data across multiple tables or Series, even if they have different indices.
* **Missing Data Handling:** Pandas provides robust methods for dealing with missing data, such as filling or dropping missing values (NaN).
* **Data Transformation:** Pandas has numerous functions for transforming data, including aggregation, merging, reshaping, and sorting. These functions allow users to quickly perform common data manipulation tasks.
* **GroupBy:** The groupby functionality allows for splitting data into groups based on some criteria and applying functions (e.g., sum, mean) to each group, making it useful for aggregating or summarizing data.

**Data Import/Export:**

* Pandas makes it easy to load data from various sources, such as CSV files, Excel spreadsheets, SQL databases, JSON, and even web-based data sources. Similarly, it can export data to multiple formats, such as CSV, Excel, and SQL databases.
* **Data Visualization:** While not a full-fledged visualization library like Matplotlib or Seaborn, Pandas integrates well with these libraries and allows for basic plotting of data using simple .plot() methods. It is useful for quickly generating charts and visualizing trends in data.
* **Performance:** Pandas is built on top of NumPy, so it is optimized for performance, particularly for large datasets. It can efficiently handle large-scale data manipulation tasks that would be slow or difficult in native Python.

### 3.2.2 Common Pandas Functions and Methods

**1) Loading Data:**

In [None]:
import pandas as pd
# create and save as csv
df = pd.read_csv('data.csv')  # Load a CSV file into a DataFrame

**2) DataFrame Operations:**

In [None]:
print(df.head())         # View the first few rows of a DataFrame
print(df.tail())         # View the last few rows of a DataFrame
print(df.info())         # Get a summary of the DataFrame
print(df.describe())     # Get statistical summaries of numerical columns
print(df.columns)        # Get the column names of the DataFrame
print(df.shape)         # Get the number of rows and columns (tuple)

**3) Data Selection:**

In [None]:
print(df['name'])            # Select a specific column
print(df[['name', 'age']])   # Select multiple columns
print(df.iloc[0])            # Select the first row by index (position-based)
print(df.loc[0])             # Select the first row by label (index-based)
print(df.iloc[0:2])          # First two rows
print(df[df['age'] > 5])     # Filter rows based on a condition

**4) Data Aggregation:**

In [None]:
print(df.groupby('column_name').sum())    # Group by a column and sum the values
print(df.groupby('column_name').mean())   # Group by a column and calculate the mean

**5) Missing Data Handling:**

In [None]:
print(df.isnull())         # Check for missing data
print(df.dropna())         # Drop rows with missing values
print(df.fillna(0))        # Fill missing values with a specific value

**6) Merging and Joining:**

In [None]:
print(df1.merge(df2, on='key'))      # Merge two DataFrames on a common column
print(df1.join(df2))                 # Join DataFrames on their index

**7) Sorting Data:**

In [None]:
# Sort by a single column
df.sort_values(by='age', ascending=False, inplace=True)

# Sort by multiple columns
df.sort_values(by=['city', 'age'], ascending=[True, False], inplace=True)

**8) Exporting Data:**

In [None]:
df.to_csv('output.csv')  # Save DataFrame to a CSV file
df.to_excel('output.xlsx')  # Save DataFrame to an Excel file

**Example Usage: Pandas can be used to load data, filter it, and perform some basic analysis**

In [None]:
import pandas as pd

# Load data from a CSV file into a DataFrame
df = pd.read_csv('sales_data.csv')

# Preview the first 5 rows
print(df.head())

# Filter data for sales greater than $5000
high_sales = df[df['sales'] > 5000]

# Group data by product and calculate the average sales
avg_sales_by_product = df.groupby('product')['sales'].mean()

# Handle missing values by filling with zero
df.fillna(0, inplace=True)

# Save the processed data to a new CSV file
df.to_csv('processed_sales_data.csv', index=False)

## 3.3. Matplotlib and Seaborn (visualization)
Matplotlib and Seaborn are two of the most popular libraries in Python for data visualization. They are often used to create static, animated, and interactive visualizations to help understand and interpret data more effectively.

### 3.3.1. Matplotlib
Matplotlib is a comprehensive and flexible library used to create a wide range of static, animated, and interactive plots in Python. It provides low-level functionality for creating almost any type of plot or graph, and it is the foundation upon which many other visualization libraries (like Seaborn) are built.

**Example of Matplotlib Usage:**
The following script will create a simple line plot with labeled axes, a title, and a grid.

In [None]:
import matplotlib.pyplot as plt

# Simple Line Plot
x = [0, 1, 2, 3, 4, 5]
y = [0, 1, 4, 9, 16, 25]

plt.plot(x, y, label='y = x^2', color='blue', marker='o')  # Line plot
plt.title('Simple Line Plot')  # Title of the plot
plt.xlabel('x-axis')  # X-axis label
plt.ylabel('y-axis')  # Y-axis label
plt.legend()  # Show legend
plt.grid(True)  # Show grid
plt.show()

### 3.3.2. Seaborn
Seaborn is built on top of Matplotlib and provides a higher-level interface for creating attractive and informative statistical graphics. While Matplotlib is more low-level and flexible, Seaborn is designed to work well with Pandas DataFrames and simplifies the process of creating complex visualizations. It comes with better default aesthetics and several functions tailored for statistical plots.

**Example of Seaborn Usage:**
In the following example, Seaborn creates a pairplot for the Iris dataset (a famous dataset with measurements for three species of flowers), coloring the points by the species.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Load a built-in dataset from Seaborn (e.g., Iris dataset)
data = sns.load_dataset('iris')

# Create a pairplot (scatterplot matrix) with histograms on the diagonal
sns.pairplot(data, hue='species', palette='coolwarm')

# Show the plot
plt.show()

In the following example, Seaborn creates a heatmap from the correlation matrix.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a correlation matrix of a dataset
data = sns.load_dataset('flights')
pivot_data = data.pivot('month', 'year', 'passengers')

# Plot a heatmap of the correlation matrix
sns.heatmap(pivot_data, annot=True, cmap='YlGnBu', linewidths=.5)

# Show the plot
plt.show()