[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nepal-College-of-Information-Technology/AI-Data-Science-Workshop-2024/blob/main/Day%2003%3A%20Data%20Handling%20with%20Python/Part2_Pandas.ipynb)



## 1. Introduction to Pandas

**Pandas** is a powerful Python library used for data manipulation and analysis. It is particularly well-suited for handling structured data, like spreadsheets or SQL tables, and allows for easy data cleaning, transformation, and exploration.

Pandas provides two primary data structures:

- **Series:** A one-dimensional array-like object with labeled indices, perfect for handling data like a column in a spreadsheet.
- **DataFrame:** A two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns). It's similar to a table in a database or an Excel spreadsheet.

### Why Pandas?

- **Data Handling:** Pandas is designed to handle large datasets efficiently. It can easily import data from various formats such as CSV, Excel, SQL databases, and more.
- **DataFrames:** Pandas DataFrames are highly versatile, allowing for data manipulation, statistical analysis, and more. They are essential for data exploration and preprocessing.
- **Integration:** Pandas integrates seamlessly with other data science libraries, such as Numpy, Matplotlib, and Scikit-learn, making it a cornerstone of the Python data analysis stack.


Pandas is your go-to tool for data wrangling, offering a suite of functions to clean, modify, analyze, and visualize data.



### 2.1 Creating a Pandas Series

A Pandas Series is a one-dimensional labeled array capable of holding any data type. It can be thought of as a single column of data. 

For example, you might want to represent the number of vehicles counted at different times of the day at a particular intersection in Kathmandu.


In [1]:

import pandas as pd

# Traffic data for a specific day in Kathmandu
traffic_series = pd.Series([3200, 4000, 4500], index=["Morning", "Noon", "Evening"])
print("Traffic Series for a specific day in Kathmandu:")
print(traffic_series)


Traffic Series for a specific day in Kathmandu:
Morning    3200
Noon       4000
Evening    4500
dtype: int64



### 2.2 Creating a Pandas DataFrame

A Pandas DataFrame is a two-dimensional labeled data structure with columns that can be of different data types. It's like a spreadsheet or a SQL table. 

For instance, you might want to represent daily traffic data from two intersections in Kathmandu using a DataFrame.


In [2]:

# Creating a DataFrame to represent daily traffic at two intersections in Kathmandu
data = {
    "Intersection 1": [3200, 4000, 4500],
    "Intersection 2": [6300, 8800, 6400]
}
traffic_df = pd.DataFrame(data, index=["Morning", "Noon", "Evening"])
print("Traffic DataFrame for two intersections in Kathmandu:")
print(traffic_df)


Traffic DataFrame for two intersections in Kathmandu:
         Intersection 1  Intersection 2
Morning            3200            6300
Noon               4000            8800
Evening            4500            6400



## 3. Data Operations in Pandas

Pandas provides various methods for selecting, filtering, and modifying data within a DataFrame.

### 3.1 Selecting and Filtering Data

You can select specific rows or columns in a DataFrame or filter data based on conditions.

#### Example: Selecting Morning Traffic Data


In [3]:

# Selecting traffic data for the morning period
morning_traffic = traffic_df.loc["Morning"]
print(morning_traffic)


Intersection 1    3200
Intersection 2    6300
Name: Morning, dtype: int64



### 3.2 Modifying Data

You can modify the data in a DataFrame by adding new columns, updating existing data, or performing operations on the data.

#### Example: Adding a Correction Factor to Traffic Data


In [4]:

# Adding a correction factor to traffic data for Intersection 1
traffic_df["Intersection 1"] = traffic_df["Intersection 1"] + 100
print(traffic_df)


         Intersection 1  Intersection 2
Morning            3300            6300
Noon               4100            8800
Evening            4600            6400



### 3.3 Handling Missing Data

Real-world datasets often contain missing data. Pandas provides several methods to handle this, such as filling missing values or dropping rows with missing data.

#### Example: Filling Missing Data with Default Values


In [5]:

# Introducing missing data
traffic_df["Intersection 3"] = [3200, None, 4500]

# Filling missing data with a default value (0)
traffic_df.fillna(0, inplace=True)
print(traffic_df)


         Intersection 1  Intersection 2  Intersection 3
Morning            3300            6300          3200.0
Noon               4100            8800             0.0
Evening            4600            6400          4500.0



## 4. Data Manipulation Techniques

Pandas makes it easy to manipulate and transform data. You can sort data, group it, aggregate it, and merge different datasets together.

### 4.1 Sorting Data

You can sort data based on specific columns in ascending or descending order.

#### Example: Sorting Traffic Data by Morning Traffic


In [6]:

# Sorting traffic data by the traffic count in Intersection 1
sorted_traffic = traffic_df.sort_values(by="Intersection 1", ascending=False)
print(sorted_traffic)


         Intersection 1  Intersection 2  Intersection 3
Evening            4600            6400          4500.0
Noon               4100            8800             0.0
Morning            3300            6300          3200.0



### 4.2 Grouping and Aggregating Data

Grouping and aggregating data allows you to calculate summary statistics for specific groups within your dataset.

#### Example: Calculating the Total Traffic for Each Time Period


In [7]:

# Calculating the total traffic across all intersections for each time period
total_traffic = traffic_df.sum(axis=1)
print(total_traffic)


Morning    12800.0
Noon       12900.0
Evening    15500.0
dtype: float64



## 5. Python Pandas - IO Tools

Pandas provides a wide array of IO tools to handle data import and export. These functions are essential for reading data from various file formats and writing processed data back to files.

### 5.1 Reading CSV Files with `read_csv`

One of the most common operations is reading data from a CSV file into a Pandas DataFrame. This is particularly useful when dealing with large datasets stored in CSV format, such as census data or traffic logs.

#### Example: Reading Traffic Data from a CSV File

Assume you have a CSV file named `kathmandu_traffic.csv` containing traffic data collected over a week. Let's load this data into a Pandas DataFrame.


In [9]:

# Example of reading a CSV file (assuming the CSV file exists)
# Note: Since we cannot actually read files in this environment, this is a hypothetical example.

# traffic_df = pd.read_csv('kathmandu_traffic.csv')
# print(traffic_df)

# For demonstration, let's simulate what the DataFrame might look like:
data = {
    "Day": ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"],
    "Intersection 1": [3200, 3400, 3000, 4100, 4500, 5000, 6200],
    "Intersection 2": [6300, 6600, 6200, 7000, 7400, 8000, 9000]
}
traffic_df = pd.DataFrame(data)
print("Simulated Traffic Data from 'kathmandu_traffic.csv':")
print(traffic_df)


Simulated Traffic Data from 'kathmandu_traffic.csv':
         Day  Intersection 1  Intersection 2
0     Sunday            3200            6300
1     Monday            3400            6600
2    Tuesday            3000            6200
3  Wednesday            4100            7000
4   Thursday            4500            7400
5     Friday            5000            8000
6   Saturday            6200            9000


## Introduction to Pandas { Summary Table }

| **Section**                      | **Key Points**                                                                                                                                                    |
|----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **1. Introduction to Pandas**    | - Pandas is a powerful Python library for data manipulation and analysis.                                                                                         |
|                                  | - It provides two primary data structures: Series (1D) and DataFrames (2D).                                                                                      |
|                                  | - Pandas integrates well with other data science libraries like Numpy and Matplotlib.                                                                            |
|                                  | - **Use Cases in Kathmandu:** Analyzing census data, transforming traffic data for peak hours, etc.                                                              |
| **2. Creating Pandas Series and DataFrames** | - **Series:** One-dimensional labeled arrays, useful for handling single columns of data.                                                                          |
|                                  | - **DataFrame:** Two-dimensional labeled data structures, similar to tables in databases or spreadsheets.                                                        |
|                                  | - **Example:** Creating a Series and DataFrame to represent traffic data from intersections in Kathmandu.                                                        |
| **3. Data Operations in Pandas** | - **Selecting and Filtering Data:** Choose specific rows/columns or filter based on conditions.                                                                  |
|                                  | - **Modifying Data:** Add new columns, update existing data, or perform operations on the DataFrame.                                                             |
|                                  | - **Handling Missing Data:** Fill missing values or drop rows/columns with missing data.                                                                         |
|                                  | - **Example:** Selecting morning traffic data, adding correction factors, handling missing data in a DataFrame.                                                  |
| **4. Data Manipulation Techniques** | - **Sorting Data:** Sort DataFrame based on specific columns.                                                                                                    |
|                                  | - **Grouping and Aggregating Data:** Group data by specific columns and calculate summary statistics.                                                            |
|                                  | - **Merging and Concatenating DataFrames:** Combine multiple DataFrames to analyze related datasets together.                                                    |
|                                  | - **Example:** Sorting traffic data, calculating total traffic, merging traffic data with weather data.                                                          |
| **5. Python Pandas - IO Tools**  | - **Reading CSV Files:** Use `read_csv` to load data from CSV files into DataFrames.                                                                              |
|                                  | - **Writing to CSV:** Save processed DataFrames back to CSV files for storage or sharing.                                                                         |
|                                  | - **Example:** Simulating the loading of traffic data from a CSV file using `read_csv`.                                                                          |