## **What is Pandas?**
- Pandas is the most popular Python library for data manipulation and analysis. It's the data scientist's equivalent of a spreadsheet (like Excel or Google Sheets), but infinitely more powerful and scalable.
- It introduces two primary data structures: the Series (1-dimensional) and the DataFrame (2-dimensional).
- **Why Pandas?**
    1. **Labeled Data:** Unlike NumPy arrays, which are accessed by integer position, Pandas allows you to use custom labels for rows (the Index) and columns. This makes your data and code much more intuitive and readable (e.g., df['age'] instead of arr[:, 3]).
    2. **Handling Different Data Types:** A single DataFrame can contain columns of different types (integers, floats, strings, dates, etc.), just like a real-world dataset.
    3. **Data I/O:** It provides incredibly powerful and easy-to-use tools for reading and writing data from various formats (CSV, Excel, SQL databases, JSON, and more).
    4. **Rich Functionality:** It has a massive set of functions for cleaning, transforming, merging, reshaping, and analyzing data.

## **1. Importing Pandas**
By convention, Pandas is always imported with the alias **pd**.

In [1]:
import pandas as pd
import numpy as np # It's standard practice to import numpy alongside pandas

## **2. The Pandas Series**
A Series is a one-dimensional labeled array, much like a single column in a spreadsheet. It consists of two main parts: the **data** and the **index**.

In [2]:
# Creating a Series from a list
numbers = [10, 20, 30, 40]
s1 = pd.Series(numbers)
print("Series from list:")
print(s1)

# Notice the data on the left and the default integer index on the right.
print(f"\nValues: {s1.values}") # The data is a NumPy array!
print(f"Index: {s1.index}")

# Creating a Series with a custom index
ages = {'Alice': 25, 'Bob': 30, 'Charlie': 35}
s2 = pd.Series(ages)
print("\nSeries from dictionary (keys become index):")
print(s2)

# Accessing data using the index label
print(f"\nBob's age: {s2['Bob']}")

Series from list:
0    10
1    20
2    30
3    40
dtype: int64

Values: [10 20 30 40]
Index: RangeIndex(start=0, stop=4, step=1)

Series from dictionary (keys become index):
Alice      25
Bob        30
Charlie    35
dtype: int64

Bob's age: 30


## **3. The Pandas DataFrame**
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It's the most important object in Pandas. Think of it as a spreadsheet, a SQL table, or a dictionary of Series objects.
- **Creating a DataFrame**
  - **From a Dictionary of Lists (most common):**

In [4]:
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print("DataFrame from a dictionary of lists:")
df

DataFrame from a dictionary of lists:


Unnamed: 0,name,age,city
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago
3,David,40,Houston


**Setting a different column as the index:**

In [8]:
df_with_name_index = df.set_index('name')
print("\nDataFrame with 'name' as the index:")
df_with_name_index


DataFrame with 'name' as the index:


Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alice,25,New York
Bob,30,Los Angeles
Charlie,35,Chicago
David,40,Houston


## **4. Reading Data from Files**
This is the most common way you'll create a DataFrame in the real world.

**Reading a CSV: simple_sales.csv**

In [8]:
# Make sure 'simple_sales.csv' is in the same directory as your notebook
sales_df = pd.read_csv("simple_sales.csv")
sales_df

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price
0,1001,2023-01-15,Electronics,Laptop,2,1200
1,1002,2023-01-16,Office Supplies,Pen Set,10,15
2,1003,2023-01-16,Electronics,Mouse,5,25
3,1004,2023-01-17,Home Goods,Coffee Maker,1,80
4,1005,2023-01-18,Office Supplies,Notebook,20,5
5,1006,2023-01-18,Electronics,Laptop,1,1250
6,1007,2023-01-19,Home Goods,Blender,2,50


## **5. Inspecting a DataFrame**
Once you have a DataFrame, these are the first commands you'll always run to understand it.
- **.head(n):** Shows the first n rows (default is 5).
- **.tail(n):** Shows the last n rows (default is 5).
- **.info():** Provides a concise summary: index type, column types, non-null values, and memory usage. This is extremely important.
- **.describe():** Generates descriptive statistics for numerical columns (count, mean, std, min, max, etc.).
- **.shape:** A tuple representing the dimensions (rows, columns).
- **.columns:** An index object containing the column labels.
- **.dtypes:** Shows the data type of each column.

In [13]:
print("--- First 5 rows of the sales data ---")
sales_df.head()

--- First 5 rows of the sales data ---


Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price
0,1001,2023-01-15,Electronics,Laptop,2,1200
1,1002,2023-01-16,Office Supplies,Pen Set,10,15
2,1003,2023-01-16,Electronics,Mouse,5,25
3,1004,2023-01-17,Home Goods,Coffee Maker,1,80
4,1005,2023-01-18,Office Supplies,Notebook,20,5


In [14]:
print("\n--- Concise summary of the DataFrame ---")
sales_df.info()


--- Concise summary of the DataFrame ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    7 non-null      int64 
 1   Date              7 non-null      object
 2   Product Category  7 non-null      object
 3   Product Name      7 non-null      object
 4   Units Sold        7 non-null      int64 
 5   Unit Price        7 non-null      int64 
dtypes: int64(3), object(3)
memory usage: 468.0+ bytes


In [18]:
print("\n--- Descriptive statistics for numerical columns ---")
sales_df.describe().T


--- Descriptive statistics for numerical columns ---


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Transaction ID,7.0,1004.0,2.160247,1001.0,1002.5,1004.0,1005.5,1007.0
Units Sold,7.0,5.857143,7.010197,1.0,1.5,2.0,7.5,20.0
Unit Price,7.0,375.0,581.36334,5.0,20.0,50.0,640.0,1250.0


In [22]:
print(f"\nShape of the DataFrame: {sales_df.shape}")
print(f"\nColumn labels: {sales_df.columns}")
print(f"\nData types of each column:\n{sales_df.dtypes}")


Shape of the DataFrame: (7, 6)

Column labels: Index(['Transaction ID', 'Date', 'Product Category', 'Product Name',
       'Units Sold', 'Unit Price'],
      dtype='object')

Data types of each column:
Transaction ID       int64
Date                object
Product Category    object
Product Name        object
Units Sold           int64
Unit Price           int64
dtype: object


## **Exercises**

**1. Series Creation:**
- Create a Pandas Series named countries from a Python list of strings: ["USA", "Canada", "Mexico", "Brazil"].
- Create another Pandas Series named capitals from a Python list of strings: ["Washington D.C.", "Ottawa", "Mexico City", "Brasília"].
- Now, create a Series where the capitals are the data and the countries are the index. Print this Series.

In [28]:
countries = pd.Series(["USA", "Canada", "Mexico", "Brazil"])
capitals  = pd.Series(["Washington D.C.", "Ottawa", "Mexico City", "Brasília"])
countries_capital = pd.Series(data = capitals.values, index = countries.values)

print(f"Series with Capitals as Data and Countries as Index:\n{countries_capital}")

Series with Capitals as Data and Countries as Index:
USA       Washington D.C.
Canada             Ottawa
Mexico        Mexico City
Brazil           Brasília
dtype: object


**2. DataFrame Creation:**
- Create a Python dictionary to store data about students:
- `student_data = {
    'student_id': [101, 102, 103, 104],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'major': ['CS', 'Physics', 'Math', 'CS']
}`
- Convert this dictionary into a Pandas DataFrame.
- Set the student_id column as the index of the DataFrame.
- Print the final DataFrame.

In [6]:
student_data = {
    'student_id': [101, 102, 103, 104],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'major': ['CS', 'Physics', 'Math', 'CS']
}

df = pd.DataFrame(student_data)
print(f"Student Data:\n{df}")
df_with_id_index = df.set_index('student_id')
df_with_id_index

Student Data:
   student_id     name    major
0         101    Alice       CS
1         102      Bob  Physics
2         103  Charlie     Math
3         104    David       CS


Unnamed: 0_level_0,name,major
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1
101,Alice,CS
102,Bob,Physics
103,Charlie,Math
104,David,CS


**3. DataFrame Inspection:**
- Using the sales_df DataFrame you created by reading simple_sales.csv:
- Print the last 3 rows.
- Print the data type of just the Unit Price column.
- Calculate and print the total number of Units Sold across all transactions (Hint: select the column and call a familiar aggregation method like .sum()).

In [9]:
sales_df.tail(3)

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price
4,1005,2023-01-18,Office Supplies,Notebook,20,5
5,1006,2023-01-18,Electronics,Laptop,1,1250
6,1007,2023-01-19,Home Goods,Blender,2,50


In [21]:
print(f"\nData Type of Unit Price column: {sales_df['Unit Price'].dtypes}")


Data Type of Unit Price column: int64


In [22]:
print(f"\nTotal Units Sold: {sales_df['Units Sold'].sum()}")


Total Units Sold: 41
