# 📊 Beginner-Friendly Guide to Pandas

This notebook introduces **Pandas**, one of the most powerful libraries in Python for data analysis. We'll cover the basics of Pandas and then practice Exploratory Data Analysis (EDA) using built-in datasets from Seaborn.

## 📑 Table of Contents

- **Part 1: Introduction to Pandas**
  - What is Pandas and why use it?
  - Installing and importing Pandas
  - Core data structures: Series and DataFrame
  - Creating Series and DataFrames
  - Reading and writing CSV files
  - Basic DataFrame operations

- **Part 2: Exploratory Data Analysis (EDA) with Pandas**
  - Loading datasets
  - Inspecting data
  - Handling missing data
  - Value counts and unique values
  - Grouping and aggregation
  - Creating new columns
  - Encoding categorical variables

- **Part 3: Visual EDA (with Pandas + Seaborn)**
  - Histogram
  - Bar plot
  - Box plot
  - Correlation heatmap
  - Line plot

- **Part 4: Summary and Practice**
  - Recap
  - Practice exercises


# Part 1: Introduction to Pandas

## What is Pandas and why use it?

- **Pandas** is a Python library for data analysis and manipulation.
- It provides two main data structures: **Series** (1D) and **DataFrame** (2D).
- Makes it easy to clean, analyze, and visualize data.
- Works well with NumPy and Matplotlib/Seaborn.

## Installing and Importing Pandas

In [1]:
# If not installed: !pip install pandas seaborn
import pandas as pd
import seaborn as sns
import numpy as np

## Core Data Structures: Series and DataFrame

In [2]:
# A Series is like a 1D labeled array
s = pd.Series([10, 20, 30, 40], name="Numbers")
print(s)

# A DataFrame is like a 2D table
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Paris", "London"]
}
df = pd.DataFrame(data)
print(df)

## Creating Series and DataFrames from Lists, Dicts, NumPy Arrays

In [3]:
# From list
series_from_list = pd.Series([1, 2, 3, 4])
print(series_from_list)

# From dict
df_from_dict = pd.DataFrame({"A": [1,2], "B": [3,4]})
print(df_from_dict)

# From NumPy array
arr = np.array([[1,2,3],[4,5,6]])
df_from_array = pd.DataFrame(arr, columns=["X","Y","Z"])
print(df_from_array)

## Reading and Writing CSV Files

In [4]:
# Example: Writing and reading a CSV file
df.to_csv("sample.csv", index=False)
read_df = pd.read_csv("sample.csv")
print(read_df)

## Basic DataFrame Operations

In [5]:
# Viewing data
print(df.head())  # first 5 rows
print(df.tail())  # last 5 rows
print(df.info())  # data types
print(df.describe())  # summary statistics

In [6]:
# Selecting columns and rows
print(df["Name"])  # column
print(df.loc[0])   # row by label
print(df.iloc[1])  # row by index position

In [7]:
# Filtering rows
print(df[df["Age"] > 25])

In [8]:
# Sorting values
print(df.sort_values("Age", ascending=False))

In [9]:
# Renaming and dropping columns
renamed_df = df.rename(columns={"Name": "FullName"})
print(renamed_df)
print(df.drop("City", axis=1))

# Part 2: Exploratory Data Analysis (EDA) with Pandas

Let's use Seaborn's built-in dataset **tips**.

In [10]:
tips = sns.load_dataset("tips")
print(tips.head())

In [11]:
# Checking data shape, types, missing values
print(tips.shape)
print(tips.info())
print(tips.isnull().sum())

In [12]:
# Handling missing data
print(tips.dropna().shape)  # remove missing
print(tips.fillna(0).head())  # fill with 0

In [13]:
# Value counts and unique values
print(tips["day"].value_counts())
print(tips["smoker"].unique())

In [14]:
# Grouping and aggregation
print(tips.groupby("day")["tip"].mean())
print(tips.groupby("sex").agg({"total_bill": "mean", "tip": "max"}))

In [15]:
# Creating new columns
tips["tip_percent"] = tips["tip"] / tips["total_bill"] * 100
print(tips.head())

In [16]:
# Encoding categorical variables
encoded = pd.get_dummies(tips["sex"])
print(encoded.head())

# Part 3: Visual EDA (with Pandas + Seaborn)

In [17]:
# Histogram
tips["total_bill"].hist()

In [18]:
# Bar plot
sns.barplot(x="day", y="total_bill", data=tips)

In [19]:
# Box plot
sns.boxplot(x="day", y="total_bill", data=tips)

In [20]:
# Correlation heatmap
sns.heatmap(tips.corr(), annot=True, cmap="coolwarm")

In [21]:
# Line plot (using flights dataset)
flights = sns.load_dataset("flights")
sns.lineplot(x="year", y="passengers", data=flights)

# Part 4: Summary and Practice

## Recap
- **Pandas** helps in handling tabular data.
- We learned: creating DataFrames, filtering, grouping, missing values.
- Used Seaborn datasets for EDA.
- Visualized data with Pandas + Seaborn.

## Practice Exercises
1. Find the average tip for each day of the week.
2. Fill missing values in the `flights` dataset with the median.
3. Create a new column in `tips` indicating if the total bill is above the median.
4. Plot a histogram of `tip_percent` from the `tips` dataset.
5. Use `groupby` to find the average passengers per year in the `flights` dataset.