# Pandas

## 🔑 Core Topics to Learn in Pandas

### 1. **Pandas Basics**

* What is Pandas? Why use it?
* Installing and importing (`import pandas as pd`)
* Pandas data structures:

  * **Series** (1D)
  * **DataFrame** (2D)

---

### 2. **Data Input/Output (I/O)**

* Reading data:
  `pd.read_csv()`, `pd.read_excel()`, `pd.read_json()`, `pd.read_sql()`
* Writing data:
  `.to_csv()`, `.to_excel()`, `.to_json()`

---

### 3. **Exploring Data**

* `df.head()`, `df.tail()`
* `df.info()`, `df.shape`
* `df.describe()`
* Checking columns & index → `df.columns`, `df.index`

---

### 4. **Selection & Indexing**

* Column selection: `df['col']`, `df[['col1','col2']]`
* Row selection: `.iloc[]` (by position), `.loc[]` (by label)
* Boolean indexing: `df[df['age'] > 30]`

---

### 5. **Data Cleaning**

* Handling missing values: `df.isnull()`, `df.fillna()`, `df.dropna()`
* Removing duplicates: `df.drop_duplicates()`
* String operations: `df['col'].str.lower()`, `.str.contains()`

---

### 6. **Data Transformation**

* Renaming columns: `df.rename()`
* Changing data types: `df.astype()`
* Replacing values: `df.replace()`
* Apply functions: `df.apply()`, `df.applymap()`

---

### 7. **Sorting & Filtering**

* Sorting: `df.sort_values(by='col')`
* Filtering conditions: multiple conditions with `&` and `|`

---

### 8. **Aggregation & Grouping**

* `df.groupby('col').mean()`
* Aggregations: `.sum()`, `.mean()`, `.count()`
* Multiple aggregations: `.agg({'col1':'mean', 'col2':'sum'})`

---

### 9. **Merging, Joining & Concatenation**

* `pd.concat([df1, df2])`
* `pd.merge(df1, df2, on='key')`
* Different join types: inner, outer, left, right

---

### 10. **Reshaping Data**

* Pivot tables: `df.pivot_table()`
* Melting: `pd.melt()`
* Stacking & unstacking

---

### 11. **Time Series**

* Parsing dates: `pd.to_datetime()`
* Setting datetime index
* Resampling: `df.resample('M').mean()`

---

### 12. **Visualization (with Pandas)**

* `df['col'].plot(kind='hist')`
* `df.plot(kind='bar')`, `df.plot(kind='line')`

---

## 🔥 Extended Core Topics in Pandas

### 13. **Advanced Indexing**

* MultiIndex (hierarchical indexing):
  `df.set_index(['col1','col2'])`
  `df.loc[('A', 'B')]`
* Index alignment in operations
* Reindexing: `df.reindex()`

---

### 14. **Window Functions**

* Rolling window: `df['col'].rolling(7).mean()`
* Expanding: `df['col'].expanding().sum()`
* Exponentially weighted: `df['col'].ewm(span=5).mean()`

---

### 15. **Categorical Data**

* `pd.Categorical()` for memory-efficient storage
* `.cat.codes`, `.cat.categories`
* Useful when you have many repeated string labels

---

### 16. **Performance Optimization**

* Vectorization vs loops (`apply` vs native methods)
* Using `.query()` and `.eval()` for faster filtering
* Chunk processing with `pd.read_csv(..., chunksize=)`
* Memory usage check: `df.memory_usage(deep=True)`

---

### 17. **Sparse & Large Data Handling**

* Sparse data structures (`pd.Series.sparse`)
* Efficient storage for large datasets
* Working with HDF5/Parquet: `pd.read_parquet()`, `to_parquet()`

---

### 18. **Advanced Merging & Joins**

* Merge with multiple keys
* Cross joins (`how='cross'`)
* Index-based joins

---

### 19. **Styling & Reporting**

* `df.style` for pretty outputs (color scales, highlights)
* `.to_html()`, `.to_latex()` for exporting reports

---

### 20. **Integration with Other Libraries**

* Numpy: vectorized operations (`df.values`)
* Matplotlib/Seaborn: `.plot()` integration
* Scikit-learn: using Pandas DataFrames as ML input/output

---

### 21. **Advanced Time Series**

* Date offsets (`pd.DateOffset`)
* Shifting and lagging: `df['col'].shift(1)`
* Rolling joins with time-based indexes

---

### 22. **Multi-Dataset Analysis**

* Combining multiple datasets (like in data engineering/ETL pipelines)
* `pd.concat()` with hierarchical keys
* Panel-like analysis with `groupby` + reshaping

---

### 23. **Testing & Validation**

* Assertions: `pd.testing.assert_frame_equal(df1, df2)`
* Data validation with constraints (unique, non-null, ranges)

---

### 24. **Best Practices & Patterns**

* Method chaining (`.pipe()`, `.assign()`) → cleaner code
* Writing reusable transformations
* Avoiding loops, sticking to vectorization
* Consistent column naming conventions

---


## 📝 Pandas Cheat Sheet (Quick Ref)

```python
import pandas as pd

# Create DataFrame
df = pd.DataFrame({'Name':['A','B'], 'Age':[25,30]})

# I/O
df = pd.read_csv("file.csv")   # read
df.to_csv("file.csv")          # write

# Inspect
df.head(), df.tail()
df.info(), df.describe()

# Select
df['col'], df[['col1','col2']]
df.loc[0], df.iloc[0]
df[df['Age'] > 25]

# Clean
df.dropna(), df.fillna(0)
df.drop_duplicates()
df['col'].str.lower()

# Transform
df.rename(columns={'A':'a'})
df.astype({'Age':float})
df.replace({'M':'Male','F':'Female'})
df.apply(lambda x: x*2)

# Sort & Filter
df.sort_values(by='Age')
df[(df['Age'] > 25) & (df['Name']=='A')]

# Group & Aggregate
df.groupby('col').mean()
df.agg({'Age':['mean','max']})

# Merge & Concatenate
pd.concat([df1,df2])
pd.merge(df1, df2, on='id', how='left')

# Reshape
df.pivot_table(values='Age', index='Name', aggfunc='mean')
pd.melt(df, id_vars=['Name'])

# Time Series
df['date'] = pd.to_datetime(df['date'])
df.set_index('date').resample('M').mean()

# Plot
df['Age'].plot(kind='hist')
```

---


## 🔰 Beginner Level

### 1. Pandas Basics

* **Exercise:** Create a Series and DataFrame from Python lists/dicts.
* Print shape, columns, index, data types.
* **Dataset:** Manual small list or dict.

---

### 2. Data Input/Output (I/O)

* **Exercise:** Load a CSV, Excel, and JSON file.
* Save the DataFrame back as CSV & Excel.
* **Dataset:** Titanic dataset (`titanic.csv`).

---

### 3. Exploring Data

* **Exercise:** Use `.head()`, `.tail()`, `.info()`, `.describe()`.
* Find column names, row count.
* **Dataset:** Titanic dataset.

---

### 4. Selection & Indexing

* **Exercise:** Select a single column, multiple columns, rows by index.
* Slice rows using `.iloc` and `.loc`.
* **Dataset:** Titanic.

---

### 5. Data Cleaning

* **Exercise:** Handle missing ages with mean, drop rows with nulls, remove duplicates.
* Convert names to lowercase.
* **Dataset:** Titanic.

---

### 6. Data Transformation

* **Exercise:** Rename columns (`Survived` → `is_survived`), convert Age to `int`.
* Replace male/female with 0/1.
* Apply a custom function: double the fare.
* **Dataset:** Titanic.

---

### 7. Sorting & Filtering

* **Exercise:** Sort passengers by Age and Fare.
* Select passengers older than 40 and paid more than 50.
* **Dataset:** Titanic.

---

### 8. Aggregation & Grouping

* **Exercise:** Find average fare by class, survival rate by gender.
* Multiple aggregation: min/max age by class.
* **Dataset:** Titanic.

---

### 9. Merging, Joining & Concatenation

* **Exercise:** Merge Titanic passengers with a new dataset of `class → avg fare`.
* Concatenate two DataFrames vertically.
* **Dataset:** Titanic + custom dataset.

---

### 10. Reshaping Data

* **Exercise:** Create pivot table (avg age by class and gender).
* Melt columns like Age, Fare into a long format.
* **Dataset:** Titanic.

---

### 11. Time Series

* **Exercise:** Create a DataFrame of dates and random sales.
* Resample to weekly, monthly.
* Find moving average of sales.
* **Dataset:** Generated with `pd.date_range()`.

---

### 12. Visualization (with Pandas)

* **Exercise:** Plot histogram of Age, bar chart of survival by class, line chart of sales over time.
* **Dataset:** Titanic + sales dataset.

---

---

## 🚀 Intermediate Level

### 13. Advanced Indexing

* **Exercise:** Set `['class','sex']` as MultiIndex, slice data.
* Reindex to add missing categories.
* **Dataset:** Titanic.

---

### 14. Window Functions

* **Exercise:** Calculate 7-day rolling average of sales.
* Expanding sum of sales.
* EWM for smoothing.
* **Dataset:** Sales data.

---

### 15. Categorical Data

* **Exercise:** Convert Titanic `class` to categorical.
* Encode categories → codes.
* Compare memory usage before/after.
* **Dataset:** Titanic.

---

### 16. Performance Optimization

* **Exercise:** Compare filtering with `.query()` vs normal boolean indexing.
* Chunk read a large CSV (`chunksize=5000`).
* **Dataset:** NYC taxi trips dataset (big).

---

### 17. Sparse & Large Data Handling

* **Exercise:** Convert a DataFrame with many zeros to sparse.
* Save/load to Parquet for speed.
* **Dataset:** Random sparse matrix.

---

### 18. Advanced Merging & Joins

* **Exercise:** Multi-key merge (`class` + `sex`).
* Perform left, right, inner, outer joins.
* Cross join two datasets.
* **Dataset:** Titanic + extra table.

---

### 19. Styling & Reporting

* **Exercise:** Style Titanic survival table with background gradient.
* Export to HTML report.
* **Dataset:** Titanic.

---

### 20. Integration with Other Libraries

* **Exercise:** Use NumPy ufuncs on DataFrame.
* Pass Pandas DataFrame into Scikit-learn (fit a LogisticRegression).
* **Dataset:** Titanic.

---

---

## ⚡ Advanced Level

### 21. Advanced Time Series

* **Exercise:** Shift stock prices by 1 day (lagging).
* Use rolling joins with datetime index.
* Add 30-day offset.
* **Dataset:** Stock price dataset (Yahoo Finance).

---

### 22. Multi-Dataset Analysis

* **Exercise:** Combine sales data from 3 months, analyze trends.
* Use hierarchical keys with concat.
* **Dataset:** Monthly sales CSVs.

---

### 23. Testing & Validation

* **Exercise:** Create 2 DataFrames and test equality.
* Validate column ranges (Age >= 0, Fare >= 0).
* **Dataset:** Titanic.

---

### 24. Best Practices & Patterns

* **Exercise:** Use method chaining (`.pipe()`, `.assign()`) to clean and analyze Titanic in one line.
* Build a reusable transformation function.
* **Dataset:** Titanic.

---


In [12]:
pip install pandas numpy seaborn matplotlib

Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2
Note: you may need to restart the kernel to use updated packages.


In [14]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [15]:
# From a list
numbers = [10, 20, 30, 40, 50]
series = pd.Series(numbers)

print("Series:\n", series)
print("Values:", series.values)
print("Index:", series.index)
print("Data type:", series.dtype)

Series:
 0    10
1    20
2    30
3    40
4    50
dtype: int64
Values: [10 20 30 40 50]
Index: RangeIndex(start=0, stop=5, step=1)
Data type: int64


In [16]:
# From a dictionary of lists
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)

print("DataFrame:\n", df)
print("\nShape:", df.shape)      # (rows, columns)
print("Columns:", df.columns)
print("Index:", df.index)
print("Data types:\n", df.dtypes)


DataFrame:
       Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston

Shape: (4, 3)
Columns: Index(['Name', 'Age', 'City'], dtype='object')
Index: RangeIndex(start=0, stop=4, step=1)
Data types:
 Name    object
Age      int64
City    object
dtype: object


In [17]:
# From list of dictionaries
people = [
    {"Name": "Eve", "Age": 29, "City": "San Francisco"},
    {"Name": "Frank", "Age": 35, "City": "Seattle"},
]

df2 = pd.DataFrame(people)

print("DataFrame from list of dicts:\n", df2)
print("\nShape:", df2.shape)
print("Columns:", df2.columns.tolist())


DataFrame from list of dicts:
     Name  Age           City
0    Eve   29  San Francisco
1  Frank   35        Seattle

Shape: (2, 3)
Columns: ['Name', 'Age', 'City']


In [18]:
# Show only column "Name"
print(df['Name'])

# Show first 2 rows
print(df.head(2))

# Show info about the DataFrame
print(df.info())


0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object
    Name  Age         City
0  Alice   24     New York
1    Bob   27  Los Angeles
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes
None


In [10]:
import pandas as pd
import numpy as np

empty_series = pd.Series([])
l = [1,2,3,4,5,6]
list_series = pd.Series(l)
a = np.array([1.1, 2.2, 3.3, 4.4])
array_series = pd.Series(a)
d = {"a": 20, "b": 30, "c": 10, "d": 5}
dict_series = pd.Series(d)
t = (10, 3, 4,5, 1,3)
tuple_series = pd.Series(t)

In [11]:
print(empty_series)
print(list_series)
print(array_series)
print(dict_series)
print(tuple_series)

Series([], dtype: object)
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
0    1.1
1    2.2
2    3.3
3    4.4
dtype: float64
a    20
b    30
c    10
d     5
dtype: int64
0    10
1     3
2     4
3     5
4     1
5     3
dtype: int64
