### What is a DataFrame?

- A **DataFrame** is a two-dimensional, size-mutable, heterogeneous **tabular data structure** with labeled axes:
  - **Rows** → `index`
  - **Columns** → `column labels`

- Think of it like:
  - A spreadsheet (Excel)
  - A SQL table
  - A dictionary of Series (columns share the same row index)

###  Why use a DataFrame?
- Easy and intuitive **row/column selection**
- Built-in support for **missing data handling**
- Powerful **grouping and aggregation** tools
- Seamless **I/O with CSV, Excel, SQL, JSON**, and more
- Built on **NumPy** → **fast** and **vectorized** computations

---

### Creating a DataFrame 
#### 1. from a Dictionary

####  Syntax:
```python
pd.DataFrame(data, index=None, columns=None, dtype=None)
Rule of thumb: Each key becomes a column; each value supplies that column’s data.

#### Variant A – Dict of Lists / Arrays

In [7]:
import pandas as pd

# 1 – Basic numeric data
data = {'Name': ['Ana', 'Ben', 'Cara'],
        'Age':  [23,   25,   22]}
df1 = pd.DataFrame(data)
print(df1)


   Name  Age
0   Ana   23
1   Ben   25
2  Cara   22


In [9]:
df1

Unnamed: 0,Name,Age
0,Ana,23
1,Ben,25
2,Cara,22


#### 2 – Mixed dtypes + custom row index

In [11]:
data = {'City':     ['Pune', 'Delhi', 'Mumbai'],
        'Temp_C':   [32.0,   36.5,    34.2],
        'Humidity': [60,     55,      70]}
df2 = pd.DataFrame(data, index=['Mon', 'Tue', 'Wed'])
print(df2)


       City  Temp_C  Humidity
Mon    Pune    32.0        60
Tue   Delhi    36.5        55
Wed  Mumbai    34.2        70


#### 3 – Select / reorder columns at construction

In [15]:
cols = ['Temp_C', 'City']          # omit Humidity on purpose
df3  = pd.DataFrame(data, columns=cols, index=['Mon', 'Tue', 'Wed'])
print(df3)


     Temp_C    City
Mon    32.0    Pune
Tue    36.5   Delhi
Wed    34.2  Mumbai


In [17]:
cols = ['City','Humidity']          # omit Humidity on purpose
df3  = pd.DataFrame(data, columns=cols, index=['Mon', 'Tue', 'Wed'])
print(df3)

       City  Humidity
Mon    Pune        60
Tue   Delhi        55
Wed  Mumbai        70


### Variant B — Creating a DataFrame from a **Dictionary of Series**

> **Pattern** `pd.DataFrame({col_name: series, …}, index=None)`

 🧩 Key Ideas 
- **Column mapping**  
  - Each **dictionary key** becomes a **column label** in the new DataFrame.  
  - Each **value** must be a **`pd.Series` object** (not just a list/array).

- **Automatic row‑index union**  
  - Pandas takes the **union** of the indexes of all supplied Series to build the DataFrame’s **row index**.  
  - This behaves like a full outer join on the row labels.

- **Label‑based alignment**  
  - Values align **by matching index labels**, not by positional order.  
  - If a Series lacks a label present in the union, Pandas inserts **`NaN`** (or `pd.NA` for new dtypes).

- **Data type preservation**  
  - Each column keeps the **dtype** of its originating Series.  
  - Missing values may up‑cast integers to `float64` (or to `Int64` with nullable dtypes).

- **Optional `index=` argument**  
  - Supplying an explicit `index` lets you **re‑index** every Series to that exact label set.  
  - Any labels in `index` but absent from a Series → filled with `NaN`.  
  - Labels present in a Series but **not** in `index` are **dropped**.

- **Order & subset control**  
  - Use the `columns=` parameter to **select** or **re‑order** columns independent of the dict’s key order.

- **Memory view vs copy**  
  - The DataFrame **references** the original Series data when possible (no deep copy), so updating a Series **after** construction does **not** affect the DataFrame.

- **Typical use cases**  
  - Combining multiple pre‑computed Series (e.g., KPI time‑series) into a single table.  
  - Aligning disparate data feeds that share some, but not all, time stamps or IDs.

- **Performance note**  
  - Large, sparsely overlapping indexes can explode memory usage because of many `NaN`s. Consider merging/joining selectively when union size is huge.

- **Edge cases**  
  - **Duplicate labels** inside a Series are **preserved**; the DataFrame can have non‑unique row index.  
  - If **all Series share identical index labels**, the result is equivalent to joining on that exact index (no `NaN`s introduced).

#### Rule of Thumb
> *“A DataFrame built from a dict of Series behaves like an **outer join on the row labels**, forming one column per Series.”*


In [32]:
s_sales  = pd.Series([250, 300, 400], index=['Q1', 'Q2', 'Q3'])
s_profit = pd.Series([ 80, 110],      index=['Q1', 'Q4'])
df4 = pd.DataFrame({'Sales': s_sales, 'Profit': s_profit})
print(df4)


    Sales  Profit
Q1  250.0    80.0
Q2  300.0     NaN
Q3  400.0     NaN
Q4    NaN   110.0


#### 2 – Supplying an explicit overall index

In [37]:
df5 = pd.DataFrame({'Sales': s_sales, 'Profit': s_profit},
                   index=['Q1', 'Q2', 'Q3', 'Q5'])
df5

Unnamed: 0,Sales,Profit
Q1,250.0,80.0
Q2,300.0,
Q3,400.0,
Q5,,


### 3 – Adding a constant (scalar) column

In [40]:
df5['Currency'] = 'INR'
df5

Unnamed: 0,Sales,Profit,Currency
Q1,250.0,80.0,INR
Q2,300.0,,INR
Q3,400.0,,INR
Q5,,,INR


### Variant C – Creating a DataFrame from a **Nested Dictionary**
> Pattern: `pd.DataFrame({col1: {row1: val1, …}, col2: {…}})`

---


#### 🔹 1. Outer keys ➜ Columns
- Each **outer dictionary key** becomes a **column label**.
- Each **inner dictionary** contains key–value pairs where:
  - **Inner keys** become **row labels (index)**.
  - **Inner values** become **data values** in the respective column.


In [45]:
nested = {'Math': {'Alice': 85, 'Bob': 78},
          'Sci' : {'Bob': 82, 'Cara': 91}}
df6 = pd.DataFrame(nested)
df6

Unnamed: 0,Math,Sci
Alice,85.0,
Bob,78.0,82.0
Cara,,91.0
