### -------------------------------- CHAPTER 5 ---------------------------------------

- Series -> single column excel sheet but with lables
- DataFrames -> full fledged excel sheet but supercharged
------
**SERIES (PYTHON LIST WITH SOME ATTITUDE)**
- Instead of asking where is the second element you can directly ask what's under 'ohio'😉
- Just like NumPy arrays you can slice , dice and multiply with the series
- Think of this as an python dictionary with indexes equivalent of keys and data -> values 
- iloc and loc -> They are for indexing and take [] brackets 


In [None]:
import pandas as pd
import numpy as np

In [None]:
obj = pd.Series([1, 4, -6, 8])  # Default index: 0 to len-1
obj2 = pd.Series([1, 4, -5, 7], index=["a", "b", "c", "d"])  # Custom index
obj2.index = ["c", "d", "a", "b"]  # Rename index just because we can

obj2["b"]        # Access single value (4)
obj2[["a", "b", "d"]]  # Multiple row access like a boss
obj2.loc["a":"c"] #you could do it without loc too and this kind of slicing includes the last element

# NumPy-like operations (because pandas = numpy + swagger)
obj2[obj2 > 0]   # Filter positives
obj2 * 2         # Multiply everything by 2 (easy gains)
np.exp(obj2)     # Exponentials, no calculator needed

# Series works like a dictionary (but cooler)
"b" in obj2      # Membership test

# Create Series from a dictionary (upgrade complete)
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)

obj3.to_dict()   # Back to dictionary, if you’re nostalgic

# Custom index (adds NaN where data's missing)
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)

pd.isna(obj4)    # True where data is missing                 obj4.isna() used as variable attribute
pd.notna(obj4)   # False where data is missing               obj4.notna()

# Automatic alignment on index when combining Series
obj3 + obj4

# Name your Series and index to feel important
obj3.index.name = "hello"
obj3.name = "world"

**DATAFRAMES (DF)**

- A DF is a dictionary of Series sharing the same index (2D but can represent hierarchical structures).
- Assigning a Series to a column aligns it by index; missing matches → `NaN`.
- Transposing mixed-type columns results in `dtype=object`.
- Nested dictionary structure:
  - Outer keys → columns.
  - Inner keys → indexes (missing keys → `NaN`).
- Dot notation (`df.col`) **won’t work if**:
  - Column names have spaces or special characters.
  - Column names conflict with DataFrame methods.
- Assigning lists/arrays to columns requires matching length.
- iloc and loc -> loc[row_lable , column_lable] and for iloc[row_index , column_index] 
- In iloc is exclusive on the end and loc is inclusive on the end
- Normal indexing with square brackets isn’t great because it makes assumptions. On a DataFrame, it defaults to columns. With numbers, it might treat them as labels, which gets confusing fast.



In [None]:
# Creating a DataFrame from a dictionary
data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
    "year": [2000, 2001, 2002, 2001, 2002, 2003],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)  # keys become columns, values become data

# Viewing rows (head & tail)
frame.head()  # top 5 rows
frame.tail()  # bottom 5 rows

# Rearranging columns and adding new ones (with NaN)
pd.DataFrame(frame, columns=["pop", "year", "state"])
pd.DataFrame(frame, columns=["pop", "year", "state", "debt"])  # 'debt' doesn't exist yet → NaN

# Accessing columns/rows -> Becomes confusing for pandas
frame["pop"]  # Series access
frame.state   # dot notation (don't clash with method names!)

# Accessing rows/columns ->`loc` uses labels (no integers allowed), `iloc` uses integers (labels? Nope).
frame.loc["pop"]   # by label (index)
frame.iloc[2]  # by position (integer index)

# Assigning values to columns
frame["debt"] = 16.5  # broadcasts to the entire column
frame["pop"] = np.arange(6.)  # replaces 'pop' with array values (lengths must match)
val = pd.Series([12, 5, 6], index=["A", "B", "C"])  # custom index Series
frame["debt"] = val  # aligns by index, missing rows → NaN

# Creating and deleting columns
frame["easter"] = frame["state"] == "Ohio"  # boolean condition as a new column
del frame["easter"]  # delete the column

# Transposing the DataFrame
frame.T  # flip rows and columns

# Nested dictionary to DataFrame
populations = {
    "Gotham": {2020: 5.5, 2021: 6.0, 2022: 6.8},
    "Metropolis": {2021: 8.2, 2022: 8.5}
}
frame = pd.DataFrame(populations)  # Outer keys = columns, inner keys = index
# Missing data becomes NaN

# Dictionary of Series to DataFrame
pdata = {
    "Gotham": frame["Gotham"][:-1],        # Excludes last row (2022)
    "Metropolis": frame["Metropolis"][:2]  # First two rows (2020, 2021)
}
new_frame = pd.DataFrame(pdata)  # Indexes are aligned, missing values filled with NaN

# Naming index and columns
frame.index.name = "Year"        # Sets index name
frame.columns.name = "City"      # Sets columns name

# Convert DataFrame to NumPy array
raw_data = frame.to_numpy()      # Returns 2D ndarray without index/column labels

# DataFrame with mixed data types
crazy_data = pd.DataFrame({
    "Year": [2020, 2021, 2022],
    "City": ["Gotham", "Gotham", "Gotham"],
    "Population": [5.5, 6.0, 6.8],
    "Growth_Rate": [None, 0.1, 0.15]
})
crazy_numpy = crazy_data.to_numpy()  # dtype becomes object because of mixed types

**INDEX OBJECTS IN PANDAS**
- What is an Index? -># Index holds row and column labels along with metadata in pandas.<br>
-----------INDEX METHODS & PROPERTIES SUMMARY-----------<br>
---------
**Set-like operations:**
- append(): Concatenate with additional Index objects
- difference(): Compute set difference as an Index
- intersection(): Compute set intersection
- union(): Compute set union
- isin(): Boolean array indicating whether each value is in another collection

**Element operations:**
- delete(): Delete element at a specific index
- drop(): Compute new Index by deleting passed values
- insert(): Insert element at a specific index

**properties:**
- is_monotonic: True if each element is greater than or equal to the previous one
- is_unique: True if there are no duplicate values
- unique(): Compute an array of unique values in the Index

**You may not directly interact with Index objects often BUT they show up in most pandas operations:**
- Merging
- Joining
- Aligning
- Reindexing



In [None]:
# Immutable
obj.index[1] = "d"  # Raises TypeError
# Example of sharing Index objects
labels = pd.Index(np.arange(3))  # Int64Index([0, 1, 2], dtype='int64')
obj2 = pd.Series([1.5, -2.5, 0], index=labels) #now they share same types of indexes
# INDEX OBJECT BEHAVIOR: ARRAY + SET HYBRID
# - Fixed-size set (supports membership checks)
# Unlike Python sets, Index objects CAN have duplicates!
dup_index = pd.Index(["foo", "foo", "bar", "bar"])
# Selections with duplicate labels return all occurrences.

**REINDEXING**
- It reaaranges the indexes whether columns , index ,labels etc
- loc and iloc are also used but they work for the existing indexes
---------
**📝 Table 5-3: `reindex()` Function Arguments**
- labels--->>> New index sequence (a.k.a. your VIP list—whether they show up is another story).  
- index--->>> Set new row labels (rearrange who sits where in the row section; musical chairs, DataFrame edition).  
- columns--->>> Set new column labels (finally give Utah a seat at the table—whether it deserves it or not).  
- axis--->>> Tell pandas where to work its magic--->>> rows (`index`) or columns (`columns`). Default is rows, because pandas likes to start at the bottom.  
- method--->>> Filling strategy for missing data—`'ffill'` (drag the last known value forward, like clinging to your ex), or `'bfill'` (borrow from the next one, like sneaking into the future).  
- fill_value--->>> The "emergency backup" value when data’s MIA. Defaults to `NaN`, but you can throw in whatever makes sense (or doesn't).  
- limit--->>> Max gaps to fill when doing `ffill` or `bfill`. It’s like saying, "I'll help... but only up to 3 times. I have boundaries."  
- tolerance--->>> Max numeric distance to tolerate before filling stops. Basically, "I’ll stretch... but only so far before I snap."  
- level--->>> For MultiIndex—pick the level to reindex (because sometimes you only have energy to deal with one layer of chaos).  
- copy--->>> `True` makes a shiny new copy every time (pandas the perfectionist), `False` skips the extra effort when it’s already good (lazy but efficient).
--------
**EXTRA**
- Use `drop` like `reindex` to remove indexes (index, columns, etc.). You *can* do it with `reindex`, but `drop` makes it simpler and cleaner.



In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
print(obj.reindex(["a", "b", "c", "d", "e"]))  # Adds 'e' as NaN, the ghost town of indexes.
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
print(obj3.reindex(np.arange(6), method="ffill"))  # Forward fill: handing down the family heirlooms.
# DataFrame reindexing: rows first, columns second. Like dressing up your DataFrame.
frame = pd.DataFrame(np.arange(9).reshape(3, 3),index=["a", "c", "d"],columns=["Ohio", "Texas", "California"])
# Row reindexing: making room for guests that may never show up (hello, NaNs).
frame2 = frame.reindex(index=["a", "b", "c", "d", "e"])
# Column reindexing: adding a column from Utah because why not?
frame2 = frame2.reindex(columns=["Texas", "Utah", "California"])
# Same as above but flexing with axis param.
states = ["Texas", "Utah", "California"]
frame2 = frame2.reindex(states, axis="columns")
# Careful! loc needs **square brackets**, not round ones. Otherwise, it’s like calling a wrong number.
frame2 = frame.loc[["a", "c", "d"], states]

frame2.drop(columns=["Utah"]) #similarly you can use index , axis etc

**INDEXING , SELECTING , FILTERING**
----- Pandas Indexing Cheatsheet (With Just Enough Sass)

- `df[column]` → grabs column(s)… or starts a rebellion if you confuse it
- `df.loc[rows]` → rows by label… because names matter
- `df.loc[:, cols]` → columns by label… politely asks for columns, gets them
- `df.loc[rows, cols]` → rows *and* columns by label… VIP access only
- `df.iloc[rows]` → rows by position… doesn’t care who you are, just where you sit
- `df.iloc[:, cols]` → columns by position… seating chart vibes
- `df.iloc[rows, cols]` → rows and columns by position… full-on Battleship mode
- `df.at[row, col]` → sniper precision, one value by label… no drama
- `df.iat[row, col]` → sniper precision, one value by position… faster than `at` but colder
- `df.reindex()` → shuffle the deck… hope you know what you’re doing


In [None]:
"""dataframe with normal and loc or iloc indexing"""
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=["Ohio", "Colorado", "Utah", "New York"],columns=["one", "two", "three", "four"])
data["three"] #Normal indexing
data[["three" , "one"]] #Multiple indexing
data[0:3] #indexing at rows level (because numerical slicing)
data["three"][0:3] #indexing at row level by speciliazing the columns
data[data["three"] > 5] #conditional indexing
data < 5 #dataframe of boolean values
data[data < 5] = 8

data.loc["Colorado",["two","three"]]
data.iloc[[1,2],[3,0,1]]
data.loc[:"Utah","two"] # Both indexing functions work with slices
data.iloc[:,:3][data.three > 5] #It is like making a small dataframe and then indexing it with boolean values
data.loc[data.three >= 2] #Boolean aray can be used with loc and only and not with iloc

**INTEGER INDEXING PITFALLS**
**series**
- You can't use the negative index like normal python list
- When your lables are integers then panda don't wanna guess if you are talking about lable based indexing or positions based , although for string lables there will be no such problem you can use negative indexing here
- with iloc and loc you will face no such problems<br>
--------
**pitfalls with chained indexing**
1. This will throw error() because it creates the copy of original dataframe not the actual view
- df[condition]['column'] = 5
- df.loc[rows]['column'] = 2
2. This works totally fine 
- df.loc[df['age'] > 30, 'salary'] = 100000  # ✅ Works properly!<br>
**A good rule of thumb is to avoid chained indexing when doing assignments**

**ARITHMATIC AND DATA ALLIGNMENT**
- In case of series you have already seen how index pair of two different objects get alligned automatically
- In case of dataframe , the same happens with both the index and columns 
**arithmatic methods**
- `add`, `radd`: addition (`+`)
- `sub`, `rsub`: subtraction (`-`)
- `div`, `rdiv`: division (`/`)
- `floordiv`, `rfloordiv`: floor division (`//`)
- `mul`, `rmul`: multiplication (`*`)
- `pow`, `rpow`: exponentiation (`**`)


In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),columns=list("abcde"))

df1 + df2 #NaN values are added where the indexes are not matching
df1.add(df2 , fill_value = 0) #filling NaN values with 0
df1.reindex(columns=df2.columns, fill_value=0) #filling NaN values with 0
df1.rdiv(1) #counterpart of df1.div(1) which is df1/1

- By default, when a DataFrame and Series do math, pandas lines up the Series **index** with the DataFrame **columns**, then broadcasts **down the rows**.
- If pandas can’t find matching labels, it politely makes a **union** of both and fills in the blanks.
- Want to broadcast **across columns** instead? Use an arithmetic method like `add()` and tell pandas to match on the **index**, not columns.

In [None]:
arr = np.arange(12).reshape((3,4))
arr - arr[0] #broadcasting in each row
#something similar happens in case of dataframe and series arimatics
frame = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list("bde"),index=["Utah", "Ohio", "Texas", "Oregon"])
series = frame.iloc[0] #series with index = ["b", "d", "e"]
frame - series #series's index are match with the columns of the datafrrame and then it broadcasts to each row

series2 = pd.Series(range(3), index=["b", "e", "f"])
frame + series2 #where the colums and index don't match , NaN values are added
series3 = frame["d"] #this time it is a column with index being ["Utah", "Ohio", "Texas", "Oregon"]

frame.sub(series3 , axis="index") #if you don't define axis it will be full of null values and you know why

**FUNCTION APPLICATION AND MAPPING**
- We can numpy ufunc with pandas too
- Using apply is not necessary
- apply() - runs a function across the given axis by_default it is "column"
- map/applymap - choose map if you don't want warning and it works element wise both on dataframe and series

In [None]:
frame = pd.DataFrame(np.random.standard_normal((4,3)) , columns=list("bde") , index=["Utah" , "Ohio" , "Texas", "Oregon"])
np.abs(frame) #absolute values of the dataframe
f1 = lambda x: x.max() - x.min()
frame.apply(f1) #applying a function on each column
frame.apply(f1 , axis="columns") #applying a function on each row
f2 = lambda x: pd.Series([x.max(),x.min()],index=["max","min"])
frame.apply(f2) #applying a function that returns a series
f3 = lambda x:f"{x}%"
frame.map(f3) #applying a function element wise
frame["e"].map(f3) #applying a function on a series

**SORTING AND RANKING**
- We can sort by index and values
- values -> You can sort by multiple columns or rows, and the rest will obediently fall in line—like students when the teacher walks in.

In [None]:
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])
obj.sort_index(ascending=False) #No need to define axis since it is always 0 , ascending by default is True
frame = pd.DataFrame(np.arange(8).reshape((2,4)) , index=["three" , "one"] , columns=["d" , "a" , "b" , "c"])
frame.sort_index(axis="columns" , ascending=False) #By default axis is 0 and ascending is True

obj = pd.Series([4, 7,np.nan , -3, 2,np.nan])
obj.sort_values(na_position="first") #NaN values are sorted to the end by default , so we chose first
frame = pd.DataFrame({"b": [4,7,-3,2] , "a" :[0,1,0,1]})
frame.sort_values(["a","b"]) #Sorting by multiple columns , you can even choose single column and rest is just same
