1. Explain the difference between .loc[] and .iloc[]. When would you use each?

.loc[]	Label-based indexing	

.iloc[] Integer position-based indexing

In [5]:
import pandas as pd

df = pd.DataFrame({"A": [10, 20, 30]}, index=["x", "y", "z"])
print(df.loc["y",'A'])     # Access by index label
print(df.iloc[1,0])      # Access by integer position

20
20


2. How do you handle missing data in pandas? Describe multiple strategies and when each is appropriate (e.g., drop, fill, interpolation).

Common methods:

Drop rows/columns: dropna()

Fill values: fillna()

Interpolate: interpolate()

In [9]:
df = pd.DataFrame({"A": [1, None, 3], "B": [4, 5, None]})
df_drop = df.dropna()
df_fill = df.fillna(0)
df_interp = df.interpolate(method='nearest')
df_interp


Unnamed: 0,A,B
0,1.0,4.0
1,1.0,5.0
2,3.0,


3. What are vectorized operations in pandas, and why are they preferred over apply() or loops?

Vectorized operations apply functions directly to entire columns â€“ faster and more efficient.

In [10]:
df["C"] = df["A"] * 2  # Fast vectorization

df["D"] = df["A"].apply(lambda x: x * 2)  # Slower than vectorized

4. How would you optimize performance in pandas when working with very large datasets? Discuss memory usage, types, chunking, Dask, etc.

Use categorical dtypes

Convert string to numeric types

Use chunking when reading large files

Avoid Python loops

Use .query() and .eval() for speed

In [12]:
# for chunk in pd.read_csv("big.csv", chunksize=100000):
#     process(chunk)


5. Explain the difference between merge(), join(), and concat(). Provide examples of when to use each.

merge()	SQL-style join on keys

join()	Join using index or key column

concat()	Append DataFrames vertically/horizontally

In [13]:
a = pd.DataFrame({"id":[1,2],"A":[10,20]})
b = pd.DataFrame({"id":[1,2],"B":[30,40]})

pd.merge(a, b, on="id")        # Merge on key
a.join(b.set_index("id"), on="id")  # Join using index
pd.concat([a, b], axis=0)      # Append rows


Unnamed: 0,id,A,B
0,1,10.0,
1,2,20.0,
0,1,,30.0
1,2,,40.0


6. How do you group data in pandas and compute aggregated metrics? Show knowledge of groupby(), multiple aggregations, and transform().

In [16]:
#groupby
df = pd.DataFrame({
    "team": ["A","A","B","B"],
    "score": [10,20,30,40],
    "time": [5,10,15,20]
})

df_group = df.groupby("team").agg({"score": "mean", "time": "sum"})
df_group

Unnamed: 0_level_0,score,time
team,Unnamed: 1_level_1,Unnamed: 2_level_1
A,15.0,15
B,35.0,35


In [18]:
# transform
df["score_mean"] = df.groupby("team")["score"].transform("mean")
df

Unnamed: 0,team,score,time,score_mean
0,A,10,5,15.0
1,A,20,10,15.0
2,B,30,15,35.0
3,B,40,20,35.0


7. What is the difference between a DataFrame and a Series?

| Feature       | `Series` | `DataFrame` |
| ------------- | -------- | ----------- |
| Structure     | 1D       | 2D          |
| Equivalent to | Column   | Table       |


In [19]:
series = pd.Series([1,2,3])
df = pd.DataFrame({"col":[1,2,3]})


8. How do you convert data types efficiently, especially when reading data from external sources? Discuss astype(), category types, and parsing options.

Efficient datatype handling improves memory + performance.

In [21]:
# df["date"] = pd.to_datetime(df["date"])
# df["category_col"] = df["category_col"].astype("category")
# df["num"] = pd.to_numeric(df["num"], errors="coerce")


9. Explain the role of index in pandas. How do you set, reset, and use multi-indexing?

In Pandas, the index uniquely identifies each row. It acts similarly to a primary key in relational databases.

| Benefit                   | Example usage                     |
| ------------------------- | --------------------------------- |
| Faster lookups            | `df.loc['ID_101']`                |
| Cleaner joins             | Merging based on index            |
| Hierarchical organization | MultiIndexes for grouped data     |
| Reshaping data            | `stack()`, `unstack()`, `pivot()` |
| Labeled access            | Slice by label instead of integer |


In [25]:
import pandas as pd

data = {
    "country": ["USA", "USA", "Canada", "Canada"],
    "year": [2022, 2023, 2022, 2023],
    "sales": [100, 150, 90, 120],
    "profit": [30, 50, 25, 40]
}

df = pd.DataFrame(data)
print(df)
# The default index is 0, 1, 2, 3.


  country  year  sales  profit
0     USA  2022    100      30
1     USA  2023    150      50
2  Canada  2022     90      25
3  Canada  2023    120      40


In [27]:
#set a column as an index
df_index = df.set_index("year")
print(df_index)


     country  sales  profit
year                       
2022     USA    100      30
2023     USA    150      50
2022  Canada     90      25
2023  Canada    120      40


In [30]:
#reset index
df_reset = df_index.reset_index()
print(df_reset)


   year country  sales  profit
0  2022     USA    100      30
1  2023     USA    150      50
2  2022  Canada     90      25
3  2023  Canada    120      40


In [32]:
#multi index
df_multi = df.set_index(['country', 'year'])
print(df_multi)


              sales  profit
country year               
USA     2022    100      30
        2023    150      50
Canada  2022     90      25
        2023    120      40


In [35]:
#select by multi index
df_multi.loc[('USA', 2022)]
df_multi.loc['Canada']
df_multi.loc[pd.IndexSlice[:, 2023], :]


Unnamed: 0_level_0,Unnamed: 1_level_0,sales,profit
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1
USA,2023,150,50
Canada,2023,120,40


Useful multiindex operations
| Operation                       | Code                                                               |
| ------------------------------- | ------------------------------------------------------------------ |
| Swap multi-index levels         | `df_multi.swaplevel()`                                             |
| Remove one index level          | `df_multi.reset_index(level='year')`                               |
| Flatten MultiIndex column names | `df.columns = ['_'.join(col) for col in df.columns]` after groupby |


10. Describe how you would detect and remove duplicate records. Use duplicated() and drop_duplicates() with subset handling.

In [36]:
import pandas as pd

data = {
    "name": ["Alice", "Bob", "Charlie", "Bob", "Alice"],
    "email": [
        "alice@mail.com",
        "bob@mail.com",
        "charlie@mail.com",
        "bob@mail.com",     # duplicate
        "alice@mail.com"    # duplicate
    ],
    "age": [25, 30, 35, 31, 26]  # different values for duplicates
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)


Original DataFrame:
      name             email  age
0    Alice    alice@mail.com   25
1      Bob      bob@mail.com   30
2  Charlie  charlie@mail.com   35
3      Bob      bob@mail.com   31
4    Alice    alice@mail.com   26


In [37]:
#detect duplicates
dup_mask = df.duplicated(subset=['email'], keep='first')
print("\nDuplicate Mask:")
print(dup_mask)



Duplicate Mask:
0    False
1    False
2    False
3     True
4     True
dtype: bool


In [39]:
#drop duplicates
df_cleaned = df.drop_duplicates(subset=['email'], keep='first')
print("\nCleaned DataFrame:")
print(df_cleaned)



Cleaned DataFrame:
      name             email  age
0    Alice    alice@mail.com   25
1      Bob      bob@mail.com   30
2  Charlie  charlie@mail.com   35
