1. Explain the difference between .loc[] and .iloc[]. When would you use each?

.loc[]	Label-based indexing	

.iloc[] Integer position-based indexing

In [5]:
import pandas as pd

df = pd.DataFrame({"A": [10, 20, 30]}, index=["x", "y", "z"])
print(df.loc["y",'A'])     # Access by index label
print(df.iloc[1,0])      # Access by integer position

20
20


2. How do you handle missing data in pandas? Describe multiple strategies and when each is appropriate (e.g., drop, fill, interpolation).

Common methods:

Drop rows/columns: dropna()

Fill values: fillna()

Interpolate: interpolate()

In [9]:
df = pd.DataFrame({"A": [1, None, 3], "B": [4, 5, None]})
df_drop = df.dropna()
df_fill = df.fillna(0)
df_interp = df.interpolate(method='nearest')
df_interp


Unnamed: 0,A,B
0,1.0,4.0
1,1.0,5.0
2,3.0,


3. What are vectorized operations in pandas, and why are they preferred over apply() or loops?

Vectorized operations apply functions directly to entire columns – faster and more efficient.

In [10]:
df["C"] = df["A"] * 2  # Fast vectorization

df["D"] = df["A"].apply(lambda x: x * 2)  # Slower than vectorized

4. How would you optimize performance in pandas when working with very large datasets? Discuss memory usage, types, chunking, Dask, etc.

Use categorical dtypes

Convert string to numeric types

Use chunking when reading large files

Avoid Python loops

Use .query() and .eval() for speed

In [12]:
# for chunk in pd.read_csv("big.csv", chunksize=100000):
#     process(chunk)


5. Explain the difference between merge(), join(), and concat(). Provide examples of when to use each.

merge()	SQL-style join on keys

join()	Join using index or key column

concat()	Append DataFrames vertically/horizontally

In [13]:
a = pd.DataFrame({"id":[1,2],"A":[10,20]})
b = pd.DataFrame({"id":[1,2],"B":[30,40]})

pd.merge(a, b, on="id")        # Merge on key
a.join(b.set_index("id"), on="id")  # Join using index
pd.concat([a, b], axis=0)      # Append rows


Unnamed: 0,id,A,B
0,1,10.0,
1,2,20.0,
0,1,,30.0
1,2,,40.0


6. How do you group data in pandas and compute aggregated metrics? Show knowledge of groupby(), multiple aggregations, and transform().

In [48]:
#groupby
df = pd.DataFrame({
    "team": ["A","A","B","B"],
    "score": [10,20,30,40],
    "time": [5,10,15,20]
})

df_group = df.groupby("team").agg({"score": "mean", "time": "sum"})
df_group

Unnamed: 0_level_0,score,time
team,Unnamed: 1_level_1,Unnamed: 2_level_1
A,15.0,15
B,35.0,35


In [49]:
# transform
df["score_mean"] = df.groupby("team")["score"].transform("mean")
df

Unnamed: 0,team,score,time,score_mean
0,A,10,5,15.0
1,A,20,10,15.0
2,B,30,15,35.0
3,B,40,20,35.0


7. What is the difference between a DataFrame and a Series?

| Feature       | `Series` | `DataFrame` |
| ------------- | -------- | ----------- |
| Structure     | 1D       | 2D          |
| Equivalent to | Column   | Table       |


In [19]:
series = pd.Series([1,2,3])
df = pd.DataFrame({"col":[1,2,3]})


8. How do you convert data types efficiently, especially when reading data from external sources? Discuss astype(), category types, and parsing options.

Efficient datatype handling improves memory + performance.

In [21]:
# df["date"] = pd.to_datetime(df["date"])
# df["category_col"] = df["category_col"].astype("category")
# df["num"] = pd.to_numeric(df["num"], errors="coerce")


In [42]:
df = pd.DataFrame({
    "id": ["1", "2", "3", "4"],
    "age": ["25", "30", "35", "40"],
    "salary": ["50,000", "60,000", "70,000", "80,000"],
    "department": ["HR", "IT", "IT", "Finance"],
    "join_date": ["2021-01-15", "2020-06-01", "2019-09-23", "2018-03-10"],
    "active": ["True", "False", "True", "True"]
})

df.dtypes

id            object
age           object
salary        object
department    object
join_date     object
active        object
dtype: object

In [43]:
df["age"] = df["age"].astype(int)
df["id"] = df["id"].astype(int)

df = df.astype({
    "age": "int",
    "id": "int"
})

# Not recommended - any non null string becomes true
# df["active"] = df["active"].astype(bool)

# Safer approach
df["active"] = df["active"].map({"True": True, "False": False})

#example of handling errors
df["age"] = df["age"].astype(int, errors="ignore")

#Limitation of astype()
#Cannot clean or parse values (e.g., "50,000" → 50000)

df.dtypes

id             int32
age            int32
salary        object
department    object
join_date     object
active          bool
dtype: object

Categorical (category) dtype

Category is a special pandas dtype for columns with repeated, limited values.

Benefits:
- Reduced memory usage
- Faster comparisons and grouping
- Optional ordering of values

In [44]:
df["department"] = df["department"].astype("category")
df.dtypes
df["department"].cat.categories

Index(['Finance', 'HR', 'IT'], dtype='object')

Parsing options (cleaning + conversion)

Parsing means interpreting strings into proper types (numbers, dates, booleans).

In [45]:
#Problem

df['salary']

0    50,000
1    60,000
2    70,000
3    80,000
Name: salary, dtype: object

In [46]:
df["salary"] = pd.to_numeric(
    df["salary"].str.replace(",", ""),
    errors="coerce"
)

#you can also use astype
# df["salary"] = (
#     df["salary"]
#     .str.replace(",", "", regex=False)
#     .astype(int)
# )

In [33]:
df["salary"]

0    50000
1    60000
2    70000
3    80000
Name: salary, dtype: int64

In [40]:
#parsing dates
#not safe
df["join_date"] = pd.to_datetime(df["join_date"])

#safer and faster is to specify format
df["join_date"] = pd.to_datetime(
    df["join_date"],
    format="%Y-%m-%d"
)

#to handle errors
df["join_date"] = pd.to_datetime(
    df["join_date"],
    format="%Y-%m-%d",
    errors="coerce"
)

df.dtypes

id                     int32
age                    int32
salary                 int64
department          category
join_date     datetime64[ns]
active                object
dtype: object

| Task                  | Best Tool         |
| --------------------- | ----------------- |
| Force dtype change    | `astype()`        |
| Clean numeric strings | `to_numeric()`    |
| Parse dates           | `to_datetime()`   |
| Memory optimization   | `category`        |
| Handle invalid values | `errors="coerce"` |


9. Explain the role of index in pandas. How do you set, reset, and use multi-indexing?

In Pandas, the index uniquely identifies each row. It acts similarly to a primary key in relational databases.

| Benefit                   | Example usage                     |
| ------------------------- | --------------------------------- |
| Faster lookups            | `df.loc['ID_101']`                |
| Cleaner joins             | Merging based on index            |
| Hierarchical organization | MultiIndexes for grouped data     |
| Reshaping data            | `stack()`, `unstack()`, `pivot()` |
| Labeled access            | Slice by label instead of integer |


In [25]:
import pandas as pd

data = {
    "country": ["USA", "USA", "Canada", "Canada"],
    "year": [2022, 2023, 2022, 2023],
    "sales": [100, 150, 90, 120],
    "profit": [30, 50, 25, 40]
}

df = pd.DataFrame(data)
print(df)
# The default index is 0, 1, 2, 3.


  country  year  sales  profit
0     USA  2022    100      30
1     USA  2023    150      50
2  Canada  2022     90      25
3  Canada  2023    120      40


In [27]:
#set a column as an index
df_index = df.set_index("year")
print(df_index)


     country  sales  profit
year                       
2022     USA    100      30
2023     USA    150      50
2022  Canada     90      25
2023  Canada    120      40


In [30]:
#reset index
df_reset = df_index.reset_index()
print(df_reset)


   year country  sales  profit
0  2022     USA    100      30
1  2023     USA    150      50
2  2022  Canada     90      25
3  2023  Canada    120      40


In [32]:
#multi index
df_multi = df.set_index(['country', 'year'])
print(df_multi)


              sales  profit
country year               
USA     2022    100      30
        2023    150      50
Canada  2022     90      25
        2023    120      40


In [35]:
#select by multi index
df_multi.loc[('USA', 2022)]
df_multi.loc['Canada']
df_multi.loc[pd.IndexSlice[:, 2023], :]


Unnamed: 0_level_0,Unnamed: 1_level_0,sales,profit
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1
USA,2023,150,50
Canada,2023,120,40


Useful multiindex operations
| Operation                       | Code                                                               |
| ------------------------------- | ------------------------------------------------------------------ |
| Swap multi-index levels         | `df_multi.swaplevel()`                                             |
| Remove one index level          | `df_multi.reset_index(level='year')`                               |
| Flatten MultiIndex column names | `df.columns = ['_'.join(col) for col in df.columns]` after groupby |


10. Describe how you would detect and remove duplicate records. Use duplicated() and drop_duplicates() with subset handling.

In [36]:
import pandas as pd

data = {
    "name": ["Alice", "Bob", "Charlie", "Bob", "Alice"],
    "email": [
        "alice@mail.com",
        "bob@mail.com",
        "charlie@mail.com",
        "bob@mail.com",     # duplicate
        "alice@mail.com"    # duplicate
    ],
    "age": [25, 30, 35, 31, 26]  # different values for duplicates
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)


Original DataFrame:
      name             email  age
0    Alice    alice@mail.com   25
1      Bob      bob@mail.com   30
2  Charlie  charlie@mail.com   35
3      Bob      bob@mail.com   31
4    Alice    alice@mail.com   26


In [37]:
#detect duplicates
dup_mask = df.duplicated(subset=['email'], keep='first')
print("\nDuplicate Mask:")
print(dup_mask)



Duplicate Mask:
0    False
1    False
2    False
3     True
4     True
dtype: bool


In [39]:
#drop duplicates
df_cleaned = df.drop_duplicates(subset=['email'], keep='first')
print("\nCleaned DataFrame:")
print(df_cleaned)



Cleaned DataFrame:
      name             email  age
0    Alice    alice@mail.com   25
1      Bob      bob@mail.com   30
2  Charlie  charlie@mail.com   35


11. How to switch boolean to reverse

In [4]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'age': [25, np.nan, 40, np.nan, 31]
})


~df['age'].isna()

0     True
1    False
2     True
3    False
4     True
Name: age, dtype: bool

12. How to add nan values randomly

In [13]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.random(size=(5, 3)))

# Probability of NaN (20%)
mask = np.random.rand(*df.shape) < 0.2

df[mask] = np.nan
print(df)

          0         1         2
0       NaN  0.450999  0.382686
1  0.964669  0.728758  0.198475
2       NaN  0.526033       NaN
3  0.679556  0.668343  0.934766
4       NaN  0.968687  0.431033


13. difference between map and filter in pandas explain and give examples

1️⃣ map in pandas

Purpose: Apply a function to each element of a Series (or sometimes DataFrame columns).

Works on: Series objects.

Returns: A Series with the same index but transformed values.

In [1]:
import pandas as pd

# Sample Series
s = pd.Series([1, 2, 3, 4, 5])

# Using map to square each value
squared = s.map(lambda x: x**2)
print(squared)

0     1
1     4
2     9
3    16
4    25
dtype: int64


In [2]:
#you can also use dictionary
s = pd.Series(['yes','yes','no','yes'])
s.map({'yes':True,'no':False})
s

0    yes
1    yes
2     no
3    yes
dtype: object

filter in pandas

Purpose: Select columns or rows from a DataFrame based on labels or conditions.

Works on: DataFrame objects (mostly for columns, but can also filter rows with axis parameter).

Returns: A DataFrame with only the filtered rows/columns.

In [17]:
# Sample DataFrame
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David"],
    "age": [25, 30, 35, 40],
    "salary": [50000, 60000, 70000, 80000],
    "department": ["HR", "IT", "IT", "Finance"],
    "start_year": [2018, 2016, 2015, 2012]
})
# Filter columns containing 'a'
filtered_cols = df.filter(like='m', axis=1)
# print(filtered_cols)

#Filter specific columns
df.filter(items=["name", "salary"])

df.filter(regex="^s")

#Filter rows instead of columns
df.filter(items=[0, 2], axis=0)

Unnamed: 0,name,age,salary,department,start_year
0,Alice,25,50000,HR,2018
2,Charlie,35,70000,IT,2015


apply

Works on: Series and DataFrames.

Purpose: Apply a function along axis (for DataFrames) or element-wise (for Series).

Input: Any function that can take a row, column, or element (depending on context).

Returns: Series or DataFrame depending on the function.

In [3]:
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [10, 20, 30]
})

# Sum across columns for each row
row_sum = df.apply(lambda row: row.sum(), axis=1)
print(row_sum)

0    11
1    22
2    33
dtype: int64


| Feature     | `map()`                    | `apply()`                  |
| ----------- | -------------------------- | -------------------------- |
| Works on    | **Series only**            | **Series or DataFrame**    |
| Input       | dict, Series, or function  | function (very flexible)   |
| Output      | Series                     | Series or DataFrame        |
| Typical use | Element-wise value mapping | Row/column-wise operations |
| Speed       | Usually faster             | Slower (more general)      |

