# 📌 Pandas Interview Q\&A

---

### **1. What is Pandas and why is it used in AI/ML?**

**Answer:**

* Pandas is a **Python library** for data manipulation and analysis.
* Provides two main data structures:

  * **Series** → 1D labeled array
  * **DataFrame** → 2D labeled table (rows & columns)

✅ **Relevance in ML/AI:**

* Data preprocessing (cleaning, missing values, encoding).
* Exploratory Data Analysis (EDA).
* Feature engineering before feeding into ML models.

---

### **2. Difference between Series and DataFrame**

**Answer:**

| Feature        | Series          | DataFrame                      |
| -------------- | --------------- | ------------------------------ |
| Dimensionality | 1D              | 2D (rows & cols)               |
| Example        | Column in Excel | Full Excel sheet               |
| Usage          | Single variable | Dataset with multiple features |

```python
import pandas as pd

s = pd.Series([10, 20, 30])
df = pd.DataFrame({"A": [1,2,3], "B": [4,5,6]})
```

---

### **3. How do you read and write data using Pandas?**

```python
# CSV
df = pd.read_csv("data.csv")
df.to_csv("output.csv", index=False)

# Excel
df = pd.read_excel("data.xlsx")
df.to_excel("output.xlsx", index=False)
```

✅ **Scenario:** Reading large CSV datasets for ML training.

---

### **4. How to inspect a dataset in Pandas?**

```python
df.head()      # first 5 rows
df.tail()      # last 5 rows
df.info()      # column types, non-null counts
df.describe()  # summary stats
df.shape       # (rows, cols)
```

✅ **Scenario:** Quickly checking dataset quality in data preprocessing.

---

### **5. How to select and filter data in Pandas?**

```python
# Column selection
df["A"]        # single column
df[["A","B"]]  # multiple columns

# Row selection
df.iloc[0]     # first row (integer index)
df.loc[0, "A"] # value at row 0, column A

# Conditional filtering
df[df["A"] > 5]
```

---

### **6. How to handle missing data in Pandas?**

```python
df.isnull().sum()         # check missing values
df.dropna(inplace=True)   # drop rows with NaN
df.fillna(0, inplace=True) # replace NaN with 0
```

✅ **Scenario:** In ML, missing values must be imputed before model training.

---

### **7. Explain GroupBy in Pandas with an example**

```python
data = {"Dept": ["IT", "HR", "IT", "HR"],
        "Salary": [50000, 40000, 60000, 45000]}
df = pd.DataFrame(data)

grouped = df.groupby("Dept")["Salary"].mean()
print(grouped)
# HR    42500
# IT    55000
```

✅ **Scenario:** Aggregating statistics like mean salary per department → useful in feature engineering.

---

### **8. What are vectorized operations in Pandas?**

```python
df["Bonus"] = df["Salary"] * 0.1
```

✅ **Explanation:**

* Instead of looping over rows, Pandas applies operations efficiently on entire columns.
* Internally built on **NumPy vectorization**.

---

### **9. How do you merge/join/concatenate DataFrames?**

```python
# Concatenate vertically
pd.concat([df1, df2])

# Merge on column
pd.merge(df1, df2, on="ID", how="inner")
```

✅ **Scenario:** Combining multiple datasets before ML model training.

---

### **10. Common Pandas interview coding questions**

#### **a) Find the mean of each column**

```python
df.mean()
```

#### **b) Normalize all numeric columns (0–1 range)**

```python
df_norm = (df - df.min()) / (df.max() - df.min())
```

#### **c) One-hot encode categorical columns**

```python
pd.get_dummies(df, columns=["Category"])
```

#### **d) Sort by a column**

```python
df.sort_values(by="Salary", ascending=False)
```

#### **e) Apply a custom function**

```python
df["Salary_in_LPA"] = df["Salary"].apply(lambda x: x/100000)
```

---

✅ **Why Pandas is critical in ML interviews**

* Most ML work = **80% data cleaning & feature engineering**, 20% model training.
* Pandas is the go-to library for handling structured data before feeding into ML models.




# 📌 Hands-on Pandas Coding Tasks

---

### **1. Load dataset and inspect**

**Q:** Read a CSV, show first 5 rows, shape, and column info.

```python
import pandas as pd

df = pd.read_csv("employees.csv")

print(df.head())     # first 5 rows
print(df.shape)      # (rows, cols)
print(df.info())     # datatypes & null counts
```

✅ **Why asked:** To test if you can quickly explore datasets.

---

### **2. Select rows based on condition**

**Q:** Get all employees with salary > 50,000.

```python
high_salary = df[df["Salary"] > 50000]
print(high_salary)
```

✅ **Why asked:** Filtering rows is common in preprocessing.

---

### **3. Compute group-wise aggregation**

**Q:** Find average salary per department.

```python
dept_salary = df.groupby("Department")["Salary"].mean()
print(dept_salary)
```

✅ **Why asked:** Feature engineering (aggregate stats) in ML.

---

### **4. Handle missing values**

**Q:** Replace missing salaries with the mean salary.

```python
df["Salary"].fillna(df["Salary"].mean(), inplace=True)
```

✅ **Why asked:** Missing value imputation before ML training.

---

### **5. Create a new column using vectorized operations**

**Q:** Add a 10% bonus column.

```python
df["Bonus"] = df["Salary"] * 0.10
```

✅ **Why asked:** Tests knowledge of Pandas vectorization (no loops).

---

### **6. Sort data**

**Q:** Sort employees by salary in descending order.

```python
df_sorted = df.sort_values(by="Salary", ascending=False)
print(df_sorted.head())
```

✅ **Why asked:** Sorting helps in ranking/feature extraction.

---

### **7. One-hot encoding categorical variables**

**Q:** Convert “Department” column to dummy variables.

```python
df_encoded = pd.get_dummies(df, columns=["Department"])
print(df_encoded.head())
```

✅ **Why asked:** Preparing categorical features for ML models.

---

### **8. Merge two DataFrames**

**Q:** Merge employee info with department info on `DeptID`.

```python
df1 = pd.DataFrame({"EmpID":[1,2,3], "DeptID":[10,20,30]})
df2 = pd.DataFrame({"DeptID":[10,20,30], "DeptName":["IT","HR","Finance"]})

merged = pd.merge(df1, df2, on="DeptID", how="inner")
print(merged)
```

✅ **Why asked:** Merging datasets is critical in real-world data pipelines.

---

### **9. Apply custom function**

**Q:** Create a column `Salary_in_LPA` (Lakhs per annum).

```python
df["Salary_in_LPA"] = df["Salary"].apply(lambda x: round(x/100000,2))
```

✅ **Why asked:** Apply transformation for feature engineering.

---

### **10. Normalize numeric columns (0–1 scale)**

**Q:** Normalize all numeric features for ML.

```python
numeric_cols = df.select_dtypes(include="number")
df[numeric_cols.columns] = (numeric_cols - numeric_cols.min()) / (numeric_cols.max() - numeric_cols.min())
```

✅ **Why asked:** ML models need normalized data.

---

⚡ **Pro interview twist**: Sometimes they ask you to **combine multiple steps** (e.g., filter, group, aggregate, encode) in **one pipeline**.

