## Pandas Interview Questions
---
### Question 1: Creating a DataFrame

Create a Pandas DataFrame from a dictionary that contains two lists: `Name` with values `['Alice', 'Bob']` and `Age` with values `[25, 30]`.

**Solution:**

In [3]:
import pandas as pd

dic = {'name': ['Alice', 'Bob'], 'age': [25, 30]}
df = pd.DataFrame(dic)
print(df)

    name  age
0  Alice   25
1    Bob   30


### Question 2: Conditional Selection

Given a DataFrame `df`, select all rows where the 'Score' column is greater than 80 and the 'Status' column is 'Passed'.

**Solution:**

In [5]:
data = {'Name': ['Dan', 'Eva', 'Frank', 'Grace'],
        'Score': [85, 92, 74, 88],
        'Status': ['Passed', 'Passed', 'Failed', 'Passed']}
df = pd.DataFrame(data)

filtered_df = df[(df['Score'] > 80) & (df['Status'] == 'Passed')]
print(filtered_df)

    Name  Score  Status
0    Dan     85  Passed
1    Eva     92  Passed
3  Grace     88  Passed


**Explanation:** Similar to NumPy, Pandas uses boolean indexing. Two conditions are created and combined with the logical AND operator ( `&` ). Each condition must be enclosed in parentheses due to operator precedence in Python.
***
### Question 3: Handling Missing Data

You have a DataFrame `df` with `NaN` values in the 'Sales' column. How do you replace these `NaN` values with the mean of that same column?

In [6]:
import numpy as np
data = {'Month': ['Jan', 'Feb', 'Mar', 'Apr'], 'Sales': [200, 210, np.nan, 220]}
df = pd.DataFrame(data)

mean_sales = df['Sales'].mean()
df['Sales'].fillna(mean_sales, inplace=True)
print(df)

  Month  Sales
0   Jan  200.0
1   Feb  210.0
2   Mar  210.0
3   Apr  220.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Sales'].fillna(mean_sales, inplace=True)


**Explanation:** First, the mean of the 'Sales' column is calculated, which automatically ignores `NaN` values. Then, the `.fillna()` method is used on the 'Sales' column to replace all `NaN`s with the calculated mean. `inplace=True` modifies the DataFrame directly.
***
### Question 4: Grouping and Aggregation

Given a DataFrame of employee data with 'Department' and 'Salary' columns, how would you find the average salary for each department?

**Solution:**

In [None]:
data = {'Department': ['HR', 'IT', 'HR', 'IT', 'IT'],
        'Salary': [60000, 85000, 62000, 90000, 88000]}
df = pd.DataFrame(data)

avg_salary_by_dept = df.groupby('Department')['Salary'].mean()

print(avg_salary_by_dept)
avg_salary_by_dept.dtype

Department
HR    61000.000000
IT    87666.666667
Name: Salary, dtype: float64


dtype('float64')

**Explanation:** The `.groupby('Department')` method groups the DataFrame by unique values in the 'Department' column. Then, we select the 'Salary' column `['Salary']` and apply the `.mean()` aggregation function to calculate the average for each group.
***
### Question 5: Merging DataFrames

You have two DataFrames: `df1` with columns `['user_id', 'name']` and `df2` with `['user_id', 'order_count']`. How do you perform an inner merge on `user_id` to combine them?

**Solution:**

In [None]:
df1 = pd.DataFrame({'user_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'user_id': [1, 2, 4], 'order_count': [5, 2, 10]})

df_merge = pd.merge(df1, df2, how = 'inner', on = 'user_id')
df3 = pd.concat([df1, df2], ignore_index=True) # Stacking vertically (default axis=0)

print(df_merge)
print(df3)

   user_id   name  order_count
0        1  Alice            5
1        2    Bob            2
   user_id     name  order_count
0        1    Alice          NaN
1        2      Bob          NaN
2        3  Charlie          NaN
3        1      NaN          5.0
4        2      NaN          2.0
5        4      NaN         10.0


**Explanation:** `pd.merge()` is the standard function for joining DataFrames. `on='user_id'` specifies the common column to join on, and `how='inner'` ensures that only `user_id`s present in *both* DataFrames are included in the result.
***
### Question 6: Adding a New Column

Given a DataFrame with 'Price' and 'Quantity' columns, how do you create a new column called 'Total' which is the product of 'Price' and 'Quantity'?

**Solution:**

In [16]:
data = {'Price': [10.0, 5.5, 8.0], 'Quantity': [3, 4, 5]}
df = pd.DataFrame(data)

df['Total'] = df['Price'] * df['Quantity']
print(df)

   Price  Quantity  Total
0   10.0         3   30.0
1    5.5         4   22.0
2    8.0         5   40.0


**Explanation:** You can create a new column by simply assigning the result of a vectorized operation to a new column name. Pandas automatically performs the multiplication element-wise for each row.
***
### Question 7: Applying a Function

You have a DataFrame `df` with a 'Name' column. How would you create a new column 'Name_Length' that contains the length of each name?

**Solution:**

In [17]:
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Samantha', 'Peter']})

df['Name_Length'] = df['Name'].apply(len)
print(df)

       Name  Name_Length
0      John            4
1  Samantha            8
2     Peter            5


**Explanation:** The `.apply()` method is used to apply a function to each element in a Series. Here, it applies Python's built-in `len()` function to every name in the 'Name' column.
***
### Question 8: Selection using `.loc` and `.iloc`

Given the DataFrame `df`, explain how to select the element in the third row and second column using both `.loc` and `.iloc`. Assume the index is `[10, 20, 30, 40]`.

**Solution:**

In [22]:
import pandas as pd
data = {'col1': [100, 200, 300, 400], 'col2': [150, 250, 350, 450]}
df = pd.DataFrame(data, index=[10, 20, 30, 40])

# Using .loc (label-based)
val_loc = df.loc[30, 'col2']

# Using .iloc (integer position-based)
val_iloc = df.iloc[2, 1]

print(f"Using .loc: {val_loc}")
print(f"Using .iloc: {val_iloc}")

Using .loc: 350
Using .iloc: 350


**Explanation:**
* `.loc` is used for **label-based** indexing. The label for the third row is `30` and the label for the second column is `'col2'`.
* `.iloc` is used for **integer position-based** indexing. The third row is at integer position `2` (0-indexed) and the second column is at integer position `1`.
***
### Question 9: Value Counts

You have a DataFrame `df` with a 'Category' column. How would you count the occurrences of each unique category?

**Solution:**

In [23]:
import pandas as pd
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'A']}
df = pd.DataFrame(data)

category_counts = df['Category'].value_counts()
print(category_counts)

Category
A    3
B    2
C    1
Name: count, dtype: int64


**Explanation:** The `.value_counts()` method is a convenient Series method that returns a new Series containing the counts of unique values, sorted in descending order by default.
***
### Question 10: Dropping Columns

Given a DataFrame `df`, how do you permanently remove the column named 'temp_data'?

**Solution:**

In [30]:
import pandas as pd
data = {'A': [1, 2], 'B': [3, 4], 'temp_data': [5, 6]}
df = pd.DataFrame(data)

df.drop('temp_data', axis = 1, inplace = True)

print(df)

   A  B
0  1  3
1  2  4
