# Data Cleaning in Pandas

## Suppose you have a pandas DataFrame called `sales_data` with a column called `price` that contains prices in the format `$xx.xx`, where `x` is a digit. You want to convert the `price` column to a numerical data type while keeping the exact price. Which of the following code snippets will achieve this?

- A) ***
``` python
sales_data["price"] = sales_data["price"].str.replace("$", "")
sales_data["price"] = sales_data["price"].astype(float)
```

- B)
``` python
sales_data["price"] = sales_data["price"].astype(float)
sales_data["price"] = sales_data["price"].str.replace("$", "")
```

- C)
``` python
sales_data["price"] = sales_data["price"].astype(int)
sales_data["price"] = sales_data["price"].str.replace("$", "")
```

- D)
``` python
sales_data["price"] = sales_data["price"].str.replace("$", "")
sales_data["price"] = sales_data["price"].astype(int)
```

## Suppose you have a pandas DataFrame called `sales_data` with a column called `region` that contains the regions where sales were made. The `region` column has a small number of unique values compared to the total number of rows in the DataFrame. What data type should the `region` column be to optimize memory usage?

- A) `int`
- B) `bool`
- C) `string`
- D) `category` ***



## Suppose you have a pandas DataFrame called `employee_data` with a column called `employee_id` that contains unique identifiers for each employee. You want to find all the rows with duplicate employee IDs in the DataFrame. Which of the following code snippets will achieve this?

- A) `employee_data[employee_data["employee_id"].duplicated()]` ***
- B) `employee_data.duplicated(subset="employee_id")`
- C) `employee_data[employee_data.duplicated("employee_id")]`
- D) `employee_data["employee_id"].duplicated()`

## Suppose you have a pandas DataFrame called `sales_data` with columns `date` and `sales_amount`. You want to sort the DataFrame in ascending order based on the `sales_amount` column. Which of the following code snippets will achieve this without redundancy?

- A) `sales_data.sort_values("sales_amount")`
- B) `sales_data.sort_values(by="sales_amount")` ***
- C) `sales_data.sort_values("sales_amount", ascending=True)`
- D) `sales_data.sort_values(by="sales_amount", ascending=False)`

## Suppose you have a pandas DataFrame called `employee_data`. How can you check whether there are any `NaN` values in the DataFrame?

- A) `employee_data.isna()` ***

- B) `employee_data.isnan()`

- C) `employee_data.hasna()`

- D) `employee_data.hasnan()`

## Suppose you have a pandas DataFrame called `sales_data` with columns `date`, `product`, and `sales_amount`. The DataFrame contains some duplicate rows where the same product was sold on the same date. You want to calculate the average sales amount for each product across all duplicate rows. Which of the following code snippets will achieve this?

- A) `sales_data.groupby(["date", "product"])["sales_amount"].mean()` ***
- B) `sales_data.groupby(["date", "product"])["sales_amount"].mean()`
- C) `sales_data.groupby(["product", "sales_amount"])["date"].mean()`
- D) `sales_data.groupby("date")["product", "sales_amount"].mean()`

## Suppose you have a pandas DataFrame called `employee_data` with columns `employee_id`, `employee_name`, and `department`. You want to filter the DataFrame to only include rows where the `department colum`n contains either `"Sales"` or `"Marketing"`. Which of the following code snippets will achieve this?

- A) `employee_data.filter(items=["Sales", "Marketing"], axis=0)`
- B) `employee_data.loc[employee_data["department"] == ["Sales", "Marketing"]]`
- C) `employee_data[employee_data["department"].isin(["Sales", "Marketing"])]` ***
- D) `employee_data[employee_data.isin(["Sales", "Marketing"])]` 

## Suppose you have a pandas DataFrame called `sales_data` with several columns, including a column called `customer_age` that contains the age of each customer. Some of the rows in the DataFrame have missing values in the `customer_age` column. You want to remove all the rows with missing values in the `customer_age` column before analyzing the data. Which of the following code snippets would achieve this goal?

- A) `sales_data.dropna('customer_age')`
- B) `sales_data.dropna(subset=['customer_age'])` ***
- C) `sales_data.dropna(columns='customer_age')`
- D) `sales_data.dropna(how='all')`

## Suppose you have a pandas DataFrame called `student_grades` with several columns, including a column called `grade` that contains the grade of each student. Some of the rows in the DataFrame have missing values in the `grade` column. You want to identify all the rows with missing values in the `grade` column. Which of the following code snippets would achieve this goal?

- A) `student_grades.isnull('grade')`
- B) `student_grades.isnull(subset=['grade'])`
- C) `student_grades['grade'].isnull()` ***
- D) `student_grades.isnull()`