# Data Cleaning in Pandas

## Suppose you have a Pandas dataframe called `sales_data` with a column called `price` that contains prices in the format `$xx.xx`, where `x` is a digit. You want to convert the `price` column to a numerical data type while keeping the exact price. Which of the following code snippets will achieve this?

- This code ***
``` python
sales_data["price"] = sales_data["price"].str.replace("$", "")
sales_data["price"] = sales_data["price"].astype(float)
```

- This code
``` python
sales_data["price"] = sales_data["price"].astype(float)
sales_data["price"] = sales_data["price"].str.replace("$", "")
```

- This code
``` python
sales_data["price"] = sales_data["price"].astype(int)
sales_data["price"] = sales_data["price"].str.replace("$", "")
```

- This code
``` python
sales_data["price"] = sales_data["price"].str.replace("$", "")
sales_data["price"] = sales_data["price"].astype(int)
```

## Suppose you have a Pandas dataframe called `sales_data` with a column called `region` that contains the regions where sales were made. The `region` column has a small number of unique values compared to the total number of rows in the DataFrame. What data type should the `region` column be to optimize memory usage?

- `int`
- `bool`
- `string`
- `category` ***



## Suppose you have a Pandas dataframe called `employee_data`. How can you check whether there are any `NaN` values in the dataframe?

- `employee_data.isna()` ***
- `employee_data.isnan()`
- `employee_data.hasna()`
- `employee_data.hasnan()`

## Suppose you have a Pandas dataframe called `employee_data` with columns `employee_id`, `employee_name`, and `department`. You want to filter the dataframe to only include rows where the `department colum`n contains either `"Sales"` or `"Marketing"`. Which of the following code snippets will achieve this?

- `employee_data.filter(items=["Sales", "Marketing"], axis=0)`
- `employee_data.loc[employee_data["department"] == ["Sales", "Marketing"]]`
- `employee_data[employee_data["department"].isin(["Sales", "Marketing"])]` ***
- `employee_data[employee_data.isin(["Sales", "Marketing"])]` 

## Suppose you have a Pandas dataframe called `sales_data` with several columns, including a column called `customer_age` that contains the age of each customer. Some of the rows in the dataframe have missing values in the `customer_age` column. You want to remove all the rows with missing values in the `customer_age` column before analyzing the data. Which of the following code snippets would achieve this goal?

- `sales_data.dropna('customer_age')`
- `sales_data.dropna(subset=['customer_age'])` ***
- `sales_data.dropna(columns='customer_age')`
- `sales_data.dropna(how='all')`

## Suppose you have a `Pandas` dataframe called `Customers` and you want to remove all rows which are **exact** duplicates from the existing dataframe, which code snippet would you use?

- `customers['customer_id'].drop_duplicates()`
- `customers.drop_duplicates(columns=all)`
- `customers.drop_duplicates(inplace = True)` ***
- `customers_nonduplicated = customers[~customers.duplicated()]`

## Creating a new categorical column from a column of continuous data can be achieved by which process?

- Binning ***
- Chopping
- Normalising
- Aggregating


## The `.nunique()` method returns:

- The unique values in a column
- The non-unique values in a column
- The count of unique values in a column ***
- The indices of the unique values in a column