# One-Hot Encoding vs. `pd.get_dummies` Exercise

In this exercise, we will explore how to use both `OneHotEncoder` and `pd.get_dummies` to handle categorical data. We'll use the Seaborn `tips` dataset for this task.


### Step 1: Load the Data
1. Import the necessary libraries:
    - `pandas` as `pd`
    - `seaborn` as `sns`
2. Load the `tips` dataset using `sns.load_dataset()`.

3. Display the first 5 rows of the dataset using `.head()`.
4. Use `.info()` to get an overview of the data types and missing values.

### Step 2: Using `pd.get_dummies`
1. Use the `pd.get_dummies` function to encode the categorical variables:
    - Select the columns `sex`, `smoker`, and `day`.
    - Set `drop_first=True` to avoid multicollinearity.

2. Merge the encoded columns back into the original dataset, dropping the original categorical columns.

---

### Step 4: Using `OneHotEncoder`
1. Use `OneHotEncoder` from `sklearn.preprocessing` to encode the same columns (`sex`, `smoker`, and `day`).
2. Combine the encoded columns into a new DataFrame with meaningful column names.
3. Add the encoded columns to the original dataset, dropping the original categorical columns.

---

### Step 5: Comparison of Results
- Compare the results from `pd.get_dummies` and `OneHotEncoder`:
    - Do they produce the same results?
    - What differences do you observe in the output?


### if you have extra time
1. Use the `tips` dataset (with categorical variables encoded) to predict the `total_bill` column based on all other features.
    - Use `LinearRegression` from `sklearn.linear_model`.
    - Split the data into training and testing sets using `train_test_split`.
    - Evaluate the model using `mean_squared_error` and calculate the RMSE.
2. Which encoding method (OneHotEncoder or `pd.get_dummies`) performs better?


### Note:
- Use `OneHotEncoder` with `sparse_output=False` for easier handling of the encoded DataFrame.
- Use `pd.concat()` to merge encoded features back into the original DataFrame.
- When using `train_test_split`, set `random_state=42` for consistent results.
- For regression, consider droping the `tip` column to avoid multicollinearity, as it has a high correlation with `total_bill`.
