IMPORTANT! Before beginning any lab assignment, be sure to **make your own copy** of the notebook and name it "lastname - Lab 4" or something similar.

# Lab 4: Predictive Analytics in Python (Part 2)

## Objective

In this lab, we will continue working with the churn dataset from Lab 3. The goal is to finish preparing the dataset for modeling (and then - visualize it!).

### The Scenario: Customer Churn in Telecommunications ðŸ“ž (Continued)

In case you need a reminder from Lab 3:
* You're a data analyst contracted by a telecommunications company that is experiencing a high rate of customer churn.
* The company has noticed a recent decline in active subscriptions.
* It's your job to analyze this customer churn, identify trends, and recommend strategies to improve customer retention.

In the previous activity, we learned that **we had outliers**n (at least one); on the other hand, we also learned there were **missing values**, for less than a percent of the `TotalCharges` data. Now that we've had a chance to reflect on our understanding of the business and data - and completed our DQP (Data Quality Plan) - we can move on to the Data Preparation stage.

**Your goal:** Prepare the dataset by handling data quality issues, engineering relevant features, and implementing initial predictive models to classify whether a customer is likely to churn.

Before we start, we should load the dataset and refresh ourselves on the details. Recall the dataset contains various features about customer's account information and service usage patterns. The target variable `Churn`indicates whether a customer has churned (i.e., left the service) or not.

- `CustomerID`: The unique ID of each customer
- `Tenure`: The time for which a customer has been using the service.
- `PhoneService`: Whether a customer has a landline phone service along with the internet service.
- `Contract`: The type of contract a customer has chosen.
- `PaperlessBilling`: Whether a customer has opted for paperless billing.
- `PaymentMethod`: Specifies the method by which bills are paid.
- `MonthlyCharges`: Specifies the money paid by a customer each month.
- `TotalCharges`: The total money paid by the customer to the company.
- `Churn`: This is the target variable which specifies if a customer has churned or not.

Let's get started by loading the dataset into a Polars DataFrame. If you haven't already downloaded the dataset, the code below will do so for you.  Its keeps a local copy in case you need to restart the notebook.

**NOTE**: We are also creating a Pandas version of the Polars DataFrame for compatibility with Seaborn, which we will use for visualization later in this lab.

In [None]:
import requests
from pathlib import Path
import polars as pl

# Set the URL and local filename
url = "https://raw.githubusercontent.com/JuanCab/csis446_lab04/refs/heads/main/data/churn_data.csv"
local_filename = "churn_data.csv"

# Check if the file already exists to avoid re-downloading
if not Path(local_filename).is_file():
    print(f"Downloading dataset from {url}...")
    # Download the dataset using requests library
    r = requests.get(url)
    # Check if the request was successful, if not, raise an error
    r.raise_for_status()

    # Save the content to a local file
    with open(local_filename, "wb") as f:
        f.write(r.content)
        print(f"Dataset downloaded and saved as '{local_filename}'.")

# Load the dataset from the local file into a Polars DataFrame
churn_df = pl.read_csv(local_filename)

# Display the first few rows of the DataFrame to verify successful loading
churn_df.head()

Let's remind ourselves we can review some of the basic information about the dataset after loading it.

In [None]:
# Let's get the description/statistics of the dataset
churn_df.describe()

We built lists of categorical and continuous features in the last lab.  Those will be useful again here.  Let's recreate them.

In [None]:
def column_types(dataframe):
    """Returns two lists: categorical and continuous column names."""
    # Lists to track column types
    categorical_cols = []
    continuous_cols = []

    for column in dataframe.columns:
        # .dtype is the data type of the column
        # pl.Float64 and pl.Int64 are Polars data types for float and integer
        if dataframe[column].dtype in (pl.Float64, pl.Int64):
            continuous_cols.append(column)
        else:
            categorical_cols.append(column)
    return categorical_cols, continuous_cols

categorical_cols, continuous_cols = column_types(churn_df)
print(f"{categorical_cols=}")
print(f"{continuous_cols=}")

## Part A: Data Preparation

Before we build any models, we need to ensure our data is clean and reliable.

In Lab 3, you should have built a draft Data Quality Plan (DQP) in the previous lab. It should have looked something like this:

| Feature         | Issue Identified                  | Recommended Action                |
|-----------------|----------------------------------|-----------------------------------|
| `CustomerID`      | High cardinality (unique values) | Exclude from analysis             |
| `tenure`          | Bimodal distribution (new vs long-term customers) | Consider binning or segmentation |
| `PhoneService`    | Binary categorical (cardinality 2) | Eventually convert to Boolean type        |
| `Contract`        | Categorical variable (low cardinality) | None |
| `PaperlessBilling` | Binary categorical (cardinality 2) | Eventually convert to Boolean type        |
| `PaymentMethod`   | Categorical variable (low cardinality) | None |
| `MonthlyCharges`  | 14 outliers found | Investigate outliers, consider keeping or adjusting|
| `TotalCharges`    | Missing values (11 out of 7043, 0.16%), also 19 outliers | Remove rows or impute with mean/median.  Review outlier handling. |
| `Churn`           | Target variable (cardinality 2) | Eventually convert to Boolean type        |

### 1. Addressing Cardinality Issues

We already noted that `customerID` has high cardinality (unique values for each customer), which means it is not useful for modeling.  Let's drop this feature from the dataset.  In polars, this can be done using the `drop` method, just list the column name to be dropped as a string (or list of strings for multiple columns).

In [None]:
# TO DO: Drop the customerID column as it has high cardinality

# Drop this column from the categorical columns list as well
categorical_cols.remove('customerID')

We didn't mention this in class, but for those Boolean features, we can consider converting them to actual Boolean True/False types for better performance during modeling (since most modelling algorithms only work with numeric data).  Let's do that now.

In [None]:
# Build a cardinalities dictionary using a dictionary comprehension
cardinalities_comp = { col: churn_df[col].n_unique() for col in categorical_cols }

# Loop through dictionary and convert every feature in the polars dataframe
# with a cardinality of 2 to Boolean type mapping "Yes"/"No" values into True/False
for col, card in cardinalities_comp.items():
    if card == 2:
        churn_df = churn_df.with_columns(
            pl.when(pl.col(col) == "Yes")
            .then(True).otherwise(False)
            .alias(f"{col}_Bool")
        )

print("\nUpdated DataFrame:")
churn_df.head()

Explain what you think then `.when`, `.then`, and `.otherwise` methods are doing in this context.  Write your explanation below.

```
WRITE DOWN YOUR ANSWERS HERE.  KEEP THEM BETWEEN THE TRIPLE BACKTICKS TO HAVE THEM BE FORMATTED AS FIXED-WIDTH TEXT, WHICH MAKES THEM STAND OUT FOR GRADING.
```

### 2. Addressing outliers

In Lab 3, we identified some outliers in the `MonthlyCharges` and `TotalCharges` features during our data exploration using the code below.

In [None]:
# This is a slightly modified code from Lab 3 used to identify potential 
# outliers using the IQR method. It stores the lower and upper bounds for each
# continuous column in dictionaries for later use.

# Initialize dictionaries to store lower and upper bounds
lower_bounds = {}
upper_bounds = {}

for col in continuous_cols:
    print(f"\nAnalyzing column: {col}")

    # Define Q1, Q3, IQR, lower_bound, upper_bound values for 'col'
    Q1 = churn_df[col].quantile(0.25)
    Q3 = churn_df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    # Store bounds in dictionaries
    lower_bounds[col] = lower_bound
    upper_bounds[col] = upper_bound

    # Print IQR and bounds
    print(f"IQR for {col}:", IQR)
    print(f"Lower/Upper Bounds for {col}:", 
          lower_bounds[col], upper_bounds[col])

    # Check against the dataframe to find potential outliers
    outliers = churn_df.filter((pl.col(col) < lower_bounds[col]) | 
                               (pl.col(col) > upper_bounds[col]))

    print(f"Number of Potential Outliers for {col}: {outliers.height}")

    # Show the values of the outliers
    print(f"\nPotential Outliers for {col}:")
    print(outliers)

Let's examine the outliers visually using box plots.  We'll use Seaborn for this, which works with Pandas DataFrames, so we'll use the Pandas version of our data.  Examine the plots and describe what you see below.


In [None]:
# Create a pandas version of the Polars DataFrame for compatibility
# with Seaborn
churn_pd_df = churn_df.to_pandas()

# Make box plots to visualize outliers in MonthlyCharges and TotalCharges
import seaborn as sns
import matplotlib.pyplot as plt

# Set up the matplotlib figure with two subplots
plt.figure(figsize=(12, 6))
# Split the figure into 2 rows and 1 column, and place next plot in 
# the first subplot
plt.subplot(2, 1, 1)
sns.boxplot(x=churn_pd_df['MonthlyCharges'], color='skyblue', orient='h')
plt.title('Box Plot of MonthlyCharges')
# TO DO: Now place the TotalCharges box chart in the second subplot
plt.subplot(2, 1, 2)
sns.boxplot(x=churn_pd_df['TotalCharges'], color='orange', orient='h')
plt.title('Box Plot of TotalCharges')

# Adjust layout to prevent overlap and show the plots
plt.tight_layout()
plt.show()

```
WRITE DOWN YOUR ANSWER HERE AS BEFORE.
```

To avoid them skewing our visualizations and models, we can use a **clamp transformation**, which sets extreme values to predefined upper and lower limits based on the IQR.

1. Apply a **clamp transformation** using the `.clip()` method to cap extreme values.
2. Replace the modified numeric columns in the main dataset.
3. Visualize distributions with boxplots after handling outliers to confirm changes.

In [None]:
# Since we are making modifications to the data, we work on a COPY of
# the data until we are happy with the results.
churn_clipped_df = churn_df.clone()

# TO DO: Apply clamp transformation to handle outliers in churn_clipped_df.
# Hint: Clip values based on lower and upper bounds stored in the
#   lower_bounds and upper_bounds dictionaries from IQR calculations above.


# Show the effects of handling outliers
print("\nSummary statistics after handling outliers:")
churn_clipped_df.describe()

In [None]:
# TO DO: Redo the box plots to visualize distributions after handling
# outliers. You can mostly copy code from above, but remember to remake
# a pandas version of the Polars DataFrame using churn_clipped_df you
# just created.

# Adjust layout to prevent overlap and show the plots
plt.tight_layout()
plt.show()

Notice the remaining outliers in the `MonthlyCharges` boxplot after clamping.  This is actually all of our clamped values for `MonthlyCharges`! When the box plot RECALCULATES the quartiles and median for the clamped data, it is clear these clamped values are still outliers relative to the rest of the data.  This was because our outliers were several times larger than the upper bound and were themselves initially distorting the computed Some people deal with situations like this by iteratively recalculating the IQR and clamping again. 

However, it turns out a detailed examination of the outliers revealed they were all the result of typographical errors during data entry (e.g., an extra digit added by mistake). Therefore, we will remove these outliers from the dataset instead of clamping them. (We didn't show how to do this in class, so the code to do this is provided, just run it.)

In [None]:
# Filter out rows with outliers in either MonthlyCharges or TotalCharges
churn_filtered_df = churn_df.filter(
    (pl.col('MonthlyCharges') >= lower_bounds['MonthlyCharges']) &
    (pl.col('TotalCharges') >= lower_bounds['TotalCharges']) &
    (pl.col('MonthlyCharges') <= upper_bounds['MonthlyCharges']) &
    (pl.col('TotalCharges') <= upper_bounds['TotalCharges'])
)

# Show the statistics of the filtered DataFrame
churn_filtered_df.describe()

# Make churn_df the filtered DataFrame for further analysis and re-create
# the pandas version
churn_df = churn_filtered_df
churn_pd_df = churn_df.to_pandas()

In [None]:
# Create the plots again
plt.figure(figsize=(12, 6))
plt.subplot(2, 1, 1)
sns.boxplot(x=churn_pd_df['MonthlyCharges'], color='skyblue', orient='h')
plt.title('Box Plot of MonthlyCharges')
plt.subplot(2, 1, 2)
sns.boxplot(x=churn_pd_df['TotalCharges'], color='orange', orient='h')
plt.title('Box Plot of TotalCharges')

# Adjust layout to prevent overlap and show the plots
plt.tight_layout()
plt.show()

### 3. Addressing Missing Values

Our Data Quality Plan identified 11 missing values in the `TotalCharges` feature.  Since this is a small percentage of the dataset (0.16%), we can choose to remove these rows without significantly impacting our analysis.  That said, we were warned to only remove entire rows if the missing values are for the target feature. As such, we are going to try imputation first.

For continuous features like `TotalCharges`, a common imputation method is to use the mean or median of the feature.  If `TotalCharges` still had some outliers, we would use the median for imputation, as it is more robust to outliers. However, since we have already handled the outliers, we can use the mean for imputation.

In [None]:
# TO DO: Deal with the missing values in TotalCharges by imputing with
# the mean In class we showed a case using .full_full() to fill missing
# values in an entire dataframe, go ahead and use that if you want.


# TO DO: Check to see if there are any missing values left

Have the missing values been handled?  How can you tell?

```
WRITE DOWN YOUR ANSWER HERE AS BEFORE.
```

# Part B: Continued Data Exploration and Visualization (Relationships)

We addressed some of our data quality issues in Part A.  Now, let's continue our data exploration by examining relationships between features.  This will help us understand how different features interact with each other and with the target variable `Churn`.

Recall that we basically have different kinds of plots we use for different kinds of features we are comparing:

1. **Continuous vs. Continuous:** Scatter Plots, SPLOM
2. **Categorical vs. Categorical:** Small Multiples of Bar Charts or Stacked Bar Charts
3. **Categorical vs. Continuous:** Small Multiples of Histograms or Box Plots

We will experiment with each of these kinds of plots below.

### 1. Continuous vs. Continuous

We only have three continuous features in this dataset: `Tenure`, `MonthlyCharges`, and `TotalCharges`.  Let's create a scatter plot matrix (SPLOM) to visualize the relationships between these continuous features. (You can find an example of this in the class notes).  Then write down your comments on what you observe in the cell below the plot.

In [None]:
# TO DO: Import Seaborn (sns) and Matplotlib (plt) libraries for 
# visualization

# TO DO: Create a pairplot (scatter plot matrix) using Seaborn, only for
# the continuous features (Remember to use the Pandas DataFrame version
# we created earlier for compatibility with Seaborn)

# Add a title for entire figure and then show it
plt.suptitle("Scatter Plot Matrix of Continuous Features", y=1.02)
plt.show()

```
WRITE DOWN YOUR ANSWER HERE AS BEFORE.
```

### 2. Categorical vs. Categorical

We have five categorical features in this dataset (not including `customerID`): `PhoneService` `Contract`, `PaperlessBilling`, `PaymentMethod`, and the target feature `Churn`. In class we reviewed how to visualize relationships between categorial variables using small multiples of histograms or box plots. 

Let's create small multiples of histograms to visualize the relationships between these categorical features.  Then write down your comments on what you observe in the cell below the plot.  **NOTE**: As you are using Seaborn, remember to use the Pandas DataFrame version of the data, `churn_pd_df`.

First Let's examine the relationship between `Contract` and `Churn` using small multiples of bar charts.  Finish the code below to create the bar charts (**HINT**: Most of this is shown in class notes, you just need to fill in the missing pieces).


In [None]:
# Define the categorical feature to make bar charts for
target_col = 'Churn'
# Define the categorical feature to segment by
col = 'Contract'
categories = churn_df[col].unique().to_list() # Get unique values for feature
n_types = len(categories) # Number of unique values

# We will make bar charts, one for all data and one for each contract type.
# Create a figure with 1 row x n_types columns of subplots
fig, axes = plt.subplots(1, n_types+1, figsize=(4*(n_types+1), 4))
# Adjust the horizontal and vertical spacing between subplots
fig.subplots_adjust(hspace=0.2, wspace=0.35)

# TO DO: Create a combined box plot of all data in 'axes[0]'



# Since we made the 'Churn' a binary feature (0/1), we can manually
# set the xticks to show 'No' and 'Yes' instead of 0 and 1
axes[0].set_xticks(["No", "Yes"])

# Set the labels and title for the subplot of all data
axes[0].set_ylabel("Count")
axes[0].set_title(f'All {col} Types')

# The following loop goes through each categorical value and creates
# a bar chart in its respective subplot (which should be axes[i+1]).
for i, category in enumerate(categories):
    # Define subset the DataFrame for the current value type
    subset = churn_pd_df[churn_pd_df[col] == category]
    # TO DO: Create a count plot for the subset in the corresponding subplot


    # TO DO: Set title for each subplot


Did you notice any interesting patterns or trends in the relationship between `Contract` type and customer churn?  Write your observations below.

```
WRITE DOWN YOUR ANSWER HERE AS BEFORE.
```

We can also create stacked bar charts to visualize the same relationship.  We're going to go beyond what was shown in class by showing how to make stacked bar charts in terms of both raw counts and proportions.

The next three cells don't need to be completed, just run.  However, look at the results of each cell and make sure you understand what is going on.

In [None]:
# Group the data by the categorical feature and target column
df_grouped = churn_pd_df.groupby([col, target_col]).size().unstack()
df_grouped

In [None]:
# The following line divides (.div) our original counts dataframe by the
# sum of counts (.sum) in each row. This gives us the proportion of each target
# class within each categorical feature value. (NOTE: axis=0 means we are
# dividing along rows, i.e., for each category value, whereas the axis=1
# in sum(axis=1) means we are summing across columns)
df_grouped_prop = df_grouped.div(df_grouped.sum(axis=1), axis=0)
df_grouped_prop

In [None]:
# We can visualize the same relationship using stacked bar charts both
# in terms of counts and proportions.

# Create a figure with 1 row x 2 columns of subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(hspace=0.2, wspace=0.2)

# Create stacked bar chart
df_grouped.plot(kind='bar', stacked=True, ax=axes[0])
axes[0].set_ylabel("Count")
axes[0].set_title(f'Stacked Bar Chart of {target_col} by {col}')

# Now make a stacked bar chart to visualize the same relationship but
# show the proportions instead of counts.
# Create stacked bar chart for proportions
df_grouped_prop.plot(kind='bar', stacked=True, ax=axes[1])
axes[1].set_ylabel("Proportion")
axes[1].set_title(f'Stacked Bar Chart of {target_col} Proportions by {col}')
plt.show()

Look at the stacked bar charts produced by the above code and compare them to each other.  They are based on the same data, but visualized differently.  Are the raw counts or the proportions more useful in this case?  Why?  Write your thoughts below.

```
WRITE DOWN YOUR ANSWER HERE AS BEFORE.
```

### 3. Categorical vs. Continuous

We can also check the relationships between categorical and continuous features using small multiples of histograms or box plots, it is really a question of how you want to visualize the distribution of the continuous feature for each category.

Let's create small multiples of histograms to visualize the relationship between `Contract` type and `MonthlyCharges`.   The quickest way to do this is to use Seaborn's `FacetGrid` to make a grid of histograms based on the `Contract` type.  

The code below "works", but recall that `MonthlyCharges` had some high-value outliers, which means the histograms may be misleading.  After running the code, adjust the number of bins used until you get a better
sense of the distribution of `MonthlyCharges` for each `Contract` type.  You may also want to change the upper and lower bounds to be something other than the default (which is the entire range of the data). Write your observations below the code.

In [None]:
# Again, a reminder that Seaborn needs pandas DataFrames churn_pd_df

# Set up the plot
target_col = 'Contract'
seg_col = 'MonthlyCharges'

# Create a grid of plots
g = sns.FacetGrid(churn_pd_df, col=target_col)

# TO DO: Here we execute a matplotlib 'hist' command to get a histogram
# in each facet. However, you need the adjust the number of bins and 
# probably hand-tune the range to get a better sense of the distribution
# of MonthlyCharges for each Contract type.
n_bins = 6  # Number of bins for histogram
lower_bound = churn_pd_df[seg_col].min()
upper_bound = churn_pd_df[seg_col].max()

# Define the range for the histogram and generate the histograms
range_vals = (lower_bound, upper_bound)
g.map(plt.hist, seg_col, bins=n_bins, range=range_vals,
      edgecolor="black")
plt.show()

```
WRITE DOWN YOUR ANSWER HERE AS BEFORE.
```

Now lets see if the different `Churn` categories have different distributions of `MonthlyCharges` (maybe people churn because they can get a better deal?).  

To do this we will use small multiples of box plots to visualize the relationship between `Churn` and `TotalCharges`.   Most of this code is a variant of what we saw in the class notes.  You just need to fill in the missing pieces.

In [None]:
# Again, a reminder that Seaborn needs pandas DataFrames churn_pd_df

# Set up the plot
target_col = 'Churn'
seg_col = 'MonthlyCharges'

# Make two plots side by side (using two in 1x3 grid, one subplot will
# be 2x wider than the other)
fig, _ = plt.subplots(1, 3, figsize=(8,6))
plt.subplots_adjust(wspace=0.5)  # Increase horizontal spacing
fig.clear()

# Create Left axes (1 unit wide) and Right axes (spans 2 units wide)
ax1 = plt.subplot2grid((1, 3), (0, 0))
ax2 = plt.subplot2grid((1, 3), (0, 1), colspan=2)

# Boxplot of Seg_Col values
sns.boxplot(y=churn_pd_df[seg_col], ax=ax1, legend=False)
ax1.set_title(f"{seg_col} Distribution")
ax1.set_ylabel(f"{seg_col}")
# ax1.set_ylim((0, 150)) # Force y-axis limits for better comparison

# TO DO: Boxplot of Seg_Col by Target_Col

# Label this axis
ax2.set_title(f"{seg_col} Distribution by {target_col}")
ax2.set_title(f"{seg_col} Distribution")
ax2.set_ylabel(f"{seg_col}")
# ax2.set_ylim((0, 150)) # Force y-axis limits for better comparison

# Show the plots
plt.show()

Can you tell what the relationship is between `MonthlyCharges` and `Churn` from these box plots?  Write your observations below.


```
WRITE DOWN YOUR ANSWER HERE AS BEFORE.
```

## Part C: Feature Engineering

Feature engineering normally refers to the process of creating new features from existing ones to improve model performance. 

In this case, we will create a new feature called `AvgMonthlyCharges`, which is calculated by dividing `TotalCharges` by `Tenure`. This feature represents the average monthly charges for each customer over their tenure with the company. We have reason to suspect that customers with higher average monthly charges may be more likely to churn, as they may perceive the service as being too expensive.

In [None]:
# TO DO: We can easily compute the average monthly charges by dividing
# TotalCharges by tenure.  We will create a new feature called
# AvgMonthlyCharges by dividing pl.col('TotalCharges') by
# pl.col('tenure') and assigning it to alias `AvgMonthlyCharges`.





# See the first few rows to verify the new feature
churn_df.head()

Furthermore if a user is paying more now than their average monthly charge, they may be more likely to churn.  We can create a categorial feature called `MonthlyChargeTrend` that is either `rising`, `falling`, or `stable` based on whether their current `MonthlyCharges` is above, below, or equal to their `AvgMonthlyCharges`.  This feature could help us identify customers who are experiencing changes in their billing that may influence their decision to churn.

**BIG HINT**: You can do this using the `.with_columns()` method along with conditional expressions using `.when()`, `.then()`, and `.otherwise()` methods as shown in class.  Since we have 3 conditions to check, you will need to chain multiple `.when()` statements together. To have `.alias()` apply to the entire conditional expression, you will need to wrap the entire expression in parentheses.  Also, to set values to strings, you need to use `pl.lit("string_value")` (`pl.lit` defines a "literal").

In [None]:
# TO DO: Let's create the categorical feature MonthyChargeTrend based on
# whether the AvgMonthlyCharges is above, below, or equal to the overall
# average.







# Create the pandas version again for visualization with Seaborn
churn_pd_df = churn_df.to_pandas()

# Show the first few rows to verify the new feature
churn_df.head()

Try to see visually if either of these new features have a relationship with `Churn` by making the appropriate plots.  Write your observations below the plots.

**CREATE CODE CELLS WITH YOUR PLOTS AND ADD MARKDOWN CELLS WITH YOUR OBSERVATIONS**

## Part D: Normalization and Scaling

Once we have all our features ready, we may need to normalize or scale them before feeding them into a machine learning model. This is especially important for algorithms that are sensitive to the scale of input features, such as k-nearest neighbors (KNN) and support vector machines (SVM).

For this dataset, we can apply Min-Max Scaling to the continuous features `Tenure`, `MonthlyCharges`, `TotalCharges`, and `AvgMonthlyCharges`. This technique scales the features to a fixed range, typically [0, 1], which helps in normalizing the data.

In [None]:
# Set of a list of the features to be normalized/scaled
continuous_features = ['tenure', 'MonthlyCharges', 'TotalCharges', 
                       'AvgMonthlyCharges']

from sklearn.preprocessing import MinMaxScaler

# Initialize scaler
scaler = MinMaxScaler(feature_range=(0, 1))

# Extract columns to be scaled
features_to_scale = churn_df.select(continuous_features)

# Create scaled_data numpy array as the scaled version of the features
scaled_data = scaler.fit_transform(features_to_scale)

# Add normalized features back to the DataFrame with new column names
for i, col in enumerate(continuous_features):
    churn_df = churn_df.with_columns(
        pl.Series(f"{col}_norm", scaled_data[:, i])
    )

# Show the first few rows to verify the new normalized features
churn_df.head()

It turns out AvgMonthlyCharges has some pretty extreme large-value outliers, why is this a problem when using Min-Max Scaling?

```
WRITE DOWN YOUR ANSWER HERE AS BEFORE.
```

We might consider standardization (z-score normalization) as an alternative scaling method, which is less sensitive to outliers. However, given that we don't see a particularly normal distribution for `AvgMonthlyCharges`, standardization may not be the best choice either.  We will hold off on doing any more for now.

You may have noticed, I tended to create new columns when normalizing or binning features rather than replacing the original columns.  Why might this be a good idea?

```
WRITE DOWN YOUR ANSWER HERE AS BEFORE.
```

## What you need to submit

Once you have completed this lab, you need to submit your work. You should submit the `.ipynb` notebook file. To do this, go to the File menu and select **Download > Download .ipynb**(this is the native Jupyter notebook format).  Submit that file to the Lab 4 dropbox on D2L.