<a href="https://colab.research.google.com/github/Jezreel114/CCDATSCL_EXERCISES/blob/main/Exercise2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 2

<img src="https://vsqfvsosprmjdktwilrj.supabase.co/storage/v1/object/public/images/insights/1753644539114-netflix.jpeg"/>


In this activity , you will explore two fundamental preprocessing techniques used in data science and machine learning: feature scaling and discretization (binning).

These techniques are essential when working with datasets that contain numerical values on very different scales, or continuous variables that may be more useful when grouped into categories.


We will use a subset of the Netflix Movies and TV Shows dataset, which contains metadata such as release year, duration, ratings, and other attributes of titles currently or previously available on Netflix. Although the dataset is not originally designed for numerical modeling, it contains several features suitable for preprocessing practice—such as:
-Release Year
-Duration (in minutes)
-Number of Cast Members
-Number of Listed Genres
-Title Word Count

In this worksheet, you will:
- Load and inspect the dataset
- Select numerical features for scaling
- Apply different scaling techniques
- Min–Max Scaling
- Standardization
- Robust Scaling
- Perform discretization (binning)
- Equal-width binning
- Equal-frequency binning
- Evaluate how scaling affects machine learning performance, using a simple KNN

In [None]:
import pandas as pd
import os
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub


## 1. Setup and Data Loading



Load the Netflix dataset into a DataFrame named df.

In [None]:

# Download latest version
path = kagglehub.dataset_download("shivamb/netflix-shows")

print("Path to dataset files:", path)


if os.path.isdir(path):
  print(True)

contents = os.listdir(path)
contents

mydataset = path + "/" + contents[0]
mydataset


df = pd.read_csv(mydataset)

Using Colab cache for faster access to the 'netflix-shows' dataset.
Path to dataset files: /kaggle/input/netflix-shows
True


## 2. Data Understanding

Store the dataset’s column names in a variable called cols.

In [None]:
# put your answer here..
cols = df.columns.tolist()

Store the shape of the dataset as a tuple (rows, columns) in shape_info.

In [None]:
# put your answer here
shape_info = df.shape

## 3. Data Cleaning
Count missing values per column and save to missing_counts.

In [None]:
# put your answer here
missing_counts = df.isnull().sum()

Drop rows where duration is missing. Save to df_clean.

In [None]:
# put your answer here
df_clean = df.dropna(subset=['duration'])

4. Selecting Relevant Numeric Features

Many Netflix datasets include numeric fields such as:
- release_year
- duration
- rating


Create a DataFrame `df_num` containing only numeric columns.

In [None]:
# put your answer here
df_num = df.select_dtypes(include=['number'])

## 5. Feature Scaling

Focus on a single numeric column (e.g., duration).


Extract the column duration into a Series named `dur`.

In [None]:
# put your answer here
dur = df['duration']

In [None]:
from IPython.display import display, HTML

# Convert the 'duration' column into a scrollable HTML table
display(HTML(df[['duration']].to_html(max_rows=None, max_cols=1).replace('<table ',
                  '<table style="max-height: 300px; overflow-y: scroll; display: block;" ')))

Unnamed: 0,duration
0,90 min
1,2 Seasons
2,1 Season
3,1 Season
4,2 Seasons
5,1 Season
6,91 min
7,125 min
8,9 Seasons
9,104 min


In [None]:
from IPython.display import display, HTML

display(HTML(df.to_html(max_rows=None, max_cols=None, notebook=True)
             .replace('<table ', '<table style="display:block; max-height:400px; overflow:auto;" ')))


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
5,s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...


In [None]:
# Get the column names of the DataFrame df
column_names = df.columns

# Display the column names
print(column_names)

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'duration_equal_width_bin', 'duration_equal_freq_bin'],
      dtype='object')


In [None]:
import pandas as pd
import math  # To handle NaN (if needed)

# Function to convert duration (seasons or minutes) to minutes
def convert_to_minutes(duration):
    if isinstance(duration, str):  # Ensure we're dealing with a string
        if 'min' in duration:  # If it contains minutes
            return int(duration.split()[0])  # Extract the number before 'min'
        elif 'Season' in duration:  # If it contains seasons
            seasons = int(duration.split()[0])  # Extract the number of seasons
            return seasons * 500  # Convert seasons to minutes (500 minutes per season)
    elif isinstance(duration, (int, float)):  # If it's already a number (e.g., float or int)
        # Handle NaN values and return 0 or None
        if math.isnan(duration):  # If it's NaN
            return None  # You can choose to return 0 instead of None if desired
        return int(duration)  # Treat it as minutes

    return None  # In case of unexpected format

dur_converted = dur.apply(convert_to_minutes)

# Now, 'df['duration']' has the converted durations in minutes
print(dur_converted)

0         90.0
1       1000.0
2        500.0
3        500.0
4       1000.0
         ...  
8802     158.0
8803    1000.0
8804      88.0
8805      88.0
8806     111.0
Name: duration, Length: 8807, dtype: float64


Apply Min–Max Scaling to `dur`. Store the result as `dur_minmax`.

In [None]:
# put your answer here
dur_converted = (dur_converted - dur_converted.min()) / (dur_converted.max() - dur_converted.min())

Apply Z-score Standardization to `dur`. Store in `dur_zscore`.

In [None]:
# put your answer here
dur_zscore = (dur_converted - dur_converted.mean()) / dur_converted.std()


## 6. Discretization (Binning)
Apply equal-width binning to dur into 5 bins. Store as `dur_width_bins`.


- Use `pandas.cut()` to divide duration_minutes into 4 `equal-width bins`.
- Add the resulting bins as a new column named:
`duration_equal_width_bin`

In [None]:
# put your answer here
dur_width_bins = pd.cut(dur_zscore, bins=5, labels=[1, 2, 3, 4, 5])
df['duration_equal_width_bin'] = dur_width_bins

In [None]:
# put your answer here
from IPython.display import display, HTML

# Convert the 'duration' column into a scrollable HTML table
display(HTML(df[['duration_equal_width_bin']].to_html(max_rows=None, max_cols=1).replace('<table ',
                  '<table style="max-height: 300px; overflow-y: scroll; display: block;" ')))

Unnamed: 0,duration_equal_width_bin
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0
5,1.0
6,1.0
7,1.0
8,3.0
9,1.0


In [None]:
value_counts = df['duration_equal_width_bin'].value_counts()

# To get counts for only the values 1, 2, 3, 4, 5
count_1_to_5 = value_counts.loc[1:5].fillna(0).astype(int)  # fillna(0) in case some values don't appear

# Display the result
print(count_1_to_5)

duration_equal_width_bin
1    8545
2     193
3      56
4       7
5       3
Name: count, dtype: int64


Describe the characteristics of each bin

- What are the bin edges produced by equal-width binning?
- How many movies fall into each bin?

In [None]:
#they are not separated equally
#1 - 8545, 2 - 193, 3 - 56, 4 -7, 5 - 3

Apply equal-frequency binning to dur into 5 bins. Store as `dur_quantile_bins`.

- Use `pandas.qcut()` to divide duration_minutes into 4 equal-frequency bins.
- Add the result as a new column named:
`duration_equal_freq_bin`

In [None]:
# put your answer here
dur_quantile_bins = pd.qcut(dur_zscore, q=5, labels=[1, 2, 3, 4, 5])
df['duration_equal_freq_bin'] = dur_quantile_bins

In [None]:
# put your answer here
from IPython.display import display, HTML

# Convert the 'duration' column into a scrollable HTML table
display(HTML(df[['duration_equal_freq_bin']].to_html(max_rows=None, max_cols=1).replace('<table ',
                  '<table style="max-height: 300px; overflow-y: scroll; display: block;" ')))

Unnamed: 0,duration_equal_freq_bin
0,2
1,5
2,4
3,4
4,5
5,4
6,2
7,3
8,5
9,3


In [None]:
value_counts = df['duration_equal_freq_bin'].value_counts()

# To get counts for only the values 1, 2, 3, 4, 5
count_1_to_5 = value_counts.loc[1:5].fillna(0).astype(int)  # fillna(0) in case some values don't appear

# Display the result
print(count_1_to_5)

duration_equal_freq_bin
1    1838
3    1757
2    1714
5     883
Name: count, dtype: int64


Describe the characteristics of each bin

- What are the bin ranges produced by equal-frequency binning?
- How many movies fall into each bin? Are they nearly equal?

In [None]:
# they are equally distributed
# all of them ranges to 1700-1800 expect the last which only have 883

## 7. KNN Before & After Scaling


Create a feature matrix X using any two numeric columns and a target y (e.g., classification by genre or type). Create a train/test split.

Train a KNN classifier without scaling. Store accuracy in acc_raw.

In [None]:
print(X.isnull().sum())  # Missing values in feature columns
print(y.isnull().sum())

duration_equal_freq_bin    0
release_year               0
dtype: int64
0


In [None]:
df = df.dropna(subset=['duration_equal_freq_bin'])

In [None]:
# put your answer here
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Feature matrix X and target variable y
X = df[['duration_equal_freq_bin', 'release_year']]
y = df['type']



# Create a train/test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)

# Train the KNN classifier without scaling
knn.fit(X_train, y_train)

# Predict on the test data
y_pred = knn.predict(X_test)

# Calculate accuracy
acc_raw = accuracy_score(y_test, y_pred)

# Print accuracy
print(f'Accuracy (before scaling): {acc_raw:.4f}')

Accuracy (before scaling): 0.9097


Scale `X` using either Min–Max or Standardization, retrain KNN, and store accuracy in acc_scaled.

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Assume df is your DataFrame and 'duration_equal_freq_bin', 'release_year' are your features
X = df[['duration_equal_freq_bin', 'release_year']]
y = df['type']

# Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Min-Max Scaler
scaler_minmax = MinMaxScaler()

# Apply Min-Max scaling to the features
X_train_scaled = scaler_minmax.fit_transform(X_train)  # Fit and transform the training data
X_test_scaled = scaler_minmax.transform(X_test)  # Only transform the test data

# Initialize KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)

# Train the KNN classifier on scaled data
knn.fit(X_train_scaled, y_train)

# Predict on the test data
y_pred_scaled = knn.predict(X_test_scaled)

# Calculate accuracy
acc_scaled_minmax = accuracy_score(y_test, y_pred_scaled)

# Print accuracy
print(f'Accuracy (after Min-Max Scaling): {acc_scaled_minmax:.4f}')


Accuracy (after Min-Max Scaling): 0.9205


In [None]:
print(f'Accuracy (before scaling): {acc_raw:.4f}')
print(f'Accuracy (after Min-Max Scaling): {acc_scaled_minmax:.4f}')

Accuracy (before scaling): 0.9120
Accuracy (after Min-Max Scaling): 0.9205


Did scaling improve accuracy? Explain why.

In [None]:
# put your answer here
#Yes, scaling with Min-Max scaling likely improved accuracy because KNN is distance-based, and scaling ensures all features are on the same scale, preventing any feature from dominating the distance calculation.