<a href="https://colab.research.google.com/github/Direspecific/CCDATSCL_EXERCISE_COM22/blob/main/Exercise_2_FAT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 2

<img src="https://vsqfvsosprmjdktwilrj.supabase.co/storage/v1/object/public/images/insights/1753644539114-netflix.jpeg"/>


In this activity , you will explore two fundamental preprocessing techniques used in data science and machine learning: feature scaling and discretization (binning).

These techniques are essential when working with datasets that contain numerical values on very different scales, or continuous variables that may be more useful when grouped into categories.


We will use a subset of the Netflix Movies and TV Shows dataset, which contains metadata such as release year, duration, ratings, and other attributes of titles currently or previously available on Netflix. Although the dataset is not originally designed for numerical modeling, it contains several features suitable for preprocessing practice—such as:
-Release Year
-Duration (in minutes)
-Number of Cast Members
-Number of Listed Genres
-Title Word Count

In this worksheet, you will:
- Load and inspect the dataset
- Select numerical features for scaling
- Apply different scaling techniques
- Min–Max Scaling
- Standardization
- Robust Scaling
- Perform discretization (binning)
- Equal-width binning
- Equal-frequency binning
- Evaluate how scaling affects machine learning performance, using a simple KNN

In [90]:
import pandas as pd
import os
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub


## 1. Setup and Data Loading



Load the Netflix dataset into a DataFrame named df.

In [91]:

# Download latest version
path = kagglehub.dataset_download("shivamb/netflix-shows")

print("Path to dataset files:", path)


if os.path.isdir(path):
  print(True)

contents = os.listdir(path)
contents

mydataset = path + "/" + contents[0]
mydataset


df = pd.read_csv(mydataset)

Using Colab cache for faster access to the 'netflix-shows' dataset.
Path to dataset files: /kaggle/input/netflix-shows
True


## 2. Data Understanding

Store the dataset’s column names in a variable called cols.

In [92]:
df.columns.tolist()

['show_id',
 'type',
 'title',
 'director',
 'cast',
 'country',
 'date_added',
 'release_year',
 'rating',
 'duration',
 'listed_in',
 'description']

Store the shape of the dataset as a tuple (rows, columns) in shape_info.

In [93]:
print(df.shape)

(8807, 12)


## 3. Data Cleaning
Count missing values per column and save to missing_counts.

In [94]:
missing_counts = df.isnull().sum()

print(missing_counts)

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64


In [95]:
df.tail()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...


Drop rows where duration is missing. Save to df_clean.

In [96]:
df_clean = df.dropna(subset=['duration'])

df_clean.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


4. Selecting Relevant Numeric Features

Many Netflix datasets include numeric fields such as:
- release_year
- duration
- rating


Create a DataFrame `df_num` containing only numeric columns.

In [97]:
df_num = df.select_dtypes(include=['number'])

df_num.head()

Unnamed: 0,release_year
0,2020
1,2021
2,2021
3,2021
4,2021


## 5. Feature Scaling

Focus on a single numeric column (e.g., duration).


Extract the column duration into a Series named `dur`.

In [98]:
dur = df['duration']
dur.head()

Unnamed: 0,duration
0,90 min
1,2 Seasons
2,1 Season
3,1 Season
4,2 Seasons


Apply Min–Max Scaling to `dur`. Store the result as `dur_minmax`.

In [99]:
# Make a copy of df_clean
df_temp = df_clean.copy()

# Function to convert duration to minutes with updated rules
def convert_duration_to_minutes(row):
    duration_str = str(row['duration'])
    media_type = row['type']
    listed_in = str(row['listed_in'])  # Categories/genres

    if pd.isna(duration_str) or duration_str.lower() == 'nan':
        return np.nan

    minutes = np.nan

    # ----------------------------
    # MOVIES: "120 min"
    # ----------------------------
    if media_type == 'Movie':
        match = re.match(r'(\d+)\s*min', duration_str)
        if match:
            minutes = int(match.group(1))

    # ----------------------------
    # TV SHOWS: Convert Seasons → Episodes → Minutes
    # 1 Season = 12 Episodes, 1 Episode = 90 mins
    # ----------------------------
    elif media_type == 'TV Show':
        # A) Season(s)
        season_match = re.match(r'(\d+)\s*Season(?:s)?', duration_str)
        if season_match:
            seasons = int(season_match.group(1))
            episodes = seasons * 12          # 1 Season = 12 Episodes
            minutes = episodes * 90          # 1 Episode = 90 mins

        # B) Episode(s)
        episode_match = re.match(r'(\d+)\s*Episode(?:s)?', duration_str)
        if episode_match:
            episodes = int(episode_match.group(1))
            minutes = episodes * 90

    # ----------------------------
    # Increase duration by number of categories in listed_in
    # ----------------------------
    if pd.notna(minutes):
        num_categories = listed_in.count(',') + 1  # e.g., "Action, Comedy" → 2
        minutes *= num_categories

    return minutes

# Apply conversion
df_temp['duration_in_minutes'] = df_temp.apply(convert_duration_to_minutes, axis=1)

# Drop rows with missing durations
df_temp = df_temp.dropna(subset=['duration_in_minutes'])

# Optional: show sample
print(df_temp[['type', 'duration', 'listed_in', 'duration_in_minutes']].sample(10))


         type  duration                                          listed_in  \
5338    Movie   160 min   Action & Adventure, Dramas, International Movies   
2624    Movie    96 min                     Comedies, International Movies   
2188    Movie    44 min                           Children & Family Movies   
8776    Movie    90 min                 Children & Family Movies, Comedies   
3353    Movie    89 min  Documentaries, International Movies, Music & M...   
6034    Movie   110 min                                             Dramas   
7616    Movie    53 min                                      Documentaries   
6609    Movie   133 min                                 Action & Adventure   
7247    Movie   149 min      Dramas, International Movies, Romantic Movies   
5030  TV Show  1 Season  British TV Shows, Docuseries, International TV...   

      duration_in_minutes  
5338                  480  
2624                  192  
2188                   44  
8776                  180  
3

Apply Z-score Standardization to `dur`. Store in `dur_zscore`.

In [100]:
dur_mean = numeric_dur.mean()
dur_std = numeric_dur.std()

# Handle the case where standard deviation is zero to avoid division by zero
if dur_std == 0:
    dur_zscore = pd.Series(0.0, index=numeric_dur.index)
else:
    dur_zscore = (numeric_dur - dur_mean) / dur_std

dur_zscore.head()

Unnamed: 0,duration_in_minutes
0,-0.448698
1,1.19376
2,0.324223
3,0.318185
4,1.19376


## 6. Discretization (Binning)
Apply equal-width binning to dur into 5 bins. Store as `dur_width_bins`.


- Use `pandas.cut()` to divide duration_minutes into 4 `equal-width bins`.
- Add the resulting bins as a new column named:
`duration_equal_width_bin`

In [101]:
dur_width_bins = pd.cut(numeric_dur, bins=5)
print("dur_width_bins sample:")
print(dur_width_bins.head())

dur_width_bins sample:
0    (0.763, 2460.4]
1    (0.763, 2460.4]
2    (0.763, 2460.4]
3    (0.763, 2460.4]
4    (0.763, 2460.4]
Name: duration_in_minutes, dtype: category
Categories (5, interval[float64, right]): [(0.763, 2460.4] < (2460.4, 4907.8] < (4907.8, 7355.2] <
                                           (7355.2, 9802.6] < (9802.6, 12250.0]]


In [102]:
df_temp['duration_equal_width_bin'] = pd.cut(
    df_temp['duration_in_minutes'],
    bins=4
)

print("\nSample of duration_equal_width_bin:")
print(df_temp[['duration_in_minutes', 'duration_equal_width_bin']].sample(5))



Sample of duration_equal_width_bin:
      duration_in_minutes duration_equal_width_bin
1341                  212       (-42.594, 12154.5]
5436                 3240       (-42.594, 12154.5]
6137                 6480       (-42.594, 12154.5]
2736                 9720       (-42.594, 12154.5]
6762                   61       (-42.594, 12154.5]


Describe the characteristics of each bin

- What are the bin edges produced by equal-width binning?
- How many movies fall into each bin?

In [103]:
bin_counts = dur_width_bins.value_counts().sort_index()
print("\nNumber of movies/TV shows in each equal-width bin:")
print(bin_counts)


Number of movies/TV shows in each equal-width bin:
duration_in_minutes
(0.763, 2460.4]      8545
(2460.4, 4907.8]      193
(4907.8, 7355.2]       56
(7355.2, 9802.6]        7
(9802.6, 12250.0]       3
Name: count, dtype: int64


Apply equal-frequency binning to dur into 5 bins. Store as `dur_quantile_bins`.

- Use `pandas.qcut()` to divide duration_minutes into 4 equal-frequency bins.
- Add the result as a new column named:
`duration_equal_freq_bin`

In [104]:
# 4 equal-frequency bins for the DataFrame
df_temp['duration_equal_freq_bin'] = pd.qcut(
    df_temp['duration_in_minutes'],
    q=4
)

print("\nSample of duration_equal_freq_bin:")
print(df_temp[['duration_in_minutes', 'duration_equal_freq_bin']].sample(5))


Sample of duration_equal_freq_bin:
      duration_in_minutes duration_equal_freq_bin
4385                  266          (180.0, 294.0]
7182                 3240       (2160.0, 48600.0]
376                  1080         (294.0, 2160.0]
1084                 3240       (2160.0, 48600.0]
1861                  495         (294.0, 2160.0]


Describe the characteristics of each bin

- What are the bin ranges produced by equal-frequency binning?
- How many movies fall into each bin? Are they nearly equal?

In [105]:
bin_summary = numeric_dur.groupby(dur_quantile_bins).agg(['count', 'min', 'max', 'mean'])
print("\nSummary statistics for each equal-frequency bin:")
print(bin_summary)



Summary statistics for each equal-frequency bin:
                     count   min    max         mean
duration_in_minutes                                 
(2.999, 89.0]         1838    13    104    78.034276
(89.0, 102.0]         1714    95    117   106.626604
(102.0, 127.0]        1757   108    142   124.816164
(127.0, 540.0]        2612   133    735   552.322741
(540.0, 9180.0]        883  1445  12250  2400.016988


  bin_summary = numeric_dur.groupby(dur_quantile_bins).agg(['count', 'min', 'max', 'mean'])


## 7. KNN Before & After Scaling


Create a feature matrix X using any two numeric columns and a target y (e.g., classification by genre or type). Create a train/test split.

Train a KNN classifier without scaling. Store accuracy in acc_raw.

In [106]:
# ----------------------------
# 1. Select features (X) and target (y)
# ----------------------------
# Example numeric columns: duration_in_minutes, release_year
X = df_temp[['duration_in_minutes', 'release_year']].copy()
y = df_temp['type']  # Target: Movie or TV Show

# ----------------------------
# 2. Train/test split
# ----------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ----------------------------
# 3. Train KNN classifier (without scaling)
# ----------------------------
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# ----------------------------
# 4. Predict and compute accuracy
# ----------------------------
y_pred = knn.predict(X_test)
acc_raw = accuracy_score(y_test, y_pred)

print("Accuracy of KNN classifier without scaling:", acc_raw)


Accuracy of KNN classifier without scaling: 0.9994321408290744


Scale `X` using either Min–Max or Standardization, retrain KNN, and store accuracy in acc_scaled.

In [107]:
# ----------------------------
# 1. Scale X (using Standardization)
# ----------------------------
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ----------------------------
# 2. Train/test split (same as before)
# ----------------------------
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

# ----------------------------
# 3. Train KNN on scaled data
# ----------------------------
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_s, y_train_s)

# ----------------------------
# 4. Predict and compute accuracy
# ----------------------------
y_pred_scaled = knn_scaled.predict(X_test_s)
acc_scaled = accuracy_score(y_test_s, y_pred_scaled)

print("Accuracy of KNN classifier after scaling:", acc_scaled)


Accuracy of KNN classifier after scaling: 0.9988642816581488


Did scaling improve accuracy? Explain why.

In [108]:
# Scaling did not improve accuracy here; in fact, it slightly decreased from 0.9994 to 0.9989 because the features were already on similar scales and KNN could classify effectively without scaling.