### Scaling Data (Normalisation / Standardisation)

In [None]:
import pandas as pd
movies = pd.read_csv(r"x:REDACTED\03 - Scaling\imdb-top-1000.csv")

# from google.colab import files
# uploaded = files.upload()
# movies = pd.read_csv('imdb-top-1000.csv')

Saving imdb-top-1000.csv to imdb-top-1000.csv


Scaling columns in a Pandas DataFrame refers to the process of transforming the numerical values within specific columns so that they fall within a particular range or have a standardised distribution. This is a crucial preprocessing step in data analysis and machine learning, especially when dealing with features that have vastly different scales or units of measurement.
### Purpose of Scaling:
- Equal Contribution: Ensures that all features contribute equally to the analysis or model training, preventing features with larger values from dominating the process.
- Improved Algorithm Performance: Many machine learning algorithms (e.g., K-Nearest Neighbors, Support Vector Machines, Gradient Descent-based algorithms) are sensitive to the scale of features and perform better or converge faster with scaled data.
- Enhanced Comparability: Makes it easier to compare and interpret features that were originally on different scales or units.

- Common Scaling Methods:
  - Min-Max Scaling (Normalisation): Scales values to a specific range, typically between 0 and 1. The formula is:
    - `X_norm = (X - X_min) / (X_max - X_min)`
  - Standardisation: Transforms data to have a mean of 0 and a standard deviation of 1. The formula is:
    - `Z = (X - μ) / σ`
    - where `μ` is the mean and `σ` is the standard deviation.
  - Robust Scaling: Similar to standardisation but uses the median and interquartile range (IQR) instead of mean and standard deviation, making it more robust to outliers.

In [2]:
# from sklearn.preprocessing import StandardScaler

# # Select numerical columns
# numerical_cols = movies.select_dtypes(include=['int64', 'float64']).columns

# # Initialize StandardScaler
# scaler = StandardScaler()

# # Scale the numerical columns
# movies[numerical_cols] = scaler.fit_transform(movies[numerical_cols])

# # Display the first few rows of the scaled data
# display(movies.head())

## Normalisation Scaling

In [3]:
# Select numerical columns
numerical_cols = movies.select_dtypes(include=['int64', 'float64']).columns

# Manually normalise numerical columns
for col in numerical_cols:
    minimum = movies[col].min()
    maximum = movies[col].max()
    movies[col] = (movies[col] - minimum) / (maximum - minimum)

# Display the first few rows of the scaled data
display(movies.head())

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
0,The Shawshank Redemption,1994,0.351449,Drama,1.0,Frank Darabont,Tim Robbins,1.0,0.030257,0.722222
1,The Godfather,1972,0.471014,Crime,0.941176,Francis Ford Coppola,Marlon Brando,0.688207,0.144092,1.0
2,The Dark Knight,2008,0.387681,Action,0.823529,Christopher Nolan,Christian Bale,0.982797,0.571025,0.777778
3,The Godfather: Part II,1974,0.568841,Crime,0.823529,Francis Ford Coppola,Al Pacino,0.476641,0.061173,0.861111
4,12 Angry Men,1957,0.184783,Crime,0.823529,Sidney Lumet,Henry Fonda,0.286778,0.004653,0.944444


## Standardisation Scaling

In [None]:
# Scaling the Numerical Data
numerical_cols = movies.select_dtypes(include=['int64', 'float64']).columns

# Manually standardise numerical columns
for col in numerical_cols:
    mean = movies[col].mean()
    std = movies[col].std()
    movies[col] = (movies[col] - mean) / std

# Display the first few rows of the scaled data
display(movies.head())

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
0,The Shawshank Redemption,1994,0.680189,Drama,4.902879,Frank Darabont,Tim Robbins,6.321288,-0.496923,0.163902
1,The Godfather,1972,1.854831,Crime,4.539891,Francis Ford Coppola,Marlon Brando,4.113581,0.033257,1.77992
2,The Dark Knight,2008,1.036141,Action,3.813915,Christopher Nolan,Christian Bale,6.199476,2.021671,0.487106
3,The Godfather: Part II,1974,2.815901,Crime,3.813915,Francis Ford Coppola,Al Pacino,2.615548,-0.35293,0.971911
4,12 Angry Men,1957,-0.957191,Crime,3.813915,Sidney Lumet,Henry Fonda,1.271187,-0.616168,1.456717
