# Khipus.ai
## Introduction to Machine Learning
### Data Preparation
### Case Study: Netflix
<span>© Copyright Notice 2025, Khipus.ai - All Rights Reserved.</span>

### About Dataset
This dataset provides a comprehensive collection of all titles (Movies and TV Series) available on Netflix. In addition to basic information, it includes IMDb-specific data like IMDb ID, Average Rating, and Number of Votes.

Source: https://www.kaggle.com/datasets/octopusteam/full-netflix-dataset?resource=download

## Data Exploration
Understanding the data is the first step in any machine learning workflow.

In [51]:
# Import necessary libraries
import pandas as pd


In [52]:
# Load the dataset
data = pd.read_csv('Netflix.csv')

# Display the first few rows of the dataset
data.head()

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,520476.0,"AT, CH, DE"
1,Kill Bill: Vol. 1,movie,"Action, Crime, Thriller",2003.0,tt0266697,8.2,1232113.0,"AE, AL, AO, AT, AU, AZ, BG, BH, BY, CI, CM, CZ..."
2,Jarhead,movie,"Biography, Drama, War",2005.0,tt0418763,7.0,213209.0,"AD, AE, AG, AO, BH, BM, BS, BZ, CI, CM, CO, CR..."
3,Unforgiven,movie,"Drama, Western",1992.0,tt0105695,8.2,447667.0,"AU, BA, BG, CZ, HR, HU, MD, ME, MK, NZ, PL, RO..."
4,Eternal Sunshine of the Spotless Mind,movie,"Drama, Romance, Sci-Fi",2004.0,tt0338013,8.3,1117918.0,"AD, AE, AG, AL, AO, AR, AU, AZ, BA, BB, BE, BG..."


In [53]:
# Display basic information about the dataset
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20609 entries, 0 to 20608
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               19997 non-null  object 
 1   type                20609 non-null  object 
 2   genres              20279 non-null  object 
 3   releaseYear         20576 non-null  float64
 4   imdbId              19130 non-null  object 
 5   imdbAverageRating   18945 non-null  float64
 6   imdbNumVotes        18945 non-null  float64
 7   availableCountries  20609 non-null  object 
dtypes: float64(3), object(5)
memory usage: 1.3+ MB


In [54]:
# Summary statistics for numerical columns
data.describe()


Unnamed: 0,releaseYear,imdbAverageRating,imdbNumVotes
count,20576.0,18945.0,18945.0
mean,2013.161693,6.398955,31364.8
std,14.358143,1.098073,119918.7
min,1913.0,1.2,5.0
25%,2011.0,5.7,325.0
50%,2018.0,6.5,1579.0
75%,2021.0,7.2,10007.0
max,2025.0,9.5,2991460.0


## Data Cleaning
Cleaning the data involves handling missing values, duplicates, and correcting data types.

In [65]:
# Check for missing values
data.isnull().sum()


title                  612
type                     0
genres                 330
releaseYear             33
imdbId                1479
imdbAverageRating     1664
imdbNumVotes          1664
availableCountries       0
dtype: int64

In [66]:
# Handle missing values by dropping rows with missing data (for simplicity)
data_cleaned = data.dropna()

# Verify there are no missing values
data_cleaned.isnull().sum()


title                 0
type                  0
genres                0
releaseYear           0
imdbId                0
imdbAverageRating     0
imdbNumVotes          0
availableCountries    0
dtype: int64

In [67]:
# Check for duplicates
data_cleaned.duplicated().sum()

# Drop duplicates if any
data_cleaned = data_cleaned.drop_duplicates()


## Feature engineering 
### One-Hot Encoding (Modity Features)

One-hot encoding is a process used to convert categorical variables into a format that can be provided to machine learning algorithms to improve predictions. In this notebook, we applied one-hot encoding to the `genres` column of the dataset. This process involves the following steps:

1. **Splitting the Genres**: The `genres` column contains multiple genres separated by commas. We split these genres into individual columns.
2. **Creating Binary Columns**: For each unique genre, a new column is created with binary values (0 or 1). A value of 1 indicates the presence of that genre for a particular title, while a value of 0 indicates its absence.
3. **Concatenating with Original Data**: The newly created binary columns are concatenated with the original dataframe, and the original `genres` column is dropped.

This transformation allows the machine learning model to interpret the categorical genre data effectively.

In [68]:
# One-hot encode the genres column
genres_encoded = data_cleaned['genres'].str.get_dummies(sep=', ')

# Concatenate the one-hot encoded genres with the original dataframe
data_cleaned_encoded = pd.concat([data_cleaned, genres_encoded], axis=1)

# Drop the original genres column
data_cleaned_encoded = data_cleaned_encoded.drop('genres', axis=1)

# Display the first few rows of the updated dataframe
data_cleaned_encoded.head()

Unnamed: 0,title,type,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,Action,Adventure,Animation,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,The Fifth Element,movie,1997.0,tt0119116,7.6,520476.0,"AT, CH, DE",1,1,0,...,0,0,0,1,0,0,0,0,0,0
1,Kill Bill: Vol. 1,movie,2003.0,tt0266697,8.2,1232113.0,"AE, AL, AO, AT, AU, AZ, BG, BH, BY, CI, CM, CZ...",1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,Jarhead,movie,2005.0,tt0418763,7.0,213209.0,"AD, AE, AG, AO, BH, BM, BS, BZ, CI, CM, CO, CR...",0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,Unforgiven,movie,1992.0,tt0105695,8.2,447667.0,"AU, BA, BG, CZ, HR, HU, MD, ME, MK, NZ, PL, RO...",0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,Eternal Sunshine of the Spotless Mind,movie,2004.0,tt0338013,8.3,1117918.0,"AD, AE, AG, AL, AO, AR, AU, AZ, BA, BB, BE, BG...",0,0,0,...,0,0,1,1,0,0,0,0,0,0


## Feature Selection
Selecting the most relevant features for the machine learning model.

The goal of the model will be predict the ratings (imdbAverageRating)

In [59]:
# Select relevant features for analysis
features = ['imdbNumVotes', 'releaseYear', 'Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Musical', 'Mystery', 'News', 'Reality-TV', 'Romance', 'Sci-Fi', 'Short', 'Sport', 'Thriller', 'War', 'Western']
target = 'imdbAverageRating'

# Create a new dataframe with the selected features
data_selected = data_cleaned_encoded[features + [target]]
data_selected.head()


Unnamed: 0,imdbNumVotes,releaseYear,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Thriller,War,Western,imdbAverageRating
0,520476.0,1997.0,1,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,7.6
1,1232113.0,2003.0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,8.2
2,213209.0,2005.0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,1,0,7.0
3,447667.0,1992.0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,8.2
4,1117918.0,2004.0,0,0,0,0,0,0,0,1,...,0,0,1,1,0,0,0,0,0,8.3


## Splitting Training and Test Data
Splitting the data into training and test sets for model evaluation.

Splitting the dataset into four parts (X_train, X_test, y_train, y_test) is essential for training and evaluating a machine learning model.

X_train: This is the portion of the dataset used to train the machine learning model. It contains the features (input variables) for training.

X_test: This is the portion of the dataset used to test the machine learning model after it has been trained. It contains the features (input variables) for testing.

y_train: This is the portion of the dataset used to train the machine learning model. It contains the target variable (output) for training.

y_test: This is the portion of the dataset used to test the machine learning model after it has been trained. It contains the target variable (output) for testing.

By splitting the data, you ensure that the model is evaluated on data it hasn't seen during training, providing a more accurate measure of its performance


In [60]:
from sklearn.model_selection import train_test_split

# Define features and target variable
X = data_cleaned_encoded [features]
y = data_cleaned_encoded[target]

# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the splits
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((15156, 26), (3789, 26), (15156,), (3789,))

In [61]:
# Display basic information about the dataset
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15156 entries, 20264 to 16848
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   imdbNumVotes  15156 non-null  float64
 1   releaseYear   15156 non-null  float64
 2   Action        15156 non-null  int64  
 3   Adventure     15156 non-null  int64  
 4   Animation     15156 non-null  int64  
 5   Biography     15156 non-null  int64  
 6   Comedy        15156 non-null  int64  
 7   Crime         15156 non-null  int64  
 8   Documentary   15156 non-null  int64  
 9   Drama         15156 non-null  int64  
 10  Family        15156 non-null  int64  
 11  Fantasy       15156 non-null  int64  
 12  History       15156 non-null  int64  
 13  Horror        15156 non-null  int64  
 14  Music         15156 non-null  int64  
 15  Musical       15156 non-null  int64  
 16  Mystery       15156 non-null  int64  
 17  News          15156 non-null  int64  
 18  Reality-TV    15156 non-nul

In [62]:
# Display basic information about the dataset
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3789 entries, 16052 to 14156
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   imdbNumVotes  3789 non-null   float64
 1   releaseYear   3789 non-null   float64
 2   Action        3789 non-null   int64  
 3   Adventure     3789 non-null   int64  
 4   Animation     3789 non-null   int64  
 5   Biography     3789 non-null   int64  
 6   Comedy        3789 non-null   int64  
 7   Crime         3789 non-null   int64  
 8   Documentary   3789 non-null   int64  
 9   Drama         3789 non-null   int64  
 10  Family        3789 non-null   int64  
 11  Fantasy       3789 non-null   int64  
 12  History       3789 non-null   int64  
 13  Horror        3789 non-null   int64  
 14  Music         3789 non-null   int64  
 15  Musical       3789 non-null   int64  
 16  Mystery       3789 non-null   int64  
 17  News          3789 non-null   int64  
 18  Reality-TV    3789 non-null 

In [63]:
# Display basic information about the dataset
y_train.info()

<class 'pandas.core.series.Series'>
Index: 15156 entries, 20264 to 16848
Series name: imdbAverageRating
Non-Null Count  Dtype  
--------------  -----  
15156 non-null  float64
dtypes: float64(1)
memory usage: 236.8 KB


In [64]:
# Display basic information about the dataset
y_test.info()

<class 'pandas.core.series.Series'>
Index: 3789 entries, 16052 to 14156
Series name: imdbAverageRating
Non-Null Count  Dtype  
--------------  -----  
3789 non-null   float64
dtypes: float64(1)
memory usage: 59.2 KB
