*This assignment consists of four sections.*

**Section 1 – Data Collection and Preprocessing [20%]**

• Select a dataset relevant to a predictive modeling task.

• Provide a brief description of the dataset, including feature descriptions and target
variable.

• Perform the following preprocessing steps:
1. Handle any missing values and outliers.
2. Encode categorical variables as needed.
3. Scale features if necessary.
4. Split the data into training and testing sets (e.g., 70% train, 30% test).

**DATASET SOURCE LINK**

As required, I have provided a link to the source for my dataset. It is from Kaggle and the link is listed below:

https://www.kaggle.com/datasets/octopusteam/full-netflix-dataset/data

**Step 1: Load the Dataset**

I load the `data.csv` file into a pandas DataFrame. This dataset contains information about Netflix movies and shows.

To make sure it works correctly, I then displayed first few rows of the dataset


In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the CSV file into a DataFrame
df = pd.read_csv('data.csv')

# Display the first few rows of the DataFrame
print("First few rows of the dataset:")
print(df.head())


ModuleNotFoundError: No module named 'pandas'

**Step 2: Explore the Dataset**

To understand the dataset better, I use the following methods:

`df.info()` gives me a summary of the DataFrame, including the number of entries, columns, and data types.

`df.describe()` provides summary statistics of numerical columns, such as mean, median, and standard deviation.

`df.isnull().sum()` checks for missing values across the dataset, and I print the number of missing values in each column.


In [None]:
# Display basic information about the DataFrame
print("\nBasic Information (df.info()):")
df.info()

# Display summary statistics for numerical columns
print("\nSummary Statistics (df.describe()):")
print(df.describe())

# Check for missing values
print("\nMissing Values (df.isnull().sum()):")
print(df.isnull().sum())


**Step 3: Handle Missing Values**


I identify missing values using `df.isnull().sum()` and then fill them.
For categorical columns like title and genres, I replace missing values with 'Unknown Title' and 'Unknown Genre' respectively.

For numerical columns, I replace missing values with the mean of each column.

In [None]:
# Display rows with missing values
print("\nRows with Missing Values:")
print(df[df.isnull().any(axis=1)])

# Fill missing values for categorical columns without using inplace=True
df['title'] = df['title'].fillna('Unknown Title')
df['genres'] = df['genres'].fillna('Unknown Genre')

# Fill missing values for numerical columns
df.fillna(df.select_dtypes(include=['number']).mean(), inplace=True)


**Step 4: Verify Missing Values Are Handled**


After filling missing values, I use `df.isnull().sum()` again to verify that no missing values remain.

In [None]:
# Check if missing values are filled
print("\nAfter filling missing values (df.isnull().sum()):")
print(df.isnull().sum())


The following result show that the replacement worked as intended. 

`title`: All missing values were replaced with 'Unknown Title' (count is now 0).

`genres`: All missing values were replaced with 'Unknown Genre' (count is now 0).

`releaseYear`, `imdbAverageRating`, `imdbNumVotes`: Missing numerical values were successfully filled with the column mean (counts are 0).

`type` and `availableCountri`es: These columns had no missing values from the start.

`imdbId`: This column still **has 1,309 missing values**, likely because it was neither categorical nor numerical and wasn't explicitly handled in my code. I replaced them with a placeholder value like 'Unknown ID'.

**The code is shown below**


In [None]:
df['imdbId'] = df['imdbId'].fillna('Unknown ID')

print(df.isnull().sum())

Once this is done, my DataFrame should have no missing values. The count of 0 indicates that it has no missing values. 

This also means that:

 • For `title` and `genres`, missing values were replaced with **Unknown Title** and **Unknown Genre**, respectively. 
 
 • For numerical columns, missing values were replaced with their respective column means using the `fillna` method. 
 
 • The dataset is now **complete**, with no missing values left to handle.

**Step 5: Encode Categorical Data**

I encode the `genres` column as categorical data by converting it into numerical codes using pandas' `astype('category').cat.codes` method.

In [None]:
# Encode categorical columns (example for 'genres')
df['genres'] = df['genres'].astype('category').cat.codes


**Step 6: Split the Dataset into Training and Testing Sets**


I split the dataset into features (`X`) and target (`y`), where the target is the `imdbAverageRating`. 

Then, I use `train_test_split` to divide the data into 80% training and 20% testing sets.

In [None]:
# Split the dataset into training and testing sets
X = df.drop('imdbAverageRating', axis=1)  # Features
y = df['imdbAverageRating']  # Target

# Split into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Output the shapes of training and testing data
print("\nTraining and Testing Data Shapes:")
print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")


**At this point, I've completed the data preprocessing steps, including handling missing values, encoding categorical variables, and splitting the data into training and testing sets.**

**This prepares the dataset for the next steps, such as model selection and evaluation in future sections.**

**Section 2 – Model Selection and Training [25%]**


• Select two machine learning algorithms suitable for the dataset and prediction task (e.g.,
classification or regression).

• For each model, implement the training process in Python, using the training dataset from
Section 1.

*Required Steps:*

1. Define each model and provide a brief explanation of why it is suitable for the task.
2. Train both models and save the trained models for evaluation.
Present code snippets and a brief justification for each model choice. Include any
hyperparameter tuning or optimizations performed.

***In this section, I will select two machine learning algorithms that are suitable for predicting IMDb average ratings, which is a regression task. I will choose Linear Regression and Random Forest Regressor, and then train both models using the training dataset from Section 1.***

1. **Defining Each Model and Provide a Brief Explanation**

*Linear Regression*

Why it's suitable?: 
Linear regression is a simple and well-understood model used for regression tasks. It assumes a linear relationship between the features and the target variable (IMDb rating in this case). Linear regression is easy to implement and interpret, making it a good baseline model for this dataset.

*Random Forest Regressor*

Why it's suitable?: 
Random Forest is an ensemble method that builds multiple decision trees and combines their predictions. It's particularly useful for datasets with complex patterns, and it can model non-linear relationships between the features and the target. Random Forest is robust against overfitting and is often effective on a wide range of datasets.

**Step 1: Libary Importation and Data Loading**

After importing the important libaries, I started by loading the dataset from a CSV file (data.csv) using `Pandas`. I then displayed the first few rows and column names to get an initial understanding of the data's structure. This helped me identify key features and check for any obvious issues as well as that, it allows anyone else to see how the dataset is like 

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder

# Load the dataset
url = "data.csv" 
df = pd.read_csv(url)

# Displaying the first few rows to understand the data structure
print(df.head())

# Checking the column names in your dataset
print(df.columns)

**Step 2: Handling Missing Values**

Due to the nature of the dataset, there are some missing values. For the models to work properly, it's essential that all values are numerical. If any columns aren't numerical, I would need to either replace the missing values or drop the entire column as a result, I filled the missing values in categorical columns such as `title`, `genres`, `imdbId` and `availableCountries` with placeholder values like **Unknown Title**, **Unknown Genre**, **Unknown Imdb ID** and **Unknown Country**. 

For numerical columns, I used `df.fillna()` to replace missing values with the mean of the respective column.

Finally, I verified that all missing values were handled, and there are no longer any missing entries in the dataset.

In [None]:
## Checking for missing values
print("\nMissing Values (df.isnull().sum()):")
print(df.isnull().sum())

# Filling missing values for categorical columns
df['title'] = df['title'].fillna('Unknown Title')
df['genres'] = df['genres'].fillna('Unknown Genre')
df['availableCountries'] = df['availableCountries'].fillna('Unknown Country')
df['imdbId'] = df['imdbId'].fillna('Unknown IMDb ID')

# Filling missing values for numerical columns
df.fillna(df.select_dtypes(include=['number']).mean(), inplace=True)

# Checking if missing values are filled
print("\nAfter filling missing values (df.isnull().sum()):")
print(df.isnull().sum())

My dataset is now free of missing values, and i'm ready to proceed with my  analysis or model training.

**Step 3: Data Preprocessing**

I then performed one-hot encoding on the `genres` and `availableCountries` columns to convert them into numerical values while also using
`Label Encoding` for the type column.

One-hot encoding is used to like mention, convert fields into numerical values which ensures for better model compatibility and avoids errors.

I had to also drop columns like `title, imdbId`, and the original `genres` and `availableCountries` columns as they were either unnecessary for predictions and weren't fully numerical.

I also verified that all features were numeric for model compatibility.

In [None]:
# One-hot encoding for 'genres'
df = df.join(df['genres'].str.get_dummies(sep=', '))

# One-hot encoding for 'availableCountries'
df = df.join(df['availableCountries'].str.get_dummies(sep=', '))

# Dropping the original 'genres' and 'availableCountries' columns since we've one-hot encoded them
df.drop(columns=['genres', 'availableCountries','imdbId'], inplace=True)

# Encoding other categorical columns (example for 'type' using LabelEncoder)
le = LabelEncoder()
df['type'] = le.fit_transform(df['type'])  # You can apply LabelEncoder to other columns like 'type'

# Dropping any irrelevant features (e.g., 'title' if it's not needed for prediction)
df.drop(columns=['title'], inplace=True)

# Final check for data types to ensure everything is numeric
print("\nData types of features after encoding (df.dtypes):")
print(df.dtypes)

**Step 4: Splitting the Dataset**

I split the dataset into features (X) and the target (y) variables which then lead on to further splitting these into training and testing sets, using an 80/20 split ratio, ensuring the test set was representative of the dataset.

In [None]:
# Spliting the dataset into training and testing sets
X = df.drop('imdbAverageRating', axis=1)  # Features
y = df['imdbAverageRating']  # Target

# Spliting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


**Step 5: Training and Evaluating the Linear Regression Model**

I trained a Linear Regression model using the training data and tested its performance on the test dataset. To evaluate how well the model performed, I used the Mean Squared Error (MSE) and R-squared (R²) metrics.

In [None]:
# Defining the Linear Regression model
linear_reg = LinearRegression()

# Training the model
linear_reg.fit(X_train, y_train)

# Predicting using the test set
y_pred_lr = linear_reg.predict(X_test)

# Evaluating the model
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

# Printing results for Linear Regression
print("Linear Regression Model Results:")
print(f"Mean Squared Error: {mse_lr}")
print(f"R-squared: {r2_lr}")

The output of my Linear Regression model gives two key metrics: **Mean Squared Error (MSE)** and **R-squared**. Here's what they suggest:

1. **Mean Squared Error (MSE): 0.7876**

MSE measures how well the model's predictions match the actual values. It calculates the average of the squared differences between predicted and actual values.

A *lower* MSE indicates *better* performance, as the model's predictions are closer to the actual values. In my case, the MSE value is 0.7876, which suggests that while the model is performing reasonably, there's still significant error in its predictions. The exact performance is contextual, but a smaller MSE would indicate better fit.


2. **R-squared: 0.2784**

R-squared (R²) is a statistical measure of how well the independent variables explain the variation in the dependent variable. In simple terms, it tells you how much of the variability in the target variable is captured by your model.

A R-squared value of 0.2784 means that **approximately 27.84% of the variability in the target variable is explained by the model**. This is relatively low, which suggests that the model is not capturing much of the relationship between the features and the target variable.

A R² value closer to 1 would indicate that the model is doing a good job of explaining the target variable. An R² closer to 0 suggests that the model is not effective at predicting the target based on the features.

**In summary, my linear regression model isn't performing great, but this is a useful starting point to see how the features interact with the target.**



**Step 6: Training and Evaluating the Random Forest Regressor**

I then trained a Random Forest Regressor model using the training dataset. To assess its performance, I evaluated the predictions on the test dataset using Mean Squared Error (MSE) and R-squared (R²) metrics.

In [None]:
## Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Model Training
rf_model.fit(X_train, y_train)

#Prediciton
y_pred_rf = rf_model.predict(X_test)

#Model Evaluation
from sklearn.metrics import mean_squared_error, r2_score
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

#output
print("Random Forest Regressor Results:")
print(f"Random Forest Mean Squared Error: {mse_rf}")
print(f"Random Forest R-squared: {r2_rf}")


*Based on the results, this suggests a lot things.* 

First of all, The **MSE** for the Random Forest model is *lower than the MSE from the Linear Regression model (0.6388 vs. 0.7876)*. This indicates that the *Random Forest model is making smaller errors in its predictions, suggesting it’s a better fit for this dataset compared to Linear Regression*.

**R-squared**: The R-squared value of 0.4147 means that *about 41.47% of the variability in the target variable is explained by the model*. This is a **significant improvement over the Linear Regression model’s R-squared of 0.2784 (27.84%)**. It indicates that the Random Forest model is better at capturing the relationships between the features and the target variable.

**In conclusion, switching to the Random Forest Regressor model has clearly improved the performance, and it's a more suitable model for this task compared to Linear Regression.**



As you can see, I have now **completed section 2**

**Section 3 – Prediction and Evaluation [30%]**

Use the trained models from Question 2 to generate predictions on the test dataset.

Evaluate each model’s performance using appropriate metrics:

For classification tasks, report metrics such as accuracy, precision, recall, and F1-
score.

For regression tasks, report metrics such as Mean Absolute Error (MAE), Root
Mean Square Error (RMSE), or R-squared (R²).

Compare the models’ performances and discuss which model performs better and why.

Include Python code for generating predictions and calculating performance metrics. Interpret
and compare the results for each model

The next code will showcase how I use the **Linear Regression** and **Random Forest** model to predict the `imdbAverageRating` values. 
The following code will calculate:

1. **Mean Absolute Error (MAE)**: Measures the average absolute difference between predicted and actual values.


2. **Root Mean Square Error (RMSE)**: Measures the square root of the average squared difference between predicted and actual values.

3. **R-squared (R²)**: Indicates the proportion of variance in the target variable explained by the model.

In [None]:
from sklearn.metrics import mean_absolute_error

# Generating Predictions
y_pred_lr = linear_reg.predict(X_test)  # Predictions from Linear Regression
y_pred_rf = rf_model.predict(X_test)   # Predictions from Random Forest Regressor

#Evaluating Metrics for Linear Regression
mse_lr = mean_squared_error(y_test, y_pred_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)
rmse_lr = mse_lr ** 0.5
r2_lr = r2_score(y_test, y_pred_lr)

print("\nLinear Regression Evaluation:")
print(f"Mean Absolute Error (MAE): {mae_lr}")
print(f"Root Mean Square Error (RMSE): {rmse_lr}")
print(f"R-squared (R²): {r2_lr}")

# Evaluation for Random Forest
mse_rf = mean_squared_error(y_test, y_pred_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = mse_rf ** 0.5
r2_rf = r2_score(y_test, y_pred_rf)

print("\nRandom Forest Regressor Evaluation:")
print(f"Mean Absolute Error (MAE): {mae_rf}")
print(f"Root Mean Square Error (RMSE): {rmse_rf}")
print(f"R-squared (R²): {r2_rf}")

Judging from the results, it is clear that the **Random Forest Regression** outperformed my **Linear Regression Model**

One of the main key points to take from the results is that the **Random Forest model** has a significantly lower *Mean Absolute Error (MAE)* and *Root Mean Square Error (RMSE)*.

This means the **Random Forest** is making more accurate predictions compared to Linear Regression, with smaller average errors and deviations from the true values

As well as that, the *Higher R² value* for **Random Forest** is 0.4318, which is notably higher than the Linear Regression's 0.2830.

This indicates that the Random Forest model explains a larger proportion of the variance in the target variable (`imdbAverageRating`) compared to Linear Regression which is important as it allows for more accuracy and understanding which is crucial.



To make it more clear, the code below shows a clear comparison between the two and it's result with a clear statement of what outperforms what.

In [None]:
# 4. Compare Results
print("\nPerformance Comparison:")
print("Linear Regression:")
print(f"  - MAE: {mae_lr}, RMSE: {rmse_lr}, R²: {r2_lr}")
print("Random Forest Regressor:")
print(f"  - MAE: {mae_rf}, RMSE: {rmse_rf}, R²: {r2_rf}")

if r2_rf > r2_lr:
    print("\nThe Random Forest Regressor outperforms the Linear Regression model based on R².")
else:
    print("\nThe Linear Regression model performs better based on R².")

**Section 3 is now completed.** 

**Section 4 – Visualization and Insights [25%]**

Visualize key aspects of your machine learning project to help understand model
performance and data distribution.

Required Visualizations:
1. For *classification* tasks, provide a confusion matrix or ROC curve.
2. For *regression tasks*, plot predicted values against actual values.
3. Visualize feature importance (if applicable) to understand which features
contribute most to predictions.

Include code for each visualization and describe the insights gained from these visualizations.
Summarize your findings and observations based on the entire workflow.

Because of my scenario, it is a **regression** task and I will like mentioned, plot predicted values against actual values.

Starting off, the code below will import the necassary libaries, and I will plot for **Linear Regression** and **Random Forest**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot for Linear Regression
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred_lr, alpha=0.6, label='Linear Regression')
sns.lineplot(x=y_test, y=y_test, color='red', label='Ideal Predictions')
plt.title("Linear Regression: Predicted vs. Actual")
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.legend()
plt.show()

# Plot for Random Forest
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred_rf, alpha=0.6, label='Random Forest Regressor')
sns.lineplot(x=y_test, y=y_test, color='red', label='Ideal Predictions')
plt.title("Random Forest Regressor: Predicted vs. Actual")
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.legend()
plt.show()

As intended, it generated a scatter plot for both models. The visuals indicate a lot of things, I will now break it down

**The Predicted/Actual Values**:

**Blue Dots**: These represent the predictions made by the linear regression model.

**X-axis**: This axis represents the actual values of the target variable which is the true values from your test data.

**Y-axis**: This axis shows the predicted values from my model for each corresponding actual value.

**Ideal Predictions (Red Line)**: 

The red diagonal line represents ideal predictions where the predicted values perfectly match the actual values the points closer to this line signify more accurate predictions.

**Scatter Analysis**:

The scatter of points around the red line is tighter compared to typical linear regression models, especially for Random Forest, which tends to model non-linear relationships well.

**This suggests that the Random Forest Regressor is capturing complex patterns in the data, leading to improved predictive performance**.


**Short Summary** 

From the two scatter plots, we can see how well each model predicts the target variable by comparing the predicted values against the actual values.

In the **Linear Regression plot**, the points are more scattered, especially at the higher end of the actual values, which shows that the model struggles to make accurate predictions. The predicted values deviate more significantly from the actual values, showing that the model isn't capturing the patterns well.

On the other hand, the **Random Forest Regressor** plot shows a tighter clustering of points around the red prediction line, this shows that it makes more accurate predictions overall. The Random Forest model seems to better capture the relationship between the features and the target variable, leading to fewer errors.

Overall, based on these visualizations, the ***Random Forest Regressor performs better than the Linear Regression model***, as it shows more precise predictions with less scatter. This is reflected in the higher R² score and lower error metrics for the Random Forest model as well.






