<a href="https://colab.research.google.com/github/RA2112704010029/Financial-Machine-Learning-Situation-Questions/blob/main/HOTS_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QUESTION (CT-1 - 2):
Assume you are working on a recommendation system where you need to suggest movies to users based on their
past ratings. You decide to use the KNN algorithm for this task.

• Discuss the importance of the value of 'k' in the KNN algorithm and its impact on the model’s performance.

• Write a Python code snippet using sklearn to implement a KNN-based recommendation system, including theselection of the optimal 'k' value.

• Describe the steps involved in cross-validation and how you would evaluate the performance of the
recommendation system.

# 1. LOAD THE DATASET

In [None]:
# Import the required libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load the dataset
data = pd.read_csv("https://raw.githubusercontent.com/shreyaswankhede/IMDb-Web-Scraping-and-Sentiment-Analysis/master/imdbmovies.csv", delimiter=';')
data.head()

Unnamed: 0,IMDBID,Title,Genre,Year,URL,Audience_Rating,Critic_Rating,Budget_In_Millions,User_Review,Polarity
0,tt0035423,Kate & Leopold,"Comedy, Fantasy, Romance",2001,https://m.media-amazon.com/images/M/MV5BNmNlN2...,6.4,4.4,48.0,"Funny lines, bad plot, one hot actor",Negative
1,tt0091288,Jean de Florette,Drama,1987,https://m.media-amazon.com/images/M/MV5BMTgxND...,8.1,0.0,0.0,Escape to a rural idyll from another age,Positive
2,tt0108148,Iron Monkey,"Action, Crime, Drama",2001,https://m.media-amazon.com/images/M/MV5BYjJmMG...,7.6,7.9,11.0,"Very enjoyable, well done action fest with a f...",Positive
3,tt0118589,Glitter,"Drama, Music, Romance",2001,https://m.media-amazon.com/images/M/MV5BMGJkN2...,2.2,1.4,22.0,Oh the pain.... Mariah believes she is Streisa...,Negative
4,tt0118926,The Dancer Upstairs,"Crime, Drama, Thriller",2003,https://m.media-amazon.com/images/M/MV5BODE2OD...,7.0,6.4,0.0,A very interesting movie with a semi-political...,Positive


# 2. PREPROCESS THE DATASET

In [None]:
# Dropping rows with missing values
data.dropna(inplace=True)

In [None]:
# Encoding categorical columns (for simplicity, let's encode Genre and Polarity)
data['Polarity'] = LabelEncoder().fit_transform(data['Polarity'])  # Encode Polarity: Negative=0, Positive=1
data['Genre'] = LabelEncoder().fit_transform(data['Genre'])        # Encode Genre as numeric labels

In [None]:
# Selecting features for KNN - These are numerical features that will help in predicting the target
X = data[['Audience_Rating', 'Critic_Rating', 'Budget_In_Millions']].values

In [None]:
# Target variable
y = data['Polarity'].values

In [None]:
# Standardize the features (important for KNN as it is distance-based)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# TRAIN THE MODEL

In [None]:
# Implementing KNN with Cross-Validation to find the best 'k' value
knn = KNeighborsClassifier()

# Defining the parameter grid for GridSearchCV (searching for optimal 'k')
param_grid = {'n_neighbors': list(range(1, 31))}

# Using GridSearchCV to find the best 'k' based on cross-validation
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Getting the best k value
best_k = grid_search.best_params_['n_neighbors']
print(f"Optimal value of k: {best_k}")

# Train the KNN model with the best k value
knn_best = KNeighborsClassifier(n_neighbors=best_k)
knn_best.fit(X_train, y_train)

Optimal value of k: 26


# 4. TEST THE MODEL

In [None]:
# Evaluate the model on the test set
y_pred = knn_best.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
accuracy = knn_best.score(X_test, y_test)

print(f"Mean Squared Error: {mse}")
print(f"Accuracy: {accuracy * 100:.2f}%")

Mean Squared Error: 0.4558951965065502
Accuracy: 54.41%
