# IMDB
IMDb is the world's most popular and authoritative source for movie, TV and celebrity content. Find ratings and reviews for the newest movie and TV shows.

Data Source: https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

<img src="imdb1.jpg" alt="drawing" width="1000" height="100" align = "left">

# Objectives

- Perform Feature Engineering, clean, wraggling and tidy then save the new dataset in a .csv file. The new file will be use in machine learning model, KNN and Decision Tree.




Data Dictionary:
- `Poster_Link` - Link of the poster that imdb using
- `Series_Title` = Name of the movie
- `Released_Year` - Year at which that movie released
- `Certificate` - Certificate earned by that movie
- `Runtime` - Total runtime of the movie
- `Genre` - Genre of the movie
- `IMDB_Rating` - Rating of the movie at IMDB site
- `Overview` - mini story/ summary
- `Meta_score` - Score earned by the movie
- `Director` - Name of the Director
- `Star1,Star2,Star3,Star4` - Name of the Stars
- `Noofvotes` - Total number of votes
- `Gross` - Money earned by that movie

# Import Packages and Load Data

In [71]:
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")

In [72]:
# define training points and training labels
df = pd.read_csv('imdb_top_1000.csv')
print(df.shape)ddd
df.head(2)

(1000, 16)


Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411


# Feature Selection

In [None]:
plt.figure(figsize=(18, 13)) 
sns.heatmap(data = np.round(df_new.corr(),2), annot = True)
plt.title('Correlation Heatmap')
plt.show()

## K Nearest Neighbor

We will classify our movie if it's good or bad.

First let's create a new column that has as a 1 value if IMDB_Rating is equal or higher than 8 otherwise 0.

In [None]:
df_new['good_bad'] = 0
# df_new['good_bad'][df_new['IMDB_Rating'] > 8.0 ] = 1

# use this code for 1's
df_new['good_bad'][df_new.IMDB_Rating.isin([8. , 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.8,
                                            8.9, 9. , 9.2, 9.3])] = 1

In [None]:
# checking proportion of good and bad movies
df_new['good_bad'].value_counts()

#### Visialize K-value's

In [None]:
# show graph
accuracies = []
k_list = range(1,101)  

for k in range(1,101):
   classifier = KNeighborsClassifier(n_neighbors = k)
   classifier.fit(x_train, y_train)
   accuracies.append(classifier.score(x_test, y_test))

plt.figure(figsize=(15, 5)) 

ax1 = plt.subplot(1,2,1)
plt.plot(k_list, accuracies)
plt.title('K Values Accuracies')
plt.xlabel('K values range')
plt.ylabel('Accuracy')

ax2 = plt.subplot(1,2,2)
plt.plot(k_list, accuracies)
plt.axvline(13, color = 'r', linestyle = '--', label = 'K peak value')
plt.legend()
plt.xlim(10,19)
plt.title('Zoom in')

In [None]:
# classifier instance
classifier = KNeighborsClassifier(n_neighbors = 13)

In [None]:
# define training points and training labels
training_points = df_new[['Gross', 'Runtime', 'Released_Year', 'Meta_score', 'IMDB_Rating', 'No_of_Votes']]
training_labels = df_new['good_bad']

In [None]:
# Training data
x_train, x_test, y_train, y_test = train_test_split(training_points, training_labels,  test_size = 0.2, random_state = 42)

In [None]:
# Normalize our training point(x)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
# fitting training data
classifier.fit(x_train, y_train)

# Testing our model

In [None]:
# make sure it is not in our dataset, we dont want the neighbor to be itself
# Movies sample
Spiderman = [[1800000000, 148, 2021, 71, 83, 697614]]
badmovie_sample = [[180000, 3, 1920, 7, 7, 5000]]

# Normalize our movie sample
Spiderman = scaler.transform(Spiderman)
badmovie_sample  = scaler.transform(badmovie_sample)

In [None]:
# Test result
print(classifier.predict(Spiderman))
print(classifier.predict(badmovie_sample))

# K-NEAREST NEIGHBOR REGRESSOR
 Instead of classifying a new movie as either good or bad, we are now going to predict its IMDb rating as a real number.

In [None]:
# KNN Regression instance
# weight distance is more accurate
regr = KNeighborsRegressor(n_neighbors = 13, weights = 'distance')

In [None]:
# Set IMDB Rating as our new training label
training_labels2 = df_new['IMDB_Rating']

# Training data
x_train2, x_test2, y_train2, y_test2 = train_test_split(training_points, training_labels2,  test_size = 0.2, random_state = 42)

In [None]:
# Fit training data
regr.fit(x_train2, y_train2)

#### Testing

regression for bad movies seems off. Can we do better?

`Pending`
- Use feature selection to improve our regression model

In [None]:
# Note: Spiderman values are normalized
print(regr.predict(Spiderman))
print(regr.predict(badmovie_sample))

# Model Evaluation
- `Mean Squared Error`: Averaged of the squared error of the difference between the actual and predicted values (higher = better)
- `R2`: The corelation between the dependent variable and the set of independent variables (higher = better)


### Classification

In [None]:
y_predict = cross_val_predict(classifier, x_test, y_test, cv=5)

In [None]:
from math import sqrt
sqrt(mean_squared_error(y_test, y_predict))

In [None]:
r2_score(y_test, y_predict)

### Regression

In [None]:
y_predict2 = cross_val_predict(regr, x_test2, y_test2, cv=5)
# note: y_test2 is the actual rating while y_predict2 is our predicted IDMB rating

In [None]:
sqrt(mean_squared_error(y_test2, y_predict2))

In [None]:
r2_score(y_test2, y_predict2)

### Can we do better?
Both Classification and Regression has very lower score, lets do some feature engineering to select better training labels for both models