# Phase 3: Data Analysis & Prediction

## Shreya Kumar (861279837)

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore", message="invalid value encountered")
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from scipy import stats

df = pd.read_csv('imdb_1000.csv')
df1 = df[['Title', 'Genre', 'Rating', 'Votes']]
df1.head(20)

Unnamed: 0,Title,Genre,Rating,Votes
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",8.1,757074
1,Prometheus,"Adventure,Mystery,Sci-Fi",7.0,485820
2,Split,"Horror,Thriller",7.3,157606
3,Sing,"Animation,Comedy,Family",7.2,60545
4,Suicide Squad,"Action,Adventure,Fantasy",6.2,393727
5,The Great Wall,"Action,Adventure,Fantasy",6.1,56036
6,La La Land,"Comedy,Drama,Music",8.3,258682
7,Mindhorn,Comedy,6.4,2490
8,The Lost City of Z,"Action,Adventure,Biography",7.1,7188
9,Passengers,"Adventure,Drama,Romance",7.0,192177


The raw number of votes isn't very useful for computing distances between movies, so we'll create a new DataFrame that contains the normalized number of ratings. So, a value of 0 means nobody rated it, and a value of 1 will mean it's the most popular movie there is.

In [2]:
popularity = df1[['Votes','Rating']]
movieNormalizedNumRatings = popularity.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
movieNormalizedNumRatings.head(20)


Unnamed: 0,Votes,Rating
0,0.422474,0.873239
1,0.271093,0.71831
2,0.087923,0.760563
3,0.033755,0.746479
4,0.219697,0.605634
5,0.031239,0.591549
6,0.144331,0.901408
7,0.001356,0.633803
8,0.003977,0.732394
9,0.107216,0.71831


In [3]:
df.Genre = pd.factorize(df.Genre)[0]

print(df["Year"].min())
print(df["Year"].max())

2006
2016


In [4]:
movieDict = {}
for index, row in df.iterrows():
    name = row.Title
    genre = row.Genre
    votes = movieNormalizedNumRatings.loc[index].get('Votes')
    avgRating = row.Rating
    movieDict[index] = (name,genre,votes,avgRating)
    

## Part 1: Data Analysis

### Fit Model and Produce Intercepts/Coefficients using Linear Regression

In [5]:
df['Revenue (Millions)'].fillna(0, inplace=True)
df['Metascore'].fillna(0, inplace=True)
#df.isnull().sum()


imdb_train = df.loc[:500].copy()
imdb_test = df.loc[501:].copy()

#imdb_train["log(rating)"] = np.log(imdb_train["Rating"])

imdb_model = LinearRegression()
imdb_model.fit(
    X=imdb_train[["Votes","Revenue (Millions)", "Metascore"]],
    y=imdb_train["Rating"]
)

imdb_model.predict(
    X=imdb_test[["Votes", "Revenue (Millions)", "Metascore"]]
)

print("Intercept: ", imdb_model.intercept_)
print("Coefficients: ", imdb_model.coef_)

Intercept:  5.5926524415057015
Coefficients:  [ 1.90186219e-06 -7.49789883e-04  1.67875397e-02]


### Check Accuracy of Intercept/Coefficients

In [6]:
votes = df["Votes"].loc[3]
revenue = df["Revenue (Millions)"].loc[3]
meta = df["Metascore"].loc[3]
rating = imdb_model.intercept_ + ((imdb_model.coef_[0]*votes)+ (imdb_model.coef_[1]*revenue) + (imdb_model.coef_[2]*meta))

print("The predicted rating for", df["Title"].loc[3], "is", rating, ".")
print("The actual rating for", df["Title"].loc[3], "is", df["Rating"].loc[3], ".")

The predicted rating for Sing is 6.495582327612669 .
The actual rating for Sing is 7.2 .


In [7]:
pearson_coef= stats.pearsonr(df["Votes"], df["Rating"])
pearson_coef1= stats.pearsonr(df["Revenue (Millions)"], df["Rating"])
pearson_coef2= stats.pearsonr(df["Metascore"], df["Rating"])

print("The magnitude of association between Votes and Rating is: ", pearson_coef[0], "\n")
print("The magnitude of association between Revenue (Millions) and Rating is: ", pearson_coef1[0], "\n")
print("The magnitude of association between Metascore and Rating is: ", pearson_coef2[0], "\n")

The magnitude of association between Votes and Rating is:  0.5115373197657553 

The magnitude of association between Revenue (Millions) and Rating is:  0.2517210757661679 

The magnitude of association between Metascore and Rating is:  0.47244562342549046 



### Analysis of Linear Regression Results

As per our Linear Regression Results, we get an intercept of 5.5926524415057015 and coefficients for Votes, Revenue (Millions) and Metascore as 1.90186219e-06 -7.49789883e-04  1.67875397e-02, respectively. The intercept is what the expected result would be if all features were 0, however that does not make sense - if a movie has 0 voters, 0 revenue and 0 metascore it cannot have a rating. While the intercept itself may not make sense, it still serves a purpose. It gives our line of best fit a starting point which then goes to inform the rest of our predictions. The coefficients suggest that since Votes and Metascore have positive coefficients, higher Votes and Metascore values can be attributed to higher ratings while Revenue (Millions) which has a negative coefficient can be attributed to a lower Rating. However, logically, that doesn't make very much sense. This leads us to question whether Revenue (Millions) actually has any relation to the outcome of Ratings. To determine this, I went ahead and calculated the Pearson Coefficient value for Revenue (Millions). The value we got was 0.25 which indicates a very weak relationship.

## Part 2: K-NN Clustering to Predict Movie Rating.

In [8]:
from scipy import spatial

def calcDist(a, b):
    genreA = a[1]
    genreB = b[1]
    genreDist = spatial.distance.cosine(genreA, genreB)
    popA = a[2]
    popB = b[2]
    popDist = abs(popA - popB)
    return genreDist + popDist
    

#### Function 1: The purpose of the above function is to calculate the distance between genres and poppularity of a & b which indicates how similar the 2 objects are (higher distance, less similar).

In [9]:
import operator

def getNeighbor(movieIndex, K):
    distances = []
    neighbors = []
    for movie in movieDict:
        if (movie != movieIndex):
            dist = calcDist(movieDict[movieIndex], movieDict[movie])
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors


#### Function 2: The purpose of the above function is to return 15 movies which are most similar to the searched movie by using the distance function to compare.

In [10]:
K = 15
avgRating = 0
myMovieIndex = 3
neighbors = getNeighbor(myMovieIndex, K)
print("The 15 movies most similar to", movieDict[myMovieIndex][0], "are: \n")
for neighbor in neighbors:
    avgRating += movieDict[neighbor][3]
    print (movieDict[neighbor][0] + " " + str(movieDict[neighbor][3]))
    
avgRating /= K

The 15 movies most similar to Sing are: 

Guardians of the Galaxy 8.1
The Great Wall 6.1
Bad Moms 6.2
Silence 7.3
Why Him? 6.3
Resident Evil: The Final Chapter 5.6
Bahubali: The Beginning 8.3
Underworld: Blood Wars 5.8
Trolls 6.5
The Founder 7.2
The Autopsy of Jane Doe 6.8
Hidden Figures 7.8
Mother's Day 5.6
Gold 6.7
Lion 8.1


#### Main Function: The purpose of the above function is to use the array of 15 movies that was found previously, look at their ratings and use the average of those ratings as a prediction for the rating of the searched movie.

In [11]:
print("The predicted rating for" , movieDict[myMovieIndex][0], "based off of the 10 movies most similar to it is:", avgRating, ".")


The predicted rating for Sing based off of the 10 movies most similar to it is: 6.826666666666666 .


In [12]:
print("The actual rating for", movieDict[myMovieIndex][0], "is", movieDict[myMovieIndex][3])

The actual rating for Sing is 7.2


### Analysis of K-NN Results

The Nearest Neighbor algorithm performed pretty well. It seems to under-predict the rating of a movie but not by much. Since we're only looking at 2 features, that may have something to do with it's inaccuracy. I think if we added more relevant features such as the Metascore, we could possibly see a more accurate prediction. We could even look into adjusting the K value for a more accurate prediction. In fact, initially I had started with K = 10 but found that I get my most accurate prediction at K = 15. I tried K = 20 but ended up getting a less accurate prediction.

## Which method was more Accurate?

The K-NN Met