# Classification Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

We **Team ...**, confirm - by submitting this document - that the solutions in this notebook are a result of our own work and that we abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

We understand that non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Twitter Sentiment Classification Challenge

### Problem Statement

Recommender systems play a crucial role in today's technology-driven world, enabling individuals to make informed choices about the content they engage with on a daily basis. In particular, movie content recommendations rely on intelligent algorithms to help viewers discover great titles from a vast array of options. Companies like Netflix, Amazon Prime, Showmax, and Disney have successfully employed recommendation algorithms to suggest personalized content to their users. The challenge at hand is to construct a recommendation algorithm based on content or collaborative filtering that accurately predicts how a user will rate a movie they have not yet viewed, leveraging their historical preferences.

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Introduction</a>

<a href=#one>2. Importing Packages</a>

<a href=#two>3. Setting Up Comet</a>

<a href=#two>4. Loading Data</a>

<a href=#three>5. Exploratory Data Processing/a>

<a href=#four>6. Feature Extraction</a>

<a href=#five>7. Modeling- Selection, Evaluation, and Fine-Tuning</a>

<a href=#six>8. Model Performance</a>

<a href=#seven>9. Model Explanations</a>

<a href=#two>10. Submission</a>

## 1. Introduction
In the era of digital media consumption, recommender systems have become pivotal for guiding users towards relevant and engaging content. Platforms such as Netflix, Amazon Prime, Showmax, and Disney have mastered the art of providing personalized recommendations, enhancing user satisfaction and driving revenue. Behind these recommendations lies a sophisticated algorithm that analyzes user preferences and historical data to predict their potential interest in unexplored movies.

In this Jupyter notebook, we will tackle the challenge presented by EA, aiming to build a functional recommender system capable of accurately predicting user ratings for unseen movies. By leveraging content-based or collaborative filtering techniques, we will develop an algorithm that harnesses the power of historical user data to generate meaningful recommendations.

The value of constructing an effective recommender system is immense, both economically and socially. A successful solution to this challenge can open doors to increased user engagement, platform affinity, and revenue generation. The evaluation metric for this competition is the Root Mean Square Error (RMSE), a widely used measure in regression analysis and forecasting. By minimizing the RMSE, we can enhance the accuracy and reliability of our recommendation algorithm, improving user satisfaction and driving platform success.

To participate in this competition, submission files should adhere to a specific format. Each submission should include two columns: "Id" and "rating." The "Id" column should consist of a concatenation of the userID and movieID, separated by an underscore (_). The "rating" column should contain the predicted rating for the corresponding user-movie pair.

Let's dive into the challenge and develop an innovative recommendation algorithm that brings users closer to the movies they love.

### Data Overview

The dataset provided for this challenge consists of several million 5-star ratings obtained from users of the online MovieLens movie recommendation service. The MovieLens dataset has been widely used by both industry and academic researchers to enhance the performance of recommender systems. In this challenge, we will be working with a special version of the MovieLens dataset that has been enriched with additional data and resampled to ensure fair evaluation.

Source:
The MovieLens dataset is maintained by the GroupLens research group at the University of Minnesota's Department of Computer Science and Engineering. Additional movie content data has been legally scraped from IMDB to enrich the dataset.

Supplied Files:

1. ##### genome_scores.csv: 
This file contains scores that map the strength between movies and tag-related properties. It provides valuable insights into the characteristics and attributes associated with movies.

2. ##### genome_tags.csv: 
User-assigned tags corresponding to the genome-related scores are provided in this file.

3. ##### imdb_data.csv: 
This file includes additional movie metadata that was scraped from IMDB using the links.csv file.

4. ##### links.csv: 
It provides a mapping between the MovieLens ID and associated IMDB and TMDB IDs, allowing cross-referencing of movie data from different sources.

5. ##### sample_submission.csv: 
This file serves as a sample submission format for the hackathon.

6. ##### tags.csv: 
User-assigned tags for the movies within the dataset are provided in this file.

7. ##### test.csv: 
This file contains the test split of the dataset, which includes user and movie IDs but no rating data.

8. ##### train.csv: 
The training split of the dataset is provided in this file. It contains user and movie IDs with associated rating data.

###### Additional Information:
The following information is derived directly from the MovieLens dataset description files:

Ratings Data File Structure (train.csv):
The train.csv file contains all the ratings in the dataset. Each line represents one rating of one movie by one user and follows the format: userId, movieId, rating, timestamp. The lines in the file are ordered first by userId and then, within each user, by movieId. Ratings are provided on a 5-star scale with half-star increments.

###### Tags Data File Structure (tags.csv):
The tags.csv file contains all the user-assigned tags for movies in the dataset. Each line represents one tag applied to one movie by one user and follows the format: userId, movieId, tag, timestamp. The lines in the file are ordered first by userId and then, within each user, by movieId. Tags are user-generated metadata about movies, typically represented by a single word or short phrase.

###### Movies Data File Structure (movies.csv):
The movies.csv file contains information about each movie in the dataset. Each line represents one movie and follows the format: movieId, title, genres. Movie titles are manually entered or imported from https://www.themoviedb.org/ and include the year of release in parentheses. The genres are listed as pipe-separated values.

###### Links Data File Structure (links.csv):
The links.csv file provides identifiers that can be used to link to other sources of movie data. Each line represents one movie and follows the format: movieId, imdbId, tmdbId. The movieId corresponds to the identifier used by https://movielens.org. imdbId corresponds to the identifier used by http://www.imdb.com, and tmdbId corresponds to the identifier used by https://www.themoviedb.org.

###### Tag Genome (genome-scores.csv and genome-tags.csv):
The tag genome represents how strongly movies exhibit specific properties or characteristics encoded by tags, such as being atmospheric, thought-provoking, or realistic. The genome-scores.csv file contains movie-tag relevance data in the format: movieId, tagId, relevance. The genome-tags

## Importing Packages

In [1]:
import numpy as np
import pandas as pd
import cufflinks as cf
import matplotlib.pyplot as plt
import seaborn as sns
import re
%matplotlib inline
import warnings

from wordcloud import WordCloud, STOPWORDS 
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise import SVD
from surprise import KNNWithMeans
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split
from surprise import SVD, NormalPredictor, BaselineOnly, NMF, SlopeOne, CoClustering

sns.set(style='whitegrid', palette='muted',
        rc={'figure.figsize': (15,10)})

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True) 

## Setting up Comet

In [3]:
from kaggle_secrets import UserSecretsClient # I use this to store the API key
!pip install --root-user-action=ignore comet_ml
from comet_ml import Experiment # Base class for logging via Comet-ML

experiment = Experiment(
  api_key = "i1li3OprnSlOMo3ELSODLbiJG",
  project_name = "movie-recommender-system-unsupervised",
  workspace="andisiwe-jafta"
)

Collecting comet_ml
  Downloading comet_ml-3.33.5-py3-none-any.whl (549 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m549.1/549.1 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting python-box<7.0.0 (from comet_ml)
  Downloading python_box-6.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting semantic-version>=2.8.0 (from comet_ml)
  Downloading semantic_version-2.10.0-py2.py3-none-any.whl (15 kB)
Collecting websocket-client<1.4.0,>=0.55.0 (from comet_ml)
  Downloading websocket_client-1.3.3-py3-none-any.whl (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.3/54.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting everett[ini]<3.2.0,>=1.0.1 (from comet_ml)
  Downloading everett-3.1.0-py2.py3-none-any.whl (35 kB)
Collecting dulwich!=0.20.33,>=0.20.6 (from c

[1;38;5;39mCOMET INFO:[0m Couldn't find a Git repository in '/kaggle/working' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.
[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/andisiwe-jafta/movie-recommender-system-unsupervised/05b993c445874c5e9fee9bf0dd310388



## Loading Data

In [5]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/edsa-movie-recommendation-predict/sample_submission.csv
/kaggle/input/edsa-movie-recommendation-predict/movies.csv
/kaggle/input/edsa-movie-recommendation-predict/imdb_data.csv
/kaggle/input/edsa-movie-recommendation-predict/genome_tags.csv
/kaggle/input/edsa-movie-recommendation-predict/genome_scores.csv
/kaggle/input/edsa-movie-recommendation-predict/train.csv
/kaggle/input/edsa-movie-recommendation-predict/test.csv
/kaggle/input/edsa-movie-recommendation-predict/tags.csv
/kaggle/input/edsa-movie-recommendation-predict/links.csv


In [6]:
movies_db = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/movies.csv')
imdb_data_db = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/imdb_data.csv')
tags_db = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/tags.csv')
train_db = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/train.csv')
test_db = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/test.csv')
links_db = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/links.csv')
genome_tags_db = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/genome_tags.csv')
genome_scores_db = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/genome_scores.csv')

In [11]:
# View first 
movies_db.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [10]:
imdb_data_db.head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [12]:
tags_db.head()


Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [13]:
train_db.head()


Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [14]:
links_db.head()


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [15]:
genome_tags_db.head()


Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [16]:
genome_scores_db.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


In [17]:
test_db.head()

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318
