# **Unspervised Learning**

**This notebook has been prepared by:**
* Zizipho Tyeko
* Siyamanga Malawu
* Lejone Malokosta
* Pfano Phungo
* Mogau Mogashoa
* Dunyiswa Matshaya

### **How is the notebook gooinf to work?**

This notebook is a layout of a recommender system that is used to predict a movie user possible rating. The notebook will make use of the recommender system methods and techniques using sequential steps to get to the prediction of the possible expected results.

# **Movie Recommendation Challenge**

## **Recommender System**

Recommender systems are amid the most well known applications of data science today. They are used to predict the "rating" or "preference" that a user would possibly give to an item. Recommender systems uses its techniques by searching through large volume of dynamically generated information to provide users with personalized content and services.
Technically recommender system has the ability to predict whether a particular user would prefer an item or not based on the user’s profile.

## **Two types of Recommender System**

* Content-Based Recommender System
* Colaborative Filtering Recommender System

## Colaborative Filtering Recommender System
Colaborative filtering recommender systems are based on the past interactions recorded between users and items in order to produce new recommendations. These interactions are stored in the so-called “user-item interactions matrix”.

### Advantages of Colaborative Filtering Recommender System
* Works for any kind of item since no feature selection is needed
* Requires not content analysis & extraction
* Independent of any machine-readable representation of the objects being recommended
* More diverse and serendipitous recommendation 

### Disadvantage of Colaborative Filtering Recommender System
* Cold Start problem
* Popularity bias
* Spacity: Hard to find users that have rated the same item

## **Content-Based Recommender System**
content based recommender sytem  use additional information about users and/or items to predict.This additional information can be, for example, the age, the sex, the job or any other personal information for users.

### Advantages of Content-Based Recommender Sytem
* Content representations are varied and they open up the options to use different approaches like: text processing techniques, the use of semantic information, inferences, etc…
* It is easy to make a more transparent system: we use the same content to explain the recommendations.
* We can avoid the “new item problem”

### Disadvantages of Content-Based Recommender Sytem
* Content-Based RecSys tend to over-specialization: they will recommend items similar to those already consumed, with a tendecy of creating a “filter bubble”.
* The methods based on Collaborative Filtering have shown to be, empirically, more precise when generating recommendations


# **Introduction**

The aim of this notebook is to predict how a user will rate a movie they have not yet viewed, based on their historical preference on a movie website or application e.g Netflix, Showmax or Amazon Prime.

Movie websites and applications can improve their reliability and enhance their customer experience by providing an estimated rating or preference of a movie through a recommender system used to model the predicted results.

Recommender systems are essential economically and socially in today's technology driven world. This can help movie companies in ensuring that their users can make the appropriate choices surrounding the content that they regulary engage with.

## **Problem Statement**
Can we construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

## **Aim**
Design a recommendersystem which will predict a user possible rating on a movie that they have not viewed yet based on they user history of their movie ratings.

## **Scope**
The scope of this project is to analyse and search through large volume of dynamically generated information consisting of movie ratings given by a user and information describing the movie.
These ratings will be used to train machine learning models to help with the prediction of the ratings given by a user on an unseen movie. This could also help with providing users with personalised content and services.

<img src="https://posteet.com/wp-content/uploads/2019/11/movies.png" width=90%>

# **Table of Content**

1. Import packages
2. Loading Datasets
3. Data Description
4. Explanotory Data Analysis
5. Data Filtering
6. Varibale Selection
7. Modeling
8. Model Comparison
9. Model Explanation
10. Submission
11. Application Pickled files


In [None]:
!pip install scikit-surprise

In [None]:
!pip install surprise

## 1. Importing packages

In [None]:
# utilities
import numpy as np
import pandas as pd

#pre-processing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

#plotting
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import seaborn as sns
import cufflinks as cf
plt.style.use('ggplot')
%matplotlib inline
sns.set()

from sklearn.metrics import mean_squared_error
from scipy import stats
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate


from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# 2. Loading Datasets

In [None]:
train = pd.read_csv('../input/edsa-recommender-system-predict/train.csv') 
test = pd.read_csv('../input/edsa-recommender-system-predict/test.csv')
scores = pd.read_csv('../input/edsa-recommender-system-predict/genome_scores.csv')
tags = pd.read_csv('../input/edsa-recommender-system-predict/genome_tags.csv')
imbd = pd.read_csv('../input/edsa-recommender-system-predict/imdb_data.csv') 
links = pd.read_csv('../input/edsa-recommender-system-predict/links.csv') 
movies = pd.read_csv('../input/edsa-recommender-system-predict/movies.csv')
sample = pd.read_csv('../input/edsa-recommender-system-predict/sample_submission.csv')

# 3. Data Description

In [None]:
train.head()

In [None]:
train['rating'].unique()

In [None]:
test.head()

In [None]:
scores.head()

In [None]:
tags.head()

In [None]:
imbd.head()

In [None]:
imbd_df = imbd.copy()


In [None]:
import re

In [None]:
imbd_df.dtypes

In [None]:
#budget1 = imbd_df.budget

imbd_df['budget'].replace(regex=True, inplace=True, to_replace=r'[^0-9.\-]',value=r'') 
imbd_df['budget'] = imbd_df['budget'].astype(float)

#re.sub(r'[a-z]+', '', budget, re.I)


#imbd_df1 = re.sub("[^0-9]", "", budget)

#print(imbd_df1)

In [None]:
imbd_df.head()

In [None]:
imbd_df.dtypes

In [None]:
links.head()

In [None]:
movies.head(100)

In [None]:
movies_df = movies.copy()

In [None]:

movies_df['Year'] = movies_df['title'].str.extract(r'(?!\()\b(\d+){1}')

In [None]:
movies_df.head()

In [None]:
sample.head()

## 4. Pre-processing

In [None]:
train.head()

In [None]:
train.isnull().sum()

In [None]:
# create short list of unwanted columns
labels = ['timestamp']

# declare the features to be all columns, less the unwanted ones from above
features = [col for col in train.columns if col not in labels]

In [None]:
#I did not run this by zizipho
cf.set_config_file(offline=True, world_readable=True, theme='ggplot')

# using plotly to plot the boxplot
train[features].iplot(kind='box', title="Boxplots of Features (Unscaled)")

### Removing duplicates 

In [None]:
dup_bool = train.duplicated(['movieId','userId','rating'])
dups = sum(dup_bool) # by considering all columns..( including timestamp)
print("There are {} duplicate rating entries in the data..".format(dups))

#  4. **Exploratory Data Analysis**

In [None]:
train.describe()['rating']

**Boxplot**

In [None]:
box = train['rating']
plt.boxplot(box)
plt.show()

**Total Number of ratings, users and movies**

In [None]:
print("Total data ")
print("-"*50)
print("\nTotal no of ratings :",train.shape[0])
print("Total No of Users   :", len(np.unique(train.userId)))
print("Total No of movies  :", len(np.unique(train.movieId)))

In [None]:
# method to make y-axis more readable
def human(num, units = 'M'):
    units = units.lower()
    num = float(num)
    if units == 'k':
        return str(num/10**3) + " K"
    elif units == 'm':
        return str(num/10**6) + " M"
    elif units == 'b':
        return str(num/10**9) +  " B"

In [None]:
fig, ax = plt.subplots()
plt.title('Distribution of ratings over Training dataset', fontsize=15)
sns.countplot(train.rating)
ax.set_yticklabels([human(item, 'M') for item in ax.get_yticks()])
ax.set_ylabel('No. of Ratings(Millions)')

plt.show()

In [None]:
#number of rated movies per user
no_of_rated_movies_per_user = train.groupby(by='userId')['rating'].count().sort_values(ascending=False)

no_of_rated_movies_per_user.head()

In [None]:
no_of_rated_movies_per_user.describe()

In [None]:
no_of_ratings_per_movie = train.groupby(by='movieId')['rating'].count().sort_values(ascending=False)

fig = plt.figure(figsize=plt.figaspect(.5))
ax = plt.gca()
plt.plot(no_of_ratings_per_movie.values)
plt.title('# RATINGS per Movie')
plt.xlabel('MovieId')
plt.ylabel('No of Users who rated a movie')
ax.set_xticklabels([])

plt.show()

In [None]:
movie_data = pd.merge(train, movies, on='movieId')

In [None]:
movie_data.head(2)

In [None]:
#sort mean movie rating by title in ascending order
movie_data.groupby('title')['rating'].mean().sort_values(ascending=False).head()

In [None]:
#group movies by the number of ratings in ascending orde
movie_data.groupby('title')['rating'].count().sort_values(ascending=False).head()

In [None]:
#mean count of ratings
ratings_mean_count = pd.DataFrame(movie_data.groupby('title')['rating'].mean())
ratings_mean_count['rating_counts'] = pd.DataFrame(movie_data.groupby('title')['rating'].count())
ratings_mean_count.head()

You can see movie title, along with the average rating and number of ratings for the movie.

A histogram for the number of ratings represented by the "rating_counts" column in the above dataframe

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('dark')
%matplotlib inline

plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
ratings_mean_count['rating_counts'].hist(bins=50)

From the output, you can see that most of the movies have received less than 50 ratings. While the number of movies having more than 5000 ratings is very low.

A histogram for average ratings

In [None]:
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
ratings_mean_count['rating'].hist(bins=50)

You can see that the integer values have taller bars than the floating values since most of the users assign rating as integer value i.e. 1, 2, 3, 4 or 5. Furthermore, it is evident that the data has a weak normal distribution with the mean of around 3.5. There are a few outliers in the data.

Average ratings against the number of ratings:

In [None]:
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
sns.jointplot(x='rating', y='rating_counts', data=ratings_mean_count, alpha=0.4)

**Budget of movies**

In [None]:
 budget = pd.merge(imbd, movies, on='movieId')
budget.head(2)

In [None]:
#top 5 movies with longest running time
budget['runtime'] = budget['runtime'].astype(float)
b = budget.drop(['movieId','title_cast', 'director', 'budget', 'plot_keywords', 'genres'], axis=1)
b.head()
b.nlargest(5,['runtime'])

The longest movie is Taken (2002)

## fitting model

In [None]:
#Independent feature of the train dataframe
X = train.drop(['rating'], axis=1)
#Dependent feature of the train dataframe
y=train['rating']
#Independent feature of test dataframe
x_unseen=test['movieId'] #test independent feature

In [None]:
#Splitting the train dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
vectoriser = TfidfVectorizer(stop_words='english', 
                             min_df=1, 
                             max_df=0.9, 
                             ngram_range=(1, 2))

In [None]:
#fitting the vectoriser
vectoriser.fit(X, y)

In [None]:
#transformation of the datasets
X_train = vectoriser.transform(X)
X_test  = vectoriser.transform(X)
#x_unseen =  vectoriser.transform(x_unseen)

In [None]:
params={
 "learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15,20,30,40],
 "min_child_weight" : [ 1, 3, 5, 7 ,9,10,11,12],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4,1,2],
 "colsample_bytree" : [ 0.2,0.3, 0.4, 0.5 , 0.7,0.8 ]
 
    
}

In [None]:
## Hyperparameter optimization using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
import xgboost as xgb
boost = xgb.XGBRegressor()

In [None]:
random_search=RandomizedSearchCV(boost,param_distributions=params,n_iter=2,n_jobs=1,cv=2,verbose=True)

In [None]:
random_search.fit(X,y)

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_params_

In [None]:
# initialize Our first XGBoost model...
boost = xgb.XGBRegressor()
boost.fit(X,y)


In [None]:
# Getting predicions from the X_test
pred0 = boost.predict(X_test)
#checking score
mean_squared_error(y_test, pred0, squared=False)