# <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. First we need to load the libraries we are going to use throughout our notebook. After which we will load our train and test data under loading data.|

---

In [2]:
import numpy as np 
import pandas as pd

In [3]:
#pip install plotly

In [4]:
import scipy as sp # <-- The sister of Numpy, used in our code for numerical efficientcy. 
import matplotlib.pyplot as plt
import seaborn as sns

# Entity featurization and similarity computation
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Libraries used during sorting procedures.
import operator # <-- Convienient item retrieval during iteration 
import heapq # <-- Efficient sorting of large lists

# Imported for our sanity
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%matplotlib inline
from pandas import MultiIndex

from plotly import graph_objects as go
# set plot style
import seaborn as sns
sns.set()

# <a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section we load the data from the `train csv` file into a DataFrame for Train Data and `test_with_no_labels` file into a DataFrame for Test Data. We will be using Pandas python package to read the csv file from our local computer. We will assign our Train data to name Train and Test Data to Test. |

---

In [5]:
#Load the tweet dataset into a dataframe


train = pd.read_csv('../input/database/train.csv')
test = pd.read_csv('../input/database/test.csv')

In [6]:
moviesfile = pd.read_csv('../input/database/movies.csv')
tags = pd.read_csv('../input/database/tags.csv')
imdb = pd.read_csv('../input/database/imdb_data.csv')
genome_tags = pd.read_csv('../input/database/genome_tags.csv')
genome_scores = pd.read_csv('../input/database/genome_scores.csv')

# <a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

⚡ Description: Exploratory data analysis ⚡ |
:--------------------------- |
In this section, we are required to perform an in-depth analysis of all the variables in the DataFrame. |
we first begin with the vital component which is the EDA, to better understand the dataset we are working with and to gain insight about the features and labels by performing Univariate or Multivariate , Non-graphical or Graphical Analysis"




In [7]:
# Loading and displaying an overview of the data
print('Dimension of train is: ', train.shape)
print('Dimension of test is: ', test.shape)

In [8]:
print (f'Number of ratings in dataset: {train.shape[0]}')

# Let's take a look at our data

In [9]:
#The first ten rows of the trainig dataset
train.head(10)

After taking a look at the frist ten rows of the dataframe we can see that we have Four (4) columns in the dataFrame The test dataFrame contains only the features.

We have two features and one label features inludes:

- userId
- movieId
- timestamp

label:

- rating

Now let's take a look at the data types in the dataframe using pd.info() to get more information about the dataframe

In [10]:
train.info()

In [11]:
train.describe()

In [12]:
moviesfile.describe()

In [13]:
tags.describe()

In [14]:
#checking null values in the training data
train.isnull().sum()

Our training data shows that we have 0 null values which means we don't have any missing values.

In [15]:
moviesfile.isnull().sum()

In [16]:
tags.isnull().sum()

In [17]:
tags.dropna()

In [18]:
# look at data statistics
train.columns

In [19]:
#Checking for unique values sentiment
train['rating'].value_counts()

In [20]:
 # hostogram of total_bills
plt.hist(train['rating'])
  
plt.title("Histogram")
  
# Adding the legends
plt.show()

In [21]:
table = pd.merge(train,moviesfile, on = 'movieId', how = 'outer')

In [22]:
with sns.axes_style('white'):
    g = sns.factorplot("rating", data=table, aspect=2.0,kind='count')
    g.set_ylabels("Total number of ratings")
print (f'Average rating in dataset: {np.mean(table["rating"])}')

# <a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
|⚡ Description: Data engineering ⚡ |
|:--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase our datasets contains a non_numerical column certain preprocessing steps must be carried out, which involves:


#### Data processing




In [23]:
# Loading and displaying an overview of the data
print('Dimension of train is: ', train.shape)
print('Dimension of test is: ', test.shape)

1. There are 4 columns and 10000038 rows for the Train Data.
2. There are 2 columns and 5000019 for the Test Data.

A look at the first ten rows of our data

In [24]:
train.head(5)

In [25]:
moviesfile.head(5)

In [26]:
tags.head(5)

In [27]:
table.head(5)

In [28]:
print (f'Number of ratings in dataset: {table.shape[0]}')

# <a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
|⚡ Description: Modelling on the movie recommendations system ⚡ |
|:--------------------------- |



#### Train - Test - Split

Before anything we have to split our train data into features and target variables. Split our train data into a train and validation set. This will allow us to evaluate our model performance and chose the best model to use for our submission

---

In [29]:
train.head(5)

In [30]:
moviesfile.head(5)

In [31]:
tags.head(5)

In [32]:
table.head(5)

In [33]:
#

In [44]:
table['auth_tags'] = (pd.Series(table[[ 'genres']]
                      .fillna('')
                      .values.tolist()).str.join(' '))

In [45]:
# Convienient indexes to map between our movie titles and indexes of 
# the movies dataframe
titles = table['title']
indices = pd.Series(table.index, index=table['title'])

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [47]:
table.head(5)

In [5]:
from scipy import spatial
from sklearn.metrics.pairwise import cosine_similarity,cosine_distances

In [6]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2),
                     min_df=0, stop_words='english')

# Produce a feature matrix, where each row corresponds to a movie,
# with TF-IDF features as columns 
tf_authTags_matrix = tf.fit_transform(table['auth_tags'])

In [3]:
from sklearn.metrics.pairwise import cosine_similarity,cosine_distances

cosine_sim_authTags = cosine_similarity(tf_authTags_matrix,
                                        tf_authTags_matrix)


In [2]:
cosine_sim_authTags = cosine_similarity(tf_authTags_matrix,
                                        tf_authTags_matrix)

In [1]:
cosine_sim_authTags[:5]

### Recommendations

In [63]:
def content_generate_rating_estimate(movie_title, user, rating_data, k=20, threshold=0.0):
    # Convert the book title to a numeric index for our 
    # similarity matrix
    b_idx = indices[movie_title]
    neighbors = [] # <-- Stores our collection of similarity values 

    # Gather the similarity ratings between each movie the user has rated
    # and the reference book 
    for index, row in rating_data[rating_data['userId']==user].iterrows():
        sim = cosine_sim_authTags[b_idx-1, indices[row['title']]-1]
        neighbors.append((sim, row['rating']))
    # Select the top-N values from our collection
    k_neighbors = heapq.nlargest(k, neighbors, key=lambda t: t[0])

    # Compute the weighted average using similarity scores and 
    # user item ratings. 
    simTotal, weightedSum = 0, 0
    for (simScore, rating) in k_neighbors:
        # Ensure that similarity ratings are above a given threshold
        if (simScore > threshold):
            simTotal += simScore
            weightedSum += simScore * rating
    try:
        predictedRating = weightedSum / simTotal
    except ZeroDivisionError:
        # Cold-start problem - No ratings given by user. 
        # We use the average rating for the reference item as a proxy in this case 
        predictedRating = np.mean(rating_data[rating_data['title']==movie_title]['rating'])
    return predictedRating

In [7]:
# Subset of ratings from user 314
table[table['userId'] == 314][3:10]

In [8]:
title = "Three Musketeers, The (1993)"
actual_rating = table[(table['userId'] == 314) & (table['title'] == title)]['rating'].values[0]
pred_rating = content_generate_rating_estimate(movie_title=title, user=314, rating_data=table)
print (f"Title - {title}")
print ("---")
print (f"Actual rating: \t\t {actual_rating}")
print (f"Predicted rating: \t {pred_rating}")