# A Netflix Tour of Data Science - Film suggestion by diffusion on graphs
# Signal - Vote average

### Students:

    * Team     : 17
    * Students : Edwige Avignon, Kenneth Nguyen, Pierre Fourcade  
    * Dataset  : Kaggle dataset - Films and Crew

## About this notebook:

This notebook is used to extract and export the signal that we are going to use for the project: the average vote.

We can note here that when we computed the different adjacency matrices, we haven't removed the isolated nodes that, usually and in our case, do not carry meaningful informations.
However, keeping them doesn't do any harm and it makes the extraction of the signal more convenient: we simply take the full signal from the dataset, no need to order or remove any node.

## 0 - Libraries and dataset

In [1]:
import numpy as np
import pandas as pd
import pygsp as pg
import networkx as nx
import matplotlib.pyplot as plt
import ast
from collections import Counter

In [2]:
credits = pd.read_csv('Dataset_Exports/tmdb_5000_credits.csv')
movies = pd.read_csv('Dataset_Exports/tmdb_5000_movies.csv')

## 1 - Signal - Vote average

Let's extract the signal from the dataset.

In [3]:
movie_vote_average = np.array(movies.vote_average)

We export two forms of the signal: one in csv so that we can apply it in Gephi and an other to work with PyGSP.
We also export the signal placed into bins. As the average vote can go from 0 to 10 we create 11 bins: 0, 1, 2,... and 10.

The signal placed into those bins is more convenient for representation.

In [4]:
signal = np.zeros(len(movie_vote_average))

for i in range (len(signal)):
    signal[i] = movie_vote_average[i]

In [5]:
np.savetxt('Dataset_Exports/Vote_average_base_signal.txt', signal)

df_signal = pd.DataFrame(signal)
df_signal.to_csv('Dataset_Exports/Vote_average_base_signal.csv')

In [6]:
# We place the signal into the bins described:

signal_bin = np.zeros(len(signal))

for i in range (len(signal_bin)):
    if movie_vote_average[i] < 0.5:
        signal_bin[i] = 0
    elif movie_vote_average[i] >= 0.5 and movie_vote_average[i] < 1.5:
        signal_bin[i] = 1
    elif movie_vote_average[i] >= 1.5 and movie_vote_average[i] < 2.5:
        signal_bin[i] = 2
    elif movie_vote_average[i] >= 2.5 and movie_vote_average[i] < 3.5:
        signal_bin[i] = 3
    elif movie_vote_average[i] >= 3.5 and movie_vote_average[i] < 4.5:
        signal_bin[i] = 4
    elif movie_vote_average[i] >= 4.5 and movie_vote_average[i] < 5.5:
        signal_bin[i] = 5
    elif movie_vote_average[i] >= 5.5 and movie_vote_average[i] < 6.5:
        signal_bin[i] = 6
    elif movie_vote_average[i] >= 6.5 and movie_vote_average[i] < 7.5:
        signal_bin[i] = 7
    elif movie_vote_average[i] >= 7.5 and movie_vote_average[i] < 8.5:
        signal_bin[i] = 8
    elif movie_vote_average[i] >= 8.5 and movie_vote_average[i] < 9.5:
        signal_bin[i] = 9
    else:
        signal_bin[i] = 10

In [7]:
np.savetxt('Dataset_Exports/Vote_average_base_signal_bin.txt', signal_bin)

df_signal_bin = pd.DataFrame(signal_bin)
df_signal_bin.to_csv('Dataset_Exports/Vote_average_base_signal_bin.csv')

We also add this signal to the different dataframe made with Gephi.

In [8]:
Graph_Cast_Nodes = pd.read_csv('Dataset_Exports/Cast/Graph_Cast_Nodes.csv')
Graph_Cast_Nodes['vote_average0'] = signal_bin
Graph_Cast_Nodes.to_csv('Dataset_Exports/Cast/Graph_Cast_Nodes.csv')

In [9]:
Graph_First_Role_Nodes = pd.read_csv('Dataset_Exports/First_Role/Graph_First_Role_Nodes.csv')
Graph_First_Role_Nodes['vote_average0'] = signal_bin
Graph_First_Role_Nodes.to_csv('Dataset_Exports/First_Role/Graph_First_Role_Nodes.csv')

In [10]:
Graph_Genres_Nodes = pd.read_csv('Dataset_Exports/Genres/Graph_Genres_Nodes.csv')
Graph_Genres_Nodes['vote_average0'] = signal_bin
Graph_Genres_Nodes.to_csv('Dataset_Exports/Genres/Graph_Genres_Nodes.csv')

In [11]:
Graph_Crew_Nodes = pd.read_csv('Dataset_Exports/Crew/Graph_Crew_Nodes.csv')
Graph_Crew_Nodes['vote_average0'] = signal_bin
Graph_Crew_Nodes.to_csv('Dataset_Exports/Crew/Graph_Crew_Nodes.csv')