# Data analysis of the imdb database

The dataset can be obtained from https://www.kaggle.com/karrrimba/movie-metadatacsv

We will use the pandas library for Python. Some tutorials can be found at https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Reading csv data
data = pd.read_csv("./Data/movie_metadata.csv")

In [None]:
data.head()

In [None]:
# Rapid description of the data
data.describe()

In [None]:
# Dropping data lines with NaN
data.dropna(how='any',inplace=True)

## Plotting data

In [None]:
data_sample = data[data.movie_facebook_likes > 0]
data_sample.plot.scatter("movie_facebook_likes","gross")
plt.show()

### Question 1
Plot different variables and see whether you can spot some correlations. 

## Gradient descent algorithm

We want to find a linear regression between x = log(budget) and y = log(gross).

The linear regression consists in the hypothesis

$h_\theta (x) = \theta_0 + \theta_1 x$

with the cost function

$J(\theta_0, \theta_1) = \frac{1}{2N}
\sum_{i=1}^N \left( h_\theta (x^{(i)}) - y^{(i)} \right)^2$

The iterative procedure of the gradient descent algorithm is then 

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$

with $\alpha$ the learning parameter and 

$\frac{\partial J}{\partial \theta_0} = \frac{1}{N}
\sum_{i=1}^N \left( h_\theta (x^{(i)}) - y^{(i)} \right)$

$\frac{\partial J}{\partial \theta_1} = \frac{1}{N}
\sum_{i=1}^N \left( h_\theta (x^{(i)}) - y^{(i)} \right)x^{(i)}$

In [None]:
x = np.log(np.array(data["budget"]))
y = np.log(np.array(data["gross"]))
plt.plot(x,y,'.')
plt.show()

### Question 2
- Complete the following code to implement a gradient descent algorithm
- Check how the convergence varies with $\alpha$ (a plot of J as a function of t may help)

In [None]:
T = 100  # number of steps
alpha = 0.003  # learning parameter
theta0 = 3.  # initial value
theta1 = 1.  # initial value
N = len(x)  

#for t in range(T):
#    for i in range(N):
       

### Plotting the result

In [None]:
plt.plot(x,y,'.')
plt.plot(x,theta0 + theta1 * x)
plt.show()

### Question 3
The gradient descent algorithm can be improved.
- Implement the same gradient descent algorithm but with rescaled data
- Implement the algorithm with stochastic gradient descent (at each time step a data is randomly picked and the parameters are updated with this single data, N=1)
- Implement the algorithm with mini-batches (at each time step n<N data are randomly picked and the parameters are updated with this data)

In [None]:
x_rescaled = (x - np.mean(x)) / np.std(x)
#...

## Regression by direct method
The parameters $\theta_0$ and $\theta_1$ can also be obtained by the direct method

$\theta_1 = {\langle (x - \langle x \rangle) (y - \langle y \rangle)} \,/\, {\langle(x - \langle x \rangle)^2\rangle}$

$\theta_0 = \langle y \rangle - \theta_1 \langle x \rangle$

### Question 4
Complete the code below to calculate $\theta_0$ and $\theta_1$ directly.

In [None]:
x = data["budget"].apply(np.log)
xm = x.mean()
#...

### Question 5
Compare the different methods graphically and discuss why the results may be different. 