## Multivariate Reduction Practice

Let's continue our discussion with multivariate data reduction. We will focus on dimensionality reduction using principal component analysis. The questions are partially complete. You may have to debug/modify/complete the code to generate the desired output. 

**Load the data** into movies_data dataframe.

In [None]:
movies_data <- read.csv("/dsa/data/all_datasets/movies/movie_metadata.csv", header = T, sep=",")
head(movies_data)

Remove the rows that contain any NA values.

In [None]:
# Count number of rows in the dataset
nrow(movies_data)

# Omit rows from  the dataset that contain NA values
movies_data <- na.omit(movies_data)

# Count number of rows again in the dataset
nrow(movies_data)

# Create a new dataframe called less_data excluding all rows from movies_data that contain NA values 
less_data <- movies_data[!sapply(movies_data,class) %in% c("factor")]

#### Correlation Matrix


In [None]:
cor(less_data) # get the correlations for less_data

The output of cor() function is the correlation coefficient between each and every variable combination in the dataset. 
A variable's correlation to itself is always 1.

Variables

- movie_facebook_likes
- num_user_for_reviews
- num_voted_users
- num_critic_for_reviews
- duration

are the most correlated with imdb_score. 
<div>
    <ul>
        <br>
        <li> 
            <span style="color:#cc1652">cast_total_facebook_likes</span> has a strong positive correlation with the <span style="color:#d38032">actor_1_facebook_likes</span>, and has smaller positive correlation with both <span style="color:#30a5d3">actor_2_facebook_likes</span> and <span style="color:#08cc6d">actor_3_facebook_likes</span>
        </li>
        <br>
        <li>
            <span style="color:#d32e44">movie_facebook_likes</span> has strong correlation with <span style="color:#8934c1">num_critic_for_reviews</span>, meaning that the popularity of a movie in social networks can be largely affected by the critics
        </li>
        <br>
        <li> <span style="color:#d32e44">movie_facebook_likes</span> has a decent amount of correlation with the <span style="color:#1d48d3">num_voted_users</span>
        </li>
        <br>
        <li> <span style="color:#cc1652">gross</span> has a strong positive correlation with the <span style="color:#1d48d3">num_voted_users</span>
        </li>
    </ul>
</div>


Contradicting correlations
---------------------------

<div>
    <ul>
        <br>
        <li> 
            <span style="color:#cc1652">imdb_score</span> has very small positive correlation with <span style="color:#08cc6d">director_facebook_likes</span>. So we cant guarantee a popular director's movie will be great.
        </li>
        <br>
        <li>
            <span style="color:#cc1652">imdb_score</span> has very small positive correlation with the <span style="color:#d32e44">actor_1_facebook_likes</span>. Just like a famous director, we cant guarantee a popular actor's movie will be great.
        </li>
        <br>
        <li> <span style="color:#cc1652">imdb_score</span> has a small but positive correlation with <span style="color:#d32e44">duration</span>. Highly rated movies tend to be longer in duration.
        </li>
        <br>
        <li> <span style="color:#cc1652">num_voted_users</span> and <span style="color:#1d48d3">num_user_for_reviews</span> demonstrate a small positive correlation. Maybe more reviews are made on good movies.
        </li>
        <br>
        <li> <span style="color:#cc1652">imdb_score</span> has almost no correlation with <span style="color:#11c627">budget</span>. Big budget movies will not necessarily turn out great
        </li>
    </ul>
</div>

**Question 1.a:** Which correlations surprise you and/or seem interesting for investigation?

Let's continue our discussion with PCA. As we have seen in lab notebook we have to standardize the variables. 

**Question 2:** Use scale() function to standardize the numeric variables in movies_data and assign the new data to a variable called standard_vars.

In [None]:
standard_vars <- as.data.frame(scale(less_data))
dim(standard_vars)
head(standard_vars)

**Question 3:** Run prcomp() function on standard_vars created above and assign the result to movies_data_pca 

In [None]:
# Compute the Principal Components. Run prcomp() function on standardardized variables created above.
movies_data_pca <- prcomp(standard_vars)

In [None]:
help(prcomp)

**If you go to the help page for `prcomp` you will find in the details section,**

`The calculation is done by a singular value decomposition of the (centered and scaled) data matrix, not by using eigen on the covariance matrix. This is generally the preferred method for numerical accuracy.`

For `princomp()` you will see,

`The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result. A preferred method of calculation is to use svd on x, as is done in prcomp."`

In [None]:
summary(movies_data_pca)

In [None]:
screeplot(movies_data_pca, type="lines")

**Question 4:** What are your observations from the plot below? Write a few words below about how you interpret points and vectors?

In [None]:
biplot(movies_data_pca) 

Look at the dimensions of the PCA we ran. We are interested in the x part of movies_data_pca for the dimensions.

In [None]:
dim(movies_data_pca$x)

**Question 5:** Fit a multiple regression model to predict imdb_score on less_data using the **first 4 principal components created above**. 

In [None]:
# movies_data_pca$x is a list that contains all the principal components. You can access components using 
# subscripts [,1], [,2], [,3], and so on

fit = lm(less_data$imdb_score ~ <YOUR CODE HERE>)
summary(fit)

Lets try to fit a linear multiple regression model using the most correlated variables we found.

**Question 6.a:** Fit a multiple regression model on movies_data to predict imdb_score using variables movie_facebook_likes, num_user_for_reviews, num_voted_users, num_critic_for_reviews and duration.

In [None]:
fit1=lm(<YOUR CODE HERE>,
       data=movies_data)
summary(fit1)

**Question 6.b:** Compare the $R^2$ value for models fit1 and fit. Write your opinion about the models.

**Question 7:** Build a model to predict imdb_score using all the independent features of less_data.

In [None]:
fit2=lm(<YOUR CODE HERE>)
summary(fit2)

**Question 8:** Compare the model built using first 4 principal components to the models built using the variables in the datasets less_data and movies_data. 

**Question 9a:** Run factanal() function to generate 2 factors for less_data.  

In [None]:
factors <- factanal(<YOUR CODE HERE>)
factors


**Question 9b:** Look at component loadings; can you group the variables by two factors similar to the lab notebooks? 

# SAVE YOUR NOTEBOOK