<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Note" data-toc-modified-id="Note-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Note</a></span></li><li><span><a href="#Load-data-and-the-rvest-package" data-toc-modified-id="Load-data-and-the-rvest-package-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load data and the <code>rvest</code> package</a></span></li><li><span><a href="#Parse-data-via-tags-and-classes" data-toc-modified-id="Parse-data-via-tags-and-classes-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Parse data via tags and classes</a></span><ul class="toc-item"><li><span><a href="#Parse-Ranking" data-toc-modified-id="Parse-Ranking-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Parse Ranking</a></span></li><li><span><a href="#Parse-Titles" data-toc-modified-id="Parse-Titles-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Parse Titles</a></span></li><li><span><a href="#Parse-Movie-descriptions" data-toc-modified-id="Parse-Movie-descriptions-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Parse Movie descriptions</a></span></li><li><span><a href="#Parse-Runtime" data-toc-modified-id="Parse-Runtime-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Parse Runtime</a></span></li><li><span><a href="#Parse-IMDB-ratings" data-toc-modified-id="Parse-IMDB-ratings-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Parse IMDB ratings</a></span></li><li><span><a href="#Parse-Number-of-Votes" data-toc-modified-id="Parse-Number-of-Votes-3.6"><span class="toc-item-num">3.6&nbsp;&nbsp;</span>Parse Number of Votes</a></span></li><li><span><a href="#Parse-Director-Name" data-toc-modified-id="Parse-Director-Name-3.7"><span class="toc-item-num">3.7&nbsp;&nbsp;</span>Parse Director Name</a></span></li><li><span><a href="#Parse-Lead-Actor-Name" data-toc-modified-id="Parse-Lead-Actor-Name-3.8"><span class="toc-item-num">3.8&nbsp;&nbsp;</span>Parse Lead Actor Name</a></span></li><li><span><a href="#Parse-Metascore-Ratings" data-toc-modified-id="Parse-Metascore-Ratings-3.9"><span class="toc-item-num">3.9&nbsp;&nbsp;</span>Parse Metascore Ratings</a></span></li><li><span><a href="#Parse-Gross-Revenue" data-toc-modified-id="Parse-Gross-Revenue-3.10"><span class="toc-item-num">3.10&nbsp;&nbsp;</span>Parse Gross Revenue</a></span></li></ul></li><li><span><a href="#Combine-all-lists-to-form-one-data-frame" data-toc-modified-id="Combine-all-lists-to-form-one-data-frame-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Combine all lists to form one data frame</a></span></li></ul></div>

# Note
**Credit** <br>
Written by my team member Zoiz Bouikidis. <br>
Edited, reorganized, and commented on by me. 

**Order** <br>
This webscraping model requires a lot of cleaning for each columns, most of which are simple operations. 
We ran this task in R with step-by-step description, but the one in Python would be more streamlined.

**Why this relates to the comparison between Python and R**
<br>The major difference between the two languages (and packages) is in the how `rvest` can depend primarily on tags to do its work, while Python's `bs4` requires the user to have an understanding of the webpage's structure. This gives R higher scores given the beginner persona we decided upon.  

# Load data and the `rvest` package

In [36]:
install.packages('rvest')
#Loading the rvest package
library('rvest')

Installing package into 'C:/Users/linhd/Documents/R/win-library/3.5'
(as 'lib' is unspecified)
"unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
"package 'rvest' is in use and will not be installed"

We will be focusing on this IMDB list of the best comedy films of all time. <br>
First we retrieve the webpage to be scraped by providing a URL.

In [37]:
# Specifying the url for desired website to be scraped
url <- 'https://www.imdb.com/list/ls000729643/'
# Reading the HTML code from the website
webpage <- read_html(url)

Let's take a look at what we get back. We see that the webpage was retrieved as an XML document.

In [38]:
print(webpage)

{xml_document}
<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body id="styleguide-v2" class="fixed">\n\n            <img height="1" wi ...


# Parse data via tags and classes

## Parse Ranking

In [39]:
#Using CSS selectors to scrap the rankings section
rank_data_html <- html_nodes(webpage,'.text-primary')
#Converting the ranking data to text
rank_data <- html_text(rank_data_html)
#Let's have a look at the rankings
head(rank_data)

In [40]:
#Data-Preprocessing: Converting rankings to numerical
rank_data<-as.numeric(rank_data)
#Let's have another look at the rankings
head(rank_data)

## Parse Titles

In [41]:
#Using CSS selectors to scrap the title section
title_data_html <- html_nodes(webpage,'.lister-item-header a')
#Converting the title data to text
title_data <- html_text(title_data_html)
#Let's have a look at the title
head(title_data)

## Parse Movie descriptions

In [42]:
#Using CSS selectors to scrap the description section
description_data_html <- html_nodes(webpage,'.ratings-metascore+ p')
#Converting the description data to text
description_data <- html_text(description_data_html)
#Let's have a look at the description data
head(description_data)

In [43]:
#Data-Preprocessing: removing '\n'
description_data<-gsub("\n","",description_data)
#Let's have another look at the description data
head(description_data)

## Parse Runtime

In [44]:
#Using CSS selectors to scrap the Movie runtime section
runtime_data_html <- html_nodes(webpage,'.runtime')
#Converting the runtime data to text
runtime_data <- html_text(runtime_data_html)
#Let's have a look at the runtime
head(runtime_data)

In [45]:
#Data-Preprocessing: removing mins and converting it to numerical
runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)
#Let's have another look at the runtime data
head(runtime_data)

In [46]:
#Using CSS selectors to scrap the IMDB rating section
rating_data_html <- html_nodes(webpage,'.ipl-rating-star__rating')
#Converting the ratings data to text
rating_data <- html_text(rating_data_html)
#Let's have a look at the ratings
head(rating_data)

## Parse IMDB ratings

In [47]:
remove(rating_data)
#Using CSS selectors to scrap the IMDB rating section
rating_data_html <- html_nodes(webpage,'.ipl-rating-star.small .ipl-rating-star__rating')
#Converting the ratings data to text
rating_data <- html_text(rating_data_html)
#Let's have a look at the ratings
head(rating_data)

In [48]:
#Data-Preprocessing: converting ratings to numerical
rating_data<-as.numeric(rating_data)
#Let's have another look at the ratings data
head(rating_data)

## Parse Number of Votes

In [49]:
#Using CSS selectors to scrap the votes section
votes_data_html <- html_nodes(webpage,'.text-muted+ span:nth-child(2)')
votes_data_html

{xml_nodeset (53)}
 [1] <span></span>
 [2] <span></span>
 [3] <span></span>
 [4] <span name="nv" data-value="314206">314,206</span>
 [5] <span name="nv" data-value="308168">308,168</span>
 [6] <span name="nv" data-value="62649">62,649</span>
 [7] <span name="nv" data-value="246430">246,430</span>
 [8] <span name="nv" data-value="668365">668,365</span>
 [9] <span name="nv" data-value="226641">226,641</span>
[10] <span name="nv" data-value="329041">329,041</span>
[11] <span name="nv" data-value="206931">206,931</span>
[12] <span name="nv" data-value="203704">203,704</span>
[13] <span name="nv" data-value="131449">131,449</span>
[14] <span name="nv" data-value="203635">203,635</span>
[15] <span name="nv" data-value="90946">90,946</span>
[16] <span name="nv" data-value="137799">137,799</span>
[17] <span name="nv" data-value="64535">64,535</span>
[18] <span name="nv" data-value="109795">109,795</span>
[19] <span name="nv" data-value="49147">49,147</span>
[20] <span name="nv" data-value="532

In [50]:
#Converting the votes data to text
votes_data <- html_text(votes_data_html)

#Let's have a look at the votes data
head(votes_data)

In [51]:
remove(votes_data)
remove(votes_data_html)

## Parse Director Name

In [52]:
#Using CSS selectors to scrap the directors section
directors_data_html <- html_nodes(webpage,'.text-muted a:nth-child(1)')
#Converting the directors data to text
directors_data <- html_text(directors_data_html)
#Let's have a look at the directors data
head(directors_data)

In [53]:
#Data-Preprocessing: converting directors data into factors
directors_data<-as.factor(directors_data)
#Using CSS selectors to scrap the actors section
actors_data_html <- html_nodes(webpage,'.ghost+ a')
#Converting the gross actors data to text
actors_data <- html_text(actors_data_html)
#Let's have a look at the actors data
head(actors_data)

## Parse Lead Actor Name

In [54]:
actors_data_html

{xml_nodeset (50)}
 [1] <a href="/name/nm0005562/?ref_=ttls_li_st_0">Owen Wilson</a>
 [2] <a href="/name/nm0002071/?ref_=ttls_li_st_0">Will Ferrell</a>
 [3] <a href="/name/nm0000604/?ref_=ttls_li_st_0">John C. Reilly</a>
 [4] <a href="/name/nm0002071/?ref_=ttls_li_st_0">Will Ferrell</a>
 [5] <a href="/name/nm0302108/?ref_=ttls_li_st_0">Zach Galifianakis</a>
 [6] <a href="/name/nm0515296/?ref_=ttls_li_st_0">Ron Livingston</a>
 [7] <a href="/name/nm0000120/?ref_=ttls_li_st_0">Jim Carrey</a>
 [8] <a href="/name/nm0000196/?ref_=ttls_li_st_0">Mike Myers</a>
 [9] <a href="/name/nm0000196/?ref_=ttls_li_st_0">Mike Myers</a>
[10] <a href="/name/nm0001774/?ref_=ttls_li_st_0">Ben Stiller</a>
[11] <a href="/name/nm0005561/?ref_=ttls_li_st_0">Luke Wilson</a>
[12] <a href="/name/nm0000409/?ref_=ttls_li_st_0">Brendan Fraser</a>
[13] <a href="/name/nm0000134/?ref_=ttls_li_st_0">Robert De Niro</a>
[14] <a href="/name/nm0000134/?ref_=ttls_li_st_0">Robert De Niro</a>
[15] <a href="/name/nm0000188/?ref_=t

In [55]:
#Data-Preprocessing: converting actors data into factors
actors_data<-as.factor(actors_data)

## Parse Metascore Ratings

In [56]:
#Using CSS selectors to scrap the metascore section
metascore_data_html <- html_nodes(webpage,'.metascore')
#Converting the runtime data to text
metascore_data <- html_text(metascore_data_html)
#Let's have a look at the metascore
head(metascore_data)

In [57]:
#Data-Preprocessing: removing extra space in metascore
metascore_data<-gsub(" ","",metascore_data)

#Lets check the length of metascore data
length(metascore_data)

In [58]:
#Using CSS selectors to scrap the gross revenue section
gross_data_html <- html_nodes(webpage,'.text-muted .ghost~ .text-muted+ span')

#Converting the gross revenue data to text
gross_data <- html_text(gross_data_html)

#Let's have a look at the votes data
head(gross_data)

## Parse Gross Revenue

In [59]:
# Data-Preprocessing: removing '$' and 'M' signs
gross_data<-gsub("M","",gross_data)
head(gross_data)
gross_data<-substring(gross_data,2,6)

In [60]:
# Show the cleaning process
head(gross_data)
length(gross_data)
summary(gross_data)

   Length     Class      Mode 
       50 character character 

In [61]:
#Data-Preprocessing: converting gross to numerical
gross_data<-as.numeric(gross_data)

# Combine all lists to form one data frame

In [62]:
#Combining all the lists to form a data frame
movies_df<-data.frame(Rank = rank_data, Title = title_data,
Description = description_data, Runtime = runtime_data,
Rating = rating_data,
Metascore = metascore_data, Gross_Earning_in_Mil = gross_data, Director = directors_data, Actor = actors_data)

In [63]:
#Structure of the data frame
str(movies_df)

'data.frame':	50 obs. of  9 variables:
 $ Rank                : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Title               : Factor w/ 50 levels "¡Three Amigos!",..: 48 8 47 36 41 31 16 9 10 35 ...
 $ Description         : Factor w/ 50 levels "    6 Los Angeles celebrities are stuck in James Franco's house after a series of devastating events just destr"| __truncated__,..: 32 35 36 43 40 41 37 2 24 48 ...
 $ Runtime             : num  119 94 96 98 100 89 107 89 95 101 ...
 $ Rating              : num  7 7.2 6.8 6.9 7.7 7.8 7.3 7 6.6 6.1 ...
 $ Metascore           : Factor w/ 30 levels "26","37","41",..: 16 15 15 8 23 18 3 8 12 11 ...
 $ Gross_Earning_in_Mil: num  209.2 85.3 18.3 100.4 277.3 ...
 $ Director            : Factor w/ 34 levels "Adam McKay","Ben Stiller",..: 8 1 16 1 33 29 31 17 17 33 ...
 $ Actor               : Factor w/ 38 levels "Anna Faris","Ben Stiller",..: 26 36 15 36 38 30 12 25 25 2 ...


In [64]:
head(movies_df,10)

Rank,Title,Description,Runtime,Rating,Metascore,Gross_Earning_in_Mil,Director,Actor
1,Wedding Crashers,"John Beckwith and Jeremy Grey, a pair of committed womanizers who sneak into weddings to take advantage of the romantic tinge in the air, find themselves at odds with one another when John meets and falls for Claire Cleary.",119,7.0,64,209.2,David Dobkin,Owen Wilson
2,Anchorman: The Legend of Ron Burgundy,"Ron Burgundy is San Diego's top-rated newsman in the male-dominated broadcasting of the 1970s, but that's all about to change for Ron and his cronies when an ambitious woman is hired as a new anchor.",94,7.2,63,85.29,Adam McKay,Will Ferrell
3,Walk Hard: The Dewey Cox Story,Singer Dewey Cox overcomes adversity to become a musical legend.,96,6.8,63,18.32,Jake Kasdan,John C. Reilly
4,Step Brothers,Two aimless middle-aged losers still living at home are forced against their will to become roommates when their parents marry.,98,6.9,51,100.4,Adam McKay,Will Ferrell
5,The Hangover,"Three buddies wake up from a bachelor party in Las Vegas, with no memory of the previous night and the bachelor missing. They make their way around the city in order to find their friend before his wedding.",100,7.7,73,277.3,Todd Phillips,Zach Galifianakis
6,Office Space,Three company workers who hate their jobs decide to rebel against their greedy boss.,89,7.8,68,10.82,Mike Judge,Ron Livingston
7,Dumb and Dumber,The cross-country adventures of 2 good-hearted but incredibly stupid friends.,107,7.3,41,127.1,Peter Farrelly,Jim Carrey
8,Austin Powers: International Man of Mystery,"A 1960s secret agent is brought out of cryofreeze to oppose his greatest enemy in the 1990s, where his social attitudes are glaringly out of place.",89,7.0,51,53.88,Jay Roach,Mike Myers
9,Austin Powers: The Spy Who Shagged Me,"Dr. Evil is back and has invented a new time machine that allows him to go back to the 1960s and steal Austin Powers' mojo, inadvertently leaving him ""shagless"".",95,6.6,59,206.0,Jay Roach,Mike Myers
10,Starsky & Hutch,"Two streetwise cops bust criminals in their red and white Ford Gran Torino, with the help of a police snitch called ""Huggy Bear"".",101,6.1,55,88.24,Todd Phillips,Ben Stiller


**Now we have a well-structured dataframe ready to be exported and analyzed**