Youtube Data Analysis

YouTube allows billions of people to connect, inform, and inspire others across the globe using originally created videos.

In our Project we analyze the data to identify the top 5 categories in which the most number of videos are uploaded. The dataset is gathered using the YouTube API and stored in Hadoop Distributed File System(HDFS). MapReduce algorithm is applied to process the dataset and identify the video categories.

However, we will be configuring Master-Slave architecture we need to apply the common changes in Hadoop config files (i.e. common for both type of Mater and Slave nodes) before we distribute these Hadoop files over the rest of the machines/nodes. Hence, these changes will be reflected over your single node Hadoop setup. After that, we will make changes specifically for Master and Slave nodes respectively. Changes:

Update core-site.xml

Update this file by changing hostname from localhost to HadoopMaster

To edit file, fire the below given command

ubuntu@hadoopmaster1:/opt/hadoop/etc/hadoop$ sudo gedit core-site.xml

Paste these lines into tag OR Just update it by replacing localhost with master

fs.default.name

hdfs://HadoopMaster:9000

Update hdfs-site.xml

Update this file by updating repliction factor from 1 to 3.

To edit file, fire the below given command

ubuntu@hadoopmaster1:/opt/hadoop/etc/hadoop$ sudo gedit hdfs-site.xml

Paste/Update these lines into tag

dfs.replication

3

Update yarn-site.xml

Update this file by updating the following three properties by updating hostname from localhost to HadoopMaster

To edit file, fire the below given command

ubuntu@hadoopmaster1:/opt/hadoop/etc/hadoop$ sudo gedit yarn-site.xml

Paste/Update these lines into tag

<name>yarn.resourcemanager.resource-tracker.address</name>

<value>HadoopMaster:8025</value>

<name>yarn.resourcemanager.scheduler.address</name>

<value>HadoopMaster:8035</value>

<name>yarn.resourcemanager.address</name>

<value>HadoopMaster:8050</value>

Update Mapred-site.xml

Update this file by updating and adding following properties,

To edit file, fire the below given command

ubuntu@hadoopmaster1:/opt/hadoop/etc/hadoop$ sudo gedit mapred-site.xml

Paste/Update these lines into tag

<name>mapreduce.job.tracker</name>

<value>HadoopMaster:5431</value>

<name>mapred.framework.name</name>

<value>yarn</value>

Update slaves

Update the directory of slave nodes of Hadoop cluster

To edit file, fire the below given command

ubuntu@hadoopmaster1:/opt/hadoop/etc/hadoop$ sudo gedit slaves

Add name of slave nodes

hadoopslave1 hadoopslave2 hadoopslave3

Format Namenonde (Run on MasterNode) :

Run this command from Masternode

ubuntu@hadoopmaster1: /opt/hadoop/$ hdfs namenode -format

Copy Hadoop distribution to other nodes:

sudo scp -R /opt/hadoop/ ubuntu@hadoopslave1:/opt/

sudo scp -R /opt/hadoop/ ubuntu@hadoopslave2:/opt/

sudo scp -R /opt/hadoop/ ubuntu@hadoopslave3:/opt/

Starting Namenode, Datanode and ResourceManger:

start-all.sh

Check if Hadoop started as desired using jps command.

Obtaining Youtube API access key

Use the following link to obtain an API access key.

https://youtu.be/JbWnRhHfTDA

Install Node.js and node package manager

sudo apt-get update

sudo apt-get install nodejs

sudo apt-get install npm

After installing node.js and npm go to WebContent folder and run the below command to download all the dependencies.

npm install

Run The Project

Run the nodejs server using the command.

cd YouTube-Data-Analysis/WebContent/

nodejs app.js

The project will be up and running at port http://localhost:8080

Click on Get More Data option on the sidebar to get new data via the YouTube API. After the data is stored in the server, a script will run in background to store the data in Hadoop File system.

Click on Statistics option on the sidebar to run the Hadoop MapReduce algorithm on the data. The Analyze data button will run the script to start hadoop MapReduce algorithm. The result will be displayed on the same webpage.

Description of each file

Filename	Purpose	New/Modified	Comments
YoutubeCategory.java	Mapper Reducer code to get top 5 categories	New	Create JAR of this file to run in Hadoop system
YoutubeUploader.java	Mapper Reducer code to get top uploaders	New	Create JAR of this file to run in Hadoop system
YoutubeView.java	Mapper Reducer code to get most viewed videos	New	Create JAR of this file to run in Hadoop system
analyzedata.sh	Shell script to execute Hadoop commands	New	Merged Sorting commands in the file
getdata.sh	Shell script to copy the data file from server to HDFS	New	No Comments
app.js	Main configuration file to run the entire application	Modified	Changed client server communication from AJAX to socket.io
searchapi.js	Connect to YouTube data API to fetch data in a file	Modified	Changed callbacks and data to be fetched

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
WebContent		WebContent
hadoopjars		hadoopjars
scripts		scripts
youtubecategory		youtubecategory
youtubeuploader		youtubeuploader
youtubeviews		youtubeviews
.gitignore		.gitignore
Inferno - CSE546Midterm.ppt		Inferno - CSE546Midterm.ppt
Inferno - Project Presentation.ppt		Inferno - Project Presentation.ppt
Inferno_CSE546Midterm.pdf		Inferno_CSE546Midterm.pdf
Project Proposal Final.pdf		Project Proposal Final.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Youtube Data Analysis

Table of Contents

Installing Hadoop On Cluster Instructions

Installing Java

Mapping the nodes

Configuring Key Based Login

Installing Hadoop

Applying Common Hadoop Configuration :

Update core-site.xml

Update hdfs-site.xml

Update yarn-site.xml

Update Mapred-site.xml

Update slaves

Add name of slave nodes

Format Namenonde (Run on MasterNode) :

Copy Hadoop distribution to other nodes:

Starting Namenode, Datanode and ResourceManger:

Check if Hadoop started as desired using jps command.

Obtaining Youtube API access key

Install Node.js and node package manager

Run The Project

Description of each file

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Youtube Data Analysis

Table of Contents

Installing Hadoop On Cluster Instructions

Installing Java

Mapping the nodes

Configuring Key Based Login

Installing Hadoop

Applying Common Hadoop Configuration :

Update core-site.xml

Update hdfs-site.xml

Update yarn-site.xml

Update Mapred-site.xml

Update slaves

Add name of slave nodes

Format Namenonde (Run on MasterNode) :

Copy Hadoop distribution to other nodes:

Starting Namenode, Datanode and ResourceManger:

Check if Hadoop started as desired using jps command.

Obtaining Youtube API access key

Install Node.js and node package manager

Run The Project

Description of each file

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages