<a id=section1></a>
# 1. Problem Statement

The objective of the problem is to predict values “Views” attribute from the given features of the Test data. 

<a id=section2></a>
# 2. Importing Packages

In [None]:
import numpy as np                     

import pandas as pd

# To suppress pandas warnings.
pd.set_option('mode.chained_assignment', None) 

# To display all the data in each column
pd.set_option('display.max_colwidth', -1)         

pd.get_option("display.max_rows",10000)

# To display every column of the dataset in head()
pd.options.display.max_columns = 100               

import warnings
warnings.filterwarnings('ignore')     

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# To apply seaborn styles to the plots.
import seaborn as sns
sns.set(style='whitegrid', font_scale=1.3, color_codes=True)      

<a id=section3></a>

# 3. Loading Data

In [None]:
# loading data from csv file to a data frame
df_train = pd.read_csv('./data/train.csv', index_col = "Video_id")
df_test = pd.read_csv('./data/test.csv', index_col = "Video_id")

print(df_train.shape)
print(df_test.shape)

(3198, 20)
(1335, 17)


**NOTE :** we observe mismatch in numbers of columns in train and test. lets find out them.

In [None]:
set(df_train.columns) - set(df_test.columns)

{'Unnamed: 19', 'Unnamed: 20', 'views'}

**NOTE :**

1. **views** ia target variable, we can ignore it.
2. let's understand remaining columns i.e;'Unnamed: 19', 'Unnamed: 20'

In [None]:
df_train['Unnamed: 19'].value_counts()

2544.0    1
Name: Unnamed: 19, dtype: int64

In [None]:
df_train['Unnamed: 20'].value_counts()

False    1
Name: Unnamed: 20, dtype: int64

**NOTE :** We observe **Unnamed: 19, Unnamed: 20** has values in only one row. let's delete that row data  and make train data clean and columns as well.

In [None]:
df_train = df_train[(df_train['Unnamed: 19'] != '2544.0') | (df_train['Unnamed: 20'] != False)]
df_train.drop(['Unnamed: 19', 'Unnamed: 20'], axis = 1, inplace = True)
df_train.shape

(3198, 18)

In [None]:
# Adding new column 'isTestData' so that we can easily separate train and test 
# data during prediction process
df_train['Is_Test_Data'] = 0



df_test['Is_Test_Data'] = 1

# concat train and test data for data pre processing
df_views_video = pd.concat([df_train,df_test])

del df_train
del df_test

df_views_video.head()

Unnamed: 0_level_0,Is_Test_Data,Tag_count,Trend_day_count,Trend_tag_count,category_id,channel_title,comment_count,comment_disabled,description,dislike,like dislike disabled,likes,publish_date,subscriber,tag appered in title,tags,title,trending_date,views
Video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
HDR9SQc79,0,21,6.0,6,22,CaseyNeistat,,falSE,SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\nCANDICE - https://www.lovebilly.com\n\n'Diamond Veins (Blowsom remix)' by French 79 http://hyperurl.co/DiamondVeinsRMX\n'Moon' by Kid Francescoli http://hyperurl.co/KID_PlayMeAgain\n\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\nwith this lens -- http://amzn.to/2rUJOmD\nbig drone - http://tinyurl.com/h4ft3oy\nOTHER GEAR --- http://amzn.to/2o3GLX5\nSony CAMERA http://amzn.to/2nOBmnv\nOLD CAMERA,6089,falSE,13342,2017-11-13,9086142.0,False,SHANtell martin,WE WANT TO TALK ABOUT OUR MARRIAGE,2017-11-20,1978978
KNH52UF?48,0,23,1.0,1,24,LastWeekTonight,116266.0,TrUe,"One year after the presidential election, John Oliver discusses what we've learned so far and enlists our catheter cowboy to teach Donald Trump what he hasn't.\n\nConnect with Last Week Tonight online...\n\nSubscribe to the Last Week Tonight YouTube channel for more almost news as it almost happens: www.youtube.com/user/LastWeekTonight\n\nFind Last Week Tonight on Facebook like your mom would: http://Facebook.com/LastWeekTonight\n\nFollow us on Twitter for news about jokes and jokes about news: http://Twitter.com/LastWeekTonight\n\nVisit our official site for all that other stuff at once: http://www.hbo.com/lastweektonight",3044,FaLSE,5761,2017-11-13,5937292.0,False,last week tonight trump presidency|last week tonight donald trump|john oliver trump|donald trump,The Trump Presidency: Last Week Tonight with John Oliver (HBO),2017-11-20,1487870
QTW28IRG36,0,22,10.0,3,23,Rudy Mancuso,257850.0,true,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_confirmation=1\n\nTHANKS FOR WATCHING! LIKE & SUBSCRIBE FOR MORE VIDEOS!\n-----------------------------------------------------------\nFIND ME ON: \nInstagram | http://instagram.com/rudymancuso\nTwitter | http://twitter.com/rudymancuso\nFacebook | http://facebook.com/rudymancuso\n\nCAST: \nRudy Mancuso | http://youtube.com/c/rudymancuso\nLele Pons | http://youtube.com/c/lelepons\nKing Bach | https://youtube.com/user/BachelorsPadTv\n\nVideo Effects: \nCaleb Natale | https://instagram.com/calebnatale\n\nPA:\nPaulina Gregory\n\n\nShots Studios Channels:\nAlesso | https://youtube.com/c/alesso\nAnitta | http://youtube.com/c/anitta\nAnwar Jibawi | http://youtube.com/c/anwar\nAwkward Puppets | http://youtube.com/c/awkwardpuppets\nHannah Stocking | http://youtube.com/c/hannahstocking\nInanna Sarkis | http://youtube.com/c/inanna\nLele Pons | http://youtube.com/c/lelepons\nMaejor | http://youtube.com/c/maejor\nMike Tyson | http://youtube.com/c/miketyson \nRudy Mancuso | http://youtube.com/c/rudymancuso\nShots Studios | http://youtube.com/c/shots\n\n#Rudy\n#RudyMancuso,0,TRUE,0,2017-11-12,4191209.0,True,racist superman|rudy|mancuso|king|bach|racist|superman|love|rudy mancuso poo bear black white official music video|iphone x by pineapple|lelepons|hannahstocking|rudymancuso|inanna|anwar|sarkis|shots|shotsstudios|alesso|anitta|brazil|Getting My Driver's License | Lele Pons,"Racist Superman | Rudy Mancuso, King Bach & Lele Pons",2017-11-20,1502102
MGL76WI]26,0,17,12.0,5,24,Good Mythical Morning,263939.0,true,Today we find out if Link is a Nickelback amateur or a secret Nickelback devotee. GMM #1218\nDon't miss an all new Ear Biscuits: https://goo.gl/xeZNQt\nWatch Part 4: https://youtu.be/MhCdiiB8CQg | Watch Part 2: https://youtu.be/7qiOrNao9fg\nWatch today's episode from the start: http://bit.ly/GMM1218\n\nPick up all of the official GMM merch only at https://mythical.store\n\nFollow Rhett & Link: \nInstagram: https://instagram.com/rhettandlink\nFacebook: https://facebook.com/rhettandlink\nTwitter: https://twitter.com/rhettandlink\nTumblr: https://rhettandlink.tumblr.com\nSnapchat: @realrhettlink\nWebsite: https://mythical.co/\n\nCheck Out Our Other Mythical Channels:\nGood Mythical MORE: https://youtube.com/goodmythicalmore\nRhett & Link: https://youtube.com/rhettandlink\nThis Is Mythical: https://youtube.com/thisismythical\nEar Biscuits: https://applepodcasts.com/earbiscuits\n\nWant to send us something? https://mythical.co/contact\nHave you made a Wheel of Mythicality intro video? Submit it here: https://bit.ly/GMMWheelIntro\n\nIntro Animation by Digital Twigs: https://www.digitaltwigs.com\nIntro & Outro Music by Jeff Zeigler & Sarah Schimeneck https://www.jeffzeigler.com\nWheel of Mythicality theme: https://www.royaltyfreemusiclibrary.com/\nAll Supplemental Music fromOpus 1 Music: https://opus1.sourceaudio.com/\nWe use ‘The Mouse’ by Blue Microphones https://www.bluemic.com/mouse/,0,True,0,2017-11-13,13186408.0,True,rhett and link|gmm|good mythical morning|rhett and link good mythical morning|good mythical morning rhett and link|mythical morning|Season 12|nickelback lyrics|nickelback lyrics real or fake|nickelback|nickelback songs|nickelback song|rhett link nickelback|gmm nickelback|lyrics (website category)|nickelback (musical group)|rock|music|lyrics|chad kroeger|music (industry)|mythical|gmm challenge|comedy|funny|the betrayal|the betrayal act III|how you remind me,Nickelback Lyrics: Real or Fake?,2017-11-20,3519302
TWP93KXT70,0,15,11.0,7,224,nigahiga,268085.0,True,"I know it's been a while since we did this show, but we're back with what might be the best episode yet!\nLeave your dares in the comment section! \n\nOrder my book how to write good \nhttp://higatv.com/ryan-higas-how-to-write-good-pre-order-links/\n\nJust Launched New Official Store\nhttps://www.gianthugs.com/collections/ryan\n\nHigaTV Channel\nhttp://www.youtube.com/higatv\n\nTwitter\nhttp://www.twitter.com/therealryanhiga\n\nFacebook\nhttp://www.facebook.com/higatv\n\nWebsite\nhttp://www.higatv.com\n\nInstagram\nhttp://www.instagram.com/notryanhiga\n\nSend us mail or whatever you want here!\nPO Box 232355\nLas Vegas, NV 89105",0,TRUE,0,2017-11-12,20563106.0,True,ryan|higa|higatv|nigahiga|i dare you|idy|rhpc|dares|no truth|comments|comedy|funny|stupid|fail,I Dare You: GOING BALD!?,2017-11-19,4835374


<a id=section301></a>
## 3.1 Description of the Datasets

#### a. Check shape

In [None]:
#shape of data
df_views_video.shape

(4533, 19)

#### b. info

Video_id:- ID of the uploaded video

Publish_date:-Date when it was published

Trending_date:- Last date it was trending on top 5 spot

Category_id:- Category it belong

Channel_Title:- Name of the title

Suscriber:- Number of people who suscribed the channel
Title:- Title of the video uploaded

Tags:-Tags appered with the video

Description:-Description of the video

Trend_day_count:-Number of days video was trending

Tag_count:-Number of tags in the video

Trend_tag_count:- Number of tags trending among total tag

Tag appered in title:-Does tag appered in video title

views:- Total views on the video after 1 week.





In [None]:
df_views_video.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4533 entries, HDR9SQc79 to LOI64QVq10
Data columns (total 19 columns):
Is_Test_Data             4533 non-null int64
Tag_count                4532 non-null object
Trend_day_count          4532 non-null float64
Trend_tag_count          4532 non-null object
category_id              4526 non-null object
channel_title            4530 non-null object
comment_count            4533 non-null object
comment_disabled         4533 non-null object
description              4443 non-null object
dislike                  4533 non-null object
like dislike disabled    4533 non-null object
likes                    4533 non-null object
publish_date             4531 non-null object
subscriber               4502 non-null float64
tag appered in title     4532 non-null object
tags                     4324 non-null object
title                    4530 non-null object
trending_date            4531 non-null object
views                    3198 non-null object
dtypes: f

**Observations :**  

1. We have few missing values.

#### c. describe

In [None]:
df_views_video.describe()

Unnamed: 0,Is_Test_Data,Trend_day_count,subscriber
count,4533.0,4532.0,4502.0
mean,0.294507,7.534863,3571822.0
std,0.455871,66.006302,24202160.0
min,0.0,0.0,0.0
25%,0.0,4.0,242880.0
50%,0.0,7.0,1195770.0
75%,1.0,10.0,3766915.0
max,1.0,4444.0,1576229000.0


**Observations :** Looks like there are some outliers. Let's confirm same from pandas profiling in next step.

 <a id=section302></a>
## 3.2 Pandas Profiling before Data Preprocessing

In [None]:
# To install pandas profiling please run this command.

#!pip install folium==0.2.1
#!pip install pandas-profiling --upgrade

In [None]:
import pandas_profiling

# Running pandas profiling to get better understanding of data
pandas_profiling.ProfileReport(df_views_video)

 <a id=section4></a>
# 4. Data Preprocessing

 <a id=section401></a>
## 4.1 Remove columns with least variance in data

###  **Unnamed: 19** and  **Unnamed: 20** has only one value . Lets drop them

df_views_video.drop(['Unnamed: 19', 'Unnamed: 20'], axis = 1, inplace=True)
df_views_video.shape

**NOTE :** We can observe country has only one value i.e AU. Let's drop it.

 <a id=section402></a>
## 4.2 Fixing datatypes of columns

In [None]:
df_views_video.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4533 entries, HDR9SQc79 to LOI64QVq10
Data columns (total 19 columns):
Is_Test_Data             4533 non-null int64
Tag_count                4532 non-null object
Trend_day_count          4532 non-null float64
Trend_tag_count          4532 non-null object
category_id              4526 non-null object
channel_title            4530 non-null object
comment_count            4533 non-null object
comment_disabled         4533 non-null object
description              4443 non-null object
dislike                  4533 non-null object
like dislike disabled    4533 non-null object
likes                    4533 non-null object
publish_date             4531 non-null object
subscriber               4502 non-null float64
tag appered in title     4532 non-null object
tags                     4324 non-null object
title                    4530 non-null object
trending_date            4531 non-null object
views                    3198 non-null object
dtypes: f

In [None]:
df_views_video.head()

Unnamed: 0_level_0,Is_Test_Data,Tag_count,Trend_day_count,Trend_tag_count,category_id,channel_title,comment_count,comment_disabled,description,dislike,like dislike disabled,likes,publish_date,subscriber,tag appered in title,tags,title,trending_date,views
Video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
HDR9SQc79,0,21,6.0,6,22,CaseyNeistat,,falSE,SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\nCANDICE - https://www.lovebilly.com\n\n'Diamond Veins (Blowsom remix)' by French 79 http://hyperurl.co/DiamondVeinsRMX\n'Moon' by Kid Francescoli http://hyperurl.co/KID_PlayMeAgain\n\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\nwith this lens -- http://amzn.to/2rUJOmD\nbig drone - http://tinyurl.com/h4ft3oy\nOTHER GEAR --- http://amzn.to/2o3GLX5\nSony CAMERA http://amzn.to/2nOBmnv\nOLD CAMERA,6089,falSE,13342,2017-11-13,9086142.0,False,SHANtell martin,WE WANT TO TALK ABOUT OUR MARRIAGE,2017-11-20,1978978
KNH52UF?48,0,23,1.0,1,24,LastWeekTonight,116266.0,TrUe,"One year after the presidential election, John Oliver discusses what we've learned so far and enlists our catheter cowboy to teach Donald Trump what he hasn't.\n\nConnect with Last Week Tonight online...\n\nSubscribe to the Last Week Tonight YouTube channel for more almost news as it almost happens: www.youtube.com/user/LastWeekTonight\n\nFind Last Week Tonight on Facebook like your mom would: http://Facebook.com/LastWeekTonight\n\nFollow us on Twitter for news about jokes and jokes about news: http://Twitter.com/LastWeekTonight\n\nVisit our official site for all that other stuff at once: http://www.hbo.com/lastweektonight",3044,FaLSE,5761,2017-11-13,5937292.0,False,last week tonight trump presidency|last week tonight donald trump|john oliver trump|donald trump,The Trump Presidency: Last Week Tonight with John Oliver (HBO),2017-11-20,1487870
QTW28IRG36,0,22,10.0,3,23,Rudy Mancuso,257850.0,true,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_confirmation=1\n\nTHANKS FOR WATCHING! LIKE & SUBSCRIBE FOR MORE VIDEOS!\n-----------------------------------------------------------\nFIND ME ON: \nInstagram | http://instagram.com/rudymancuso\nTwitter | http://twitter.com/rudymancuso\nFacebook | http://facebook.com/rudymancuso\n\nCAST: \nRudy Mancuso | http://youtube.com/c/rudymancuso\nLele Pons | http://youtube.com/c/lelepons\nKing Bach | https://youtube.com/user/BachelorsPadTv\n\nVideo Effects: \nCaleb Natale | https://instagram.com/calebnatale\n\nPA:\nPaulina Gregory\n\n\nShots Studios Channels:\nAlesso | https://youtube.com/c/alesso\nAnitta | http://youtube.com/c/anitta\nAnwar Jibawi | http://youtube.com/c/anwar\nAwkward Puppets | http://youtube.com/c/awkwardpuppets\nHannah Stocking | http://youtube.com/c/hannahstocking\nInanna Sarkis | http://youtube.com/c/inanna\nLele Pons | http://youtube.com/c/lelepons\nMaejor | http://youtube.com/c/maejor\nMike Tyson | http://youtube.com/c/miketyson \nRudy Mancuso | http://youtube.com/c/rudymancuso\nShots Studios | http://youtube.com/c/shots\n\n#Rudy\n#RudyMancuso,0,TRUE,0,2017-11-12,4191209.0,True,racist superman|rudy|mancuso|king|bach|racist|superman|love|rudy mancuso poo bear black white official music video|iphone x by pineapple|lelepons|hannahstocking|rudymancuso|inanna|anwar|sarkis|shots|shotsstudios|alesso|anitta|brazil|Getting My Driver's License | Lele Pons,"Racist Superman | Rudy Mancuso, King Bach & Lele Pons",2017-11-20,1502102
MGL76WI]26,0,17,12.0,5,24,Good Mythical Morning,263939.0,true,Today we find out if Link is a Nickelback amateur or a secret Nickelback devotee. GMM #1218\nDon't miss an all new Ear Biscuits: https://goo.gl/xeZNQt\nWatch Part 4: https://youtu.be/MhCdiiB8CQg | Watch Part 2: https://youtu.be/7qiOrNao9fg\nWatch today's episode from the start: http://bit.ly/GMM1218\n\nPick up all of the official GMM merch only at https://mythical.store\n\nFollow Rhett & Link: \nInstagram: https://instagram.com/rhettandlink\nFacebook: https://facebook.com/rhettandlink\nTwitter: https://twitter.com/rhettandlink\nTumblr: https://rhettandlink.tumblr.com\nSnapchat: @realrhettlink\nWebsite: https://mythical.co/\n\nCheck Out Our Other Mythical Channels:\nGood Mythical MORE: https://youtube.com/goodmythicalmore\nRhett & Link: https://youtube.com/rhettandlink\nThis Is Mythical: https://youtube.com/thisismythical\nEar Biscuits: https://applepodcasts.com/earbiscuits\n\nWant to send us something? https://mythical.co/contact\nHave you made a Wheel of Mythicality intro video? Submit it here: https://bit.ly/GMMWheelIntro\n\nIntro Animation by Digital Twigs: https://www.digitaltwigs.com\nIntro & Outro Music by Jeff Zeigler & Sarah Schimeneck https://www.jeffzeigler.com\nWheel of Mythicality theme: https://www.royaltyfreemusiclibrary.com/\nAll Supplemental Music fromOpus 1 Music: https://opus1.sourceaudio.com/\nWe use ‘The Mouse’ by Blue Microphones https://www.bluemic.com/mouse/,0,True,0,2017-11-13,13186408.0,True,rhett and link|gmm|good mythical morning|rhett and link good mythical morning|good mythical morning rhett and link|mythical morning|Season 12|nickelback lyrics|nickelback lyrics real or fake|nickelback|nickelback songs|nickelback song|rhett link nickelback|gmm nickelback|lyrics (website category)|nickelback (musical group)|rock|music|lyrics|chad kroeger|music (industry)|mythical|gmm challenge|comedy|funny|the betrayal|the betrayal act III|how you remind me,Nickelback Lyrics: Real or Fake?,2017-11-20,3519302
TWP93KXT70,0,15,11.0,7,224,nigahiga,268085.0,True,"I know it's been a while since we did this show, but we're back with what might be the best episode yet!\nLeave your dares in the comment section! \n\nOrder my book how to write good \nhttp://higatv.com/ryan-higas-how-to-write-good-pre-order-links/\n\nJust Launched New Official Store\nhttps://www.gianthugs.com/collections/ryan\n\nHigaTV Channel\nhttp://www.youtube.com/higatv\n\nTwitter\nhttp://www.twitter.com/therealryanhiga\n\nFacebook\nhttp://www.facebook.com/higatv\n\nWebsite\nhttp://www.higatv.com\n\nInstagram\nhttp://www.instagram.com/notryanhiga\n\nSend us mail or whatever you want here!\nPO Box 232355\nLas Vegas, NV 89105",0,TRUE,0,2017-11-12,20563106.0,True,ryan|higa|higatv|nigahiga|i dare you|idy|rhpc|dares|no truth|comments|comedy|funny|stupid|fail,I Dare You: GOING BALD!?,2017-11-19,4835374


#### a. Tag_count



In [None]:
df_views_video['Tag_count'].value_counts()

24                                                                             217
12                                                                             202
23                                                                             199
13                                                                             185
21                                                                             185
14                                                                             181
16                                                                             180
10                                                                             178
11                                                                             177
17                                                                             177
18                                                                             172
25                                                                             171
8   

**NOTE :**  we can see string value and few missing values in **Tag_count**. Lets replace it with median.

In [None]:
try:
    df_views_video['Tag_count'] = df_views_video['Tag_count'].astype(int)
except Dat:
    print(e) 

In [None]:
# extracting out valid Tag_count values
Tag_count_valid = df_views_video[(df_views_video['Tag_count'] != 'alissa ashley|alissa ashley makeup|hooded eye makeup|makeup for hooded eyes') & (~df_views_video['Tag_count'].isna())]['Tag_count']

# converting dtype from object to int
Tag_count_int = Tag_count_valid.astype(int)

# calculating median
Tag_count_median = Tag_count_int.median()
Tag_count_median = int(Tag_count_median)

Tag_count_median

17

In [None]:
# replacing string value with median

df_views_video[(df_views_video['Tag_count'] == 'alissa ashley|alissa ashley makeup|hooded eye makeup|makeup for hooded eyes')] = Tag_count_median

# filling missing values with median
df_views_video['Tag_count'].fillna(Tag_count_median)

Video_id
HDR9SQc79     21
KNH52UF?48    23
QTW28IRG36    22
MGL76WI]26    17
TWP93KXT70    15
JDJ37HWR29    9 
INH29DD?32    17
ENJ69DGJ93    14
ZXD32BTa68    20
NCA33YGN27    8 
BQQ21ZVm59    22
SEF40YEp58    10
OQU60INj45    22
QTI12DNb53    18
SHP12LIj51    10
TAE9URl23     22
MWG80DHU39    17
TVY50JW\54    18
QPK66WVC92    23
ZGJ23BH=92    19
LNA70LIB36    22
IOR89PLQ18    23
JLP11BVk71    9 
TKQ4JYs6      25
GIV34ORj61    18
SAS76OI?59    23
CCF5FMU29     25
WOD95ZZm76    17
ACP17YA_30    15
ZUW72GLK1     21
              ..
KHE42DN<84    17
NDB28CR_31    22
VSG82ZXG61    19
VHU1ZB^45     10
YKP34OMn11    13
VSM44VFf37    17
QDN61PSW4     15
USK49BSc65    24
DDK71RPA71    13
RSP25NOo83    22
TWE53OFA61    15
NMB51QRF66    11
LVD1DXU24     8 
UDK4FIS26     14
EXX13RVl47    9 
IBE44QMk63    23
BZA90UEF14    16
NZG46OKW19    21
LEN76WCg39    15
SQP66IXG95    16
ZVI84ZDn45    17
WEG65XHT54    15
WJD70UKo20    12
ZQJ67GM<38    12
EJB69XNC26    21
UII58AGX12    10
ECJ91UNP40    8 
AQR71

#### b. Trend_tag_count          

In [None]:
df_views_video['Trend_tag_count'].value_counts()

6       475
3       473
5       461
2       460
4       450
7       446
1       429
2       206
4       201
7       191
6       187
5       187
3       182
1       181
>       1  
9903    1  
17      1  
Name: Trend_tag_count, dtype: int64

**NOTE :**  we can see string value and few missing values in **Trend_tag_count**. Lets replace it with median.

In [None]:
# extracting out valid Tag_count values
Trend_tag_count_valid = df_views_video[(df_views_video['Trend_tag_count'] != '>') & (~df_views_video['Trend_tag_count'].isna())]['Trend_tag_count']

# converting dtype from object to int
Trend_tag_count_int = Trend_tag_count_valid.astype(int)

# calculating median
Trend_tag_count_median = Trend_tag_count_int.median()
Trend_tag_count_median = int(Trend_tag_count_median)

Trend_tag_count_median

4

In [None]:
# replacing string value with median

df_views_video[(df_views_video['Trend_tag_count'] == '>')] = Trend_tag_count_median

# filling missing values with median
df_views_video['Trend_tag_count'].fillna(Trend_tag_count_median)

Video_id
HDR9SQc79     6
KNH52UF?48    1
QTW28IRG36    3
MGL76WI]26    5
TWP93KXT70    7
JDJ37HWR29    1
INH29DD?32    5
ENJ69DGJ93    7
ZXD32BTa68    1
NCA33YGN27    2
BQQ21ZVm59    3
SEF40YEp58    2
OQU60INj45    4
QTI12DNb53    7
SHP12LIj51    2
TAE9URl23     4
MWG80DHU39    3
TVY50JW\54    6
QPK66WVC92    3
ZGJ23BH=92    1
LNA70LIB36    2
IOR89PLQ18    4
JLP11BVk71    4
TKQ4JYs6      1
GIV34ORj61    4
SAS76OI?59    1
CCF5FMU29     4
WOD95ZZm76    5
ACP17YA_30    7
ZUW72GLK1     1
             ..
KHE42DN<84    1
NDB28CR_31    2
VSG82ZXG61    1
VHU1ZB^45     5
YKP34OMn11    7
VSM44VFf37    1
QDN61PSW4     3
USK49BSc65    5
DDK71RPA71    3
RSP25NOo83    1
TWE53OFA61    2
NMB51QRF66    5
LVD1DXU24     1
UDK4FIS26     6
EXX13RVl47    1
IBE44QMk63    5
BZA90UEF14    1
NZG46OKW19    1
LEN76WCg39    4
SQP66IXG95    6
ZVI84ZDn45    5
WEG65XHT54    6
WJD70UKo20    1
ZQJ67GM<38    2
EJB69XNC26    2
UII58AGX12    1
ECJ91UNP40    1
AQR71GB@63    1
PPD49TIn30    2
LOI64QVq10    6
Name: Trend_tag

#### c. comment_count 

#### c. Change Popularity column from string to integer

In [None]:
#converting string to int
df_views_songs['Popularity'] = df_views_songs['Likes'].astype('int64')

#### d. Change Timestamp column to pandas datetime

In [None]:
df_views_songs
df_views_songs['Timestamp'] = pd.to_datetime(df_views_songs['Timestamp'])

In [None]:
df_views_songs.info()

 <a id=section402></a>
## 4.2 Filling missing values

In [None]:
column_names = list(df_views_songs.columns)
column_names.remove('Views')

In [None]:
columns_to_be_dropped = list()
for i in range(0,len(column_names)):
  if(df_views_songs[column_names[i]].isna().any()):
    
    # calculating missing percentage for each column
    missing_count = sum(df_views_songs[column_names[i]].isna())
    total_count = len(df_views_songs[column_names[i]])
    missing_percent = (missing_count/total_count)*100
  

    # add column name to `columns_to_be_dropped` 
    # if missing percentage is greater than 70
    if(missing_percent >= 70):
      columns_to_be_dropped.append(column_names[i])
    else:

      # checking datatype of each column so that we know 
      # which value to be replaced in missing value(median/mode)
      if(df_views_songs[column_names[i]].dtype == 'object'):
        value_to_be_filled = df_views_songs[column_names[i]].mode()[0]
        df_views_songs[column_names[i]].fillna(value_to_be_filled, inplace=True)
      elif(df_views_songs[column_names[i]].dtype == 'float64' or df_views_songs[column_names[i]].dtype == 'int64'):
         value_to_be_filled = df_views_songs[column_names[i]].median()
         df_views_songs[column_names[i]].fillna(value_to_be_filled, inplace=True)

# droping all columns who have more than 70% missing values
df_views_songs.drop(columns_to_be_dropped, axis=1, inplace=True)

In [None]:
df_views_songs.info()

**Observations:**

We don't have any missing values. Good to go.

 <a id=section403></a>
## 4.3 Remove highly correlated columns

In [None]:
# extracting feature columns
feature_cols = list(df_views_songs.columns)
feature_cols.remove('Views')
feature_cols.remove('Is_Test_Data')
feature_cols

In [None]:
# extracting highly correlated columns(except target variable) to drop

# Create correlation matrix
corr_matrix = df_views_songs[feature_cols].corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.80
cols_to_drop = [column for column in upper.columns if any(upper[column] > 0.70)]
cols_to_drop

**Observations :** We have two highly correlated independant columns. Let's drop one.

In [None]:
# lets drop Popularity
df_views_songs.drop('Popularity', axis=1, inplace=True)

 <a id=section404></a>
## 4.4 Handling Outliers

Concept of outliers is only applicable to continuous variables.

NOTE:

1. Remove ouliers if percentage is less than 2%

2. Fill remaining outliers values with median(continuous) or mode(categorical) depending on data.



In [None]:
# storing columns with continuos datatype in 'continuos_cols' 
continuous_columns = []
categorical_columns = []
cols = df_views_songs.columns

for i in range(0,len(cols)):
  if(df_views_songs[cols[i]].dtype != 'object'):
    continuous_columns.append(cols[i])
  else:
    categorical_columns.append(cols[i])

continuous_columns.remove('Timestamp')
continuous_columns.remove('Views')

print(continuous_columns)
print(categorical_columns)

In [None]:
for i in range(0, len(continuous_columns)):
  df_temp = df_views_songs[continuous_columns[i]]
  sorted(df_temp)
  q1, q3= np.percentile(df_temp,[10,90])
  iqr = q3 - q1
  lower_bound = q1 -(1.5 * iqr) 
  upper_bound = q3 +(1.5 * iqr) 
  true_index = df_temp.loc[(df_temp < lower_bound) & \
            (df_temp > upper_bound)].any()

  print(true_index)

**Observations:** We don't have any outliers in the data.

 <a id=section405></a>
## 4.5 Pandas Profiling after Data Preprocessing

In [None]:
# Running pandas profiling to get better understanding of data
#df_views_songs.profile_report(title='Pandas Profiling after Data Preprocessing', style={'full_width':True})

 <a id=section406></a>
## 4.6 Exploratory Data Analysis

We do EDA to have little more understanding of data which might eventually help in selecting best model for prediction

### 1. Top 10 artists with most views

In [None]:
df_views_songs.groupby('Name')['Views'].sum().sort_values(ascending=False).head(10)

In [None]:
# plotting
df = pd.DataFrame(
	{
    'Views':df_views_songs.groupby('Name')['Views'].sum().sort_values(ascending=False).head(10)
	}
	) 
df.plot.bar(rot=0,figsize=(32, 7))

### 2. Top 10 songs with most views

In [None]:
df_views_songs.groupby('Song_Name')['Views'].sum().sort_values(ascending=False).head(10)

In [None]:
df = pd.DataFrame(
	{
    'Views':df_views_songs.groupby('Song_Name')['Views'].sum().sort_values(ascending=False).head(10)
	}
	) 
df.plot.bar(rot=0,figsize=(32, 7))

### 3. Top 10 genre with most views

In [None]:
df_views_songs.groupby('Genre')['Views'].sum().sort_values(ascending=False).head(10)

In [None]:
df = pd.DataFrame(
	{
    'Genre':df_views_songs.groupby('Genre')['Views'].sum().sort_values(ascending=False).head(10)
	}
	) 
df.plot.bar(rot=0,figsize=(32, 7))

In [None]:
df_views_songs.columns

 <a id=section5></a>
# 5. Data preparation for model building

 <a id=section501></a>
## 5.1 Dummification / One-Hot Encoding of categorical variables

In [None]:
# lets look at how many unique labels each category has
for i in range(0, len(categorical_columns)):
  print(categorical_columns[i], " - ", df_views_songs[categorical_columns[i]].nunique())

**Observations:**
 
As we have lots of categories in one column,

Can we apply below thesis results?

http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf


**Summary:**

In the winning solution Of the KDD 2009 cup: "Winning the KDD Cup Orange Challenge with Ensemble Selection the authors limit one hot encoding to the 10 most frequent labels of the variable. This means that they would make one binary variable for each of the 10 most frequent labels only. This is equivalent to grouping all the other labels under a new category, that in this case will be dropped. Thus, the 10 new dummy variables indicate if one of the 10 most frequent labels is present (1) or not (O) for a particular observation.

In [None]:
for col in categorical_columns:
    imp_labels = list(df_views_songs[col].value_counts().head(10).index)
    
    for label in imp_labels:
        df_views_songs[col+'_'+label] = np.where(df_views_songs[col] == label, 1, 0)
    
    df_views_songs.drop(col, axis = 1, inplace=True)
    
df_views_songs.head()

**Observations:**

We have 36* columns after one-hot encoding

In [None]:
feature_cols = list(df_views_songs.columns)

for col in continuous_columns:
    if col in feature_cols:
        feature_cols.remove(col)

# let's remove Timestamp from prediction
feature_cols.remove('Timestamp')

feature_cols.remove('Views')

categorical_columns = feature_cols

print('continuous_columns length : {} '.format(len(continuous_columns)))
print('categorical_columns length : {}'.format(len(categorical_columns)))

 <a id=section502></a>
 ## 5.2 Standardizing continuous variables

In [None]:
continuous_columns

In [None]:
from sklearn.preprocessing import StandardScaler

continuous_columns.remove('Is_Test_Data')
# standardizing of data
scaler = StandardScaler().fit(df_views_songs[continuous_columns])
data = scaler.transform(df_views_songs[continuous_columns])

In [None]:
# forming dataframe after standardization
df_views_songs_sd= pd.DataFrame(data)
df_views_songs_sd.columns = continuous_columns
df_views_songs_sd.index = df_views_songs.index
print(df_views_songs_sd.shape)

#### Merging all columns together.

In [None]:
# merge ctegorical and continuos columns
df_views_songs_sd = pd.concat([df_views_songs_sd, df_views_songs[categorical_columns]],axis=1).reindex(df_views_songs.index)
df_views_songs_sd.shape

In [None]:
# add Is_Test_Data column
df_views_songs_sd = pd.concat([df_views_songs_sd, df_views_songs['Is_Test_Data']],axis=1).reindex(df_views_songs.index)

df_views_songs_sd.shape

In [None]:
# add Views column
df_views_songs_sd = pd.concat([df_views_songs_sd, df_views_songs['Views']],axis=1).reindex(df_views_songs.index)
df_views_songs_sd.shape

 <a id=section6></a>
 # 6. Ensemble Modelling and Prediction
 
 Ensemble modeling is a process where multiple diverse models are created to predict an outcome, either by using many different modeling algorithms or using different training data sets. The ensemble model then aggregates the prediction of each base model and results in once final prediction for the unseen data.

 <a id=section601></a>
 ## 6.1 Linear Regression
 
 
Linear regression is a basic and commonly used type of predictive analysis.  The overall idea of regression is to examine two things: 

1. Does a set of predictor variables do a good job in predicting an outcome (dependent) variable?  
2. Which variables in particular are significant predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable?  

These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables.  The simplest form of the regression equation with one dependent and one independent variable is defined by the formula y = c + b*x, where y = estimated dependent variable score, c = constant, b = regression coefficient, and x = score on the independent variable.


 <a id=section60101></a>
### 6.1.1  Checking assumptions of Linear Regression

#### a. Independant variables shouldn't be correlated

**NOTE:**

Above assumption is taken care in [Remove highly correlated columns](#section403). Moving forward.

#### b.  Independant variables and target variable should have linear relation

In [None]:
continuous_columns

In [None]:
feature_cols = continuous_columns

In [None]:
cols_to_drop = list()

# iterating through feature columns and collecting all columns 
# who have less than 0.1 correlation with target variable
for i in range(0, len(feature_cols)):
    corr_matrix = df_views_songs_sd[['Views', feature_cols[i]]].corr().abs()
    if(corr_matrix.iloc[0][1] < 0.1):
        cols_to_drop.append(feature_cols[i])

#dropping all uncorrelated columns
df_views_songs_sd.drop(cols_to_drop, axis = 1, inplace=True)
df_views_songs_sd.shape

**NOTE:**

Dropped all columns who have less than **0.1 correlation** with target variable.

#### c. Target variable should be normally distributed


In [None]:
views = df_views_songs_sd[df_views_songs_sd['Is_Test_Data'] == 0]['Views']
sns.distplot(views, color="b")

**Observations**

1. We observe target variable i.e; **Views** is not normally distributed
2. Lets apply different transformation and check.

In [None]:
views_trans = views.apply(lambda x : x**(1/10))
sns.distplot(views_trans, color="b")

**Observations:** Current transformation has made it little better. Let's continue.

In [None]:
views_trans = pd.DataFrame({'Views':views_trans})

NOTE:

Let's assign newly transformed 'Views' column after train test split.

 <a id=section60102></a>
 ### 6.1.2 Segregating Train and Test data

In [None]:
df_views_songs_train = df_views_songs_sd[df_views_songs_sd['Is_Test_Data'] == 0]
df_views_songs_test = df_views_songs_sd[df_views_songs_sd['Is_Test_Data'] == 1]

In [None]:
# dropping Is_Test_Data column
 df_views_songs_train.drop('Is_Test_Data', axis=1, inplace=True)
 df_views_songs_test.drop('Is_Test_Data', axis=1, inplace=True)

In [None]:
print(df_views_songs_train.shape)
print(df_views_songs_test.shape)

In [None]:
feature_cols = list(df_views_songs_train.columns)
feature_cols.remove('Views')
feature_cols

In [None]:
 X = df_views_songs_train[feature_cols]
 y = df_views_songs_train['Views']

##### Splitting train data again into train and test data


In [None]:
from sklearn.model_selection import train_test_split

def split(X,y):
    return train_test_split(X, y, test_size=0.30, random_state=1)

In [None]:
X_train_lr, X_test_lr, y_train_lr, y_test_lr=split(X,y)
print('Train cases as below')
print('X_train shape: ',X_train_lr.shape)
print('y_train shape: ',y_train_lr.shape)
print('\nTest cases as below')
print('X_test shape: ',X_test_lr.shape)
print('y_test shape: ',y_test_lr.shape)

##### Defining Linear Regression function for modelling

In [None]:
def rmse_scorer(y_actual, y_predicted) :
  from sklearn.metrics import mean_squared_error
  from math import sqrt

  rmse = sqrt(mean_squared_error(y_actual, y_predicted))
  return rmse

In [None]:
from sklearn.metrics import make_scorer
my_scorer = make_scorer(rmse_scorer, greater_is_better=False)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

def linear_reg(gridsearch = False):
    linreg = LinearRegression() 
    if not(gridsearch):
        parameters = {'normalize':[True,False], 'copy_X':[True, False] }
        linreg = RandomizedSearchCV(linreg,parameters, cv = 10,refit = True , scoring = my_scorer)                                                    
        return linreg
    else:
        parameters = {'normalize':[True,False], 'copy_X':[True, False]}
        linreg = GridSearchCV(linreg,parameters, cv = 10,refit = True , scoring= my_scorer)                                                    
        return linreg

 <a id=section60102></a>
### 6.1.2 Using Default Model

 <a id=section6010201></a>
#### 6.1.2.1 Building Model and Prediction

In [None]:
linreg = LinearRegression()
linreg.fit(X_train_lr,y_train_lr)

In [None]:
# print the intercept and coefficients
print('Intercept:',linreg.intercept_)
print('Coefficients:',linreg.coef_)  

In [None]:
 # make predictions on the training set
y_pred_train_lr = linreg.predict(X_train_lr) 

In [None]:
 # make predictions on the testing set
y_pred_test_lr = linreg.predict(X_test_lr)  

In [None]:
RMSE_MAP = {}

 <a id=section6010202></a>
#### 6.1.2.2 Model Evaluation

#### a. RMSE


In [None]:
from sklearn import metrics
RMSE_train = np.sqrt( metrics.mean_squared_error(y_train_lr, y_pred_train_lr))
print('RMSE for training set is {}'.format(RMSE_train))
RMSE_MAP['lr_train_d'] = RMSE_train

RMSE_test = np.sqrt( metrics.mean_squared_error(y_test_lr, y_pred_test_lr))
print('RMSE for testing set is {}'.format(RMSE_test))
RMSE_MAP['lr_test_d'] = RMSE_test

#### b. MAPE


In [None]:
def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [None]:
mape_train_error = mean_absolute_percentage_error(y_train_lr, y_pred_train_lr)
mape_test_error = mean_absolute_percentage_error(y_test_lr, y_pred_test_lr)
print('MAPE for training set is {}'.format(mape_train_error))
print('MAPE for testing set is {}'.format(mape_test_error))

#### c. R-Squared Error

In [None]:
from sklearn.metrics import r2_score

r_squared_train_lr = r2_score(y_train_lr,y_pred_train_lr )
print('R-Squared for training set is {}'.format(r_squared_train_lr))

In [None]:
r_squared_test_lr = r2_score(y_test_lr,y_pred_test_lr )
print('R-Squared for testing set is {}'.format(r_squared_test_lr))

#### d. Adjusted R-Squared Error

In [None]:
adjusted_r_squared_train_lr = 1 - (1-r_squared_train_lr)*(len(y_train_lr)-1)/(len(y_train_lr)-X_train_lr.shape[1]-1)
print('Adjusted R-Squared for training set is {}'.format(adjusted_r_squared_train_lr))

In [None]:
adjusted_r_squared_test_lr = 1 - (1-r_squared_test_lr)*(len(y_test_lr)-1)/(len(y_test_lr)-X_test_lr.shape[1]-1)
print('Adjusted R-Squared for testing set is {}'.format(adjusted_r_squared_test_lr))

 <a id=section60103></a>
### 6.1.3 Using GridSearchCV

 <a id=section6010301></a>
#### 6.1.3.1 Building Model and Prediction

In [None]:
linreg_gs = linear_reg(True)
linreg_gs.fit(X_train_lr,y_train_lr)

In [None]:
print("best_params after cross-validation : ", linreg_gs.best_params_)   

In [None]:
 # make predictions on the training set
y_pred_train_lr_gs = linreg_gs.predict(X_train_lr) 

In [None]:
 # make predictions on the testing set
y_pred_test_lr_gs = linreg_gs.predict(X_test_lr)  

 <a id=section6010302></a>
#### 6.1.3.2 Model Evaluation

#### a. RMSE


In [None]:
from sklearn import metrics
RMSE_train = np.sqrt( metrics.mean_squared_error(y_train_lr, y_pred_train_lr_gs))
print('RMSE for training set is {}'.format(RMSE_train))
RMSE_MAP['lr_train_gs'] = RMSE_train

RMSE_test = np.sqrt( metrics.mean_squared_error(y_test_lr, y_pred_test_lr_gs))
print('RMSE for testing set is {}'.format(RMSE_test))
RMSE_MAP['lr_test_gs'] = RMSE_test

#### b. MAPE


In [None]:
mape_train_error = mean_absolute_percentage_error(y_train_lr, y_pred_train_lr_gs)
mape_test_error = mean_absolute_percentage_error(y_test_lr, y_pred_test_lr_gs)
print('MAPE for training set is {}'.format(mape_train_error))
print('MAPE for testing set is {}'.format(mape_test_error))

#### c. R-Squared Error

In [None]:
from sklearn.metrics import r2_score

r_squared_train_lr_gs = r2_score(y_train_lr,y_pred_train_lr_gs )
print('R-Squared for training set is {}'.format(r_squared_train_lr_gs))

In [None]:
r_squared_test_lr_gs = r2_score(y_test_lr,y_pred_test_lr_gs )
print('R-Squared for testing set is {}'.format(r_squared_test_lr_gs))

#### d. Adjusted R-Squared Error

In [None]:
adjusted_r_squared_train_lr_gs = 1 - (1-r_squared_train_lr_gs)*(len(y_train_lr)-1)/(len(y_train_lr)-X_train_lr.shape[1]-1)
print('Adjusted R-Squared for training set is {}'.format(adjusted_r_squared_train_lr_gs))

In [None]:
adjusted_r_squared_test_lr_gs = 1 - (1-r_squared_test_lr_gs)*(len(y_test_lr)-1)/(len(y_test_lr)-X_test_lr.shape[1]-1)
print('Adjusted R-Squared for testing set is {}'.format(adjusted_r_squared_test_lr_gs))

 <a id=section60104></a>
### 6.1.4 Using RandomSearchCV

 <a id=section6010401></a>
#### 6.1.4.1 Building Model and Prediction

In [None]:
linreg_rs = linear_reg(False)
linreg_rs.fit(X_train_lr,y_train_lr)

In [None]:
print("best_params after cross-validation : ", linreg_rs.best_params_)   

In [None]:
 # make predictions on the training set
y_pred_train_lr_rs = linreg_rs.predict(X_train_lr) 

In [None]:
 # make predictions on the testing set
y_pred_test_lr_rs = linreg_rs.predict(X_test_lr)  

 <a id=section6010402></a>
#### 6.1.4.2 Model Evaluation

#### a. RMSE


In [None]:
from sklearn import metrics
RMSE_train = np.sqrt( metrics.mean_squared_error(y_train_lr, y_pred_train_lr_rs))
print('RMSE for training set is {}'.format(RMSE_train))
RMSE_MAP['lr_train_rs'] = RMSE_train

RMSE_test = np.sqrt( metrics.mean_squared_error(y_test_lr, y_pred_test_lr_rs))
print('RMSE for testing set is {}'.format(RMSE_test))
RMSE_MAP['lr_test_rs'] = RMSE_test

#### b. MAPE


In [None]:
mape_train_error = mean_absolute_percentage_error(y_train_lr, y_pred_train_lr_rs)
mape_test_error = mean_absolute_percentage_error(y_test_lr, y_pred_test_lr_rs)
print('MAPE for training set is {}'.format(mape_train_error))
print('MAPE for testing set is {}'.format(mape_test_error))

#### c. R-Squared Error

In [None]:
from sklearn.metrics import r2_score

r_squared_train_lr_rs = r2_score(y_train_lr,y_pred_train_lr_rs )
print('R-Squared for training set is {}'.format(r_squared_train_lr_rs))

In [None]:
r_squared_test_lr_rs = r2_score(y_test_lr,y_pred_test_lr_rs )
print('R-Squared for testing set is {}'.format(r_squared_test_lr_rs))

#### d. Adjusted R-Squared Error

In [None]:
adjusted_r_squared_train_lr_rs = 1 - (1-r_squared_train_lr_rs)*(len(y_train_lr)-1)/(len(y_train_lr)-X_train_lr.shape[1]-1)
print('Adjusted R-Squared for training set is {}'.format(adjusted_r_squared_train_lr_rs))

In [None]:
adjusted_r_squared_test_lr_rs = 1 - (1-r_squared_test_lr_rs)*(len(y_test_lr)-1)/(len(y_test_lr)-X_test_lr.shape[1]-1)
print('Adjusted R-Squared for testing set is {}'.format(adjusted_r_squared_test_lr_rs))

 <a id=section602></a>
## 6.2 Decision Tree

 <a id=section60201></a>
### 6.2.1 Using Default Model

##### Splitting train and test data


In [None]:
X_train_dt, X_test_dt, y_train_dt, y_test_dt=split(X,y)
print('Train cases as below')
print('X_train shape: ',X_train_dt.shape)
print('y_train shape: ',y_train_dt.shape)
print('\nTest cases as below')
print('X_test shape: ',X_test_dt.shape)
print('y_test shape: ',y_test_dt.shape)

 <a id=section6020101></a>
#### 6.2.1.1 Building Model and Prediction

In [None]:
from sklearn.tree import DecisionTreeRegressor

# using default model for building 
dt_reg = DecisionTreeRegressor()
dt_reg.fit(X_train_dt, y_train_dt)

In [None]:
#prediction on training data
y_pred_train_dt = dt_reg.predict(X_train_dt)

#prediction on testing data
y_pred_test_dt = dt_reg.predict(X_test_dt)

<a id=section6020102></a>
#### 6.2.1.2 Model Evaluation

#### a. RMSE


In [None]:
RMSE_train_dt = np.sqrt( metrics.mean_squared_error(y_train_dt, y_pred_train_dt))
print('RMSE for training set is {}'.format(RMSE_train_dt))
RMSE_MAP['dt_train_d'] = RMSE_train_dt

RMSE_test_dt = np.sqrt( metrics.mean_squared_error(y_test_dt, y_pred_test_dt))
print('RMSE for testing set is {}'.format(RMSE_test_dt))
RMSE_MAP['dt_test_d'] = RMSE_test_dt

#### b. MAPE


In [None]:
mape_train_error = mean_absolute_percentage_error(y_train_dt, y_pred_train_dt)
mape_test_error = mean_absolute_percentage_error(y_test_dt, y_pred_test_dt)
print('MAPE for training set is {}'.format(mape_train_error))
print('MAPE for testing set is {}'.format(mape_test_error))

#### c. R-Squared Error

In [None]:
r_squared_train_dt = r2_score(y_train_dt,y_pred_train_dt )
print('R-Squared for training set is {}'.format(r_squared_train_dt))

In [None]:
r_squared_test_dt = r2_score(y_test_dt,y_pred_test_dt )
print('R-Squared for testing set is {}'.format(r_squared_test_dt))

#### d. Adjusted R-Squared Error

In [None]:
adjusted_r_squared_train_dt = 1 - (1-r_squared_train_dt)*(len(y_train_dt)-1)/(len(y_train_dt)-X_train_dt.shape[1]-1)
print('Adjusted R-Squared for training set is {}'.format(adjusted_r_squared_train_dt))

In [None]:
adjusted_r_squared_test_dt = 1 - (1-r_squared_test_dt)*(len(y_test_dt)-1)/(len(y_test_dt)-X_train_dt.shape[1]-1)
print('Adjusted R-Squared for testing set is {}'.format(adjusted_r_squared_test_dt))

 <a id=section60202></a>
### 6.2.2 Using GridSearchCV

 <a id=section6020201></a>
#### 6.2.2.1 Building Model and Prediction

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
param_grid = {
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf
               
}

In [None]:
# Instantiate the grid search model
dt_reg_gs = GridSearchCV(estimator = dt_reg, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, scoring = my_scorer, verbose = 2)

In [None]:
dt_reg_gs.fit(X_train_dt, y_train_dt)

In [None]:
#prediction on training data
y_pred_train_dt_gs = dt_reg_gs.predict(X_train_dt)

#prediction on testing data
y_pred_test_dt_gs = dt_reg_gs.predict(X_test_dt)

<a id=section6020202></a>
#### 6.2.2.2 Model Evaluation

#### a. RMSE


In [None]:
RMSE_train_dt_gs = np.sqrt( metrics.mean_squared_error(y_train_dt, y_pred_train_dt_gs))
print('RMSE for training set is {}'.format(RMSE_train_dt_gs))
RMSE_MAP['dt_train_gs'] = RMSE_train_dt_gs

RMSE_test_dt_gs = np.sqrt( metrics.mean_squared_error(y_test_dt, y_pred_test_dt_gs))
print('RMSE for testing set is {}'.format(RMSE_test_dt_gs))
RMSE_MAP['dt_test_gs'] = RMSE_test_dt_gs

#### b. MAPE


In [None]:
mape_train_error = mean_absolute_percentage_error(y_train_dt, y_pred_train_dt_gs)
mape_test_error = mean_absolute_percentage_error(y_test_dt, y_pred_test_dt_gs)
print('MAPE for training set is {}'.format(mape_train_error))
print('MAPE for testing set is {}'.format(mape_test_error))

#### c. R-Squared Error

In [None]:
r_squared_train_dt_gs = r2_score(y_train_dt,y_pred_train_dt_gs )
print('R-Squared for training set is {}'.format(r_squared_train_dt_gs))

In [None]:
r_squared_test_dt_gs = r2_score(y_test_dt,y_pred_test_dt_gs )
print('R-Squared for testing set is {}'.format(r_squared_test_dt_gs))

#### d. Adjusted R-Squared Error

In [None]:
adjusted_r_squared_train_dt_gs = 1 - (1-r_squared_train_dt_gs)*(len(y_train_dt)-1)/(len(y_train_dt)-X_train_dt.shape[1]-1)
print('Adjusted R-Squared for training set is {}'.format(adjusted_r_squared_train_dt_gs))

In [None]:
adjusted_r_squared_test_dt_gs = 1 - (1-r_squared_test_dt_gs)*(len(y_test_dt)-1)/(len(y_test_dt)-X_train_dt.shape[1]-1)
print('Adjusted R-Squared for testing set is {}'.format(adjusted_r_squared_test_dt_gs))

 <a id=section60203></a>
### 6.2.3 Using RandomizedSearchCV

 <a id=section6020301></a>
#### 6.2.3.1 Building Model and Prediction

In [None]:
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]


# Create the random grid
random_grid = {
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf
               
}


In [None]:
# Instantiate the random search model
dt_reg_rs = RandomizedSearchCV(estimator = dt_reg, param_distributions = random_grid, n_iter = 100, cv = 3, 
                               verbose=2, random_state=42, scoring = my_scorer, n_jobs = -1)

In [None]:
dt_reg_rs.fit(X_train_dt, y_train_dt)

In [None]:
#prediction on training data
y_pred_train_dt_rs = dt_reg_rs.predict(X_train_dt)

#prediction on testing data
y_pred_test_dt_rs = dt_reg_rs.predict(X_test_dt)

<a id=section6020302></a>
#### 6.2.3.2 Model Evaluation

#### a. RMSE


In [None]:
RMSE_train_dt_rs = np.sqrt( metrics.mean_squared_error(y_train_dt, y_pred_train_dt_rs))
print('RMSE for training set is {}'.format(RMSE_train_dt_rs))
RMSE_MAP['dt_train_rs'] = RMSE_train_dt_rs

RMSE_test_dt_gs = np.sqrt( metrics.mean_squared_error(y_test_dt, y_pred_test_dt_rs))
print('RMSE for testing set is {}'.format(RMSE_test_dt_gs))
RMSE_MAP['dt_test_rs'] = RMSE_test_dt_gs


#### b. MAPE


In [None]:
mape_train_error = mean_absolute_percentage_error(y_train_dt, y_pred_train_dt_rs)
mape_test_error = mean_absolute_percentage_error(y_test_dt, y_pred_test_dt_rs)
print('MAPE for training set is {}'.format(mape_train_error))
print('MAPE for testing set is {}'.format(mape_test_error))

#### c. R-Squared Error

In [None]:
r_squared_train_dt_rs = r2_score(y_train_dt,y_pred_train_dt_rs )
print('R-Squared for training set is {}'.format(r_squared_train_dt_rs))

In [None]:
r_squared_test_dt_rs = r2_score(y_test_dt,y_pred_test_dt_rs )
print('R-Squared for testing set is {}'.format(r_squared_test_dt_rs))

#### d. Adjusted R-Squared Error

In [None]:
adjusted_r_squared_train_dt_gs = 1 - (1-r_squared_train_dt_gs)*(len(y_train_dt)-1)/(len(y_train_dt)-X_train_dt.shape[1]-1)
print('Adjusted R-Squared for training set is {}'.format(adjusted_r_squared_train_dt_gs))

In [None]:
adjusted_r_squared_test_dt_gs = 1 - (1-r_squared_test_dt_gs)*(len(y_test_dt)-1)/(len(y_test_dt)-X_train_dt.shape[1]-1)
print('Adjusted R-Squared for testing set is {}'.format(adjusted_r_squared_test_dt_gs))

In [None]:
RMSE_MAP

 <a id=section603></a>
## 6.3 Random Forest

##### Splitting train and test data


In [None]:
X_train_rf, X_test_rf, y_train_rf, y_test_rf=split(X,y)
print('Train cases as below')
print('X_train shape: ',X_train_rf.shape)
print('y_train shape: ',y_train_rf.shape)
print('\nTest cases as below')
print('X_test shape: ',X_test_rf.shape)
print('y_test shape: ',y_test_rf.shape)

 <a id=section60301></a>

#### 6.3.1 Using Default Model


 <a id=section6030101></a>

#### 6.3.1.1 Building Model and Prediction

In [None]:
from sklearn.ensemble import RandomForestRegressor

# using default model for building
rf_reg = RandomForestRegressor()
rf_reg.fit(X_train_rf, y_train_rf)

In [None]:
y_pred_train_rf = rf_reg.predict(X_train_rf)
y_pred_test_rf = rf_reg.predict(X_test_rf)

 <a id=section6030102></a>

### 6.3.1.2 Model Evaluation

##### a. RMSE


In [None]:
from sklearn import metrics
RMSE_train = np.sqrt( metrics.mean_squared_error(y_train_rf, y_pred_train_rf))
print('RMSE for training set is {}'.format(RMSE_train))
RMSE_MAP['rf_train_d'] = RMSE_train

RMSE_test = np.sqrt( metrics.mean_squared_error(y_test_rf, y_pred_test_rf))
print('RMSE for testing set is {}'.format(RMSE_test))
RMSE_MAP['rf_test_d'] = RMSE_test


##### b. MAPE


In [None]:
mape_train_error = mean_absolute_percentage_error(y_train_rf, y_pred_train_rf)
mape_test_error = mean_absolute_percentage_error(y_test_rf, y_pred_test_rf)
print('MAPE for training set is {}'.format(mape_train_error))
print('MAPE for testing set is {}'.format(mape_test_error))

##### c. R-Squared Error

In [None]:
r_squared_train_rf = r2_score(y_train_rf,y_pred_train_rf )
print('R-Squared for training set is {}'.format(r_squared_train_rf))

In [None]:
r_squared_test_rf = r2_score(y_test_rf,y_pred_test_rf )
print('R-Squared for testing set is {}'.format(r_squared_test_rf))

##### d. Adjusted R-Squared Error

In [None]:
adjusted_r_squared_train_rf = 1 - (1-r_squared_train_rf)*(len(y_train_rf)-1)/(len(y_train_rf)-X_train_rf.shape[1]-1)
print('Adjusted R-Squared for training set is {}'.format(adjusted_r_squared_train_rf))

In [None]:
adjusted_r_squared_test_rf = 1 - (1-r_squared_test_rf)*(len(y_test_rf)-1)/(len(y_test_rf)-X_train_rf.shape[1]-1)
print('Adjusted R-Squared for testing set is {}'.format(adjusted_r_squared_test_rf))

 <a id=section60302></a>
### 6.3.2 Using GridSearchCV

 <a id=section6030201></a>
#### 6.3.2.1 Building Model and Prediction

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 5)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
param_grid = {
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'n_estimators' : [1, 5, 10, 15, 20, 25, 30],
               'bootstrap': bootstrap
              }

In [None]:
# Instantiate the grid search model
rf_reg_gs = GridSearchCV(estimator = rf_reg, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2, scoring = my_scorer)

In [None]:
rf_reg_gs.fit(X_train_dt, y_train_dt)

In [None]:
#prediction on training data
y_pred_train_rf_gs = rf_reg_gs.predict(X_train_rf)

#prediction on testing data
y_pred_test_rf_gs = rf_reg_gs.predict(X_test_rf)

 <a id=section6030202></a>

#### 6.3.2.2 Model Evaluation

##### a. RMSE


In [None]:
from sklearn import metrics
RMSE_train = np.sqrt( metrics.mean_squared_error(y_train_rf, y_pred_train_rf_gs))
print('RMSE for training set is {}'.format(RMSE_train))
RMSE_MAP['rf_train_gs'] = RMSE_train

RMSE_test = np.sqrt( metrics.mean_squared_error(y_test_rf, y_pred_test_rf_gs))
print('RMSE for testing set is {}'.format(RMSE_test))
RMSE_MAP['rf_test_gs'] = RMSE_test


##### b. R-Squared Error

In [None]:
r_squared_train_rf_gs = r2_score(y_train_rf,y_pred_train_rf_gs )
print('R-Squared for training set is {}'.format(r_squared_train_rf_gs))

In [None]:
r_squared_test_rf_gs = r2_score(y_test_rf,y_pred_test_rf_gs )
print('R-Squared for testing set is {}'.format(r_squared_test_rf_gs))

##### c. Adjusted R-Squared Error

In [None]:
adjusted_r_squared_train_rf_gs = 1 - (1-r_squared_train_rf_gs)*(len(y_train_rf)-1)/(len(y_train_rf)-X_train_rf.shape[1]-1)
print('Adjusted R-Squared for training set is {}'.format(adjusted_r_squared_train_rf_gs))

In [None]:
adjusted_r_squared_test_rf_gs = 1 - (1-r_squared_test_rf_gs)*(len(y_test_rf)-1)/(len(y_test_rf)-X_train_rf.shape[1]-1)
print('Adjusted R-Squared for testing set is {}'.format(adjusted_r_squared_test_rf_gs))

 <a id=section60303></a>
### 6.3.3 Using RandomizedSearchCV

 <a id=section6030301></a>
#### 6.3.3.1 Building Model and Prediction

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 5)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'n_estimators' : n_estimators,
               'bootstrap': bootstrap
              }

In [None]:
# Instantiate the grid search model
rf_reg_rs = RandomizedSearchCV(estimator = rf_reg, param_distributions = random_grid, n_iter = 100, cv = 3, 
                               verbose=2, random_state=42, n_jobs = -1, scoring = my_scorer)

In [None]:
rf_reg_rs.fit(X_train_dt, y_train_dt)

In [None]:
#prediction on training data
y_pred_train_rf_rs = rf_reg_rs.predict(X_train_rf)

#prediction on testing data
y_pred_test_rf_rs = rf_reg_rs.predict(X_test_rf)

 <a id=section6030202></a>

#### 6.3.2.2 Model Evaluation

##### a. RMSE


In [None]:
from sklearn import metrics
RMSE_train = np.sqrt( metrics.mean_squared_error(y_train_rf, y_pred_train_rf_rs))
print('RMSE for training set is {}'.format(RMSE_train))
RMSE_MAP['rf_train_rs'] = RMSE_train

RMSE_test = np.sqrt( metrics.mean_squared_error(y_test_rf, y_pred_test_rf_rs))
print('RMSE for testing set is {}'.format(RMSE_test))
RMSE_MAP['rf_test_rs'] = RMSE_test


##### b. R-Squared Error

In [None]:
r_squared_train_rf_rs = r2_score(y_train_rf,y_pred_train_rf_rs )
print('R-Squared for training set is {}'.format(r_squared_train_rf_rs))

In [None]:
r_squared_test_rf_rs = r2_score(y_test_rf,y_pred_test_rf_rs )
print('R-Squared for testing set is {}'.format(r_squared_test_rf_rs))

##### c. Adjusted R-Squared Error

In [None]:
adjusted_r_squared_train_rf_rs = 1 - (1-r_squared_train_rf_rs)*(len(y_train_rf)-1)/(len(y_train_rf)-X_train_rf.shape[1]-1)
print('Adjusted R-Squared for training set is {}'.format(adjusted_r_squared_train_rf_rs))

In [None]:
adjusted_r_squared_test_rf_rs = 1 - (1-r_squared_test_rf_rs)*(len(y_test_rf)-1)/(len(y_test_rf)-X_train_rf.shape[1]-1)
print('Adjusted R-Squared for testing set is {}'.format(adjusted_r_squared_test_rf_rs))

 <a id=section7></a>

# 7. Conclusion

## 7.1 Choosing Best Model for prediction

 <a id=section701></a>
<img src="./images/Model_Comparision.jpg" height='400px' width='100%'><br/>

NOTE:

As we can observe Random Forest algorithm has **best scores in terms of RMSE**. Let's use Tandom Forest GridSearchCV to predict our output.


 <a id=section702></a>
# 7.2 Final Prediction

In [None]:
 # dropping existing Views column
 df_views_songs_test.drop('Views', axis=1, inplace=True)

In [None]:
# predicting test data
y_pred = rf_reg_gs.predict(df_views_songs_test)

In [None]:
# storing predicted output in a dataframe
views_predicted = pd.DataFrame({'Views' : np.array(y_pred)})
views_predicted.index = df_views_songs_test.index

In [None]:
# scaling back to original value
views_predicted.apply(lambda x : x**10)

In [None]:
# storing output in xlsx format
views_predicted.to_excel('Views_Prediction.xlsx')