<a href="https://colab.research.google.com/github/SanaSoren/TED-Talk-views-pridiction/blob/main/TED_Talk_Views_Prediction_Regression_CapstonProject_sanasoren.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  **TED Talk Views Prediction**



##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Problem Statement**


The goal of this project is to develop a predictive model that accurately forecasts the number of views a TED talk video will have. Through the analysis of past TED talks and their associated video metrics, the model will be able to identify trends in viewership and suggest ways to better optimize for higher views. By utilizing the data of past TED talks, the model will be able to create a predictive tool that will allow TED producers to better understand which topics and speakers will be more likely to attract a larger audience.

# ***Let's Begin !***

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from datetime import datetime
import calendar

import ast

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.linear_model import Lasso, Ridge
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
 
import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
# Load Data
TED_talk = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Ted Talk Supervised Learning Project/data_ted_talks.csv')

### **Let's Describe our Data first**

In [None]:
# Dataset First 
TED_talk.head()

In [None]:
# Dataset Rows & Columns 
TED_talk.shape

In [None]:
# Dataset Info
TED_talk.info()

In [None]:
# Checking for Duplicate Value
len(TED_talk[TED_talk.duplicated()])

To perform well on predicting TED Talks video views, I try to use as much features from the dataset as possible. Nevertheless, I have decided not to use some of the parameters (e.g., url, speaker_1 name, etc.) because they won't be much useful in predicting the views. Using some of the features (such as description, topics) I've left for further work.

♦ **Talk_ID -** Id of the speaker

♦ **Title -** Title of the Talk

♦ **Speaker_1 -** Name of the speaker that leads the talk, we rarely see the same speaker do more than 1 talk

♦ **All_Speakers -** Name of the talk, which includes the name speaker_1 and title of the talk 

♦ **About_Speakers -** Name of the talk, which includes the name speaker_1 and the occupation of speaker

♦ **Views -** Number of times the video has been watched

♦ **Event -** Name of the event of which the talk is part of

♦ **Native_Lang -** Number of language in which the talk 

♦ **Available_Lang -** Number of languages in which the talk is available in

♦ **Duration -** Duration of the video

♦ **Recorded_date, Published_date -** Date of recording and publishing the talk, from which we get:

        ▪ Day of the week
        ▪ Month
        ▪ Year

♦ **Related-talks -** An array that consists of 6 related talks, from which I extract the average number of views.

I've excluded the **comments** and **ratings** features, as using those I consider cheating. The point of the task is to predict the number of views for a video which has just been released or is yet to be released. After going through the data analysis notebooks I mentined earlier, I decided to exclude the following features:

♦ **Comments -** number of comments on the video

♦ **Url -** Url link to the talk

The following features I leave for future work:

♦ **Description -** Description of the talk, will need to encode this information

♦ **Transcript -** Transcript of the talk, will need to encode this information

♦ **Topics -** Topics that are associated with the talk

♦ **Occupation -** Occupation of the speaker

# **TED Talks Data Analysis**

### **Cleaning The Data**
Various datasets frequently have missing values, so I start off by checking whether the TED Talks dataset has any.

In [None]:
# Finding number of unique and null values in each columns
pd.DataFrame([[col, TED_talk[col].nunique(), TED_talk[col].isna().sum()]  for  col  in TED_talk],
             columns = ['Column Name', 'Unique Count', 'Missing Count'])

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(TED_talk.isnull(), cbar= False)

### **What we get to know about dataset?**

There are many null values: Occupations 522 null values, about_speakers 503 null values, Comments 655 null values, all_speakers 4 null values, and recorder_date only 1 null value.

## **Now we can start our Exploratory Data Analysis**

In [None]:
# Dataset Describe
TED_talk.describe()

### **Fromating DateTime**

In [None]:
today=datetime.now()
today.strftime('%Y-%m-%d')

The format of recorded_date and published_date are in string format, we have to convert them into date format

In [None]:
# Recorded date formatting:
TED_talk['recorded_date']= pd.to_datetime(TED_talk['recorded_date'])

# Published date formatting:
TED_talk['published_date']= pd.to_datetime(TED_talk['published_date'])

In [None]:
TED_talk[['recorded_date','published_date']].info()

In [None]:
# Number of days Ted talk has been published
last_publishing_date= TED_talk['published_date'].max()
TED_talk['time_passed_since_published']= last_publishing_date - pd.DatetimeIndex(TED_talk['published_date'])

In [None]:
import calendar
import datetime

month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
day_order   = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Create new columns for publish_month, publish_year, publish_day, publish_week_day
TED_talk['publish_month'] = pd.DatetimeIndex(TED_talk['published_date']).month
TED_talk['publish_month'] = TED_talk['publish_month'].apply(lambda x: calendar.month_abbr[x])
TED_talk['publish_year'] = pd.DatetimeIndex(TED_talk['published_date']).year
TED_talk['publish_day'] = pd.DatetimeIndex(TED_talk['published_date']).day
TED_talk['publish_week_day'] = TED_talk['published_date'].apply(lambda x: day_order[datetime.date(x.year, x.month, x.day).weekday()])


In [None]:
#month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
#day_order   = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

#TED_talk['publish_month'] = pd.DatetimeIndex(TED_talk['published_date']).month
#TED_talk['publish_month'] = TED_talk['publish_month'].apply(lambda x: calendar.month_abbr[x])
#TED_talk['publish_year'] = pd.DatetimeIndex(TED_talk['published_date']).year
#TED_talk['publish_day'] = pd.DatetimeIndex(TED_talk['published_date']).day
#TED_talk['publish_week_day']= TED_talk['published_date'].apply(lambda x: day_order[datetime.date(x.year, x.month, x.day).weekday()])

# **Creating variable for Daily Views(Target)**

In [None]:
# Daily views/Talk:
TED_talk['daily_views'] = TED_talk['views'] / ( TED_talk['time_passed_since_published'].apply(lambda x : x.days) + 1 )

In [None]:
TED_talk[['publish_month','publish_year','publish_day','publish_week_day','daily_views']].head()

# **Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables**

# **Univariate Analysis**
**Why do you do a univariate analysis?**

▪ Univariate analysis is a statistical technique used to identify patterns and relationships within a single variable. 

▪ Univariate analysis can help to assess the distribution of data, identify outliers, and gain a better understanding of the data set as a whole.

▪ It can also provide insight into the potential influence of a single variable on a dependent variable.

### **Continuous variables**


In [None]:
fig = plt.figure(figsize=(15,5))

plt.subplot(1,3,1)
plt.title("views")
sns.distplot(x= TED_talk['views'])

plt.subplot(1,3,2)
plt.title("number of comments")
sns.distplot(x= TED_talk['comments'])

plt.subplot(1,3,3)
plt.title("duration of talk")
sns.distplot(x= TED_talk['duration'])

plt.show()

# **Bivariate analysis with dependent variable**
**What is a dependent variable in data analysis?**

▪ A dependent variable is a variable in a data analysis that is affected by the changes in an independent variable.

▪ It is the variable that is being measured or tested in an experiment.



**speaker vs Duration**

In [None]:
temp = TED_talk.groupby(['speaker_1'],as_index=False)['duration'].sum().sort_values('duration',ascending=False)[:25]
temp = TED_talk.groupby(['speaker_1'],as_index=False).agg({'duration':'sum','talk_id':'count'}).sort_values('duration',ascending=False).reset_index()[:8]
temp['talk_id']=temp['duration']/temp['talk_id']
plt.figure(figsize=(15,6))
ax=sns.barplot(x='speaker_1',y='duration',data=temp)
labels=ax.get_xticklabels()
plt.setp(labels, rotation = 60);


### **Speaker Vs Number of talks delivered**

In [None]:
data_speaker_count= pd.DataFrame(TED_talk['speaker_1'].value_counts()).reset_index().rename(columns=({'index':'Speaker','speaker_1':'Number of talks'}))

most_talks = data_speaker_count.nlargest(5, 'Number of talks')
plt.figure(figsize=(10,6))
sns.barplot(x = 'Speaker', y = 'Number of talks', data = most_talks)
plt.show()

## **speaker_1 vs daily_views**

In [None]:
# Top 25 speakers
temp = TED_talk.groupby(['speaker_1'],as_index=False)['daily_views'].sum().sort_values('daily_views',ascending=False)[:5]
plt.figure(figsize=(8,6))
ax=sns.barplot(x='speaker_1', y='daily_views',data=temp)
plt.setp(ax.get_xticklabels(), rotation=50);
plt.title('Top 5 speaker according to daily_views')
ax.grid(False)

##**Speaker vs Comments**

In [None]:
temp = TED_talk.groupby(['speaker_1'],as_index=False)['comments'].sum().sort_values('comments',ascending=False)[:5]
plt.figure(figsize=(10,6))
ax=sns.barplot(x='speaker_1',y='comments',data=temp);
plt.setp(ax.get_xticklabels(), rotation=70);
plt.title('Most popular speaker according to views',fontsize=20)
plt.show()

## **Speaker vs Average Views**

In [None]:
# Speaker most popular video
temp = TED_talk[['speaker_1','views']].sort_values('views',ascending=False)[:5]
plt.figure(figsize=(8,6))
ax=sns.barplot(x='speaker_1',y='views',data=temp)
plt.setp(ax.get_xticklabels(), rotation=60);
plt.title('Speaker_1 with most popular video')
plt.ylabel('Average views in ten millions')
ax.grid(False)

# **Target Encoding**
Target encoding is a technique used in data analysis to encode categorical variables into numerical values. This is useful when dealing with categorical variables (i.e. variables with a finite number of levels) as it allows for easier interpretation of the data, as well as a more accurate analysis of the data. Target encoding can also be used to reduce the amount of overfitting in a model, as it replaces the many levels of a categorical variable with a single numerical value.

### **Applying Target encoding on speaker_1**

In [None]:
speaker = TED_talk.groupby('speaker_1').agg({'daily_views' : 'mean'}).sort_values(['daily_views'],ascending=False)
speaker = speaker.to_dict()
speaker = speaker.values()
speaker =  list(speaker)[0]
TED_talk['speaker_1 average views'] = TED_talk['speaker_1'].map(speaker)

plt.figure(figsize=(10,5))
sns.distplot(TED_talk['speaker_1 average views'])
plt.show()
     

# **Event**
Event is also a catagorical variable, therefore we also apply target encoding on it.

In [None]:
event = TED_talk.groupby('event').agg({'daily_views' : 'mean'}).sort_values(['daily_views'],ascending=False)
event = event.to_dict()
event = event.values()
event=  list(event)[0]
TED_talk['Event wise Average Views']= TED_talk['event'].map(event)

plt.figure(figsize=(10,5))
sns.distplot(TED_talk['Event wise Average Views'])
plt.show()

# **Top 10 TED Talk Events**

In [None]:
temp = TED_talk.groupby(['event','publish_year'],as_index=False).agg({'daily_views':'sum','talk_id':'count'}).sort_values('daily_views',ascending=False).reset_index()[:8]
temp['talk_id'] = temp['daily_views']/temp['talk_id']
plt.figure(figsize=(10,6))
ax = sns.barplot(x='event',y='daily_views',data=temp)
labels = ax.get_xticklabels()
plt.title('Top TED Events by daily views')
plt.ylabel('daily views in million')
plt.setp(labels, rotation=50);
     

## **Top TED Events by Average daily views**

In [None]:
plt.figure(figsize=(20,6))
ax = sns.barplot(x='event',y='talk_id',data=temp)
labels = ax.get_xticklabels()
plt.title('Top TED Events by Average daily views')
plt.xlabel('Events')
plt.ylabel('Average daily views in millions')
plt.show()

**Available Language Variable**

In [None]:
TED_talk['number of language'] = TED_talk['available_lang'].apply(lambda x: len(x))
sns.distplot(TED_talk['number of language'])
plt.show()

### **Number of Topics from Topic variable**

In [None]:
TED_talk['topics'] = TED_talk.apply(lambda x: eval("x['topics']"), axis=1)
TED_talk['number of topics'] = TED_talk.apply(lambda x: len(x['topics']), axis=1)
# graph:
plt.figure(figsize=(15,5))
sns.distplot(TED_talk['number of topics'])

### **Number of Unique Topics**

In [None]:
#Checking for unique topic
unique_topics=[]
for i in range(0,len(TED_talk)):
  temp = TED_talk['topics'][i]
  for i in temp:
    if(i not in unique_topics):
      unique_topics.append(i)
      
len(unique_topics)

### **Target encoding on Unique Topics**

In [None]:
# Fetching the average views with respect to each topic in another dict unique_topics_avg_view_dict
unique_topics_avg_view_dict={}
for topic in unique_topics:
  temp=0
  count=0
  for i in range(0,len(TED_talk)):
    temp2 = TED_talk['topics'][i]
    if(topic in temp2):
      temp+= TED_talk['daily_views'][i]
      count+=1
  unique_topics_avg_view_dict[topic]=temp//count

In [None]:
# Storing the average views w.r.t topic for each talk
topics_wise_avg_views=[]
for i in range(0,len(TED_talk)):
  temp=0
  temp_topic = TED_talk['topics'][i]
  for ele in temp_topic:
    temp+= unique_topics_avg_view_dict[ele]
  
  topics_wise_avg_views.append(temp//len(temp_topic))

se = pd.Series(topics_wise_avg_views)
TED_talk['Topics wise average views'] = se.values

# Graph:
plt.figure(figsize=(10,6))
sns.distplot(TED_talk['Topics wise average views'])

# **Related Talk Variable**

Related talk column contains a dictionary containing information about related videos with talk_id as key and video name as it's value. taking mean of all realated talk videos views

In [None]:
TED_talk['related_talks'] = TED_talk['related_talks'].apply(lambda x: ast.literal_eval(x))

In [None]:
# Defining a new feature called related_views
TED_talk['related_views'] = 0

# Iterating through the each row and extracting the value of related_talks
for index, row in TED_talk.iterrows():
    id_list=list(row['related_talks'].keys())
    temp=0
    for i in range(len(TED_talk)):
      if (TED_talk.loc[i,'talk_id']) in id_list:
        temp+= TED_talk.loc[i,'daily_views']

    TED_talk.loc[index,'related_views']=temp//6

# Graph of related_views column
plt.figure(figsize=(10,5))
sns.distplot(TED_talk['related_views'])
plt.show()

### **Converting time passed since published into integer**

In [None]:
TED_talk['time_passed_since_published'] = TED_talk['time_passed_since_published'].dt.days.astype('int16')

# **Feature Engineering and Data Preprocessing**
▪ Feature engineering and data processing are important steps in the ML workflow. Feature engineering involves creating, selecting, and transforming features to create an informative dataset for model training. Data processing involves cleaning, normalizing, and preparing the data for model training. Feature engineering is essential for creating a dataset that is suitable for training an ML model, while data processing ensures that the data is in an appropriate format for the model. Together, feature engineering and data processing enable ML models to train on data that is accurate and reliable.

### **Verifying OLS assumptions**
▪ Verifying OLS (ordinary least squares) assumptions in ML (machine learning) is a critical step in any ML model-building process. OLS assumptions help to ensure that the model is accurately representing the underlying data and that the results of the model are reliable. OLS assumptions include linearity, independence, normality, homoscedasticity, and lack of multicollinearity. By validating these assumptions, you can determine if the data is suitable for OLS regression and can identify areas of improvement that may be needed.

## **Linearity**
▪ Linearity is a key concept in Machine Learning (ML). It is used to understand and determine relationships between the input variables and the output variables. Linearity helps to identify patterns in data and to make predictions from data. It is also used to determine the effects of changes in input variables on the output variables. Linearity can also be used to simplify the ML models since linear models are easier to interpret and understand.

In [None]:
# checking for Linearity
      
fig = plt.figure(figsize=(15,10))

plt.subplot(3,4,1)
plt.title("comments")
sns.scatterplot(TED_talk['comments'],TED_talk['daily_views'])

plt.subplot(3,4,2)                   
plt.title("duration")
sns.scatterplot(TED_talk['duration'],TED_talk['daily_views'])

plt.subplot(3,4,3)
plt.title("time_passed_since_published")
sns.scatterplot(TED_talk['time_passed_since_published'],TED_talk['daily_views'])

plt.subplot(3,4,4)
plt.title("publish_year")
sns.scatterplot(TED_talk['publish_year'],TED_talk['daily_views'])

plt.subplot(3,4,5)
plt.title("publish_day")
sns.scatterplot(TED_talk['publish_day'],TED_talk['daily_views'])

plt.subplot(3,4,6)
plt.title("speaker_1 average views")
sns.scatterplot(TED_talk['speaker_1 average views'],TED_talk['daily_views'])

plt.subplot(3,4,7)
plt.title("Event wise Average Views")
sns.scatterplot(TED_talk['Event wise Average Views'],TED_talk['daily_views'])

plt.subplot(3,4,8)
plt.title("number of language")
sns.scatterplot(TED_talk['number of language'],TED_talk['daily_views'])

plt.subplot(3,4,9)
plt.title("num of topics")
sns.scatterplot(TED_talk['number of topics'],TED_talk['daily_views'])

plt.subplot(3,4,10)
plt.title("Topics wise average views")
sns.scatterplot(TED_talk['Topics wise average views'],TED_talk['daily_views'])

plt.subplot(3,4,11)
plt.title("related_views")
sns.scatterplot(TED_talk['related_views'],TED_talk['daily_views'])

plt.tight_layout()
plt.show()

# **Transformation for Linearity**

In [None]:
# Transformation
TED_talk['log_daily_views'] = np.log(TED_talk['daily_views'])
TED_talk['log_comments'] = np.log(TED_talk['comments'])
TED_talk['log_speaker_1_avg_views'] = np.log(TED_talk['speaker_1 average views'])
TED_talk['log_event_wise_average_views'] = np.log(TED_talk['Event wise Average Views'])
TED_talk['log_duration'] = np.log(TED_talk['duration'])
TED_talk['log_topics_wise_average_views'] = np.log(TED_talk['Topics wise average views'])
TED_talk['log_related_views'] = np.log(TED_talk['related_views'])

In [None]:
# Linearity
fig = plt.figure(figsize=(15,10))

plt.subplot(3,4,1)
plt.title("log_comments")
sns.scatterplot(TED_talk['log_comments'],TED_talk['log_daily_views'])

plt.subplot(3,4,2)
plt.title("log_duration")
sns.scatterplot(TED_talk['log_duration'],TED_talk['log_daily_views'])

plt.subplot(3,4,3)
plt.title("time_passed_since_published")
sns.scatterplot(TED_talk['time_passed_since_published'],TED_talk['log_daily_views'])

plt.subplot(3,4,4)
plt.title("publish_year")
sns.scatterplot(TED_talk['publish_year'],TED_talk['log_daily_views'])

plt.subplot(3,4,5)
plt.title("publish_day")
sns.scatterplot(TED_talk['publish_day'],TED_talk['log_daily_views'])

plt.subplot(3,4,6)
plt.title("log_speaker_1_avg_views")
sns.scatterplot(TED_talk['log_speaker_1_avg_views'],TED_talk['log_daily_views'])

plt.subplot(3,4,7)
plt.title("log_event_wise_average_views")
sns.scatterplot(TED_talk['log_event_wise_average_views'],TED_talk['log_daily_views'])

plt.subplot(3,4,8)
plt.title("number_of_lang")
sns.scatterplot(TED_talk['number of language'],TED_talk['log_daily_views'])

plt.subplot(3,4,9)
plt.title("num_of_topics")
sns.scatterplot(TED_talk['number of topics'],TED_talk['log_daily_views'])

plt.subplot(3,4,10)
plt.title("log_topics_wise_average_views")
sns.scatterplot(TED_talk['log_topics_wise_average_views'],TED_talk['log_daily_views'])

plt.subplot(3,4,11)
plt.title("log_related_views")
sns.scatterplot(TED_talk['log_related_views'],TED_talk['log_daily_views'])

plt.tight_layout()
plt.show()

▶ **INFERENCE :** Not all features show linearity with the target and also many feature are showing hetroscedasticity

# **Outliers Detection**

In [None]:
# Boxplots
fig = plt.figure(figsize=(10,8))

plt.subplot(3,4,1)
#plt.title("log_comments")
sns.boxplot(x= TED_talk['log_comments'])

plt.subplot(3,4,2)
#plt.title("duration")
sns.boxplot(x= TED_talk['log_duration'])

plt.subplot(3,4,3)
#plt.title("time_passed_since_published")
sns.boxplot(x= TED_talk['time_passed_since_published'])

plt.subplot(3,4,4)
#plt.title("publish_year")
sns.boxplot(x= TED_talk['publish_year'])

plt.subplot(3,4,5)
#plt.title("publish_day")
sns.boxplot(x= TED_talk['publish_day'])

plt.subplot(3,4,6)
#plt.title("log_speaker_1_avg_views")
sns.boxplot(x= TED_talk['log_speaker_1_avg_views'])

plt.subplot(3,4,7)
#plt.title("log_event_wise_avg_views")
sns.boxplot(x= TED_talk['log_event_wise_average_views'])


plt.subplot(3,4,8)
#plt.title("number_of_lang")
sns.boxplot(x= TED_talk['number of language'])

plt.subplot(3,4,9)
#plt.title("num_of_topics")
sns.boxplot(x= TED_talk['number of topics'])

plt.subplot(3,4,10)
#plt.title("log_daily_views")
sns.boxplot(x= TED_talk['log_daily_views'])

plt.subplot(3,4,11)
#plt.title("log_topics_wise_avg_views")
sns.boxplot(x= TED_talk['log_topics_wise_average_views'])

plt.subplot(3,4,12)
#plt.title("log_related_views")
sns.boxplot(x= TED_talk['log_related_views'])

plt.tight_layout()
plt.show()

In [None]:
# removing outliers from log_comments
q_low = TED_talk['log_comments'].quantile(0.01)
q_hi  = TED_talk['log_comments'].quantile(0.99)

df_1 = TED_talk[(TED_talk['log_comments'] < q_hi) & (TED_talk['log_comments'] > q_low)]

# removing outliers from log_duration
q_low = df_1["log_duration"].quantile(0.01)
q_hi  = df_1["log_duration"].quantile(0.99)

df_2 = df_1[(df_1["log_duration"] < q_hi) & (df_1["log_duration"] > q_low)]

# removing outliers from log_speaker_1_avg_views
q_low = df_2["log_speaker_1_avg_views"].quantile(0.01)
q_hi  = df_2["log_speaker_1_avg_views"].quantile(0.99)

df_3 = df_2[(df_2["log_speaker_1_avg_views"] < q_hi) & (df_2["log_speaker_1_avg_views"] > q_low)]

# removing outliers from log_event_wise_avg_views
q_low = df_3["log_event_wise_average_views"].quantile(0.01)
q_hi  = df_3["log_event_wise_average_views"].quantile(0.99)

df_4 = df_3[(df_3["log_event_wise_average_views"] < q_hi) & (df_3["log_event_wise_average_views"] > q_low)]

# removing outliers from number_of_lang
q_low = df_4["number of language"].quantile(0.01)
q_hi  = df_4["number of language"].quantile(0.99)

df_5 = df_4[(df_4["number of language"] < q_hi) & (df_4["number of language"] > q_low)]
     

# removing outliers from num_of_topics
q_hi  = df_5["number of topics"].quantile(0.99)

df_6 = df_5[df_5["number of topics"] < q_hi]

# removing outliers from log_daily_views
q_low = df_6["log_daily_views"].quantile(0.01)
q_hi  = df_6["log_daily_views"].quantile(0.99)

df_7 = df_6[(df_6["log_daily_views"] < q_hi) & (df_6["log_daily_views"] > q_low)]

# removing outliers from log_topics_wise_avg_views
q_low = df_7["log_topics_wise_average_views"].quantile(0.01)
q_hi  = df_7["log_topics_wise_average_views"].quantile(0.99)

df_8 = df_7[(df_7["log_topics_wise_average_views"] < q_hi) & (df_7["log_topics_wise_average_views"] > q_low)]

# removing outliers from log_related_views
q_low = df_8["log_related_views"].quantile(0.01)
q_hi  = df_8["log_related_views"].quantile(0.99)

df_filtered = df_8[(df_8["log_related_views"] < q_hi) & (df_8["log_related_views"] > q_low)]

In [None]:
# New Boxplots
fig = plt.figure(figsize=(10,8))

plt.subplot(3,4,1)
#plt.title("log_comments")
sns.boxplot(x= df_filtered['log_comments'])

plt.subplot(3,4,2)
#plt.title("duration")
sns.boxplot(x= df_filtered['log_duration'])

plt.subplot(3,4,3)
#plt.title("time_passed_since_published")
sns.boxplot(x= df_filtered['time_passed_since_published'])

plt.subplot(3,4,4)
#plt.title("publish_year")
sns.boxplot(x= df_filtered['publish_year'])

plt.subplot(3,4,5)
#plt.title("publish_day")
sns.boxplot(x= df_filtered['publish_day'])

plt.subplot(3,4,6)
#plt.title("log_speaker_1_avg_views")
sns.boxplot(x= df_filtered['log_speaker_1_avg_views'])

plt.subplot(3,4,7)
#plt.title("log_event_wise_avg_views")
sns.boxplot(x= df_filtered['log_event_wise_average_views'])


plt.subplot(3,4,8)
#plt.title("number_of_lang")
sns.boxplot(x= df_filtered['number of language'])

plt.subplot(3,4,9)
#plt.title("num_of_topics")
sns.boxplot(x= df_filtered['number of topics'])

plt.subplot(3,4,10)
#plt.title("log_daily_views")
sns.boxplot(x= df_filtered['log_daily_views'])

plt.subplot(3,4,11)
#plt.title("log_topics_wise_avg_views")
sns.boxplot(x= df_filtered['log_topics_wise_average_views'])

plt.subplot(3,4,12)
#plt.title("log_related_views")
sns.boxplot(x= df_filtered['log_related_views'])

plt.tight_layout()
plt.show()

# **Removing Irrelevent Features**


In [None]:
df_filtered.columns

In [None]:
unwanted_features=['talk_id', 'title', 'speaker_1', 'all_speakers', 'occupations',
       'about_speakers', 'views', 'recorded_date', 'published_date', 'event',
       'native_lang', 'available_lang', 'topics',
       'related_talks', 'url', 'description', 'transcript', 'comments', 'duration', 'daily_views','speaker_1 average views',
       'Event wise Average Views','Topics wise average views','related_views']

In [None]:
df_filtered.drop(columns=unwanted_features,inplace=True)

# **Removing Collinearity**

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
sns.heatmap(np.abs(df_filtered.corr()), annot= True, cmap= 'GnBu_r',ax=ax)
plt.show()

# **Variance Inflation Factor Analysis**

In [None]:
vif_data = df_filtered.drop(['publish_week_day','publish_month','log_daily_views','publish_year','number of language','log_comments','log_related_views','log_topics_wise_average_views','log_duration','log_event_wise_average_views'],axis=1)

In [None]:
vif_df=pd.DataFrame()
vif_df['features']=vif_data.columns
vif_df['VIF']=[variance_inflation_factor(vif_data.values,i) for i in range(vif_data.shape[1])]
vif_df

# **Lets check for normal distribution of features in data**

In [None]:
# Ploting distributions of features

fig = plt.figure(figsize=(10,8))

plt.subplot(2,3,1)
plt.title("time_passed_since_published")
sns.distplot(x= df_filtered['time_passed_since_published'])

plt.subplot(2,3,2)
plt.title("publish_day")
sns.distplot(x= df_filtered['publish_day'])

plt.subplot(2,3,3)
plt.title("num_of_topics")
sns.distplot(x= df_filtered['number of topics'])

plt.subplot(2,3,4)
plt.title("log_speaker_1_avg_views")
sns.distplot(x= df_filtered['log_speaker_1_avg_views'])

#plt.subplot(2,3,5)
#plt.title("log_daily_views")
#sns.histplot(x= np.log(df_filtered['related_views']))

plt.subplot(2,3,5)
plt.title("log_daily_views")
sns.distplot(x= df_filtered['log_daily_views'])

plt.tight_layout()
plt.show()

# **Transformation**

In [None]:
# Transformation
df_filtered['sqrt_publish_day']=np.sqrt(df_filtered['publish_day'])
df_filtered['log_num_of_topics']=np.log(df_filtered['number of topics'])
df_filtered['log_time_passed_since_published']=np.log(df_filtered['time_passed_since_published'])

In [None]:
# Ploting distributions of features

fig = plt.figure(figsize=(10,8))

plt.subplot(2,3,1)
plt.title("log_time_passed_since_published")
sns.distplot(x= df_filtered['log_time_passed_since_published'])

plt.subplot(2,3,2)
plt.title("sqrt_publish_day")
sns.distplot(x= df_filtered['sqrt_publish_day'])

plt.subplot(2,3,3)
plt.title("log_num_of_topics")
sns.distplot(x= df_filtered['log_num_of_topics'])

plt.subplot(2,3,4)
plt.title("log_speaker_1_avg_views")
sns.distplot(x= df_filtered['log_speaker_1_avg_views'])

plt.subplot(2,3,5)
plt.title("log_daily_views")
sns.distplot(x= df_filtered['log_daily_views'])

plt.tight_layout()
plt.show()

In [None]:
df_filtered.columns

In [None]:
data = df_filtered.drop(['log_topics_wise_average_views','time_passed_since_published','publish_year','log_duration','log_comments','log_event_wise_average_views','number of language','publish_day','number of topics','log_related_views'],axis=1)

In [None]:
data.columns

# **Lets Start The Model Implementation**

In [None]:
data['log_daily_views'].describe()

### **Removing null values from dataset**

In [None]:
data_dummy = pd.get_dummies(data,drop_first=True)
data_dummy.shape

## **Defining dependent and independent features**

In [None]:
y = data_dummy['log_daily_views']
X = data_dummy.drop(columns='log_daily_views')

In [None]:
X.head()

## **Next we will standardize the features**

In [None]:
scaler = StandardScaler()
scaler.fit(X)
x = scaler.transform(X)

## **Lets split the data into training and testing**

In [None]:
# Spliting dataset into training and test
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

## **Implementing Linear Regression Training Model**

In [None]:
# Regression
reg=LinearRegression()
reg.fit(x_train,y_train)

## **Model Accuracy On Train Data**

In [None]:
y_pred = reg.predict(x_train)

In [None]:
plt.scatter(y_train,y_pred)
plt.xlabel('Target(y_train)',fontsize=20)
plt.ylabel('Predictions(yhat)',fontsize=20)
plt.title('Prediction VS Target',fontsize=20)
plt.show()

## **Scatter plot must be as close to the 45 degree line from origin as possible for best predictions**

In [None]:
# Other way to judge the model
sns.distplot(y_train-y_pred)
plt.title('Residual PDF',fontsize=15)
plt.show()

## **Model Evaluation Metrics**

In [None]:
# R-square to explain the variability our model id able to explain
R2 = reg.score(x_train,y_train)
R2

In [None]:
# Adjusted R-square
n=len(x_train)
p=x_train.shape[1]
adj_r_sqr=1-((1-reg.score(x_train,y_train))*(n-1)/(n-p-1))
adj_r_sqr

In [None]:
variability_df = pd.DataFrame({"R-Square":R2,"Adjusted R-Square":adj_r_sqr},index=["Values"])
variability_df

# **Lets See The Model Parameters**

### **Intercept**

In [None]:
reg.intercept_

## **Rest Of The Parameters**

In [None]:
summary = pd.DataFrame({'Features':X.columns,'Weight':reg.coef_})
summary

# **Wieghts Interpretation**

### **Continuous Variable**
1. A **Positive Weight** shows that as the feature increases in value so does the daily_views and log_daily_views variables.
2. A **Negative Weight** shows that as the feature increases in value the daily_views and log_daily_views variables decreases in values.

### **Dummy Variables**
1. A **Positive Weight** shows that the respective catagory is more expensive than the benchmark
2. A **Positive Weight** shows that the respective catagory is less expensive than the benchmark

# **TESTING**

In [None]:
y_pred_test=reg.predict(x_test)
plt.scatter(y_test,y_pred_test,alpha=0.2)
plt.xlabel('Expected',fontsize=20)
plt.ylabel('Predicted',fontsize=20)
plt.title('Daily Views (Prediction / Expected)',fontsize=20)
plt.show()

In [None]:
plt.figure(figsize=(8,5))
plt.plot(np.exp(y_pred_test))
plt.plot(np.array(np.exp(y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
pf_df = pd.DataFrame({'Predictions':np.exp(y_pred_test)})
pf_df.head()

In [None]:
y_test = y_test.reset_index(drop=True)
pf_df['Target(expected values)'] = np.exp(y_test)

In [None]:
pf_df.head()

In [None]:
pf_df['Residual'] = pf_df['Target(expected values)']-pf_df['Predictions']
pf_df['Difference_percentage'] = np.absolute(pf_df['Residual']/pf_df['Target(expected values)']*100)
pf_df.describe()

**Error percentage is very less between 25 quartile to 75 quartile that shows our model is working very good on test data.**

### **Error metrices**

In [None]:
MSE = mean_squared_error(np.exp(y_test), np.exp(y_pred_test))

In [None]:
RMSE = math.sqrt(mean_squared_error(np.exp(y_test), np.exp(y_pred_test)))

In [None]:
# Mean Absolute Error
sum = 0
n=len(y_test)
# for loop for iteration
for ele in range(n):
    sum += abs(np.exp(y_test[ele]) - np.exp(y_pred_test[ele]))
  
MAE = sum/n
  
# display
print("Mean absolute error : " + str(MAE))

In [None]:
MAPE = mean_absolute_percentage_error(np.exp(y_test),np.exp(y_pred_test))

In [None]:
r2 = r2_score(np.exp(y_test), np.exp(y_pred_test))
ar2=1-(1-r2_score(np.exp(y_test), np.exp(y_pred_test)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1))
     

In [None]:
error_metric = pd.DataFrame({'Values':[r2,ar2,MSE,RMSE,MAE,MAPE]},index=['R-Square','Adj. R-Square','MSE','RMSE','MAE','MAPE'])
error_metric

▶ **INFERENCE :** Error metrices show the same observation of low error in the test dataset

# **Lets check for overfitting in our model**

## **Lasso Regression Model**
▪ Lasso regression (Least Absolute Shrinkage and Selection Operator) is a type of regularized linear regression that uses shrinkage, where data values are shrunk towards a central point, like the mean. It is used to reduce model complexity and prevent overfitting by penalizing large coefficients associated with features and by performing feature selection. This can be used to identify the most important predictors in a dataset and is particularly useful when there are a large number of features.

### **Running Grid Search Cross Validation**
▪ The use of grid search cross-validation in the Lasso regression model helps to find the best combination of parameters for the model. Grid search cross-validation is a method of hyperparameter tuning that involves training and evaluating a model on each combination of hyperparameters in a grid. This helps to optimize the model for the given data set. By using grid search cross-validation, one can identify the optimal combination of hyperparameters for the best results. This is especially important for the Lasso regression model, since its regularization parameter, lambda, can greatly influence the model's performance. Grid search cross-validation helps to identify the best lambda value to use for the model.

In [None]:
# Cross validation
lasso = Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=5)
lasso_regressor.fit(x_train, y_train)

In [None]:
print("The best fit alpha value is found out to be :" ,lasso_regressor.best_params_)
print("\nUsing ",lasso_regressor.best_params_, " the negative mean squared error is: ", lasso_regressor.best_score_)

In [None]:
y_pred_lasso = lasso_regressor.predict(x_test)

In [None]:
plt.figure(figsize=(15,6))
plt.subplot(1,2,1)
plt.plot(np.exp(y_pred_lasso))
plt.plot(np.exp(np.array(y_test)))
plt.legend(["Predicted","Expected"])
plt.title('Daily Views (Prediction / Expected)',fontsize=20)

plt.subplot(1,2,2)
yhat_test=reg.predict(x_test)
plt.scatter(y_test,y_pred_lasso,alpha=0.2)
plt.xlabel('Expected',fontsize=20)
plt.ylabel('Predicted',fontsize=20)
plt.title('Daily Views (Prediction / Expected)',fontsize=20)

plt.tight_layout()
plt.show()

In [None]:
MSE = mean_squared_error(np.exp(y_test), np.exp(y_pred_lasso))

In [None]:
RMSE = math.sqrt(mean_squared_error(np.exp(y_test), np.exp(y_pred_lasso)))

In [None]:
# Mean Absolute Error
sum = 0
n=len(y_test)
# for loop for iteration
for ele in range(n):
    sum += abs(np.exp(y_test[ele]) - np.exp(y_pred_lasso[ele]))
  
MAE = sum/n
  
# display
print("Mean absolute error : " + str(MAE))

In [None]:
MAPE = mean_absolute_percentage_error(np.exp(y_test),np.exp(y_pred_lasso))

In [None]:
r2 = r2_score(np.exp(y_test), np.exp(y_pred_lasso))
ar2 = 1-(1-r2_score(np.exp(y_test), np.exp(y_pred_lasso)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1))

In [None]:
error_metric_lasso = pd.DataFrame({'Values':[r2,ar2,MSE,RMSE,MAE,MAPE]},index=['R-Square','Adj. R-Square','MSE','RMSE','MAE','MAPE'])
error_metric_lasso

# **Ridge Regression Model**
▪ Ridge regression is a type of regularized regression technique that is used to address the problem of multicollinearity in linear regression. It is an extension of least squares regression that uses a penalty term to shrink the magnitude of the coefficients toward zero. This helps to reduce overfitting and improve the generalization of the model. Ridge regression can also be used to identify important features in a dataset.

### **Running Grid Search Cross Validation**
▪ Grid search cross-validation is used in the Ridge regression model to find the optimal set of hyperparameters that best generalize the model and minimize the prediction error. Grid search cross-validation helps to identify the best combination of hyperparameters by iterating through different combinations of hyperparameters and evaluating the model performance using cross-validation. This approach helps to avoid overfitting and helps to identify the best set of hyperparameters that would lead to the most accurate predictions on unseen data.

In [None]:
# Hyperprarameter tuning
ridge = Ridge()
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=5)
ridge_regressor.fit(x_train,y_train)

In [None]:
print("The best fit alpha value is found out to be :" ,ridge_regressor.best_params_)
print("\nUsing ",ridge_regressor.best_params_, " the negative mean squared error is: ", ridge_regressor.best_score_)

In [None]:
# Model Prediction
y_pred_ridge = ridge_regressor.predict(x_test)

In [None]:
plt.figure(figsize=(15,6))
plt.subplot(1,2,1)
plt.plot(np.exp(y_pred_ridge))
plt.plot(np.exp(np.array(y_test)))
plt.legend(["Predicted","Expected"])
plt.title('Daily Views (Prediction / Expected)',fontsize=20)

plt.subplot(1,2,2)
yhat_test = reg.predict(x_test)
plt.scatter(y_test,y_pred_ridge,alpha=0.2)
plt.xlabel('Expected',fontsize=20)
plt.ylabel('Predicted',fontsize=20)
plt.title('Daily Views (Prediction / Expected)',fontsize=20)

plt.tight_layout()
plt.show()

In [None]:
MSE = mean_squared_error(np.exp(y_test), np.exp(y_pred_ridge))

In [None]:
RMSE = math.sqrt(mean_squared_error(np.exp(y_test), np.exp(y_pred_ridge)))

In [None]:
# Mean Absolute Error
sum = 0
n=len(y_test)
# for loop for iteration
for ele in range(n):
    sum += abs(np.exp(y_test[ele]) - np.exp(y_pred_ridge[ele]))
  
MAE = sum/n
  
# display
print("Mean absolute error : " + str(MAE))

In [None]:
MAPE = mean_absolute_percentage_error(np.exp(y_test),np.exp(y_pred_ridge))

In [None]:
r2 = r2_score(np.exp(y_test), np.exp(y_pred_ridge))
ar2=1-(1-r2_score(np.exp(y_test), np.exp(y_pred_ridge)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1))

In [None]:
error_metric_ridge=pd.DataFrame({'Values':[r2,ar2,MSE,RMSE,MAE,MAPE]},index=['R-Square','Adj. R-Square','MSE','RMSE','MAE','MAPE'])
error_metric_ridge

# **Elastic Regression Model**
▪ Elastic regression is a machine learning technique used to reduce the amount of high-dimensional data that needs to be processed in order to make a prediction. It is a regularization technique that combines the principles of ridge regression and lasso regression to automatically select important features from a large data set. Elastic regression can help reduce the complexity of a model and improve the accuracy of predictions.

### **Running Grid Search Cross Validation**
▪ Running Grid Search Cross-Validation in the Elastic regression model is a great way to find the optimal hyperparameters of the model. Grid Search Cross-Validation allows us to test a variety of combinations of hyperparameters, which can help us find the best combination of hyperparameters for the model. This can help us improve the performance of the model and ensure that it is optimized for the task at hand. Additionally, it can help us reduce the chances of overfitting, which can be an issue when using this type of regression model.

In [None]:
elastic = ElasticNet()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100],'l1_ratio':[0.3,0.4,0.5,0.6,0.7,0.8]}
elastic_regressor = GridSearchCV(elastic, parameters, scoring='neg_mean_squared_error',cv=5)
elastic_regressor.fit(x_train, y_train)

In [None]:
print("The best fit alpha value is found out to be :" ,elastic_regressor.best_params_)
print("\nUsing ",elastic_regressor.best_params_, " the negative mean squared error is: ", elastic_regressor.best_score_)

In [None]:
y_pred_elastic = elastic_regressor.predict(x_test)

In [None]:
plt.figure(figsize=(15,6))
plt.subplot(1,2,1)
plt.plot(np.exp(y_pred_elastic))
plt.plot(np.exp(np.array(y_test)))
plt.legend(["Predicted","Expected"])
plt.title('Daily Views (Prediction / Expected)',fontsize=20)

plt.subplot(1,2,2)
yhat_test=reg.predict(x_test)
plt.scatter(y_test,y_pred_elastic,alpha=0.2)
plt.xlabel('Expected',fontsize=20)
plt.ylabel('Predicted',fontsize=20)
plt.title('Daily Views (Prediction / Expected)',fontsize=20)

plt.tight_layout()
plt.show()

In [None]:
MSE = mean_squared_error(np.exp(y_test), np.exp(y_pred_elastic))

In [None]:
RMSE = math.sqrt(mean_squared_error(np.exp(y_test), np.exp(y_pred_elastic)))

In [None]:
# Mean Absolute Error
sum = 0
n=len(y_test)
# for loop for iteration
for ele in range(n):
    sum += abs(np.exp(y_test[ele]) - np.exp(y_pred_elastic[ele]))
  
MAE = sum/n
  
# display
print("Mean absolute error : " + str(MAE))

In [None]:
MAPE = mean_absolute_percentage_error(np.exp(y_test),np.exp(y_pred_elastic))

In [None]:
r2 = r2_score(np.exp(y_test), np.exp(y_pred_elastic))
ar2=1-(1-r2_score(np.exp(y_test), np.exp(y_pred_elastic)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1))

In [None]:
error_metric_Elastic=pd.DataFrame({'Values':[r2,ar2,MSE,RMSE,MAE,MAPE]},index=['R-Square','Adj. R-Square','MSE','RMSE','MAE','MAPE'])
error_metric_ridge

In [None]:
Model_Summary= pd.DataFrame({'Linear Regression':error_metric['Values'],
                             'Lasso':error_metric_lasso['Values'],
                             'Ridge':error_metric_ridge['Values'],
                             'Elastic': error_metric_Elastic['Values']})

In [None]:
Model_Summary

##**Conclusion**

On comparing all the models our base linear regression model is still performing better followed by Lasso, Ridge, and ElasticNet Regression model on the basis of RMSE. But our model contains a large number of outliers and the value of RMSE is affected by outliers, therefore, we will use MAE as our evaluation matrix according to which Lasso Regressor has the best performance.

We can also see that Lasso and Ridge regression models are performing better than the Base Linear Regression Model because of the feature selection methods that are implemented in both models.

In conclusion, the Lasso Regression model has the best performance on the given dataset based on the evaluation matrix MAE.

###**Future Work**
▪ Improve feature engineering

▪ Remove unimportant and correlated features

▪ Normalise the data

▪ Improve the hyperparameters of the models

▪ Use PCA