# **Project Name**    - **TED Talks Views Prediction**



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Member**           - Chandu Chokkam


# **Project Summary -**

Project Summary: TED Talk View Prediction
Objective:
The goal of this project is to develop a predictive model to estimate the number of views a TED Talk will receive based on features extracted from the dataset, such as the talk's topic, duration, description, publication year, and speaker details.

Key Components:

Problem Definition:

Predict the number of views a TED Talk receives.
Identify the key factors contributing to the popularity of a talk.

Dataset:

A publicly available TED Talks dataset containing attributes such as:
Title: Name of the talk.
Speaker name: The person delivering the talk.
Duration: Length of the talk in seconds.
Tags: Keywords associated with the talk.
Number of comments: Engagement metric.
Event: The event where the talk was given.
Publication date: When the talk was made public.
Views: Target variable for prediction.

Exploratory Data Analysis (EDA):

Analyze the distribution of views and other features.
Examine correlations between views and other attributes.
Visualize popular topics, durations, and other key trends.

Feature Engineering:

Extract and process features such as:
Text analysis of title, description, and tags (e.g., sentiment, word count).
Time-based features (e.g., year, month of publication).
Event-based popularity trends.
Encode categorical variables and normalize numerical data.

Model Development:

Split data into training and testing sets.
Use machine learning models such as:
Linear Regression
Random Forest
Gradient Boosting (e.g., XGBoost, LightGBM)
Neural Networks (optional for advanced users).
Evaluate model performance using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or R² score.

Insights and Recommendations:

Identify key drivers of high viewership.
Provide actionable insights for TED organizers and speakers to optimize talks for popularity.
Deployment (Optional):

Build a simple web-based interface to input talk details and predict expected views using the trained model.

Potential Applications:

Help speakers understand factors that drive TED Talk popularity.
Assist event organizers in scheduling and promoting talks more effectively.


# **GitHub Link -**

https://github.com/CHOKKAM-CHANDU/TED_TALKS_VIEW_PREDICTION

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Data manipulation libraries
import pandas as pd
import numpy as np

# Data visualization libraries
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Datatime library for Date columns
from datetime import datetime
import datetime as dt

# for remove Multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Preprocessing libraries
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import PowerTransformer

# For build pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
# Machine learning models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.ensemble import VotingRegressor,StackingRegressor


# for plot decision tree
from sklearn import tree

# Model selection libraries
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

# importing XGB regressor
from xgboost import XGBRegressor

# Metrics libraries for model evaluation
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error

# Warnings module handles warnings in Python
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
d = pd.read_csv('/content/ted.csv')

### Dataset First View

In [None]:
# Dataset First Look
d.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
d.shape

### Variables Description


*   talk_id: A unique identifier for each TED Talk video.
*   title: The title of the talk.
*   speaker_1: The primary speaker for the talk.

*   all_speakers: A list of all the speakers for the talk.

*   occupations: The occupations of the speakers.
*   about_speakers: Information about the speakers, such as their backgrounds and expertise.
*   recorded_date: The date the talk was recorded.


*   published_date: The date the talk was published on the TED Talks YouTube channel.


*   event: The name of the TED event where the talk was given.

*   native_lang: The language the talk was given in.
*   available_lang: The languages the talk is available in.

*   duration: The length of the video.(in sec.)
*   topics: The topics covered in the talk.

*   related talks: Other TED Talks that are related to this talk.
*   url: The URL of the video.

*   description: A brief description of the talk.
*   transcript: A transcript of the talk.



### Dataset Information

In [None]:
# Dataset Info
d.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
d.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
d.isnull().sum()



*   comments , occupations and about_speakers columns are high missing values.
*   The comments attributes has a lot of NaN values(655) to deal with. We have used some basic intution for what could be the reason of comments being null. The most logical explanation could be that the comments are disabled for the video. The other reason that could be possible is the data inconsistency so there could be some issues with the survey that are possibly causing these inconsistencies. We'll deal with these NaN values later on.





In [None]:
# Visualizing the missing values
d['occupations'].duplicated().sum()

In [None]:
# Visualizing the missing values
d['about_speakers'].duplicated().sum()

In [None]:
# Visualizing the missing values
d[d['recorded_date'].isnull()]

In [None]:
# Visualizing the missing values
d[d['all_speakers'].isnull()]

### What did you know about your dataset?

The TED Talks dataset contains 4,005 entries with features like title, speaker_1, views, duration, topics, and description. Some fields have missing data, such as occupations (3483 non-null) and comments (3350 non-null). It includes dates (recorded_date, published_date), multilingual data (native_lang, available_lang), and engagement metrics like views and comments. Text fields like description and transcript can be leveraged for NLP tasks.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
d.columns

In [None]:
# Dataset Describe
d.describe(include='all')

In [None]:
# describe the numerical dataset
d.describe().T

In [None]:
d.describe(percentiles=[.25,.50,.75,.80,.85,.90,.95,.96,.97,.98,.99])



*   the minimum value of views is 0.

*   the minimum value of comments is also 0.

*   outliers in views, comments and duration columns.






In [None]:
# find rows where column comments have 0 value
d[d['comments']==0.0]

In [None]:
# find rows where column views have 0 value
d[d['views']==0]



*   Total 6 rows are present where views = 0 and columns = NaN. this is MCAR data (missing completely at random) so we can remove this rows. because this is impossible that the views of video are 0 on TEDx Website.
*   Total 655 NaN values present in comments column so we have to fill that value also.



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in d.columns:
  print(f'The unique values in {i} are {d[i].nunique()}')

**Numerical Columns =** talk_id, views, comments, duration.

**categorical columns =** title, speaker_1, all_speakers, occupations, about_speakers, event, native_lang, available_lang, topics,
                      related_talks, url, description, transcript

**Datetime columns =** recorded_date, published_date




*   incorrect data-type assigned to recorded_date , published_date , comments.




In [None]:
# Check Unique Values for 'native_lang' variable.
d['native_lang'].unique()

In [None]:

d.describe(include='O').T

In [None]:
# Write your code to make your dataset analysis ready.

d[(d['occupations'].isnull() & d['about_speakers'].isnull())][['speaker_1', 'all_speakers']]

## 3. ***Data Wrangling***

### Data Wrangling Code

# All Issues with the dataset

1. Dirty Data (Low quality data)

<> Comments , occupations and about_speakers columns are high missing values...we have to fill 655 missing values of comments. completion issue(missing data)

<> Incorrect data-type assigned to recorded_date , published_date , comments.

<> The minimum value of column views is 0 and there are total 6 rows. so we have to delete that rows.

<> The minimum value of comments is also 0. there are only 2 rows there but null value in other 655 rows so simply fill with 0 but this column important so we fill values in feature engineering part. accuracy issue(not accurate values)

<> There are two column with details, i.e. speaker_1, all_speakers. So, one of the column is to be deleted.

<> url and talk_id column is also not useful in views prediction so, we have to delete both the columns.

2. Messy Data (untidy data)

<> Topics and available_lang are in list format. we have to split this untidy data for better feature corr with views. we perform this in feature transformation part.

<> There are also few columns in dictionary untidy format occupations,about_speakers, related_talks but this columns are not important so in later feature transformation part we remove this columns if needed.

In [None]:
# Creating copy of the original dataset
df = d.copy()

In [None]:
# filling missing value.

values = {'comments':0, 'occupations':'no data', 'about_speakers': 'no data', 'all_speakers' : 'no data'}

df = df.fillna(value=values)

In [None]:
# No null values now, I'll take care of views and comment zero values later in feature engineering part.

df.isnull().sum()

In [None]:
#Changing the wrongly assigned data types
df = df.astype({'talk_id': 'int32', 'views':'int32','comments':'int32', 'duration':'int32'})

df['published_date'] = pd.to_datetime(df['published_date'])

df['recorded_date'] = pd.to_datetime(df['recorded_date'])

In [None]:
#Dropping the unnecessary columns & renaming the speaker1 column to speaker column

df.drop(['talk_id', 'all_speakers', 'url'], axis = 1, inplace=True)

df.rename(columns={'speaker_1':'speaker'}, inplace=True)

In [None]:
# Drop rows where 'views' is 0.
# Now df is a new dataframe which does not contain views column with zero value
df = df[df['views'] != 0]

In [None]:
# Cross-Checking the above operation
print((df['views'] == 0).sum())

In [None]:
# Removing few more columns which are not important
df.drop(['occupations', 'about_speakers', 'related_talks','description','transcript'], axis=1, inplace=True)

In [None]:
# Checking new shape of dataframe
df.shape

In [None]:
#Checking random samples
df.sample(5)

In [None]:
# find popular talk show titles and speakers based on views

popular_talks = df[['title', 'speaker', 'views']].sort_values('views', ascending=False)[0:15]
popular_talks

**Observations :**

*   Ken Robinson's talk on Do Schools Kill Creativity? is the most popular TED Talk of all time with 65.05 million views.
*   Also coincidentally, it is also one of the first talks to ever be uploaded on the TED Site (the main dataset is sorted by published date).
*   Robinson's talk is closely followed by Amy Cuddy's talk on Your Body Language May Shape Who You Are.
*   There are only 3 talks that have surpassed the 50 million mark and 12 talks that have crossed the 30 million mark.





In [None]:
# create a dataframe with top 15 speakers by views
top15_views = df.groupby('speaker').views.sum().nlargest(15)
top15_views = top15_views.reset_index()
top15_views


In [None]:
# create a dataframe with top 15 speakers by comments
top15_comments = df.groupby('speaker').comments.sum().nlargest(15)
top15_comments = top15_comments.reset_index()
top15_comments

### What all manipulations have you done and insights you found?

The dataset was first copied to preserve the original data, and missing values were handled by filling `comments` with `0`, and `occupations`, `about_speakers`, and `all_speakers` with `'no data'`. Data types were corrected by converting `talk_id`, `views`, `comments`, and `duration` to `int32`, while `published_date` and `recorded_date` were converted to `datetime`. Unnecessary columns, including `talk_id`, `all_speakers`, `url`, `occupations`, `about_speakers`, `related_talks`, `description`, and `transcript`, were dropped to streamline the dataset. The `speaker_1` column was renamed to `speaker` for clarity. Rows with `views` equal to `0` were removed, and a validation check confirmed that no such rows remained. These manipulations ensured a cleaner and more consistent dataset for further analysis and feature engineering.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# create the figure and subplots
fig, axs = plt.subplots(2,1, figsize=(18,12))

# create a barplot with top 15 speakers by views
sns.barplot(x='views', y='speaker', data=top15_views, ax=axs[0])
axs[0].set_title('Top 15 Speakers by Views')

# create a barplot with top 15 speakers by comments
sns.barplot(x='comments', y='speaker', data=top15_comments, ax=axs[1])
axs[1].set_title('Top 15 Speakers by Comments')


plt.tight_layout()
plt.show()

### Questions and Answers:

1. **Why did you pick the specific chart?**  
   The bar plots were chosen because they effectively showcase rankings and comparisons, making it easy to identify the most viewed and commented speakers at a glance.

2. **What is/are the insight(s) found from the chart?**  
   Popular speakers like Alex Gendler and Sir Ken Robinson have high views, while Richard Dawkins and Sir Ken Robinson lead in comments. Themes like education, inspiration, and controversy tend to drive audience engagement.

3. **Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth?**  
   Yes, the insights can guide future content strategies by identifying trending themes and speakers, boosting engagement and revenue. However, over-reliance on popular themes may stifle innovation and lead to audience fatigue. Balancing popular and unique content is essential.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#checking corr. with views column

plt.figure(figsize=(10,6))
sns.scatterplot(x='comments', y='views', data=df)

### Q&A Based on the Chart

1. **Why did you pick the specific chart?**  
   This scatter plot effectively shows the relationship between the number of views and comments for TED Talks. It allows us to identify patterns, such as whether higher views lead to higher comments or if there are any outliers.

2. **What is/are the insight(s) found from the chart?**  
   - The majority of TED Talks cluster around low views and comments, indicating moderate engagement levels.  
   - Outliers with exceptionally high views or comments highlight specific talks that resonate significantly with the audience.

3. **Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
   - **Positive Impact**: Understanding which talks generate high engagement helps TED focus on popular themes or speakers to maximize audience interaction.  
   - **Negative Impact**: Over-prioritizing popular themes might reduce diversity and innovation, as niche or emerging topics with lower engagement could be overlooked.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# checking distribution of comments column

plt.figure(figsize=(10,5))
sns.distplot(df['comments'], color='Red')

### Q&A Based on the Chart

1. **Why did you pick the specific chart?**  
   A density plot with a histogram provides a clear visualization of the distribution of comments. It shows the frequency and concentration of comments for TED Talks, making it easy to observe where most talks lie on the spectrum of engagement.

2. **What is/are the insight(s) found from the chart?**  
   - The majority of TED Talks receive fewer than 500 comments, with a sharp decline as the number of comments increases.  
   - A long tail exists for talks with higher comment counts, indicating a few talks generate significantly more discussion than the rest.

3. **Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
   - **Positive Impact**: Identifying the characteristics of highly commented talks can help TED optimize future content for engagement.  
   - **Negative Impact**: Overfocusing on the majority of low-comment talks might lead to missing opportunities to create more impactful and engaging content.

In [None]:
print(len(df[df['comments'] > 1100]))

In [None]:
#Dropping indexes where comments are greater than 1100
df.drop(df[df['comments']>1100].index, inplace=True)

In [None]:
#fill null(0) value with median of column

df['comments']= df['comments'].replace(0, np.nan)
df["comments"].fillna(df["comments"].median(), axis = 0, inplace = True)

In [None]:
# checking distribution of comments column

plt.figure(figsize=(10,5))
sns.distplot(df['comments'], color='Red')

CHART-4

In [None]:
# Chart - 4 visualization code
# check distribution of views column

plt.figure(figsize=(10,5))
sns.distplot(df['views'], color ='green')


### Q&A Based on the Chart

1. **Why did you pick the specific chart?**  
   A density plot with a histogram was chosen to show the distribution of views for TED Talks. This visualization effectively highlights how viewership is concentrated and the range of outliers.

2. **What is/are the insight(s) found from the chart?**  
   - Most TED Talks receive a relatively low number of views, with the majority concentrated under 10 million.  
   - A long tail exists for videos with exceptionally high views, indicating a few highly successful talks dominate the overall viewership.

3. **Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
   - **Positive Impact**: Insights can guide TED to analyze the common traits of highly viewed talks (e.g., topics, speakers) and replicate their success.  
   - **Negative Impact**: Over-reliance on creating content similar to highly viewed talks might reduce diversity and innovation in the topics covered.

CHART-5


In [None]:
# Chart - 5 visualization code
# check distribution of duration column

plt.figure(figsize=(10,5))
sns.distplot(df['duration'], color ='Orange')

Why this chart?
Examining duration distribution ensures optimal video length for maximum engagement.

Insights:

TED Talks have varying durations, with a possible preference for mid-length videos.
Extreme durations may not be optimal for engagement.

Business Impact:

Positive: Tailoring video lengths based on audience preferences.
Negative: Over-standardizing lengths could limit creative freedom

In [None]:
# change duration in sec. to min.

df['duration'] = df['duration'] / 60

In [None]:
# Create a new column 'speaker_popularity' in the main DataFrame and assign the categories

df['speaker_popularity'] = ""
df.loc[df['views'] <= 500000, 'speaker_popularity'] = 'not_popular'
df.loc[(df['views'] > 500000) & (df['views'] <= 1500000), 'speaker_popularity'] = 'avg_popular'
df.loc[(df['views'] > 1500000) & (df['views'] <= 2500000), 'speaker_popularity'] = 'popular'
df.loc[(df['views'] > 2500000) & (df['views'] <= 3500000), 'speaker_popularity'] = 'high_popular'
df.loc[df['views'] > 3500000, 'speaker_popularity'] = 'extreme_popular'

# check the dataset

df.sample(2)

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(18,6))
sns.barplot(data=df, x='speaker_popularity', y='comments',
            order=['not_popular', 'avg_popular', 'popular', 'high_popular', 'extreme_popular'])

Why did you pick the specific chart?
A density plot with a histogram provides a clear visualization of the distribution of comments. It shows the frequency and concentration of comments for TED Talks, making it easy to observe where most talks lie on the spectrum of engagement.

What is/are the insight(s) found from the chart?

The majority of TED Talks receive fewer than 500 comments, with a sharp decline as the number of comments increases.
A long tail exists for talks with higher comment counts, indicating a few talks generate significantly more discussion than the rest.
Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.

Positive Impact: Identifying the characteristics of highly commented talks can help TED optimize future content for engagement.
Negative Impact: Overfocusing on the majority of low-comment talks might lead to missing opportunities to create more impactful and engaging content.


Answer Here

In [None]:
# Create a new column 'video_rating' in the main DataFrame and assign the categories

df['video_rating'] = ""
df.loc[df['comments'] <= 50, 'video_rating'] = 1
df.loc[(df['comments'] > 50) & (df['comments'] <= 120), 'video_rating'] = 2
df.loc[(df['comments'] > 120) & (df['comments'] <= 200), 'video_rating'] = 3
df.loc[(df['comments'] > 200) & (df['comments'] <= 300), 'video_rating'] = 4
df.loc[df['comments'] > 300, 'video_rating'] = 5

# check the dataset
df.sample(2)

In [None]:
# add new column available_languages using existing column available_lang

df['available_languages'] = df['available_lang'].apply(lambda x: len(x))
pd.DataFrame(df['available_languages'])

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# check the distribution of this new column available_languages

plt.figure(figsize=(8,6))
sns.distplot(df['available_languages'],color = 'darkblue')
plt.show()


Why did you pick the specific chart?
A density plot with a histogram was chosen to visualize the distribution of available languages for TED Talks because it effectively highlights the frequency and spread of data. This combination helps understand how multilingual accessibility is distributed across the dataset.

What is/are the insight(s) found from the chart?

The distribution of available languages is approximately normal, with most TED Talks offering translations in around 200 languages.
Few talks are available in fewer than 100 or more than 300 languages, showing a central focus on talks with translations in a moderate range.
The long tail suggests that only a small fraction of talks deviate significantly in either direction.
Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.

Positive Impact: Understanding the optimal range of available languages (around 200) can help TED prioritize resources for translation services to maximize accessibility.
Negative Impact: Overextending translation efforts for less popular talks might lead to wasted resources without significantly improving audience engagement.


#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Select only numeric columns for correlation calculation
numeric_columns = d.select_dtypes(include=['number'])

correlation_matrix = numeric_columns.corr()

# Create the heatmap
plt.figure(figsize=(10, 8))  # Adjust figure size as needed
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)

# Add title and display the heatmap
plt.title("Correlation Heatmap")
plt.show()

Why did you pick the specific chart?

A correlation heatmap is a powerful visualization to understand the relationships between numerical variables in a dataset. It provides an easy-to-read summary of how strongly features are correlated with each other, aiding in feature selection and understanding dependencies.

What is/are the insight(s) found from the chart?

Positive Correlation: There is a moderate positive correlation of 0.50 between views and comments, suggesting that talks with higher views tend to have more comments.

Negative Correlation: There is a weak negative correlation between duration and talk_id (-0.26), indicating minimal dependency between the two variables.
Low Correlation: Features like duration and views (0.07) have almost no correlation, which means they are largely independent of each other.

Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.

Positive Impact: The moderate positive correlation between views and comments suggests that engaging and widely viewed talks tend to spark more conversations. By focusing on promoting content that encourages interaction, TED can optimize user engagement.

Negative Impact: Misinterpreting weak correlations (e.g., duration and views) could lead to ineffective efforts in designing talks based on length, which may not drive significant growth. It's essential to rely on strong correlations when making strategic decisions.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt




# Create the pair plot
sns.pairplot(d, diag_kind="kde", plot_kws={"alpha": 0.6})

# Display the plot
plt.show()


### Why did you pick the specific chart?  
The pair plot effectively visualizes pairwise relationships and distributions across numerical features, making it easy to detect correlations, trends, and outliers in the dataset.

### What is/are the insight(s) found from the chart?  
1. `talk_id` has no meaningful relationship with other features as it's an identifier.  
2. A positive relationship exists between `views` and `comments`; talks with higher views tend to receive more comments.  
3. Longer durations cluster around moderate views and comments but show no strong correlation.  
4. Distributions of `views` and `comments` are highly skewed, with significant outliers.

### Will the gained insights help create a positive business impact?  
**Positive Impact**:  
1. Insights on `views` and `comments` can help TED optimize content for engagement.  
2. Understanding `duration` trends enables tailoring talk lengths for better audience retention.  
**Negative Impact**:  
1. Overemphasis on popular talks might ignore niche audiences.  
2. Unhandled outliers could bias future content strategies.  

**FEATURE ENGINEERING**

In [None]:
# Making seperate column for day, month and year of upload

df['published_year'] = df['published_date'].dt.year
df['published_month'] = df['published_date'].dt.month
df['published_day'] = df["published_date"].dt.day_name()

# storing weekdays in order of numbers from 0 to 6 value

daydict = {'Sunday' : 0, 'Monday' : 1, 'Tuesday':2,'Wednesday':3,'Thursday':4,'Friday':5,'Saturday':6}

# making new column holding information of day number

df['published_daynumber'] = df['published_day'].map(daydict)


In [None]:
# add one more column published_months_ago

df['published_months_ago'] = ((2023 - df['published_year'])*12 + df['published_month'])


In [None]:
df.sample(1)

In [None]:
# there are lot of TED events

print(df['event'].value_counts().head(10))

In [None]:
# add new column of each TED event type using existing column event

ted_categories = ['TED-Ed','TEDx', 'TED', 'TEDGlobal', 'TEDSummit', 'TEDWomen', 'TED Residency']


df['TEDevent_type'] = df['event'].map(lambda x: "TEDx" if x[0:4] == "TEDx" else x)
df['TEDevent_type'] = df['TEDevent_type'].map(lambda x: "TED-Ed" if x[0:4] == "TED_Ed" else x)
df['TEDevent_type'] = df['TEDevent_type'].map(lambda x: "TED" if x[0:4] == "TED2" else x)
df['TEDevent_type'] = df['TEDevent_type'].map(lambda x: "TEDGlobal" if x[0:4] == "TEDG" else x)
df['TEDevent_type'] = df['TEDevent_type'].map(lambda x: "TEDWomen" if x[0:4] == "TEDW" else x)
df['TEDevent_type'] = df['TEDevent_type'].map(lambda x: "TEDSummit" if x[0:4] == "TEDS" else x)
df['TEDevent_type'] = df['TEDevent_type'].map(lambda x: "TED Residency" if x[0:13] == "TED Residency" else x)
df['TEDevent_type'] = df['TEDevent_type'].map(lambda x: "Other TED" if x not in ted_categories else x)

In [None]:

# check the all events talkshows counts

pd.DataFrame(df['TEDevent_type'].value_counts()).reset_index()

In [None]:
import ast

# use duplicate dataframe for topics analysis
dff = df.copy()

dff['topics'] = dff['topics'].apply(lambda x: ast.literal_eval(x))
s = dff.apply(lambda x: pd.Series(x['topics']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'topic'

dff = dff.drop('topics', axis=1).join(s)

In [None]:
# plot a bar chart of popular topics of TEDx Website

pop_topic = pd.DataFrame(dff['topic'].value_counts()).reset_index()
pop_topic.columns = ['topic', 'TEDtalks']

plt.figure(figsize=(20,6))
sns.barplot(x='topic', y='TEDtalks', data=pop_topic.head(12))
plt.show()


### Feature Manipulation & Selection

In [None]:
df.drop(labels = ["speaker", "title", "recorded_date", "published_date", "event", "native_lang", "available_lang", "topics"],axis = 1, inplace = True)


In [None]:
# again change data-types of columns

df = df.astype({'comments':'int64', 'views':'int64','video_rating':'int64'})

df = df.astype({
    'speaker_popularity': 'category',
    'published_day': 'category',
    'TEDevent_type': 'category'
})

# Multicollinearity

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# create a new DataFrame with only numeric columns
numeric_cols = df.select_dtypes(include=['int64', 'int32', 'float32', 'float64']).drop(['views'], axis=1)

# calculate VIF for each column
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(numeric_cols.values, i) for i in range(numeric_cols.shape[1])]
vif["features"] = numeric_cols.columns

# print the results
vif

The columns published_year and published_months_ago are highly correlated with each other and have high VIF. We can remove one of these columns and check VIF again.

In [None]:
# Step 1: drop published_months_ago column

df.drop(['published_year','published_month', 'published_months_ago','video_rating'], axis=1, inplace=True)


# Step 2: calculate VIF

numeric_cols = df.select_dtypes(include=['int64', 'int32', 'float32', 'float64']).drop(['views'], axis=1)
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(numeric_cols.values, i) for i in range(numeric_cols.shape[1])]
vif["features"] = numeric_cols.columns


# print the results

vif


In [None]:
# use Yeo - Johnson Transform for views column and then we train test split the data

pt = PowerTransformer()
df['views'] = pt.fit_transform(pd.DataFrame(df['views']))

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
# split the Dataset into independent(x) and dependent(y) Dataset

X = df.drop(columns=['views'])
y = df['views']


In [None]:
# display independent variables dataframe

X

In [None]:
# display dependent variable dataframe

y

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# calling train_test_split() to get the training and testing data.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

# split sizes
print(X_train.shape)
print(X_test.shape)


In [None]:
# using column transformer to make step1 of scaling, encoding, function transformer, imputer etc to build pipelines.
step1 = ColumnTransformer(transformers=[
    ('col_tnf', StandardScaler(),[0,1,3,5]),
    ('col_tnf1', PowerTransformer(),[0,1,3]),
    ('col_tnf2', OneHotEncoder(sparse_output=False, drop='first'),[4,6]), # Change 'sparse' to 'sparse_output'
    ('col_tnf3', OrdinalEncoder(categories=[['not_popular','avg_popular','popular','high_popular','extreme_popular']]),[2])
],remainder='passthrough')



# display pipeline

from sklearn import set_config
set_config(display='diagram')

By utilizing a ColumnTransformer, we can efficiently apply multiple pre-processing steps, such as scaling, encoding and function transformation, to our data in a single step. This simplifies the pre-processing phase and allows us to build pipelines with different algorithms, performing hyperparameter tuning to find the best results for our model.

In [None]:
# apply LinearRegression algorithm as step2

step2 = LinearRegression()


# make pipeline
pipe1 = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

# fit the pipeline on training dataset
pipe1.fit(X_train,y_train)

# predict the train and test dataset
y_pred_train = pipe1.predict(X_train)
y_pred = pipe1.predict(X_test)

# display pipeline diagram
display(pipe1)
# LinearRegression model all output scores
print('\033[1mTraining data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_train,y_pred_train))
print('Adjusted R2 score', (1-(1-r2_score(y_train,y_pred_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))))

print('\n')
print('\033[1mTesting data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_test,y_pred))
print('Adjusted R2 score', (1-(1-r2_score(y_test,y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))))

print('\n')
print('\033[1mThe performance metrics\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('MAE',mean_absolute_error(y_test,y_pred))
print('MSE',mean_squared_error(y_test,y_pred))
print('RMSE',np.sqrt(mean_squared_error(y_test,y_pred)))


In [None]:

# Plot the figure
plt.figure(figsize=(20,8))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No. of Test Data')
plt.show()


### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# apply RidgeRegression algorithm with hyperparameter tuning as step2


# giving parameters
parameters = {'alpha': [1e-8,1e-7,1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1,3,5,8,12,15,18,21,25]}

# we use gridsearchCV because the dataset is not that big so we use this not RandomizedSearchCV
Reg_ridge = GridSearchCV(Ridge(), parameters, cv=10)

step2 = Reg_ridge

# make pipeline
pipe2 = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

# fit the pipeline on training dataset
pipe2.fit(X_train,y_train)

# predict the train and test dataset
y_pred_train = pipe2.predict(X_train)
y_pred = pipe2.predict(X_test)

# display pipeline diagram
display(pipe2)

In [None]:

# Plot the figure
plt.figure(figsize=(20,8))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No. of Test Data')
plt.show()

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# apply LassoRegression algorithm with hyperparameter tuning as step2


# giving parameters
parameters = {'alpha': [1e-8,1e-7,1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1,2,3,4,5,8,12,15,18,21,25]}

# we use gridsearchCV because the dataset is not that big so we use this not RandomizedSearchCV
Reg_Lasso = GridSearchCV(Lasso(), parameters, cv=10)

step2 = Reg_Lasso

# make pipeline
pipe3 = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

# fit the pipeline on training dataset
pipe3.fit(X_train,y_train)

# predict the train and test dataset
y_pred_train = pipe3.predict(X_train)
y_pred = pipe3.predict(X_test)

# display pipeline diagram
display(pipe3)

# Lasso Regression model all output scores
print('\033[1mTraining data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_train,y_pred_train))
print('Adjusted R2 score', (1-(1-r2_score(y_train,y_pred_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))))

print('\n')
print('\033[1mTesting data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_test,y_pred))
print('Adjusted R2 score', (1-(1-r2_score(y_test,y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))))

print('\n')
print('\033[1mCross-validation score and best params\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print("The best parameters is", Reg_Lasso.best_params_)
print('cross-validation score', Reg_Lasso.best_score_)

print('\n')
print('\033[1mThe performance metrics\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('MAE',mean_absolute_error(y_test,y_pred))
print('MSE',mean_squared_error(y_test,y_pred))
print('RMSE',np.sqrt(mean_squared_error(y_test,y_pred)))


In [None]:

# Plot the figure
plt.figure(figsize=(20,8))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No. of Test Data')
plt.show()

# ML Model - 4

In [None]:
# apply DecisionTreeRegressor algorithm with hyperparameter tuning as step2


# giving parameters
parameters = {
    'criterion':['squared_error'],     # 'friedman_mse', 'absolute_error'
    'splitter' :['best'],              # random
    'max_depth' :[6],                  #4,5,6,7,8,9,None
    'max_features' :[1.0]              #0.25,0.50,0.75,0.85
}

# we use gridsearchCV because the dataset is not that big so we use this not RandomizedSearchCV
dtr = GridSearchCV(DecisionTreeRegressor(), param_grid=parameters , cv=10, n_jobs=-1)

step2 = dtr

# make pipeline
pipe4 = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

# fit the pipeline on training dataset
pipe4.fit(X_train,y_train)
# predict the train and test dataset
y_pred_train = pipe4.predict(X_train)
y_pred = pipe4.predict(X_test)

# display pipeline diagram
display(pipe4)

# DecisionTreeRegressor model all output scores
print('\033[1mTraining data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_train,y_pred_train))
print('Adjusted R2 score', (1-(1-r2_score(y_train,y_pred_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))))

print('\n')
print('\033[1mTesting data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_test,y_pred))
print('Adjusted R2 score', (1-(1-r2_score(y_test,y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))))
print('\n')
print('\033[1mCross-validation score and best params\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print("The best parameters is", dtr.best_params_)
print('cross-validation score', dtr.best_score_)

print('\n')
print('\033[1mThe performance metrics\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('MAE',mean_absolute_error(y_test,y_pred))
print('MSE',mean_squared_error(y_test,y_pred))
print('RMSE',np.sqrt(mean_squared_error(y_test,y_pred)))

In [None]:

# Plot the figure
plt.figure(figsize=(20,8))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No. of Test Data')
plt.show()


# ML Model - 5

In [None]:
# apply RandomForestRegressor algorithm with hyperparameter tuning as step2


# giving parameters
parameters = {
    'n_estimators':[58],      # 50,55,60,70,80,90,100
    'max_depth' :[6],         # 4,5,6,7,8,9,None
    'max_features' :[None],   # 'sqrt','log2'
    'max_samples' :[0.85]     # 0.40,0.50,0.60,0.70,0.75,0.85,1.0
}

# we use gridsearchCV because the dataset is not that big so we use this not RandomizedSearchCV
rfr = GridSearchCV(RandomForestRegressor(), param_grid=parameters , cv=10, n_jobs=-1)

step2 = rfr

# make pipeline
pipe5 = Pipeline([
    ('step1',step1),
    ('step2',step2)
])
# fit the pipeline on training dataset
pipe5.fit(X_train,y_train)

# predict the train and test dataset
y_pred_train = pipe5.predict(X_train)
y_pred = pipe5.predict(X_test)

# display pipeline diagram
display(pipe5)

# RandomForestRegressor model all output scores
print('\033[1mTraining data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_train,y_pred_train))
print('Adjusted R2 score', (1-(1-r2_score(y_train,y_pred_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))))

print('\n')
print('\033[1mTesting data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_test,y_pred))
print('Adjusted R2 score', (1-(1-r2_score(y_test,y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))))

print('\n')
print('\033[1mCross-validation score and best params\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print("The best parameters is", rfr.best_params_)
print('cross-validation score', rfr.best_score_)
print('\n')
print('\033[1mThe performance metrics\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('MAE',mean_absolute_error(y_test,y_pred))
print('MSE',mean_squared_error(y_test,y_pred))
print('RMSE',np.sqrt(mean_squared_error(y_test,y_pred)))

In [None]:

# Plot the figure
plt.figure(figsize=(20,8))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No. of Test Data')
plt.show()

# **Conclusion**

After evaluating multiple regression models on the dataset, Random Forest Regressor perform better than other models. They have higher R2 scores, lower error metrics, and can generalize well on unseen data.

Linear Regressor and Lasso Regressor have slightly lower performance metrics compared to Random Forest Regressor and Gradient Boosting Regressor.

Decision Tree Regressor has a lower R2 score, higher error metrics, and little bit overfits the data comparing to the other best models, indicating it's not the best model to use.
Therefore, based on the evaluation results, the Random Forest Regressor was chosen as the best model to achieve our objective. Also in future we can try implementing some other optimising techniques to wind up with better results.

🥇RandomForest with hyperparameter tuning🥇
Training data R2 and Adjusted R2 Score

R2 score 0.9108
Adjusted R2 score 0.9106
Testing data R2 and Adjusted R2 Score

R2 score 0.8977
Adjusted R2 score 0.

Cross-validation score
0.8974

The performance metrics

MAE 0.2613

MSE 0.1055

RMSE 0.3249

At the end a word of Thankyou to you for going through project till the very end, genuinely appreciate your time. Happy Learning!

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***