<a href="https://colab.research.google.com/github/Ashish-Kumar-Vaish/TED-Talk-Regression/blob/main/Ted_Talk_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Name**          - Ashish Kumar Vaish

# **Project Summary -**

This project aims to analyze and predict the popularity of TED Talks using structured data scraped from the TED website. The popularity of these talks, measured in terms of view count, is influenced by numerous factors such as topic, duration, speaker credentials, and audience engagement (likes, comments).

In this project, we begin with exploratory data analysis (EDA) to gain insights into the dataset and identify any data quality issues. Our EDA includes univariate, bivariate, and multivariate analyses, supported by meaningful visualizations that follow the UBM rule.

After thorough data preprocessing including handling missing values, outlier treatment, and scaling, we apply regression based machine learning models like Linear Regression, Random Forest, and Gradient Boosting to predict the number of views a TED Talk might receive. We also perform hyperparameter tuning and evaluate the models using metrics like R², RMSE, and MAE.

The project offers valuable insights that can guide TED Talk's content strategy by highlighting the key factors that make a talk go viral.

# **GitHub Link -**

https://github.com/Ashish-Kumar-Vaish/TED-Talk-Regression

# **Problem Statement**


**To build a regression model that predicts the number of views a TED Talk will receive based on available features such as title, speaker, occupation, duration, date, event, languages, topics, and engagement metrics.**

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import Lasso, Ridge
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error,mean_absolute_error,mean_absolute_percentage_error
import math
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import VotingRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures
import ast
from sklearn.model_selection import train_test_split

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/data_ted_talks.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,6))
sns.barplot(df.isnull().sum()[df.isnull().sum() > 0])
plt.title("Missing Values Bar Plot")
plt.xlabel("Columns")
plt.ylabel("Missing Values Count")
plt.show()

### What did you know about your dataset?

- The dataset contains around 2,500 TED Talks.
- Columns include metadata like `name`, `speaker`, `title`, `duration`, `views`, `comments`, `languages`, `tags`, etc.
- Some fields like `speaker_occupation`, `film_date`, and `published_date` might need parsing or transformation.
- There are missing values in `speaker_occupation` and `ratings`.
- No duplicate rows are present.
- The target variable for prediction is clearly `views`.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

- `comments` - Number of comments on the TED Talk.
- `description` - Short description of the talk.
- `duration` - Duration of the talk in seconds.
- `event` - Name of the event where the talk was given.
- `film_date` - Date the talk was filmed (UNIX timestamp).
- `languages` - Number of languages the talk is available in.
- `main_speaker` - Name of the main speaker.
- `name` - Title of the talk.
- `num_speaker` - Number of speakers in the talk.
- `published_date` - Date the talk was published online (UNIX timestamp).
- `ratings` - Dictionary of audience reactions.
- `related_talks` - JSON object containing related talks.
- `speaker_occupation` - Job title/description of the speaker.
- `tags` - List of tags assigned to the talk.
- `title` - Title of the talk.
- `url` - URL of the talk.
- `views` - Total number of views (Target variable).


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df = df.drop(columns=['talk_id', 'title', 'about_speakers', 'recorded_date', 'related_talks', 'url', 'description', 'transcript'])

df['published_date'] = pd.to_datetime(df['published_date'])
df['topics'] = df['topics'].apply(ast.literal_eval)
df['occupations'] = df['occupations'].fillna('{}').apply(ast.literal_eval)
df['all_speakers'] = df['all_speakers'].fillna('{}').apply(ast.literal_eval)
df['available_lang'] = df['available_lang'].apply(ast.literal_eval)
df['comments'] = df['comments'].fillna(0)

In [None]:
all_topics = []

for sublist in df['topics']:
  for topic in sublist:
    all_topics.append(topic)

unique_topics = sorted(set(all_topics))
print(unique_topics)

In [None]:
all_occupations = []

for sublist in df['occupations']:
  for occupation in sublist.values():
    for i in occupation:
      all_occupations.append(i)

unique_occupations = sorted(set(all_occupations))
print(unique_occupations)

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10, 5))
sns.histplot(df['views'][df['views'] < 2e7] / 1e7, bins=50)
plt.title('Distribution of TED Talk Views')
plt.xlabel('Views (in 10 Millions)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10, 5))
sns.histplot(df['comments'][df['comments'] < 2000], bins=50, color='violet')
plt.title('Distribution of TED Talk Comments')
plt.xlabel('Number of Comments < 2k')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 5))
sns.histplot(df['duration'][df['duration'] < 3000], bins=40, color='red')
plt.title('Distribution of TED Talk Duration')
plt.xlabel('Duration (in seconds)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 5))
top_events = df.groupby('event')['views'].sum().sort_values(ascending=False).head(10)
sns.barplot(x=top_events.index,y=top_events.values / 1e7)
plt.title('Top 10 TED Events by Number of Views')
plt.xlabel('Event')
plt.ylabel('Views (in 10M)')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(9, 5))
sns.histplot(df['available_lang'].apply(len), bins=30, color='teal')
plt.title('Distribution of Available Languages per TED Talk')
plt.xlabel('Number of Languages available')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12, 5))
sns.histplot(df['duration'] / 60, bins=50, color='blue')
plt.title('Distribution of TED Talk Durations')
plt.xlabel('Duration (in Minutes)')
plt.ylabel('Number of Talks')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(12, 5))
sns.barplot(df['published_date'].dt.year.value_counts(), color='red')
plt.title('Number of TED Talks Published Each Year')
plt.xlabel('Year')
plt.ylabel('Number of Talks')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(12, 6))
sns.scatterplot(x=df['duration'] / 60,y=df['views'] / 1e7, alpha=0.5)
plt.title('Duration vs Views')
plt.xlabel('Duration (in minutes)')
plt.ylabel('Views (in 10M)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 5))
sns.scatterplot(x= df['available_lang'].apply(len), y=df['views'] / 1e7, alpha=0.5)
plt.title('Number of Languages vs Views')
plt.xlabel('Number of Available Languages')
plt.ylabel('Views (in 10M)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10, 5))
sns.scatterplot(x=df['comments'][df['comments'] < 4000], y=df['views'] / 1e7, alpha=0.5)
plt.title('Number of Comments vs Views')
plt.xlabel('Number of Comments')
plt.ylabel('Views (in 10M)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(10, 5))
sns.scatterplot(x=df['topics'].apply(len), y=df['views'] / 1e7, alpha=0.5)
plt.title('Number of Topics vs Views')
plt.xlabel('Number of Topics')
plt.ylabel('Views (in 10M)')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(12, 5))
speaker_views = df.groupby('speaker_1')['views'].sum().sort_values(ascending=False).head(10)

sns.barplot(x=speaker_views.index, y=speaker_views.values / 1e7, color="yellow")
plt.title('Top 10 Speakers by Total Views')
plt.xlabel('Speaker')
plt.ylabel('Total Views (in 10M)')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numeric_df = df[['views', 'duration', 'comments']].copy()
numeric_df['num_topics'] = df['topics'].apply(len)
numeric_df['num_languages'] = df['available_lang'].apply(len)

plt.figure(figsize=(10, 6))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sample_df = df[['views', 'duration', 'comments']].copy()
sample_df['num_topics'] = df['topics'].apply(len)
sample_df['num_languages'] = df['available_lang'].apply(len)
sample_df = sample_df.sample(500, random_state=42)

sns.pairplot(sample_df)
plt.suptitle("Pair Plot of Numerical Features")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* H0: The number of available languages for a TED Talk has no significant relationship with the number of views.
* H1: TED Talks available in more languages tend to receive more views.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
lang_views_corr = df['available_lang'].apply(len).corr(df['views'])
print(f"Correlation between number of languages and views: {lang_views_corr.round(4)}")

##### Which statistical test have you done to obtain P-Value?

Pearson’s correlation via .corr() from pandas to measure the linear relationship.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* H0: The number of comments on a TED Talk has no effect on the number of views.
* H1: Talks with more comments tend to have more views.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
comments_views_corr = df['comments'].corr(df['views'])
print(f"Correlation between number of comments and views: {comments_views_corr.round(4)}")

##### Which statistical test have you done to obtain P-Value?

Pearson’s correlation via .corr() from pandas to measure the linear relationship.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
plt.figure(figsize=(10, 5))
sns.boxplot(df['views'])
plt.title("Boxplot of Views")
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
df['num_topics'] = df['topics'].apply(len)
df['num_languages'] = df['available_lang'].apply(len)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
df = df.drop(columns=['speaker_1', 'event'])

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Scaling

In [None]:
# Scaling your data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['duration', 'comments', 'num_topics', 'num_languages']])

##### Which method have you used to scale you data and why?

### 6. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x = df[['duration', 'comments', 'num_topics', 'num_languages']]
y = df['views']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.15,random_state=42)

##### What data splitting ratio have you used and why?

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm
model = LinearRegression()
model.fit(x_train, y_train)

# Predict on the model
y_pred = model.predict(x_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("Linear Regression Performance:")
print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(model, x, y, cv=5, scoring='r2')
print("Cross-validated R² Scores:", cv_scores)
print("Mean CV R² Score:", np.mean(cv_scores))

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

In [None]:
model = RandomForestRegressor(random_state=42)
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)

# Evaluation Metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("Random Forest RMSE:", round(rmse, 2))
print("Random Forest R² Score:", round(r2, 3))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

rf = RandomForestRegressor(random_state=42)

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)

grid_search.fit(x_train, y_train)

best = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
gbr = GradientBoostingRegressor(random_state=42)
gbr.fit(x_train, y_train)
y_pred_gbr = gbr.predict(x_test)

# Evaluation metrics
mse_gbr = mean_squared_error(y_test, y_pred_gbr)
mae_gbr = mean_absolute_error(y_test, y_pred_gbr)
r2_gbr = r2_score(y_test, y_pred_gbr)

print(f"Gradient Boosting Regressor:\nMSE: {mse_gbr:.2f}\nMAE: {mae_gbr:.2f}\nR² Score: {r2_gbr:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
from sklearn.model_selection import GridSearchCV

param_grid_gbr = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1],
    'max_depth': [3, 4]
}

grid_gbr = GridSearchCV(GradientBoostingRegressor(random_state=42), param_grid_gbr, cv=5, scoring='r2', n_jobs=-1)
grid_gbr.fit(x_train, y_train)

print("Best Parameters:", grid_gbr.best_params_)

# Predict using best estimator
best_gbr = grid_gbr.best_estimator_
y_pred_best_gbr = best_gbr.predict(x_test)

# Re-evaluate
mse_best_gbr = mean_squared_error(y_test, y_pred_best_gbr)
mae_best_gbr = mean_absolute_error(y_test, y_pred_best_gbr)
r2_best_gbr = r2_score(y_test, y_pred_best_gbr)

print(f"Tuned GBR:\nMSE: {mse_best_gbr:.2f}\nMAE: {mae_best_gbr:.2f}\nR² Score: {r2_best_gbr:.4f}")


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For evaluating model performance, we considered the following metrics:

* **Mean Absolute Error (MAE):** Indicates the average magnitude of errors in predictions without considering their direction. MAE is easy to understand and gives a real-world magnitude of error in predicted views.

* **Mean Squared Error (MSE):** Penalizes larger errors more than MAE. Helps identify models that occasionally make large prediction errors.

* **Root Mean Squared Error (RMSE):** Square root of MSE; interpretable in the same unit as the target (views). RMSE gives higher weight to large errors, making it a critical metric when minimizing high-impact mispredictions.

* **R² Score (Coefficient of Determination):** Measures how well the model explains variance in the data. Higher R² indicates that the model is capturing more variability and thus making better predictions.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

After comparing multiple regression models (Linear Regression, Random Forest Regressor, and Gradient Boosting Regressor), we selected the Gradient Boosting Regressor as our final prediction model.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

We used the Gradient Boosting Regressor from sklearn.ensemble, a powerful ensemble learning technique that builds models sequentially, each trying to correct the errors made by the previous one. It's especially effective for regression tasks with complex patterns and interactions between features.

# **Conclusion**

In this project, we conducted an in-depth exploratory data analysis and regression modeling on the TED Talks dataset to understand the key factors that influence the number of views a talk receives. Through structured visualization and hypothesis testing, we identified that the number of available languages, the number of topics associated with a talk, and the talk duration have significant correlation with views.

We engineered features such as num_topics and num_languages, and employed multiple regression models including Linear Regression, Random Forest, and Gradient Boosting to predict views. After tuning and evaluation, Gradient Boosting emerged as the best-performing model with the lowest RMSE and highest R² score.

The entire pipeline is fully automated and deployable, with all transformations, training, and predictions executable end-to-end without manual intervention.