<a href="https://colab.research.google.com/github/MonaRansing/Ted_talks_views_prediction/blob/main/Ted_talk_view_prediction_project_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -   **TED Talk Views Prediction**



##### **Project Type**    - Regression
##### **Contribution**    - Individual

# **Project Summary -**

This project aimed to build a predictive model that can accurately predict the number of views of TED talks videos uploaded on the TEDx website. The dataset contained over 4,000 TED talks including transcripts in many languages. The data was explored and preprocessed to handle missing values, outliers, and categorical variables. Feature engineering was performed to extract additional information from the data, such as the length of the talk and the number of speakers.

Several regression algorithms were evaluated to determine which model would provide the best predictions. Linear regression, Ridge regression, and Lasso regression were applied to the dataset. The performance of these models was compared using various evaluation metrics, including mean absolute error, mean squared error, and root mean squared error. The results showed that Lasso regression outperformed the other models, with the lowest mean squared error and root mean squared error.

Feature selection was also performed using the Lasso regression model to identify the most important features that contribute to the prediction of the number of views. The selected features were the duration of the talk, the number of views, the number of comments, and the number of languages the talk was translated into.

Overall, this project demonstrated that it is possible to build a reliable predictive model that can accurately predict the number of views of TED talks videos uploaded on the TEDx website. 

# **GitHub Link -**

https://github.com/MonaRansing/Ted_talks_views_prediction.git

# **Problem Statement**


TED is devoted to spreading powerful ideas on just about any topic. These datasets contain over 4,000 TED talks including transcripts in many languages. Founded in 1984 by Richard Salman as a nonprofit organization that aimed at bringing experts from the fields of Technology, Entertainment and Design together, TED Conferences have gone on to become the Mecca of ideas from virtually all walks of life. As of 2015, TED and its sister TEDx chapters have published more than 2000 talks for free consumption by the masses and its speaker list boasts of the likes of AI Gore, Jimmy Wales, Shahrukh Khan and Bill Gates. The main objective is to build a predictive model, which could help in predicting views of the videos uploaded on the TEDx website. 

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Data visualisation libraries
import numpy as np
import pandas as pd
from numpy import math
from scipy.stats import ttest_ind
from scipy.stats import stats

# preprocessing libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import PowerTransformer

# model selection libraries
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

# machine learning libraries
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.ensemble import VotingRegressor,StackingRegressor


# metrics libraries
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error

from xgboost import XGBRegressor

# data visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go

# dataetime library 
from datetime import datetime
import datetime as dt

# for remove Multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor


# pipeline libraries
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

import warnings
warnings.filterwarnings('ignore')

import ast


### Dataset Loading

In [None]:
# Load Dataset

# google drive mounted 
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
# dataset reading using read_csv
ted_talk_dataset = '/content/drive/MyDrive/Almabetter/Data Science/dataset/data_ted_talks.csv'
df = pd.read_csv(ted_talk_dataset)

### Dataset First View

In [None]:
# Dataset First Look
# first 5 rows 
df.head()

In [None]:
# last five rows
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

Give dataset has **4005 rows** and **19 columns**.

### Dataset Information

In [None]:
# Understanding dataset information
df.info()

In [None]:
# making cpoy of dataset
df1 = df.copy()
df1

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_values= df1.duplicated().value_counts()
duplicate_values

In the given dataset there are no any duplicate values.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df1.isnull().sum().sort_values(ascending=False)[:5]
missing_values

In [None]:
# Visualizing the missing values
columns = ['comments','occupation','about_speakers','all_speakers','recorded_date']
missing_values = [655,522,503,4,1]
plt.figure(figsize=(14,7))
sns.barplot(x=columns, y=missing_values, data=df1, palette='husl')
plt.title('Visualisation of missing values', fontsize = 20)
plt.xlabel('Name of column', fontsize = 15)
plt.ylabel('Count of missing Values', fontsize = 15)
plt.show()

### What did you know about your dataset?

**The given dataset has 4005 columns and 19 rows. There are no any duplicate value.Dataset has 5 columns which have missing values and those columns are commnets,occupation, about_speakers, all_speakers, recorded_date.**

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df1.columns

In [None]:
# Dataset Describe
df1.describe().T

In [None]:
df1.head()

In [None]:
df1.columns

### Variables Description 

* **talk_id:** A unique identifier for each TED Talk video.**(Numerical)**

* **title:** The title of the talk.**(Categorical)**

* **speaker_1:** The primary speaker for the talk.**(Categorical)**

* **all_speakers:** A list of all the speakers for the talk.**(Categorical)**

* **occupations:** The occupations of the speakers.**(Categorical)**

* **about_speakers:** Information about the speakers, such as their backgrounds and expertise.**(Categorical)**

* **views:** The number of views the video has received.**(Numerical)**

* **recorded_date:** The date the talk was recorded.**(Datetime)**

* **published_date:** The date the talk was published on the TED Talks YouTube channel.**(Datetime)**

* **event:** The name of the TED event where the talk was given.**(Categorical)**

* **native_lang:** The language the talk was given in.**(Categorical)**

* **available_lang:** The languages the talk is available in.**(Categorical)**

* **comments:** The number of comments on that video.**(Numerical)**

* **duration:** The length of the video.(in sec.)**(Numerical)**

* **topics:** The topics covered in the talk.**(Categorical)**

* **related talks:** Other TED Talks that are related to this talk.**(Categorical)**

* **url:** The URL of the video.**(Categorical)**

* **description:** A brief description of the talk.**(Categorical)**

* **transcript:** A transcript of the talk.**(Categorical)**

* In the given dataset there are three columns which have numerical values and 
those columns are talk_id, views,comments and duration.
* 13 columns which have categorical values and those columns are title, speaker_1,all_speakers,occupation,about_speakers,event,native_lang,available_lang,topics,related_talks,url,description,transcript.
* 2 columns which have datetime values and those columns are recorded_date,published_date.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(df1.apply(lambda col: col.unique()))

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# description of dataset
df1.describe().T

In [None]:
# boxplot to visualize outliers in comment column
plt.figure(figsize=(7,7))
sns.boxplot(x=df1['comments'],data=df1)
plt.show()

In [None]:
# boxplot to visualize outliers in views column
plt.figure(figsize=(6,6))
sns.boxplot(x=df1['views'],data=df1)
plt.show()

In [None]:
# boxplot to visualize outliers in duration column.
plt.figure(figsize=(6,6))
sns.boxplot(x=df1['duration'],data=df1)
plt.show()

* The minimum value of views is 0.
* The minimum value of comments is 0.
* There are outliers in columns views, comments and duration.

In [None]:
# Find the row where column comments have 0 value
df1[df1['comments']==0]

**In given dataset there are 2 rows which have 0 value in comment column.**

In [None]:
# Find the row where column views have 0 value
df1[df1['views']==0]

* There are 6 rows which have 0 value in views column and comment columns have NaN.
* In dataset 6 NaN values are present in comments columns. So we have to fill that value also.

In [None]:
df1.describe(include='O').T

There are two columns in above dataset which have same containt so we have to drop one of them.

In [None]:
# Filling The missing values.
missing_values = df1.isnull().sum().sort_values(ascending=False)[:5]
missing_values

In [None]:
null_columns=['occupations','about_speakers','all_speakers']
for columns in null_columns:
  df1[columns].fillna('other',inplace=True)

In [None]:
df1['comments'].fillna(0, inplace=True)

In [None]:
df1.isnull().sum().sort_values

In [None]:
# chane the datatype of some columns
df1 = df1.astype({'talk_id':'int32','views':'int32','comments':'int32','duration':'int32',})
df1['recorded_date']=pd.to_datetime(df1['recorded_date'])
df1['published_date']=pd.to_datetime(df1['published_date'])

In [None]:
df1.info()

In [None]:
# drop all_speaker column because it is duplicate of speakar_1 and also drop url and talk_id columns because they are not required.
df1.drop(['all_speakers', 'url', 'talk_id'], axis=1, inplace=True)

In [None]:
# speaker_1 is renamed as speakers
df1.rename(columns={"speaker_1":"speakers"}, inplace=True)

In [None]:
df1= df1[df1['views']!=0]

In [None]:
print((df1['views']==0).sum())

In [None]:
df1.shape

### What all manipulations have you done and insights you found?

* In given dataset there are four columns 'comments','occupation','about_speaker','all_speakers' have missing values. So In these columns values are fill using fillna() function. Comments column is fill with 0 value and other three columns are fill with 'other'.
* Some columns like talk_id, views, comment, duration, recorded_date, published_date need to be change their datatype. So datatype is changed.
* Three columns 'all_speakers','url','talk_id' these columns do not required for analysis so these columns are droped.
* 'views' column have six rows with 0 value and TED talk videos have 0 views is impossible. So those rows are droped from dataset.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

* Dependent variable in dataset is "views". So in dataset there are some columns like 'occupation','about_speakers','related_talks','description','transcript' are not corelated with dependent varible. So it is better to remove from dataset.

In [None]:
# remove unwanted columns
df1.drop(['occupations','about_speakers','related_talks','description','transcript'], axis=1,inplace=True)

#### Chart - 1

## 1) Which are first 10 popular TED talks based on title,speaker and views?

---



In [None]:
# Chart - 1 visualization code
popular_talk = df1[['title','speakers','views']].sort_values('views', ascending=False)[0:10]
popular_talk

In [None]:
plt.figure(figsize=(14,7))
sns.barplot(x=popular_talk['speakers'],y=popular_talk['views'],data=popular_talk)
plt.title('Most popular TED talks', fontsize=20)
plt.ylabel('Views', fontsize=15)
plt.xlabel('Speakers of popular TED talks', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

I choose this plot beacause barplot is easy to read and understand.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I found following insights:
* The speaker Sir Ken Robinson who talks on "Do schools kill creativity?" topic and this topic is most popular topic and has 65051954 views.
* The speaker Amy Cuddy who talks on "Your body language may shape who you are" topic and this topic is second most popular topic and has 57074270 views.  	

#### Chart - 2

2) Which are top 10 most popular speakers based on views? 

In [None]:
# Chart - 2 visualization code
popular_speakrs = df1.groupby('speakers')['views'].sum().nlargest(10).reset_index()
popular_speakrs


In [None]:
plt.figure(figsize=(14,7))
sns.barplot(x=popular_speakrs['speakers'],y=popular_speakrs['views'],data=popular_speakrs)
plt.title('Top 10 speakers', fontsize=20)
plt.ylabel('Views', fontsize=15)
plt.xlabel('Speakers', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

I choose this plot beacause barplot is easy to read and understand.

##### 2. What is/are the insight(s) found from the chart?

* Alex Gendler is most popular speaker and followed by Sir ken Robinson.
* Alex Gendler's video has 117619583 views and second most popular speaker Sir ken Robinson's video has 84380518 views.



In [None]:
df1.skew()

1.   The views column has **positive skewness** of **8.104312** which which suggests that the distribution of the views variable is **highly skewed to the right**. This means that there are a few videos with a very large number of views, while the majority of videos have a relatively small number of views.
2.   The comments column has a **positive skewness** of **9.158308**, which indicates that the distribution of the comments variable is also **highly skewed to the right**. Similar to the views variable, there are a few videos with a very large number of comments, while the majority of videos have a relatively small number of comments.
3.  The duration column has a **positive skewness** of **1.186224**, which suggests that the distribution of the duration variable is also slightly skewed to the right, but to a lesser extent than the views and comments variables. This means that the majority of videos have a shorter duration, while a few videos have a longer duration.
4. So I focuse on numerical columns, deal with outliers and null values and then check skewness of the columns.



#### Chart - 3

**3) Dealing with outliers in columns comments, viwes and Duration.**

In [None]:
# Checking correlation with view column

plt.figure(figsize=(14,7))
sns.scatterplot(x='comments', y="views", data = df1, color = 'green')
plt.show()


From above scatterplot it is clear that comments and views are right skewwd and they have somehwat similar distribution and highly right skewed distribution means high outliers.


In [None]:
# check distrubution of comment column
plt.figure(figsize=(14,7))
sns.distplot(df1['comments'], color='green')
plt.show()


In [None]:
# remove outliers of comments column
df1.drop(df1[df1['comments']>1000].index, inplace=True)

In [None]:
df1.shape

In [None]:
# fill null values with median of column
df1['comments']=df1['comments'].replace(0, np.nan)
df1['comments'].fillna(df1['comments'].median(),axis = 0, inplace=True)

I use median beacause while dealing with outliers insted of mean median is better.

In [None]:
# Distplot after removal of outliers
plt.figure(figsize=(10,5))
sns.distplot(df1['comments'], color ='green')

Now after removal of outliers column comments are right skewd.

In [None]:
# Check distribution of duration column

plt.figure(figsize=(14,7))
sns.distplot(df1['duration'], color='green')

In [None]:
# Check correlation of duration and views using scatter plot
plt.figure(figsize=(14,7))
sns.scatterplot(x='duration', y='views', data=df1, color='green')
plt.show()

We can see that duration and views are not correlated with each other. Na duration column has outliers.

In [None]:
# Check outliers using boxplot in duration and views column
columns = ['views','duration']
n = 1
plt.figure(figsize=(14,7))

for i in columns:
  plt.subplot(3,3,n)
  n=n+1
  sns.boxplot(df1[i], orient='h')
  plt.title(i)
  plt.tight_layout()

In [None]:
# Outlier traetment
columns = ['views','duration']

for i in columns:
  iqr = df1[i].quantile(0.75)-df1[i].quantile(0.25)
  df1[i]=df1[i].mask(df1[i]>(df1[i].quantile(0.75)+1.5*iqr), df1[i].mean())

In [None]:
columns = ['views','duration']
n = 1
plt.figure(figsize=(14,7))

for i in columns:
  plt.subplot(3,3,n)
  n=n+1
  sns.boxplot(df1[i], orient='h')
  plt.title(i)
  plt.tight_layout()

In [None]:
# after outlier traetment distribution graph of columns views and duration
fig,axs=plt.subplots(1,2,figsize=(14,7))
sns.distplot(df1['views'], color='green', ax=axs[0])
axs[0].set_title('Distribution of views')

sns.distplot(df1['duration'], color='green', ax=axs[1])
axs[1].set_title('Distribution of Duration')

plt.tight_layout()
plt.show()

* After filling outliers with mean views column is little bit right skewed with normal distribution and for duration column are bi-model type distribution.

In [None]:
df1.skew()

#### Chart - 4

### 4) Check speaker popularity.

In [None]:
# Chart - 4 visualization code
# create a new column 'spaker popularity
df1['speaker_popularity'] = ""
df1.loc[df1['views']<=500000, 'speaker_popularity'] = 'not_popular'
df1.loc[(df1['views']> 500000) & (df1['views'] <= 1500000), 'speaker_popularity'] = 'avg_popular'
df1.loc[(df1['views']>1500000) & (df1['views']<=2500000), 'speaker_popularity'] = 'popular'
df1.loc[(df1['views']>2500000) & (df1['views']<=3500000), 'speaker_popularity'] = 'high_popular'
df1.loc[df1['views']>3500000, 'speaker_popularity'] = 'extreme_popular'

In [None]:
plt.figure(figsize=(14,7))
sns.barplot(data=df1, x='speaker_popularity', y='comments',
            order=['not_popular','avg_popular','popular','high_popular','extreme_popular'])
plt.title("Speakers popularity according to comments")
plt.show()

* speaker_popularity and comments column has strong correlation with each other means comments going to increase speaker_popularity also foing to increase.

#### Chart - 5

### 5) Check video rating

In [None]:
# Chart - 5 visualization code
# create a new column video_rating
df1['video_rating'] = ""
df1.loc[df1['comments']<=50, 'video_rating'] = 1
df1.loc[(df1['comments']>50) & (df1['comments']<=100), "video_rating"] = 2
df1.loc[(df1['comments']>100) & (df1['comments']<=200), "video_rating"] = 3
df1.loc[(df1['comments']>200) & (df1['comments']<=300), "video_rating"] = 4
df1.loc[df1['comments']>300, "video_rating"] = 5

In [None]:
rating_counts = df1['video_rating'].value_counts().sort_values(ascending=False)
rating_counts

In [None]:
plt.figure(figsize=(14,7))
rating_counts.plot(kind='bar', color='green')
plt.title('Distribution of Video Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Because this plot is easy to understand. 

##### 2. What is/are the insight(s) found from the chart?

From above barplot I get insight is there are 1385 videos which have rating 2 means those values have comments between 50-100. 

#### Chart - 6

### 6) Chek available languages

In [None]:

# Chart - 6 visualization code
df1["available_langauges"] = df1["available_lang"].apply(lambda x: len(x))
pd.DataFrame(df1["available_langauges"])

In [None]:
plt.figure(figsize=(14,7))
sns.distplot(df1["available_langauges"], color='green')

##### 1. What is/are the insight(s) found from the chart?

Distplot for available langauges is slightly right skewed and in the middle of plot there are more values.

## Published Date

In [None]:
# making seperate column for day, month, and year
df1['published_year'] = df1["published_date"].dt.year
df1['published_month'] = df1["published_date"].dt.month
df1['published_day'] = df1["published_date"].dt.day_name()

# storing weekdays in order of numbers from 0 to 6 value
weekdays = {'Sunday':0, 'Monday':1,'Tuseday':2, 'Wednesday':0, 'Thursday':4, 'Friday': 5,'Saturday':6}

# making nw column which have informaton of day number

df1["published_daynumb"] = df1["published_day"].map(weekdays)

In [None]:
df1.sample()

#### Chart - 7

### 7) Check different events of TED upto year 2013.

In [None]:
# Chart - 7 visualization code
TED_events = df1["event"].value_counts().head(10).reset_index().rename(columns={'index': 'events', 'event': 'Counts'})
TED_events

In [None]:
plt.figure(figsize=(14,7))
sns.barplot(x=TED_events['events'], y=TED_events['Counts'], order=TED_events['events'])
plt.title('TED events count', fontsize=20)
plt.ylabel('Counts', fontsize=15)
plt.xlabel('Events', fontsize=15)
plt.show()

In [None]:
# Add new column TED event type by using existing column event
ted_categories = ['TED-ED', 'TEDx','TED','TEDGlobal','TEDSummit','TEDWomen','TED Residency']

df1['TEDevent_type'] = df1['event'].map(lambda x : "TEDx" if x[0:4] == "TEDx" else x)
df1['TEDevent_type'] = df1['TEDevent_type'].map(lambda x : "TED-ED" if x[0:4] == "TED-ED" else x)
df1['TEDevent_type'] = df1['TEDevent_type'].map(lambda x : "TED" if x[0:4] == "TED2" else x)
df1['TEDevent_type'] = df1['TEDevent_type'].map(lambda x : "TEDGlobal" if x[0:4] == "TEDG" else x)
df1['TEDevent_type'] = df1['TEDevent_type'].map(lambda x : "TEDWomen" if x[0:4] == "TEDW" else x)
df1['TEDevent_type'] = df1['TEDevent_type'].map(lambda x : "TEDSummit" if x[0:4] == "TEDS" else x)
df1['TEDevent_type'] = df1['TEDevent_type'].map(lambda x : "TED Residency" if x[0:13] == "TED Residency" else x)
df1['TEDevent_type'] = df1['TEDevent_type'].map(lambda x : "Other TED" if x not in ted_categories else x)

In [None]:
df1.sample(1)

In [None]:
TEDevent_type = pd.DataFrame(df1['TEDevent_type'].value_counts()).reset_index().rename(columns={'index': 'TEDevent_type', 'TEDevent_type': 'Counts'})
TEDevent_type

In [None]:
plt.figure(figsize=(14,7))
sns.barplot(x=TEDevent_type['TEDevent_type'], y=TEDevent_type['Counts'], order=TEDevent_type['TEDevent_type'])
plt.title('Barplot of TEDevent type', fontsize=20)
plt.ylabel('Counts', fontsize=15)
plt.xlabel('TEDevent type', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Becuase barplot is easy to understand and we can read that plot so easily.

##### 2. What is/are the insight(s) found from the chart?

I found that ted events which are other TED have highest number which is 1231. 

#### Chart - 8

### 8) Topics of TED talk

In [None]:
# Chart - 8 visualization code
df_1 = df1.copy()

df_1['topics'] = df_1['topics'].apply(lambda x: ast.literal_eval(x))
s = df_1.apply(lambda x: pd.Series(x['topics']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'topic'

df_1 = df_1.drop('topics', axis=1).join(s)

In [None]:
df_1.head()

In [None]:
Popular_topics = pd.DataFrame(df_1['topic'].value_counts()).reset_index()
Popular_topics.columns=['topic','TEDtalks']

In [None]:
plt.figure(figsize=(14,7))
sns.barplot(x="topic", y="TEDtalks", data=Popular_topics.head(10))
plt.title('Popular topics in TED talk', fontsize=20)
plt.ylabel('Topics', fontsize=15)
plt.xlabel('TEDtalks', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Because visualization becomes easy when we use barplot.

##### 2. What is/are the insight(s) found from the chart?

* First 10 Popular TED talks videos are based on science, technology, culture, TEDx, TED-ED, global issues, socirty, design, social change and animation
* Within above 10 topics science and technology are most popular.

In [None]:
# plot the stacked bar charts of top 8 topics over the year and check trend

pop_theme_talks = df_1[(df_1['topic'].isin(Popular_topics.head(12)['topic'])) & (df_1['topic'] != ('TEDx','TED-Ed'))]
pop_theme_talks['published_year'] = pop_theme_talks['published_year'].astype('int')
pop_theme_talks = pop_theme_talks[pop_theme_talks['published_year'] > 2008]


themes = list(Popular_topics.head(10)['topic'])
themes.remove('TEDx')
themes.remove('TED-Ed')

ctab = pd.crosstab([pop_theme_talks['published_year']], pop_theme_talks['topic']).apply(lambda x: x/x.sum(), axis=1)
ctab[themes].plot(kind='bar', stacked=True, colormap='plasma', figsize=(12,8)).legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

### **Feature Selection**

In [None]:
df1.sample(2)

In [None]:
df1.drop(labels = ["speakers", "title", "recorded_date", "published_date", "event", "native_lang", "available_lang","topics"],axis = 1, inplace = True)

In [None]:
df1.sample(1)

In [None]:
df1.shape

In [None]:
df1 = df1.astype({'comments':'int64', 'views':'int64','video_rating':'int64'})

df1 = df1.astype({'speaker_popularity': 'category','published_day': 'category','TEDevent_type':'category'})

In [None]:
df1.info()

#### Chart - 9 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
correlmap = df1.corr()
plt.figure(figsize=(14,7))
sns.heatmap(correlmap, annot=True)

##### 1. Why did you pick the specific chart?

If we want to know the correlation between columns in a given data and we want effective visualization then hetmap is beat choice because we can uderstand easily.

##### 2. What is/are the insight(s) found from the chart?

Dependent column in given dataset is views so the columns comments and avaliable_ langauges are most correlated with views column.

### **Check Multicollinearity and Remove**


In [None]:
# select numeric columns
numeric_columns = df1.select_dtypes(include=['int64','int32','float32','float64']).drop(['views'],axis=1)

# replace infinite and NaN values with appropriate values
numeric_columns = numeric_columns.replace([np.inf, -np.inf], np.nan)
numeric_columns = numeric_columns.fillna(numeric_columns.mean())

# calculate VIF for each feature
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(numeric_columns.values, i) for i in range(numeric_columns.shape[1])]
vif['features'] = numeric_columns.columns
print(vif)


In [None]:
df1.drop(['published_year','published_month','video_rating'], axis=1, inplace=True)

In [None]:
# select numeric columns
numeric_columns = df1.select_dtypes(include=['int64','int32','float32','float64']).drop(['views'],axis=1)

# replace infinite and NaN values with appropriate values
numeric_columns = numeric_columns.replace([np.inf, -np.inf], np.nan)
numeric_columns = numeric_columns.fillna(numeric_columns.mean())

# calculate VIF for each feature
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(numeric_columns.values, i) for i in range(numeric_columns.shape[1])]
vif['features'] = numeric_columns.columns
print(vif)

In [None]:
df1.sample(1)

In [None]:
pt = PowerTransformer()
df1['views'] = pt.fit_transform(pd.DataFrame(df1['views']))

In [None]:
df1.skew()

## ***7. ML Model Implementation***

In [None]:
df1.isnull().sum()

In [None]:
values = {'published_daynumb':0}

df1 = df1.fillna(value=values)

In [None]:
df1.isnull().sum()

In [None]:
X = df1.drop(columns=['views'])
y = df1['views']

In [None]:
X

In [None]:
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 0)

In [None]:
print(X_train.shape)
print(X_test.shape)

In [None]:
from sklearn import datasets, linear_model, metrics

In [None]:
reg = linear_model.LinearRegression()

In [None]:
df1.sample(2)

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf', StandardScaler(),[0,1,3,5]),
    ('col_tnf1', PowerTransformer(),[0,1,3]),
    ('col_tnf2', OneHotEncoder(sparse=False, drop='first'),[4,6]),
    ('col_tnf3', OrdinalEncoder(categories=[['not_popular','avg_popular','popular','high_popular','extreme_popular']]),[2])
],remainder='passthrough')



# display pipeline

from sklearn import set_config
set_config(display='diagram')

### ML Model - 1

In [None]:
# apply LinearRegression algorithm as step2

step2 = LinearRegression()


# make pipeline
pipe1 = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

# fit the pipeline on training dataset
pipe1.fit(X_train,y_train)

# predict the train and test dataset 
y_pred_train = pipe1.predict(X_train)
y_pred = pipe1.predict(X_test)

# display pipeline diagram
display(pipe1)

# LinearRegression model all output scores
print('\033[1mTraining data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_train,y_pred_train))
print('Adjusted R2 score', (1-(1-r2_score(y_train,y_pred_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))))

print('\n')
print('\033[1mTesting data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_test,y_pred))
print('Adjusted R2 score', (1-(1-r2_score(y_test,y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))))

print('\n')
print('\033[1mThe performance metrics\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('MAE',mean_absolute_error(y_test,y_pred))
print('MSE',mean_squared_error(y_test,y_pred))
print('RMSE',np.sqrt(mean_squared_error(y_test,y_pred)))

In [None]:
# Plot the figure
plt.figure(figsize=(20,8))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No. of Test Data')
plt.show()

### **Observations**

* Training data R2 and Adjusted R2 Score is 0.8872985205276127
0.8870319066863465 respectively.

* Testing data R2 and Adjusted R2 Score is 0.8849387907523739
0.8841185988330935 respectively.

* The performance metrics are:-

    **MAE** 0.27380534823783437

    **MSE** 0.11420981723987313

    **RMSE** 0.33794943000377015


### ML Model - 2

In [None]:
# apply RidgeRegression algorithm with hyperparameter tuning as step2


# giving parameters
parameters = {'alpha': [1e-8,1e-7,1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1,3,5,8,12,15,18,21,25]}

# we use gridsearchCV because the dataset is not that big so we use this not RandomizedSearchCV
Reg_ridge = GridSearchCV(Ridge(), parameters, cv=10)                             

step2 = Reg_ridge

# make pipeline
pipe2 = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

# fit the pipeline on training dataset
pipe2.fit(X_train,y_train)

# predict the train and test dataset 
y_pred_train = pipe2.predict(X_train)
y_pred = pipe2.predict(X_test)

# display pipeline diagram
display(pipe2)

# Ridge Regression model all output scores
print('\033[1mTraining data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_train,y_pred_train))
print('Adjusted R2 score', (1-(1-r2_score(y_train,y_pred_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))))

print('\n')
print('\033[1mTesting data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_test,y_pred))
print('Adjusted R2 score', (1-(1-r2_score(y_test,y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))))

print('\n')
print('\033[1mCross-validation score and best params\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print("The best parameters is", Reg_ridge.best_params_)
print('cross-validation score', Reg_ridge.best_score_)

print('\n')
print('\033[1mThe performance metrics\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('MAE',mean_absolute_error(y_test,y_pred))
print('MSE',mean_squared_error(y_test,y_pred))
print('RMSE',np.sqrt(mean_squared_error(y_test,y_pred)))

In [None]:
# Plot the figure
plt.figure(figsize=(20,8))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No. of Test Data')
plt.show()

### **Observations:** 


* Training data R2 and Adjusted R2 score is 0.8872978749437399
0.8870312595752391 respectively.


* Testing data R2 and Adjusted R2 score is 0.8849257240863323
0.8841054390238112 respectively.


* Cross-validation score and best params are as follows:

   **The best parameters** is {'alpha': 0.001}

   **cross-validation score** is 0.884838466969393


* The performance metrics are as follows:

     **MAE** 0.27379874378382013

     **MSE** 0.11422278721950664

     **RMSE** 0.3379686186904143

### ML Model - 3

In [None]:
# apply LassoRegression algorithm with hyperparameter tuning as step2


# giving parameters
parameters = {'alpha': [1e-8,1e-7,1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1,2,3,4,5,8,12,15,18,21,25]}

# we use gridsearchCV because the dataset is not that big so we use this not RandomizedSearchCV
Reg_Lasso = GridSearchCV(Lasso(), parameters, cv=10)

step2 = Reg_Lasso

# make pipeline
pipe3 = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

# fit the pipeline on training dataset
pipe3.fit(X_train,y_train)

# predict the train and test dataset
y_pred_train = pipe3.predict(X_train)
y_pred = pipe3.predict(X_test)

# display pipeline diagram
display(pipe3)

# Lasso Regression model all output scores
print('\033[1mTraining data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_train,y_pred_train))
print('Adjusted R2 score', (1-(1-r2_score(y_train,y_pred_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))))

print('\n')
print('\033[1mTesting data R2 and Adjusted R2 Score\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('R2 score',r2_score(y_test,y_pred))
print('Adjusted R2 score', (1-(1-r2_score(y_test,y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))))

print('\n')
print('\033[1mCross-validation score and best params\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print("The best parameters is", Reg_Lasso.best_params_)
print('cross-validation score', Reg_Lasso.best_score_)

print('\n')
print('\033[1mThe performance metrics\033[0m')
print('\033[1m' + '-----------------------------------------' + '\033[0m')
print('MAE',mean_absolute_error(y_test,y_pred))
print('MSE',mean_squared_error(y_test,y_pred))
print('RMSE',np.sqrt(mean_squared_error(y_test,y_pred)))

In [None]:
# Plot the figure
plt.figure(figsize=(20,8))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No. of Test Data')
plt.show()

### **Observations**

* Training data R2 and Adjusted R2 Score as follows:

    **R2 score** 0.8856772370433994

    **Adjusted R2 score** 0.885406787790038


* Testing data R2 and Adjusted R2 Score as follows:

    **R2 score** 0.8824740403319331

    **Adjusted R2 score** 0.8816362789086374


* Cross-validation score and best params are as follows:

    **The best parameters** is {'alpha': 1e-08}

    **cross-validation score** 0.8832671645096054


* The performance metrics are as follows:

    **MAE** 0.275916310909462

    **MSE** 0.11665632981262587

    **RMSE** 0.3415498935918819

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

* The performance metrics of all three models (linear regression, Ridge regression, and Lasso regression) are very similar, with R2 scores ranging from 0.885 to 0.888 and RMSE values ranging from 0.338 to 0.342. However, Ridge regression and linear regression have slightly better R2 scores and lower RMSE values compared to Lasso regression.

* Therefore, based on the performance metrics, both linear regression and Ridge regression appear to be good choices for this dataset. 

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***