<a href="https://colab.research.google.com/github/Aman1647/TED-Talks-Views-Prediction/blob/main/TED_Talks_Views_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - 



##### **Project Type**    - Regression/TED Talks Views Prediction
##### **Contribution**    - Individual


# **Project Summary -**

We had a dataset of over 4000 TED Events and the main objective was to build a model for views generated for these events. The dataset was feature-rich but was heavily influenced by outliers , which were handled using Interquartile Range (IQR) method. Many important insights were received from the features extracted from the dataset using Feature-Engineering. One-Hot Encoding was performed for binary conversion of object type data-types. Regression Models implemented for analysis were Linear Regression along with regularizations Lasso, Ridge & Elastic-Net followed by Decision Trees and its ensembles techniques. Models were evaluated on their MAE, R2 score which concluded with Random-Forest being the best of all the models with a train-score of 98% and test-score of 80% and least MAE . Feature importance were calculated using SHAPLEY method and it was found that features generated using feature engineering were the most important and impacted the model prediction the most. It was concluded that TED events can be focused more on the theme of Self-Improvement , Education, Personality development within the time-limit of 16 min with 4-7 topics for each event and make the event available in as many languages as possible .These factors were associated with wide-spread reach of TED events and its popularity . Also , if the events are published at the start of the Year , close to March month , max views were generated.It was concluded that on an avg the views generated for any new TED events following the above features and factors could fetch an average views of 2.17 million.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**TED is devoted to spreading powerful ideas on just about any topic. These datasets contain over 4,000 TED talks including transcripts in many languages. Founded in 1984 by Richard Salman as a nonprofit organization that aimed at bringing experts from the fields of Technology, Entertainment, and Design together, TED Conferences have gone on to become the Mecca of ideas from virtually all walks of life. As of 2015, TED and its sister TEDx chapters have published more than 2000 talks for free consumption by the masses and its speaker list boasts of the likes of Al Gore, Jimmy Wales, Shahrukh Khan, and Bill Gates. The main objective is to build a predictive model, which could help in predicting the views of the videos uploaded on the TEDx website.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**know your data**

Features mentioned in the dataset :
1. talk_id: Talk identification number provided by TED
2. title: Title of the talk
3. speaker_1: First speaker in TED's speaker list
4. all_speakers: Speakers in the talk
5. occupations: Occupations of the speakers
6. about_speakers: Blurb about each speaker
7. recorded_date: Date the talk was recorded
8. published_date: Date the talk was published to TED.com
9. event: Event or medium in which the talk was given
10. native_lang: Language the talk was given in
11. available_lang: All available languages (lang_code) for a talk
12. comments: Count of comments
13. duration: Duration in seconds
14. topics: Related tags or topics for the talk
15. related_talks: Related talks (key='talk_id',value='title')
16. description: Description of the talk
17. transcript: Full transcript of the talk
18. url: URL of the talk

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import ast
from datetime import datetime

from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn import ensemble
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
import xgboost as xgb
from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from scipy.stats import uniform
from scipy.stats import norm
from scipy.stats import chi2
from scipy.stats import t
from scipy.stats import f

### Dataset Loading

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Almabetter/Capstone Projects/Regression/Copy of data_ted_talks.csv")

ted_df = df.copy()

### Dataset First View

In [None]:

ted_df.head()

### Dataset Rows & Columns count

In [None]:

ted_df.shape

### Dataset Information

In [None]:
ted_df.info()

#### Duplicate Values

In [None]:
ted_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
ted_df.isnull().sum()

### What did you know about your dataset?

Dataset contains 4005 columns and 19 rows.Target variable is integer datatype. Some columns contain null values which need to be handled

## ***2. Understanding Your Variables***

In [None]:
ted_df.columns

In [None]:
ted_df.describe(include = 'all').T

**Views**

In [None]:
plt.figure(figsize=(8,6))
sns.distplot((ted_df['views']))
plt.xlabel('Views')
     

In [None]:
plt.figure(figsize=(8,6))
sns.distplot(np.sqrt(ted_df['views']))          
plt.xlabel('Views')

In [None]:
ted_df['views'].describe()

**Duration**

In [None]:
plt.figure(figsize=(8,6))
sns.distplot((ted_df['duration']))
plt.xlabel('Duration')

In [None]:
ted_df['duration'].describe()

**Comments**

In [None]:
plt.figure(figsize=(8,6))
sns.distplot((ted_df['comments']))
plt.xlabel('Comments')

In [None]:
ted_df['comments'].describe()


### Variables Description 

1. The average views garned by TED events is 2.14 million and median is 1.37 million. This suggests a very high popularity of TED events
2. The average duration for TED events is around 714 seconds which is 11 minutes, whereas majority of TED events have time-duration of 960 seconds which approximates to around 15 minutes which is a very good time distribution
3. TED events receive an average of 90 comments and median is 68 which is very less considering the popularity of the TED events. While the maximum count of comments received are greater than 135

### Check Unique Values for each variable.

In [None]:
for col in ted_df:
    print(col ,":\n", ted_df[col].unique())

## 3. ***Data Wrangling***

### Data Wrangling Code

Grouping events based on decades and uniqe suffixes to reduce complexity

In [None]:
ted_df['event'].value_counts()
     

In [None]:
ted_df['Event_Category'] = 'Others'
for i in range(len(df)):
    if ted_df['event'][i][0:5]=='TED20':
        ted_df['Event_Category'][i] = 'TED2000s'
    elif ted_df['event'][i][0:5]=='TED19':
        ted_df['Event_Category'][i] = 'TED1900s'
    elif ted_df['event'][i][0:4]=='TEDx':
        ted_df['Event_Category'][i] = "TEDx"
    elif ted_df['event'][i][0:4]=='TED@':
        ted_df['Event_Category'][i] = "TED@"
    elif ted_df['event'][i][0:8]=='TEDSalon':
        ted_df['Event_Category'][i] = "TEDSalon"
    elif ted_df['event'][i][0:6]=='TED-Ed':
        ted_df['Event_Category'][i] = 'TED-Ed'
    elif ted_df['event'][i][0:8]=='TEDWomen':
        ted_df['Event_Category'][i] = 'TEDWomen'
    elif ted_df['event'][i][0:6]=='TEDMED':
        ted_df['Event_Category'][i] = 'TEDMED'
    elif ted_df['event'][i][0:3]=='TED':
        ted_df['Event_Category'][i] = 'TED_Other'

In [None]:
ted_df['Event_Category'].unique()

In [None]:
ted_df['Event_Category'].value_counts()

Creating new datetime features like date, month, year, day and time_since_published features for better understanding and evaluation of dataset

In [None]:
ted_df['recorded_date'] = pd.to_datetime(ted_df['recorded_date'], format= "%Y-%m-%d")
ted_df['published_date'] = pd.to_datetime(ted_df['published_date'], format = "%Y-%m-%d")

Last date of published event

In [None]:
last_date = ted_df['published_date'].max()
last_date

First- date of published event

In [None]:
first_date = ted_df['published_date'].min()
first_date
     

Converting into datetime

In [None]:
ted_df['Day'] = ted_df['published_date'].dt.day_name()
     

In [None]:
ted_df['Month'] = ted_df['published_date'].dt.month_name()

In [None]:
ted_df["year"] = ted_df["published_date"].apply(lambda x: x.year)
ted_df["day_num"] = ted_df["published_date"].apply(lambda x: x.day)

In [None]:
ted_df['time_since_published'] = (last_date - ted_df['published_date']).apply(lambda x:x.days)

In [None]:
ted_df['daily_views'] = ted_df['views'] / (ted_df['time_since_published'])

In [None]:
ted_df.head(2)

Finding number of unique topics in dataset

In [None]:
temp = ted_df['topics'].iloc[1]
temp

In [None]:
temp_eval = ast.literal_eval(temp)
temp_eval

In [None]:
def get_num_topics(temp):

  temp_eval = ast.literal_eval(temp)
  num_topics = []
  for t in temp_eval:
    num_topics = len(temp_eval)
    
  return num_topics


ted_df['Num_of_Topics'] = ted_df['topics'].map(get_num_topics)

In [None]:
get_num_topics(temp)

Number of Available Languages Count

In [None]:
temp = ted_df['available_lang'][0]
temp

In [None]:
temp_eval = ast.literal_eval(temp)

In [None]:
def get_lanuages(temp):

  temp_eval = ast.literal_eval(temp)
  lang_count = []
  for t in temp_eval:
    lang_count = len(temp_eval)  
  return lang_count


ted_df['Available_lang'] = ted_df['available_lang'].map(get_lanuages)

In [None]:
get_lanuages(temp)

Speaker_1 average views

In [None]:
Speaker_avg_views = ted_df.groupby('speaker_1').agg({'views':'mean'}).sort_values(['views'], ascending=False)
Speaker_avg_views = Speaker_avg_views.to_dict()
Speaker_avg_views = Speaker_avg_views.values()
Speaker_avg_views = list(Speaker_avg_views)[0]
ted_df['Speaker_avg_views'] = ted_df['speaker_1'].map(Speaker_avg_views)

In [None]:
ted_df.head(2)

### What all manipulations have you done and insights you found?

1. Grouping multiple TED events together into easily interpretable groups
2. Creating new datetime features like day, year, month, time_since_published for better evaluation of data
3. Finding NUmber of unique topics and Available_languages from the dataset
4. Finding average views of Speaker

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
temp=ted_df.groupby(['speaker_1'],as_index=False)['views'].sum().sort_values('views',ascending=False)[:5]
plt.figure(figsize=(10,8)) 
ax=sns.barplot(x='speaker_1', y='views',data=temp)
plt.setp(ax.get_xticklabels(), rotation=50);
plt.title('Top 5 speaker according to views')
plt.xlabel('Speaker')
plt.ylabel('Views (In Millions')
ax.grid(False)

In [None]:
speaker_count = ted_df['speaker_1'].value_counts()
speaker_count

##### 1. Why did you pick the specific chart?

Barplot describes count of various features wrt target variable

##### 2. What is/are the insight(s) found from the chart?

Chart describes the top 5 speakers wrt views generated

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Chart gives insight into most popular speakers and correspondingly the topics and available languages

#### Chart - 2

In [None]:
speaker_count = ted_df['speaker_1'].value_counts()
speaker_count = pd.DataFrame(speaker_count).reset_index()
speaker_count.columns = ['Speaker_Name', 'talks_delivered']
most_talks = speaker_count.nlargest(5, 'talks_delivered')
plt.figure(figsize=(10,6))
sns.barplot(x = 'Speaker_Name', y = 'talks_delivered', data = most_talks)

##### 1. Why did you pick the specific chart?

Barplot describes count of various features wrt target variable

##### 2. What is/are the insight(s) found from the chart?

Chart describes the Top 5 speakers wrt talks delivered from all events combined

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Chart gives insight into most active speakers and whether it is linked to popularity(views) of the event

#### Chart - 3

In [None]:
temp=ted_df.groupby(['title'],as_index=False)['views'].sum().sort_values('views',ascending=False)[:5]
plt.figure(figsize=(10,8)) 
ax=sns.barplot(x='title', y='views',data=temp)
plt.setp(ax.get_xticklabels(), rotation=65);
plt.title('Top 5 title according to views')
plt.xlabel('Titles')
plt.ylabel('Views (In Millions)')
ax.grid(False)

##### 1. Why did you pick the specific chart?

Barplot describes count of various features wrt target variable

##### 2. What is/are the insight(s) found from the chart?

Chart describes Top titles that generate maximum views

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Chart gives insight into overall theme of the event that audience is most interested in

#### Chart - 4

In [None]:
temp=ted_df.groupby(['Event_Category'],as_index=False)['views'].sum().sort_values('views',ascending=False)[:5]
plt.figure(figsize=(8,6)) 
ax=sns.barplot(x='Event_Category', y='views',data=temp)
plt.setp(ax.get_xticklabels(), rotation=50);
plt.title('Top 5 events according to views')
plt.xlabel('Events')
plt.ylabel('Views (In Millions)')
ax.grid(False)

##### 1. Why did you pick the specific chart?

Barplot describes count of various features wrt target variable

##### 2. What is/are the insight(s) found from the chart?

When grouped together it can be seen that , TED events happened in 2000s are the most viewed followed by TED_Other which contains various small events organised by TED

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Chart gives an insight into popularity of particular TED events , & which events needs to be more focus and development

#### Chart - 5

In [None]:
temp=ted_df.groupby(['Month'],as_index=False)['views'].sum().sort_values('views',ascending=False)
plt.figure(figsize=(8,6)) 
ax= sns.barplot(x='Month', y='views',data=temp)
plt.setp(ax.get_xticklabels(), rotation=50);
plt.xlabel('Months')
plt.ylabel('Views (In Millions)')
ax.grid(False)

##### 1. Why did you pick the specific chart?

To get a detailed insight into months with most active audiences

##### 2. What is/are the insight(s) found from the chart?

It can be seen that March is the most popular month & August the least popular month as far as TED events are concerned

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Chart gives insight into the period of year when audience are more active

#### Chart - 6

In [None]:
temp=ted_df.groupby(['Day'],as_index=False)['views'].sum().sort_values('views',ascending=False)
plt.figure(figsize=(8,6)) 
ax= sns.barplot(x='Day', y='views',data=temp)
plt.setp(ax.get_xticklabels(), rotation=50);
plt.xlabel('Day')
plt.ylabel('Views (In Millions)')
ax.grid(False)

##### 1. Why did you pick the specific chart?

To get information regarding most popular day in most popular months

##### 2. What is/are the insight(s) found from the chart?

While March was the most popular month , Friday is the day with most active audiences

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Chart gives insight into,When to schedule the events wrt day, year

#### Chart - 7

In [None]:
temp=ted_df.groupby(['Num_of_Topics'],as_index=False)['views'].sum().sort_values('views',ascending=False)
plt.figure(figsize=(8,6)) 
ax= sns.barplot(x='Num_of_Topics', y='views',data=temp)
plt.setp(ax.get_xticklabels(), rotation=50);
plt.xlabel('Num_of_Topics')
plt.ylabel('Views (In Millions)')
ax.grid(False)

##### 1. Why did you pick the specific chart?

To get the number of topis wrt each event

##### 2. What is/are the insight(s) found from the chart?

It can be seen that maixmum views are generated when Num_of_Topics lie in the range of 4-7 , whereas views go on decreasing as Num_of_Topics go on increasing beyond that, which can be associated with the increase in time duration of an event also

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Chart gives insight into Number of topics , to schedule for each events

#### Chart - 8

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(ted_df['Available_lang'])

##### 1. Why did you pick the specific chart?

Distplot describes continuous variables wrt to their density

##### 2. What is/are the insight(s) found from the chart?

On average a TED event is available in around 25 languages , which explains its wide reach and popularity

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Each event should be made available in as many languages as possible

#### Chart - 9

In [None]:
temp=ted_df.groupby(['duration'],as_index=False)['views'].sum().sort_values('views',ascending=False)
plt.figure(figsize=(8,6)) 
ax= plt.scatter(x='duration', y='views',data=temp)
plt.xlabel('duration(in sec)')
plt.ylabel('Views (In Millions)')

##### 1. Why did you pick the specific chart?

Scatter plot describes linearity in relationship between two features

##### 2. What is/are the insight(s) found from the chart?

The data is heavily influenced by outliers , hence no strong relationship between duration and views

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

No insight wrt business growth

#### Chart - 10

In [None]:
comments_views = ted_df.groupby('comments')['views'].mean().astype(int).reset_index()
comments_views = comments_views.sort_values(by ='views', ascending = False)

comments_views.plot(x='comments',
                 y='views',
                 kind='scatter',
                 figsize = (10, 6))

##### 1. Why did you pick the specific chart?

Scatter plot describes linearity in relationship between two features

##### 2. What is/are the insight(s) found from the chart?

There is a slight relationship , but data is heavily influenced by outliers

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

No insight wrt business growth

#### Chart - 11 - Correlation Heatmap

In [None]:
plt.figure(figsize = (10,9))
correlation = ted_df.corr()
sns.heatmap(abs(correlation), annot=True, cmap = 'flare')

##### 1. Why did you pick the specific chart?

Heatmap describes Multicollinearity between all the features in the dataset

##### 2. What is/are the insight(s) found from the chart?

No significant collinearity between any two features, except Time_since_published V/s year & Speaker_avg_views V/s views. But they are important wrt model implementation . Hence no changes made

#### Chart - 12 - Pair Plot 

In [None]:
sns.pairplot(ted_df[1:])

##### 1. Why did you pick the specific chart?

To understand relation between all features in the dataset

##### 2. What is/are the insight(s) found from the chart?

There is no significant relation between any two features, which indicates No Multicollinearity

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
ted_df['occupations'].fillna('NA', inplace=True)
ted_df['about_speakers'].fillna( 'NA' , inplace =True)
ted_df['comments'].fillna(0, inplace =True)
ted_df['all_speakers'].fillna( 'NA' , inplace =True)

#### What all missing value imputation techniques have you used and why did you use those techniques?

1. Object type data-types were replaced with 'NA'
2. Integer data-types were replaced with '0'

### 2. Handling Outliers

Duration

In [None]:
sns.boxplot(ted_df['duration'], data= ted_df)
     

In [None]:
percentile_q1 = np.percentile(ted_df['duration'],25)
print(percentile_q1)
percentile_q2 = np.percentile(ted_df['duration'],50)
print(percentile_q2)
percentile_q3 = np.percentile(ted_df['duration'],75)
print(percentile_q3)

In [None]:
iqr_duration =  percentile_q3 - percentile_q1
print(iqr_duration)

In [None]:
ted_df['duration']= ted_df['duration'].mask(ted_df['duration']>(ted_df['duration'].quantile(0.75)+1.5* iqr_duration), ted_df['duration'].mean())

In [None]:
sns.boxplot(ted_df['duration'], data= ted_df)

Comments

In [None]:
sns.boxplot(ted_df['comments'], data= ted_df)

In [None]:
iqr_comments = np.percentile(ted_df['comments'], 75) - np.percentile(ted_df['comments'],25)
print(iqr_comments)

In [None]:
ted_df['comments']= ted_df['comments'].mask(ted_df['comments']>(ted_df['comments'].quantile(0.75)+1.5*iqr_comments), ted_df['comments'].mean())

In [None]:
sns.boxplot(ted_df['comments'], data= ted_df)

Available languages

In [None]:
sns.boxplot(ted_df['Available_lang'], data=ted_df)

In [None]:
iqr_lang = np.percentile(ted_df['Available_lang'], 75) - np.percentile(ted_df['Available_lang'],25)
print(iqr_lang)
     

In [None]:
ted_df['Available_lang']= ted_df['Available_lang'].mask(ted_df['Available_lang']>(ted_df['Available_lang'].quantile(0.75)+1.5* iqr_lang), ted_df['Ava

In [None]:
sns.boxplot(ted_df['Available_lang'], data=ted_df)

##### What all outlier treatment techniques have you used and why did you use those techniques?

Inter-quartile range (IQR) method was used for treatment of outliers. Since,it shows how the data is distribubted around the median

### 3. Categorical Encoding

In [None]:
ted_df.columns

In [None]:
pair_df = [ted_df[['duration', 'comments', 'Available_lang', 'Num_of_Topics', 'daily_views', 'Speaker_avg_views', 'time_since_published', ]], 
              pd.get_dummies(ted_df[['Day','year', 'Month', 'day_num', 'Event_Category']], drop_first =False),
              ted_df['views'] ]
one_hot_df = pd.concat(pair_df, axis=1)
one_hot_df.head(2)

#### What all categorical encoding techniques have you used & why did you use those techniques?

One-Hot Encoding Method was used for categorical encoding. Since, the categorical features contained many categories

### 4. Feature Manipulation & Selection

#### 1. Feature Selection




In [None]:
one_hot_df.dropna(inplace=True)
one_hot_df.replace([np.inf, -np.inf], 1, inplace=True)

In [None]:
features = [i for i in one_hot_df.columns if i not in ['views']]

In [None]:
len(features)

In [None]:
X = one_hot_df[features].apply(zscore)
y = (one_hot_df['views']).replace([np.inf, -np.inf], 0)

In [None]:
X.shape,y.shape

### 2. Data Scaling

In [None]:
scaler = MinMaxScaler()
scaler.fit(X)
scaled_features = scaler.transform(X)

##### Which method have you used to scale you data and why?

MinMaxScaler method is used for scaling the data in the range [0,1]. It preserves the shape of original data and dosen't change the information embedded in the original data

### 3. Data Splitting

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y , test_size = 0.3, random_state = 0) 
print(X_train.shape)
print(X_test.shape)

##### What data splitting ratio have you used and why? 

1. Splitting ratio = 0.3 , since increasing the test dataset any further,descreased the test score, sometimes causing the model to overfit
2. Also, descreasing the splitting ratio caused some models to underfit

## ***7. ML Model Implementation***

### ML Model - 1 : Linear Regression

Linear Regression

In [None]:
# Instantiating Linear Model
lr = LinearRegression()
# Fit the Algorithm
lr.fit(X_train, y_train)

In [None]:
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

In [None]:
lr.coef_
     

In [None]:
y_pred = lr.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
MAE= mean_absolute_error((y_test),(y_pred))
print("MAE :" ,MAE)

MSE  = mean_squared_error((y_test),(y_pred))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test), (y_pred))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
lasso = Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=4)
lasso_regressor.fit(X_train, y_train)

In [None]:
print("The best fit alpha value is found out to be :" ,lasso_regressor.best_params_)
print("\nUsing ",lasso_regressor.best_params_, " the negative mean squared error is: ", lasso_regressor.best_score_)

In [None]:
y_pred_lasso = lasso_regressor.predict(X_test)

In [None]:
MSE  = mean_squared_error((y_test), (y_pred_lasso))
print("MSE :" , MSE)

MAE=mean_absolute_error((y_test),(y_pred_lasso))
print("MAE :" ,MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test),(y_pred_lasso))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

Ridge (L1) Regularization

In [None]:
ridge = Ridge(alpha=0.05).fit(X_train, y_train)      # 0..5 gives the least MSE and max r2
print("Training set score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge.score(X_test, y_test)))


In [None]:
y_pred_r = ridge.predict(X_test)

In [None]:
MSE  = mean_squared_error((y_test), (y_pred_r))
print("MSE :" , MSE)

MAE=mean_absolute_error((y_test), (y_pred_r))
print("MAE :" ,MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test), (y_pred_r))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

In [None]:
plt.figure(figsize=(15,7))
plt.plot(np.square(y_pred_r[:100]))
plt.plot(np.square(np.array(y_test[:100])))
plt.legend(["Predicted","Actual"])
plt.show()

### ML Model - 1.2 : Lasso (L2) *Regularization*

Lasso (L2) Regularization

In [None]:
lasso = Lasso(alpha=0.1).fit(X_train, y_train)              # 0.1 gives the least MSE and max r2
print("Training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso.coef_ != 0)))

In [None]:
y_pred_l = lasso.predict(X_test)

In [None]:
plt.figure(figsize=(15,7))
plt.plot(np.square(y_pred_r[:100]))
plt.plot(np.square(np.array(y_test[:100])))
plt.legend(["Predicted","Actual"])
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Regularization of Linear , Lasso (L2) is used for model optimization, though there is no significant improvement in the performance

In [None]:
MSE  = mean_squared_error((y_test), (y_pred_l))
print("MSE :" , MSE)

MAE=mean_absolute_error((y_test),(y_pred_l))
print("MAE :" ,MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test),(y_pred_l))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

In [None]:
plt.figure(figsize=(15,7))
plt.plot(np.square(y_pred_l[:100]))
plt.plot(np.square(np.array(y_test[:100])))
plt.legend(["Predicted","Actual"])
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
lasso = Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=4)
lasso_regressor.fit(X_train, y_train)

In [None]:
print("The best fit alpha value is found out to be :" ,lasso_regressor.best_params_)
print("\nUsing ",lasso_regressor.best_params_, " the negative mean squared error is: ", lasso_regressor.best_score_)

In [None]:
y_pred_lasso = lasso_regressor.predict(X_test)
     

In [None]:
MSE  = mean_squared_error((y_test), (y_pred_lasso))
print("MSE :" , MSE)

MAE=mean_absolute_error((y_test),(y_pred_lasso))
print("MAE :" ,MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test),(y_pred_lasso))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

Which hyperparameter optimization technique have you used and why?**bold text**

GridSearchCV optimization technique was used.Since, it uses a combination of hyperparameters in a specific order & fits the model on each & every coombination of hyperparameters possible

**Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

There is no significant improvement in the performance , even after using Hyperparameter optimization

**3. Explain each evaluation metric's indication towards business and the business impact of the ML model used**

Though the R2 score of 0.76 is considerable , indicating better prediction . But the MAE is also high, indicating high error in model prediction

### ML Model - 1.3 : Linear Regression : Ridge Regression

In [None]:
ridge = Ridge(alpha=0.05).fit(X_train, y_train)      # 0..5 gives the least MSE and max r2
print("Training set score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge.score(X_test, y_test)))

In [None]:
y_pred_r = ridge.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
MSE  = mean_squared_error((y_test), (y_pred_r))
print("MSE :" , MSE)

MAE=mean_absolute_error((y_test), (y_pred_r))
print("MAE :" ,MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test), (y_pred_r))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

In [None]:
plt.figure(figsize=(15,7))
plt.plot(np.square(y_pred_r[:100]))
plt.plot(np.square(np.array(y_test[:100])))
plt.legend(["Predicted","Actual"])
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
ridge = Ridge()
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=3)
ridge_regressor.fit(X_train,y_train)

In [None]:
print("The best fit alpha value is found out to be :" ,ridge_regressor.best_params_)
print("\nUsing ",ridge_regressor.best_params_, " the negative mean squared error is: ", ridge_regressor.best_score_)

In [None]:
y_pred_ridge = ridge_regressor.predict(X_test)

In [None]:
MSE  = mean_squared_error((y_test), (y_pred_ridge))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test), (y_pred_ridge))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

**Which hyperparameter optimization technique have you used and why?**

GridSearchCV optimization technique was used.Since, it uses a combination of hyperparameters in a specific order & fits the model on each & every coombination of hyperparameters possible

**Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

There is no significant improvement in the performance , even after using Hyperparameter optimization

### ML Model - 1.4 : Elastic Net Regression

In [None]:
elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)
elasticnet.fit(X_train,y_train)

In [None]:
print("Training set score: {:.2f}".format(elasticnet.score(X_train, y_train)))
print("Test set score: {:.2f}".format(elasticnet.score(X_test, y_test)))

In [None]:
y_pred_e = elasticnet.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
MSE  = mean_squared_error((y_test),(y_pred_e))
print("MSE :" , MSE)

MAE = mean_absolute_error(y_test, y_pred_e)
print("MAE :", MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test),(y_pred_e))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

R2 score for Elastic-Net regression is lesser than Ridge & Lasso ,this is because Elastic regression generally works well on big datasets

In [None]:
plt.figure(figsize=(15,7))
plt.plot(np.square(y_pred_e[:100]))
plt.plot(np.square(np.array(y_test[:100])))
plt.legend(["Predicted","Actual"])
plt.show()
     

### ML Model - 2 : Decision Tree Regression

In [None]:
dtree = DecisionTreeRegressor( random_state = 0, max_depth = 25)        # max_depth = 25 --> gives least MSE and max r2 score in comparison to other values
parameters = {'n_estimators':[500],
            }
     

In [None]:
dtree.fit(X_train, y_train)

In [None]:
print("Training set score: {:.3f}".format(dtree.score(X_train, y_train)))
print("Test set score: {:.3f}".format(dtree.score(X_test, y_test)))

In [None]:
y_pred_d = dtree.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
MSE  = mean_squared_error((y_test), (y_pred_d))
print("MSE :" , MSE)

MAE = mean_absolute_error(y_test, y_pred_d)
print("MAE", MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test), (y_pred_d))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

In [None]:
plt.figure(figsize=(15,7))
plt.plot(np.square(y_pred_d[:100]))
plt.plot(np.square(np.array(y_test[:100])))
plt.legend(["Predicted","Actual"])
plt.show()


From the training-score & test-score, it can be seen that model overfits

### ML Model - 3 : Bagging Regression

In [None]:
bagging = BaggingRegressor(random_state=0)
bagging.fit(X_train, y_train)

In [None]:
print("Training set score: {:.2f}".format(bagging.score(X_train, y_train)))
print("Test set score: {:.2f}".format(bagging.score(X_test, y_test)))

In [None]:
y_pred_bag = bagging.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
MSE  = mean_squared_error((y_test),(y_pred_bag))
print("MSE :" , MSE)

MAE = mean_absolute_error(y_test, y_pred_bag)
print("MAE :", MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test), (y_pred_bag))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

### ML Model - 4 : Random-Forest Regression

In [None]:
rfr = RandomForestRegressor(random_state=0)

In [None]:
rfr.fit(X_train, y_train)

In [None]:
print("Training set score: {:.2f}".format(rfr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(rfr.score(X_test, y_test)))

In [None]:
y_pred_rfr = rfr.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
MSE  = mean_squared_error((y_test),(y_pred_rfr))
print("MSE :" , MSE)

MAE = mean_absolute_error(y_test, y_pred_rfr)
print("MAE :", MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test), (y_pred_rfr))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

In [None]:
plt.figure(figsize=(15,7))
plt.plot(np.square(y_pred_rfr[:100]))
plt.plot(np.square(np.array(y_test[:100])))
plt.legend(["Predicted","Actual"])
plt.show()

### ML Model - 5 : Gradient Boosting Regression

In [None]:
GradientBoosting = GradientBoostingRegressor(random_state=0)

In [None]:
GradientBoosting.fit(X_train, y_train)

In [None]:
print("Training set score: {:.2f}".format(GradientBoosting.score(X_train, y_train)))
print("Test set score: {:.2f}".format(GradientBoosting.score(X_test, y_test)))

In [None]:
y_pred_gb = GradientBoosting.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
MSE  = mean_squared_error((y_test), (y_pred_gb))
print("MSE :" , MSE)

MAE =mean_absolute_error(y_test, y_pred_gb)
print("MAE :", MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test), (y_pred_gb))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

In [None]:
plt.figure(figsize=(15,7))
plt.plot(np.square(y_pred_gb[:100]))
plt.plot(np.square(np.array(y_test[:100])))
plt.legend(["Predicted","Actual"])
plt.show()

### ML Model - 6 : XG-Boost Regression

In [None]:
xg_boost = XGBRegressor(random_state=0)

In [None]:
xg_boost.fit(X_train,y_train)

In [None]:
print("Training set score: {:.2f}".format(xg_boost.score(X_train, y_train)))
print("Test set score: {:.2f}".format(xg_boost.score(X_test, y_test)))

In [None]:
y_pred_xgb = xg_boost.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
MSE  = mean_squared_error((y_test),(y_pred_xgb))
print("MSE :" , MSE)

MAE = mean_absolute_error(y_test, y_pred_xgb)
print("MAE", MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test), (y_pred_xgb))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

In [None]:
plt.figure(figsize=(15,7))
plt.plot(np.square(y_pred_xgb[:100]))
plt.plot(np.square(np.array(y_test[:100])))
plt.legend(["Predicted","Actual"])
plt.show()

### ML Model - 7 : KNN Regression

In [None]:
neighbors = np.arange(1,10)
train_accuracy =np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

for i,k in enumerate(neighbors):
    # Setup a knn classifier with k neighbors
    knn = KNeighborsRegressor(n_neighbors=k)
    
    # Fit the model
    knn.fit(X_train, y_train)
    
    # Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)
    
    # Compute accuracy on the test set
    test_accuracy[i] = knn.score(X_test, y_test) 

In [None]:
plt.title('k-NN Varying number of neighbors')
plt.plot(neighbors, test_accuracy, label='Testing Accuracy')
plt.plot(neighbors, train_accuracy, label='Training Accuracy')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()

In [None]:
knn_reg = KNeighborsRegressor(n_neighbors = 6)            # minkowski --> generalization of Eucledian & manhattan,  p= no of input variables
knn_reg.fit(X_train, y_train)

In [None]:
print("Training set score: {:.2f}".format(knn_reg.score(X_train, y_train)))
print("Test set score: {:.2f}".format(knn_reg.score(X_test, y_test)))

In [None]:
y_pred_knn = knn_reg.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

KNN algorithm is implemented .High MSE, MAE & RMSE for knn can be due to the fact that KNN works best with lower dimensional data (less input variables)

In [None]:
MSE  = mean_squared_error(y_test,y_pred_knn)
print("MSE :" , MSE)

MAE = mean_absolute_error(y_test, y_pred_xgb)
print("MAE", MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_test),(y_pred_knn))
print("R2 :" ,r2)

n= X_test.shape[0]
k= X_test.shape[1]
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print("Adjusted R2 :", adj_r2_score)

In [None]:
plt.figure(figsize=(15,7))
plt.plot(np.square(y_pred_knn[:100]))
plt.plot(np.square(np.array(y_test[:100])))
plt.legend(["Predicted","Actual"])
plt.show()

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Regression Evaluation metrics
We will assess our model performances on the basis of following metrics :

1. Mean Absolute Error (MAE)
2. Mean Squared Error (MSE)
3. Root Mean Squared Error (RMSE)
4. R2 score
5. Adjusted R2 score

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

In [None]:
models = [
           ['Lasso ', Lasso(alpha=0.1)],
           ['Ridge ', Ridge(alpha=0.05)],
           ['Elastic-Net ', ElasticNet(alpha=0.1, l1_ratio=0.5)],
           ['Bagging', BaggingRegressor(random_state=0)],
           ['RandomForest ',RandomForestRegressor(random_state=0)],
           ['Gradient-Boostiing ',GradientBoostingRegressor(random_state=0)], 
           ['XGB-Regressor', XGBRegressor(random_state=0)] ,
           ['KNN-Regressor', KNeighborsRegressor(n_neighbors = 6)]    
        ]

In [None]:
model_data = []
for name,curr_model in models :
    curr_model_data = {}
    curr_model.random_state = 0
    curr_model_data["Name"] = name
    curr_model.fit(X_train,y_train)
    curr_model_data["MAE_train"] =metrics.mean_absolute_error(y_train, curr_model.predict(X_train))
    curr_model_data["MAE_test"] =metrics.mean_absolute_error(y_test, curr_model.predict(X_test))
    curr_model_data["R2_Score_train"] = r2_score(y_train,curr_model.predict(X_train))
    curr_model_data["R2_Score_test"] = r2_score(y_test,curr_model.predict(X_test))
    curr_model_data["RMSE_Score_train"] = np.sqrt(mean_squared_error(y_train,curr_model.predict(X_train)))
    curr_model_data["RMSE_Score_test"] = np.sqrt(mean_squared_error(y_test,curr_model.predict(X_test)))

    model_data.append(curr_model_data)

In [None]:
results_df = pd.DataFrame(model_data)
results_df

In [None]:
plt.figure(figsize=(12, 8))
ax=sns.barplot(x='Name', y='R2_Score_test',data=results_df)
plt.setp(ax.get_xticklabels(), rotation=20);
plt.title('R2 score of models')
plt.xlabel('Models')
plt.ylabel('R2 Score')
ax.grid(False)

In [None]:
plt.figure(figsize=(12, 8))
ax=sns.barplot(x='Name', y='MAE_test',data=results_df)
plt.setp(ax.get_xticklabels(), rotation=20);
plt.title('Mean Absolute Error for models')
plt.xlabel('Models')
plt.ylabel('Mean Absolute Error')
ax.grid(False)

Considering the R2 Score and MAE scaore , XG-Boost has the best performance wrt model prediction

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

**SHAPLEY Method**

In [None]:
pip install shap

In [None]:
import shap 

In [None]:
# Create the object that can calculate shap values
explainer = shap.Explainer(rfr)
# Calculate shap values
shap_values = explainer(X_train)
     

In [None]:
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values.values[0,:],X.iloc[0,:])

In [None]:
# Obtain a Bar Summary Plot
shap.summary_plot(shap_values, X_train, plot_type="bar")

In [None]:
# Beeswarm plot
shap.plots.beeswarm(shap_values)

1. The above plots shows top features & their contribution to push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red and those pushing the prediction lower are in blue.
2. So, Daily_views & Speaker_avg_values pushes the prediction lower, while time_since_published and year pushes the prediction higher.
3. The base value (mean value) of our output is around 1 million views

# **Conclusion**

**Summary**

1. We trained 8 Machine Learning models on training dataset by considering the best parameters for each model.
2. The performance of each model was evaluated using comparison graph between Predicted and Actual values and some Regression evaluation metrics.
3. We started with Linear regression, and further implemented regularizations to the same.
4. To further evaluate our dataset on more complex and restricted parameters , we used Decision Trees and its ensemble techniques.
5. Considering the overall optimal values from errors and R2 score, Random-Forest has least errors and optimal R2 score for training and test dataset.


**Conclusion :**

1. Focus should be given on specific topics, number of topics and available languages count.
2. Overall theme of majority of the events focus on Education & Self-development
3. Events should be made available in as many languages as possible
4. Those speakers should be selected, who have made a significant contribution in their respective field and possess significant knowledge
5. Events should be published online mostly on Friday and in the month of March
6. Avg views prediction for new TED event wrt given dataset will be around 2.17 million

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***