# **Project Name**    - TED Talk Views Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Individual

# **Project Summary -**

In an age where knowledge-sharing has become increasingly important, TED, a nonprofit organization founded in 1984, stands as a beacon for the dissemination of innovative and inspiring ideas. TED's mission is to connect experts across various fields, be it Technology, Entertainment, or Design, with a global audience. Over the years, it has hosted talks by luminaries such as Al Gore, Jimmy Wales, Shahrukh Khan, and Bill Gates. As of 2015, TED, along with its associated TEDx chapters, had made over 2000 talks available for free to the public, making it a vital resource for intellectual enrichment.

The core objective of this project is to construct a predictive model that can estimate the number of views a TED talk will garner once uploaded to the TEDx website. This predictive model carries significant implications for TED as it helps the organization better understand the factors influencing the popularity of its talks and provides insights into how to engage its audience more effectively.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


This project seeks to predict the number of views TED talks will receive on the TEDx website, a task crucial for content curation, audience engagement, and resource allocation. The challenge lies in handling diverse data, accounting for unpredictable viewership factors, and creating an interpretable model. The primary objectives include building an accurate predictive model and understanding the influential factors behind talk views. The project's scope covers data processing, feature engineering, model selection, and rigorous performance evaluation, aligning with TED's mission of sharing impactful ideas globally.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    precision_score,
    recall_score,
    f1_score,
)
from sklearn.preprocessing import (
    LabelEncoder,
    StandardScaler,
    OneHotEncoder,
    MinMaxScaler,
)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.utils import resample
from scipy import stats
import string
import re
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
import random
!pip install contractions
import contractions
from tqdm import tqdm
import collections
import statsmodels.api as sm
from scipy.stats import f_oneway
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
csv_file_path = '/content/drive/MyDrive/Project/TED Talk Views Prediction/data_ted_talks.csv'

dataset = pd.read_csv(csv_file_path)

### Dataset First View

In [None]:
# Dataset First Look
print("\nFirst 5 rows of the dataset:")
print(dataset.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
dataset.shape

### Dataset Information

In [None]:
# Dataset Info
print("Dataset Information:")
print(dataset.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dataset.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Countmissing_values = dataset.isnull().sum()
missing_values = dataset.isnull().sum()
# Display the count of missing values for each column
print("Missing Values Count per Column:")
print(missing_values)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(dataset.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

Data Structure: The dataset comprises 4,005 rows and 19 columns, with each row representing a unique TED talk. It contains a diverse range of data types, including integers, floats, and objects (textual data).

Features: The dataset includes features such as talk titles, speaker information, recording and publication dates, event details, language attributes, view counts, and metadata about talks. Additionally, textual features like talk descriptions, transcripts, and related talks were provided.

Data Completeness: Missing values were observed in several columns, including speaker-related details, date information, and the number of comments. The missing values varied in frequency and significance, requiring consideration during data preprocessing.

Duplicate Values: The dataset was found to be free of duplicate entries, which is crucial for maintaining data integrity and ensuring accurate analyses.

Textual Information: The presence of text-based columns, such as descriptions and transcripts, suggests the potential for natural language processing (NLP) tasks, sentiment analysis, or text-based feature engineering.

Numerical Features: Numeric columns, such as view counts and duration, are available for quantitative analysis, and these may be crucial for building a predictive model to estimate talk views.

Categorical Features: Categorical features like native language and event categories could be useful for categorization and segmentation tasks.

Time Series Data: The dataset contains date-related features, which could facilitate time series analysis or trend identification.

Data Quality: The dataset has generally good data quality, but some inconsistencies, missing values, and variations need to be addressed during data preprocessing to ensure the reliability and accuracy of analyses and models.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(dataset.columns)

In [None]:
# Dataset Describe
print(dataset.describe())

### Variables Description

talk_id: A unique identifier for each TED talk.

title: The title of the TED talk.

speaker_1: The primary speaker of the talk.

all_speakers: Information about all the speakers involved in the talk.

occupations: Occupations or roles of the speakers, describing their professional backgrounds.

about_speakers: Information about the background and expertise of the speakers.

views: The number of views the talk has received on the TEDx website.

recorded_date: The date when the talk was recorded.

published_date: The date when the talk was published on the TEDx website.

event: The event or conference at which the talk was presented.

native_lang: The native language of the talk.

available_lang: The languages for which subtitles are available for the talk.

comments: The number of comments made by viewers on the talk.

duration: The duration of the talk in seconds.

topics: Topics associated with the talk, representing the themes or subject matter.

related_talks: IDs of related talks, indicating talks that are related to the current one.

url: The URL that provides access to the specific talk on the TED website.

description: A brief description or summary of the talk's content.

transcript: The full transcript of the talk, which may contain the spoken content of the talk.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in dataset.columns:
    unique_values = dataset[column].unique()
    print(f'Unique values for column "{column}":')
    for value in unique_values:
        print(value)
    print(f'Total unique values: {len(unique_values)}')
    print('\n')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Drop rows with missing values in 'recorded_date'
dataset = dataset.dropna(subset=['recorded_date'])

In [None]:
# Fill missing values in 'about_speakers' and 'occupations' with a default value
dataset['about_speakers'].fillna('Unknown', inplace=True)
dataset['occupations'].fillna('Unknown', inplace=True)

In [None]:
# Fill missing values in 'comments' with zeros
dataset['comments'].fillna(0, inplace=True)

In [None]:
# Data Type Conversion

dataset['recorded_date'] = pd.to_datetime(dataset['recorded_date'])
dataset['published_date'] = pd.to_datetime(dataset['published_date'])

### What all manipulations have you done and insights you found?

Data Manipulations:

Handling Missing Values:

We addressed missing values in the dataset, with a focus on critical columns.
Rows with missing values in the 'recorded_date' column were removed, as this date is essential for time-based analysis.
Missing values in 'about_speakers' and 'occupations' were filled with a default value ('Unknown') to ensure the completeness of speaker-related information.
Missing values in the 'comments' column were imputed with zeros to indicate no comments.
Data Type Conversion:

We converted date-related columns, specifically 'recorded_date' and 'published_date,' into datetime objects. This allows for meaningful time-based analysis and visualization.
Key Insights:

Data Completeness:

By handling missing values, we ensured that the dataset is now more complete and suitable for analysis. Removing rows with missing 'recorded_date' values has helped maintain the integrity of time-based analysis.
Speaker Information:

The dataset often contains 'Unknown' values for 'about_speakers' and 'occupations,' indicating a lack of detailed speaker information for some talks. Understanding this can help us make informed decisions when dealing with speaker-related features.
Comments Distribution:

The imputation of missing values in the 'comments' column allowed us to maintain data integrity. We observed a wide distribution of comments, ranging from zero to a substantial number, which indicates varying levels of audience engagement with TED talks.
Time-Based Analysis:

Converting 'recorded_date' and 'published_date' into datetime objects enables us to perform time-based analysis and explore how the timing of TED talk recordings and publications relates to views.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Group the data by 'native_lang' and count the number of talks in each language
language_counts = dataset['native_lang'].value_counts()

# Create a bar chart to visualize the distribution
plt.figure(figsize=(12, 6))
language_counts.plot(kind='bar', color='skyblue')
plt.title('Distribution of TED Talks by Native Language')
plt.xlabel('Native Language')
plt.ylabel('Number of Talks')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

# Show the chart
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I suggested creating a bar chart to visualize the distribution of TED talks by native language because it is a suitable choice for this specific type of data and objective. Here's why I selected this specific chart:

Data Type: The 'native_lang' column represents categorical data, where each unique value corresponds to a different native language. Bar charts are commonly used to display the distribution of categorical data, making them an appropriate choice for this dataset.

Count of Categories: A bar chart is effective when you want to show the frequency or count of categories (in this case, the number of talks for each native language). It allows viewers to easily compare the counts for different categories.

Readability: Bar charts are easy to read and interpret. The x-axis displays the categories (native languages), and the y-axis shows the count or frequency. This makes it straightforward for the audience to understand the distribution of TED talks across languages.

Visual Comparison: Bar charts make it simple to visually compare the sizes of different categories. In this context, viewers can quickly identify which native languages have the highest and lowest numbers of talks.

Insight Generation: This type of visualization can provide insights into the diversity of languages represented in TED talks, which might be of interest in understanding the global reach and impact of TED's content.

##### 2. What is/are the insight(s) found from the chart?

Insight from the Chart:

The chart displaying the distribution of TED talks by native language reveals several insights:

English Dominance: English (en) is the dominant language for TED talks, with a significantly higher number of talks compared to other languages. This suggests that English is the most commonly used language for TED presentations, making up a substantial portion of the content.

Global Reach: TED's commitment to making ideas accessible globally is evident in the diversity of languages represented in the dataset. While English is prominent, there is a wide array of native languages in which TED talks are conducted, indicating the organization's efforts to reach a worldwide audience.

Low Representation: The chart highlights that some languages have relatively lower representation, with only a few or zero talks. This could indicate areas where TED might consider expanding its content to cater to a broader international audience.

Multilingual Content: The presence of multiple languages signifies TED's commitment to inclusivity, making talks available to non-English-speaking audiences through translation and subtitles.

Data Quality Check: The presence of zero or extremely low counts for some languages might also prompt a data quality check. It's possible that there could be data issues or missing information for certain talks, contributing to the observed distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Audience Reach and Engagement: Understanding the distribution of TED talks by native language allows TED to tailor its content to a global audience. By offering talks in multiple languages, TED can attract and engage a broader and more diverse viewership. This inclusivity can lead to increased user engagement and a larger global following, which can positively impact the business.

Cultural Relevance: Providing content in multiple languages demonstrates TED's commitment to cultural relevance and diversity. This approach can strengthen TED's brand and reputation as an organization that respects and values linguistic and cultural diversity.

Global Partnerships: TED can leverage its multilingual content to forge partnerships and collaborations with organizations and individuals from various regions. This can lead to opportunities for co-hosted events, cross-promotions, and other mutually beneficial arrangements.

Content Monetization: A diverse range of content can attract a larger and more varied audience, potentially enhancing opportunities for content monetization through advertising, sponsorships, and partnerships.

Feedback and Localization: Insights from the distribution of talks by language can help TED collect feedback and analytics on the preferences and interests of audiences in different regions. This information can guide content localization and future content creation, enhancing the overall user experience.

No Negative Growth Insights:

The insights gained from the chart do not inherently lead to negative growth. While there may be lower representation of certain languages, this does not translate directly into negative growth. Instead, it may indicate areas for potential expansion and improvement. TED's commitment to inclusivity and diversity in language representation aligns with its mission, which is unlikely to have negative consequences when approached thoughtfully.

However, TED should be mindful of ensuring that low representation of certain languages does not result from data quality issues or neglect of particular regions. In such cases, addressing these concerns can have a positive impact on growth.

In summary, the insights gained from the chart primarily support positive business impact by expanding TED's global reach, improving user engagement, and enhancing the organization's commitment to inclusivity and diversity.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Create a histogram to visualize the distribution of TED talk durations
plt.figure(figsize=(10, 6))
plt.hist(dataset['duration'], bins=30, color='skyblue')
plt.title('Distribution of TED Talk Durations')
plt.xlabel('Duration (seconds)')
plt.ylabel('Number of Talks')

# Show the chart
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Here's why I selected this specific chart:

Data Type: The 'duration' column represents continuous numeric data, indicating the length of TED talks in seconds. A histogram is an appropriate choice for visualizing the distribution of such data.

Frequency Distribution: A histogram allows you to see how the values are distributed across different duration ranges. It provides insights into the central tendency, spread, and presence of any outliers in the data.

Identification of Patterns: By creating a histogram, you can easily identify patterns in the distribution of TED talk durations, such as whether they follow a normal distribution, have multiple peaks, or exhibit skewness.

Understanding Typical Duration: This type of visualization helps you understand what the typical duration of TED talks is in your dataset and whether there are any talks that significantly deviate from the norm.

Visual Comparison: A histogram is a widely used chart for comparing the distribution of values, making it easy to identify any unusual trends or features in the dataset.

##### 2. What is/are the insight(s) found from the chart?

Insight from the Chart:

Diverse Talk Durations: The histogram reveals that TED talks have a wide range of durations, with the majority of talks falling within a particular duration range. This diversity in talk durations suggests that TED content accommodates various topics and speaker styles, allowing for both short and long presentations.

Peak Duration Range: There is a clear peak in the histogram, indicating a common duration range for TED talks. This peak suggests that a significant number of talks are of similar duration, which can be informative for event planning and audience expectations.

Longer Talks: The right tail of the histogram shows that there are talks with longer durations, indicating that TED is open to hosting more extended discussions when necessary. These longer talks may be related to more in-depth or complex topics.

Short Talks: The left tail of the histogram indicates the presence of shorter talks, possibly used for concise and impactful presentations or introductions to specific concepts.

Outliers: The histogram allows for the identification of potential outliers, talks with durations significantly different from the norm. These outliers can be of interest for further analysis, as they may represent unique or exceptional cases.

Typical Duration: The histogram provides insights into the typical duration of TED talks in the dataset, helping TED organizers and viewers understand the expected length of a standard TED talk.

Audience Engagement: Talk duration is a crucial factor in audience engagement. The distribution suggests that TED effectively manages the balance between providing concise talks that capture attention and longer talks that delve into complex subjects.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Audience Engagement: Understanding the distribution of talk durations allows TED to tailor its content to the preferences and attention spans of its audience. This can result in higher audience engagement, as talks of various lengths cater to different viewers with distinct needs.

Content Planning: TED can use insights into the typical duration and distribution to plan content more effectively. This helps ensure that the duration of each talk aligns with the depth of the topic and the target audience.

Diverse Topics: The wide range of talk durations signifies TED's versatility in accommodating diverse topics and presentations. This diversity is appealing to a broad audience, encouraging them to explore a range of subjects.

Monetization: Content that caters to varying attention spans and interests can enhance opportunities for monetization through ads, sponsorships, and memberships.

Tailored Events: TED can plan events with a mix of talk durations to offer a varied and engaging experience for attendees.

No Negative Growth Insights:

The insights from the duration distribution don't inherently lead to negative growth. However, TED should be cautious about managing outliers, ensuring they align with the content's quality and relevance to maintain a high standard. Overly long or short talks that compromise content quality may negatively affect viewer satisfaction. Therefore, while longer or shorter talks are accommodated, they must still provide value and maintain TED's reputation for quality content.

In summary, the insights gained from the distribution of TED talk durations support positive business impact by enhancing audience engagement, content planning, diversity, and monetization opportunities. TED's commitment to inclusivity in terms of talk lengths can be advantageous without posing direct negative growth risks, provided that content quality and relevance are maintained.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Group the data by 'native_lang' and count the number of talks in each language
language_counts = dataset['native_lang'].value_counts()

# Create a pie chart to visualize the distribution
plt.figure(figsize=(8, 8))
plt.pie(language_counts, labels=language_counts.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired(range(len(language_counts))))
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Distribution of TED Talks by Native Language')

# Show the chart
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Proportion Display: A pie chart is ideal when you want to show how a whole (in this case, the total number of TED talks) is distributed among various categories (native languages). It provides a clear representation of the proportion of each category relative to the whole.

Relative Comparisons: The pie chart makes it easy to compare the sizes of different language categories at a glance. Viewers can immediately see the distribution of talks by language, which is useful for understanding the composition of TED's content.

Percentage Information: The percentage labels on the pie chart slices provide specific information about the contribution of each language. This makes it clear to viewers what portion of TED talks is in each language.

Concise Visualization: Pie charts are concise and easy to understand, making them effective for communicating a high-level overview of data.

Linguistic Diversity: In the context of TED talks, the pie chart highlights the linguistic diversity and the organization's commitment to offering content in multiple languages.

In summary, a pie chart is a suitable choice for visualizing the distribution of TED talks by native language because it effectively conveys the proportion of talks in each language relative to the total, making it easy for viewers to grasp the linguistic diversity of TED's content.

##### 2. What is/are the insight(s) found from the chart?

Insight from the Chart:

English Dominance: The pie chart clearly illustrates the dominance of English as the primary language for TED talks. The sizable portion of the chart occupied by English indicates that a significant majority of TED talks are presented in English.

Multilingual Content: While English is predominant, the presence of multiple language segments shows that TED offers content in various native languages. This reflects TED's commitment to inclusivity and its mission to make ideas accessible to global audiences.

Linguistic Diversity: The pie chart highlights the linguistic diversity of TED's content, with different languages represented. This diversity makes TED accessible and appealing to audiences from various linguistic backgrounds.

Language Representation: By viewing the chart, viewers can quickly identify the distribution of talks in other languages. This can be informative for viewers who prefer talks in specific languages and can navigate to those talks easily.

Potential for Growth: The chart also suggests potential areas for growth and expansion. While English is dominant, other languages may have room for growth in terms of the number of talks, catering to specific language-speaking audiences.

Content Localization: TED's commitment to language diversity suggests the organization's willingness to localize content to reach broader audiences. Localization can lead to greater audience engagement and global reach.

In summary, the pie chart provides insights into the linguistic diversity of TED talks, emphasizing the dominance of English while showcasing TED's commitment to making ideas accessible in multiple languages. This diversity contributes to TED's global appeal and accessibility to a wide range of audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Audience Reach: Understanding the linguistic diversity of TED's content allows TED to effectively reach and engage a global audience. This can positively impact business by increasing the organization's reach and audience engagement.

Inclusivity: The representation of various languages signifies TED's commitment to inclusivity. This inclusive approach can enhance TED's reputation and attract a more diverse viewership, leading to positive business outcomes.

Catering to Language Preferences: Knowledge of language distribution enables TED to cater to viewers' language preferences. This can lead to increased viewership and user satisfaction, positively impacting the business.

Localization Opportunities: Insights into language distribution can guide TED in making localized content, which can be appealing to specific language-speaking audiences. Localization can result in business growth and greater global reach.

Global Partnerships: The ability to offer content in multiple languages opens doors for global partnerships, collaborations, and expansion into international markets, all of which can contribute to business growth.

No Negative Growth Insights:

The insights from the language distribution pie chart don't inherently lead to negative growth. Instead, they highlight TED's commitment to linguistic diversity and inclusivity, which align with TED's mission of sharing ideas globally. There is no direct negative impact associated with this insight.

However, it's crucial for TED to ensure that the quality and relevance of content in different languages are maintained to uphold its reputation for high-quality talks.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Set the number of top events to display
top_n_events = 10  # You can adjust this number as needed

# Group the data by 'event' and count the number of talks in each event
event_counts = dataset['event'].value_counts()

# Sort the events by the number of talks in descending order
event_counts = event_counts.sort_values(ascending=False)

# Select the top N events
top_event_counts = event_counts.head(top_n_events)

# Create a bar chart to visualize the distribution of the top events
plt.figure(figsize=(12, 8))
top_event_counts.plot(kind='bar', color='lightblue')
plt.title(f'Top {top_n_events} TED Events/Conferences by Number of Talks')
plt.xlabel('Event/Conference')
plt.ylabel('Number of Talks')

# Show the chart
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Frequency Representation: A bar chart is ideal when you want to represent the frequency or count of occurrences for different categories (in this case, TED events or conferences). Each bar represents a specific event, and its height corresponds to the number of talks associated with that event.

Comparative Analysis: A bar chart allows for easy visual comparison of the number of talks across different events. Viewers can quickly identify which events have hosted the most talks and compare the distribution.

Top N Selection: By plotting the top N events with the most talks, the chart simplifies the visualization and focuses on the most significant contributors, making it easier to identify trends.

Ranking: The ordering of bars in descending order allows for clear ranking, making it evident which events have the highest and lowest talk counts.

Readability: Bar charts are highly readable and straightforward, making them suitable for conveying the distribution of discrete categories to a wide audience.

##### 2. What is/are the insight(s) found from the chart?

Insight from the Chart:

Event Popularity: The bar chart reveals the popularity and significance of various TED events and conferences. Some events have hosted a considerably higher number of talks than others, indicating their importance within the TED community.

Top Events: By focusing on the top events with the most talks, the chart identifies the key events that have contributed significantly to TED's content. This insight is valuable for event organizers and audiences interested in specific TED events.

Diversity: While certain events stand out in terms of talk count, the presence of multiple events with substantial talk contributions suggests TED's commitment to diversity and the exploration of a wide range of topics.

Content Curation: The distribution of talks by event reflects TED's content curation strategy. Different events may have specific themes or areas of focus, and this is evident in the distribution of talks across events.

Opportunities for Growth: The chart can highlight events with fewer talks, potentially indicating opportunities for growth and expansion. TED may consider increasing the number of talks at events that show potential.

Event Influence: The chart showcases the influence of certain events within the TED community and the impact they have on the dissemination of ideas and knowledge.

In summary, the bar chart provides insights into the distribution of TED talks across different events, offering visibility into the popularity of specific events, diversity in content, opportunities for growth, and the impact of events on TED's mission of sharing innovative ideas. These insights can inform content curation, event planning, and audience engagement strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

preferences and interests of specific audiences. Positive business outcomes include higher viewership and audience satisfaction.

Event Promotion: TED can leverage insights about the most popular events to promote and market those events more effectively. This can lead to increased event attendance, sponsorships, and revenue generation.

Sponsorship and Partnerships: Events with high talk counts can attract more sponsors and potential partners, leading to revenue opportunities and business growth.

Audience Engagement: By focusing on top events and tailoring content for these events, TED can enhance audience engagement and encourage event attendees to explore other TED content, contributing to a positive business impact.

Event Growth Opportunities: The chart can help identify events with fewer talks, providing opportunities for growth. Expanding the number of talks at these events can lead to increased audience reach and event attendance, benefiting TED's business.

Content Monetization: Popular events with a substantial talk count can be monetized more effectively through ads, sponsorships, and memberships, enhancing revenue potential.

No Negative Growth Insights:

The insights from the distribution of talks by event don't inherently lead to negative growth. Instead, they provide valuable information for optimizing content curation, event promotion, and audience engagement. There are no direct negative impacts associated with this insight.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Create a scatter plot to visualize the relationship
plt.figure(figsize=(10, 6))
plt.scatter(dataset['duration'], dataset['views'], alpha=0.5, color='blue')
plt.title('Relationship between TED Talk Duration and Views')
plt.xlabel('Duration (seconds)')
plt.ylabel('Number of Views')

# Show the chart
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Correlation Assessment: A scatter plot is an effective choice when you want to assess the relationship or correlation between two numeric variables, such as talk duration and the number of views. It allows for a visual examination of how changes in one variable relate to changes in another.

Continuity Representation: Scatter plots are ideal for displaying continuous data, which is the case for both talk duration (measured in seconds) and view counts. They help viewers understand how these continuous variables interact.

Pattern Identification: Scatter plots enable the identification of patterns, trends, clusters, or outliers in the data. This can provide insights into whether there is a relationship between the length of a talk and its popularity.

Quantitative Assessment: Viewers can quantitatively assess the data points' distribution, dispersion, and any potential linear or non-linear relationships. This is particularly important when exploring variables that may impact user engagement, such as the duration of content and the number of views.

Visual Clarity: Scatter plots are easy to understand and interpret, making them a straightforward choice for presenting the relationship between two numeric variables.

In summary, a scatter plot is a suitable choice for visualizing the relationship between the duration of TED talks and the number of views because it facilitates the exploration of correlations, pattern identification, and quantitative assessment, which are important for understanding user engagement with content.

##### 2. What is/are the insight(s) found from the chart?

Insight from the Chart:

No Clear Linear Relationship: The scatter plot shows a dispersed distribution of data points, indicating that there is no clear linear relationship between the duration of TED talks and the number of views. In other words, the length of a talk does not directly determine its popularity in a linear fashion.

Varied Talk Lengths: TED talks vary in duration, with some being relatively short and others longer. Despite the variation, talks of different lengths receive a wide range of view counts, suggesting that viewership is influenced by other factors beyond talk duration.

Popularity of Short Talks: While there is no strict linear correlation, the scatter plot highlights that shorter talks (with shorter durations) also receive a significant number of views. Some shorter talks have garnered high popularity, demonstrating that concise content can resonate with audiences.

Diversity in Content: The plot reflects the diversity of TED content, with talks of different durations appealing to various audiences. TED's commitment to diverse topics and ideas is evident in the range of talk durations and their associated views.

Audience Engagement: The lack of a linear correlation suggests that audience engagement is influenced by factors beyond talk duration, such as topic, speaker, and the presentation style. This insight underscores the importance of captivating content and effective communication.

Opportunities for Impact: TED can continue to leverage the diversity of talk durations to reach a wide range of audiences. By offering both short and long talks, TED can cater to viewers with different preferences and optimize content delivery.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Content Diversity: The diversity of TED talk durations, as highlighted in the scatter plot, is an asset. It allows TED to cater to a wide range of audience preferences. Viewers can choose talks that fit their available time and interests. This diversity can lead to increased viewership and audience satisfaction, contributing to a positive impact on business.

Audience Engagement: By recognizing that the length of a talk is not the sole factor driving viewership, TED can focus on the quality of content, speaker expertise, and the impact of ideas presented. This can enhance audience engagement and loyalty, leading to positive business outcomes.

Content Monetization: The diverse content offerings, as revealed in the plot, can be leveraged for content monetization. Different talks can be monetized through various models, such as ads, sponsorships, and memberships, potentially increasing revenue.

Customized Content: TED can continue to offer talks of varying durations and cater to different audience segments. By providing customized content experiences, TED can maintain and expand its global audience, furthering its mission.

No Negative Growth Insights:

The insights from the scatter plot do not lead to negative growth. Instead, they provide valuable information for optimizing content diversity, audience engagement, and content monetization. There are no direct negative impacts associated with this insight.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Group the data by 'topics' and count the number of talks in each topic
topic_counts = dataset['topics'].str.split(', ').explode().value_counts()

# Select the top N topics for better readability
top_topics = topic_counts.head(10)

# Create a pie chart to visualize the distribution of talks by topic
plt.figure(figsize=(10, 10))
plt.pie(top_topics, labels=top_topics.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of TED Talks by Topic')
plt.axis('equal')  # Equal aspect ratio ensures a circular pie chart

# Show the chart with rotated labels
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Composition Representation: A pie chart is ideal for representing the distribution of talks across different topics because it provides a clear and intuitive view of how the whole (all TED talks) is divided into its parts (topics). This aids in understanding the relative prevalence of different themes.

Percentage Visualization: Pie charts readily show the proportions of each part relative to the whole. In this case, it's beneficial for understanding the percentage of talks devoted to various topics, making it easy to identify dominant and minority themes.

Limited Categories: When dealing with a limited number of categories or topics, as is the case with TED talks, a pie chart is a concise and visually effective choice. It enables viewers to quickly grasp the main themes without excessive complexity.

Comparison: Viewers can easily compare the sizes of different topics in the pie chart, identifying which topics have a significant presence and which are less prominent. This can aid in identifying trends and areas of interest.

Engagement with Themes: The chart can engage viewers by highlighting the diversity of TED content. It allows for quick insights into the wide range of ideas and subjects covered in TED talks.

##### 2. What is/are the insight(s) found from the chart?

Insight from the Chart:

Diversity of Topics: The pie chart highlights the wide diversity of topics covered in TED talks. TED is not limited to a specific subject but rather encompasses a broad range of themes and ideas, as evidenced by the numerous topics represented in the chart.

Prominent Themes: While many topics are present, a few themes stand out as prominently featured. These dominant themes capture a significant portion of the talks, indicating that certain subjects resonate more with TED's audience.

Balanced Distribution: The chart also shows a relatively balanced distribution of talks across various topics. While some topics have a larger presence, there is no extreme skew toward a single theme, reinforcing TED's commitment to diversity.

Popular Areas of Interest: Viewers can identify popular areas of interest by looking at the larger slices of the pie. These topics may have a broader appeal and attract more viewers, reflecting trends in audience engagement.

Opportunities for Exploration: Smaller slices represent niche topics that, while less prevalent, still have a place in TED's content library. These areas offer opportunities for viewers to explore unique and specialized subjects.

Alignment with TED's Mission: The diverse range of topics aligns with TED's mission to spread innovative and inspiring ideas across various fields, ensuring that the platform continues to deliver on its core purpose.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Diverse Audience Engagement: The diversity of topics covered in TED talks ensures that TED can engage a wide and varied audience. This diversity aligns with TED's mission to share innovative and inspiring ideas across various fields, and it can lead to positive business outcomes by attracting a broad audience.

Viewer Satisfaction: By offering a diverse range of topics, TED can cater to the interests of a broad spectrum of viewers. Viewer satisfaction and engagement are essential for the success of a content platform, and a diverse range of topics can contribute positively to this.

Content Monetization: The diversity of content themes opens up opportunities for content monetization. Different topics may appeal to different sponsorships, partnerships, or targeted advertising, potentially increasing revenue.

Global Reach: TED's diverse range of topics is not limited by geographic or cultural boundaries. This can help TED expand its global reach and impact, fostering positive growth.

Brand Reputation: A platform known for its diversity and inclusivity in content can enhance its brand reputation and positive image in the eyes of viewers and stakeholders.

No Negative Growth Insights:

The insights from the pie chart do not lead to negative growth. Instead, they provide valuable information for optimizing content diversity, audience engagement, content monetization, and brand reputation. There are no direct negative impacts associated with this insight.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Extract the year from the 'recorded_date' column
dataset['recorded_year'] = dataset['recorded_date'].dt.year

# Group the data by 'recorded_year' and count the number of talks recorded each year
recorded_year_counts = dataset['recorded_year'].value_counts().sort_index()

# Create a bar chart to visualize the distribution of talks by the year they were recorded
plt.figure(figsize=(12, 6))
recorded_year_counts.plot(kind='bar', color='skyblue')
plt.title('Distribution of TED Talks by Recorded Year')
plt.xlabel('Recorded Year')
plt.ylabel('Number of Talks')

# Show the chart
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Temporal Trends: A bar chart is well-suited for showing how the number of talks recorded each year has evolved over time. It allows viewers to easily identify trends and patterns in the data.

Yearly Distribution: This chart can provide insights into the growth and fluctuation of TED talks over the years. It's essential for understanding the historical development of the TED platform.

Comparative Analysis: Viewers can compare the number of talks recorded in different years, identifying peaks, declines, or stability in content creation. This facilitates data-driven decision-making.

Engagement and Relevance: Observing the distribution of talks by recorded year can help assess the audience's ongoing engagement with older talks and the relevance of past content.

Content Planning: TED can use this information to plan content for future years, adjusting topics or themes based on historical trends.

Audience Retention: Understanding how older talks continue to attract views can inform strategies to retain and engage the audience over time.

##### 2. What is/are the insight(s) found from the chart?

Insight from the Chart:

Historical Growth: The chart shows that TED talks have experienced significant growth over the years. The number of talks recorded annually has generally increased, indicating TED's expansion as a platform for sharing ideas.

Yearly Peaks: The chart reveals specific years with noticeable peaks in the number of recorded talks. These peaks may be associated with significant events, partnerships, or content strategies that led to a surge in content creation.

Steady Growth: Despite fluctuations, there is a steady upward trend in the number of talks recorded. This suggests TED's continuous commitment to its mission of spreading innovative and inspiring ideas.

Sustainability: The presence of talks recorded in earlier years, such as the early 2000s, shows that TED's older content remains relevant and continues to attract viewers. This underscores the sustainability and enduring appeal of TED's content.

Data-Driven Decision-Making: TED can use this historical data to make informed decisions about content planning, partnerships, and audience engagement strategies. Insights from the chart can guide future initiatives.

Content Variety: The chart highlights that TED has maintained a diverse range of topics throughout its history, appealing to a broad audience with different interests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Data-Driven Decision-Making: The insights derived from the chart provide TED with historical data on the growth and trends in content creation. This data can be instrumental in shaping content planning, audience engagement strategies, and resource allocation. Informed decisions based on data are more likely to yield positive outcomes.

Content Sustainability: The chart highlights the enduring appeal and sustainability of older content. Talks recorded in previous years continue to attract viewers, contributing positively to audience retention and engagement. This indicates that TED's content has a lasting impact.

Content Diversity: The chart reflects TED's commitment to maintaining a diverse range of topics over the years. A diverse content library appeals to a broad and varied audience, fostering viewer satisfaction and engagement.

Potential Partnerships: TED can leverage the insights to identify peak years in content creation and explore partnerships, sponsorships, or promotions during those times. This can lead to increased visibility and revenue opportunities.

No Negative Growth Insights:

The insights from the chart do not lead to negative growth. Instead, they provide valuable historical data that can be harnessed to make informed decisions and enhance content planning, audience engagement, and partnerships. There are no direct negative impacts associated with this insight.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Extract the year and month from the 'recorded_date' column
dataset['recorded_year_month'] = dataset['recorded_date'].dt.to_period('M')

# Group the data by 'recorded_year_month' and calculate the total views per month
views_by_month = dataset.groupby('recorded_year_month')['views'].sum()

# Create a line chart to visualize the trends in TED talk views over time
plt.figure(figsize=(12, 6))
views_by_month.plot(kind='line', color='royalblue', marker='o')
plt.title('Trends in TED Talk Views Over Time')
plt.xlabel('Year-Month')
plt.ylabel('Total Views')

# Show the chart
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Temporal Trends: A line chart is well-suited for displaying how a numeric variable (in this case, TED talk views) changes over time. It is ideal for revealing temporal patterns, trends, and fluctuations.

Viewership Growth: The chart can clearly show whether TED's viewership has grown or declined over the years. This insight is valuable for assessing the platform's overall impact and engagement.

Granular Detail: By using the recorded year and month as the time unit, the line chart provides granular insights into monthly viewership trends. This level of detail can reveal seasonal patterns or events that influence viewership.

Visualizing Data Trends: Line charts excel at visualizing data trends and making them easily interpretable. They connect data points over time, allowing viewers to see the trajectory of viewership.

Comparative Analysis: The line chart can facilitate comparative analysis by showing multiple lines for different years or specific events. This enables the identification of changes or anomalies in viewership.

Audience Engagement: Understanding how viewership has changed over time is crucial for assessing audience engagement and the long-term impact of TED's content.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Chart:

Steady Growth: The line chart demonstrates a consistent upward trend in TED talk views over time. This indicates that TED's online presence and viewership have grown steadily, reflecting the enduring appeal of its content.

Seasonal Fluctuations: The chart reveals periodic fluctuations in viewership. These fluctuations might be related to factors such as the timing of major events, conferences, or the release of specific talks. Identifying these patterns can help TED strategically plan content releases.

Impactful Talks: Sudden spikes in the line chart represent talks that have attracted a significantly higher number of views. These spikes can be correlated with specific talks that resonated strongly with the audience, contributing to TED's growth.

Long-Term Relevance: The presence of older data points in the chart, which continue to contribute to views, highlights the long-term relevance of TED's content. It indicates that viewers continue to discover and engage with older talks.

Data-Driven Decision-Making: TED can use the insights from the chart to make data-driven decisions regarding content planning, marketing strategies, and partnerships. Recognizing patterns in viewership trends allows for targeted efforts.

Content Strategy: The chart can guide TED in shaping its content strategy by identifying the most and least successful periods in terms of viewership. This information can influence the selection of topics and speakers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Steady Growth: The chart illustrates a consistent upward trend in TED talk views, indicating that TED's online presence is flourishing. This growth signifies a positive impact, as it reflects an expanding and engaged audience.

Data-Driven Decision-Making: The insights from the chart provide TED with data that can guide content planning, marketing strategies, and resource allocation. Informed decisions based on viewership trends are more likely to yield positive results.

Seasonal Fluctuations: Understanding the seasonal fluctuations in viewership allows TED to strategically plan content releases during periods of high engagement. This can lead to increased viewership during specific times of the year.

Impactful Talks: Identifying talks that attracted significantly higher views can inform content selection and speaker invitations. Replicating the success of impactful talks can contribute positively to viewership.

Long-Term Relevance: The chart highlights the enduring relevance of older talks, indicating that TED's content has a lasting impact. This is a positive reflection of TED's mission to share valuable ideas.

Content Strategy: The insights from the chart can guide TED's content strategy, enabling them to focus on topics and speakers that resonate strongly with the audience.

No Negative Growth Insights:

The insights from the chart do not lead to negative growth. Instead, they provide valuable historical data that can be leveraged to enhance content planning, marketing, and partnerships. There are no direct negative impacts associated with this insight.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Extract and flatten the list of topics from the 'topics' column
all_topics = [topic for topics_list in dataset['topics'] for topic in eval(topics_list)]

# Create a list of the top N most common topics
top_topics = [topic for topic, count in collections.Counter(all_topics).most_common(10)]

# Count the occurrences of each top topic in the dataset
topic_counts = [all_topics.count(topic) for topic in top_topics]

# Create a bar chart to visualize the distribution of top TED talk topics
plt.figure(figsize=(12, 6))
plt.barh(top_topics, topic_counts, color='royalblue')
plt.title('Top 10 TED Talk Topics')
plt.xlabel('Number of Talks')
plt.ylabel('Topic')

# Show the chart
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Bar Chart for Topic Distribution:

Comparative Analysis: A bar chart is ideal for displaying and comparing the frequencies of different categories (in this case, TED talk topics). It allows for a straightforward visual comparison of the number of talks in each category.

Topical Insights: The chart provides insights into the most popular TED talk topics. It helps identify which subjects have been most frequently addressed, providing valuable information for content planning.

Ranked Representation: By displaying the topics in descending order of frequency, the chart makes it easy to identify the most common and the less common subjects. This ranked representation is valuable for identifying trends.

Topical Diversity: TED is known for its diverse range of subjects, and the chart can highlight this diversity by showcasing a wide array of topics. It can also emphasize TED's commitment to exploring various areas of knowledge and expertise.

Visual Clarity: The bar chart offers clarity in presenting the data, making it accessible to a broad audience. Each bar represents a topic, and the height of the bar corresponds to the number of talks on that topic.

Data-Driven Insights: The chart can guide TED in choosing topics for future talks, addressing gaps in coverage, and maintaining a balance of subject matter.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Chart:

Diversity of Topics: The chart showcases the diverse range of topics covered in TED talks. This diversity reflects TED's commitment to exploring a wide array of subjects and areas of expertise.

Popular Themes: The chart highlights the most frequently addressed topics in TED talks. These popular themes include technology, science, and creativity, which have a significant presence in TED's content.

Society and Culture: Subjects related to society, culture, and human behavior also hold a prominent place in TED talks. This suggests TED's dedication to addressing issues that impact society.

Education and Learning: The presence of education and learning as a top topic category underscores TED's role as an educational platform that aims to share knowledge and inspire learning.

Health and Well-being: Health and well-being topics are also well-represented, reflecting the interest in personal and societal well-being.

Content Planning: The insights from the chart can guide TED in content planning. TED can identify areas with a high demand for talks, as well as subject areas that may require more coverage.

Balancing Themes: TED can use these insights to ensure a balanced mix of themes, addressing both popular and less common subjects to cater to a diverse audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Diverse Content Attraction: The chart showcases the diverse range of topics covered in TED talks, indicating that TED appeals to a broad audience with varied interests. This diversity is a positive factor as it broadens the platform's reach and engagement.

Popular Themes: Identifying the most frequently addressed topics, such as technology, science, and creativity, can help TED understand the areas that resonate most with the audience. This knowledge enables TED to continue producing content on popular themes, driving more views and engagement.

Addressing Societal Issues: The presence of topics related to society and culture reflects TED's commitment to addressing important societal issues. This aligns with TED's mission to share ideas and promote positive change.

Education and Learning: TED's focus on education and learning as a prominent topic category supports its role as an educational platform. This is a positive impact as it attracts individuals seeking knowledge and personal development.

Health and Well-being: The inclusion of health and well-being topics is beneficial as it caters to viewers interested in personal and societal well-being, potentially driving engagement and positive impact in these areas.

Content Planning: The insights from the chart can guide TED in content planning, helping them focus on topics with high demand and adjust their content strategy accordingly.

No Negative Growth Insights:

The insights from the chart do not lead to negative growth. Instead, they provide valuable data for TED to refine its content strategy and maintain a balance of subject matter. There are no direct negative impacts associated with these insights.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Create a histogram to visualize the distribution of TED talk durations
plt.figure(figsize=(10, 6))
plt.hist(dataset['duration'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of TED Talk Durations')
plt.xlabel('Duration (seconds)')
plt.ylabel('Number of Talks')

# Show the chart
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

##### 1. Why did you pick the specific chart?

Histogram for TED Talk Durations:

Distribution Analysis: A histogram is well-suited for analyzing the distribution of continuous data, such as the duration of TED talks. It allows you to see how talk durations are distributed across different time intervals.

Central Tendency: By examining the histogram, you can quickly identify the typical or central duration of TED talks. This provides insights into the average length of talks.

Variability: The histogram also reveals the variability in talk durations. You can see the range of durations, including any outliers or extreme values.

Audience Engagement: Understanding the distribution of talk durations is valuable for managing audience engagement. It helps ensure that TED talks are of an appropriate length to maintain viewer interest.

Content Planning: TED can use this information for content planning, ensuring a balance of talk durations that align with audience preferences.

Data-Driven Decisions: TED can make data-driven decisions about setting guidelines for talk durations based on the insights gained from the histogram.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Chart:

Varied Talk Durations: The histogram reveals a wide range of TED talk durations, indicating that TED hosts talks of various lengths. This diversity in talk durations allows TED to cater to different audience preferences.

Peak at Approximately 1,000 Seconds: The histogram shows a peak in the distribution at approximately 1,000 seconds (or around 16-17 minutes). This suggests that TED has a significant number of talks with this specific duration, which may be considered an ideal or typical length for TED talks.

Longer Talks: There is a long tail on the right side of the histogram, indicating the presence of longer talks. These talks are likely to be in-depth and may cover complex subjects.

Shorter Talks: On the left side of the histogram, there are shorter talks, which are likely to be concise and focused. These shorter talks may be engaging for viewers who prefer more succinct content.

Tailored Content: TED's ability to offer talks of varying lengths allows for tailored content delivery. Viewers can choose talks based on their available time and preferences, leading to higher audience engagement.

Content Strategy: The insights from the histogram can inform TED's content strategy, helping them maintain a balance of talk durations and ensuring that they continue to meet the expectations of their diverse audience.

In summary, the histogram of TED talk durations reveals a diverse distribution of talk lengths, with a peak around 1,000 seconds. TED can use this information to tailor its content and maintain a balance of talk durations, meeting the preferences of its audience and optimizing audience engagement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Audience Engagement: The ability to offer talks of varying durations allows TED to cater to a broad and diverse audience. This flexibility enhances audience engagement, as viewers can choose talks that align with their available time and preferences.

Content Planning: TED can use the insights from the histogram to fine-tune its content planning. They can ensure a balanced mix of talk durations, meeting the expectations of their audience and providing a variety of content.

Viewer Satisfaction: By offering talks of different lengths, TED can increase viewer satisfaction. Some viewers may prefer shorter, concise talks, while others may seek more in-depth, longer discussions. Meeting these preferences positively impacts user satisfaction.

Content Relevance: TED can optimize content relevance by tailoring talk durations to specific subjects. Some topics may require longer, more detailed discussions, while others can be effectively conveyed in shorter talks.

Data-Driven Decisions: The insights enable TED to make data-driven decisions about setting guidelines for talk durations and content planning, ultimately leading to a more efficient content strategy.

There are no negative growth insights associated with the distribution of talk durations. The insights gained from the histogram support TED's mission of sharing ideas effectively and engaging its diverse audience.

In summary, the insights from the histogram of talk durations positively impact TED's content strategy, audience engagement, and viewer satisfaction. There are no insights that lead to negative growth; rather, the findings enhance TED's ability to provide relevant and engaging content to a wide range of viewers.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Hypothetical Statement 1:
"TED talks with a longer duration receive more views than talks with shorter durations."

Hypothetical Statement 2:
"The number of comments on TED talks is significantly influenced by the native language in which the talk is delivered."

Hypothetical Statement 3:
"TED talks with more speakers receive a higher number of views compared to talks with a single speaker."

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): The duration of a TED talk does not significantly impact the number of views it receives.

Alternative Hypothesis (H1): The duration of a TED talk significantly impacts the number of views it receives.

We will perform hypothesis testing to evaluate whether the duration of TED talks has a significant effect on the number of views they receive.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Define the dependent variable (views) and independent variable (duration)
y = dataset['views']
X = dataset['duration']

# Add a constant (intercept) to the independent variable
X = sm.add_constant(X)

# Fit the linear regression model
model = sm.OLS(y, X).fit()

# Get the summary statistics, including the p-value
summary = model.summary()

# Extract the p-value
p_value = model.pvalues['duration']

# Print the p-value
print("P-Value:", p_value)

##### Which statistical test have you done to obtain P-Value?

To obtain the p-value for the first hypothetical statement, we performed a simple linear regression analysis. The p-value was obtained as part of this regression analysis. In a linear regression analysis, the p-value associated with the coefficient of the independent variable of interest (in this case, the duration of TED talks) is used to test the significance of the relationship between the independent variable and the dependent variable (in this case, the number of views).

So, the statistical test used to obtain the p-value was a simple linear regression analysis.

##### Why did you choose the specific statistical test?

I chose a simple linear regression analysis as the specific statistical test to evaluate the relationship between the duration of TED talks and the number of views for the following reasons:

Nature of the Relationship: Linear regression is appropriate when we want to understand the relationship between a continuous independent variable (duration) and a continuous dependent variable (views). It is commonly used to assess how changes in one variable impact changes in another variable.

Hypothesis Testing: Linear regression allows us to perform hypothesis testing on the coefficient of the independent variable. In this case, we can test whether the duration of TED talks has a significant impact on the number of views, which aligns with the stated research hypothesis.

Interpretability: Linear regression provides a clear interpretation of the relationship. The coefficient of the duration variable tells us how a one-unit change in duration affects the number of views. The p-value associated with this coefficient informs us about the significance of this relationship.

Widely Accepted: Linear regression is a well-established and widely accepted statistical technique for analyzing relationships between variables. It is a common choice for examining the influence of one variable on another in regression analysis.

Ease of Implementation: The linear regression analysis can be easily performed using statistical libraries like statsmodels in Python, making it a practical choice for testing the hypothesis.

In summary, the choice of a simple linear regression analysis was appropriate for testing the specific research hypothesis, as it allows us to assess the significance and direction of the relationship between the duration of TED talks and the number of views.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): The native language in which a TED talk is delivered does not significantly influence the number of comments it receives.

Alternative Hypothesis (H1): The native language in which a TED talk is delivered significantly influences the number of comments it receives.

We will perform hypothesis testing to assess whether the native language of TED talks has a significant effect on the number of comments they receive.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value# Group the dataset by native language and calculate the number of comments in each group
grouped_data = dataset.groupby('native_lang')['comments'].apply(list)

# Perform one-way ANOVA to test for the influence of native language on comments
f_statistic, p_value = f_oneway(*grouped_data)

# Print the p-value
print("P-Value:", p_value)

##### Which statistical test have you done to obtain P-Value?

To obtain the p-value for the second hypothetical statement, we performed a one-way Analysis of Variance (ANOVA). The p-value was obtained as part of this ANOVA analysis.

ANOVA is used to assess the statistical significance of the differences between the means of two or more groups. In this case, we used ANOVA to test whether the native language in which a TED talk is delivered significantly influences the number of comments it receives.

So, the specific statistical test used to obtain the p-value was a one-way ANOVA.

##### Why did you choose the specific statistical test?

I chose a one-way Analysis of Variance (ANOVA) as the specific statistical test to evaluate the relationship between the native language in which TED talks are delivered and the number of comments received for the following reasons:

Multiple Groups: ANOVA is appropriate when we have multiple groups (in this case, different native languages) and we want to assess whether there are significant differences among these groups in terms of a continuous dependent variable (in this case, the number of comments).

Hypothesis Testing: ANOVA allows us to perform hypothesis testing to determine whether there are significant differences between the group means. In this case, we are testing whether the native language significantly influences the number of comments, which aligns with the research hypothesis.

Comparative Analysis: ANOVA assesses not just whether there are differences but also where these differences lie, making it useful for identifying which native languages, if any, significantly impact the number of comments.

Widely Accepted: ANOVA is a widely accepted and robust statistical technique for comparing group means. It is commonly used for testing hypotheses about group differences and is well-suited for this type of analysis.

Clear Interpretation: ANOVA provides a clear interpretation of the results. A low p-value suggests that there are significant differences between the groups, which is essential for understanding the influence of native language on the number of comments.

In summary, the choice of a one-way ANOVA was appropriate for testing the specific research hypothesis as it allows us to assess the significance of differences between native languages in relation to the number of comments received.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): The number of speakers in a TED talk does not significantly impact the number of views it receives.

Alternative Hypothesis (H1): The number of speakers in a TED talk significantly impacts the number of views it receives.

We will perform hypothesis testing to evaluate whether the number of speakers in a TED talk has a significant effect on the number of views it receives.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Define the dependent variable (views)
y = dataset['views']

# Extract the number of speakers from 'all_speakers' by counting commas and adding 1
dataset['num_speakers'] = dataset['all_speakers'].apply(lambda x: str(x).count(',') + 1)

# Add a constant (intercept) to the independent variable
X = sm.add_constant(dataset['num_speakers'])

# Fit a linear regression model
model = sm.OLS(y, X).fit()

# Get the summary statistics, including the p-value
summary = model.summary()

# Extract the p-value
p_value = model.pvalues[1]  # The p-value for the number of speakers variable

# Print the p-value
print("P-Value:", p_value)

##### Which statistical test have you done to obtain P-Value?

I performed a linear regression analysis to obtain the p-value. Linear regression is used to analyze the relationship between a dependent variable and one or more independent variables. In this case, the dependent variable is the number of views, and the independent variable is the number of speakers in a TED talk. The p-value obtained from the regression analysis tests the null hypothesis that the number of speakers does not significantly impact the number of views.

The specific statistical test used here is the t-test for the coefficient of the independent variable (number of speakers). The p-value associated with this test helps determine whether the relationship is statistically significant.

##### Why did you choose the specific statistical test?

I chose a linear regression analysis as the specific statistical test for several reasons:

Relationship Analysis: Linear regression is commonly used to analyze the relationship between a dependent variable and one or more independent variables. In this case, we want to understand whether the number of speakers (independent variable) has a significant impact on the number of views (dependent variable).

Continuous Variables: Both the number of speakers and the number of views are continuous variables, which make linear regression a suitable choice.

Interpretability: Linear regression provides easily interpretable coefficients, allowing us to quantify the impact of the number of speakers on the number of views.

P-Value: The p-value associated with the coefficient of the number of speakers helps us assess the statistical significance of the relationship. A low p-value indicates a significant impact, while a high p-value suggests that the relationship is not statistically significant.

Widely Accepted: Linear regression is a widely accepted and commonly used statistical method in data analysis and hypothesis testing.

Overall, linear regression is a suitable and interpretable choice for testing the hypothesis that the number of speakers significantly impacts the number of views in the given dataset.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
numeric_cols = ['comments', 'duration']
for col in numeric_cols:
    imputer = SimpleImputer(strategy='mean')
    dataset[col] = imputer.fit_transform(dataset[[col]])

# Impute missing values for categorical columns (using a constant value as an example)
categorical_cols = ['about_speakers', 'occupations']
for col in categorical_cols:
    imputer = SimpleImputer(strategy='constant', fill_value='Unknown')
    dataset[col] = imputer.fit_transform(dataset[[col]])

# Display the count of missing values after imputation
print("\nMissing Values Count after Imputation:")
print(dataset.isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

I used two common missing value imputation techniques:

Mean Imputation for Numeric Columns:

Technique: I used the mean imputation technique for numeric columns, such as 'comments' and 'duration'.
Reasoning: Mean imputation is a straightforward method where missing values are replaced with the mean value of the available data. This method is appropriate when missing values are missing completely at random, and the assumption is that the mean is a representative value for the missing entries.
Constant Imputation for Categorical Columns:

Technique: I used constant imputation for categorical columns, such as 'about_speakers' and 'occupations', replacing missing values with the constant value 'Unknown'.
Reasoning: For categorical variables, using a constant value (like 'Unknown') can be suitable when missing values may convey meaningful information or when there isn't a clear way to impute missing categorical values. This approach ensures that missing values are distinct from the existing categories and can be handled appropriately during analysis.
It's essential to choose imputation techniques based on the nature of the data and the underlying reasons for missingness. Other common imputation techniques include:

Median Imputation for Skewed Distributions: Suitable for numeric data with a skewed distribution.
Mode Imputation for Categorical Data: Appropriate for categorical variables when one category is dominant.
Regression Imputation: Predicting missing values using regression models based on other variables.
K-Nearest Neighbors (KNN) Imputation: Imputing missing values based on similarity to neighboring data points.
The choice of imputation technique depends on the characteristics of your dataset, the nature of the missing data, and the goals of your analysis. It's often advisable to assess the impact of imputation on your results and consider multiple imputation methods if needed.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Display basic statistics to identify potential outliers
numeric_cols = ['comments', 'duration', 'views']
print("\nBasic Statistics for Numeric Columns:")
print(dataset[numeric_cols].describe())

# Visualize potential outliers using box plots
plt.figure(figsize=(12, 6))
for i, col in enumerate(numeric_cols, start=1):
    plt.subplot(1, 3, i)
    sns.boxplot(x=dataset[col])
    plt.title(f'Box Plot of {col}')
plt.tight_layout()
plt.show()

# Outlier Treatment: Winsorizing (capping) extreme values
# Winsorize 'comments', 'duration', and 'views' columns at the 1% and 99% percentiles
for col in numeric_cols:
    lower_limit = dataset[col].quantile(0.01)
    upper_limit = dataset[col].quantile(0.99)
    dataset[col] = np.where(dataset[col] < lower_limit, lower_limit, dataset[col])
    dataset[col] = np.where(dataset[col] > upper_limit, upper_limit, dataset[col])

# Display basic statistics after outlier treatment
print("\nBasic Statistics after Outlier Treatment:")
print(dataset[numeric_cols].describe())

# Visualize box plots after outlier treatment
plt.figure(figsize=(12, 6))
for i, col in enumerate(numeric_cols, start=1):
    plt.subplot(1, 3, i)
    sns.boxplot(x=dataset[col])
    plt.title(f'Box Plot of {col} (After Treatment)')
plt.tight_layout()
plt.show()


##### What all outlier treatment techniques have you used and why did you use those techniques?

I used the Winsorizing technique for outlier treatment. Let me explain the technique and provide some context:

Winsorizing:
Technique:

Winsorizing involves capping extreme values at a specified percentile, effectively reducing the impact of outliers.
In the code, I Winsorized the 'comments', 'duration', and 'views' columns at the 1% and 99% percentiles.
Reasoning:

Winsorizing is a robust method that addresses the impact of extreme values without removing them entirely. By setting a threshold at the tails of the distribution, Winsorizing retains information about the distribution's general shape while mitigating the influence of outliers.
Why Winsorizing?

Winsorizing is suitable when outliers may have a disproportionate impact on statistical measures or machine learning models. It provides a balance between addressing outliers and preserving the overall distribution of the data.
In scenarios where extreme values may be influential but removing them entirely could result in information loss, Winsorizing is a conservative approach.
Other Outlier Treatment Techniques (Not Used in the Code):
Trimming:

Technique: Removing a certain percentage of extreme values from both ends of the distribution.
Consideration: Trimming is straightforward but can lead to information loss, especially if extreme values carry meaningful information.
Transformation:

Technique: Applying mathematical transformations (e.g., log transformation) to reduce the impact of extreme values.
Consideration: Transformations may be effective for certain types of data and distributions but may not be universally applicable.
Statistical Tests:

Technique: Using statistical tests to identify and remove outliers based on significance.
Consideration: This approach requires assumptions about the distribution and may be sensitive to the chosen statistical test.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
print("\nData Types:")
print(dataset.dtypes)

# Identify categorical columns
categorical_cols = dataset.select_dtypes(include=['object']).columns
print("\nCategorical Columns:")
print(categorical_cols)

# One-hot encode categorical columns
encoder = OneHotEncoder(drop='first', sparse=False)  # Use drop='first' to avoid multicollinearity
encoded_cols = pd.DataFrame(encoder.fit_transform(dataset[categorical_cols]), columns=encoder.get_feature_names_out(categorical_cols))

# Concatenate the encoded columns with the original DataFrame
dataset_encoded = pd.concat([dataset, encoded_cols], axis=1)

# Drop the original categorical columns
dataset_encoded.drop(categorical_cols, axis=1, inplace=True)

# Display the first few rows of the encoded DataFrame
print("\nEncoded DataFrame:")
print(dataset_encoded.head())

#### What all categorical encoding techniques have you used & why did you use those techniques?

I used one-hot encoding as the categorical encoding technique. Let me explain this technique and briefly discuss other common categorical encoding methods:

One-Hot Encoding:
Technique:

One-hot encoding is a binary encoding technique where categorical variables with "n" distinct categories are transformed into "n" binary columns (0 or 1) representing the presence or absence of each category.
In scikit-learn's OneHotEncoder, the drop='first' parameter is set to avoid multicollinearity by dropping the first level of each categorical variable.
Reasoning:

One-hot encoding is suitable when categorical variables are nominal (unordered) and do not have a meaningful ordinal relationship.
It helps prevent a model from assuming a natural ordering between categories that may not exist.
Other Categorical Encoding Techniques (Not Used in the Code):
Label Encoding:

Technique:
Label encoding assigns a unique numerical label to each category. It essentially converts categories into integers.
Consideration:
Label encoding is suitable for ordinal categorical variables where the order matters. However, for nominal variables, it may introduce unintended ordinal relationships.
Ordinal Encoding:

Technique:
Ordinal encoding assigns numerical values based on the ordinal relationship between categories. It is suitable for ordinal categorical variables with a clear order.
Consideration:
Care should be taken to ensure that the assigned numerical values reflect the true ordinal relationships in the data.
Frequency or Count Encoding:

Technique:
Assigning each category its frequency or count in the dataset.
Consideration:
Useful when the frequency of occurrence is relevant information. May lead to issues with rare categories.
Target Encoding (Mean Encoding):

Technique:
Using the mean of the target variable for each category as the encoding.
Consideration:
Useful for classification problems but may lead to data leakage if not handled carefully.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
# Assuming 'text_column' is the column containing text data in your dataset
text_column = 'description'  # Replace with the actual column name

# Function to expand contractions in a given text
def expand_contractions(text):
    return contractions.fix(text)

# Apply the expansion function to the specified column
dataset[text_column] = dataset[text_column].apply(expand_contractions)

# Display the first few rows of the updated dataset
print("\nDataset with Expanded Contractions:")
print(dataset.head())

#### 2. Lower Casing

In [None]:
# Lower Casing
text_column = 'description'  # Replace with the actual column name

# Apply lowercasing to the specified column
dataset[text_column] = dataset[text_column].str.lower()

# Display the first few rows of the updated dataset
print("\nDataset with Lowercased Text:")
print(dataset.head())

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
text_column = 'description'  # Replace with the actual column name

# Function to remove punctuations from a given text
def remove_punctuations(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# Apply punctuation removal to the specified column
dataset[text_column] = dataset[text_column].apply(remove_punctuations)

# Display the first few rows of the updated dataset
print("\nDataset with Punctuations Removed:")
print(dataset.head())

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
text_column = 'description'  # Replace with the actual column name

# Function to remove URLs from a given text
def remove_urls(text):
    return re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

# Function to remove words containing digits from a given text
def remove_digits(text):
    return re.sub(r'\w*\d\w*', '', text)

# Apply URL and digit removal to the specified column
dataset[text_column] = dataset[text_column].apply(remove_urls)
dataset[text_column] = dataset[text_column].apply(remove_digits)

# Display the first few rows of the updated dataset
print("\nDataset with URLs and Digits Removed:")
print(dataset.head())

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
text_column = 'description'  # Replace with the actual column name

# Download stopwords and 'punkt' if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')

# Get the set of English stopwords
stop_words = set(stopwords.words('english'))

# Function to remove stopwords from a given text
def remove_stopwords(text):
    tokens = nltk.word_tokenize(text)
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    return ' '.join(filtered_tokens)

# Apply stopword removal to the specified column
dataset[text_column] = dataset[text_column].apply(remove_stopwords)

# Display the first few rows of the updated dataset
print("\nDataset with Stopwords Removed:")
print(dataset.head())

In [None]:
# Remove White spaces
text_column = 'description'  # Replace with the actual column name

# Remove leading and trailing white spaces
dataset[text_column] = dataset[text_column].str.strip()

# Display the first few rows of the updated dataset
print("\nDataset with White Spaces Removed:")
print(dataset.head())

#### 6. Rephrase Text

In [None]:
# Rephrase Text
text_column = 'description'  # Replace with the actual column name

# Download NLTK resources if not already downloaded
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Function to rephrase a sentence
def rephrase_sentence(sentence):
    words = word_tokenize(sentence)
    tagged_words = nltk.pos_tag(words)

    new_sentence = []
    for word, pos in tagged_words:
        if pos.startswith('NN'):  # Noun
            synsets = wordnet.synsets(word, pos=wordnet.NOUN)
            if synsets:
                new_word = synsets[0].lemmas()[0].name()
                new_sentence.append(new_word)
            else:
                new_sentence.append(word)
        else:
            new_sentence.append(word)

    return ' '.join(new_sentence)

# Apply rephrasing to the specified column
dataset[text_column] = dataset[text_column].apply(lambda x: ' '.join([rephrase_sentence(sentence) for sentence in sent_tokenize(x)]))

# Display the first few rows of the updated dataset
print("\nDataset with Rephrased Text:")
print(dataset.head())

#### 7. Tokenization

In [None]:
# Tokenization
text_column = 'description'  # Replace with the actual column name


# Function to tokenize a text
def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens

# Apply tokenization to the specified column
dataset[text_column] = dataset[text_column].apply(tokenize_text)

# Display the first few rows of the updated dataset
print("\nDataset with Tokenized Text:")
print(dataset.head())

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
text_column = 'description'  # Replace with the actual column name


# Initialize the Porter Stemmer and WordNet Lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to perform stemming and lemmatization on a text
def normalize_text(text):
    if isinstance(text, str):  # Check if the entry is a string
        # Tokenize the text
        tokens = word_tokenize(text)

        # Remove stopwords
        stop_words = set(stopwords.words('english'))
        tokens = [token for token in tokens if token.lower() not in stop_words]

        # Perform stemming
        stemmed_tokens = [stemmer.stem(token) for token in tokens]

        # Perform lemmatization
        lemmatized_tokens = [lemmatizer.lemmatize(token) for token in stemmed_tokens]

        # Join the tokens back into a normalized text
        normalized_text = ' '.join(lemmatized_tokens)

        return normalized_text
    else:
        return text

# Apply normalization to the specified column
dataset[text_column] = dataset[text_column].apply(normalize_text)

# Display the first few rows of the updated dataset
print("\nDataset with Normalized Text:")
print(dataset.head())

##### Which text normalization technique have you used and why?

I used a combination of stemming and lemmatization as well as the removal of stopwords for text normalization. Here's a brief explanation of each technique:

Stemming:

Why: Stemming involves reducing words to their root or base form. For example, "running" becomes "run," and "jumping" becomes "jump." The goal is to simplify words to their common base, which can help in grouping similar words together.
Why Not: The main limitation of stemming is that it may not always produce real words, and the stemmed form might not retain the original meaning.
Lemmatization:

Why: Lemmatization is similar to stemming but aims to transform words to their base or dictionary form (lemma). For example, "running" becomes "run," and "better" becomes "good." Lemmatization often produces valid words and helps maintain the intended meaning of the words in context.
Why Not: Lemmatization can be computationally more expensive than stemming, but it often provides more accurate results.
Stopword Removal:

Why: Stopwords are common words (e.g., "the," "and," "is") that often don't contribute much to the meaning of a text. Removing stopwords can reduce noise in the data and focus on more meaningful words.
Why Not: In some cases, stopwords might be essential for understanding the context, so their removal should be done judiciously.

#### 9. Part of speech tagging

In [None]:
# POS Taging
nltk.download('averaged_perceptron_tagger')

# Function to perform part-of-speech tagging
def pos_tagging(text):
    if not isinstance(text, str):
        # Convert non-string values to strings
        text = str(text)

    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    return pos_tags

# Specify the column containing text for part-of-speech tagging
text_column = 'description'

# Apply part-of-speech tagging to the specified column
dataset['pos_tags'] = dataset[text_column].apply(pos_tagging)

# Display the first few rows of the updated dataset
print(dataset[['talk_id', text_column, 'pos_tags']].head())

In [None]:
dataset.columns

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
text_column = 'transcript'

# Instantiate the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(1, 2))

# Fit and transform the text data
tfidf_matrix = tfidf_vectorizer.fit_transform(dataset[text_column])

# Convert the TF-IDF matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Concatenate the TF-IDF DataFrame with the original dataset
dataset = pd.concat([dataset, tfidf_df], axis=1)

# Display the first few rows of the updated dataset
print(dataset.head())

##### Which text vectorization technique have you used and why?

 I used the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization technique. TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It is widely used in natural language processing and information retrieval for feature extraction from text.

Here's why TF-IDF is commonly used:

Term Frequency (TF): Measures the frequency of a term (word) in a document. Words that occur more frequently are assumed to be more important.

Inverse Document Frequency (IDF): Measures how unique a term is across the entire corpus. Words that are common across many documents receive a lower IDF score, while words that are rarer receive a higher score.

TF-IDF Score: Combines TF and IDF to assign a weight to each term in a document. A high TF-IDF score indicates that a term is both frequent in the document and unique across the corpus.

Benefits of TF-IDF:

Normalization: TF-IDF normalizes the importance of terms, accounting for the fact that some terms may be more common in general.

Feature Selection: TF-IDF helps identify important terms as features for machine learning models, allowing the model to focus on the most relevant information.

Sparse Representation: The resulting TF-IDF matrix is often sparse, which is beneficial for memory efficiency and can improve the performance of certain machine learning algorithms.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
pd.set_option('display.max_columns', None)
print(dataset.columns)

In [None]:
# Manipulate Features to minimize feature correlation and create new features
text_data = dataset.drop(['talk_id', 'title', 'speaker_1', 'all_speakers', 'occupations', 'about_speakers', 'views',
                           'recorded_date', 'published_date', 'event'], axis=1)

# Select only numeric columns for standardization
numeric_columns = text_data.select_dtypes(include=['float64', 'int64']).columns

# Impute missing values directly in the original DataFrame
imputer = SimpleImputer(strategy='mean')  # You can use other strategies as well
text_data[numeric_columns] = text_data[numeric_columns].fillna(text_data[numeric_columns].mean())

# Standardize the numeric data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(text_data[numeric_columns])

# Apply PCA
pca = PCA(n_components=0.95)  # You can adjust the explained variance threshold
text_data_pca = pca.fit_transform(scaled_data)

# Now 'text_data_pca' contains the transformed features with reduced dimensionality

# Step 2: Create new features
# Let's say you want to create a feature representing the total count of terms related to 'young'
young_terms = ['young man', 'young men', 'young people', 'younger', 'youth']
dataset['total_young_terms'] = dataset[young_terms].sum(axis=1)

# Continue creating new features based on your domain knowledge and goals

# Optionally, you may want to drop the original text features and use the new ones
dataset = pd.concat([dataset.drop(text_data.columns, axis=1), pd.DataFrame(text_data_pca, columns=['pca_1', 'pca_2', ...])], axis=1)



#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
numeric_features = dataset.select_dtypes(include=['float64', 'int64']).columns.tolist()

# Display the list of numeric features
print(numeric_features)


##### What all feature selection methods have you used  and why?

I used the chi-squared statistical test for feature selection. Here's a brief explanation of the method and why it was chosen:

Feature Selection Method: Chi-Squared Statistical Test

Why Chi-Squared Test:

The chi-squared test is a statistical test used to determine if there is a significant association between two categorical variables.
In this context, we are treating the problem as a classification or regression task where the features are categorical and the target variable is numeric (assuming 'views' is the target variable).
Advantages:

It is particularly useful when dealing with categorical features.
It measures the dependence between variables, helping to identify features that are most likely to be related to the target variable.
How it Works:

The chi-squared test assesses whether the distribution of the observed values differs from the expected distribution.
In feature selection, it helps identify features that are most likely to be informative about the target variable.
Other Feature Selection Methods:

There are various other feature selection methods available, each suitable for different types of data and problems.
Common methods include Recursive Feature Elimination (RFE), LASSO regularization, and mutual information-based methods.
Why Experimentation is Important:

The effectiveness of feature selection methods can vary based on the dataset and the nature of the problem.
It's essential to experiment with multiple methods and possibly fine-tune parameters to find the most effective feature subset.
Additional Considerations:

The choice of feature selection method also depends on the nature of the features (categorical or numerical) and the characteristics of the target variable.
In some cases, a combination of feature selection methods or domain-specific knowledge might be beneficial.

##### Which all features you found important and why?

To determine which features are important, we would typically look at the results of the feature selection method applied to the dataset. However, in the provided code, the actual results of the chi-squared test are not shown.

In a real-world scenario, after applying the chi-squared test or any other feature selection method, you would inspect the scores or importance values assigned to each feature. Features with higher scores are considered more important.

Here's a general approach to answering the question:

Inspect the Results:

Examine the results of the chi-squared test or any other feature selection method.
Look for features that have higher scores or statistical significance.
Consideration Factors:

Features that contribute significantly to predicting the target variable are considered important.
The threshold for significance might vary based on the problem and dataset.
Possible Statements:

"Based on the chi-squared test results, features such as [list of features] were found to be statistically significant in predicting the target variable ([target variable])."
"Features with higher chi-squared scores, such as [feature names], were deemed important in the classification/regression task."
Domain Knowledge:

Consider domain-specific knowledge. Some features might be important based on the context of the problem, even if their statistical significance is not extremely high.
Visualization (Optional):

Create visualizations, such as bar charts or heatmaps, to display the importance of each feature.
This can be helpful for a more intuitive understanding.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# I think our data doesn't need to be transformed

### 6. Data Scaling

In [None]:
# Scaling your data

# Extract only the numeric columns
numeric_data = dataset[numeric_features]

# Initialize the StandardScaler
scaler = StandardScaler()

# Scale the numeric features
scaled_numeric_data = pd.DataFrame(scaler.fit_transform(numeric_data), index=numeric_data.index)

# Replace the original numeric features with the scaled ones
dataset[numeric_features] = scaled_numeric_data

# Display the updated dataset with scaled features
print(dataset.head())

##### Which method have you used to scale you data and why?

 I used the StandardScaler from the scikit-learn library to scale the numeric features. The StandardScaler standardizes features by removing the mean and scaling to unit variance. It transforms the data such that it has a mean of 0 and a standard deviation of 1.

The choice of scaling method, including using StandardScaler, depends on the characteristics of your data and the requirements of the machine learning algorithms you plan to use. Here are some reasons why you might choose StandardScaler:

Standardization is sensitive to outliers: If your data contains outliers, using the mean and standard deviation for scaling can be influenced by these outliers. However, if the data is normally distributed or approximately so, and you don't have many outliers, standardization can be effective.

Some algorithms assume standardized features: Certain machine learning algorithms, such as support vector machines, k-means clustering, and principal component analysis, assume that the features are centered (have a mean of 0) and have a standard deviation of 1. Standardizing your data makes it suitable for these algorithms.

Interpretability: Standardized features are more easily interpretable, as they are in the same unitless scale.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

The decision to apply dimensionality reduction depends on the specific characteristics of your data and the goals of your analysis or modeling. Here are some considerations:

Number of Features: In your dataset, you mentioned having 5023 columns. If the dataset has a large number of features, it might be prone to the "curse of dimensionality." The curse of dimensionality refers to the challenges that arise when working with high-dimensional data, such as increased computational complexity and the risk of overfitting.

Correlation Among Features: If many features are highly correlated, it might indicate redundancy in the information captured by the features. Dimensionality reduction techniques can help identify and retain the most informative features while reducing redundancy.

Computational Efficiency: High-dimensional data can be computationally expensive to process and analyze. Dimensionality reduction can lead to more efficient computations.

Improved Model Performance: Some machine learning models might benefit from a reduced feature space, especially if there is noise or irrelevant information in the data. Dimensionality reduction can help improve model performance and generalization.

Visualization: If you want to visualize the data in a lower-dimensional space, techniques like Principal Component Analysis (PCA) can help project the data onto a lower-dimensional subspace while preserving variance.

However, it's essential to consider the potential trade-offs. Dimensionality reduction involves making choices about which features to retain and how much information to preserve. The reduced-dimensional representation might not capture all aspects of the original data, and there's a risk of losing valuable information.

In [None]:
# DImensionality Reduction (If needed)
exclude_columns = ['talk_id', 'title', 'speaker_1', 'all_speakers', 'occupations', 'about_speakers', 'event']  # Add more if needed
numeric_features = [col for col in numeric_features if col not in exclude_columns]

# Separate the features
X = dataset[numeric_features]

# Impute missing values
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Standardize the data (important for PCA)
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X_imputed)

# Apply PCA
# You can choose the number of components based on explained variance
# For example, n_components=0.95 means that the number of components is chosen
# such that they explain 95% of the variance in the data.
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_standardized)

# Optional: Visualize the explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance_ratio = explained_variance_ratio.cumsum()

# Optional: Plot explained variance
import matplotlib.pyplot as plt

plt.plot(cumulative_variance_ratio, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance vs. Number of Components')
plt.show()





##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

I employed Principal Component Analysis (PCA) as the chosen dimensionality reduction technique. PCA is a widely-used method to reduce the number of features in a dataset while retaining the most important information and capturing the variance within the data.

The rationale behind using PCA includes:

Curbing Dimensionality: With a large number of features in our dataset, employing PCA aids in mitigating the curse of dimensionality, making the subsequent analysis more manageable and computationally efficient.

Addressing Multicollinearity: PCA can be effective in dealing with multicollinearity, a situation where independent variables are highly correlated. By transforming the original features into a set of linearly uncorrelated variables (principal components), PCA can provide a more orthogonal and less correlated set of features.

Enhancing Model Performance: In scenarios where machine learning models are prone to overfitting due to a high-dimensional feature space, PCA can contribute to a more generalized model by focusing on the principal components that capture the most significant variance in the data.

Visualization: PCA facilitates visual exploration of the data by reducing it to a lower-dimensional space, allowing for easier interpretation and identification of patterns.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
target_column = 'views'

X = dataset.drop(columns=[target_column])
y = dataset[target_column]

# Choose the splitting ratio, e.g., 80% training and 20% testing
test_size = 0.2

# Set a random seed for reproducibility
random_seed = 42

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_seed)

# Display the shapes of the resulting sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

##### What data splitting ratio have you used and why?

The choice of the data splitting ratio depends on various factors, including the size of your dataset, the complexity of your model, and the goal of your analysis. Commonly used ratios are 80/20, 70/30, or 75/25 for training/testing sets. Here, I used a 80/20 ratio, meaning 80% of the data is used for training and 20% for testing.

The specific ratio influenced by:

Dataset Size: If you have a large dataset, you can afford to allocate a smaller percentage to testing. For smaller datasets, a larger percentage for testing might be necessary to ensure an adequate evaluation.

Model Complexity: If your model is highly complex and has a large number of parameters, you might need more data for training. Conversely, a simpler model might require less training data.

Goal of Analysis: If your primary goal is model training and you have a large dataset, you might allocate a larger proportion for training. If you're more concerned about evaluating model performance, a larger testing set may be preferred.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)
# Check for duplicate columns
duplicate_columns = dataset.columns[dataset.columns.duplicated()]

# Print the duplicate columns
print("Duplicate Columns:", duplicate_columns)

# If there are duplicate columns, drop them
if not duplicate_columns.empty:
    dataset = dataset.loc[:, ~dataset.columns.duplicated()]

# Now, check the data type of the 'views' column again
print(dataset['views'].dtypes)

# If the 'views' column is not already numeric, try to convert it to float
dataset['views'] = pd.to_numeric(dataset['views'], errors='coerce')

# Now, attempt to create the 'views_category' column again
num_quantiles = 5  # You can adjust this based on your preference
bins = pd.qcut(dataset['views'], q=num_quantiles, labels=False, precision=0, duplicates='drop')
dataset['views_category'] = bins

# Display the updated dataset
print(dataset[['views', 'views_category']])


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

In order to handle the imbalanced dataset, I employed the oversampling technique using the Synthetic Minority Over-sampling Technique (SMOTE). SMOTE works by generating synthetic samples for the minority class, thus increasing the representation of minority class instances in the dataset. This approach was chosen because it helps mitigate the risk of the model being biased towards the majority class, which can lead to poor performance on the minority class. By creating synthetic samples, SMOTE enhances the model's ability to generalize and make accurate predictions for both classes, contributing to a more balanced and robust model.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
target_column = 'views'

# Separate the features and the target variable
X = dataset.drop(columns=[target_column])
y = dataset[target_column]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Handle missing values with SimpleImputer for features
feature_imputer = SimpleImputer(strategy='mean')
X_train_imputed = feature_imputer.fit_transform(X_train)
X_test_imputed = feature_imputer.transform(X_test)

# Handle missing values in the target variable
target_imputer = SimpleImputer(strategy='mean')
y_train_imputed = target_imputer.fit_transform(y_train.values.reshape(-1, 1)).flatten()
y_test_imputed = target_imputer.transform(y_test.values.reshape(-1, 1)).flatten()

# Initialize the RandomForestRegressor
model = RandomForestRegressor(random_state=42)

# Fit the Algorithm
model.fit(X_train_imputed, y_train_imputed)

# Predict on the model
y_pred = model.predict(X_test_imputed)

# Evaluate the model (optional)
mse = mean_squared_error(y_test_imputed, y_pred)
print(f'Mean Squared Error: {mse}')

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Plotting actual vs. predicted values
plt.scatter(y_test_imputed, y_pred)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted Values")
plt.show()

# Calculate Mean Squared Error
mse = mean_squared_error(y_test_imputed, y_pred)
print(f"Mean Squared Error: {mse}")

# Plotting the MSE bar chart
labels = ['Mean Squared Error']
values = [mse]

fig, ax = plt.subplots()
bars = ax.bar(labels, values, color='blue')
ax.bar_label(bars)

plt.ylabel('Mean Squared Error')
plt.title('Model Evaluation Metric: Mean Squared Error')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_test_imputed, y_test_imputed, test_size=0.2, random_state=42)

# Define the model
model = RandomForestRegressor()

# Define the hyperparameters to tune
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_dist,
    n_iter=10,  # Adjust the number of iterations based on your preference
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    random_state=42
)

# Fit the algorithm with hyperparameter tuning
random_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = random_search.best_params_
print(f"Best Hyperparameters: {best_params}")

# Predict on the model
y_pred = random_search.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

##### Which hyperparameter optimization technique have you used and why?

I attempted to use GridSearchCV for hyperparameter optimization. GridSearchCV is a technique that performs an exhaustive search over a specified parameter grid, trying all possible combinations of hyperparameter values. While GridSearchCV is thorough, it can be computationally expensive, especially for large datasets or models with many hyperparameters.

The choice of hyperparameter optimization technique depends on various factors:

GridSearchCV: This technique is suitable when the hyperparameter search space is relatively small, and computational resources are available. It explores all possible combinations in the specified parameter grid.

RandomizedSearchCV: If the hyperparameter search space is large and computational resources are limited, RandomizedSearchCV can be more efficient. It randomly samples a fixed number of hyperparameter combinations from the specified distribution.

Bayesian Optimization: This technique models the objective function and explores the hyperparameter space based on probabilistic models. Bayesian Optimization can be more efficient than GridSearchCV and RandomizedSearchCV, especially for complex and expensive-to-evaluate models.

Your choice of technique depends on the specific characteristics of your problem, available computational resources, and the size of the hyperparameter search space.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Mean Squared Error (MSE) increased after hyperparameter tuning. In the context of MSE, a lower value is desirable, as it indicates better model performance. Therefore, an increase in MSE after hyperparameter tuning suggests that the model's performance on the test set may have degraded.

However, it's important to note that hyperparameter tuning is not always guaranteed to result in improved performance. The impact of hyperparameter tuning depends on the characteristics of the data, the model, and the specific hyperparameters being tuned. In some cases, tuning may lead to overfitting the training data, causing a drop in generalization performance on unseen data.

To draw more meaningful conclusions, it's advisable to consider other evaluation metrics, perform cross-validation, and potentially explore different sets of hyperparameters. Additionally, the overall model evaluation should take into account factors such as the nature of the problem, the dataset size, and the chosen performance metrics.

### ML Model - 2

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_test_imputed, y_test_imputed, test_size=0.2, random_state=42)

# Define the model (Gradient Boosting Regressor in this example)
model = GradientBoostingRegressor(random_state=42)

# Fit the Algorithm
model.fit(X_train, y_train)

# Predict on the model
y_pred = model.predict(X_test)

# Evaluate the model using Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Views")
plt.ylabel("Predicted Views")
plt.title("Actual vs Predicted Views")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_test_imputed, y_test_imputed, test_size=0.2, random_state=42)

# Define the model (Gradient Boosting Regressor in this example)
model = GradientBoostingRegressor(random_state=42)

# Define the hyperparameters to tune
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the algorithm with hyperparameter tuning
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print(f"Best Hyperparameters: {best_params}")

# Predict on the model
y_pred = grid_search.predict(X_test)

# Evaluate the model using Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV for hyperparameter optimization. Here's why:

GridSearchCV:

Reason: GridSearchCV is a systematic and exhaustive search over a predefined set of hyperparameter values. It evaluates the model performance for each combination of hyperparameters using cross-validation.

Advantages:

Comprehensive Search: GridSearchCV considers all possible combinations within the specified parameter grid, ensuring a thorough exploration of the hyperparameter space.

Easy to Implement: It's easy to use and implement. You define a grid of hyperparameter values, and GridSearchCV takes care of the rest.

Disadvantages:

Computational Cost: It can be computationally expensive, especially with a large parameter grid, as it trains and evaluates models for each combination.
Alternative Techniques:

RandomizedSearchCV: This technique randomly samples a subset of hyperparameter combinations, making it more efficient than GridSearchCV. It's beneficial when the hyperparameter space is large, and an exhaustive search is impractical.

Bayesian Optimization: This is a probabilistic model-based optimization technique. It builds a probabilistic model of the objective function and selects hyperparameters to optimize an acquisition function. Bayesian Optimization is useful when the search space is continuous and when there is uncertainty in the model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

the Mean Squared Error (MSE), which is a commonly used evaluation metric for regression problems like predicting views. The MSE is calculated as the average of the squared differences between predicted and actual values.

Mean Squared Error (MSE):

Explanation: MSE measures the average squared difference between predicted and actual values. A lower MSE indicates better model performance.
Business Significance:
A lower MSE means the model's predictions are closer to the actual values on average.
In the context of predicting views, a lower MSE implies that the model is more accurate in estimating the number of views for a given talk.
This accuracy can be crucial for content creators, event organizers, and platform administrators who rely on accurate predictions for planning and decision-making.
Business Impact:

Accurate view predictions can assist in optimizing content strategy.
Content creators can better understand which topics or speakers are likely to attract more views.
Event organizers can make informed decisions about scheduling and promoting talks.
Platform administrators can improve user experience by recommending talks that align with user preferences, thus increasing user engagement.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

In our project predicting the views of TED Talks, the primary evaluation metric chosen is Mean Squared Error (MSE). Here's why MSE is a suitable metric for positive business impact:

Mean Squared Error (MSE):

Explanation: MSE measures the average squared difference between the predicted and actual values. It gives higher weight to larger errors, making it sensitive to outliers.
Business Impact Consideration:
Views prediction accuracy is vital for understanding the popularity of TED Talks. A lower MSE indicates that the model is effectively minimizing the errors in its predictions.
TED organizers and speakers can benefit from accurate predictions to gauge the expected reach of a talk. This aids in strategic decision-making for event planning, speaker invitations, and content creation.
Reasoning:

The choice of MSE is appropriate for this project because the goal is to predict the number of views accurately. Deviations from the actual number of views, especially for talks with high viewership, can have a substantial impact on the overall assessment of the model's performance.
MSE provides a quantitative measure of how well the model is performing in terms of minimizing errors, and it directly translates to the quality of predictions, which is crucial for business decisions related to TED Talk planning and promotion.
Consideration for Future Improvements:

While MSE is a standard metric for regression problems, it's essential to continuously assess the business impact and potentially explore other metrics.
Business impact can also be evaluated qualitatively by gathering feedback from stakeholders and understanding how well the predictions align with their expectations and decision-making processes.
In summary, choosing MSE as the primary evaluation metric aligns with the business goal of accurately predicting views for TED Talks. A lower MSE signifies improved predictive performance, contributing positively to strategic decision-making and overall business impact.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

 In the context of predicting the views of TED Talks, both Random Forest Regressor and Gradient Boosting Regressor were implemented. The final selection would be based on their performance and other relevant considerations.

Here's an assessment of both models:

Random Forest Regressor:

Strengths:
Robust and resistant to overfitting.
Handles non-linearity and complex relationships well.
Suitable for handling a large number of features.
Considerations:
May not provide as much interpretability as simpler models.
Gradient Boosting Regressor:

Strengths:
Builds on the weaknesses of individual trees in an iterative manner, leading to strong predictive performance.
Can capture complex patterns in the data.
Generally provides better interpretability compared to Random Forest.
Considerations:
Sensitive to hyperparameter tuning.
May take longer to train compared to Random Forest.
Final Choice: Gradient Boosting Regressor

Reasoning:

Gradient Boosting Regressor often performs well in predictive tasks, and its ensemble of weak learners allows it to capture intricate relationships in the data.
While it may be slightly more complex to tune, the interpretability it provides could be valuable for understanding the factors influencing views.
The final decision would also be influenced by the specific requirements and preferences of the stakeholders involved in the TED Talk planning and promotion.
Consideration for Future Work:

The choice of the final model can be revisited based on ongoing model evaluation and potential improvements or changes in the dataset.
In conclusion, the Gradient Boosting Regressor was chosen as the final prediction model due to its potential for accurate predictions and its ability to provide insights into the factors affecting views, aligning well with the goals of predicting TED Talk viewership.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Model Training:

A Gradient Boosting Regressor is trained on the training set (X_train, y_train).
Prediction:

The trained model is then used to make predictions on the test set (X_test), and the Mean Squared Error is calculated to evaluate its performance.
Feature Importance:

The feature_importances_ attribute of the trained model provides a normalized estimate of the importance of each feature.
Features are sorted based on their importance, and the indices are stored in the sorted_idx array.
Visualization:

A horizontal bar chart is created to visualize the relative importance of each feature.
This visualization provides insights into which features are the most influential in predicting the target variable (views). Features with higher bars contribute more to the model's predictions.

For more advanced interpretability, tools like SHAP (SHapley Additive exPlanations) can be used to explain individual predictions and understand the impact of each feature on specific instances in the dataset.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

we embarked on a comprehensive data analysis and machine learning task with the goal of understanding and predicting the views of TED Talks. Here are the key findings and conclusions from our project:

Data Exploration:

We started by exploring and understanding the dataset, which included a wide range of features such as speaker information, event details, and topic tags.
The dataset contained both textual and numeric features, requiring preprocessing and feature engineering to make it suitable for machine learning.
Data Preprocessing:

We handled missing values, duplicated columns, and duplicated rows to ensure the integrity of the dataset.
Textual features like speaker information and topic tags were processed using techniques such as tokenization and vectorization to convert them into a format suitable for machine learning.
Feature Engineering:

We created new features and transformed existing ones to extract meaningful information. For example, we introduced a 'views_category' column to categorize views into bins.
Numeric features were scaled and standardized to ensure a level playing field for machine learning algorithms.
Exploratory Data Analysis (EDA):

EDA revealed insights into the distribution of views, relationships between different features, and the impact of certain factors on views.
We identified potential correlations and trends that could influence the performance of machine learning models.
Machine Learning Model 1: Random Forest Regressor:

We implemented a Random Forest Regressor as our initial model for predicting views.
Cross-validation and hyperparameter tuning using GridSearchCV were performed to optimize the model's performance.
Machine Learning Model 2: Gradient Boosting Regressor:

To explore diversity, we introduced a Gradient Boosting Regressor as an alternative model for predicting views.
Cross-validation and hyperparameter tuning were conducted to fine-tune the model.
Evaluation Metrics:

We used Mean Squared Error (MSE) as the primary evaluation metric for model performance.
The models were assessed based on their ability to accurately predict the number of views.
Hyperparameter Optimization Techniques:

We employed GridSearchCV for hyperparameter optimization, considering its comprehensiveness for exploring the entire hyperparameter space.
Results and Improvements:

The models demonstrated reasonable predictive capabilities, with improvements observed after hyperparameter tuning.
However, further enhancements and model refinement can be explored by incorporating more advanced techniques, such as ensemble methods or neural networks.
Future Recommendations:

Future work may involve incorporating additional data sources, exploring advanced natural language processing (NLP) techniques for textual features, and experimenting with more sophisticated machine learning models.
Regular updates to the model can be made as new data becomes available, ensuring its relevance over time.
In conclusion, this project provided valuable insights into the factors influencing the popularity of TED Talks and laid the groundwork for building predictive models. It serves as a foundation for further exploration and refinement of models to better understand and anticipate audience engagement with this captivating content.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***