## Introduction

In this notebook, I will conduct **exploratory data analysis** for the [COVID-19 YouTube Comments dataset](https://www.kaggle.com/seungguini/youtube-comments-for-covid19-related-videos).

1. [Data Preparation](#data-preparation)
2. [Exploratory Data Analysis](#exploratory-data-analysis)
    - Removing duplicate comments
    - Word counts and character counts
3. [Processing Text Data](#processing-text-data)
    - Removing non-English comments
    - Removing URLs, time-stamps, user-handles
    - Extracting contractions, tokenization
    - Lower-case, punctuation, and stop words
    - Lemmatization

## 1. Data Preparation

In this notebook, we will work with the YouTube comment data provided [here](https://www.kaggle.com/seungguini/youtube-comments-for-covid19-related-videos). Let's take a look at our available files.

In [None]:
import time

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
# Text processing imports
!pip install contractions langid nltk==3.2.4

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

import string
import contractions
import langid
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

# Wordcloud imports
from wordcloud import WordCloud, STOPWORDS
from PIL import Image

# ML models
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
import datetime

# Sentiment Analysis
!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

### Initial Observations
Several important observations from the dataset description and preview 
- Our dataset has six files, each in the form of `youtube_comments_query.csv`. The term at the end of each .csv file stands for the query used to request videos.
- According to the dataset description, each _row_ represents a _comment_
- Approximately 50 videos were scraped for each query, while approximately 100 comments were scraped for each video. This leads to roughly 5000 comments in total, per query.

First off, we will examine the `youtube_comments_coronavirus.csv` dataset.

In [None]:
df = pd.read_csv('../input/youtube-comments-for-covid19-related-videos/covid_2021_1.csv')
df.head(3)

- The query value seems to repeat itself for the values.
- Each **row** represents a **comment**.

## 2. Exploratory Data Analysis
### Data content
Let's take a look at the shape and makeup of our dataset.

In [None]:
df.shape

Our data has 13 columns (aka. features), and 41,588 rows (comments).

In [None]:
df.info()

- Our `Non-Null Count` for each feature is the same as the number of comments (4200), which means we aren't missing values.
- 5 of our features are int64 values, while 8 of them are objects.

Let's see what makes up our 8 `object` features:

In [None]:
for column in df.columns:
    print(column, ":", type(df[column][0]))

All of our 8 `object` features contains `string` values.

Now let's take a look at each feature. First, we'll create a general statistical overview of our data with the `pd.describe()` method

> **_NOTE:_**  `pd.describe` displays statistics only for _numerical features_ by default. `.round(1)` rounds our statistical values to one decimal point, for better readability

 > **_See Also:_**  _We'll extract statistical observations for the text data later (through attribtues such as word and character counts)._

In [None]:
df.describe().round(1)

### Unique Values

Great! Let's also observe the number of unique values for each feature, with `df.nunique()`

In [None]:
df.nunique()

- The **number of titles (892)**, the **number of urls (895)**, and the **number of upload dates (894)** are all different. This indicates possible _duplicate values_ in our video data.
- The **number of comments texts (40094)** and the **number of comment dates (41209)** are different. This indicates possible _duplicate values_ in our comments data as well.
- The comments are **not** double counted, since there are 4,200 unique `comment_date` values. Thus, our dataset may contain spammed comments (repeated comments of the same text), or perhaps the same author is posting multiple comments in the same video.

We can explore further by examining frequency counts for the unique features. First, the frequency count for unique comments:

In [None]:
df['comment_text'].value_counts()[:10]

Seems like we have some repeated comments in our data. This could potentially be a red flag for biased data (i.e. spam comments), so let's check the number of repeated comments:

In [None]:
# Use boolean indexing to find repeated comments
repeats = df[df.groupby('comment_text')['comment_text'].transform('size') > 1]
print("number of repeats:", len(repeats))
print("percentage of repeats:",np.round(len(repeats)/len(df) * 100, 1), "%")

5% of our data have repeated comments, which could cause bias later on in our analysis.

> _**NOTE**_ : We eliminate the repeated comments [here](#removing-duplicate-comments)


### Analyzing MetaData

In [None]:
df.columns

We will first analyze four video metadata - `views`, `likes`, `dislikes`, and `comment_count`. As we've observed earlier, each row on the dataset represents a comment. Since there are approximately 50 comments per videos, the video metadata are repeated, as shown below:

In [None]:
df.head(3)

We can see that the `title`, `views`, `likes`, and `dislikes` are repeated, because the first three comments are from the same video. This repetition can cause skew in our data. Fortunately, we can easily create a new dataframe with **video metadata** only with `df.drop_duplicates()`

In [None]:
df_videos = df.drop_duplicates(subset=['title'])
print(df_videos.shape)

We've removed our duplicate comments, leaving one representative comment (row) for each video. `youtube-comments-coronavirus` has 42 total videos.

Let's visualize the distribution of the four video features:

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(10, 6))

ax[0][0].set(xlabel = 'likes', title='LIKES DISTRIBUTION')
sns.histplot(ax=ax[0][0], x='dislikes', data=df_videos, kde=True, color='tab:orange')

ax[0][1].set(xlabel = 'dislikes', title='DISLIKES DISTRIBUTION')
sns.histplot(ax=ax[0][1], x='views', data=df_videos, kde=True, color='tab:green')

ax[1][0].set(xlabel = 'views', title='VIEWS DISTRIBUTION')
sns.histplot(ax=ax[1][0], x='views', data=df_videos, kde=True, color='tab:blue')

ax[1][1].set(xlabel = 'no. of comments', title='NO. OF COMMENTS DISTRIBUTION')
sns.histplot(ax=ax[1][1], x='comment_count', data=df_videos, kde=True, color='tab:blue')

# Formatting for scientific notations
for i in range(0,2):
    for j in range(0,2):
        ax[i][j].get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
        ax[i][j].get_xaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
        ax[i][j].tick_params('x', labelrotation=30)
plt.tight_layout()
plt.show()

- The histograms have a similar distribution shape, with a skewed tail to the right
- The histograms have outliers on the right as well

Let's checkout the **quantiles** for our features

In [None]:
print("Likes quantiles")
print(df_videos['likes'].quantile([.01,.25,.5,.75,.99]))
print("")
print("Dislikes quantiles")
print(df_videos['dislikes'].quantile([.01,.25,.5,.75,.99]))
print("")
print("Views quantiles")
print(df_videos['views'].quantile([.01,.25,.5,.75,.99]))
print("No. of comments")
print(df_videos['comment_count'].quantile([.01,.25,.5,.75,.99]))
print("")


Let's observe the correlation for the following relationships:
- `views` on `likes`
- `views` on `dislikes`
- `views` on `comment_count`
- `likes` on `dislikes`
- `likes` on `comment_count`
- `dislikes` on `comment_count`

In [None]:
fig, ax = plt.subplots(3,2, figsize=(9, 9))
sns.scatterplot(ax=ax[0][0], x="views", y="likes", data=df_videos, color='tab:red')
ax[0][0].set(title='VIEWS V. LIKES')

sns.scatterplot(ax=ax[1][0], x="views", y="dislikes", data=df_videos, color='tab:orange')
ax[1][0].set(title='VIEWS V. DISLIKES')

sns.scatterplot(ax=ax[2][0], x="views", y="comment_count", data=df_videos, color='tab:green')
ax[2][0].set(ylabel='no. of comments', title='VIEWS V. NO. OF COMMENTS')

sns.scatterplot(ax=ax[0][1], x="likes", y="dislikes", data=df_videos, color='tab:olive')
ax[0][1].set(title='LIKES V. DISLIKES')

sns.scatterplot(ax=ax[1][1], x="likes", y="comment_count", data=df_videos, color='tab:blue')
ax[1][1].set(ylabel='no. of comments', title='LIKES V. NO. OF COMMENTS')

sns.scatterplot(ax=ax[2][1], x="dislikes", y="comment_count", data=df_videos, color='tab:cyan')
ax[2][1].set(ylabel='no. of comments', title='DISLIKES V. NO. OF COMMENTS')

# Formatting for scientific notations
for i in range(0,3):
    for j in range(0,2):
        ax[i][j].get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
        ax[i][j].get_xaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
        ax[i][j].tick_params('x', labelrotation=30)

plt.tight_layout()
plt.show()

- `likes`, `dislikes`, and `comment_count` all have a positive correlation with `views`
- This is understandable, since the more views expose the video to a larger chance of having likes, dislikes, and comments
- We can see that our **outlier videos** have _roughly 55,000,000 views_ and _75,000,000 views_.

### Additional Features
As we observed, higher `views` tended to have higher `likes`, `dislikes`,  and `no. of comments`.
Let's conduct a deeper analysis into these variables by creating a `like_ratio`, `dislike_ratio`, and `comments_ratio` - all three based on `views`

In [None]:
df_videos['likes_ratio'] = df_videos['likes'] / df_videos['views'] * 100
df_videos['dislikes_ratio'] = df_videos['views'] / df_videos['views'] * 100
df_videos['comments_ratio'] = df_videos['comment_count'] / df_videos['views'] * 100

Now, let's visualize the distributions of our ratio!

In [None]:
plt.figure(figsize = (9,6))

g1 = sns.displot(df_videos['dislikes_ratio'], color='red', kind='kde', label="Dislike")
g1 = sns.displot(df_videos['likes_ratio'], color='green', kind='kde',label="Like")
g1 = sns.displot(df_videos['comments_ratio'], color='blue', kind='kde',label="Comment")

In [None]:
df['upload_date']

### Removing duplicate comments

In our Exploratory Data Analysis, we found some repeated comments in our data. These comments could cause potential bias as we continue our analysis and modeling. No worries! We can easily remove deleted comments with `drop duplicates`.

> **_NOTE:_** set `inplace=True` to keep any changes made to our dataframe

In [None]:
df.drop_duplicates(subset=['comment_text'], inplace=True)
print(df['comment_text'].value_counts())

Wonderful! All of our duplicate comments have been removed.

### Generating word & character counts
As we can see from our output for `df.describe()` above, text data poses some challenges in creating statistical analysis. Unlike numerical data, text data does not explicitly give us figures to work with. However, we can manipulate our data to create useful features for analysis and modeling. This process is called **preprocessing**.

The two features we will be extracting are word and character counts. We will create four columns for our new features:
- `title_wc` : word count for the video title
- `title_cc` : character count for the video title
- `comment_wc` : word count for the comment
- `comment_cc` : character count for the comment

In [None]:
def wordcount(text):
    count = 0
    for word in text.split():
        count += 1
    return count

df['title_wc'] = df['title'].apply(lambda title : wordcount(title))
df['comment_wc'] = df['comment_text'].apply(lambda comment : wordcount(comment))
df['title_cc'] = df['title'].apply(lambda title : len(title))
df['comment_cc'] = df['comment_text'].apply(lambda comment : len(comment))

Awesome! Now that we've created our word and text counts for both titles and comments, let's check out their distributions!

In [None]:
fig, ax = plt.subplots(2, figsize=(6, 9))
sns.histplot(ax=ax[0], x="title_wc", data=df, color='tab:red', kde=True, bins=100)
ax[0].set(title='TITLE WORD COUNT DISTRIBUTION')

sns.histplot(ax=ax[1], x="title_cc", data=df, color='tab:orange', kde=True)
ax[1].set(title='TITLE CHARACTER COUNT DISTRIBUTION')
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(2, figsize=(6, 9))
sns.histplot(ax=ax[0], x="comment_wc", data=df, color='tab:red', kde=True, bins=100)
ax[0].set(title='COMMENT WORD COUNT DISTRIBUTION')

sns.histplot(ax=ax[1], x="comment_cc", data=df, color='tab:orange', kde=True)
ax[1].set(title='COMMENT CHARACTER COUNT DISTRIBUTION')
plt.tight_layout()

Let's try find some basic metrics about our newly added features! We'll use Panda's `.describe()` function used earlier :)

In [None]:
new_df = df[['title_wc','title_cc','comment_wc','comment_cc']]
new_df.describe()

## 3. Processing Text Data

So far, we have observed our data's shape and general correlations. Now, we would like to perform deeper analysis on our text data (most common words, sentiment analysis, etc). However, computers (and their processors) are unable to understand text information the way humans can. In order to allow our program to properly analyze our data, we process our **titles** and **commments**.  

### Removing URLs, time-stamps, user-handles
> Some comments may hold URLs, time-stamps, or user-handles. We remove these using regular expressions

In [None]:
# Remove URLS
df['processed'] = df['comment_text'].apply(lambda comment : re.sub(r"http\S+", "", comment))

# Remove time-stamps
df['processed'] = df['processed'].apply(lambda comment : re.sub(r"\d+:\d{2}", "", comment))

# Remove user-handles
df['processed'] = df['processed'].apply(lambda comment : re.sub(r"@[^\s]+", "", comment))

# Remove numbers
df['processed'] = df['processed'].apply(lambda comment : re.sub(r"^\d+\s|\s\d+\s|\s\d+$", " ", comment))


### Removing non-English comments

For this Notebook, we will work with only English comments. First, let's tag our comments with their language using the `langdetect` package:

In [None]:
# Function to detect the language
def detectLang(corpus) :
  try:
    lang = langid.classify(corpus)[0]
    if lang == 'en':
      return lang
    else:
      return None
  except:
    return None

# Tag comment language
%time df['lang'] = df['processed'].apply(lambda comment: detectLang(comment))

In [None]:
df['lang'].isnull().sum()

Approximately 7000 of our comments are not English. Let's drop these values:

In [None]:
df['lang'].dropna(inplace=True)

### Extracting contractions, tokenization

First of all, we will extract contractions _(don't --> do not)_ by using the `contractions` package.
![](http://)> _NOTE_ : `contractions.fix(word)` tokenizes each sentence into a list individual words, while expanded contraptions are placed within a single string. In order to properly tokenize, we append our words back into a comment, before tokenizing it again.
> _EXAMPLE_ : `I don't care` --> `['I', 'do not', 'care`] --> `I do not care` --> `['I', 'do', 'not', 'care']`

In [None]:
%%time
df['no_contract'] = df['comment_text'].apply(lambda x: [contractions.fix(word) for word in x.split() ])
df['comment_str'] = [' '.join(map(str, l)) for l in df['no_contract']]

# Tokenize the comments again
df['tokenized'] = df['comment_str'].apply(lambda x : word_tokenize(x))

### Lower-case, punctuation, and stop words
Next, we will convert our text into lower case and remove all punctuation symbols. We will also remove stop words ('the', 'is', 'a', etc) using NLTK's `stopwords` module

In [None]:
%%time
# Convert characters to lowercase
df['lower'] = df['tokenized'].apply(lambda x: [word.lower() for word in x])

# Remove punctuations
punct = string.punctuation
# Make a list only if token is not a punctuation
df['processed'] = df['lower'].apply(lambda x: [word for word in x if word not in punct])

In [None]:
df.columns

In [None]:
# Drop unnecessary columns
df.drop(['lang', 'no_contract', 'comment_str', 'lower', 'tokenized'],axis=1,inplace=True)

## 4. Most Common Words
Now that our comments are processed and tokenized, let's analyze the most commonly used words throughout our comments.
We visualize our top used words with a `WordCloud`.

In [None]:
# Define a function to plot the wordcloud

def plot_cloud(wordcloud):
    # Set figure size
    plt.figure(figsize=(20, 10))
    # Display image
    plt.imshow(wordcloud) 
    # No axis details
    plt.axis("off");

In [None]:
# Format comments into one, large String for the wordcloud
df['processed_sentence'] = df['processed'].apply(lambda comment : " ".join(comment))
%time text = ' '.join(df['processed_sentence'])

In [None]:
# Create YouTube logo mask
mask = np.array(Image.open('../input/youtube-logo/youtube_social_square_white.png'))

# Generate the wordcloud
wordcloud = WordCloud(width = mask.shape[1], height = mask.shape[0], background_color='white', colormap='Reds', mask=mask, max_words=300, max_font_size=300)
wordcloud.generate(text)

# Plot
plot_cloud(wordcloud)

## 5. K-Means Clustering

To further examine what our comments are talking about, let's perform **K-means clustering** on our data. This will help divide our Tweets into specific clusters that could give us insight on the different topics / discussions in our dataset.

---

### TF-IDF Vectorization

Before we can run **K-Means Clustering**, we must first vectorize our dataset. We will use the **Term Frequency-Inverse Document Frequency (TF-IDF)** vectorization method on our dataset. More on TF-IDF can be found [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

The implementation of **TF-IDF** is provided by `sklearn`:

In [None]:
df['processed_sentence']

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
data = df['processed_sentence']
# data.head()

tf_idf_vectorizor = TfidfVectorizer(max_features = 5000)
%time tf_idf = tf_idf_vectorizor.fit_transform(data)
tf_idf_norm = normalize(tf_idf)
tf_idf_array = tf_idf_norm.toarray()
tf_idf_df = pd.DataFrame(tf_idf_array, columns=tf_idf_vectorizor.get_feature_names())
tf_idf_df.head()

In [None]:
tf_idf_df.shape

In [None]:
tf_idf_df.dropna(inplace=True)

In [None]:
%%time
print(tf_idf_df.shape)
pca = PCA(0.95)
reduced = pca.fit_transform(tf_idf_df)
print(reduced.shape)

In [None]:
sum(pca.explained_variance_ratio_)

* * 

### Optimizing K with the Elbow Method

In order to figure out which K-value (the number of clusters to create) is optimal, we use the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering))

First, we run the K-means clusters with k ranging from 1 to 8. Then, we graph our score to find the point of inflection on our cluster-score graph:

In [None]:
%%time

n_clusters = range(1, 7)

inertia = []
for i in n_clusters:
    kmeans = KMeans(n_clusters=i, max_iter=600, algorithm = 'auto')
    kmeans.fit(reduced)
    inertia.append(kmeans.inertia_)

plt.plot(n_clusters, inertia)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Method')
plt.show()

Though our elbow graph is not the clearest, our score seems to only marginally increase when there are between _3 to 5 clusters_. We'll experiment with both these numbers.

In [None]:
%time kmc = KMeans(n_clusters=3, max_iter=600, algorithm = 'auto').fit(principalComponents)
predicted_values = kmc.predict(principalComponents)

plt.scatter(principalComponents[:, 0], principalComponents[:, 1], c=predicted_values, s=50, cmap='viridis')

centers = kmc.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1],c='black', s=300, alpha=0.6);

In [None]:
def get_top_features_cluster(tf_idf_array, prediction, n_feats):
    labels = np.unique(prediction)
    dfs = []
    for label in labels:
        id_temp = np.where(prediction==label) # indices for each cluster
        x_means = np.mean(tf_idf_array[id_temp], axis = 0) # returns average score across cluster
        sorted_means = np.argsort(x_means)[::-1][:n_feats] # indices with top 20 scores
        features = tf_idf_vectorizor.get_feature_names()
        best_features = [(features[i], x_means[i]) for i in sorted_means]
        df = pd.DataFrame(best_features, columns = ['features', 'score'])
        dfs.append(df)
    return dfs

def plotWords(dfs, n_feats):
    plt.figure(figsize=(8, 4))
    for i in range(0, len(dfs)):
        plt.title(("Most Common Words in Cluster {}".format(i)), fontsize=10, fontweight='bold')
        sns.barplot(x = 'score' , y = 'features', orient = 'h' , data = dfs[i][:n_feats])
        plt.show()

In [None]:
n_feats = 20
dfs = get_top_features_cluster(principalComponents, predicted_values, n_feats)
plotWords(dfs, 13)

## 6. Sentiment Analysis
There are many ways to perform sentiment analysis on our Tweets. For practice and comparison purposes, I will use several well known Sentiment Analysis models ([VADER](https://github.com/cjhutto/vaderSentiment), [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html)), while also attempting to build my own models. To train my models, I will use the **1.6M Tweet Sentiment dataset**, available on [Kaggle](https://www.kaggle.com/kazanova/sentiment140).

### Sentiment classification using Logistic Regression
We will train a **logistic regression** from `sklearn` to create a model that predicts sentiments.
Before we dive into our models, it's important to clarify the **evaluation metrics** for our classification model.

> _DISCLAIMER_ : 
This [Medium article](https://towardsdatascience.com/sentiment-classification-with-logistic-regression-analyzing-yelp-reviews-3981678c3b44#c3b8) has been incredibly helpful in building the following model. Most of the concepts, explanations, and models are built off the article's contents. Please check it out if you are interested! 

For our tasks, here are several evaluation metrics:
1. **Precision** - True Positive/(True Positive + False Positive), meaning the proportion of points that model classify as positives are actually positives.
2. **Recall** - True Positive/(True Positive + False Negative), meaning the proportion of actual positives that are correctly classified by the model.
3. **F1 score** —the harmonic mean of precision and recall.

We will use the **F1 score** as the primary evaluation metric for our model

### Text Preprocessing

First, let's import our dataset from [Kaggle](https://www.kaggle.com/kazanova/sentiment140):

In [None]:
df_sentiment = pd.read_csv('../input/sentiment140/training.1600000.processed.noemoticon.csv', encoding='latin', header=None)
df_sentiment.head()

Our dataset is missing headers, so we will assign them:

In [None]:
df_sentiment.columns = ['sentiment', 'id', 'date', 'query', 'user_id', 'text']
df_sentiment.head()

Since we're using this dataset to create a Sentiment Analysis model, let's drop the unnecessary columns.

In [None]:
df_sentiment.drop(['id', 'date', 'query', 'user_id'], axis=1, inplace=True)

According to our dataset description, this is how the `sentiment` column has been anotated:
- `4` : positive
- `2` : neutral
- `0` : negative
We will convert the numerical values into `positive`, `negative`, or `neutral`:

In [None]:
annotation = {
    4: 'positive',
    2: 'neutral',
    0: 'negative'
}
df_sentiment['sentiment'] = df_sentiment['sentiment'].apply(lambda sentiment : annotation[sentiment])
df_sentiment.head()

Let's check the distribution of sentiments in our dataset

In [None]:
val_count = df_sentiment.sentiment.value_counts()

plt.figure(figsize=(8,4))
plt.bar(val_count.index, val_count.values)
plt.title("Sentiment Data Distribution")

Thankfullly, our dataset is without much skewness. Just like we did for the **YouTube Comments Dataset**, let's perform **text preprocessing** on our new dataset. We will use similar methods as above:

In [None]:
start_time = time.time()

def process_sentence(sentence):
    words = []
    punct = string.punctuation
    
    # Remove URLS
    sentence = re.sub(r"http\S+", "", sentence)

    # Remove time-stamps
    sentence = re.sub(r"\d+:\d{2}", "", sentence)

    # Remove user-handles
    sentence = re.sub(r"@[^\s]+", "", sentence)
    
    for word in sentence.split():
        words.extend(contractions.fix(word).split())
    words = [word.lower() for word in words]
    words = [word for word in words if word not in punct]
    return words

df_sentiment['processed'] = df_sentiment['text'].apply(lambda sentence : process_sentence(sentence))
print("Time taken to convert preprocess text data: ", round((time.time() - start_time)/60, 2), " mins")

In [None]:
df_sentiment.head()

In [None]:
# Drop unnecessary columns
df_sentiment.drop(['no_contract', 'comment_str', 'tokenized', 'lower'], axis=1, inplace=True)

### Count Vectorization
Before we train our model, we must first convert our text input into vectors. 
We will use the `CountVectorizer` from `sklearn` to extract features from our words.

First, let's split our dataset into train and test data for our models:

In [None]:
train, test = train_test_split(df_sentiment, random_state = 42)

Now, let's create a binary feature representation of our words:

In [None]:
start_time = time.time()
cv = CountVectorizer(binary=True, min_df = 10, max_df = 0.95)
cv.fit_transform(train['text'].values)
train_feature_set=cv.transform(train['text'].values)
test_feature_set=cv.transform(test['text'].values)
print("Time taken to convert text input into feature vector: ", round((time.time() - start_time)/60, 2), " mins")

Great! We've created a `CountVectorizer` matrix, which simply creates a matrix counting the occurances of each vocabulary in every Tweet.

Now, we can set each Tweet's sentiment as our target value:

In [None]:
y_train = train['sentiment'].values
y_test = test['sentiment'].values

### Training our Logistic Regression Model for Sentiment Analysis

Awesome! Now that we've feature engineered our text data with `CountVectorizer` and created our target variables, let's train our **logistic regression** model:

In [None]:
start_time = time.time()
lr = LogisticRegression(solver = 'liblinear', random_state = 42, max_iter=1000)
lr.fit(train_feature_set,y_train)
y_pred = lr.predict(test_feature_set)
print("Time taken to train model and make predictions: ", round((time.time() - start_time)/60, 2), " mins")

We analyze our model's accuracy and F1 score:

In [None]:
print("Train Score", lr.score(train_feature_set, y_train))
print("Test Score", lr.score(test_feature_set, y_test))
print("Test Accuracy: ",round(metrics.accuracy_score(y_test,y_pred),3))
print("F1: ",round(metrics.f1_score(y_test, y_pred, pos_label="negative"),3))

### Using our model to predict sentiments for COVID YouTube Comments

In [None]:
# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

start_time = time.time()
df['sentiment'] = df['processed_sentence'].apply(lambda comment : analyzer.polarity_scores(comment)['compound'])
print("Time taken to create sentiment analysis with VADER: ", round((time.time() - start_time)/60, 2), " mins")