### Youtube comments analysis
In this notebook, we have a dataset of user comments for youtube videos related to animals or pets. We will attempt to identify cat or dog owners based on these comments, find out the topics important to them, and then identify video creators with the most viewers that are cat or dog owners.

The dataset provided for this coding test are comments for videos related to animals and/or pets. The dataset is 240MB compressed
https://drive.google.com/file/d/1o3DsS3jN_t2Mw3TsV0i7ySRmh9kyYi1a/view?usp=sharing

 The dataset file is comma separated, with a header line defining the field names, listed here: <br>
● creator_name. Name of the YouTube channel creator.<br>
● userid. Integer identifier for the users commenting on the YouTube channels.<br>
● comment. Text of the comments made by the users.

#### Download the data

In [4]:
# link: https://drive.google.com/file/d/1o3DsS3jN_t2Mw3TsV0i7ySRmh9kyYi1a/view?usp=sharing
!pip install googledrivedownloader
from google_drive_downloader import GoogleDriveDownloader as gdd
gdd.download_file_from_google_drive(file_id='1o3DsS3jN_t2Mw3TsV0i7ySRmh9kyYi1a', dest_path='./../../dbfs/Youtube/data/animal_comments.gz')


# 0. Data Exploration and Cleaning

In [6]:
# read data
df = spark.read.load('/Youtube/data/animal_comments.gz', format='csv', header = True, inferSchema = True)
df.show(10)

In [7]:
df.count() 

####Check missing values

In [9]:
# Count null values in each columns 
print('Number of null values in creator_name: ',df.filter(df['creator_name'].isNull()).count())
print('Number of null values in userid: ',df.filter(df['userid'].isNull()).count())
print('Number of null values in comment: ',df.filter(df['comment'].isNull()).count())

####Drop data with no comments

In [11]:
df = df.na.drop(subset=["comment"])
print('Number of rows after droping null comment:',df.count())

####Convert comment text to lower case

In [13]:
import pyspark.sql.functions as F
df_clean = df.withColumn('comment', F.lower(F.col('comment')))

####Data overview

In [15]:
df_clean.show()

#Part 1. Data preprocessing

### Tokenize the text data and create word2Vec features

Select 2MM data for this project to use.

In [19]:
from pyspark.sql.functions import rand 

df_clean.orderBy(rand(seed=0)).createOrReplaceTempView("table")
df_clean = spark.sql("select * from table limit 2000000")

In [20]:
from pyspark.ml.feature import RegexTokenizer, Word2Vec
from pyspark.ml.classification import LogisticRegression

# regular expression tokenizer
regexTokenizer = RegexTokenizer(inputCol="comment", outputCol="words", pattern="\\W")
word2Vec = Word2Vec(inputCol="words", outputCol="features") #the tokenized feature captures semantic similarity of that word to other words

In [21]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[regexTokenizer, word2Vec])

# Fit the pipeline to training documents.
pipelineFit = pipeline.fit(df_clean)
dataset = pipelineFit.transform(df_clean)

In [22]:
dataset.show(10)

###Selecting 1000000 data to build the classifier

In [24]:
from pyspark.sql.functions import rand 

dataset.orderBy(rand(seed=0)).createOrReplaceTempView("table")
sampled = spark.sql("select * from table limit 1000000")

#### Label the data
This is an unlabeled dataset and we want to train a clasifier to identify cat and dog owners. Thus first thing to do is to label each comment. <br>
Label comment when he/she has dogs or cats. <br>
label comment when he/she don't have a dog or cat. <br>
Combine 1 and 2 as our training dataset, and rest of the dataset will be the data we predict. <br>

In [26]:
# find user with preference of dog and cat
from pyspark.sql.functions import when
from pyspark.sql.functions import col

sampled = sampled.withColumn("label", \
                           (when(col("comment").like("%my dog%"), 1) \
                            .when(col("comment").like("%my dogs%"), 1) \
                           .when(col("comment").like("%i have a dog%"), 1) \
                           .when(col("comment").like("%my cat%"), 1) \
                           .when(col("comment").like("%my cats%"), 1) \
                           .when(col("comment").like("%i have a cat%"), 1) \
                           .when(col("comment").like("%my puppy%"), 1) \
                           .when(col("comment").like("%i have a puppy%"), 1) \
                           .when(col("comment").like("%my pup%"), 1) \
                            .when(col("comment").like("%i have a pup%"), 1) \
                           .when(col("comment").like("%my kitty%"), 1) \
                           .when(col("comment").like("%my kitten%"), 1) \
                           .when(col("comment").like("%my pussy%"), 1) \
                           .when(col("comment").like("i have a kitty"),1)\
                           .when(col("comment").like("i have a kitten"),1)\
                           .when(col("comment").like("my own"),1)\
                           .otherwise(0)))

In [27]:
sampled.show(10)

### Downsampling the dataset

Note that number of negative labels is around 100 times more than positive labels, so here we need to downsampling the negative labels. By rule of thumb, the gap should be no more than 10 times. But here I make them balance to the ratio aroudn 1:2 (1 for positive: 2 for negative)

In [30]:
own_pets = sampled.filter(col('label')==1)
no_pets = sampled.filter(col('label')==0)
print("Number of confirmed users who own dogs or cats in training data set: ", own_pets.count())
print("Number of confirmed users who don't have pet's in training data set: ", no_pets.count())

In [31]:
from pyspark.sql.functions import rand 
no_pets.orderBy(rand()).createOrReplaceTempView("table")

Num_Pos_Label = own_pets.count() 
Num_Neg_Label = no_pets.count()

df_no_pets_down = spark.sql("select * from table where limit {}".format(Num_Pos_Label*2))


In [32]:
print('Now after balancing the lables, we have ')   
print('Positive label: ', Num_Pos_Label)
print('Negtive label: ', df_no_pets_down.count())

In [33]:
df_model = own_pets.union(df_no_pets_down)
df_model.show(10)

#Part 2. Model Training and Evaluation

In [35]:
train, test = df_model.randomSplit([0.8, 0.2], seed=12345)

###Logistic Regression

In [37]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

lr = LogisticRegression(featuresCol="features",labelCol="label" , maxIter=10, regParam=0.1, elasticNetParam=0.8)

# Run TrainValidationSplit, and choose the best set of parameters.
lrModel = lr.fit(train)

In [38]:
summary_lr = lrModel.summary
# Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
import seaborn as sns
import matplotlib.pyplot as plt
roc = summary_lr.roc.toPandas()
plt.plot(roc['FPR'],roc['TPR'])
plt.plot([0, 1], [0, 1], 'k--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve')
plt.show()

In [39]:
print("areaUnderROC: " + str(summary_lr.areaUnderROC))

In [40]:
# Set the model threshold to maximize F-Measure
fMeasure = summary_lr .fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']) \
                .select('threshold').head()['threshold']
# Make predictions on test data. model is the model with combination of parameters
# that performed best.
predictions_lr = lrModel.transform(test,{lrModel.threshold: bestThreshold})
predictions_lr.show(10)

In [41]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator


def get_evaluation_result(predictions):
  evaluator = BinaryClassificationEvaluator(
      labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
  AUC = evaluator.evaluate(predictions)

  TP = predictions[(predictions["label"] == 1) & (predictions["prediction"] == 1.0)].count()
  FP = predictions[(predictions["label"] == 0) & (predictions["prediction"] == 1.0)].count()
  TN = predictions[(predictions["label"] == 0) & (predictions["prediction"] == 0.0)].count()
  FN = predictions[(predictions["label"] == 1) & (predictions["prediction"] == 0.0)].count()

  accuracy = (TP + TN)*1.0 / (TP + FP + TN + FN)
  precision = TP*1.0 / (TP + FP)
  recall = TP*1.0 / (TP + FN)


  print ("True Positives:", TP)
  print ("False Positives:", FP)
  print ("True Negatives:", TN)
  print ("False Negatives:", FN)
  print ("Test Accuracy:", accuracy)
  print ("Test Precision:", precision)
  print ("Test Recall:", recall)
  print ("Test AUC of ROC:", AUC)

print("Prediction result summary for Logistic Regression Model:  ")
get_evaluation_result(predictions_lr)

###Random Forest

In [43]:

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

# Train a RandomForest model.
rfModel = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=15)

# Train model.  This also runs the indexers.
model = rfModel.fit(train)

# Make predictions.
predictions_rf = model.transform(test)

# Select example rows to display.
predictions_rf.show(10)

In [44]:
print("Prediction result summary for Random Forest Model:  ")
get_evaluation_result(predictions_rf)

###Gradient Boosted

In [46]:
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Train a GDBT model.
gbtModel = GBTClassifier(featuresCol="features",labelCol="label", maxIter=10)

# Train model.  This also runs the indexer.
model = gbtModel.fit(train)

# Make predictions.
predictions_gbt = model.transform(test)

# Select example rows to display.
predictions_gbt.show(10)


In [47]:
print("Prediction result summary for Gradient Boosted Model:  ")
get_evaluation_result(predictions_gbt)

#Part 3. Hyperparamter Tuning

#### Tune hyperameters for GDBT model

In [50]:
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
import numpy as np

#Tune parameter using grid search and cross validation
paramGrid_gbt = ParamGridBuilder()\
               .addGrid(rfModel.numTrees, [int(x) for x in np.linspace(start = 10, stop = 50, num = 3)]) \
               .addGrid(rfModel.maxDepth, [int(x) for x in np.linspace(start = 5, stop = 25, num = 3)]) \
               .build()

evaluator=BinaryClassificationEvaluator()

crossval_gbt = CrossValidator(
               estimator=gbtModel,
               estimatorParamMaps=paramGrid_gbt,
               evaluator=evaluator,
               numFolds=3)

####Get the best model with the best parameters

In [52]:
#Best model with tuned parameters
cvModel_gbt = crossval_gbt.fit(train)

# Make predictions.
predictions = cvModel_gbt.transform(test)

# Select example rows to display.
predictions.show(10)

print("Prediction result summary on test data for Gradient Boosted Model with tuned parameters:  ")
get_evaluation_result(predictions)

In [53]:
#Fit the model to the full dataset and evaluate
predictions_full = cvModel_gbt.transform(df_model)
print("Prediction result summary on full dataset for Gradient Boosted Model with tuned parameters:  ")
get_evaluation_result(predictions_full)

In [54]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
import itertools

#Define a function print the confusion matraix
def plot_confusion_matrix(predictions,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    y_true =predictions.select("label")
    y_true = y_true.toPandas()
    y_pred = predictions.select("prediction")
    y_pred = y_pred.toPandas()
    class_temp = predictions.select("label").groupBy("label")\
                        .count().sort('count', ascending=False).toPandas()
    class_names = class_temp["label"].values.tolist()
    
    cm = confusion_matrix(y_true, y_pred, labels = class_names)
   
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(class_names))
    plt.xticks(tick_marks, class_names)
    plt.yticks(tick_marks, class_names)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()

In [55]:
plot_confusion_matrix(predictions_full)

We will use to cvModel_gbt as our classifier to classify all the users and make analysis based on the result.

#Part 4. Model Application

####Classify the full user data to identify pet owners

In [59]:
predictions= cvModel_gbt.transform(dataset)
pet_owners = predictions.filter(col('prediction')==1.0)
pet_owners.show(10)

####Transform the full user data to TF-IDF matrix
In order to build a LDA model to analyze the topic of pet owners, we used to tf-idf instead of word2vec to transform the user comments feature, because tf-idf will assign unequal weights to words within documents. <br>
The intuition behind it is that if a word occurs multiple times in a document, we should boost its relevance as it should be more meaningful than other words that appear fewer times (TF). At the same time, if a word occurs many times in a document but also along many other documents, maybe it is because this word is just a frequent word; not because it was relevant or meaningful (IDF).

In [61]:
#Define a function to preprocess the text
#Remove stopwords

import re as re
from pyspark.ml.feature import CountVectorizer , IDF
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.ml.feature import StopWordsRemover

def text_process(df):
  # Define a list of stop words or use default list
  remover = StopWordsRemover()
  Stopwords = remover.getStopWords() 
  remover.setInputCol("words")
  remover.setOutputCol("words_no_stopw")
  
  alpha_udf = udf(lambda tokens: [token for token in tokens if token.isalpha()]) 
  length_udf = udf(lambda tokens: [token for token in tokens if len(token)>2],ArrayType(StringType()))
  
  df = remover.transform(df) #remove stop words
  #df = df.withColumn("words_no_stopw",alpha_udf("words_no_stopw"))   #check if they’re alpha numeric
  df = df.withColumn("words_no_stopw",length_udf("words_no_stopw"))  #remove any words or typos which are less than two letters
  return df

In [62]:
pet_owners = text_process(pet_owners)

pet_owners.show(10)

Transform the daraCountVectorizer takes this data and returns a sparse matrix of term frequencies attached to the original Dataframe. Same thing goes for the IDF.

In [64]:
pet_owners = pet_owners.select("words_no_stopw")

In [65]:
# TF
cv = CountVectorizer(inputCol="words_no_stopw", 
                     outputCol="raw_features",  
                     vocabSize = 5000,
                     minDF=5.0) # the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary.

cvmodel = cv.fit(pet_owners)
result_cv = cvmodel.transform(pet_owners)
result_cv.show()
print(len(cvmodel.vocabulary))  # vocabulary size 

In [66]:
# IDF
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_cv)
result_tfidf = idfModel.transform(result_cv) 
result_tfidf.show()

####Build a LDA model - topic modeling

##### Why LDA?
LDA trades off two “conflicting” goals: <br>
1: For each document, allocate its words to as few topics aspossible. <br>
2: For each topic, assign high probability to as few terms aspossible. <br>
LDA is a probabilistic model of text used to find topics thatdescribe a corpus.<br>
LDA casts the problem of discovering themes in largedocument collections as a posterior inference problem.LDA lets us visualize the hidden thematic structure in largecollections, and generalize new data to fit into that structure.

In [69]:

from pyspark.ml.clustering import LDA
num_topics = 8
lda = LDA(k = num_topics,
          maxIter = 50 #number of iterations
          )

ldaModel = lda.fit(result_tfidf)

#####Print the top 20 words in each topic

In [71]:
# Print topics and top-weighted terms
topics = ldaModel.describeTopics(maxTermsPerTopic=20) #return the index of words from dictionary created by CountVector
vocabArray = cvmodel.vocabulary

ListOfIndexToWords = udf(lambda wl: list([vocabArray[w] for w in wl]))
#FormatNumbers = udf(lambda nl: ["{:1.4f}".format(x) for x in nl])

topics_words = topics.select(ListOfIndexToWords(topics.termIndices).alias('words'))
topics_words.show(truncate=False, n=num_topics)

#####Visualize the wordcloud of each topic

In [73]:
!pip install wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
#Define a function to draw wordclouds of topics

def wordcloud(topics):
  wordcloud = []
  for i in range(len(topics)):
    text = str(topics.iloc[i,0])
    #text = " ".join([(k + " ")*v for k,v in topics.iloc[i].items()])
    #temp_cloud = WordCloud().generate(text)
    temp_cloud = WordCloud(background_color="white", max_words=10000, collocations = False,
               contour_width=3, contour_color='steelblue',max_font_size=40)
    temp_cloud = temp_cloud.generate(text) # Generate a word cloud image
    if len(wordcloud) == 0:
      wordcloud = [temp_cloud]
    else:
      wordcloud.append(temp_cloud)
  # Display the generated image:
  # the matplotlib way:
  w=10
  h=10
  fig=plt.figure(figsize=(40,10))
  columns = 4
  rows = 2
  for i in range(1, columns*rows +1):
      img = wordcloud [i-1]
      fig.add_subplot(rows, columns, i)
      plt.imshow(img)
      plt.axis("off")    
  plt.show()
  
 

In [74]:
wordcloud(topics_pdf)

#### 4. Identify Creators With Cat And Dog Owners In The Audience

In [76]:
#Identify cat and dog owners
predictions= cvModel_gbt.transform(dataset)
pet_owner= predictions.filter(col('prediction')==1.0).select('creator_name','userid','comment')
pet_owner.show(truncate=False)

In [77]:
#Compute the fraction of cat or dog owners
owner_fr = round(pet_owner.count()/predictions.count()*100,2)
print(str(owner_fr)+ '% of the total users own cats or dogs in this Youtube comment dataset.')

#### 5. Analysis and Future work

Step 1: Identify Cat And Dog Owners
Find the users who are cat and/or dog owners.

Step 2: Build And Evaluate Classifiers
Build classifiers for the cat and dog owners and measure the performance of the classifiers.

Step 3: Classify All The Users
Apply the cat/dog classifiers to all the users in the dataset. Estimate the fraction of all users
who are cat/dog owners.

Step 4: Extract Insights About Cat And Dog Owners
Find topics important to cat and dog owners.

Step 5: Identify Creators With Cat And Dog Owners In The Audience
Find creators with the most cat and/or dog owners. Find creators with the highest statistically
significant percentages of cat and/or dog owners.

#Report
**Motivation:** My motivation of this project is because I myself own a cat, and I love watching cat related videos on Youtube. As I watch more and more videos, Youtube's algorithm was able to learn my video-watching habit and started to recommend more related content to me. But I feel like by identifying me as a cat owner, there can be more cool stuff to be done which is beyond the video recommendations. <br>

**Data and Method:** In this project, I used a sample of the Youtube comments text data to build a classifier to predict the pet and dog owners. Then, using the model trained, I identified those cat and dog owners in the full dataset. Then I built a LDA (Latent Dirichlet Allocation) model to classify the user's comments into 8 topics and plotted the wordcloud of each topic. <br>
For example, after recognizing the cat and dog owners, we can do in-depth analysis on these user's comments, then there will be changces that we gain valuable insights from those comments and find potential hot topics among this cohort. These findings can be applied to product design for pet services/product companies or pet video creators, because they can gain a good sense of what's been heavily discussed and what's trending right now, so that they can tailor their products/content to better target the specific group of users. <br>
#####Let's walk through how this project's been done!
**Step 1: Data Exploration and Cleaning**<br>
(1)There are 5,820,035 rows of user comments in total.Rows with null user id have been kept but those will null comments have been dropped. Comments are been converted to lowercase.<br> 
(2)Selected 2,000,000 rows of data to implement this whole project considering the time complexity. (Probably will run with the whole dataset again in the future)<br> 
(3)Tokenized the comment and created word2Vec features in order to train a classification model. <br> 
(4)Randomly sampled 1,000,000 rows of data to train the classifier and manually labeled the comments as 'has cat&dog' and 'no cat&dog'.<br> 
(5)This is a highly imbalanced dataset (with negative labels around 100 times more than positive labels), so I downsampled the negative labels (1 for positive:2 for negative).

**Step 2: Model Selection and Tuning**<br> 
(1)Splitted the data into training set and testing set.<br>
(2)Tained a classifier using three types of machine learning algorithm: logistic regression, random forest, and Gradient Boosting.<br>
(3)Utilized ROC/AUC, confusion matrix, accuracy, precision and recall as model evaluation metrics to evaluate the three models performance.<br>
(4)Selected Gradient Boosted algorithm to do hyparameter tuning since it has the best performance on the test data set.<br>
(5)Tuned the hyperparameters based on grid search and re-evaluated the model performace: Accuracy (0.87), Precision (0.8), Recall (0.82), AUV of ROC (0.94).

**Step 3: Model Application - LDA**<br>
(1)Applied the best GDBT model onto the full dataset and identified users who own dogs or cats. 10.74% of users own dogs or cats. <br>
(2)Preprocessed the dog&cat owners' text data to TF-IDF matrices and built a LDA model - set the topic number to 8. (This can be further improved on)<br>
(3)Got the LDA topic result and visualized the wordcloud of each topic based on word frequencied of each topic.

**Step 4: Insights gained from topic modeling**<br>
(1)Based on the wordcloud, the topic of each segment is not that easy to interpret, but I will try my best to garner as much information as possible based on my business sense.<br>
(2)Semantic meaning of each topic: Cat/Dog crying, kitty eating fish, I love waching this dog/animal Youtube channel, Cats are cute/adorable (chasing bird), Dog(husky) are cute (chasing rabbit?), gender differences for dogs, hourse related content, People care for dogs/animals.

**Recommendation based on the insights:**<br>
(1)Dog/cat related Youtube channels are super popular among dog/cat owners, this descripes their watching habit and Youtubers can post more videos to attract these users.<br>
(2)People love watching kitten eating fish/chasing bird, dogs (husky in particular) chaing rabbit, and content related to hourse riding. Youtubers can incorporate those content into their videos. <br>
(3)Following up on the above mentioned topics, ads for dog/cat competitions/social events can target this cohort, and horse-riding activity companies can see this cohort as their potential customers as well, since people watching those videos may develop an intention to engage in those activities in real life.<br>
(4)People are having discussions around pet genders and carings, it suggests that people with dog&cat pay more attention to animal care. Ads about pet commutiny (donation, adoption) can target those users.

**Area of Improvements of this project:**<br>
(1)Use the whole dataset may give better results in both models.<br>
(2)Find a better way to label the comment - this can be done by researching more into the pet owner behavior.<br>
(3)Get more user feature data other than user id, such as age, location, occupation, family size and some other behavioral features like Youtube watching record. - Based on those data, we can do user segmentation analysis before and after the classification as to get better results of identifying users and analyzing the owners' preferences.
(4)For the LDA part, we can calculate the model perplxity to select the optimal number of topic. We can also do more feature engineering to make the text data more ingestible.