### Youtube comments analysis 

In this notebook, we have a dataset of user comments for youtube videos related to animals or pets. We will attempt to identify cat or dog owners based on these comments, find out the topics important to them, and then identify video creators with the most viewers that are cat or dog owners.

The dataset contains comments for videos related to animals and/or pets. The dataset is 240MB compressed; 
 The dataset file is comma separated, with a header line defining the field names, listed here:
● creator_name. Name of the YouTube channel creator.
● userid. Integer identifier for the users commenting on the YouTube channels.
● comment. Text of the comments made by the users.

Project Highlights

Step 1: Identify Cat And Dog Owners
Find the users who are cat and/or dog owners.

Step 2: Build And Evaluate Classifiers
Build classifiers for the cat and dog owners and measure the performance of the classifiers.

Step 3: Classify All The Users
Apply the cat/dog classifiers to all the users in the dataset. Estimate the fraction of all users
who are cat/dog owners.

Step 4: Extract Insights About Cat And Dog Owners
Find topics important to cat and dog owners.

Step 5: Identify Creators With Cat And Dog Owners In The Audience
Find creators with the most cat and/or dog owners. Find creators with the highest statistically
significant percentages of cat and/or dog owners.

Project Purposes:
- Identify pet owners so that we can send related advertisement for this user group. (e.g. pets food)
- Understand what topics are those pet owners really interested in so that we can recommend them related videos and give advice for video creators with lots of pet owners fans.

#### 0. Data Exploration and Cleaning

In [4]:
df_clean=spark.read.csv("/FileStore/tables/animals_comments.csv",inferSchema=True,header=True)
df_clean.show(10)

In [5]:
df_clean.count() 

In [6]:
df_clean = df_clean.na.drop(subset=["comment"])
df_clean.count()

In [7]:
df_clean.show()

In [8]:
# find user with preference of dog and cat
from pyspark.sql.functions import when
from pyspark.sql.functions import col

# you can user your ways to extract the label

df_clean = df_clean.withColumn("label", \
                           (when(col("comment").like("%my dog%"), 1) \
                           .when(col("comment").like("%I have a dog%"), 1) \
                           .when(col("comment").like("%my cat%"), 1) \
                           .when(col("comment").like("%I have a cat%"), 1) \
                           .when(col("comment").like("%my puppy%"), 1) \
                           .when(col("comment").like("%my pup%"), 1) \
                           .when(col("comment").like("%my kitty%"), 1) \
                           .when(col("comment").like("%my pussy%"), 1) \
                           .otherwise(0)))

In [9]:
df_clean.show()

#### 1. Data preprocessing and Build the classifier

In [11]:
from pyspark.ml.feature import RegexTokenizer, Word2Vec
from pyspark.ml.classification import LogisticRegression

# regular expression tokenizer
regexTokenizer = RegexTokenizer(inputCol="comment", outputCol="words", pattern="\\W")

word2Vec = Word2Vec(inputCol="words", outputCol="features")

In [12]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[regexTokenizer, word2Vec])

# Fit the pipeline to training documents.
pipelineFit = pipeline.fit(df_clean)
dataset = pipelineFit.transform(df_clean)

In [13]:
dataset.show()

In [14]:
(lable0_train,lable0_test)=dataset.filter(col('label')==1).randomSplit([0.7, 0.3],seed = 100)
(lable1_train, lable1_ex)=dataset.filter(col('label')==0).randomSplit([0.005, 0.995],seed = 100)
(lable1_test, lable1_ex2)=lable1_ex.randomSplit([0.002, 0.998],seed = 100)

In [15]:
trainingData = lable0_train.union(lable1_train)
testData=lable0_test.union(lable1_test)

In [16]:
print("Dataset Count: " + str(dataset.count()))
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

##### LogisticRegression

In [18]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label", featuresCol="features")
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
paramGrid = ParamGridBuilder()\
    .addGrid(lr.aggregationDepth,[2,5,10])\
    .addGrid(lr.elasticNetParam,[0.0, 0.5, 1.0])\
    .addGrid(lr.fitIntercept,[False, True])\
    .addGrid(lr.maxIter,[10, 100, 1000])\
    .addGrid(lr.regParam,[0.01, 0.5, 2.0]) \
    .build()

##### Parameter Tuning and K-fold cross-validation

In [20]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator=BinaryClassificationEvaluator(rawPredictionCol="rawPrediction",labelCol="label")
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

In [21]:
cvModel = cv.fit(trainingData)

In [22]:
predictions = cvModel.transform(testData)
evaluator.evaluate(predictions)

In [23]:
best_model_lr = cvModel.bestModel
print ("**Best Model**")
print (" ElasticNetParam:"+str(best_model_lr._java_obj.parent().elasticNetParam())), 
print (" MaxIter:"+str(best_model_lr._java_obj.parent().getMaxIter())), 
print (" RegParam:"+str(best_model_lr._java_obj.parent().getRegParam())),
print(" AggregationDepth:"+str(best_model_lr._java_obj.parent().getAggregationDepth())),
print(" fitIntercept:"+str(best_model_lr._java_obj.parent().getFitIntercept()))

##### RandomForest

In [25]:
from pyspark.ml.classification import RandomForestClassifier
# Create an initial RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [2, 4, 6])
             .addGrid(rf.maxBins, [20, 40, 60])
             .addGrid(rf.numTrees, [5, 20, 50])
             .build())

In [26]:
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
cvModel_2 = cv.fit(trainingData)

In [27]:
predictions = cvModel_2.transform(testData)
evaluator.evaluate(predictions)

In [28]:
best_model_rf = cvModel_2.bestModel
print ("**Best Model**")
print (" ElasticNetParam:"+str(best_model_rf._java_obj.parent().getMaxDepth())), 
print (" MaxIter:"+str(best_model_rf._java_obj.parent().getMaxBins())), 
print (" RegParam:"+str(best_model_rf._java_obj.parent().getNumTrees())),

##### Gradient boosting

In [30]:
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(labelCol="label", featuresCol="features")
paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxDepth, [4, 6, 8])
             .addGrid(gbt.maxBins, [25, 30, 35])
             .addGrid(gbt.maxIter, [10, 20])
             .addGrid(gbt.minInfoGain, [0.0, 0.05])
             .build())

In [31]:
cv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
cvModel_3 = cv.fit(trainingData)

In [32]:
predictions = cvModel_3.transform(testData)
evaluator.evaluate(predictions)

In [33]:
best_model_gbdt = cvModel_3.bestModel
print ("**Best Model**")
print (" MaxDepth:"+str(best_model_gbdt._java_obj.parent().getMaxDepth())), 
print (" MaxIter:"+str(best_model_gbdt._java_obj.parent().getMaxIter())), 
print (" MaxBins:"+str(best_model_gbdt._java_obj.parent().getMaxBins())),
print (" minInfoGain:"+str(best_model_gbdt._java_obj.parent().getMinInfoGain()))

In [34]:
best_model_all = best_model_gbdt

According to the models above, we find out that the best model is GBDT with 20 MaxDepth and 20 MaxIter, 35 MaxBins and 0 minInfoGain. We stored this as best_model_all.

#### 2. Classify All The Users

So here, we have already finished the process that using user's comments to see whether he is a pet-owner. But in reality, one user can have lots of comments on Youtube, he or she cannot always say "my cat" or "my dog" in every comment. So here, I will say as long as he or she say this one time, I will define him as a pet-owner. I also checked the average number of comments that the user made in this dataset, it's 2.29. Actually I don't think it's a good number for us to say only 2 or 3 comments predicted as pet-owner then we can describe him as pet-owner which is a potential better way (at least have more confident to say). Therefore, due to the size of dataset, I choose the naive way that as long as one of the user's comments predicted as pet-owner, we defined him as pet-owner.

In [38]:
predictions = best_model_gbdt.transform(dataset)
evaluator.evaluate(predictions)

In [39]:
predictions.createOrReplaceTempView('predict')

In [40]:
error_terms = spark.sql("SELECT comment, label, prediction FROM predict WHERE label != prediction")
display(error_terms)

comment,label,prediction
I shared this to my friends and mom the were lol,0,1.0
when I saw the end it said to adopt I saw different animal sites I was mad that they separated the cute little pups after being together for a long time,0,1.0
Holy crap. That is quite literally the most adorable pup Ive ever seen.,0,1.0
That mother cat looks like my own Im guessing she is a russian blue due to her looks and unusual coping skills.,0,1.0
cat drugs,0,1.0
I dont understand how you think she will make a good service dogs. SD are handled by a company here in Quebec (and given for free fully trained to people in need). For the first year they are fostered by families who expose them to as many things as possible and even then after a year the majority of them are deemed not fit for service work. They have to be ultra confident and never startled never afraid of anything etc... She seem like a good pet but the very opposite of what a service dog should be.,0,1.0
Im not allowed to have a dog because of money and my apartment doesnt allow dogs!! WHAT DO I DO!!!!???!?!?!?!?,0,1.0
Storm is like a giant the other dog bb8(bba babya baby8 i dont know his name.),0,1.0
Chestnut is so cute. Your videos areuoer helpful for me. Ii dont have a dog yet but learning to train a dog ahead of time is really good for me.,0,1.0
Name the Brindle one Brin! My aunt has a brindle dog named Brin!,0,1.0


In [41]:
### Get users label. Since if user A is a pet-owner, he or she cannot say "my cat", "my dog" in every comments, so as long as he or she say one time, he or she should be labeled as pet-owner. Let's see how many pet-owner are predicted correctly by our model.
user_predict_true = spark.sql("SELECT userid as predict_user FROM predict WHERE prediction = 1 GROUP BY userid HAVING COUNT(*) > 0")
display(user_predict_true)

predict_user
141461.0
278459.0
284492.0
1001152.0
1005141.0
1127152.0
1581642.0
1588264.0
2352365.0
14269.0


In [42]:
print("Pet Owner number: " + str(user_predict_true.count()))

In [43]:
all_user = spark.sql("SELECT DISTINCT userid FROM predict")
print("The total number of users: " + str(all_user.count()))

From the analysis above, based on our best model, we find out that 18.26% of the users who make comments in this dataset are pet users. It's necessary to apply some methods to identify those users so that we can target those type of users to send advertise that they may interested in and give recommendations on videos.

I worked through the error comments one by one for the first 100 comments predicted wrongly manually and finally found out that 
- 16% of those errors were mislabeled, people didn't use word such as "my cat", "my dogs", instead, they use "mine", "my own". But these comments are correctly predicted in our word2vec model. 
- The other 20% are due to comments contains " not have a dog", 'not have a cat', 'want to have a dog' etc. They contains words such as 'dogs', 'cats' but they failed to find not in front of them.
- 64% of the errors are due to other issues.

So if we need to tune the model better, we should focus more on how to label the comments and decrease the noise from the dataset.

#### 3. Get insigts of Users

In [47]:
import pandas as pd
import pyspark
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from nltk.corpus import stopwords
import re as re
from pyspark.ml.feature import CountVectorizer , IDF

from pyspark.mllib.linalg import Vectors as MLlibVectors
from pyspark.mllib.clustering import LDA as MLlibLDA

In [48]:
data = sqlContext.read.format("csv") \
   .options(header='true', inferschema='true') \
   .load("/FileStore/tables/animals_comments.csv")

In [49]:
import nltk
nltk.download('stopwords')

In [50]:
StopWords = stopwords.words("english")
print(StopWords)

In [51]:
reviews = data.rdd.map(lambda x : x['comment']).filter(lambda x: x is not None)
StopWords = stopwords.words("english")
tokens = reviews                                                   \
    .map( lambda document: document.strip().lower())               \
    .map( lambda document: re.split(" ", document))          \
    .map( lambda word: [x for x in word if x.isalpha()])           \
    .map( lambda word: [x for x in word if len(x) > 3] )           \
    .map( lambda word: [x for x in word if x not in StopWords])    \
    .zipWithIndex()

In [52]:
df_txts = sqlContext.createDataFrame(tokens, ["list_of_words",'index'])
# TF
cv = CountVectorizer(inputCol="list_of_words", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cvmodel = cv.fit(df_txts)
result_cv = cvmodel.transform(df_txts)
# IDF
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_cv)
result_tfidf = idfModel.transform(result_cv) 

In [53]:
display(result_tfidf)

list_of_words,index,raw_features,features
"List(shared, friends)",0,"List(0, 5000, List(270, 2033), List(1.0, 1.0))","List(0, 5000, List(270, 2033), List(5.866854968997447, 7.947146489478987))"
"List(super, cute)",1,"List(0, 5000, List(13, 138), List(1.0, 1.0))","List(0, 5000, List(13, 138), List(3.736430822650456, 5.286628740396462))"
"List(stop, saying, youre, literally, dumb, common, sense, dont, kind, fucking, retarded, swear)",2,"List(0, 5000, List(5, 58, 110, 111, 158, 222, 336, 694, 753, 813, 858, 2276), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 5000, List(5, 58, 110, 111, 158, 222, 336, 694, 753, 813, 858, 2276), List(3.4216412885358745, 4.5770731123207415, 5.093898519457022, 5.063030083228325, 5.4846381555170876, 5.682744747010301, 6.02590292253235, 6.734754416375869, 6.82880779738424, 6.927713443778851, 6.952563852829649, 8.107552520951748))"
"List(tenho, jiboia, largato)",3,"List(0, 5000, List(), List())","List(0, 5000, List(), List())"
"List(wanna, happened, pigs, please)",4,"List(0, 5000, List(26, 208, 262, 1252), List(1.0, 1.0, 1.0, 1.0))","List(0, 5000, List(26, 208, 262, 1252), List(4.194149976937117, 5.6189921623849, 5.811147351494145, 7.445694103641105))"
"List(well, shit, hungry)",5,"List(0, 5000, List(52, 197, 991), List(1.0, 1.0, 1.0))","List(0, 5000, List(52, 197, 991), List(4.409234013044385, 5.609047927930331, 7.133089754625043))"
"List(said, adopt, different, animal, sites, separated, cute, little, pups, together, long, time)",6,"List(0, 5000, List(13, 20, 24, 72, 80, 93, 177, 373, 407, 1178, 4323), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 5000, List(13, 20, 24, 72, 80, 93, 177, 373, 407, 1178, 4323), List(3.736430822650456, 4.03301686100273, 4.095129371181182, 4.683496141777895, 4.812096027717315, 4.912623437446961, 5.5317795078743295, 6.1393995337132825, 6.229059015591715, 7.34739198913707, 8.940689850186336))"
"List(holy, quite, literally, adorable, ever)",7,"List(0, 5000, List(51, 151, 336, 456, 505), List(1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 5000, List(51, 151, 336, 456, 505), List(4.383282796889985, 5.334571690770321, 6.02590292253235, 6.341505865089581, 6.469215087904641))"
List(),8,"List(0, 5000, List(), List())","List(0, 5000, List(), List())"
"List(call, teddy, larry)",9,"List(0, 5000, List(154, 2677, 3622), List(1.0, 1.0, 1.0))","List(0, 5000, List(154, 2677, 3622), List(5.371157153704966, 8.378452828771039, 8.723337312686905))"


In [54]:
num_topics = 5
max_iterations = 50
lda_model = MLlibLDA.train(
  result_tfidf.select("index", "features").rdd.mapValues(MLlibVectors.fromML).map(list), k=num_topics, maxIterations=max_iterations
)

In [55]:
topics = lda_model.topicsMatrix()
vocabArray = cvmodel.vocabulary

wordNumbers = 15  # number of words per topic
topicIndices = sc.parallelize(lda_model.describeTopics(maxTermsPerTopic = wordNumbers))

def topic_render(topic):  # specify vector id of words to actual words
    terms = topic[0]
    result = []
    for i in range(wordNumbers):
        term = vocabArray[terms[i]]
        result.append(term)
    return result

In [56]:
topics_final = topicIndices.map(lambda topic: topic_render(topic)).collect()

for topic in range(len(topics_final)):
    print ("Topic" + str(topic) + ":")
    term_list = []
    for term in topics_final[topic]:
      term_list.append(term)
    print(term_list)  

Therefore, we can see the five most frequent topic that those pet owner want to know about is more videos about cure animals, seeing their wiggling etc. This shows that video creators should focus more on making videos about cure animals, dog, cat, or maybe sugggested as coyote.

#### 4. Identify Creators With Cat And Dog Owners In The Audience

In [59]:
creators = spark.sql("SELECT creator_name, count(DISTINCT userid) as number FROM predict WHERE prediction = 1 GROUP BY creator_name ORDER BY number DESC")
display(creators)

creator_name,number
Brave Wilderness,65579
The Dodo,41254
Taylor Nicole Dean,33800
Brian Barczyk,25496
Hope For Paws - Official Rescue Channel,22885
Gohan The Husky,17028
Vet Ranch,16485
Robin Seplut,16383
Cole & Marmalade,12230
stacyvlogs,10699


#### 5. Analysis and Future work

Finally, we have already finished all our work here.
- First we got a good model (GBDT) with 96.7% AUC on test set. We applied this model to all the users in the dataset and identified that about 18.26% of our users are pet owners. 
- Secondly, we Applied LDA to analysis what kind of topics that those pet owners really interested in. We noticed that they want to see more cute pets, wiggling and even some other cute animals except cats and dogs. 
- Thirdly, we find the video creators with most number of pet owner fans. Top five are: Brave Wilderness, The Dodo, Taylor Nicole Dean, Brian Barczyk, Hope For Paws - Official Rescure Channel.

So now, we can:
- Send pet related advertisement to targeted users accurately
- Give advice for video creators who have a lot of pet owner fans on topics their fans interested in
- Improve user's experience on Youtube.