### Youtube comments analysis

In this project, a dataset of user comments for youtube videos related to animals or pets was analyzed. An attempt was made to identify cat or dog owners based on these comments. Topics important to dog or cat owners was studied, and finally, the video creators with the most viewers that are cat or dog owners was identified.

#### 0. Data Exploration and Cleaning

In [3]:
# read data
df = spark.read.load("/FileStore/tables/animals_comments__1_-034b5.csv", format='csv', header = True, inferSchema = True)
df.show(10)
print('Raw data size', df.count())

In [4]:
# find user with preference of dog and cat
# 'I have a dog' contains 'I have dogs' and typo
from pyspark.sql.functions import size,col,count,when
from pyspark.sql.types import *

# Select potential pet owners
cond = (df["comment"].like("%my dog%") | df["comment"].like("%I have a dog%") | df["comment"].like("%my dogs%") | df["comment"].like("%I have dog%")
        | df["comment"].like("%my cat%") | df["comment"].like("%my cats%") | df["comment"].like("%I have a cat%") | df["comment"].like("%I have cat%") 
        | df["comment"].like("%my puppy%") | df["comment"].like("%my puppies%") | df["comment"].like("%my kitty%") | df["comment"].like("%my kitties%") 
        | df["comment"].like("%I have a kitty%") | df["comment"].like("%I have kitties%") | df["comment"].like("%I have a puppy%") | df["comment"].like("%I have puppies%"))

df_clean = df.withColumn('dog_cat',  cond)

# Data cleaning: remove NULL 
for colume in df_clean.columns:
  df_clean=df_clean.filter(df_clean[colume].isNotNull())
#label 1: cat and dog owner; label 0: Non pet owner
df_clean = df_clean.withColumn('label', col("dog_cat").cast(IntegerType()).cast('double')) 

##### 0.1 Tokenize
In this step, comments are tokenized into a list of words

In [6]:
# data preprocessing 
from pyspark.ml.feature import RegexTokenizer

regexTokenizer = RegexTokenizer(inputCol="comment", outputCol="tokenized", pattern="\\W")
df_clean = regexTokenizer.transform(df_clean)

##### 0.2 Remove stop words
In this step, stop words are dropped since they don't have any meanings

In [8]:
from pyspark.ml.feature import StopWordsRemover

# Define a list of stop words or use default list
remover = StopWordsRemover()
stopwords = remover.getStopWords() 

# Display some of the stop words
stopwords[:10]

In [9]:
# Specify input/output columns
remover.setInputCol("tokenized")
remover.setOutputCol("vector_no_stopw")

# Transform existing dataframe with the StopWordsRemover
df_clean = remover.transform(df_clean)

In [10]:
# This step was supposed to remove the rows where 'vector_stemmed' are very short or empty array
# Array length less than 4 (4 is hard coded) are removed

df_clean = df_clean.where(size(col('vector_no_stopw')) > 4)

##### 0.3 Stemming

In [12]:
%sh /home/ubuntu/databricks/python/bin/pip install nltk

In [13]:
# Import stemmer library
from nltk.stem.porter import *

# Instantiate stemmer object
stemmer = PorterStemmer()

# Create stemmer python function
def stem(in_vec):
    out_vec = []
    for t in in_vec:
        t_stem = stemmer.stem(t)
        if len(t_stem) > 2:
            out_vec.append(t_stem)       
    return out_vec

# Create user defined function for stemming with return type Array<String>
from pyspark.sql.types import *
stemmer_udf = udf(lambda x: stem(x), ArrayType(StringType()))

# Create new column with vectors containing the stemmed tokens 
df_clean = df_clean.withColumn("vector_stemmed", stemmer_udf("vector_no_stopw"))  

#### 1. Build the classifier 
In this step, a classification model was build to identify cat or dog owner from their comments using logistic regression.
TF-IDF was used to vectorize text. TF-IDF is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.
<br>
<br> Results: 
<br> Training set areaUnderROC: 0.94326497772
<br> Testing set areaUnderROC 0.93922450574
<br> Training set accuracy: 0.900138546521
<br> Testing set accuracy 0.894351918899

In [15]:
from pyspark.ml.feature import HashingTF, IDF
from pyspark.sql.functions import col,size,count,when,isnan
from pyspark.sql import *
from functools import reduce

df_clean.na.drop()
hashingTF = HashingTF(inputCol="vector_stemmed", outputCol="tf", numFeatures=200)
featurizedData = hashingTF.transform(df_clean)
featurizedData.na.drop()

featurizedData.withColumn('userid', col('userid').cast('float').cast(IntegerType()))

In [16]:
idf = IDF(inputCol="tf", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
rescaledData.show()

In [17]:
# Cat or dog owner are only a small portion of the entire dataset. Building a claasification model using this imbalanced dataset would result in model having a tendency to predict 'non cat or dog owner'. To solve this problem, we randomly selected  some samples from 'non cat or dog owner' with the same size as 'cat or dog owner'.

# The selected data prepared for training are radomly splitted into 80 and 20 percent. 80% for training and the rest 20% for testing
pet = rescaledData.filter("label=1.0")
pet_train, pet_test = pet.randomSplit([0.8, 0.2])
nopet = rescaledData.filter("label=0.0")
sampleRatio = float(pet.count()) / float(nopet.count())
sample_nopet = nopet.sample(False, sampleRatio)
df_sample = pet.unionAll(sample_nopet)
sample_nopet_train, sample_nopet_test = sample_nopet.randomSplit([0.8, 0.2])

df_train = pet_train.unionAll(sample_nopet_train)
df_test = pet_test.unionAll(sample_nopet_test)
print ('training size',df_train.count())
print ('testing size',df_test.count())

In [18]:
# In this step, use 5-fold crossvalidation to select the best l2 parameter, best model will be saved as 'best_model'
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit

lr = LogisticRegression(maxIter=10,featuresCol='features', labelCol='label')

paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.01, 0.1, 0.2, 0.4]) \
    .build()

evaluator=BinaryClassificationEvaluator()
crossval = CrossValidator(estimator = lr,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)

cvModel = crossval.fit(df_train)
best_model = cvModel.bestModel
trainingSummary = best_model.summary

In [19]:
# #Save trained model
path = "/FileStore/tables/PythonLDA/"

best_model.save(path + 'best_model')

In [20]:
# If model has been already trained, load from path
from pyspark.ml.classification import LogisticRegressionModel

path = "FileStore/tables/PythonLDA/"
best_model = LogisticRegressionModel.load(path + 'best_model')


In [21]:
# trainingSummary = best_model.summary
prediction_train = best_model.transform(df_train)
prediction_test = best_model.transform(df_test)
accuracy_train = prediction_train.filter(prediction_train.label == prediction_train.prediction).count()/float(df_train.count())
accuracy_test = prediction_test.filter(prediction_test.label == prediction_test.prediction).count()/float(df_test.count())

print('Training set areaUnderROC: ' + str(evaluator.evaluate(prediction_train)))
print('Testing set areaUnderROC ' + str(evaluator.evaluate(prediction_test)))
print('Training set accuracy: ' + str(accuracy_train))
print('Testing set accuracy ' + str(accuracy_test))

#### 2. Classify All The Users
We can now apply the cat/dog classifiers to all the other users in the dataset.
The ratio of predicted cat or dog owners are around 14% of the entire population

In [23]:
#Apply trained model to the entire dataset
prediction = best_model.transform(rescaledData)

total_pet_owner = prediction.filter("prediction = 1.0").count()
total_population = df.select("userid").distinct().count()
pet_owner_ratio = float(total_pet_owner)/float(total_population)
print('total_pet_owner :',total_pet_owner)
print('total_population :',total_population)
print('pet_owner_ratio :',pet_owner_ratio)

#### 3. Get insigts of Users
In this part, we use LDA to analyze the potential topics that cat or dog owners are also interested in.

The list of potential topics they are interested are:
<br>Fish 
<br>Rabbit 
<br>Chiken 
<br>Snake
<br>Deer
<br>Horse
<br>Hamster
<br>Train

In [25]:
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import CountVectorizer, CountVectorizerModel

pet_owner = prediction.filter("prediction = 1.0").select('userid','vector_stemmed')

cv = CountVectorizer(inputCol="vector_stemmed", outputCol="features",
                     minTF=2, # minium number of times a word must appear in a document
                     minDF=4) # minimun number of documents a word must appear in

countVectorModel = cv.fit(pet_owner)

countVectors = (countVectorModel
                .transform(pet_owner)
                .select("userid", "features").cache())

print(len(countVectorModel.vocabulary))  # how many documents, vocab size

numTopics = 10 # number of topics

lda = LDA(k = numTopics,
          maxIter = 50 # number of iterations
          )

ldaModel = lda.fit(countVectors)


# Print topics and top-weighted terms
topics = ldaModel.describeTopics(maxTermsPerTopic=20)
vocabArray = countVectorModel.vocabulary

ListOfIndexToWords = udf(lambda wl: list([vocabArray[w] for w in wl]))
FormatNumbers = udf(lambda nl: ["{:1.4f}".format(x) for x in nl])

topics.select(ListOfIndexToWords(topics.termIndices).alias('words')).show(truncate=False, n=numTopics)

#### 4. Identify Creators With Cat And Dog Owners In The Audience

'The Dodo' is the creator with largest distinct cat or dog owner audience population, followed by 'brave wilderness' and 'hope for paws'

In [27]:
from pyspark.sql.functions import countDistinct
tmp = prediction.filter("prediction = 1.0")
tmp.groupBy('creator_name').agg(countDistinct('userid')).sort('count(DISTINCT userid)',ascending= False).show()

#### 5. Analysis and Future work

According to the work, around 14.5% of the total user who commented on Youtube in this dataset are dog or cat owners. The potential topics they are interested in are include Fish, Rabbit, Chiken, Snake, Deer, Horse, Hamster, Train, etc. Videos related to these topics could be promoted to these cat or dog owners. Also, 'The Dodo', 'brave wilderness' and 'hope for paws' are the top three creators with largest distinct cat or dog owner audience population. Ads targeting cat or dog owners will potentially have the biggest payback cooperating with these creators.  

For future work, this work could be improved in the following aspects:
<br> 1. To select the cat or dog owners more accurately, the breeds of cats and dogs should be considered
<br> 2. A pipeline could be built to replace the stemming, tokenize, tf-idf process
<br> 3. Stop words could be further removed from LDA results