### Project description

The project is about predicting the number of recommendations of a comment performed on New York Times Comments
dataset that can be found under the link https://www.kaggle.com/datasets/aashita/nyt-comments.
Main challanges of this project are:
* handling large data volume
* feature engineering
* etc..
To 

### Imports

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import DateType
import pandas as pd

In [2]:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local[8]").setAppName("big_data")
sc = SparkContext.getOrCreate(conf=conf)

In [3]:
spark = SparkSession.builder.getOrCreate()

### Reading data

In [4]:
def read_spark_df(path):
    return spark.read.option("multiline",True).option('lineSep','\n').option("header", True).option("delimiter", ",").option("inferSchema",True).csv(path)

In [5]:
articles_df = read_spark_df('data/nyt-articles-2020.csv')
comments_df = read_spark_df('data/nyt-comments-2020.csv')

In [6]:
comments_df = comments_df.limit(10000)

In [7]:
comments_df

DataFrame[commentID: int, status: string, commentSequence: int, userID: int, userDisplayName: string, userLocation: string, userTitle: string, commentBody: string, createDate: string, updateDate: string, approveDate: string, recommendations: string, replyCount: string, editorsSelection: string, parentID: string, parentUserDisplayName: string, depth: string, commentType: string, trusted: string, recommendedFlag: int, permID: string, isAnonymous: string, articleID: string]

In [8]:
df = comments_df.withColumn("updateDate", comments_df['updateDate'].cast(DateType()))

In [9]:
articles_df.printSchema()

root
 |-- newsdesk: string (nullable = true)
 |-- section: string (nullable = true)
 |-- subsection: string (nullable = true)
 |-- material: string (nullable = true)
 |-- headline: string (nullable = true)
 |-- abstract: string (nullable = true)
 |-- keywords: string (nullable = true)
 |-- word_count: string (nullable = true)
 |-- pub_date: string (nullable = true)
 |-- n_comments: string (nullable = true)
 |-- uniqueID\r: string (nullable = true)



In [10]:
comments_df.printSchema()

root
 |-- commentID: integer (nullable = true)
 |-- status: string (nullable = true)
 |-- commentSequence: integer (nullable = true)
 |-- userID: integer (nullable = true)
 |-- userDisplayName: string (nullable = true)
 |-- userLocation: string (nullable = true)
 |-- userTitle: string (nullable = true)
 |-- commentBody: string (nullable = true)
 |-- createDate: string (nullable = true)
 |-- updateDate: string (nullable = true)
 |-- approveDate: string (nullable = true)
 |-- recommendations: string (nullable = true)
 |-- replyCount: string (nullable = true)
 |-- editorsSelection: string (nullable = true)
 |-- parentID: string (nullable = true)
 |-- parentUserDisplayName: string (nullable = true)
 |-- depth: string (nullable = true)
 |-- commentType: string (nullable = true)
 |-- trusted: string (nullable = true)
 |-- recommendedFlag: integer (nullable = true)
 |-- permID: string (nullable = true)
 |-- isAnonymous: string (nullable = true)
 |-- articleID: string (nullable = true)



We can observe that although we used inferSchema some of the columns should be stored as a different data type. Let's fix it.

In [11]:
comments_df=comments_df.withColumn('recommendations',comments_df['recommendations'].cast("float"))\
                        .withColumn("createDate", comments_df['createDate'].cast(DateType()))\
                        .withColumn("updateDate", comments_df['updateDate'].cast(DateType()))\
                        .withColumn("approveDate", comments_df['approveDate'].cast(DateType()))\
                        .withColumn('replyCount',comments_df['replyCount'].cast("int"))\
                        .withColumn('depth',comments_df['depth'].cast("int"))\
                        .withColumn('isAnonymous',comments_df['isAnonymous'].cast("int"))\
                        .withColumn('editorsSelection',comments_df['editorsSelection'].cast("int"))
#actually some of the above columns are boolean but pyspark does not provide such datatype so we cast them to int

In [12]:
comments_df.printSchema()

root
 |-- commentID: integer (nullable = true)
 |-- status: string (nullable = true)
 |-- commentSequence: integer (nullable = true)
 |-- userID: integer (nullable = true)
 |-- userDisplayName: string (nullable = true)
 |-- userLocation: string (nullable = true)
 |-- userTitle: string (nullable = true)
 |-- commentBody: string (nullable = true)
 |-- createDate: date (nullable = true)
 |-- updateDate: date (nullable = true)
 |-- approveDate: date (nullable = true)
 |-- recommendations: float (nullable = true)
 |-- replyCount: integer (nullable = true)
 |-- editorsSelection: integer (nullable = true)
 |-- parentID: string (nullable = true)
 |-- parentUserDisplayName: string (nullable = true)
 |-- depth: integer (nullable = true)
 |-- commentType: string (nullable = true)
 |-- trusted: string (nullable = true)
 |-- recommendedFlag: integer (nullable = true)
 |-- permID: string (nullable = true)
 |-- isAnonymous: integer (nullable = true)
 |-- articleID: string (nullable = true)



Let's take a look at our target variable

In [13]:
comments_df.describe(['recommendations']).show()

+-------+-----------------+
|summary|  recommendations|
+-------+-----------------+
|  count|            10000|
|   mean|          20.9887|
| stddev|97.33120007767721|
|    min|              0.0|
|    max|           3816.0|
+-------+-----------------+



We can observe that data is contains outliers. We will get rid of them using quantiles.

In [14]:
upper_limit = comments_df.approxQuantile('recommendations', [ 0.9], 0.05)[0]

In [15]:
comments_df = comments_df.filter((col('recommendations')<upper_limit))

In [16]:
comments_df.describe(['recommendations']).show()

+-------+-----------------+
|summary|  recommendations|
+-------+-----------------+
|  count|             8512|
|   mean|5.095277255639098|
| stddev|5.478552516337315|
|    min|              0.0|
|    max|             23.0|
+-------+-----------------+



In [17]:
import pyspark.sql.functions as F
comments_df=comments_df.withColumn("createDateInt", F.unix_timestamp(comments_df['createDate']))
comments_df.describe(['createDateInt']).show()

+-------+-------------------+
|summary|      createDateInt|
+-------+-------------------+
|  count|               8512|
|   mean|1.577956250892857E9|
| stddev|  419763.5688200801|
|    min|         1577833200|
|    max|         1601503200|
+-------+-------------------+



In [18]:
gre_histogram = comments_df.select('recommendations').rdd.flatMap(lambda x: x).histogram(11)

# Loading the Computed Histogram into a Pandas Dataframe for plotting
pd.DataFrame(
    list(zip(*gre_histogram)), 
    columns=['bin', 'frequency']
).set_index(
    'bin'
).plot(kind='bar');

Transformr text to vectors

In [19]:
# from pyspark.mllib.feature import Word2Vec

In [20]:
from pyspark.ml.feature import StopWordsRemover,Word2Vec

In [21]:
def get_vector(df,column='commentBody'):
    df = df.withColumn(column, trim(regexp_replace(column,'(@\w+)|[^a-zA-Z\s]', '')))
    df = df.select(split(col(column)," ").alias(column))
    remover = StopWordsRemover(inputCol=column, outputCol="filtered")
    filtered = remover.transform(df)
    word2vec = Word2Vec(inputCol="filtered", outputCol="vector")
    model = word2vec.fit(filtered)
    return model.transform(filtered)

In [22]:
df = get_vector(comments_df)

In [23]:
df.show()

+--------------------+--------------------+--------------------+
|         commentBody|            filtered|              vector|
+--------------------+--------------------+--------------------+
|[Here, is, someth...|[something, think...|[-0.0012416364657...|
|[I, have, used, m...|[used, VA, loan, ...|[-0.0203269722743...|
|[would, someone, ...|[someone, take, V...|[-0.0044137245160...|
|[here, in, the, A...|[Alabama, PNW, tr...|[0.00108524933036...|
|[just, a, guess, ...|[guess, doubt, cr...|[-0.0057377919769...|
|[st, you, should,...|[st, take, note, ...|[-0.0123027005052...|
|[SickBut, some, a...|[SickBut, action,...|[-0.0128891877830...|
|[I, totally, agre...|[totally, agree, ...|[0.01775400146531...|
|[Bill, Clinton, w...|[Bill, Clinton, p...|[0.00155126106284...|
|[Being, on, the, ...|[board, long, tim...|[0.00329689623675...|
|[until, you, get,...|      [get, elected]|[0.04873079177923...|
|[This, is, a, ter...|[terrific, idea, ...|[0.01907857486512...|
|[Barth, Oh, Jesu,...|[Ba

In [24]:
# comments_df=comments_df.withColumn("createDateInt",comments_df.createDateInt.cast('double'))
# # df=df.withColumn("vector",df.vector.cast('array<bigint>'))
# df.withColumn(
#     "vector", 
#     F.array_union(df.vector, F.array(comments_df.createDateInt))
# ).show()



In [25]:
comments_df.printSchema()

root
 |-- commentID: integer (nullable = true)
 |-- status: string (nullable = true)
 |-- commentSequence: integer (nullable = true)
 |-- userID: integer (nullable = true)
 |-- userDisplayName: string (nullable = true)
 |-- userLocation: string (nullable = true)
 |-- userTitle: string (nullable = true)
 |-- commentBody: string (nullable = true)
 |-- createDate: date (nullable = true)
 |-- updateDate: date (nullable = true)
 |-- approveDate: date (nullable = true)
 |-- recommendations: float (nullable = true)
 |-- replyCount: integer (nullable = true)
 |-- editorsSelection: integer (nullable = true)
 |-- parentID: string (nullable = true)
 |-- parentUserDisplayName: string (nullable = true)
 |-- depth: integer (nullable = true)
 |-- commentType: string (nullable = true)
 |-- trusted: string (nullable = true)
 |-- recommendedFlag: integer (nullable = true)
 |-- permID: string (nullable = true)
 |-- isAnonymous: integer (nullable = true)
 |-- articleID: string (nullable = true)
 |-- creat

In [26]:
from pyspark.sql.functions import monotonically_increasing_id, row_number
from pyspark.sql.window import Window

df=df.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
comments_df=comments_df.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
# df = df.join(comments_df, on=["row_index"]).drop("row_index")
comments_df = comments_df.join(df, on=["row_index"]).drop("row_index")
comments_df.take(1)
#comments_df = comments_df.withColumnRenamed("commentBody_filtered","commentBody")

[Row(commentID=104387472, status='approved', commentSequence=104387472, userID=60215558, userDisplayName='magicisnotreal', userLocation='earth', userTitle=None, commentBody='Here is something I think is fraudulent that vets are subject to If you use your VA home loan option you have to pay higher interest rates regardless of your credit rating becuase supposedly it is more risky.How exactly is a guaranteed loan more risky than a not guaranteed commercial loan?', createDate=datetime.date(2020, 1, 1), updateDate=datetime.date(2020, 1, 1), approveDate=datetime.date(2020, 1, 1), recommendations=7.0, replyCount=5, editorsSelection=None, parentID=None, parentUserDisplayName=None, depth=1, commentType='comment', trusted='0', recommendedFlag=0, permID='104387472', isAnonymous=None, articleID='nyt://article/69a7090b-9f36-569e-b5ab-b0ba5bb3ccbd', createDateInt=1577833200, commentBody=['Here', 'is', 'something', 'I', 'think', 'is', 'fraudulent', 'that', 'vets', 'are', 'subject', 'to', 'If', 'you'

In [27]:
from pyspark.ml.feature import VectorAssembler

numericCols = ['vector', 'createDateInt']
assembler = VectorAssembler(inputCols=numericCols, outputCol="features")
comments_df = assembler.transform(comments_df)

In [28]:
from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0, 6, 12, 18, float('Inf') ],inputCol="recommendations", outputCol="buckets")
comments_df = bucketizer.setHandleInvalid("keep").transform(comments_df)

In [29]:
comments_df.describe(['buckets']).show()
comments_df.take(1)

+-------+------------------+
|summary|           buckets|
+-------+------------------+
|  count|              8512|
|   mean|0.5236137218045113|
| stddev|0.8470961020589826|
|    min|               0.0|
|    max|               3.0|
+-------+------------------+



[Row(commentID=104387472, status='approved', commentSequence=104387472, userID=60215558, userDisplayName='magicisnotreal', userLocation='earth', userTitle=None, commentBody='Here is something I think is fraudulent that vets are subject to If you use your VA home loan option you have to pay higher interest rates regardless of your credit rating becuase supposedly it is more risky.How exactly is a guaranteed loan more risky than a not guaranteed commercial loan?', createDate=datetime.date(2020, 1, 1), updateDate=datetime.date(2020, 1, 1), approveDate=datetime.date(2020, 1, 1), recommendations=7.0, replyCount=5, editorsSelection=None, parentID=None, parentUserDisplayName=None, depth=1, commentType='comment', trusted='0', recommendedFlag=0, permID='104387472', isAnonymous=None, articleID='nyt://article/69a7090b-9f36-569e-b5ab-b0ba5bb3ccbd', createDateInt=1577833200, commentBody=['Here', 'is', 'something', 'I', 'think', 'is', 'fraudulent', 'that', 'vets', 'are', 'subject', 'to', 'If', 'you'

In [30]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load and parse the data file, converting it to a DataFrame.
data = comments_df

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="buckets", outputCol="indexedLabel").fit(data)

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures").fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="buckets", featuresCol="indexedFeatures")

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("predictedLabel", "buckets", "features").show(15)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="buckets", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

rfModel = model.stages[2]
print(rfModel)  # summary only

+--------------+-------+--------------------+
|predictedLabel|buckets|            features|
+--------------+-------+--------------------+
|           0.0|    2.0|[-0.0203269722743...|
|           0.0|    1.0|[-0.0044137245160...|
|           0.0|    0.0|[0.00155126106284...|
|           0.0|    1.0|[-0.0069727705781...|
|           0.0|    1.0|[-0.0134414144179...|
|           0.0|    1.0|[-4.5931638361742...|
|           0.0|    2.0|[-0.0070270372161...|
|           0.0|    1.0|[-0.0032263528072...|
|           0.0|    0.0|[-0.0133554951101...|
|           0.0|    1.0|[0.00158347106097...|
|           0.0|    1.0|[-0.0299132053359...|
|           0.0|    0.0|[-0.0029049653156...|
|           0.0|    1.0|[0.01227360839645...|
|           0.0|    3.0|[-0.0086663363401...|
|           0.0|    1.0|[-0.0092189190320...|
+--------------+-------+--------------------+
only showing top 15 rows

Test Error = 0.357727
RandomForestClassificationModel: uid=RandomForestClassifier_b447aad870e0, numT

In [31]:
rfModel.getPredictionCol()

'prediction'