# Problem statement


# Data description
In order to collect data in a natural way:
<br>- we registered Twitter Developer account;
<br>- using credentials from Twitter Developer account we run script that collected tweets by the geolocation and saved them in mongodb;
<br>
<br><b>As a result:</b>
<br>- we collected  332548 tweets (10Gb in mongodb, ~100Mb in csv) from New-York geolocation since 30 of May up to 15 of June;
<br>- we collected  6617029 tweets (~1.69Gb in csv) from USA geolocation since 15 of June up to now.

### Import all needed libs

In [2]:
#Do not move!
import findspark
findspark.init()

In [3]:
import pyspark
import operator
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.window import Window
from pyspark.sql.types import ArrayType, FloatType, StringType, IntegerType, StructField, StructType
from pyspark.sql.functions import udf, row_number,column

# processing
import re
from datetime import datetime

# text preprocessing
import re
import nltk
from nltk.stem import WordNetLemmatizer 
from pyspark.ml.feature import CountVectorizer,StopWordsRemover, HashingTF, IDF, Tokenizer
nltk.download('stopwords')
nltk.download('wordnet')

#staff for LDA
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vector as oldVector, Vectors as oldVectors
from pyspark.ml.linalg import Vector as newVector, Vectors as newVectors

# import hardcoded variables
from variables import channels_not_to_consider

#for debug purpose only
import time

#pytrends - for acquiring google trends
from get_google_trends_data.pytrends.pytrends.request import TrendReq

# basically spark

[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ubuntu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Global variables definition

**User-specific variables**  
Please feel free to tweak those variables as you wish. For example, you can set number of last hours to get hottest topics.

In [4]:
# if True locations from locations_to_consider will be used to filter
get_from_location = True

# locations to filter relevant tweets
locations_to_consider = [
                         'Manhattan, NY', 
                         'Brooklyn, NY', 
                         'Queens, NY', 
                         'Bronx, NY', 
                         'Staten Island, NY'
                         'New York, USA'
                        ]

geo = "US-NY" #US for USA

number_of_hours_to_get_topics = 2
num_of_top_interest = 15

# Set window time for interesting
frame_start_datetime = "Mon Jun 03 00:00:00 +0000 2019"
frame_finish_datetime = "Mon Jun 17 23:00:00 +0000 2019"

**Technical variables**  
Those variables are needed to connect to db and other technical stuff.

In [5]:
# LDA params
num_of_topics_LDA = 10
max_iterations_LDA = 100
nomber_of_words_to_for_topic = 15  # number of words per topic

# path to CSV
historical_tweets_data = './get-tweets-by-geolocation/data/new_york_training_tweets_15_06.csv'
#historical_tweets_data = './get-tweets-by-geolocation/training_tweets.csv'
# MongoDB table
real_time_tweets_table = "usa_training_tweets_04_07.training_tweets_collection"

### Create spark session

In [6]:
spark = SparkSession.builder.appName("pipeline") \
    .config('spark.mongodb.input.uri', 'mongodb://localhost:27017/'+real_time_tweets_table) \
    .config('spark.mongodb.output.uri', 'mongodb://localhost:27017/'+real_time_tweets_table) \
    .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.3.1') \
    .config('spark.mongodb.input.partitioner', 'MongoPaginateBySizePartitioner') \
    .getOrCreate()
sc = spark.sparkContext

### Handy functions

**Text preprocessing and filtering**

In [7]:
def filter_tweet(tweet, channels_not_to_consider):
    
    if not isinstance(tweet, str):
        is_filtered = True
    elif len(tweet.split(' ')) < 3:
        is_filtered = True
    else: 
        is_filtered = False
        
    return not is_filtered
         
def process_tweet(tweet):
   
    tweet = tweet.lower() # get lowercase
    tweet = re.sub(r'@\w+', '', tweet) # filter words with non-letters at the beginning (mainly for mentions)
    tweet = re.sub(r'http://\S{,280}', '', tweet) # filter http
    tweet = re.sub(r'https://\S{,280}', '', tweet) # filter https
    tweet = re.sub(r'[^A-Za-z]', ' ', tweet) # filter all non-letters
    tweet = re.sub(r'\s{2,}', ' ', tweet) # remove multiply whitespaces
    tweet = re.sub(r'(.)\1{2,}', r'\1', tweet) # remove repeated chars (e.g. "greeeeat" -> "great")
    tweet = tweet.strip() # remove possible whitespaces from both sides of the tweet

    # lemmatize, tokenize and conquer
    processed_tweet = [lemmatizer.lemmatize(token) for token in tokenizer.tokenize(tweet)
                       if token not in stop_word_list]
    
    return processed_tweet

#### Google trends

In [8]:
#TODO: move this function to Handy function block 
def get_google_trends_by_geo(geo):
    if geo == 'US':
        return google_trends_search_topics_us, google_trends_search_queries_us
    elif geo == 'US-NY':
        return google_trends_search_topics_us_ny, google_trends_search_queries_us_ny
    
    return None, None

**Datetime handling**

In [9]:
wrong_date = datetime.strptime("Mon Jun 03 00:00:00 +0000 2000", '%a %b %d %H:%M:%S %z %Y')

def validate(date_text):
    try:
        if date_text != datetime.strptime(date_text, '%a %b %d %H:%M:%S %z %Y').strftime('%a %b %d %H:%M:%S %z %Y'):
            raise ValueError
        return True
    except ValueError:
        return False

def str_tweet_to_datetime(frame_datetime):
    if (validate(frame_datetime) == True):
        return datetime.strptime(frame_datetime,'%a %b %d %H:%M:%S %z %Y')
    else:
        return wrong_date

def datetime_to_tweet_str(frame_datetime):
    #print(type(frame_datetime))
    ts = datetime.strftime(frame_datetime, '%a %b %d %H:%M:%S %z %Y')
    return ts

# How to call this block with functions?

In [10]:
def tweet2google_timeframe(frame_start_datetime, frame_finish_datetime):
    start_date = str_tweet_to_datetime(frame_start_datetime)
    end_date = str_tweet_to_datetime(frame_finish_datetime)
    tim
    
def get_google_trends_by_geo(geo):
    if geo == 'US':
        return google_trends_search_topics_us, google_trends_search_queries_us
    elif geo == 'US-NY':
        return google_trends_search_topics_us_ny, google_trends_search_queries_us_ny
    
    return None, None

In [11]:
#TODO: move this function to utils
def str_rising_to_float(str):
    if str is None:
        return 0.0
    if str == '':
        return 0.0
    if str == 'Breakout':
        return 0.0
    
    str_value = str.split('%')[0]
    if '+' in str_value:
        str_value = str_value.split('+')[1]
        
    if ',' in str_value:
        str_value = str_value.replace(',', '.')
        value = 1000* float(str_value)
        return value
    return float(str_value)

In [12]:
#TODO: move this function to utils
def unique_google_trends_by_time_frame(df):
    data = df.collect()
    rising_dict = {}
    top_dict = {}
    
    geo = data[0]['geo']
    columns = df.columns

    for i in range(0, len(data)):
        rising_val = data[i][columns[1]]
        top_value = data[i][columns[2]]
        
        if rising_val in rising_dict:
            rising_dict[rising_val][0] += str_rising_to_float(data[i][columns[3]])
            rising_dict[rising_val][1] += 1
        else:
            rising_dict[rising_val] = [str_rising_to_float(data[i][columns[3]]), 1]
            
        if top_value in top_dict:
            top_dict[top_value][0] += float(data[i][columns[4]])
            top_dict[top_value][1] += 1
        else:
            top_dict[top_value] = [float(data[i][columns[4]]), 1]
    
    
    for key in top_dict:
        top_dict[key] = round(top_dict[key][0] / top_dict[key][1])
        
    for key in rising_dict:
        rising_dict[key] = round(rising_dict[key][0] / rising_dict[key][1])
    
    top_dict = sorted(top_dict.items(), key=operator.itemgetter(1), reverse=True)
    rising_dict = sorted(rising_dict.items(), key=operator.itemgetter(1), reverse=True)
    
    
    seq = []
    len_top = len(top_dict)
    len_rising = len(rising_dict)
    length = max(len_top, len_rising)
    
    row = Row(columns[1], columns[2], columns[3], columns[4], columns[5])
    
    for i in range(0, length):
        rising = rising_dict[i][0] if i < len_rising else ''
        rising_val = f"+{rising_dict[i][1]}%" if i < len_rising else None
        
        top = top_dict[i][0] if i < len_top else ''
        top_val = top_dict[i][1] if i < len_top else None
        
        seq.append(row(rising, top, rising_val, top_val, geo))
    
    dframe = spark.createDataFrame(seq)
    return dframe

In [13]:
def get_geo_name(geo):
    if geo == "US-NY":
        return "New York"
    elif geo == "US":
        return "United States"
    return ""

def print_google_trend_title(start_date, finish_date, name):
    start_date_str = start_date.strftime("%Y-%m-%d")
    if start_date == finish_date:
        print(f"\nGoogle trends {name} in {get_geo_name(geo)} during {start_date_str}")
    else:
        finish_date_str = finish_date.strftime("%Y-%m-%d")
        print(f"\nGoogle trends {name} in {get_geo_name(geo)} during {start_date_str} - {finish_date_str}")

In [14]:
def convert_datetime_in_interesting_google(df):
    columns = df.columns
    converted_df = df.rdd.map(lambda x : (
                                          x["Date"].strftime("%Y-%m-%d"), 
                                          x[columns[1]], 
                                          x[columns[2]], 
                                          x[columns[3]],
                                          x[columns[4]],
                                          x[columns[5]])).toDF([columns[0], columns[1], columns[2], columns[3], columns[4], columns[5]])
                                                
    return converted_df

# Load the data


## Loading Google Trends data

In [26]:
google_trends_search_queries_us = spark.read.csv('data/google-trends/google-trends-search-queries-US.csv', inferSchema=True, header=True)
google_trends_search_topics_us = spark.read.csv('data/google-trends/google-trends-search-topics-US.csv', inferSchema=True, header=True)
google_trends_search_queries_us_ny = spark.read.csv('data/google-trends/google-trends-search-queries-US-NY.csv', inferSchema=True, header=True)
google_trends_search_topics_us_ny = spark.read.csv('data/google-trends/google-trends-search-topics-US-NY.csv', inferSchema=True, header=True)

## Here should be "magic IF" (Yevhen)

In [13]:
def get_history_and_real_timeframe(requested_start, requested_finish):

    requested_start_dt = str_tweet_to_datetime(requested_start)
    requested_finish_dt = str_tweet_to_datetime(requested_finish)
    
    const_end_history_datetime = str_tweet_to_datetime("Fri Jul 05 00:00:00 +0000 2019")

    history_start_datetime = None
    history_finish_datetime = None
    realtime_start_datetime = None
    realtime_finish_datetime = None

    assert requested_finish_dt > requested_start_dt, "Finish dataframe MUST be greater than start"

    if (requested_start_dt >= const_end_history_datetime and requested_finish_dt > const_end_history_datetime):
        realtime_start_datetime = requested_start_dt
        realtime_finish_datetime = requested_finish_dt
    elif (requested_start_dt < const_end_history_datetime and requested_finish_dt <= const_end_history_datetime):
        history_start_datetime = requested_start_dt
        history_finish_datetime = requested_finish_dt
    else:
        history_start_datetime = requested_start_dt
        history_finish_datetime = const_end_history_datetime
        realtime_start_datetime = const_end_history_datetime
        realtime_finish_datetime = requested_finish_dt
        
    return (history_start_datetime, history_finish_datetime, realtime_start_datetime, realtime_finish_datetime)

print('Example of usage!')
times = get_history_and_real_timeframe(requested_start = frame_start_datetime, 
                                       requested_finish = frame_finish_datetime)

print("Range for csv: ", times[0], times[1])
print("Time range for mongodb: ", times[2], times[3])

Example of usage!
Range for csv:  2019-06-03 00:00:00+00:00 2019-06-17 23:00:00+00:00
Time range for mongodb:  None None


## Reading the historical data, it can take a while

In [14]:
df = spark.read.csv(historical_tweets_data, inferSchema=True, header=True)
# remove records with no date
df = df.na.drop(subset=["created_at"])

In [15]:
# convert string to desired date format
from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType, TimestampType

func = udf(lambda x: str_tweet_to_datetime(x), TimestampType())

df = df.withColumn('created_at', func(col('created_at')))

In [16]:
# select history according to user's time request

times = get_history_and_real_timeframe(requested_start = frame_start_datetime, 
                                       requested_finish = frame_finish_datetime)
historical_start_time = times[0]
historical_finish_time = times[1]

print("Range for collected data (history): ", historical_start_time, historical_finish_time)

selected_history = None

if historical_start_time != None and historical_finish_time != None:
    selected_history = df.filter((df.created_at > historical_start_time) & (df.created_at < historical_finish_time))

Range for collected data (history):  2019-06-03 00:00:00+00:00 2019-06-17 23:00:00+00:00


In [17]:
selected_history.show(10)

+--------------------+------------+--------------+--------------------+---------------+----------------+---------------+--------------+-------------+------------+-----------------------+-------------------+----------+
|               tweet|country_code|  geo_location|        bounding_box|    screen_name|favourites_count|followers_count|statuses_count|friends_count|listed_count|user_described_location|         created_at|utc_offset|
+--------------------+------------+--------------+--------------------+---------------+----------------+---------------+--------------+-------------+------------+-----------------------+-------------------+----------+
|@MikePrevost3 Wha...|          US|Glen Ridge, NJ|[[[-74.218378, 40...|OmarShahJaffrey|           22016|            531|         16637|          583|          27|        New Jersey, USA|2019-06-04 15:04:39|      null|
|and this is the f...|          US| Manhattan, NY|[[[-74.026675, 40...|        dijellz|           41507|           1639|        

# Problem №1

# Mongo DB Staff

In [37]:
df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()

In [39]:
modified_df = df2.withColumn("coordinates", df2["coordinates"].cast("string"))
modified_df = modified_df.withColumn("geo", modified_df["geo"].cast("string"))
modified_df = modified_df.withColumn("place", modified_df["place"].cast("string"))
modified_df = modified_df.withColumn("quoted_status", modified_df["quoted_status"].cast("string"))

In [45]:
modified_df.createOrReplaceTempView("recent_data")
modified_df.printSchema()

root
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- contributors: null (nullable = true)
 |-- coordinates: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- display_text_range: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: integer (containsNull = true)
 |    |-- media: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- additional_media_info: struct (nullable = true)
 |    |    |    |    |-- monetizable: boolean (nullable = true)
 |    |    |    |-- display_url: string (nullable = true)
 |    |    |    |-- expanded_url: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)


In [41]:
recent_data = spark.sql('''
SELECT 
    recent_data.id_str, 
    recent_data.in_reply_to_screen_name, 
    recent_data.is_quote_status, 
    recent_data.reply_count, 
    recent_data.favorited, 
    recent_data.filter_level, 
    recent_data.quoted_status_id, 
    recent_data.source, 
    recent_data.in_reply_to_status_id_str, 
    recent_data.geo, 
    recent_data.entities, 
    recent_data.id, 
    recent_data.quoted_status, 
    recent_data.retweeted, 
    recent_data.timestamp_ms, 
    recent_data.text,
    recent_data.user, 
    recent_data.lang, 
    recent_data.truncated, 
    recent_data.in_reply_to_status_id, 
    recent_data.created_at, 
    recent_data.in_reply_to_user_id_str, 
    recent_data.contributors, 
    recent_data.retweet_count, 
    recent_data.place, 
    recent_data.favorite_count, 
    recent_data.possibly_sensitive, 
    recent_data.quote_count, 
    recent_data.quoted_status_permalink, 
    recent_data.in_reply_to_user_id, 
    recent_data.extended_tweet, 
    recent_data.extended_entities, 
    recent_data._id, 
    recent_data.quoted_status_id_str, 
    recent_data.display_text_range,
    recent_data.coordinates
FROM recent_data WHERE lang="en" AND place.country="US"
''')

AnalysisException: "Can't extract value from place#1280: need struct type but got string; line 39 pos 37"

In [46]:
recent_data = spark.sql('''
SELECT 
    recent_data._id, 
FROM recent_data 
''')

AnalysisException: "cannot resolve '`recent_data.id_str`' given input columns: []; line 3 pos 4;\n'Project ['recent_data.id_str, 'FROM AS recent_data#1465]\n+- OneRowRelation\n"

In [43]:
recent_data.show()

Py4JJavaError: An error occurred while calling o187.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18.0 failed 1 times, most recent failure: Lost task 0.0 in stage 18.0 (TID 18, localhost, executor driver): com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast DOCUMENT into a NullType (value: { "type" : "Point", "coordinates" : [-117.4998322, 34.1731682] })
	at com.mongodb.spark.sql.MapFunctions$.com$mongodb$spark$sql$MapFunctions$$convertToDataType(MapFunctions.scala:200)
	at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:39)
	at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:37)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
	at com.mongodb.spark.sql.MapFunctions$.documentToRow(MapFunctions.scala:37)
	at com.mongodb.spark.sql.MapFunctions$.castToStructType(MapFunctions.scala:222)
	at com.mongodb.spark.sql.MapFunctions$.com$mongodb$spark$sql$MapFunctions$$convertToDataType(MapFunctions.scala:194)
	at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:39)
	at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:37)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
	at com.mongodb.spark.sql.MapFunctions$.documentToRow(MapFunctions.scala:37)
	at com.mongodb.spark.sql.MongoRelation$$anonfun$buildScan$1.apply(MongoRelation.scala:58)
	at com.mongodb.spark.sql.MongoRelation$$anonfun$buildScan$1.apply(MongoRelation.scala:58)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast DOCUMENT into a NullType (value: { "type" : "Point", "coordinates" : [-117.4998322, 34.1731682] })
	at com.mongodb.spark.sql.MapFunctions$.com$mongodb$spark$sql$MapFunctions$$convertToDataType(MapFunctions.scala:200)
	at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:39)
	at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:37)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
	at com.mongodb.spark.sql.MapFunctions$.documentToRow(MapFunctions.scala:37)
	at com.mongodb.spark.sql.MapFunctions$.castToStructType(MapFunctions.scala:222)
	at com.mongodb.spark.sql.MapFunctions$.com$mongodb$spark$sql$MapFunctions$$convertToDataType(MapFunctions.scala:194)
	at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:39)
	at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:37)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
	at com.mongodb.spark.sql.MapFunctions$.documentToRow(MapFunctions.scala:37)
	at com.mongodb.spark.sql.MongoRelation$$anonfun$buildScan$1.apply(MongoRelation.scala:58)
	at com.mongodb.spark.sql.MongoRelation$$anonfun$buildScan$1.apply(MongoRelation.scala:58)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


## Data filtering and merging

In [18]:
# select recent data according to user's time request

times = get_history_and_real_timeframe(requested_start = frame_start_datetime, 
                                       requested_finish = frame_finish_datetime)
recent_start_time = times[2]
recent_finish_time = times[3]

print("Range for recent data (mongodb): ", recent_start_time, recent_finish_time)

selected_recent = None

if recent_start_time != None and recent_finish_time != None:
    selected_recent = df.filter((recent_data.created_at > historical_start_time) 
                                & (recent_data.created_at < historical_finish_time))

Range for recent data (mongodb):  None None


In [19]:
# merge together selected_recent and selected_history

selected_df = None

if selected_history != None and selected_recent != None:
    selected_df = selected_history.union(selected_recent)
elif selected_history != None and selected_recent == None:
    selected_df = selected_history
elif selected_history != None and selected_recent == None:
    selected_df = selected_recent

assert selected_df != None, "Something goes wrong with selecting data from recent data/history data"

# Tweets preprocessing

Text cleaning is crucial for any text modelling process, especially for topic modelling. In our case it consists from those steps:  
1) Lowercase all words  
2) Filter words with non-letters at the beginning (mainly for mentions, e.g. "@some_user")  
3) Filter http/https  
4) Filter all non-letters (crucial to remove emoji)  
5) Remove multiply whitespaces  
6) Remove repeated chars (e.g. "greeeeat" -> "great")

In [20]:
df = selected_history

In [21]:
tokenizer = nltk.WordPunctTokenizer()
lemmatizer = WordNetLemmatizer()
stop_word_list = nltk.corpus.stopwords.words('english')

In [22]:
# filter nans
df = df.rdd.filter(lambda x: x[0] != None and x[1] != None and x[2] != None and x[4] != None)

# filter out channels not to consider
df = df.filter(lambda x: x[4] not in channels_not_to_consider)

# filter by country
df = df.filter(lambda x: x[1] in 'US')

# filter by precise location
if get_from_location:
    df = df.filter(lambda x: x[2] in locations_to_consider)

# filter tweet itself
df = df.filter(lambda x: filter_tweet(x[0], channels_not_to_consider=channels_not_to_consider))

# process tweet
df = df.map(lambda x: process_tweet(x[0]))

# final preprocesssing
df = df.filter(lambda x: len(x) > 0)

# make dataframes great again
df = df.map(lambda x: [x])

# schema for df
schema = StructType([StructField('tokens', ArrayType(StringType()), True)])
df = df.toDF(schema=schema)

In [24]:
df.show(10)

+--------------------+
|              tokens|
+--------------------+
|[fuckin, mood, to...|
|[welcome, tuesday...|
|[one, thing, get,...|
|[bainbridge, stre...|
|[one, difficult, ...|
|[like, bout, son,...|
|[summer, around, ...|
|[incredible, buyi...|
|[lincoln, dress, ...|
|[foot, took, nice...|
+--------------------+
only showing top 10 rows



# Topic modeling/Latent Dirichlet allocation(LDA)

In [27]:
# # this block can be commented, it's just a mock

# text_file = 'data/listings.csv'
# #df2 = spark.read.csv(text_file, inferSchema=True, header=True)
# #df2 = df2.select("id", "name").dropna(subset="name")
# df2=sc.parallelize(df2.collect())

# print(time.strftime('%m%d%Y %H:%M:%S'))

# tokenizer = Tokenizer(inputCol="name", outputCol="tokens")
# df2 = tokenizer.transform(df2)
# print(time.strftime('%m%d%Y %H:%M:%S'))

In [32]:
df2.show(10, True)

+-----+--------------------+--------------------+
|   id|                name|              tokens|
+-----+--------------------+--------------------+
| 2818|Quiet Garden View...|[quiet, garden, v...|
|20168|100%Centre-Studio...|[100%centre-studi...|
|25428|Lovely apt in Cit...|[lovely, apt, in,...|
|27886|Romantic, stylish...|[romantic,, styli...|
|28658|Cosy guest room n...|[cosy, guest, roo...|
|28871|Comfortable doubl...|[comfortable, dou...|
|29051|Comfortable singl...|[comfortable, sin...|
|31080|2-story apartment...|[2-story, apartme...|
|38266|Nice and quiet pl...|[nice, and, quiet...|
|41125|Amsterdam Center ...|[amsterdam, cente...|
+-----+--------------------+--------------------+
only showing top 10 rows



In [34]:
df2.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [28]:
print(time.strftime('%m%d%Y %H:%M:%S'))

cv = CountVectorizer(inputCol="tokens", outputCol="raw_features", vocabSize=5000, minDF=3.0)
cvmodel = cv.fit(df)

print(time.strftime('%m%d%Y %H:%M:%S'))

07052019 13:01:13
07052019 13:01:29


In [29]:
print(time.strftime('%m%d%Y %H:%M:%S'))
df = cvmodel.transform(df)
print(time.strftime('%m%d%Y %H:%M:%S'))

07052019 13:01:42
07052019 13:01:42


In [30]:
idf = IDF(inputCol="raw_features", outputCol="tf_idf_features", minDocFreq=2)
idfModel = idf.fit(df)

df = idfModel.transform(df)


In [31]:
df.show(10, True)

+--------------------+--------------------+--------------------+
|              tokens|        raw_features|     tf_idf_features|
+--------------------+--------------------+--------------------+
|[fuckin, mood, to...|(5000,[11,26,528,...|(5000,[11,26,528,...|
|[welcome, tuesday...|(5000,[28,81,260,...|(5000,[28,81,260,...|
|[one, thing, get,...|(5000,[2,3,8,31,3...|(5000,[2,3,8,31,3...|
|[bainbridge, stre...|(5000,[60,175,417...|(5000,[60,175,417...|
|[one, difficult, ...|(5000,[3,21,137,2...|(5000,[3,21,137,2...|
|[like, bout, son,...|(5000,[0,89,297,3...|(5000,[0,89,297,3...|
|[summer, around, ...|(5000,[6,87,101,1...|(5000,[6,87,101,1...|
|[incredible, buyi...|(5000,[132,620,64...|(5000,[132,620,64...|
|[lincoln, dress, ...|(5000,[60,185,773...|(5000,[60,185,773...|
|[foot, took, nice...|(5000,[22,28,34,1...|(5000,[22,28,34,1...|
+--------------------+--------------------+--------------------+
only showing top 10 rows



In [32]:
#df = df.drop("name")
#df.show(10, False)

In [34]:
w = Window().orderBy(column("tokens"))
df = df.withColumn("id", row_number().over(w))

In [38]:
df.show(10, True)

+--------------------+--------------------+--------------------+---+
|              tokens|        raw_features|     tf_idf_features| id|
+--------------------+--------------------+--------------------+---+
|[aah, job, traini...|(5000,[0,61,140,1...|(5000,[0,61,140,1...|  1|
|         [aah, seen]|  (5000,[215],[1.0])|(5000,[215],[5.43...|  2|
|[aaliyah, dae, ad...|(5000,[1,9,79,166...|(5000,[1,9,79,166...|  3|
|[aaliyah, suppose...|(5000,[25,30,130,...|(5000,[25,30,130,...|  4|
|[aapl, headline, ...|(5000,[223,1214,3...|(5000,[223,1214,3...|  5|
|[aapl, strong, da...|(5000,[0,10,13,26...|(5000,[0,10,13,26...|  6|
|[aapl, sudden, lo...|(5000,[32,48,76,2...|(5000,[32,48,76,2...|  7|
|[aaple, phenomena...|(5000,[610,1351,1...|(5000,[610,1351,1...|  8|
|             [aaron]| (5000,[3010],[1.0])|(5000,[3010],[8.1...|  9|
|[aaron, coming, b...|(5000,[23,138,106...|(5000,[23,138,106...| 10|
+--------------------+--------------------+--------------------+---+
only showing top 10 rows



In [39]:
rs = df.rdd.map(lambda x: (x[3], oldVectors.fromML(x[2])))

In [40]:
rs_df = rs.toDF()
rs_df.show(10, False)

+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_1 |_2                                                                                                                                                                                                                                                                                                    |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1  |(5000,[0,61,140,1126,1840],[2.9690105061742442,4.612367939574509,5.176711868705877,7.0838875

In [48]:
# Run the LDA Topic Modeler
# Note the time before and after is printed in order to find out how much time it takes to process x number of records

print(time.strftime('%m%d%Y %H:%M:%S'))
lda_model = LDA.train(rs_df['_1', '_2'].rdd.map(list), k=num_of_topics_LDA, maxIterations=max_iterations_LDA)
print(time.strftime('%m%d%Y %H:%M:%S'))

07052019 13:07:45
07052019 13:11:24


In [49]:
wordNumbers = 15

print(time.strftime('%m%d%Y %H:%M:%S'))
topics = lda_model.topicsMatrix()
vocabArray = cvmodel.vocabulary

topicIndices = sc.parallelize(lda_model.describeTopics(maxTermsPerTopic = wordNumbers))

def topic_render(topic):  # specify vector id of words to actual words
    terms = topic[0]
    prob = topic[1]
    
    result = []
    for i in range(nomber_of_words_to_for_topic):
        term = str(round(prob[i],3))+"  "+vocabArray[terms[i]]
        result.append(term)
    return result
print(time.strftime('%m%d%Y %H:%M:%S'))

07052019 13:13:30
07052019 13:13:30


In [50]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topics_final = topicIndices.map(lambda topic:topic_render(topic)).collect()
print(time.strftime('%m%d%Y %H:%M:%S'))

07052019 13:13:31
07052019 13:13:31


# Topics

In [51]:
# based on the simple vectors(+number of words)

for topic in range(len(topics_final)):
    print ("Topic #" + str(topic+1) + "")
    for term in topics_final[topic]:
        print (term)
    print ('\n')

Topic #1
0.048  like
0.031  know
0.029  people
0.025  really
0.023  think
0.02  look
0.017  even
0.016  feel
0.013  gonna
0.012  getting
0.012  stop
0.011  many
0.011  lmfao
0.011  something
0.01  wanna


Topic #2
0.03  go
0.024  year
0.022  right
0.02  let
0.018  take
0.018  first
0.017  game
0.015  oh
0.015  better
0.013  wait
0.012  ya
0.011  old
0.01  play
0.01  team
0.01  gotta


Topic #3
0.038  get
0.029  got
0.025  back
0.021  work
0.016  guy
0.015  thanks
0.012  keep
0.012  everyone
0.012  mean
0.012  wow
0.01  looking
0.009  around
0.009  fun
0.009  nice
0.009  hit


Topic #4
0.05  new
0.037  time
0.035  york
0.032  good
0.03  see
0.029  need
0.026  want
0.015  please
0.014  city
0.012  video
0.011  baby
0.01  job
0.01  manhattan
0.01  photo
0.009  cause


Topic #5
0.021  way
0.019  best
0.019  always
0.017  friend
0.017  well
0.016  ever
0.016  also
0.014  made
0.013  give
0.013  world
0.011  hope
0.011  long
0.011  one
0.011  coming
0.011  park


Topic #6
0.025  make
0.025  

### Hot topics in the USA from [Google trends](https://trends.google.com/trends/explore?geo=US)

In [52]:
start_date = str_tweet_to_datetime(frame_start_datetime)
finish_date = str_tweet_to_datetime(frame_finish_datetime)

In [55]:
google_trends_topics, google_trends_queries = get_google_trends_by_geo(geo) 

##### Google trends search queries

In [56]:
interesting_google_topics = google_trends_topics.filter(
    (google_trends_topics.Date >= start_date) & (google_trends_topics.Date <= finish_date))

In [57]:
print_google_trend_title(start_date, finish_date, "Search topics")
interest_google_topics = convert_datetime_in_interesting_google(interesting_google_topics)
interest_google_topics.select("Date","Search topics - rising", "Search topics - top").show(num_of_top_interest, False)


Google trends Search topics in New York during 2019-06-03 - 2019-06-17
+----------+-------------------------------------------------------------+---------------------------------------+
|Date      |Search topics - rising                                       |Search topics - top                    |
+----------+-------------------------------------------------------------+---------------------------------------+
|2019-06-03|Jeopardy! - American television show                         |New York - City in New York            |
|2019-06-03|Stock - Topic                                                |New York - US State                    |
|2019-06-03|LinkedIn - Website                                           |2019 - Topic                           |
|2019-06-03|Eid al-Fitr - Topic                                          |Google Search - Topic                  |
|2019-06-03|Google Classroom - Topic                                     |Google - Technology company            |
|2019-06

In case when timeframe is more than 1 day, filter correctly this google-trends

In [19]:
interesing_google_topics_unique= unique_google_trends_by_time_frame(interesting_google_topics)
print_google_trend_title(start_date, finish_date, "Search topics")
interesing_google_topics_unique.select("Search topics - rising", "Search topics - top").show(num_of_top_interest, False)


Google trends Search topics in New York during 2019-06-27 - 2019-06-30
+--------------------------------------------------------+----------------------------------------+
|Search topics - rising                                  |Search topics - top                     |
+--------------------------------------------------------+----------------------------------------+
|Kamala Harris - United States Senator                   |New York - City in New York             |
|Marianne Williamson - American author                   |New York - US State                     |
|Brooklyn Nets - Basketball team                         |2019 - Topic                            |
|United States women's national soccer team - Soccer team|Weather - Topic                         |
|Pete Buttigieg - Mayor of South Bend                    |YouTube - Video sharing company         |
|Joe Biden - Former Vice President of the United States  |Film - Topic                            |
|Kevin Durant - American bas

##### Google trends search queries

In [20]:
interesting_google_queries = google_trends_queries.filter(
    (google_trends_queries.Date >= start_date) & (google_trends_queries.Date <= finish_date))

In [21]:
print_google_trend_title(start_date, finish_date, "Search queries")
interest_google_queries = convert_datetime_in_interesting_google(interesting_google_queries)
interest_google_queries.select("Date", "Search queries - rising", "Search queries - top").show(num_of_top_interest, False)


Google trends Search queries in New York during 2019-06-27 - 2019-06-30
+----------+---------------------------+--------------------+
|Date      |Search queries - rising    |Search queries - top|
+----------+---------------------------+--------------------+
|2019-06-28|shay mitchell              |weather             |
|2019-06-28|marianne williamson        |google              |
|2019-06-28|kamala harris              |facebook            |
|2019-06-28|argentina vs venezuela 2019|youtube             |
|2019-06-28|usa france                 |world cup           |
|2019-06-28|brazil vs paraguay         |news                |
|2019-06-28|usa vs france              |amazon              |
|2019-06-28|colombia vs chile          |copa america        |
|2019-06-28|alex morgan                |debate              |
|2019-06-28|michael bennet             |instagram           |
|2019-06-28|argentina vs venezuela     |craigslist          |
|2019-06-28|megan rapinoe              |walmart            

In [22]:
interesing_google_queries_unique= unique_google_trends_by_time_frame(interesting_google_queries)
print_google_trend_title(start_date, finish_date, "Search queries")
interesing_google_queries_unique.show(num_of_top_interest, False)


Google trends Search queries in New York during 2019-06-27 - 2019-06-30
+---------------------------+--------------------+------+---+-----+
|Search queries - rising    |Search queries - top|Rising|Top|geo  |
+---------------------------+--------------------+------+---+-----+
|yy                         |weather             |+4950%|100|US-NY|
|shay mitchell              |pride               |+3700%|62 |US-NY|
|darren collison            |facebook            |+2950%|59 |US-NY|
|marianne williamson        |google              |+2750%|55 |US-NY|
|kamala harris              |youtube             |+2400%|48 |US-NY|
|deandre jordan             |news                |+2400%|44 |US-NY|
|india vs england           |amazon              |+1050%|44 |US-NY|
|argentina vs venezuela 2019|world cup           |+900% |42 |US-NY|
|usa france                 |debate              |+900% |35 |US-NY|
|brazil vs paraguay         |yankees             |+900% |34 |US-NY|
|usa vs france              |pride parade  

#### Hot topics - google trends (directly) (probably this will be removed)

In [23]:
start_date_str = start_date.strftime("%Y-%m-%d")
finish_date_str = finish_date.strftime("%Y-%m-%d")
pytrend = TrendReq()
pytrend.build_payload(kw_list=[' '], geo=geo, timeframe=f"{start_date_str} {finish_date_str}")

##### Search topics

In [24]:
topics_df = pytrend.related_top_search_topics(spark)

In [30]:
print_google_trend_title(start_date, finish_date, "Search topics")
topics_df.select("Search topics - rising", "Search topics - top").show(num_of_top_interest, False)


Google trends Search topics in New York during 2019-06-27 - 2019-06-30
+-------------------------------------------------------------+---------------------------------------+
|Search topics - rising                                       |Search topics - top                    |
+-------------------------------------------------------------+---------------------------------------+
|Pride parade - Topic                                         |New York - City in New York            |
|Parade - Topic                                               |New York - US State                    |
|Gay pride - Topic                                            |2019 - Topic                           |
|Debate - Topic                                               |Weather - Topic                        |
|2016 Democratic Party presidential debates and forums - Topic|YouTube - Video sharing company        |
|Fireworks - Topic                                            |Google - Technology company      

##### Search queries

In [26]:
queries_df = pytrend.related_top_search_queries(spark)

In [28]:
print_google_trend_title(start_date, finish_date, "Search queries")
queries_df.show(num_of_top_interest, False)


Google trends Search queries in New York during 2019-06-27 - 2019-06-30
+-------------------------+--------------------+-------+---+-----+
|Search queries - rising  |Search queries - top|Rising |Top|geo  |
+-------------------------+--------------------+-------+---+-----+
|marianne williamson      |weather             |+1,500%|100|US-NY|
|kamala harris            |facebook            |+1,450%|61 |US-NY|
|yankees vs red sox       |google              |+1,200%|60 |US-NY|
|yy                       |youtube             |+1,150%|52 |US-NY|
|tulsi gabbard            |amazon              |+1,050%|46 |US-NY|
|mexico vs costa rica     |news                |+950%  |46 |US-NY|
|kemba walker             |world cup           |+850%  |38 |US-NY|
|usa vs france            |craigslist          |+850%  |27 |US-NY|
|mackenzie lueck          |instagram           |+500%  |27 |US-NY|
|pride parade 2019 nyc    |yankees             |+450%  |25 |US-NY|
|pride parade             |movies              |+400%  |

### Conclusion