# Outline
1. [Problem statement](#Problem-statement)
2. [ML problem](#ML-problem)
3. [Data description](#Data-description)
4. [Demo](#Let's-start-our-demo-ride)
    * [Imports](#Import-all-libs-needed)
    * [Define variables](#Define-variables)
    * [Create Spark session](#Create-Spark-session)
    * [Read data](#Read-data)
    * [Tweets preprocessing](#Tweets-preprocessing)
    * [Topic modeling via Latent Dirichlet Allocation](#Topic-modeling-via-Latent-Dirichlet-Allocation)
    * [Term frequency-inverse document frequency (TF-IDF)](#TF-IDF)
    * [Run the LDA Topic Modeler](#Run-the-LDA-Topic-Modeler)
    * [Hot topics in the USA from Google trends](#Hot-topics-in-the-USA-from-Google-trends)
5. [Evaluation and justification](#Evaluation-and-justification)


# Problem statement 

The goal of our project is to identify the topics under active discussion at the moment in a certain area.
<br> The motivation for choosing this topic is, firstly, the published news articles can be distorted. Secondly, there is a delay between the actual event and publication in the news. Both of these factors are critical for market players, and those who have instant access to reliable information have a clear advantage.
<br>We decided to organize this advantage for ourselves and for everyone (since this project is open source), highlighting the topics discussed in real-time on Twitter.
<br>As a source, we chose Twitter because of its popularity and prevalence throughout the world, legitimate access to real-time data and the many topics discussed in it.
<br>As an example, there are many cases when companies use Twitter to identify vulnerability in their security systems since information about it often comes in social networks.

# ML problem

Currently, exist a lot of algorithms that solve the problem topic-modeling. Some of them are based on classical mathematical approaches as matrix decomposition, some use probabilistic methods, some are based on deep learning. In the context of our task, we have considered 3 potential methods for solution: <b>LSA</b>, <b>pLSA</b> and <b>LDA</b>. Each of them has its advantages and disadvantages.
<br> <b>LSA</b> - Latent Semantic Analysis - is based on a singular matrix decomposition, under the assumption that words that are close in meaning will occur in similar pieces of text. This method is simple to implement, but for reliable results requires a large amount of data. 
<br><b>pLSA</b> - Latent Semantic Analysis - is based on probabilistic methods and finding hidden variables - topics. But in this approach, the number of parameters increases linearly with the number of documents. 
<br>That is why we settled on the <b>LDA</b> - Latent Dirichlet Allocation - unsupervised learning algorithm, which assumes that the topics of the documents have a Dirichlet distribution and the words in the topics also have a Dirichlet distribution. The technical part of the algorithm will be described below.

# Data description

In order to collect data in a natural way:
<br>- we registered Twitter Developer account;
<br>- using credentials from Twitter Developer account we run script that collected tweets by the geolocation and saved them in mongodb;
<br>
<br><b>As a result:</b>
<br>- we collected  332548 tweets (10Gb in mongodb, ~100Mb in csv) from New-York geolocation since 30 of May up to 15 of June;
<br>- we collected  6617029 tweets (~1.69Gb in csv) from USA geolocation since 15 of June up to now.

# Let's start our demo ride

### Import all libs needed

In [1]:
import findspark
findspark.init()

In [29]:
# essential pyspark
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.window import Window
from pyspark.sql.types import ArrayType, FloatType, StringType, IntegerType, StructField, StructType
from pyspark.sql.functions import udf, row_number,column

# vectorizer
from pyspark.ml.feature import CountVectorizer, StopWordsRemover, HashingTF, IDF, Tokenizer

# staff for LDA
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vector as oldVector, Vectors as oldVectors
from pyspark.ml.linalg import Vector as newVector, Vectors as newVectors

# pytrends for acquiring google trends
from pytrends.pytrends.request import TrendReq

# import hardcoded variables
from utils.channels_to_filter import channels_not_to_consider

# custom text preprocessing
from utils.text_preprocessing import *

# custom tools to work with google trends 
from utils.trends import *

# handy functions for data merging
from utils.data_merge import *

# handy functions for topic modeling result handeling
from utils.topic_modeling import *

# datetime handling
from datetime import datetime
import time

### Define variables

**Connecting to data**  
Please, specify path to the csv file with data right here.

In [3]:
# path to CSV
historical_tweets_data = 'data/tweets/new_york_training_tweets_15_06.csv'

**Time frames to pick data from**  
We picked some time frames to get data from to check if our topic model can extract info about events that occured during this period.

In [4]:
# final of league championship 
lc_final_start_datetime = "Sat Jun 01 00:00:00 +0000 2019"
lc_final_finish_datetime = "Sat Jun 01 23:59:59 +0000 2019"

# Stanley cup final
stanley_final_start_datetime = "Wed Jun 12 00:00:00 +0000 2019"
stanley_final_finish_datetime = "Wed Jun 12 23:59:59 +0000 2019"

# Draft NBA
nba_final_start_datetime = "Thu Jun 20 00:00:00 +0000 2019"
nba_finish_final_datetime = "Sun Jun 23 23:59:59 +0000 2019"

Or you can specify your own dates.

In [5]:
start_datetime = ''
finish_datetime = ''

Use one of the dates defined above right here (e.g., we use Stanely cup final dates).

In [6]:
frame_start_datetime = str_tweet_to_datetime(stanley_final_start_datetime)
frame_finish_datetime = str_tweet_to_datetime(stanley_final_finish_datetime)

assert (frame_finish_datetime - frame_start_datetime).days <= 3, "Date interval should not be bigger than 3 days"

**Location related variables**  
Here you can specify exact location from which you want to get tweets for topic modeling. Here we have example for NYC (local approach) and whole US (global approach). You can explore raw data to find more locations for filtering.

In [7]:
# if True locations from locations_to_consider will be used to filter
get_from_location = True

# locations to filter relevant tweets
locations_to_consider = [
                         'Manhattan, NY', 
                         'Brooklyn, NY', 
                         'Queens, NY', 
                         'Bronx, NY', 
                         'Staten Island, NY'
                         'New York, USA'
                        ]

# used to extract google trends
if get_from_location:
    geo = 'US-NY' 
else:
    geo = 'US'

**LDA parameters**  
We have tuned parameters of LDA in order to obtain reliable results. The evaluation will be discussed below.

In [83]:
# LDA params
num_of_topics_LDA = 15
max_iterations_LDA = 120

number_of_words_per_topic = 15  # number of words per topic
num_of_top_interest = 15 # number of topics

### Create Spark session

In [9]:
spark = SparkSession.builder.appName("pipeline").getOrCreate()
sc = spark.sparkContext

### Read data


**Load the historical data, it can take a while**  
Here we load data and filter by dates you chose above.

In [10]:
times = (frame_start_datetime, frame_finish_datetime)
print("Time range to be extracted from ", historical_tweets_data, times[0], times[1])
selected_df = get_historical_df(historical_tweets_data=historical_tweets_data, historical_start_time=times[0], historical_finish_time=times[1], spark=spark)
assert selected_df != None, "Something goes wrong with selecting data from recent data/history data"
selected_df.count()

Time range to be extracted from  data/tweets/new_york_training_tweets_15_06.csv 2019-06-12 00:00:00+00:00 2019-06-12 23:59:59+00:00
Range for collected data (history):  2019-06-12 00:00:00+00:00 2019-06-12 23:59:59+00:00


29538

### Tweets preprocessing  

**Basic filtering**

Here we do basic filtering based on null values. We filter out channels that are not important in topic modeling. We defined it by applying LDA on different dates and find out that there are a lot of channels which specialized on particular topics (for example, weather, traffic in the city, photos, hiring people for a job). These topics were always distinguished. But, since they don't provide any information about important events, we have removed specialized channels from consideration.
 Also, we do filtering based on global and local location (if local filtering is enabled). Finally, we check tweet itself for the length of the message. 

In [54]:
df = selected_df

# filter nans
df = df.rdd.filter(lambda x: x[0] != None and x[1] != None and x[2] != None and x[4] != None)

# filter out channels not to consider
df = df.filter(lambda x: x[4] not in channels_not_to_consider)

# filter by country
df = df.filter(lambda x: x[1] in 'US')

# filter by precise location
if get_from_location:
    df = df.filter(lambda x: x[2] in locations_to_consider)

# filter tweet itself
df = df.filter(lambda x: filter_tweet(x[0]))


**Now the most interesting - tweet cleaning**  
Text cleaning is crucial for any text modelling process, especially for topic modelling. We tried three different approaches: classic, using only hashtags, using only urls. In classic approach we delete all non-words (including urls, hashtags, emojis and mentions), filter out common stop words, so only plain text information is left. But still such data has a lot of noise and uninformative words, so we tried another approaches with using only hashtags (which should code the most important information) and urls.

**Let's start from classic approach**  
In our case it consists from those steps:  
1) Lowercase all words  
2) Filter words with non-letters at the beginning (mainly for mentions, e.g. "@some_user")  
3) Filter http/https  
4) Filter all non-letters (crucial to remove emoji)  
5) Remove multiply whitespaces  
6) Remove repeated chars (e.g. "greeeeat" -> "great")

**Hashtags & URLs**  
Those approaches are pretty clear.

**Basically applying approach discussed above**  
Now let's apply one of the aproaches. But first specify which one you want to use.

In [55]:
preprocessing_type = 'hashtags' # 'just-text', 'hashtags' or 'urls'

if preprocessing_type == 'just-text':
    process_tweet = process_text
elif preprocessing_type == 'hashtags':
    process_tweet = process_hashtags
elif preprocessing_type == 'urls':
    process_tweet = process_urls

# process tweet
df = df.map(lambda x: process_tweet(x[0]))

**Final postprocessing**  
Here we make sure that we don't have entries without tokens at all, also we change the structure suitable for pyspark LDA class.

In [56]:
# final preprocesssing
df = df.filter(lambda x: len(x) > 0)

# make dataframes great again
df = df.map(lambda x: [x])

# schema for df
schema = StructType([StructField('tokens', ArrayType(StringType()), True)])
df = df.toDF(schema=schema)

In [57]:
df.show(10)

+--------------------+
|              tokens|
+--------------------+
|        [whiteboard]|
|         [snotyboyz]|
| [remote, orchestra]|
|              [rbny]|
|[libertystatue, l...|
|[pazviola, mauric...|
|[originalcomposit...|
|          [freeship]|
|         [trioworks]|
|[autoindustry, auto]|
+--------------------+
only showing top 10 rows



In [58]:
df.count()

1924

### Topic modeling via Latent Dirichlet Allocation

Topic Model is a type of statistical model used for tagging abstract “topics” that occur in a collection of documents that best represents the information in them.<br/>
The basic idea in the LDA is that documents are represented as a random mixture of latent topics, where each topic is characterized by a distribution of words.<br/>

<img src="http://chdoig.github.io/pytexas2015-topic-modeling/images/lda-4.png" width=600/>
*http://chdoig.github.io/pytexas2015-topic-modeling/?source=post_page---------------------------#/3/4

###### CountVectorizer helps to convert a collection of text documents to vectors of token counts. 

In [59]:
print(time.strftime('%m%d%Y %H:%M:%S'))

cv = CountVectorizer(inputCol="tokens", outputCol="raw_features", vocabSize=10000, minDF=2.0)
cvmodel = cv.fit(df)

print(time.strftime('%m%d%Y %H:%M:%S'))

07182019 23:52:29
07182019 23:52:57


In [60]:
print(time.strftime('%m%d%Y %H:%M:%S'))
df = cvmodel.transform(df)
print(time.strftime('%m%d%Y %H:%M:%S'))

07182019 23:52:57
07182019 23:52:57


### TF-IDF

** Process text mining to reflect the importance of a term to a document in the corpus. **

In [61]:
idf = IDF(inputCol="raw_features", outputCol="tf_idf_features", minDocFreq=2)
idfModel = idf.fit(df)

df = idfModel.transform(df)


In [62]:
df.show(10, True)

+--------------------+--------------------+--------------------+
|              tokens|        raw_features|     tf_idf_features|
+--------------------+--------------------+--------------------+
|        [whiteboard]|         (538,[],[])|         (538,[],[])|
|         [snotyboyz]|         (538,[],[])|         (538,[],[])|
| [remote, orchestra]|         (538,[],[])|         (538,[],[])|
|              [rbny]|         (538,[],[])|         (538,[],[])|
|[libertystatue, l...|(538,[5,62,149,31...|(538,[5,62,149,31...|
|[pazviola, mauric...|   (538,[252],[1.0])|(538,[252],[6.464...|
|[originalcomposit...|         (538,[],[])|         (538,[],[])|
|          [freeship]|     (538,[2],[1.0])|(538,[2],[4.03632...|
|         [trioworks]|         (538,[],[])|         (538,[],[])|
|[autoindustry, auto]|(538,[337,492],[1...|(538,[337,492],[6...|
+--------------------+--------------------+--------------------+
only showing top 10 rows



###### Add id field

In [63]:
w = Window().orderBy(column("tokens"))
df = df.withColumn("id", row_number().over(w))

In [64]:
df.show(10, True)

+--------------------+--------------------+--------------------+---+
|              tokens|        raw_features|     tf_idf_features| id|
+--------------------+--------------------+--------------------+---+
|                 [1]|    (538,[12],[1.0])|(538,[12],[4.9977...|  1|
|                 [1]|    (538,[12],[1.0])|(538,[12],[4.9977...|  2|
|                 [1]|    (538,[12],[1.0])|(538,[12],[4.9977...|  3|
|                 [1]|    (538,[12],[1.0])|(538,[12],[4.9977...|  4|
|                 [1]|    (538,[12],[1.0])|(538,[12],[4.9977...|  5|
|                 [1]|    (538,[12],[1.0])|(538,[12],[4.9977...|  6|
|                 [1]|    (538,[12],[1.0])|(538,[12],[4.9977...|  7|
|                 [1]|    (538,[12],[1.0])|(538,[12],[4.9977...|  8|
|              [1, 1]|    (538,[12],[2.0])|(538,[12],[9.9954...|  9|
|[1, freeship, mai...|(538,[2,12,14],[1...|(538,[2,12,14],[4...| 10|
+--------------------+--------------------+--------------------+---+
only showing top 10 rows



In [65]:
rs = df.rdd.map(lambda x: (x[3], oldVectors.fromML(x[2])))

In [66]:
rs_df = rs.toDF()

In [67]:
# rs_df.show(10)

### Run the LDA Topic Modeler

In [78]:
print(time.strftime('%m%d%Y %H:%M:%S'))
lda_model = LDA.train(rs_df['_1', '_2'].rdd.map(list), k=num_of_topics_LDA, maxIterations=max_iterations_LDA)
print(time.strftime('%m%d%Y %H:%M:%S'))

07182019 23:56:23
07182019 23:57:02


###### Now we prepare output of LDA  to be shown

In [79]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topics = lda_model.topicsMatrix()
vocabArray = cvmodel.vocabulary
print(time.strftime('%m%d%Y %H:%M:%S'))

07182019 23:57:30
07182019 23:57:31


In [80]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topicIndices = sc.parallelize(lda_model.describeTopics(maxTermsPerTopic = number_of_words_per_topic))
print(time.strftime('%m%d%Y %H:%M:%S'))

07182019 23:57:31
07182019 23:57:31


In [81]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topics_final = topicIndices.map(lambda topic:topic_render(topic, number_of_words_per_topic, vocabArray)).collect()
print(time.strftime('%m%d%Y %H:%M:%S'))

07182019 23:57:31
07182019 23:57:32


###### Topics based on tweets

In [82]:
for topic in range(len(topics_final)):
    print ("Topic #" + str(topic+1) + "")
    for term in topics_final[topic]:
        print (term)
    print ('\n')

Topic #1
0.12  manhattan
0.044  aidenkai
0.043  kim
0.043  harlem
0.035  blessed
0.028  thevessel
0.028  fifawwc
0.028  subway
0.027  hongkongprotest
0.027  proud
0.027  blackandwhite
0.027  rapper
0.027  midsommar
0.027  mytwitteranniversary
0.027  nycfc


Topic #2
0.061  ducktales
0.061  whentheyseeus
0.056  classof2019
0.043  finduagain
0.038  familytime
0.037  ghopen
0.037  wcw
0.037  lgbtq
0.031  stress
0.031  time
0.03  singer
0.03  photooftheday
0.03  whitneybiennial
0.03  speedtest
0.024  downwiththesickness


Topic #3
0.086  stanleycup
0.07  game7
0.039  chasdeihashem
0.039  shtisel
0.039  shtiselny
0.033  ryan
0.032  jwmg
0.032  nhlbruins
0.032  nhl
0.032  cosplay
0.032  etsy
0.031  repost
0.026  kevin
0.025  sketch
0.024  stanleycupfinal


Topic #4
0.249  newyork
0.056  ny
0.049  travel
0.042  frenchbulldogs
0.042  frenchbulldogsofinstagram
0.042  frenchbulldog
0.042  frenchbulldogsforsale
0.042  frenchbulldogpuppiesforsale
0.027  twastyles
0.027  twa
0.026  truth
0.026  fas

##### Result of topic modelling, actual topics are highlighted
###### Topic #1 - Hong Kong protests (31 March - 16 June)
0.12  manhattan
0.044  aidenkai
0.043  kim
0.043  harlem
0.035  blessed
0.028  thevessel
0.028  fifawwc
0.028  subway
0.027  hongkongprotest
0.027  proud
0.027  blackandwhite
0.027  rapper
0.027  midsommar
0.027  mytwitteranniversary
0.027  nycfc


Topic #2
0.061  ducktales
0.061  whentheyseeus
0.056  classof2019
0.043  finduagain
0.038  familytime
0.037  ghopen
0.037  wcw
0.037  lgbtq
0.031  stress
0.031  time
0.03  singer
0.03  photooftheday
0.03  whitneybiennial
0.03  speedtest
0.024  downwiththesickness


###### Topic #3 - Hockey tournament, 12 June - final
0.086  stanleycup
0.07  game7
0.039  chasdeihashem
0.039  shtisel
0.039  shtiselny
0.033  ryan
0.032  jwmg
0.032  nhlbruins
0.032  nhl
0.032  cosplay
0.032  etsy
0.031  repost
0.026  kevin
0.025  sketch
0.024  stanleycupfinal


Topic #4
0.249  newyork
0.056  ny
0.049  travel
0.042  frenchbulldogs
0.042  frenchbulldogsofinstagram
0.042  frenchbulldog
0.042  frenchbulldogsforsale
0.042  frenchbulldogpuppiesforsale
0.027  twastyles
0.027  twa
0.026  truth
0.026  fashnme
0.019  newjersey
0.019  apple
0.019  amazing


##### Topic #5 - Melt Social Summit, 12-13 June
0.143  meltwatersummit
0.105  manhattan_again
0.079  1
0.057  megatron
0.054  wednesdaywisdom
0.04  summer
0.028  womancrushwednesday
0.028  posefx
0.028  soho
0.022  runner
0.022  soundhound
0.022  goodmorning
0.021  museum
0.016  mensstyle
0.016  tishmanspeyer


Topic #6
0.1  music
0.045  wednesdaythoughts
0.045  youtube
0.037  nicoleharrison
0.037  dogpound
0.037  dj
0.037  adweekchat
0.029  rayonbrandtministries
0.029  church
0.029  people
0.029  life
0.029  bookedandblessed
0.029  ham
0.02  hamr
0.02  amateurr


Topic #7
0.168  brooklyn
0.078  blackmendontcheat
0.058  2
0.043  cd
0.037  comic
0.035  jewel
0.028  livemusic
0.028  williamsburg
0.027  theatre
0.027  mta
0.027  1trump
0.027  honorthemwithaction
0.019  nycfreaks
0.019  hometownbbq
0.019  nassaucounty


Topic #8
0.07  rock
0.063  futurefintech
0.056  fashion
0.036  follow
0.035  dance
0.027  twitter
0.027  filmmaker
0.027  vintage
0.027  jimin
0.027  win
0.026  producer
0.019  starwars
0.019  gangsta
0.019  lukeskywalker
0.019  roastbattle


Topic #9
0.08  pulse
0.059  centralpark
0.051  jordan
0.037  digital
0.037  family
0.036  curtis
0.036  motivation
0.029  work
0.028  marketing
0.028  humor
0.026  rap
0.02  nephew
0.02  hustle
0.02  opportunity
0.02  recordlabel


###### Topic #10 - June - month of pride 
0.131  pride
0.101  gh
0.08  ava
0.073  icfsummit
0.047  pridemonth
0.033  pride2019
0.033  socialmedia
0.027  jax
0.026  nina
0.026  arifitzgerald
0.026  stonewall50
0.025  sudanmassacre
0.018  adwords
0.018  sem
0.018  smo


Topic #11
0.33  nyc
0.112  newyorkcity
0.051  beermenus
0.031  photography
0.031  byjordana
0.026  nycphotographer
0.02  garyvee
0.02  free
0.02  sorrynotsorry
0.02  pinstripepride
0.019  nycity
0.014  pasta
0.014  foodosart
0.014  foodporn
0.014  nycgo


Topic #12
0.066  love
0.059  thebachelorette
0.053  gay
0.041  foodie
0.034  live
0.033  hufflepuff
0.026  teamexhib
0.026  coffee
0.025  vibe
0.025  mib
0.018  insight
0.018  progress
0.018  wd2019
0.018  redken
0.018  olaplex


Topic #13
0.166  freeship
0.09  case
0.075  dvd
0.075  generic
0.063  art
0.042  beauty
0.035  bronx
0.035  streetart
0.029  makeup
0.029  artist
0.024  sunset
0.023  graffiti
0.023  orlandostrong
0.022  brooklyndogs
0.022  view


Topic #14
0.064  wednesday
0.032  running
0.032  wednesdaymotivation
0.025  hot97
0.025  straightcash
0.025  power1051
0.025  cx
0.025  nbafinals
0.025  evolution2
0.025  groundwar
0.025  loveislove
0.025  nypd
0.025  bts
0.025  happyhour
0.025  design


Topic #15
0.078  sugarbaby
0.058  sugardaddy
0.051  6yearswithourhomebts
0.051  lgm
0.036  broadway
0.036  방탄6주년보라해
0.036  sugarbabyneeded
0.036  toystory4
0.035  legend
0.028  seekingarrangement
0.028  hbd_bts
0.028  sugarbabywanted
0.027  friend
0.027  chernobylhbo
0.02  flyer

#### Result based on the tweets text without hashtags
Topic #1
0.038  get
0.03  people
0.027  need
0.027  good
0.024  want
0.019  never
0.017  still
0.017  show
0.015  first
0.014  like
0.013  gonna
0.011  keep
0.011  baby
0.011  getting
0.01  hope


Topic #2
0.029  amp
0.02  work
0.018  great
0.017  th
0.014  well
0.012  tonight
0.012  pm
0.01  june
0.009  another
0.009  call
0.008  nice
0.008  open
0.008  hour
0.008  making
0.008  free


Topic #3
0.028  got
0.023  year
0.018  take
0.016  last
0.016  best
0.01  sure
0.01  two
0.009  already
0.009  try
0.009  wow
0.009  literally
0.009  read
0.009  song
0.008  point
0.008  black


Topic #4
0.032  love
0.03  day
0.024  really
0.022  make
0.022  shit
0.02  say
0.02  right
0.02  back
0.019  lmao
0.016  every
0.015  happy
0.015  come
0.015  friend
0.013  said
0.012  im


Topic #5
0.025  thank
0.02  thing
0.016  feel
0.015  yes
0.015  ever
0.013  real
0.013  woman
0.011  fucking
0.011  amazing
0.011  something
0.011  god
0.01  watch
0.009  like
0.009  white
0.009  men


Topic #6
0.032  time
0.032  u
0.03  know
0.026  see
0.019  going
0.018  even
0.017  let
0.017  much
0.014  fuck
0.012  made
0.011  could
0.01  thought
0.01  trying
0.01  old
0.009  park


Topic #7
0.035  one
0.022  think
0.021  would
0.014  also
0.013  someone
0.013  next
0.011  trump
0.011  week
0.011  job
0.01  tell
0.01  mean
0.01  many
0.01  anyone
0.01  talk
0.01  everyone


Topic #8
0.025  lol
0.018  life
0.017  way
0.016  man
0.014  better
0.013  nigga
0.013  stop
0.012  lmfao
0.01  bad
0.01  wanna
0.01  start
0.009  gotta
0.009  find
0.009  hard
0.009  help


Topic #9
0.048  new
0.036  york
0.023  today
0.022  ny
0.02  look
0.013  please
0.013  city
0.013  w
0.012  game
0.012  summer
0.01  like
0.01  video
0.01  put
0.009  hate
0.009  whole


Topic #10
0.022  nyc
0.02  go
0.015  brooklyn
0.014  guy
0.014  oh
0.013  girl
0.012  thanks
0.011  wait
0.011  night
0.01  yeah
0.01  b
0.009  always
0.009  street
0.008  manhattan
0.008  photo

###### As we can see, results are better when we use hashtags

### Hot topics in the USA from [Google trends](https://trends.google.com/trends/explore?geo=US)

To check actual result of tweets data, we acquire google trends data by the specific location and the same period of time. So, send request to trends.google.com and got responce which contains top search topics and top search queries. 
There are used modified pytrends API to get get this data. We modified a little bit (add fuinctions to interface related_top_search_topics, related_top_search_queries) pytrends API to get google trends date by specifi date and timeframe, because there are not such functional in public API.
<br>
<br>
pytrnds API: https://github.com/GeneralMills/pytrends
<br>
This repository was forked and changes saves in our public rep: https://github.com/DmytroBabenko/pytrends 

In [73]:
start_date_str = frame_start_datetime.strftime("%Y-%m-%d")
finish_date_str = frame_start_datetime.strftime("%Y-%m-%d")

pytrend = TrendReq()
pytrend.build_payload(kw_list=[' '], geo=geo, timeframe=f"{start_date_str} {finish_date_str}")

##### Search topics

In [74]:
topics_df = pytrend.related_top_search_topics(spark)

In [75]:
print_google_trend_title(frame_start_datetime, frame_start_datetime, "Search topics", geo)
topics_df.select("Search topics - rising", "Search topics - top").show(num_of_top_interest, False)


Google trends Search topics in New York during 2019-06-12
+----------------------------+----------------------------------------+
|Search topics - rising      |Search topics - top                     |
+----------------------------+----------------------------------------+
|Mathematical game - Topic   |New York - City in New York             |
|Swimming pool - Topic       |New York - US State                     |
|Mathematics - Field of study|Google - Technology company             |
|nan                         |Google Search - Topic                   |
|nan                         |2019 - Topic                            |
|nan                         |Weather - Topic                         |
|nan                         |YouTube - Video sharing company         |
|nan                         |Facebook - Social networking service    |
|nan                         |Facebook, Inc. - Social network company |
|nan                         |Amazon.com - E-commerce company         |
|nan 

##### Search queries

In [76]:
queries_df = pytrend.related_top_search_queries(spark)

In [77]:
print_google_trend_title(frame_start_datetime, frame_start_datetime, "Search queries", geo)
queries_df.show(num_of_top_interest, False)


Google trends Search queries in New York during 2019-06-12
+-----------------------+--------------------+------+---+-----+
|Search queries - rising|Search queries - top|Rising|Top|geo  |
+-----------------------+--------------------+------+---+-----+
|max landis             |google              |+500% |100|US-NY|
|toy story 4            |weather             |+160% |77 |US-NY|
|iready                 |facebook            |+160% |56 |US-NY|
|elizabeth lederer      |youtube             |+140% |53 |US-NY|
|jon stewart            |amazon              |+130% |49 |US-NY|
|hong kong              |news                |+130% |46 |US-NY|
|boston bruins          |world cup           |+120% |27 |US-NY|
|dallas cowboys         |craigslist          |+110% |24 |US-NY|
|inmate lookup          |instagram           |+90%  |24 |US-NY|
|julianne hough         |translate           |+80%  |22 |US-NY|
|espn mlb               |gmail               |+80%  |21 |US-NY|
|nbt bank login         |nba                

# Evaluation and justification