# Outline
1. [Problem statement](#Problem-statement)
2. [ML problem](#ML-problem)
3. [Data description](#Data-description)
4. [Demo](#Let's-start-our-demo-ride)
    * [Imports](#Import-all-libs-needed)
    * [Define variables](#Define-variables)
    * [Create Spark session](#Create-Spark-session)
    * [Read data](#Read-data)
    * [Tweets preprocessing](#Tweets-preprocessing)
    * [Topic modeling via Latent Dirichlet Allocation](#Topic-modeling-via-Latent-Dirichlet-Allocation)
    * [Term frequency-inverse document frequency (TF-IDF)](#TF-IDF)
    * [Run the LDA Topic Modeler](#Run-the-LDA-Topic-Modeler)
    * [Hot topics in the USA from Google trends](#Hot-topics-in-the-USA-from-Google-trends)
5. [Evaluation and justification](#Evaluation-and-justification)


# Problem statement 

The goal of our project is to identify the topics under active discussion at the moment in a certain area.
<br> The motivation for choosing this topic is, firstly, the published news articles can be distorted. Secondly, there is a delay between the actual event and publication in the news. Both of these factors are critical for market players, and those who have instant access to reliable information have a clear advantage.
<br>We decided to organize this advantage for ourselves and for everyone (since this project is open source), highlighting the topics discussed in real-time on Twitter.
<br>As a source, we chose Twitter because of its popularity and prevalence throughout the world, legitimate access to real-time data and the many topics discussed in it.
<br>As an example, there are many cases when companies use Twitter to identify vulnerability in their security systems since information about it often comes in social networks.

# ML problem

Currently, exist a lot of algorithms that solve the problem topic-modeling. Some of them are based on classical mathematical approaches as matrix decomposition, some use probabilistic methods, some are based on deep learning. In the context of our task, we have considered 3 potential methods for solution: <b>LSA</b>, <b>pLSA</b> and <b>LDA</b>. Each of them has its advantages and disadvantages.
<br> <b>LSA</b> - Latent Semantic Analysis - is based on a singular matrix decomposition, under the assumption that words that are close in meaning will occur in similar pieces of text. This method is simple to implement, but for reliable results requires a large amount of data. 
<br><b>pLSA</b> - Latent Semantic Analysis - is based on probabilistic methods and finding hidden variables - topics. But in this approach, the number of parameters increases linearly with the number of documents. 
<br>That is why we settled on the <b>LDA</b> - Latent Dirichlet Allocation - unsupervised learning algorithm, which assumes that the topics of the documents have a Dirichlet distribution and the words in the topics also have a Dirichlet distribution. The technical part of the algorithm will be described below.

# Data description

In order to collect data in a natural way:
<br>- we registered Twitter Developer account;
<br>- using credentials from Twitter Developer account we run script that collected tweets by the geolocation and saved them in mongodb;
<br>
<br><b>As a result:</b>
<br>- we collected  332548 tweets (10Gb in mongodb, ~100Mb in csv) from New-York geolocation since 30 of May up to 15 of June;
<br>- we collected  6617029 tweets (~1.69Gb in csv) from USA geolocation since 15 of June up to now.
<br> All collected data can be downloaded via link: https://drive.google.com/file/d/1QxGI2esat6BnrPv0YE5Ud6uma49Lccty/view?usp=sharing

# Let's start our demo ride

### Import all libs needed

In [1]:
import findspark
findspark.init()

In [2]:
# essential pyspark
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.window import Window
from pyspark.sql.types import ArrayType, FloatType, StringType, IntegerType, StructField, StructType
from pyspark.sql.functions import udf, row_number,column

# vectorizer
from pyspark.ml.feature import CountVectorizer, StopWordsRemover, HashingTF, IDF, Tokenizer

# staff for LDA
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vector as oldVector, Vectors as oldVectors
from pyspark.ml.linalg import Vector as newVector, Vectors as newVectors

# pytrends for acquiring google trends
from pytrends.pytrends.request import TrendReq

# import hardcoded variables
from utils.channels_to_filter import channels_not_to_consider

# custom text preprocessing
from utils.text_preprocessing import *

# custom tools to work with google trends 
from utils.trends import *

# handy functions for data merging
from utils.data_merge import *

# handy functions for topic modeling result handeling
from utils.topic_modeling import *

# datetime handling
from datetime import datetime
import time

# lib to download file with tweets from google drive
from google_drive_downloader import GoogleDriveDownloader as gdd

[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ubuntu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Define variables

**Connecting to data**  
File with tweets will be downloaded automatically.
<br> In case of any troubles it's possible to download file via https://drive.google.com/file/d/1gDdlNGaNZYrBg6IWirMy7lAj6e5tqcYk/view?usp=sharing and put in folder './data'.

In [3]:
# path to CSV
historical_tweets_data = 'data/tweets/new_york_training_tweets_15_06.csv'

gdd.download_file_from_google_drive(file_id='1gDdlNGaNZYrBg6IWirMy7lAj6e5tqcYk',
                                    dest_path=historical_tweets_data)

**Time frames to pick data from**  
We picked some time frames to get data from to check if our topic model can extract info about events that occured during this period.

In [4]:
# final of league championship 
lc_final_start_datetime = "Sat Jun 01 00:00:00 +0000 2019"
lc_final_finish_datetime = "Sat Jun 01 23:59:59 +0000 2019"

# Stanley cup final
stanley_final_start_datetime = "Wed Jun 12 00:00:00 +0000 2019"
stanley_final_finish_datetime = "Wed Jun 12 23:59:59 +0000 2019"

# Draft NBA
nba_final_start_datetime = "Thu Jun 20 00:00:00 +0000 2019"
nba_finish_final_datetime = "Sun Jun 23 23:59:59 +0000 2019"

Or you can specify your own dates.

In [5]:
start_datetime = ''
finish_datetime = ''

Use one of the dates defined above right here (e.g., we use Stanely cup final dates).

In [6]:
frame_start_datetime = str_tweet_to_datetime(lc_final_start_datetime)
frame_finish_datetime = str_tweet_to_datetime(lc_final_finish_datetime)

assert (frame_finish_datetime - frame_start_datetime).days <= 3, "Date interval should not be bigger than 3 days"

**Location related variables**  
Here you can specify exact location from which you want to get tweets for topic modeling. Here we have example for NYC (local approach) and whole US (global approach). You can explore raw data to find more locations for filtering.

In [7]:
# if True locations from locations_to_consider will be used to filter
get_from_location = True

# locations to filter relevant tweets
locations_to_consider = [
                         'Manhattan, NY', 
                         'Brooklyn, NY', 
                         'Queens, NY', 
                         'Bronx, NY', 
                         'Staten Island, NY'
                         'New York, USA'
                        ]

# used to extract google trends
if get_from_location:
    geo = 'US-NY' 
else:
    geo = 'US'

**LDA parameters**  
We have tuned parameters of LDA in order to obtain reliable results. The evaluation will be discussed below.

In [8]:
# LDA params
lda_seed = 42
num_of_topics_LDA = 15
max_iterations_LDA = 120

number_of_words_per_topic = 15  # number of words per topic
num_of_top_interest = 15 # number of topics

### Create Spark session

In [9]:
spark = SparkSession.builder.appName("pipeline").getOrCreate()
sc = spark.sparkContext

### Read data


**Load the historical data, it can take a while**  
Here we load data and filter by dates you chose above.

In [10]:
times = (frame_start_datetime, frame_finish_datetime)
print("Time range to be extracted from ", historical_tweets_data, times[0], times[1])
selected_df = get_historical_df(historical_tweets_data=historical_tweets_data, historical_start_time=times[0], historical_finish_time=times[1], spark=spark)
assert selected_df != None, "Something goes wrong with selecting data from recent data/history data"
selected_df.count()

Time range to be extracted from  data/tweets/new_york_training_tweets_15_06.csv 2019-06-01 00:00:00+00:00 2019-06-01 23:59:59+00:00
Range for collected data (history):  2019-06-01 00:00:00+00:00 2019-06-01 23:59:59+00:00


58318

### Tweets preprocessing  

**Basic filtering**

Here we do basic filtering based on null values. We filter out channels that are not important in topic modeling. We defined it by applying LDA on different dates and find out that there are a lot of channels which specialized on particular topics (for example, weather, traffic in the city, photos, hiring people for a job). These topics were always distinguished. But, since they don't provide any information about important events, we have removed specialized channels from consideration.
 Also, we do filtering based on global and local location (if local filtering is enabled). Finally, we check tweet itself for the length of the message. 

In [11]:
df = selected_df

# filter nans
df = df.rdd.filter(lambda x: x[0] != None and x[1] != None and x[2] != None and x[4] != None)

# filter out channels not to consider
df = df.filter(lambda x: x[4] not in channels_not_to_consider)

# filter by country
df = df.filter(lambda x: x[1] in 'US')

# filter by precise location
if get_from_location:
    df = df.filter(lambda x: x[2] in locations_to_consider)

# filter tweet itself
df = df.filter(lambda x: filter_tweet(x[0]))


**Now the most interesting - tweet cleaning**  
Text cleaning is crucial for any text modelling process, especially for topic modelling. We tried three different approaches: classic, using only hashtags, using only urls. In classic approach we delete all non-words (including urls, hashtags, emojis and mentions), filter out common stop words, so only plain text information is left. But still such data has a lot of noise and uninformative words, so we tried another approaches with using only hashtags (which should code the most important information) and urls.

**Let's start from classic approach**  
In our case it consists from those steps:  
1) Lowercase all words  
2) Filter words with non-letters at the beginning (mainly for mentions, e.g. "@some_user")  
3) Filter http/https  
4) Filter all non-letters (crucial to remove emoji)  
5) Remove multiply whitespaces  
6) Remove repeated chars (e.g. "greeeeat" -> "great")

**Hashtags & URLs**  
Those approaches are pretty clear. We simply get hashtags and urls from tweet with regex. In case of urls we query it **(one query can take up to 500ms, so it can take a while)** and get html response, then we parse meta for keywords and description. Some basic preprocessing and tokenization is applied to both, keywords and description. Finally, we merge keywords and description tokens together.

**Basically applying one of the approaches discussed above**  
Now let's apply one of the aproaches. But first specify which one you want to use. 

In [12]:
preprocessing_type = 'hashtags' # 'just-text', 'hashtags' or 'urls'

if preprocessing_type == 'just-text':
    process_tweet = process_text
elif preprocessing_type == 'hashtags':
    process_tweet = process_hashtags
elif preprocessing_type == 'urls':
    process_tweet = process_urls

# process tweet
df = df.map(lambda x: process_tweet(x[0]))

**Final postprocessing**  
Here we make sure that we don't have entries without tokens at all, also we change the structure suitable for pyspark LDA class.

In [13]:
# final preprocesssing
df = df.filter(lambda x: len(x) > 0)

# make dataframes great again
df = df.map(lambda x: [x])

# schema for df
schema = StructType([StructField('tokens', ArrayType(StringType()), True)])
df = df.toDF(schema=schema)

In [14]:
df.show(10)

+--------------------+
|              tokens|
+--------------------+
|[nationalsmileday...|
|      [whoraisedyou]|
|          [godzilla]|
| [hero, rachelhauck]|
|[latinosforice, r...|
| [blackmendontcheat]|
|[curiousincidento...|
|   [abstinencegoals]|
|              [wplj]|
|[farewellplj, plj...|
+--------------------+
only showing top 10 rows



In [15]:
df.count()

4060

### Topic modeling via Latent Dirichlet Allocation

Topic Model is a type of statistical model used for tagging abstract “topics” that occur in a collection of documents that best represents the information in them.<br/>
The basic idea in the LDA is that documents are represented as a random mixture of latent topics, where each topic is characterized by a distribution of words.<br/>

<img src="http://chdoig.github.io/pytexas2015-topic-modeling/images/lda-4.png" width=600/>
*http://chdoig.github.io/pytexas2015-topic-modeling/?source=post_page---------------------------#/3/4

###### CountVectorizer helps to convert a collection of text documents to vectors of token counts. 

In [16]:
print(time.strftime('%m%d%Y %H:%M:%S'))

cv = CountVectorizer(inputCol="tokens", outputCol="raw_features", vocabSize=10000, minDF=2.0)
cvmodel = cv.fit(df)

print(time.strftime('%m%d%Y %H:%M:%S'))

07202019 14:23:39
07202019 14:24:43


In [17]:
print(time.strftime('%m%d%Y %H:%M:%S'))
df = cvmodel.transform(df)
print(time.strftime('%m%d%Y %H:%M:%S'))

07202019 14:24:43
07202019 14:24:44


### TF-IDF

** Process text mining to reflect the importance of a term to a document in the corpus. **

In [18]:
idf = IDF(inputCol="raw_features", outputCol="tf_idf_features", minDocFreq=2)
idfModel = idf.fit(df)

df = idfModel.transform(df)


In [19]:
df.show(10, True)

+--------------------+--------------------+--------------------+
|              tokens|        raw_features|     tf_idf_features|
+--------------------+--------------------+--------------------+
|[nationalsmileday...|(1096,[145,801],[...|(1096,[145,801],[...|
|      [whoraisedyou]|        (1096,[],[])|        (1096,[],[])|
|          [godzilla]|  (1096,[150],[1.0])|(1096,[150],[6.36...|
| [hero, rachelhauck]|        (1096,[],[])|        (1096,[],[])|
|[latinosforice, r...|        (1096,[],[])|        (1096,[],[])|
| [blackmendontcheat]|  (1096,[871],[1.0])|(1096,[871],[7.21...|
|[curiousincidento...|  (1096,[306],[1.0])|(1096,[306],[6.69...|
|   [abstinencegoals]|        (1096,[],[])|        (1096,[],[])|
|              [wplj]|  (1096,[366],[1.0])|(1096,[366],[6.92...|
|[farewellplj, plj...|  (1096,[893],[1.0])|(1096,[893],[7.21...|
+--------------------+--------------------+--------------------+
only showing top 10 rows



###### Add id field

In [20]:
w = Window().orderBy(column("tokens"))
df = df.withColumn("id", row_number().over(w))

In [21]:
df.show(10, True)

+--------------------+--------------------+--------------------+---+
|              tokens|        raw_features|     tf_idf_features| id|
+--------------------+--------------------+--------------------+---+
|                 [0]|        (1096,[],[])|        (1096,[],[])|  1|
|[00, freeship, ma...|(1096,[2,513,906]...|(1096,[2,513,906]...|  2|
|[000, freeship, m...|(1096,[2,513],[1....|(1096,[2,513],[3....|  3|
|                 [1]|   (1096,[39],[1.0])|(1096,[39],[5.364...|  4|
|                 [1]|   (1096,[39],[1.0])|(1096,[39],[5.364...|  5|
|                 [1]|   (1096,[39],[1.0])|(1096,[39],[5.364...|  6|
|                 [1]|   (1096,[39],[1.0])|(1096,[39],[5.364...|  7|
|                 [1]|   (1096,[39],[1.0])|(1096,[39],[5.364...|  8|
|                 [1]|   (1096,[39],[1.0])|(1096,[39],[5.364...|  9|
|                 [1]|   (1096,[39],[1.0])|(1096,[39],[5.364...| 10|
+--------------------+--------------------+--------------------+---+
only showing top 10 rows



In [22]:
rs = df.rdd.map(lambda x: (x[3], oldVectors.fromML(x[2])))

In [23]:
rs_df = rs.toDF()

### Run the LDA Topic Modeler

In [24]:
print(time.strftime('%m%d%Y %H:%M:%S'))
lda_model = LDA.train(rs_df['_1', '_2'].rdd.map(list), k=num_of_topics_LDA, maxIterations=max_iterations_LDA, seed=lda_seed)
print(time.strftime('%m%d%Y %H:%M:%S'))

07202019 14:27:55
07202019 14:29:12


###### Now we prepare output of LDA  to be shown

In [25]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topics = lda_model.topicsMatrix()
vocabArray = cvmodel.vocabulary
print(time.strftime('%m%d%Y %H:%M:%S'))

07202019 14:29:12
07202019 14:29:13


In [26]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topicIndices = sc.parallelize(lda_model.describeTopics(maxTermsPerTopic = number_of_words_per_topic))
print(time.strftime('%m%d%Y %H:%M:%S'))

07202019 14:29:13
07202019 14:29:13


In [27]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topics_final = topicIndices.map(lambda topic:topic_render(topic, number_of_words_per_topic, vocabArray)).collect()
print(time.strftime('%m%d%Y %H:%M:%S'))

07202019 14:29:13
07202019 14:29:13


###### Topics based on tweets

In [29]:
for topic in range(len(topics_final)):
    print ("Topic #" + str(topic+1) + "")
    for term in topics_final[topic]:
        print (term)
    print ('\n')

Topic #1
0.053  love
0.035  saturday
0.031  photooftheday
0.031  sunset
0.027  family
0.026  nofilter
0.026  loveislove
0.024  gay
0.024  fun
0.024  weekend
0.023  fashion
0.021  june
0.018  streetwear
0.018  instagood
0.018  3


Topic #2
0.255  nyc
0.153  newyork
0.09  newyorkcity
0.029  lgbtq
0.024  fashion
0.023  kevinshahroozi
0.019  party
0.019  lifestylesoftherichanddysfuntional
0.018  guncontrolnow
0.017  event
0.015  robertachiarellajewelry
0.014  home
0.012  handmadeintheusa
0.012  robertachiarella
0.012  style


Topic #3
0.067  joshuaruiz
0.061  bookcon19
0.043  boxing
0.03  smirk
0.03  comedy
0.03  goodomens
0.03  smile
0.023  humor
0.018  chuckle
0.018  laughter
0.018  lol
0.015  mom
0.015  grin
0.015  mirth
0.015  tee


Topic #4
0.129  yankee
0.094  mlb
0.077  ynwa
0.056  mets
0.035  deadwoodmovie
0.03  yankeestadium
0.025  dbacks
0.025  championsleague2019
0.025  nypd
0.022  phillies
0.022  newmusic
0.019  red
0.017  redsox
0.016  amjoy
0.016  summer


Topic #5
0.062  ufc

Topic #1
0.053  love
0.035  saturday
0.031  photooftheday
0.031  sunset
0.027  family
0.026  nofilter
0.026  loveislove
0.024  gay
0.024  fun
0.024  weekend
0.023  fashion
0.021  june
0.018  streetwear
0.018  instagood
0.018  3


Topic #2
0.255  nyc
0.153  newyork
0.09  newyorkcity
0.029  lgbtq
0.024  fashion
0.023  kevinshahroozi
0.019  party
0.019  lifestylesoftherichanddysfuntional
0.018  guncontrolnow
0.017  event
0.015  robertachiarellajewelry
0.014  home
0.012  handmadeintheusa
0.012  robertachiarella
0.012  style


##### Topic #3 - boxing Anthony Joshua vs. Andy Ruiz Jr. - 1 June
0.067  joshuaruiz
0.061  bookcon19
0.043  boxing
0.03  smirk
0.03  comedy
0.03  goodomens
0.03  smile
0.023  humor
0.018  chuckle
0.018  laughter
0.018  lol
0.015  mom
0.015  grin
0.015  mirth
0.015  tee


Topic #4
0.129  yankee
0.094  mlb
0.077  ynwa
0.056  mets
0.035  deadwoodmovie
0.03  yankeestadium
0.025  dbacks
0.025  championsleague2019
0.025  nypd
0.022  phillies
0.022  newmusic
0.019  red
0.017  redsox
0.016  amjoy
0.016  summer


###### Topic #5 - UFC Fight Night: Gustafsson vs. Smith - 1 June
0.062  ufcstockholm
0.049  mma
0.047  ufc
0.044  happyhour
0.044  art
0.036  foj
0.017  poem
0.017  artist
0.017  figmentnyc
0.017  bitcoin
0.017  plastic
0.014  lighthousepark
0.014  rooseveltisland
0.014  classic
0.014  life


Topic #6
0.173  whentheyseeus
0.047  falcon4
0.025  foodie
0.022  phantomthread
0.022  foodporn
0.022  uws
0.019  nevercursed
0.019  godzilla
0.019  legend
0.019  photography
0.016  jfk
0.016  godzillakingofthemonsters
0.016  author
0.014  lovelife
0.013  rip


###### Topic #7 -  UEFA Champions League Final - 1 June
0.061  manhattan
0.061  ny
0.044  centralpark
0.032  liverpool
0.032  pridemonth2019
0.029  championsleague
0.024  notmeus
0.024  mytwitteranniversary
0.022  spring
0.019  cbd
0.019  mta
0.019  blessed
0.019  happybirthday
0.019  uclfinal2019
0.016  stanleycup


Topic #8
0.104  alwaysbemymaybe
0.059  coys
0.044  repost
0.035  govball2019
0.035  youredoinggreat
0.032  championsleaguefinal
0.026  uclfinal19
0.023  dunkout
0.021  nowplaying
0.018  model
0.017  msg
0.017  brunch
0.015  mrjordanbc
0.015  womeninventors
0.015  bellaabzugpark


Topic #9
0.073  govballnyc
0.052  beermenus
0.033  medium
0.028  streetart
0.027  gh
0.027  virginiabeachshooting
0.027  timessquare
0.025  governorsball
0.019  nxttakeoverxxv
0.019  travel
0.018  nike
0.017  festival
0.017  graffiti
0.017  cdr
0.016  5


Topic #10
0.099  food
0.088  hookah
0.088  sugardaddysfridays
0.088  valet
0.074  sexyentertainers
0.068  nycnightlife
0.051  pride2019
0.019  saturdaythoughts
0.019  classof2019
0.016  rbny
0.014  amwriting
0.014  saturdaymorning
0.014  100reviews
0.014  deadwood
0.012  batman


Topic #11
0.135  uclfinal
0.059  lfc
0.025  broadway
0.025  shakespeareinthepark
0.023  rocketman
0.023  fbf
0.02  ucl
0.02  govball
0.018  free
0.018  fan
0.018  hadestown
0.015  neverreallyover
0.012  coffee
0.012  netflix
0.012  press


Topic #12
0.151  freeship
0.071  bookcon
0.07  case
0.056  generic
0.043  dvd
0.028  cd
0.028  jewel
0.026  harlem
0.021  sleeve
0.019  latergram
0.017  fitness
0.015  808mafia
0.014  graduation
0.014  mylife
0.012  storage


###### Topic #13 - June - month of pride
0.125  pride
0.05  1
0.045  queen
0.033  2
0.03  happypride
0.026  stonewall50
0.025  worldpride
0.022  subway
0.022  dog
0.022  theother5th
0.022  godzillamovie
0.02  beautiful
0.017  newprofilepic
0.017  kelseysarahbinge
0.017  championsleaguefinal2019


###### Topic #14 - BookCon - conference in NY (1-2 June)
0.113  brooklyn
0.047  music
0.043  bookcon2019
0.041  hiphoped
0.031  hiphop
0.026  bronx
0.026  btsatwembley
0.022  birthday
0.022  mood
0.019  bk
0.017  rose
0.017  2019
0.017  friend
0.017  saturdayvibes
0.014  youtube


Topic #15
0.144  pridemonth
0.047  nxttakeover
0.032  lga
0.027  flightdelay
0.025  impact
0.025  pinstripepride
0.023  impactontwitch
0.022  lgbt
0.02  wwe
0.02  boston
0.018  travelingmusician
0.015  oistrakhsymphony
0.015  stonewall
0.015  portrait
0.015  bushwick


-----------------------------------

#### Result based on the tweets text without hashtags
Topic #1
0.038  get
0.03  people
0.027  need
0.027  good
0.024  want
0.019  never
0.017  still
0.017  show
0.015  first
0.014  like
0.013  gonna
0.011  keep
0.011  baby
0.011  getting
0.01  hope


Topic #2
0.029  amp
0.02  work
0.018  great
0.017  th
0.014  well
0.012  tonight
0.012  pm
0.01  june
0.009  another
0.009  call
0.008  nice
0.008  open
0.008  hour
0.008  making
0.008  free


Topic #3
0.028  got
0.023  year
0.018  take
0.016  last
0.016  best
0.01  sure
0.01  two
0.009  already
0.009  try
0.009  wow
0.009  literally
0.009  read
0.009  song
0.008  point
0.008  black


Topic #4
0.032  love
0.03  day
0.024  really
0.022  make
0.022  shit
0.02  say
0.02  right
0.02  back
0.019  lmao
0.016  every
0.015  happy
0.015  come
0.015  friend
0.013  said
0.012  im


Topic #5
0.025  thank
0.02  thing
0.016  feel
0.015  yes
0.015  ever
0.013  real
0.013  woman
0.011  fucking
0.011  amazing
0.011  something
0.011  god
0.01  watch
0.009  like
0.009  white
0.009  men


Topic #6
0.032  time
0.032  u
0.03  know
0.026  see
0.019  going
0.018  even
0.017  let
0.017  much
0.014  fuck
0.012  made
0.011  could
0.01  thought
0.01  trying
0.01  old
0.009  park


Topic #7
0.035  one
0.022  think
0.021  would
0.014  also
0.013  someone
0.013  next
0.011  trump
0.011  week
0.011  job
0.01  tell
0.01  mean
0.01  many
0.01  anyone
0.01  talk
0.01  everyone


Topic #8
0.025  lol
0.018  life
0.017  way
0.016  man
0.014  better
0.013  nigga
0.013  stop
0.012  lmfao
0.01  bad
0.01  wanna
0.01  start
0.009  gotta
0.009  find
0.009  hard
0.009  help


Topic #9
0.048  new
0.036  york
0.023  today
0.022  ny
0.02  look
0.013  please
0.013  city
0.013  w
0.012  game
0.012  summer
0.01  like
0.01  video
0.01  put
0.009  hate
0.009  whole


Topic #10
0.022  nyc
0.02  go
0.015  brooklyn
0.014  guy
0.014  oh
0.013  girl
0.012  thanks
0.011  wait
0.011  night
0.01  yeah
0.01  b
0.009  always
0.009  street
0.008  manhattan
0.008  photo

###### As we can see, results are better when we use hashtags

### Hot topics in the USA from [Google trends](https://trends.google.com/trends/explore?geo=US)

To check actual result of tweets data, we acquire google trends data by the specific location and the same period of time. So, send request to trends.google.com and got responce which contains top search topics and top search queries. 
There are used modified pytrends API to get get this data. We modified a little bit (add fuinctions to interface related_top_search_topics, related_top_search_queries) pytrends API to get google trends date by specifi date and timeframe, because there are not such functional in public API.
<br>
<br>
pytrnds API: https://github.com/GeneralMills/pytrends
<br>
This repository was forked and changes saves in our public rep: https://github.com/DmytroBabenko/pytrends 

In [30]:
start_date_str = frame_start_datetime.strftime("%Y-%m-%d")
finish_date_str = frame_start_datetime.strftime("%Y-%m-%d")

pytrend = TrendReq()
pytrend.build_payload(kw_list=[' '], geo=geo, timeframe=f"{start_date_str} {finish_date_str}")

##### Search topics

In [31]:
topics_df = pytrend.related_top_search_topics(spark)

In [32]:
print_google_trend_title(frame_start_datetime, frame_start_datetime, "Search topics", geo)
topics_df.select("Search topics - rising", "Search topics - top").show(num_of_top_interest, False)


Google trends Search topics in New York during 2019-06-01
+---------------------------------------------+---------------------------------------+
|Search topics - rising                       |Search topics - top                    |
+---------------------------------------------+---------------------------------------+
|Liverpool F.C. - Football club               |New York - City in New York            |
|Tottenham Hotspur F.C. - Football club       |New York - US State                    |
|2018 UEFA Champions League Final - Tournament|2019 - Topic                           |
|UEFA Champions League - Football competition |Weather - Topic                        |
|Mega Millions - Topic                        |Film - Topic                           |
|Virginia Beach - City in Virginia            |Facebook, Inc. - Social network company|
|The Central Park Five - 2012 film            |Facebook - Social networking service   |
|Sports league - Topic                        |YouTube - Vide

##### Search queries

In [33]:
queries_df = pytrend.related_top_search_queries(spark)

In [34]:
print_google_trend_title(frame_start_datetime, frame_start_datetime, "Search queries", geo)
queries_df.show(num_of_top_interest, False)


Google trends Search queries in New York during 2019-06-01
+-----------------------+--------------------+--------+---+-----+
|Search queries - rising|Search queries - top|Rising  |Top|geo  |
+-----------------------+--------------------+--------+---+-----+
|dewayne craddock       |you                 |Breakout|100|US-NY|
|jose antonio reyes     |weather             |+3,350% |97 |US-NY|
|dwayne craddock        |facebook            |+2,550% |63 |US-NY|
|nelson figueroa        |google              |+1,750% |51 |US-NY|
|liverpool vs tottenham |youtube             |+1,450% |46 |US-NY|
|champions league final |amazon              |+1,100% |40 |US-NY|
|tottenham              |news                |+1,000% |39 |US-NY|
|champions league       |yankees             |+850%   |30 |US-NY|
|liverpool              |champions           |+850%   |29 |US-NY|
|champions              |lottery             |+650%   |27 |US-NY|
|ucl final              |movies              |+600%   |26 |US-NY|
|virginia beach 

# Evaluation and justification

Here, we considered the date 1 June, there was the final of Champions Leaque. Runnig LDA algorithm on only ** hashtags extracted from tweets **, we got several topics based on tweets, which are described above. In these topics we can find posts as **uclfinal2019**, **ucl**, ***championsleague2019***, ***liverpool***, ***championsleaguefinal***, ***tottenham***. Futhemore, google trends top search topics tell us about ***Liverpool F.C. - Football club***, ***Tottenham Hotspur F.C. - Football club***, ***UEFA Champions League - Football competition***, ***2018 UEFA Champions League Final - Tournament***, which so similar to tweets. Moreover, google seach queries for this date has ***liverpool vs tottenham***, ***tottenham***, ***liverpool***, ***champions league final***, ***uefa champions league final 2019***. It is clear to notice, that LDA tweets result are relevant to google trends for 1 June, when there was the final of Chamion League. <br>
Besides, the day before 1 June, there was Virginia Beach shooting. And, LDA gives us some relevant tweet to this event - ***virginiabeachshooting***, which is acceptbale by google trends results: ***Virginia Beach - City in Virginia*** and ***virginia beach shooting suspect***. <br>
Also, there are a lot of coincides about Gay Prode 2019, which was in June in New York. So, for this event we have sevaral tweets: ***pridemonth***, ***pride2019***, ***lgbtq***, ***loveislove***, ***gay***, ***love***, ***worldpride***. And relevant topics/queries from google trends - ***Gay pride - Topic***, ***pride month***.
<br>Also, there is highlight of book conference in New York during 1-2 June: ***Bookcon2019***

Overall, the tweets results are relevant to some popular events which are close to specific date. So we can say that the pipeline we developed is a working solution for the topic modeling problem.
<br>As an option to improve the performance of our algorithm, we see the purchase of the full amount of Twitter data on the territory, and their processing on a full-fledged cluster. This will remove the effect of a random selection of tweets and increase the amount of data, which should make the result even better.