# Problem Statement 
**TODO: describe the problem**

# ML problem
**TODO: describel our ml approach**

# Data description
In order to collect data in a natural way:
<br>- we registered Twitter Developer account;
<br>- using credentials from Twitter Developer account we run script that collected tweets by the geolocation and saved them in mongodb;
<br>
<br><b>As a result:</b>
<br>- we collected  332548 tweets (10Gb in mongodb, ~100Mb in csv) from New-York geolocation since 30 of May up to 15 of June;
<br>- we collected  6617029 tweets (~1.69Gb in csv) from USA geolocation since 15 of June up to now.

# Let's start our demo ride

### Import all libs needed

In [1]:
# crucial thing
import findspark
findspark.init()

In [2]:
# essential pyspark
import pyspark
import operator
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.window import Window
from pyspark.sql.types import ArrayType, FloatType, StringType, IntegerType, StructField, StructType
from pyspark.sql.functions import udf, row_number,column

# vectorizer
from pyspark.ml.feature import CountVectorizer, StopWordsRemover, HashingTF, IDF, Tokenizer

# staff for LDA
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vector as oldVector, Vectors as oldVectors
from pyspark.ml.linalg import Vector as newVector, Vectors as newVectors

# pytrends for acquiring google trends
from pytrends.pytrends.request import TrendReq

# import hardcoded variables
from variables import channels_not_to_consider

# custom text preprocessing
from text_preprocessing import *

# custom tools to work with google trends 
from trends import *

# handy functions
from utils import *

# datetime handling
from datetime import datetime
import time

[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ubuntu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Define variables

**Global variables definition**  
We picked some time frames to get data from to check if our topic model can extract info about events that occured during this period.

In [3]:
# final of league championship 
lc_final_start_datetime = "Sat Jun 01 00:00:00 +0000 2019"
lc_finish_finish_datetime = "Sat Jun 01 23:59:59 +0000 2019"

# Stanley cup final
stanley_final_start_datetime = "Wed Jun 12 00:00:00 +0000 2019"
stanley_finish_finish_datetime = "Wed Jun 12 23:59:59 +0000 2019"

# Draft NBA
nba_final_start_datetime = "Thu Jun 20 00:00:00 +0000 2019"
nba_finish_finish_datetime = "Sun Jun 23 23:59:59 +0000 2019"

**User-specific variables**  
Please feel free to tweak those variables as you wish. For example, you can set number of last hours to get hottest topics.

In [4]:
# if True locations from locations_to_consider will be used to filter
get_from_location = True

# locations to filter relevant tweets
locations_to_consider = [
                         'Manhattan, NY', 
                         'Brooklyn, NY', 
                         'Queens, NY', 
                         'Bronx, NY', 
                         'Staten Island, NY'
                         'New York, USA'
                        ]

# choose US if you want to consider whole US area
# used by google trends
geo = "US-NY" 

# hyperparams
number_of_hours_to_get_topics = 2
num_of_top_interest = 15

# set window time for interesting
frame_start_datetime = str_tweet_to_datetime(stanley_final_start_datetime)
frame_finish_datetime = str_tweet_to_datetime(stanley_finish_finish_datetime)

assert (frame_finish_datetime - frame_start_datetime).days <= 3, "Date interval should not be bigger than 3 days"

**Technical variables**  
Those variables are needed to connect to db and other technical stuff.

In [5]:
# LDA params
num_of_topics_LDA = 10
max_iterations_LDA = 100
number_of_words_for_topic = 15  # number of words per topic

# path to CSV
historical_tweets_data = '../get-tweets-by-geolocation/data/new_york_training_tweets_15_06.csv'
# historical_tweets_data = './get-tweets-by-geolocation/training_tweets.csv'

### Create spark session

In [6]:
spark = SparkSession.builder.appName("pipeline").getOrCreate()
sc = spark.sparkContext

### Read the data


**Load Google Trends data**

In [7]:
google_trends_search_queries_us = spark.read.csv('../data/google-trends/google-trends-search-queries-US.csv', inferSchema=True, header=True)
google_trends_search_topics_us = spark.read.csv('../data/google-trends/google-trends-search-topics-US.csv', inferSchema=True, header=True)
google_trends_search_queries_us_ny = spark.read.csv('../data/google-trends/google-trends-search-queries-US-NY.csv', inferSchema=True, header=True)
google_trends_search_topics_us_ny = spark.read.csv('../data/google-trends/google-trends-search-topics-US-NY.csv', inferSchema=True, header=True)

**Load the historical data, it can take a while**

In [8]:
times = (frame_start_datetime, frame_finish_datetime)
print("Time range to be extracted from ", historical_tweets_data, times[0], times[1])
selected_df = get_historical_df(historical_tweets_data=historical_tweets_data, historical_start_time=times[0], historical_finish_time=times[1], spark=spark)
assert selected_df != None, "Something goes wrong with selecting data from recent data/history data"
selected_df.count()

Time range to be extracted from  ../get-tweets-by-geolocation/data/new_york_training_tweets_15_06.csv 2019-06-12 00:00:00+00:00 2019-06-12 23:59:59+00:00
Range for collected data (history):  2019-06-12 00:00:00+00:00 2019-06-12 23:59:59+00:00


29538

### Tweets preprocessing

Text cleaning is crucial for any text modelling process, especially for topic modelling. In our case it consists from those steps:  
1) Lowercase all words  
2) Filter words with non-letters at the beginning (mainly for mentions, e.g. "@some_user")  
3) Filter http/https  
4) Filter all non-letters (crucial to remove emoji)  
5) Remove multiply whitespaces  
6) Remove repeated chars (e.g. "greeeeat" -> "great")

In [11]:
df = selected_df

In [13]:
# filter nans
df = df.rdd.filter(lambda x: x[0] != None and x[1] != None and x[2] != None and x[4] != None)

# filter out channels not to consider
df = df.filter(lambda x: x[4] not in channels_not_to_consider)

# filter by country
df = df.filter(lambda x: x[1] in 'US')

# filter by precise location
if get_from_location:
    df = df.filter(lambda x: x[2] in locations_to_consider)

# filter tweet itself
df = df.filter(lambda x: filter_tweet(x[0], channels_not_to_consider=channels_not_to_consider))

# process tweet
df = df.map(lambda x: process_tweet(x[0]))

# final preprocesssing
df = df.filter(lambda x: len(x) > 0)

# make dataframes great again
df = df.map(lambda x: [x])

# schema for df
schema = StructType([StructField('tokens', ArrayType(StringType()), True)])
df = df.toDF(schema=schema)

In [14]:
df.show(10)

+--------------------+
|              tokens|
+--------------------+
|[sad, news, noise...|
|[first, thing, ge...|
|               [lit]|
|[ironically, star...|
|   [thank, bro, bro]|
|         [yeah, sir]|
|[indeed, stan, fo...|
|[good, draping, e...|
|[leftover, tuesda...|
|[anyone, want, em...|
+--------------------+
only showing top 10 rows



In [15]:
df.count()

17398

# Topic modeling/Latent Dirichlet allocation(LDA)
**TODO: describe ml solution**

In [38]:
# df2.show(10, True)

In [39]:
# df2.printSchema()

In [40]:
print(time.strftime('%m%d%Y %H:%M:%S'))

cv = CountVectorizer(inputCol="tokens", outputCol="raw_features", vocabSize=10000, minDF=2.0)
cvmodel = cv.fit(df)

print(time.strftime('%m%d%Y %H:%M:%S'))

07172019 22:31:29
07172019 22:32:00


In [41]:
print(time.strftime('%m%d%Y %H:%M:%S'))
df = cvmodel.transform(df)
print(time.strftime('%m%d%Y %H:%M:%S'))

07172019 22:32:00
07172019 22:32:00


In [42]:
idf = IDF(inputCol="raw_features", outputCol="tf_idf_features", minDocFreq=2)
idfModel = idf.fit(df)

df = idfModel.transform(df)


In [43]:
df.show(10, True)

+--------------------+--------------------+--------------------+
|              tokens|        raw_features|     tf_idf_features|
+--------------------+--------------------+--------------------+
|[sad, news, noise...|(7583,[287,369,63...|(7583,[287,369,63...|
|[first, thing, ge...|(7583,[2,21,36,37...|(7583,[2,21,36,37...|
|               [lit]|  (7583,[753],[1.0])|(7583,[753],[6.54...|
|[ironically, star...|(7583,[403,862,11...|(7583,[403,862,11...|
|   [thank, bro, bro]|(7583,[18,150],[1...|(7583,[18,150],[3...|
|         [yeah, sir]|(7583,[120,675],[...|(7583,[120,675],[...|
|[indeed, stan, fo...|(7583,[141,259,42...|(7583,[141,259,42...|
|[good, draping, e...|(7583,[14,79,679,...|(7583,[14,79,679,...|
|[leftover, tuesda...|(7583,[154,1093,1...|(7583,[154,1093,1...|
|[anyone, want, em...|(7583,[20,108,706...|(7583,[20,108,706...|
+--------------------+--------------------+--------------------+
only showing top 10 rows



In [44]:
#df = df.drop("name")
#df.show(10, False)

In [45]:
w = Window().orderBy(column("tokens"))
df = df.withColumn("id", row_number().over(w))

In [46]:
df.show(10, True)

+--------------------+--------------------+--------------------+---+
|              tokens|        raw_features|     tf_idf_features| id|
+--------------------+--------------------+--------------------+---+
|[aapl, strong, da...|(7583,[0,10,17,21...|(7583,[0,10,17,21...|  1|
|[aaron, coming, b...|(7583,[32,148,144...|(7583,[32,148,144...|  2|
|[aaron, got, full...|(7583,[12,30,106,...|(7583,[12,30,106,...|  3|
|[abandoned, novel...|(7583,[16,115,174...|(7583,[16,115,174...|  4|
|[abingdon, market...|(7583,[1,4,24,708...|(7583,[1,4,24,708...|  5|
|[able, convert, s...|(7583,[11,525,657...|(7583,[11,525,657...|  6|
|[able, cop, gener...|(7583,[54,118,222...|(7583,[54,118,222...|  7|
|           [able, w]|(7583,[76,657],[1...|(7583,[76,657],[4...|  8|
| [aboard, tug, boat]|(7583,[3111,4114]...|(7583,[3111,4114]...|  9|
|[abortion, listen...| (7583,[1494],[1.0])|(7583,[1494],[7.1...| 10|
+--------------------+--------------------+--------------------+---+
only showing top 10 rows



In [47]:
rs = df.rdd.map(lambda x: (x[3], oldVectors.fromML(x[2])))

In [48]:
rs_df = rs.toDF()
rs_df.show(10, False)

+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_1 |_2                                                                                                                                                                                                                                                                                                 |
+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1  |(7583,[0,10,17,21,35,132,161,233,252,620,637,957,1235],[2.971823584815907,3.6199823781610703,3.861534

In [49]:
# Run the LDA Topic Modeler
# Note the time before and after is printed in order to find out how much time it takes to process x number of records

print(time.strftime('%m%d%Y %H:%M:%S'))
lda_model = LDA.train(rs_df['_1', '_2'].rdd.map(list), k=10, maxIterations=100)
print(time.strftime('%m%d%Y %H:%M:%S'))

07172019 22:34:04
07172019 22:35:18


In [50]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topics = lda_model.topicsMatrix()
vocabArray = cvmodel.vocabulary

07172019 22:35:18


In [51]:
wordNumbers = 15

topicIndices = sc.parallelize(lda_model.describeTopics(maxTermsPerTopic = wordNumbers))

def topic_render(topic):  # specify vector id of words to actual words
    terms = topic[0]
    prob = topic[1]
    
    result = []
    for i in range(number_of_words_for_topic):
        term = str(round(prob[i],3))+"  "+vocabArray[terms[i]]
        result.append(term)
    return result
print(time.strftime('%m%d%Y %H:%M:%S'))

07172019 22:35:20


In [52]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topics_final = topicIndices.map(lambda topic:topic_render(topic)).collect()
print(time.strftime('%m%d%Y %H:%M:%S'))

07172019 22:35:20
07172019 22:35:20


# Topics

In [51]:
# based on the simple vectors(+number of words)

for topic in range(len(topics_final)):
    print ("Topic #" + str(topic+1) + "")
    for term in topics_final[topic]:
        print (term)
    print ('\n')

Topic #1
0.032  u
0.032  love
0.024  thank
0.02  work
0.02  right
0.019  back
0.017  let
0.016  always
0.014  fuck
0.013  also
0.011  big
0.01  looking
0.01  trying
0.009  watch
0.009  lot


Topic #2
0.03  one
0.029  day
0.026  good
0.026  got
0.022  today
0.021  year
0.018  look
0.018  thing
0.016  much
0.015  last
0.015  feel
0.013  guy
0.012  gonna
0.011  night
0.011  week


Topic #3
0.028  see
0.027  like
0.026  lol
0.015  first
0.014  said
0.013  real
0.013  nigga
0.012  job
0.011  wait
0.011  tell
0.011  many
0.01  wanna
0.009  white
0.009  ok
0.008  team


Topic #4
0.048  new
0.036  york
0.031  time
0.025  go
0.022  think
0.015  come
0.014  oh
0.013  city
0.012  give
0.011  as
0.01  ya
0.01  world
0.01  park
0.009  try
0.008  manhattan


Topic #5
0.022  make
0.021  would
0.019  lmao
0.018  even
0.017  life
0.015  happy
0.015  best
0.012  someone
0.012  w
0.01  thought
0.01  mean
0.009  b
0.009  bitch
0.009  hate
0.009  person


Topic #6
0.039  get
0.028  need
0.025  want
0.013  

### Hot topics in the USA from [Google trends](https://trends.google.com/trends/explore?geo=US)

In [52]:
start_date = frame_start_datetime #str_tweet_to_datetime(frame_start_datetime)
finish_date = frame_finish_datetime #str_tweet_to_datetime(frame_finish_datetime)

In [53]:
google_trends_topics, google_trends_queries = get_google_trends_by_geo(geo) 

##### Google trends search queries

In [54]:
interesting_google_topics = google_trends_topics.filter(
    (google_trends_topics.Date >= start_date) & (google_trends_topics.Date <= finish_date))

In [55]:
print_google_trend_title(start_date, finish_date, "Search topics")
interest_google_topics = convert_datetime_in_interesting_google(interesting_google_topics)
interest_google_topics.select("Date","Search topics - rising", "Search topics - top").show(num_of_top_interest, False)


Google trends Search topics in New York during 2019-06-12 - 2019-06-12
+----------+----------------------+----------------------------------------+
|Date      |Search topics - rising|Search topics - top                     |
+----------+----------------------+----------------------------------------+
|2019-06-12|Download - Topic      |New York - City in New York             |
|2019-06-12|null                  |New York - US State                     |
|2019-06-12|null                  |Google Search - Topic                   |
|2019-06-12|null                  |Google - Technology company             |
|2019-06-12|null                  |2019 - Topic                            |
|2019-06-12|null                  |Weather - Topic                         |
|2019-06-12|null                  |YouTube - Video sharing company         |
|2019-06-12|null                  |Facebook, Inc. - Social network company |
|2019-06-12|null                  |Facebook - Social networking service    |
|201

In case when timeframe is more than 1 day, filter correctly this google-trends

In [56]:
# interesing_google_topics_unique= unique_google_trends_by_time_frame(interesting_google_topics)
# print_google_trend_title(start_date, finish_date, "Search topics")
# interesing_google_topics_unique.select("Search topics - rising", "Search topics - top").show(num_of_top_interest, False)

##### Google trends search queries

In [57]:
interesting_google_queries = google_trends_queries.filter(
    (google_trends_queries.Date >= start_date) & (google_trends_queries.Date <= finish_date))

In [58]:
interesing_google_queries_unique= unique_google_trends_by_time_frame(interesting_google_queries)
print_google_trend_title(start_date, finish_date, "Search queries")
interesing_google_queries_unique.show(num_of_top_interest, False)


Google trends Search queries in New York during 2019-06-12 - 2019-06-12
+-----------------------+--------------------+------+---+-----+
|Search queries - rising|Search queries - top|Rising|Top|geo  |
+-----------------------+--------------------+------+---+-----+
|sudanese massacre      |google              |+300% |100|US-NY|
|hong kong              |weather             |+250% |79 |US-NY|
|ny yankees score       |facebook            |+180% |59 |US-NY|
|iready                 |youtube             |+150% |54 |US-NY|
|boston bruins          |amazon              |+150% |49 |US-NY|
|raz kids               |news                |+110% |46 |US-NY|
|bruins                 |definition          |+110% |33 |US-NY|
|cool math games        |craigslist          |+100% |28 |US-NY|
|verizon wireless       |gmail               |+90%  |24 |US-NY|
|bitcoin price          |translate           |+90%  |23 |US-NY|
|scratch                |nba                 |+90%  |23 |US-NY|
|max landis             |instag

In [59]:
# print_google_trend_title(start_date, finish_date, "Search queries")
# interest_google_queries = convert_datetime_in_interesting_google(interesting_google_queries)
# interest_google_queries.select("Date", "Search queries - rising", "Search queries - top").show(num_of_top_interest, False)

#### Hot topics - google trends (directly) (probably this will be removed)

In [60]:
start_date_str = start_date.strftime("%Y-%m-%d")
finish_date_str = finish_date.strftime("%Y-%m-%d")
pytrend = TrendReq()
pytrend.build_payload(kw_list=[' '], geo=geo, timeframe=f"{start_date_str} {finish_date_str}")

##### Search topics

In [61]:
topics_df = pytrend.related_top_search_topics(spark)

In [62]:
print_google_trend_title(start_date, finish_date, "Search topics")
topics_df.select("Search topics - rising", "Search topics - top").show(num_of_top_interest, False)


Google trends Search topics in New York during 2019-06-12 - 2019-06-12
+-------------------------+----------------------------------------+
|Search topics - rising   |Search topics - top                     |
+-------------------------+----------------------------------------+
|Ice - Topic              |New York - City in New York             |
|Mathematical game - Topic|New York - US State                     |
|Design - Topic           |Google - Technology company             |
|nan                      |Google Search - Topic                   |
|nan                      |2019 - Topic                            |
|nan                      |Weather - Topic                         |
|nan                      |YouTube - Video sharing company         |
|nan                      |Facebook - Social networking service    |
|nan                      |Facebook, Inc. - Social network company |
|nan                      |Amazon.com - E-commerce company         |
|nan                      |Defi

##### Search queries

In [63]:
queries_df = pytrend.related_top_search_queries(spark)

In [64]:
print_google_trend_title(start_date, finish_date, "Search queries")
queries_df.show(num_of_top_interest, False)


Google trends Search queries in New York during 2019-06-12 - 2019-06-12
+-----------------------+--------------------+------+---+-----+
|Search queries - rising|Search queries - top|Rising|Top|geo  |
+-----------------------+--------------------+------+---+-----+
|hong kong              |google              |+350% |100|US-NY|
|courtney stodden       |weather             |+300% |80 |US-NY|
|sudanese massacre      |facebook            |+250% |60 |US-NY|
|khloe kardashian       |youtube             |+250% |56 |US-NY|
|cricinfo               |amazon              |+190% |53 |US-NY|
|iready                 |news                |+180% |42 |US-NY|
|neiman marcus          |craigslist          |+110% |27 |US-NY|
|jupiter ed             |world cup           |+100% |25 |US-NY|
|ny lottery post        |instagram           |+90%  |25 |US-NY|
|adidas                 |nba                 |+90%  |25 |US-NY|
|alex morgan            |gmail               |+80%  |22 |US-NY|
|pbs kids               |map   