# Problem Statement 
**TODO: describe the problem**

# ML problem
**TODO: describel our ml approach**

# Data description
In order to collect data in a natural way:
<br>- we registered Twitter Developer account;
<br>- using credentials from Twitter Developer account we run script that collected tweets by the geolocation and saved them in mongodb;
<br>
<br><b>As a result:</b>
<br>- we collected  332548 tweets (10Gb in mongodb, ~100Mb in csv) from New-York geolocation since 30 of May up to 15 of June;
<br>- we collected  6617029 tweets (~1.69Gb in csv) from USA geolocation since 15 of June up to now.

# Let's start our demo ride

### Import all libs needed

In [1]:
# crucial thing
import findspark
findspark.init()

In [2]:
# essential pyspark
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.window import Window
from pyspark.sql.types import ArrayType, FloatType, StringType, IntegerType, StructField, StructType
from pyspark.sql.functions import udf, row_number,column

# vectorizer
from pyspark.ml.feature import CountVectorizer, StopWordsRemover, HashingTF, IDF, Tokenizer

# staff for LDA
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vector as oldVector, Vectors as oldVectors
from pyspark.ml.linalg import Vector as newVector, Vectors as newVectors

# pytrends for acquiring google trends
from pytrends.pytrends.request import TrendReq

# import hardcoded variables
from variables import channels_not_to_consider

# custom text preprocessing
from text_preprocessing import *

# custom tools to work with google trends 
from trends import *

# handy functions
from utils import *

# datetime handling
from datetime import datetime
import time

[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ubuntu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Define variables

**Global variables definition**  
We picked some time frames to get data from to check if our topic model can extract info about events that occured during this period.

In [3]:
# final of league championship 
lc_final_start_datetime = "Sat Jun 01 00:00:00 +0000 2019"
lc_finish_finish_datetime = "Sat Jun 01 23:59:59 +0000 2019"

# Stanley cup final
stanley_final_start_datetime = "Wed Jun 12 00:00:00 +0000 2019"
stanley_finish_finish_datetime = "Wed Jun 12 23:59:59 +0000 2019"

# Draft NBA
nba_final_start_datetime = "Thu Jun 20 00:00:00 +0000 2019"
nba_finish_finish_datetime = "Sun Jun 23 23:59:59 +0000 2019"

**User-specific variables**  
Please feel free to tweak those variables as you wish. For example, you can set number of last hours to get hottest topics.

In [4]:
# if True locations from locations_to_consider will be used to filter
get_from_location = True

# locations to filter relevant tweets
locations_to_consider = [
                         'Manhattan, NY', 
                         'Brooklyn, NY', 
                         'Queens, NY', 
                         'Bronx, NY', 
                         'Staten Island, NY'
                         'New York, USA'
                        ]

# choose US if you want to consider whole US area
# used by google trends
geo = "US-NY" 

# hyperparams
number_of_hours_to_get_topics = 2
num_of_top_interest = 15

# set window time for interesting
frame_start_datetime = str_tweet_to_datetime(stanley_final_start_datetime)
frame_finish_datetime = str_tweet_to_datetime(stanley_finish_finish_datetime)

assert (frame_finish_datetime - frame_start_datetime).days <= 3, "Date interval should not be bigger than 3 days"

**Technical variables**  
Those variables are needed to connect to db and other technical stuff.

In [5]:
# LDA params
num_of_topics_LDA = 10
max_iterations_LDA = 100
number_of_words_for_topic = 15  # number of words per topic

# path to CSV
historical_tweets_data = '../get-tweets-by-geolocation/data/new_york_training_tweets_15_06.csv'
# historical_tweets_data = './get-tweets-by-geolocation/training_tweets.csv'

### Create spark session

In [6]:
spark = SparkSession.builder.appName("pipeline").getOrCreate()
sc = spark.sparkContext

### Read the data


**Load Google Trends data**

In [7]:
google_trends_search_queries_us = spark.read.csv('../data/google-trends/google-trends-search-queries-US.csv', inferSchema=True, header=True)
google_trends_search_topics_us = spark.read.csv('../data/google-trends/google-trends-search-topics-US.csv', inferSchema=True, header=True)
google_trends_search_queries_us_ny = spark.read.csv('../data/google-trends/google-trends-search-queries-US-NY.csv', inferSchema=True, header=True)
google_trends_search_topics_us_ny = spark.read.csv('../data/google-trends/google-trends-search-topics-US-NY.csv', inferSchema=True, header=True)

In [8]:
google_trends_dict = {}
google_trends_dict['US'] = [google_trends_search_topics_us, google_trends_search_queries_us]
google_trends_dict['US-NY'] = [google_trends_search_topics_us_ny, google_trends_search_queries_us_ny]

**Load the historical data, it can take a while**

In [None]:
times = (frame_start_datetime, frame_finish_datetime)
print("Time range to be extracted from ", historical_tweets_data, times[0], times[1])
selected_df = get_historical_df(historical_tweets_data=historical_tweets_data, historical_start_time=times[0], historical_finish_time=times[1], spark=spark)
assert selected_df != None, "Something goes wrong with selecting data from recent data/history data"
selected_df.count()

### Tweets preprocessing

Text cleaning is crucial for any text modelling process, especially for topic modelling. In our case it consists from those steps:  
1) Lowercase all words  
2) Filter words with non-letters at the beginning (mainly for mentions, e.g. "@some_user")  
3) Filter http/https  
4) Filter all non-letters (crucial to remove emoji)  
5) Remove multiply whitespaces  
6) Remove repeated chars (e.g. "greeeeat" -> "great")

In [None]:
df = selected_df

In [None]:
# filter nans
df = df.rdd.filter(lambda x: x[0] != None and x[1] != None and x[2] != None and x[4] != None)

# filter out channels not to consider
df = df.filter(lambda x: x[4] not in channels_not_to_consider)

# filter by country
df = df.filter(lambda x: x[1] in 'US')

# filter by precise location
if get_from_location:
    df = df.filter(lambda x: x[2] in locations_to_consider)

# filter tweet itself
df = df.filter(lambda x: filter_tweet(x[0], channels_not_to_consider=channels_not_to_consider))

# process tweet
df = df.map(lambda x: process_tweet(x[0]))

# final preprocesssing
df = df.filter(lambda x: len(x) > 0)

# make dataframes great again
df = df.map(lambda x: [x])

# schema for df
schema = StructType([StructField('tokens', ArrayType(StringType()), True)])
df = df.toDF(schema=schema)

In [None]:
df.show(10)

In [None]:
df.count()

# Topic modeling/Latent Dirichlet allocation(LDA)
**TODO: describe ml solution**

In [None]:
# df2.show(10, True)

In [None]:
# df2.printSchema()

In [None]:
print(time.strftime('%m%d%Y %H:%M:%S'))

cv = CountVectorizer(inputCol="tokens", outputCol="raw_features", vocabSize=10000, minDF=2.0)
cvmodel = cv.fit(df)

print(time.strftime('%m%d%Y %H:%M:%S'))

In [None]:
print(time.strftime('%m%d%Y %H:%M:%S'))
df = cvmodel.transform(df)
print(time.strftime('%m%d%Y %H:%M:%S'))

In [None]:
idf = IDF(inputCol="raw_features", outputCol="tf_idf_features", minDocFreq=2)
idfModel = idf.fit(df)

df = idfModel.transform(df)


In [None]:
df.show(10, True)

In [None]:
#df = df.drop("name")
#df.show(10, False)

In [None]:
w = Window().orderBy(column("tokens"))
df = df.withColumn("id", row_number().over(w))

In [None]:
df.show(10, True)

In [None]:
rs = df.rdd.map(lambda x: (x[3], oldVectors.fromML(x[2])))

In [None]:
rs_df = rs.toDF()
rs_df.show(10, False)

In [None]:
# Run the LDA Topic Modeler
# Note the time before and after is printed in order to find out how much time it takes to process x number of records

print(time.strftime('%m%d%Y %H:%M:%S'))
lda_model = LDA.train(rs_df['_1', '_2'].rdd.map(list), k=10, maxIterations=100)
print(time.strftime('%m%d%Y %H:%M:%S'))

In [None]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topics = lda_model.topicsMatrix()
vocabArray = cvmodel.vocabulary

In [None]:
wordNumbers = 15

topicIndices = sc.parallelize(lda_model.describeTopics(maxTermsPerTopic = wordNumbers))

def topic_render(topic):  # specify vector id of words to actual words
    terms = topic[0]
    prob = topic[1]
    
    result = []
    for i in range(number_of_words_for_topic):
        term = str(round(prob[i],3))+"  "+vocabArray[terms[i]]
        result.append(term)
    return result
print(time.strftime('%m%d%Y %H:%M:%S'))

In [None]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topics_final = topicIndices.map(lambda topic:topic_render(topic)).collect()
print(time.strftime('%m%d%Y %H:%M:%S'))

# Topics

In [None]:
# based on the simple vectors(+number of words)

for topic in range(len(topics_final)):
    print ("Topic #" + str(topic+1) + "")
    for term in topics_final[topic]:
        print (term)
    print ('\n')

### Hot topics in the USA from [Google trends](https://trends.google.com/trends/explore?geo=US)

In [9]:
start_date = frame_start_datetime #str_tweet_to_datetime(frame_start_datetime)
finish_date = frame_finish_datetime #str_tweet_to_datetime(frame_finish_datetime)

In [10]:
google_trends_topics, google_trends_queries = get_google_trends_by_geo(geo, google_trends_dict) 

##### Google trends search queries

In [11]:
interesting_google_topics = google_trends_topics.filter(
    (google_trends_topics.Date >= start_date) & (google_trends_topics.Date <= finish_date))

In [12]:
print_google_trend_title(start_date, finish_date, "Search topics", geo)
interest_google_topics = convert_datetime_in_interesting_google(interesting_google_topics)
interest_google_topics.select("Date","Search topics - rising", "Search topics - top").show(num_of_top_interest, False)

NameError: name 'geo' is not defined

In case when timeframe is more than 1 day, filter correctly this google-trends

In [None]:
# interesing_google_topics_unique= unique_google_trends_by_time_frame(interesting_google_topics, geo)
# print_google_trend_title(start_date, finish_date, "Search topics")
# interesing_google_topics_unique.select("Search topics - rising", "Search topics - top").show(num_of_top_interest, False)

##### Google trends search queries

In [None]:
interesting_google_queries = google_trends_queries.filter(
    (google_trends_queries.Date >= start_date) & (google_trends_queries.Date <= finish_date))

In [None]:
interesing_google_queries_unique= unique_google_trends_by_time_frame(interesting_google_queries, spark)
print_google_trend_title(start_date, finish_date, "Search queries", geo)
interesing_google_queries_unique.show(num_of_top_interest, False)

In [None]:
# print_google_trend_title(start_date, finish_date, "Search queries")
# interest_google_queries = convert_datetime_in_interesting_google(interesting_google_queries)
# interest_google_queries.select("Date", "Search queries - rising", "Search queries - top").show(num_of_top_interest, False)

#### Hot topics - google trends (directly) (probably this will be removed)

In [None]:
start_date_str = start_date.strftime("%Y-%m-%d")
finish_date_str = finish_date.strftime("%Y-%m-%d")
pytrend = TrendReq()
pytrend.build_payload(kw_list=[' '], geo=geo, timeframe=f"{start_date_str} {finish_date_str}")

##### Search topics

In [None]:
topics_df = pytrend.related_top_search_topics(spark)

In [None]:
print_google_trend_title(start_date, finish_date, "Search topics")
topics_df.select("Search topics - rising", "Search topics - top").show(num_of_top_interest, False)

##### Search queries

In [None]:
queries_df = pytrend.related_top_search_queries(spark)

In [None]:
print_google_trend_title(start_date, finish_date, "Search queries")
queries_df.show(num_of_top_interest, False)