# Problem Statement 
**TODO: describe the problem**

# ML problem
**TODO: describel our ml approach**

# Data description
In order to collect data in a natural way:
<br>- we registered Twitter Developer account;
<br>- using credentials from Twitter Developer account we run script that collected tweets by the geolocation and saved them in mongodb;
<br>
<br><b>As a result:</b>
<br>- we collected  332548 tweets (10Gb in mongodb, ~100Mb in csv) from New-York geolocation since 30 of May up to 15 of June;
<br>- we collected  6617029 tweets (~1.69Gb in csv) from USA geolocation since 15 of June up to now.

# Let's start our demo ride

### Import all libs needed

In [1]:
# crucial thing
import findspark
findspark.init()

In [12]:
# essential pyspark
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.window import Window
from pyspark.sql.types import ArrayType, FloatType, StringType, IntegerType, StructField, StructType
from pyspark.sql.functions import udf, row_number,column

# vectorizer
from pyspark.ml.feature import CountVectorizer, StopWordsRemover, HashingTF, IDF, Tokenizer

# staff for LDA
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vector as oldVector, Vectors as oldVectors
from pyspark.ml.linalg import Vector as newVector, Vectors as newVectors

# pytrends for acquiring google trends
from pytrends.pytrends.request import TrendReq

# import hardcoded variables
from utils.channels_to_filter import channels_not_to_consider

# custom text preprocessing
from utils.text_preprocessing import *

# custom tools to work with google trends 
from utils.trends import *

# handy functions
from utils import *

# datetime handling
from datetime import datetime
import time

ModuleNotFoundError: No module named 'variables'

### Define variables

**Connecting to data**  
Please, specify path to the csv file with data right here.

In [8]:
# path to CSV
historical_tweets_data = '../data/tweets/new_york_training_tweets_15_06.csv'

**Time frames to pick data from**  
We picked some time frames to get data from to check if our topic model can extract info about events that occured during this period.

In [9]:
# final of league championship 
lc_final_start_datetime = "Sat Jun 01 00:00:00 +0000 2019"
lc_final_finish_datetime = "Sat Jun 01 23:59:59 +0000 2019"

# Stanley cup final
stanley_final_start_datetime = "Wed Jun 12 00:00:00 +0000 2019"
stanley_final_finish_datetime = "Wed Jun 12 23:59:59 +0000 2019"

# Draft NBA
nba_final_start_datetime = "Thu Jun 20 00:00:00 +0000 2019"
nba_finish_final_datetime = "Sun Jun 23 23:59:59 +0000 2019"

Or you can specify your own dates.

In [10]:
start_datetime = ''
finish_datetime = ''

Use one of the dates defined above right here (e.g., we use Stanely cup final dates).

In [11]:
frame_start_datetime = str_tweet_to_datetime(stanley_final_start_datetime)
frame_finish_datetime = str_tweet_to_datetime(stanley_final_finish_datetime)

assert (frame_finish_datetime - frame_start_datetime).days <= 3, "Date interval should not be bigger than 3 days"

NameError: name 'str_tweet_to_datetime' is not defined

**Location related variables**  
Here you can specify exact location from which you want to get tweets for topic modeling. Here we have example for NYC (local approach) and whole US (global approach). You can explore raw data to find more locations for filtering.

In [10]:
# if True locations from locations_to_consider will be used to filter
get_from_location = True

# locations to filter relevant tweets
locations_to_consider = [
                         'Manhattan, NY', 
                         'Brooklyn, NY', 
                         'Queens, NY', 
                         'Bronx, NY', 
                         'Staten Island, NY'
                         'New York, USA'
                        ]

# used to extract google trends
if get_from_location:
    geo = 'US-NY' 
else:
    geo = 'US'

**Magic machine learning variables (parameters)**  
You can tune those parameters to improve output of LDA. But we don't recommend to change them as we ran a lot of experiments and came up with the best configuration :)

In [11]:
# LDA params
num_of_topics_LDA = 20
max_iterations_LDA = 10

number_of_words_per_topic = 15  # number of words per topic
num_of_top_interest = 15 # number of topics

### Create spark session

In [12]:
spark = SparkSession.builder.appName("pipeline").getOrCreate()
sc = spark.sparkContext

### Read the data


**Load the historical data, it can take a while**  
Here we load data and filter by dates you chose above.

In [13]:
times = (frame_start_datetime, frame_finish_datetime)
print("Time range to be extracted from ", historical_tweets_data, times[0], times[1])
selected_df = get_historical_df(historical_tweets_data=historical_tweets_data, historical_start_time=times[0], historical_finish_time=times[1], spark=spark)
assert selected_df != None, "Something goes wrong with selecting data from recent data/history data"
selected_df.count()

Time range to be extracted from  ../get-tweets-by-geolocation/data/new_york_training_tweets_15_06.csv 2019-06-12 00:00:00+00:00 2019-06-12 23:59:59+00:00
Range for collected data (history):  2019-06-12 00:00:00+00:00 2019-06-12 23:59:59+00:00


29538

**Load Google Trends data**  
Here we load google trends data.

In [14]:
google_trends_search_queries_us = spark.read.csv('../data/google-trends/google-trends-search-queries-US.csv', inferSchema=True, header=True)
google_trends_search_topics_us = spark.read.csv('../data/google-trends/google-trends-search-topics-US.csv', inferSchema=True, header=True)
google_trends_search_queries_us_ny = spark.read.csv('../data/google-trends/google-trends-search-queries-US-NY.csv', inferSchema=True, header=True)
google_trends_search_topics_us_ny = spark.read.csv('../data/google-trends/google-trends-search-topics-US-NY.csv', inferSchema=True, header=True)

In [15]:
google_trends_dict = {}
google_trends_dict['US'] = [google_trends_search_topics_us, google_trends_search_queries_us]
google_trends_dict['US-NY'] = [google_trends_search_topics_us_ny, google_trends_search_queries_us_ny]

### Tweets preprocessing  

**Basic filtering**

Here we do basic filtering based on null values. We filter out channels we don't want to consider (we pick them manually based on our EDA), because they provide a lot of noise to the data (typical examples are profiles, which report weather or traffic condition). Also, we do filtering based on global and local location (if local filtering is enabled). Finally, we check tweet itself for the length of the message.

In [None]:
df = selected_df

# filter nans
df = df.rdd.filter(lambda x: x[0] != None and x[1] != None and x[2] != None and x[4] != None)

# filter out channels not to consider
df = df.filter(lambda x: x[4] not in channels_not_to_consider)

# filter by country
df = df.filter(lambda x: x[1] in 'US')

# filter by precise location
if get_from_location:
    df = df.filter(lambda x: x[2] in locations_to_consider)

# filter tweet itself
df = df.filter(lambda x: filter_tweet(x[0]))


**Now the most interesting - tweet cleaning**  
Text cleaning is crucial for any text modelling process, especially for topic modelling. We tried three different approaches: classic, using only hashtags, using only urls. In classic approach we delete all non-words (including urls, hashtags, emojis and mentions), filter out common stop words, so only plain text information is left. But still such data has a lot of noise and uninformative words, so we tried another approaches with using only hashtags (which should code the most important information) and urls.

**Let's start from classic approach**  
In our case it consists from those steps:  
1) Lowercase all words  
2) Filter words with non-letters at the beginning (mainly for mentions, e.g. "@some_user")  
3) Filter http/https  
4) Filter all non-letters (crucial to remove emoji)  
5) Remove multiply whitespaces  
6) Remove repeated chars (e.g. "greeeeat" -> "great")

**Hashtags & URLs**  
Those approaches are pretty clear.

**Basically applying approach discussed above**  
Now let's apply one of the aproaches. But first specify which one you want to use.

In [None]:
def process_hashtags(tweet):
    pass

def process_urls(tweet):
    pass

In [None]:
preprocessing_type = 'just-text' # or 'hashtags', or 'urls'

if preprocessing_type == 'just-text':
    process_tweet = process_text
elif preprocessing_type == 'hashtags':
    process_tweet = process_hashtags
elif preprocessing_type == 'urls':
    process_tweet = process_urls

# process tweet
df = df.map(lambda x: process_tweet(x[0]))

**Final postprocessing**  
Here we make sure that we don't have entries without tokens at all, also we change the structure suitable for pyspark LDA class.

In [None]:
# final preprocesssing
df = df.filter(lambda x: len(x) > 0)

# make dataframes great again
df = df.map(lambda x: [x])

# schema for df
schema = StructType([StructField('tokens', ArrayType(StringType()), True)])
df = df.toDF(schema=schema)

### Topic modeling via Latent Dirichlet allocation


In [None]:
# df2.show(10, True)

In [None]:
# df2.printSchema()

In [None]:
print(time.strftime('%m%d%Y %H:%M:%S'))

cv = CountVectorizer(inputCol="tokens", outputCol="raw_features", vocabSize=10000, minDF=2.0)
cvmodel = cv.fit(df)

print(time.strftime('%m%d%Y %H:%M:%S'))

In [None]:
print(time.strftime('%m%d%Y %H:%M:%S'))
df = cvmodel.transform(df)
print(time.strftime('%m%d%Y %H:%M:%S'))

In [None]:
idf = IDF(inputCol="raw_features", outputCol="tf_idf_features", minDocFreq=2)
idfModel = idf.fit(df)

df = idfModel.transform(df)


In [None]:
df.show(10, True)

In [None]:
#df = df.drop("name")
#df.show(10, False)

In [None]:
w = Window().orderBy(column("tokens"))
df = df.withColumn("id", row_number().over(w))

In [None]:
df.show(10, True)

In [None]:
rs = df.rdd.map(lambda x: (x[3], oldVectors.fromML(x[2])))

In [None]:
rs_df = rs.toDF()
rs_df.show(10, False)

In [None]:
# Run the LDA Topic Modeler
# Note the time before and after is printed in order to find out how much time it takes to process x number of records


print(time.strftime('%m%d%Y %H:%M:%S'))
lda_model = LDA.train(rs_df['_1', '_2'].rdd.map(list), k=num_of_topics_LDA, maxIterations=max_iterations_LDA)
print(time.strftime('%m%d%Y %H:%M:%S'))

In [None]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topics = lda_model.topicsMatrix()
vocabArray = cvmodel.vocabulary

In [None]:
wordNumbers = 15

topicIndices = sc.parallelize(lda_model.describeTopics(maxTermsPerTopic = number_of_words_per_topic))

def topic_render(topic):  # specify vector id of words to actual words
    terms = topic[0]
    prob = topic[1]
    
    result = []
    for i in range(number_of_words_per_topic):
        term = str(round(prob[i],3))+"  "+vocabArray[terms[i]]
        result.append(term)
    return result
print(time.strftime('%m%d%Y %H:%M:%S'))

In [None]:
print(time.strftime('%m%d%Y %H:%M:%S'))
topics_final = topicIndices.map(lambda topic:topic_render(topic)).collect()
print(time.strftime('%m%d%Y %H:%M:%S'))

# Topics

In [None]:
# based on the simple vectors(+number of words)

for topic in range(len(topics_final)):
    print ("Topic #" + str(topic+1) + "")
    for term in topics_final[topic]:
        print (term)
    print ('\n')

### Hot topics in the USA from [Google trends](https://trends.google.com/trends/explore?geo=US)

In [9]:
start_date = frame_start_datetime #str_tweet_to_datetime(frame_start_datetime)
finish_date = frame_finish_datetime #str_tweet_to_datetime(frame_finish_datetime)

In [10]:
google_trends_topics, google_trends_queries = get_google_trends_by_geo(geo, google_trends_dict) 

##### Google trends search queries

In [11]:
interesting_google_topics = google_trends_topics.filter(
    (google_trends_topics.Date >= start_date) & (google_trends_topics.Date <= finish_date))

In [12]:
print_google_trend_title(start_date, finish_date, "Search topics", geo)
interest_google_topics = convert_datetime_in_interesting_google(interesting_google_topics)
interest_google_topics.select("Date","Search topics - rising", "Search topics - top").show(num_of_top_interest, False)


Google trends Search topics in New York during 2019-06-12 - 2019-06-12
+----------+----------------------+----------------------------------------+
|Date      |Search topics - rising|Search topics - top                     |
+----------+----------------------+----------------------------------------+
|2019-06-12|Download - Topic      |New York - City in New York             |
|2019-06-12|null                  |New York - US State                     |
|2019-06-12|null                  |Google Search - Topic                   |
|2019-06-12|null                  |Google - Technology company             |
|2019-06-12|null                  |2019 - Topic                            |
|2019-06-12|null                  |Weather - Topic                         |
|2019-06-12|null                  |YouTube - Video sharing company         |
|2019-06-12|null                  |Facebook, Inc. - Social network company |
|2019-06-12|null                  |Facebook - Social networking service    |
|201

In case when timeframe is more than 1 day, filter correctly this google-trends

In [13]:
# interesing_google_topics_unique= unique_google_trends_by_time_frame(interesting_google_topics, geo)
# print_google_trend_title(start_date, finish_date, "Search topics")
# interesing_google_topics_unique.select("Search topics - rising", "Search topics - top").show(num_of_top_interest, False)

##### Google trends search queries

In [14]:
interesting_google_queries = google_trends_queries.filter(
    (google_trends_queries.Date >= start_date) & (google_trends_queries.Date <= finish_date))

In [15]:
interesing_google_queries_unique= unique_google_trends_by_time_frame(interesting_google_queries, spark)
print_google_trend_title(start_date, finish_date, "Search queries", geo)
interesing_google_queries_unique.show(num_of_top_interest, False)


Google trends Search queries in New York during 2019-06-12 - 2019-06-12
+-----------------------+--------------------+------+---+-----+
|Search queries - rising|Search queries - top|Rising|Top|geo  |
+-----------------------+--------------------+------+---+-----+
|sudanese massacre      |google              |+300% |100|US-NY|
|hong kong              |weather             |+250% |79 |US-NY|
|ny yankees score       |facebook            |+180% |59 |US-NY|
|iready                 |youtube             |+150% |54 |US-NY|
|boston bruins          |amazon              |+150% |49 |US-NY|
|raz kids               |news                |+110% |46 |US-NY|
|bruins                 |definition          |+110% |33 |US-NY|
|cool math games        |craigslist          |+100% |28 |US-NY|
|verizon wireless       |gmail               |+90%  |24 |US-NY|
|bitcoin price          |translate           |+90%  |23 |US-NY|
|scratch                |nba                 |+90%  |23 |US-NY|
|max landis             |instag

In [16]:
# print_google_trend_title(start_date, finish_date, "Search queries")
# interest_google_queries = convert_datetime_in_interesting_google(interesting_google_queries)
# interest_google_queries.select("Date", "Search queries - rising", "Search queries - top").show(num_of_top_interest, False)

#### Hot topics - google trends (directly) (probably this will be removed)

In [17]:
start_date_str = start_date.strftime("%Y-%m-%d")
finish_date_str = finish_date.strftime("%Y-%m-%d")
pytrend = TrendReq()
pytrend.build_payload(kw_list=[' '], geo=geo, timeframe=f"{start_date_str} {finish_date_str}")

##### Search topics

In [18]:
topics_df = pytrend.related_top_search_topics(spark)

In [19]:
print_google_trend_title(start_date, finish_date, "Search topics", geo)
topics_df.select("Search topics - rising", "Search topics - top").show(num_of_top_interest, False)


Google trends Search topics in New York during 2019-06-12 - 2019-06-12
+----------------------+----------------------------------------+
|Search topics - rising|Search topics - top                     |
+----------------------+----------------------------------------+
|nan                   |New York - City in New York             |
|nan                   |New York - US State                     |
|nan                   |Google Search - Topic                   |
|nan                   |Google - Technology company             |
|nan                   |2019 - Topic                            |
|nan                   |Weather - Topic                         |
|nan                   |YouTube - Video sharing company         |
|nan                   |Facebook - Social networking service    |
|nan                   |Facebook, Inc. - Social network company |
|nan                   |Amazon.com - E-commerce company         |
|nan                   |Film - Topic                            |
|nan

##### Search queries

In [20]:
queries_df = pytrend.related_top_search_queries(spark)

In [21]:
print_google_trend_title(start_date, finish_date, "Search queries", geo)
queries_df.show(num_of_top_interest, False)


Google trends Search queries in New York during 2019-06-12 - 2019-06-12
+-----------------------+--------------------+------+---+-----+
|Search queries - rising|Search queries - top|Rising|Top|geo  |
+-----------------------+--------------------+------+---+-----+
|britt mchenry          |google              |+750% |100|US-NY|
|courtney stodden       |weather             |+450% |71 |US-NY|
|mega millions          |facebook            |+300% |58 |US-NY|
|hong kong              |youtube             |+250% |50 |US-NY|
|jon stewart            |amazon              |+150% |46 |US-NY|
|khan academy           |news                |+150% |44 |US-NY|
|john stewart           |craigslist          |+140% |26 |US-NY|
|elizabeth lederer      |instagram           |+140% |23 |US-NY|
|icc world cup 2019     |nba                 |+140% |22 |US-NY|
|bruins                 |translate           |+120% |21 |US-NY|
|boston bruins          |gmail               |+110% |20 |US-NY|
|run 3                  |drive 

## Evaluation and justification (Conclusion)