# Get Tweets

This Notbook heavily references this notebook for gathering tweets: https://github.com/alod83/data-science/blob/master/DataCollection/Twitter/get_tweets.ipynb

This Notebook references this link for uploading tweets to MongoDB: https://medium.com/analytics-vidhya/how-to-upload-a-pandas-dataframe-to-mongodb-ffa18c0953c1

This script extracts all the tweets with hashtag #covid-19 related to the day before today (yesterday) and saves them into a .csv file.
We use the `tweepy` library, which can be installed with the command `pip install tweepy`.

Firstly, we import the configuration file, called `config.py`, which is located in the same directory of this script.

In [3]:
import tweepy
import datetime

We setup the connection to our Twitter App by using the `OAuthHandler()` class and its `access_token()` function. Then we call the Twitter API through the `API()` function.

In [4]:
auth = tweepy.OAuthHandler(TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET)
auth.set_access_token(TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_TOKEN_SECRET)
api = tweepy.API(auth,wait_on_rate_limit=True)

Now we setup dates. We need to setup today and yesterday.

In [22]:
today = datetime.date.today()
yesterday= today - datetime.timedelta(days=1)
today, yesterday

(datetime.date(2021, 7, 7), datetime.date(2021, 7, 6))

We search for tweets on Twitter by using the `Cursor()` function. 
We pass the `api.search` parameter to the cursor, as well as the query string, which is specified through the `q` parameter of the cursor.
The query string can receive many parameters, such as the following (not mandatory) ones:
* `from:` - to specify a specific Twitter user profile
* `since:` - to specify the beginning date of search
* `until:` - to specify the ending date of search
The cursor can also receive other parameters, such as the language and the `tweet_mode`. If `tweet_mode='extended'`, all the text of the tweet is returned, otherwise only the first 140 characters.

In [26]:
tweets_list = tweepy.Cursor(api.search, q="#DataLake since:" + str(yesterday)+ " until:" + str(today),tweet_mode='extended', lang='en').items()

Now we loop across the `tweets_list`, and, for each tweet, we extract the text, the creation date, the number of retweets and the favourite count. We store every tweet into a list, called `output`.

In [27]:
output = []
for tweet in tweets_list:
    text = tweet._json["full_text"]
    print(text)
    favourite_count = tweet.favorite_count
    retweet_count = tweet.retweet_count
    created_at = tweet.created_at
    
    line = {'text' : text, 'favourite_count' : favourite_count, 'retweet_count' : retweet_count, 'created_at' : created_at}
    output.append(line)

RT @FactionInc: Helping enterprise architects unpack today’s cloud data challenges and select the most valuable and cost-effective solution…
Helping enterprise architects unpack today’s cloud data challenges and select the most valuable and cost-effective solution for your organization. Download the new white paper to learn more: https://t.co/PlT9NWTYwH #datawarehouse #datalake #enterprisearchitect #clouddata https://t.co/4uwxrCy7Yq
Ready to speed up your Presto queries?🎉 Sign up for a free trial of Ahana Cloud! We’ll be there if you get stuck. 👉👉https://t.co/zyBSIZadXK

#saas @prestodb #AWS #S3data #datalake https://t.co/SBRHPPD2zZ
#Gartner research says 61% of organisations are using a Data Warehouse as part of their infrastructure. Read the #NewYear #Blog from #DataVault  here https://t.co/Q0CJdh17Qy for some great insight into #DataWarehouse #BIWisdom  #CIO #DataLake #DataHub https://t.co/ipf7AkptOV
When do you use #data virtualization vs. ETL? 

I just can't imagine why you would 

Airbnb has blocked tens of thousands of bookings in party crackdown - https://t.co/5Lpf8dk6PJ - thanks @RichardEudes #Analytics,#BigData,#DataEngineering,#DataLake,#DataManagement
via @RichardEudes - Why ETL Needs Open Source to Address the Long Tail of Integrations https://t.co/fv99qm0CZ9 #bigdata, #compliance, #dataengineering, #datagovernance, #datalake, #datamanagement, #dataprivacy, #datascience, #datascience #ds, #gdpr https://t.co/nA9rtuUEha
Powering the Digital Next CPG Enterprise with Platforms of Intelligence - https://t.co/aoj8qleZiu - thanks @RichardEudes #DataScience #DS,#BigData,#Supplychain,#Compliance,#Cybersecurity,#DataEngineering,#DataGovernance,#DataLake,#DataManagement,#DataPrivacy,#DigitalTransformat…
Gamestop – Exchange leaders say GameStop saga highlights regulatory challenges - https://t.co/Oyy5JGtXaM - thanks @RichardEudes #DataScience #DS,#BigData,#DataLake,#Hadoop
Picture perfect: Chinese tourists flock to lake to recreate viral photos - https://t.co/Q6uMsn0

In [28]:
output

[{'text': 'RT @FactionInc: Helping enterprise architects unpack today’s cloud data challenges and select the most valuable and cost-effective solution…',
  'favourite_count': 0,
  'retweet_count': 1,
  'created_at': datetime.datetime(2021, 7, 6, 23, 58, 1)},
 {'text': 'Helping enterprise architects unpack today’s cloud data challenges and select the most valuable and cost-effective solution for your organization. Download the new white paper to learn more: https://t.co/PlT9NWTYwH #datawarehouse #datalake #enterprisearchitect #clouddata https://t.co/4uwxrCy7Yq',
  'favourite_count': 2,
  'retweet_count': 1,
  'created_at': datetime.datetime(2021, 7, 6, 23, 57, 43)},
 {'text': 'Ready to speed up your Presto queries?🎉 Sign up for a free trial of Ahana Cloud! We’ll be there if you get stuck. 👉👉https://t.co/zyBSIZadXK\n\n#saas @prestodb #AWS #S3data #datalake https://t.co/SBRHPPD2zZ',
  'favourite_count': 1,
  'retweet_count': 0,
  'created_at': datetime.datetime(2021, 7, 6, 23, 12)},
 {'te

Finally, we convert the `output` list to a `pandas DataFrame` and we store results.

In [29]:
import pandas as pd

df = pd.DataFrame(output)
df.to_csv('output.csv', mode='a', header=False)
#df.to_csv('output.csv')

In [30]:
df.shape

(62, 4)

In [31]:
df.head(10)

Unnamed: 0,text,favourite_count,retweet_count,created_at
0,RT @FactionInc: Helping enterprise architects ...,0,1,2021-07-06 23:58:01
1,Helping enterprise architects unpack today’s c...,2,1,2021-07-06 23:57:43
2,Ready to speed up your Presto queries?🎉 Sign u...,1,0,2021-07-06 23:12:00
3,#Gartner research says 61% of organisations ar...,0,0,2021-07-06 21:20:00
4,When do you use #data virtualization vs. ETL? ...,0,0,2021-07-06 21:17:27
5,RT @williammcknight: I’m on the latest #DataLe...,0,5,2021-07-06 20:45:59
6,RT @dremio: #ScholaryTuesday - Data Lake vs Da...,0,1,2021-07-06 20:04:32
7,#ScholaryTuesday - Data Lake vs Data Warehouse...,2,1,2021-07-06 20:04:27
8,"To efficiently ingest data into a data lake, y...",1,0,2021-07-06 20:04:03
9,RT @dremio: Apache Iceberg: An Architectural L...,0,3,2021-07-06 20:00:24


Upload pandas df with #DataLake tweets to MongoDB

In [32]:
import pandas as pd
import pymongo
from pymongo import MongoClient

In [33]:
# Making a Connection with MongoClient
client = pymongo.MongoClient("mongodb+srv://c_stewart:<password>@cscluster.wzvbj.mongodb.net/Twitter_Data?retryWrites=true&w=majority")
# database
db = client["Twitter_Data"]
# collection
collection= db["Tweets_About_Data_Lakes"]


In [34]:
df.reset_index(inplace=True)
data_dict = df.to_dict("records")
# Insert collection
collection.insert_many(data_dict)


<pymongo.results.InsertManyResult at 0x119fa3e00>