# <center>Web Scraping by API </center>

## 1. Scrape data through APIs 
- Online content providers usually provide APIs for you to access data. Two types of APIs:
   * Python packages: e.g. tweepy package from Twitter
   * REST APIs: e.g. OMDB APIs (http://www.omdbapi.com), or TMDB (https://developers.themoviedb.org/3/getting-started)
- You need to read documentation of APIs to figure out how to access data

## 2. Scrape data by REST APIs (e.g. OMDB API)
- A REST API is a web service that uses `HTTP` requests to `GET`, `PUT`, `POST` and `DELETE` data
- Example:
    - https://groceries.asda.com/api/items/search<font color="blue"><b>?</b></font><font color='green'><b>keyword</b></font>=<font color='red'><b>yogurt<b></font><front color='purple'><b>&</b></font><font color='green'><b>r</b></font>=<font color='red'><b>json<b></font>, where
        - `?`: separate API endpoint  `https://groceries.asda.com/api/items/search` from parameters
        - `keyword=yogurt`: search `yogurt` on parameter `keyword`
        - `&`: combine multiple search criteria
        - `r=json`: result is in json format 
    - You can directly paste the above API to your browser
    - Or issue API calls using requests
- You need to read API documentation to understand how to specify parameters

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import requests
import json
import pandas as pd

In [2]:
import requests
import json

keyword = 'yogurt'


url="https://groceries.asda.com/api/items/search?keyword=" + keyword + "&r=json"

print(url)

# invoke the API 
r = requests.get(url)

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    # json. dumps() function converts a Python object into a json string
    result = r.json()
    print (json.dumps(result, indent=4))



https://groceries.asda.com/api/items/search?keyword=yogurt&r=json
{
    "statusMessage": "The API Item Search was executed successfully",
    "errors": [],
    "keyword": "yogurt",
    "storeId": "4565",
    "autoCorrectedTerm": "",
    "didYouMeanTerm": "",
    "isHookLogicInsert": "false",
    "totalResult": "447",
    "currentPage": "1",
    "resultsStartIndex": "1",
    "resultsEndIndex": "60",
    "maxPages": "8",
    "qusApplied": false,
    "productBoostingDetails": "0^rule_613f0638f04fd127608474cd^^^Disabled",
    "monetizedItems": [],
    "items": [
        {
            "shelfId": "1215286383583",
            "shelfName": "Corners",
            "deptId": "1215341888021",
            "deptName": "Yogurts & Desserts",
            "isBundle": "false",
            "meatStickerDetails": "10::for::\u00a34::false",
            "extraLargeImageURL": "",
            "bundledItemCount": "0",
            "scene7Host": "https://ui.assets-asda.com:443/dm/",
            "cin": "7368400",
 

In [3]:
# Exercise 2.2.  Another way to pass parameters

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# in case authentication is needed, use
# r = requests.get('https://api.github.com/user', \
# auth=('user', 'pass'))

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    print (json.dumps(r.json(), indent=4))



{
    "statusMessage": "The API Item Search was executed successfully",
    "errors": [],
    "keyword": "yogurt",
    "storeId": "4565",
    "autoCorrectedTerm": "",
    "didYouMeanTerm": "",
    "isHookLogicInsert": "false",
    "totalResult": "447",
    "currentPage": "1",
    "resultsStartIndex": "1",
    "resultsEndIndex": "60",
    "maxPages": "8",
    "qusApplied": false,
    "productBoostingDetails": "0^rule_613f0638f04fd127608474cd^^^Disabled",
    "monetizedItems": [],
    "items": [
        {
            "shelfId": "1215286383583",
            "shelfName": "Corners",
            "deptId": "1215341888021",
            "deptName": "Yogurts & Desserts",
            "isBundle": "false",
            "meatStickerDetails": "10::for::\u00a34::false",
            "extraLargeImageURL": "",
            "bundledItemCount": "0",
            "scene7Host": "https://ui.assets-asda.com:443/dm/",
            "cin": "7368400",
            "promoDetailFull": "10 for \u00a34",
            "avail

## 3. JSON (JavaScript Object Notation)

### What is JSON
- A lightweight data-interchange format
- "Self-describing" and easy to understand
- JSON format is text only 
- Language independent: can be read and used as a data format by any programming language

###  JSON Syntax Rules
JSON syntax is derived from JavaScript object notation syntax:
- Data is in **name/value** pairs separated by commas
- Curly braces hold objects
- Square brackets hold arrays

### A JSON object is:
- **a dictionary** or 
- a **list of dictionaries**

### Useful JSON functions
- dumps: save json object to string
- dump: save json object to file
- loads: load from a string in json format
- load: load from a file in json format

In [4]:
# Exercise 3.1 API returns a JSON object 

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# if the API call returns a successful response
if r.status_code==200:
    result = r.json()
    
    df = pd.DataFrame(result["items"])
    df.head()
    

Unnamed: 0,shelfId,shelfName,deptId,deptName,isBundle,meatStickerDetails,extraLargeImageURL,bundledItemCount,scene7Host,cin,...,avgWeight,iconDetails,maxQty,pricePerWt,productURL,pricePerUOM,searchTuningScore,onSale,salePrice,positionChngByMargin
0,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£4::false,,0,https://ui.assets-asda.com:443/dm/,7368400,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,8717191.0,False,,0
1,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£4::false,,0,https://ui.assets-asda.com:443/dm/,7368402,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,5447849.0,False,,0
2,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£4::false,,0,https://ui.assets-asda.com:443/dm/,7368406,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,5098466.0,False,,0
3,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£4::false,,0,https://ui.assets-asda.com:443/dm/,7368408,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,4553693.5,False,,0
4,910000977060,"Diet, Low Fat & No Added Sugar",1215341888021,Yogurts & Desserts,False,10::for::£4::false,,0,https://ui.assets-asda.com:443/dm/,6346362,...,,"{'promotionalIcons': ['59600051'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,3162789.0,False,,0


In [5]:
# Exercise 3.2. Parse JSON object (a dictionary)

# convert the first 2 items to string
s = json.dumps(result["items"][0:2], indent=4)
print(s)

# load back from a string
items = json.loads(s)
items

# save to file
json.dump(result["items"], open("items.json","w"))

# load back from file
items = json.load(open("items.json","r"))
print("test loaded data\n")
len(items)
items[0]

[
    {
        "shelfId": "1215286383583",
        "shelfName": "Corners",
        "deptId": "1215341888021",
        "deptName": "Yogurts & Desserts",
        "isBundle": "false",
        "meatStickerDetails": "10::for::\u00a34::false",
        "extraLargeImageURL": "",
        "bundledItemCount": "0",
        "scene7Host": "https://ui.assets-asda.com:443/dm/",
        "cin": "7368400",
        "promoDetailFull": "10 for \u00a34",
        "availability": "A",
        "totalReviewCount": "14",
        "asdaSuggest": "",
        "itemName": "Corner Vanilla Yogurt with Chocolate Balls",
        "price": "\u00a30.60",
        "imageURL": "",
        "aisleName": "Yogurts & Fromage Frais",
        "id": "1000377031656",
        "promoId": "ls92280",
        "isFavourite": "false",
        "hasAlternates": "false",
        "wasPrice": "",
        "brandName": "Muller",
        "promoType": "No Promo",
        "weight": "124g      ",
        "promoOfferTypeCode": "15",
        "promoQty": "

[{'shelfId': '1215286383583',
  'shelfName': 'Corners',
  'deptId': '1215341888021',
  'deptName': 'Yogurts & Desserts',
  'isBundle': 'false',
  'meatStickerDetails': '10::for::£4::false',
  'extraLargeImageURL': '',
  'bundledItemCount': '0',
  'scene7Host': 'https://ui.assets-asda.com:443/dm/',
  'cin': '7368400',
  'promoDetailFull': '10 for £4',
  'availability': 'A',
  'totalReviewCount': '14',
  'asdaSuggest': '',
  'itemName': 'Corner Vanilla Yogurt with Chocolate Balls',
  'price': '£0.60',
  'imageURL': '',
  'aisleName': 'Yogurts & Fromage Frais',
  'id': '1000377031656',
  'promoId': 'ls92280',
  'isFavourite': 'false',
  'hasAlternates': 'false',
  'wasPrice': '',
  'brandName': 'Muller',
  'promoType': 'No Promo',
  'weight': '124g      ',
  'promoOfferTypeCode': '15',
  'promoQty': '10',
  'promoValue': '£4',
  'productAttribute': '',
  'scene7AssetId': '4025500277031',
  'promoDetail': '10 for £4',
  'bundleDiscount': '0.00',
  'avgStarRating': '4.6429',
  'name': 'Mull

test loaded data



60

{'shelfId': '1215286383583',
 'shelfName': 'Corners',
 'deptId': '1215341888021',
 'deptName': 'Yogurts & Desserts',
 'isBundle': 'false',
 'meatStickerDetails': '10::for::£4::false',
 'extraLargeImageURL': '',
 'bundledItemCount': '0',
 'scene7Host': 'https://ui.assets-asda.com:443/dm/',
 'cin': '7368400',
 'promoDetailFull': '10 for £4',
 'availability': 'A',
 'totalReviewCount': '14',
 'asdaSuggest': '',
 'itemName': 'Corner Vanilla Yogurt with Chocolate Balls',
 'price': '£0.60',
 'imageURL': '',
 'aisleName': 'Yogurts & Fromage Frais',
 'id': '1000377031656',
 'promoId': 'ls92280',
 'isFavourite': 'false',
 'hasAlternates': 'false',
 'wasPrice': '',
 'brandName': 'Muller',
 'promoType': 'No Promo',
 'weight': '124g      ',
 'promoOfferTypeCode': '15',
 'promoQty': '10',
 'promoValue': '£4',
 'productAttribute': '',
 'scene7AssetId': '4025500277031',
 'promoDetail': '10 for £4',
 'bundleDiscount': '0.00',
 'avgStarRating': '4.6429',
 'name': 'Muller Corner Vanilla Yogurt with Choco

## 4. Get Tweets

Reference: 
- https://github.com/scalto/snscrape-by-location/blob/main/snscrape_by_location_tutorial.ipynb
- https://medium.com/swlh/how-to-scrape-tweets-by-location-in-python-using-snscrape-8c870fa6ec25

Note: User object is not exposed by TwitterSearchScraper any more.

In [6]:
import pandas as pd
import snscrape.modules.twitter as sntwitter
import itertools


In [7]:
#  search by keywords + time
# TwitterSearchScraper returns an interator, islice loops through the iterator

df = pd.DataFrame(itertools.islice(sntwitter.TwitterSearchScraper(
    '"blockchain + since:2021-12-1 until:2022-1-31"').get_items(), 500))

print(len(df))
df.head()

500


Unnamed: 0,url,date,content,renderedContent,id,user,replyCount,retweetCount,likeCount,quoteCount,...,media,retweetedTweet,quotedTweet,inReplyToTweetId,inReplyToUser,mentionedUsers,coordinates,place,hashtags,cashtags
0,https://twitter.com/CryptoAnomalous/status/148...,2022-01-30 23:59:59+00:00,@BradSherman do not get in the way of US techn...,@BradSherman do not get in the way of US techn...,1487938677535907845,"{'username': 'CryptoAnomalous', 'id': 53283913...",0,0,0,0,...,,,,,,"[{'username': 'BradSherman', 'id': 30216513, '...",,,,
1,https://twitter.com/gerard_dache/status/148793...,2022-01-30 23:59:55+00:00,"Congrats to Bill Rockwood Jr, Esq., MBA, J.D.,...","Congrats to Bill Rockwood Jr, Esq., MBA, J.D.,...",1487938658023919619,"{'username': 'gerard_dache', 'id': 256189693, ...",0,0,7,3,...,,,,,,,,,,
2,https://twitter.com/Rekttrading8/status/148793...,2022-01-30 23:59:54+00:00,@kararesurrect You are still announcing it on ...,@kararesurrect You are still announcing it on ...,1487938655159140353,"{'username': 'Rekttrading8', 'id': 14337294741...",0,0,0,0,...,,,,1.48782e+18,"{'username': 'TraderZetas', 'id': 147446459766...",,,,,
3,https://twitter.com/KeanuBelieves/status/14879...,2022-01-30 23:59:50+00:00,@WatcherGuru @AffinityBSC. Being listing on a ...,@WatcherGuru @AffinityBSC. Being listing on a ...,1487938639820693508,"{'username': 'KeanuBelieves', 'id': 1425210713...",0,0,1,0,...,,,,1.487907e+18,"{'username': 'WatcherGuru', 'id': 138749787175...","[{'username': 'WatcherGuru', 'id': 13874978717...",,,[ADAPT],
4,https://twitter.com/duynguy40664441/status/148...,2022-01-30 23:59:35+00:00,@mine_blockchain scam mnet,@mine_blockchain scam mnet,1487938575937126400,"{'username': 'duynguy40664441', 'id': 14595808...",1,0,0,0,...,,,,1.486773e+18,"{'username': 'mine_blockchain', 'id': 13995421...","[{'username': 'mine_blockchain', 'id': 1399542...",,,,


In [8]:
df.content[0:20]


0     @BradSherman do not get in the way of US techn...
1     Congrats to Bill Rockwood Jr, Esq., MBA, J.D.,...
2     @kararesurrect You are still announcing it on ...
3     @WatcherGuru @AffinityBSC. Being listing on a ...
4                            @mine_blockchain scam mnet
5     46150 Well, who would have thought that? Minin...
6     👋 Hey! Wait! help me, help you, help us? Who's...
7     🐳 #Cardano $ADA Whale ❤️laced!\n💰 Transaction ...
8     @TheVunderkind I still think we are yet to app...
9     63208 Well, who would have thought that? Minin...
10    @NathanielBandy1 Nintendo has a hard time putt...
11    Love my #Straybears, the art work, the team, t...
12    39313 Well, who would have thought that? Minin...
13    Asian Company Uses Blockchain Technology to Pr...
14                         blockchain the piece of wood
15    📢 Crypto Market Status 📢  \n\n$GALA is trading...
16    Requesting $FTM funds from the #Stakely Faucet...
17    Blockchain and crypto are broken | Opinion

In [9]:
# search by user

df = pd.DataFrame(itertools.islice(sntwitter.TwitterUserScraper(
    '"zawphyowai199"').get_items(), 500))

print(len(df))
df.tail()


ValueError: Invalid username

## 5. Tweepy
- Tweepy is a python library to access Twitter API. 
- `pip install tweepy`
- The Tweepy documentation has detailed explanations: https://docs.tweepy.org/en/stable/
- You need to apply for a developer account from here: https://developer.twitter.com/en/apply-for-access

In [None]:
import tweepy
import csv
import datetime

# https://docs.tweepy.org/en/stable/auth_tutorial.html

CONSUMER_KEY=''
CONSUMER_SECRET=''
ACCESS_KEY=''
ACCESS_SECRET=''

auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
auth.set_access_token(ACCESS_KEY,ACCESS_SECRET)
api=tweepy.API(auth)

In [None]:
# Take a look at the public tweets from your account's home timeline 

public_tweets = api.home_timeline()
print(len(public_tweets))
for tweet in public_tweets[:2]:
    print(tweet.text)

20
ICYMI: Crypto miners are feeling the pinch as data shows global revenue from mining has dropped to just over $17 mi… https://t.co/q3pM4tQsYE
Authorities are responding to Long Beach where an officer-involved shooting has occurred. https://t.co/ArIHkRfhxA


In [None]:
# is this useful information?
# Let's take a close look at ONE tweet json

public_tweets[1]
# there's no way to figure this out


Status(_api=<tweepy.api.API object at 0x7ff0b86befd0>, _json={'created_at': 'Sun Oct 02 19:04:42 +0000 2022', 'id': 1576649394228805632, 'id_str': '1576649394228805632', 'text': 'Authorities are responding to Long Beach where an officer-involved shooting has occurred. https://t.co/ArIHkRfhxA', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/ArIHkRfhxA', 'expanded_url': 'https://www.cbsnews.com/losangeles/news/police-probe-officer-involved-shooting-in-long-beach/', 'display_url': 'cbsnews.com/losangeles/new…', 'indices': [90, 113]}]}, 'source': '<a href="http://www.socialnewsdesk.com" rel="nofollow">SocialNewsDesk</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 24928809, 'id_str': '24928809', 'name': 'CBS Los Angeles', 'screen_name': 'CBSLA', 'location': 'Los Angeles', 'description': 'CBS

In [None]:
# make it look better
# convert to string
json_str = json.dumps(public_tweets[0]._json)

# deserialise string into python object
parsed = json.loads(json_str)


print(json.dumps(parsed, indent=4, sort_keys=True))
# Now we can have a better idea of the clustered relations of the json object

{
    "contributors": null,
    "coordinates": null,
    "created_at": "Sun Oct 02 19:05:00 +0000 2022",
    "entities": {
        "hashtags": [],
        "symbols": [],
        "urls": [
            {
                "display_url": "twitter.com/i/web/status/1\u2026",
                "expanded_url": "https://twitter.com/i/web/status/1576649466886619141",
                "indices": [
                    117,
                    140
                ],
                "url": "https://t.co/q3pM4tQsYE"
            }
        ],
        "user_mentions": []
    },
    "favorite_count": 16,
    "favorited": false,
    "geo": null,
    "id": 1576649466886619141,
    "id_str": "1576649466886619141",
    "in_reply_to_screen_name": null,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "is_quote_status": false,
    "lang": "en",
    "place": null,
    "possibly_sensitive": false,
    "possibly_sensiti

### 5.1. Get tweets from users' timeline
- Make a Timeline call to retrieve the most recent 3200 tweets by a user (a rule set by Twitter).
    - Note: the time range you get depends on how often the user posts tweets. 
- Parameters for the timeline call
    - `count`: the number of results to try and retrieve per page. Maximum is 200. 
    - Make multiple calls to retrieve the 3200 tweets. 
    - `tweet_mode`:swaps the text index for full_text, and prevents a primary tweet longer than 140 characters from being truncated.
- Variables of tweet objects
    - https://docs.tweepy.org/en/stable/api.html#tweepy-api-twitter-api-wrapper

In [None]:
# Get the first five tweets of a user.
timeline = api.user_timeline(screen_name="KelloggCompany",count=5,tweet_mode="extended")

for status in timeline:
    print (status.id)
    print (status.full_text)

1575885449150795777
Our thoughts are with the people impacted by Hurricanes #Fiona + #Ian. To support victims + relief workers, Kellogg is donating more &gt;665,000 servings of cereal + snacks. You can help by donating to https://t.co/0OllBPhjsr or https://t.co/fJC7kLZJrj #betterdays https://t.co/zvawEgRPGl
1575826863754182665
We are extremely honored to be included among the world’s sustainability top-performing companies in the Dow Jones Sustainability Index. Read more: https://t.co/xvlXaEYdYv #DJSI #ESG #BetterDays https://t.co/gH0YyzRzLq
1575474217235136512
Kellogg is proud to have partnered with @FoodBanking for 15+  years to help local food banks #FillTheBlanks in solutions to climate change &amp; #hunger and create #betterdays for those in need.
How will you help #FillTheBlanks? Learn more: https://t.co/i1UQIHL7Rf
#FLWDay #IDAFLW https://t.co/sRodNWBfBX
1575221387832090625
Kellogg’s #WIC cereals provide iron, folic acid, vitamin D, and more. And cereal is often enjoyed with milk

In [None]:
# Step 1: get a list of tweets 
# Step 2: extract the varaibles you want

def get_all_tweets(user_name):
    auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
    auth.set_access_token(ACCESS_KEY,ACCESS_SECRET)
    api=tweepy.API(auth)
    
    # initialize the first call
    alltweets=[]
    new_tweets=api.user_timeline(screen_name=user_name, count=200)
    alltweets.extend(new_tweets)
    oldest=alltweets[-1].id-1  #next time start from the oldest one minus one 
    
    # continue to get tweets
    while len(new_tweets)>0:  
        print ("getting tweets before", oldest)
        new_tweets = api.user_timeline(screen_name=user_name,count=200, max_id=oldest)
        alltweets.extend(new_tweets)
        oldest=alltweets[-1].id-1
        print("...{} tweets downloaded so far".format(len(alltweets)))
    
    # extract the variables you want
    outtweets = [[tweet.id_str, tweet.user.name, tweet.created_at, tweet.user.followers_count,
                  tweet.text.encode("utf-8")] for tweet in alltweets]
            
    # write out your variables
    with open('%s_tweet.csv' % user_name,'w') as outputfile: 
        writer=csv.writer(outputfile)
        writer.writerow(["id","user_name","created_at","followers","text"])
        writer.writerows(outtweets)
    pass

# use your function
if __name__ == '__main__':
    #pass in the username of the account you want to download
    get_all_tweets("KelloggCompany")
    

getting tweets before 1433127895623417857
...400 tweets downloaded so far
getting tweets before 1368699367109038085
...600 tweets downloaded so far
getting tweets before 1306220294751891455
...800 tweets downloaded so far
getting tweets before 1229863740306227199
...1000 tweets downloaded so far
getting tweets before 1164539938920177673
...1200 tweets downloaded so far
getting tweets before 1092430104154775551
...1400 tweets downloaded so far
getting tweets before 1001825648426602497
...1600 tweets downloaded so far
getting tweets before 927505873689300991
...1800 tweets downloaded so far
getting tweets before 868083017285324801
...2000 tweets downloaded so far
getting tweets before 798641013451538431
...2200 tweets downloaded so far
getting tweets before 749265326081249279
...2397 tweets downloaded so far
getting tweets before 703294718990532607
...2595 tweets downloaded so far
getting tweets before 656849723526115327
...2794 tweets downloaded so far
getting tweets before 591317091527

In [None]:
# Take a look at the table you got
df= pd.read_csv('KelloggCompany_tweet.csv', header=0)
df.head()

# how many tweets we get?
len(df)

# The following tweet is a retweet. Take a look at the text.
# The index of this retweet is based on the table generated previously
# if you run at a different time, you will see a differet tweet
# try to find a retweet and compare the text with the actual tweet
# You can find each tweet on Twitter by its ID.
df.text[34]


Unnamed: 0,id,user_name,created_at,followers,text
0,1575885449150795777,Kellogg Company,2022-09-30 16:29:04,77625,b'Our thoughts are with the people impacted by...
1,1575826863754182665,Kellogg Company,2022-09-30 12:36:16,77625,b'We are extremely honored to be included amon...
2,1575474217235136512,Kellogg Company,2022-09-29 13:14:58,77625,b'Kellogg is proud to have partnered with @Foo...
3,1575221387832090625,Kellogg Company,2022-09-28 20:30:19,77625,b'Kellogg\xe2\x80\x99s #WIC cereals provide ir...
4,1575221380118851584,Kellogg Company,2022-09-28 20:30:17,77625,"b'In 2021, our cereals reached 1.3M families t..."


3240

"b'Our hearts are with the flooding victims in Kentucky and Missouri. We are donating nearly 450,000 servings of snack\\xe2\\x80\\xa6 https://t.co/pm9QC2Pfik'"

### 5.2. Deal with truncated text
- For text mining on Twitter, it is important to get the full text. 
    - Full text would be essential for topic modeling and sentiment analysis.
    - Full text is also important for extracting mention networks (note the previous example). 
- Use the `tweet_mode="extended"` when calling a user's timeline.
    - When using extended mode, the `text` attribute of Status objects returned is replaced by a `full_text` attribute, which contains the entire untruncated text of the Tweet. 
- Full text for tweets that are retweets.
    - If the tweet is a retweet, the full_text is still truncated. 
    - We need to access the full text through `retweeted_status` attribute, which is a status object itself. 
- For reference: https://docs.tweepy.org/en/stable/extended_tweets.html

In [None]:
# Let's deal with retweets

def get_all_tweets(user_name):
    auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
    auth.set_access_token(ACCESS_KEY,ACCESS_SECRET)
    api=tweepy.API(auth,wait_on_rate_limit=True)

    alltweets=[]
    new_tweets=api.user_timeline(user_name, count=200,tweet_mode="extended")
    alltweets.extend(new_tweets)
    oldest=alltweets[-1].id-1  
    
    # set date condition
    startDate=datetime.datetime(2021, 1, 1, 0, 0, 0)
    while new_tweets[-1].created_at > startDate:
        print ("getting tweets before", oldest)
        new_tweets = api.user_timeline(user_name,count=200, max_id=oldest)
        alltweets.extend(new_tweets)
        oldest=alltweets[-1].id-1
        print("...{} tweets downloaded so far".format(len(alltweets)))
        
    # check if it's a retweet
    # When using extended mode with a Retweet, the full_text attribute of the Status object may be truncated    
    # However, since the retweeted_status attribute (of a Status object that is a Retweet) is itself a Status object
    # the full_text attribute of the Retweeted Status object can be used instead.
    
    outtweets_all=[]
    for tweet in alltweets:
        status = api.get_status(tweet.id, tweet_mode="extended")
        
        if hasattr(status, "retweeted_status"):  # is a retweet
            full_text=status.retweeted_status.full_text.encode("utf-8")
            
            outtweets=[
            # tweet content
            tweet.id_str, tweet.created_at,full_text,
            # user features
            tweet.user.name, tweet.user.screen_name, tweet.user.followers_count, 
            # retweet features
            tweet.retweeted_status.user.name,tweet.retweeted_status.user.screen_name,tweet.retweeted_status.user.description]
            outtweets_all.append(outtweets)
  
        else: # not a retweet
            full_text=status.full_text.encode("utf-8")
                    
            outtweets=[
            # tweet content
            tweet.id_str, tweet.created_at,full_text,
            # user features
            tweet.user.name, tweet.user.screen_name, tweet.user.followers_count, 
            # retweet features
            "no value","no value","no value"]
            outtweets_all.append(outtweets)

    with open('%s_full_tweet.csv' % user_name,'w') as outputfile: 
        writer=csv.writer(outputfile)
        writer.writerow(["id","created_at","full_text",
                        "user.name","user.screen_name","user.followers_count",
                        "retweeted_status.user.name","retweeted_status.user.screen_name","retweeted_status.user.description"])
        writer.writerows(outtweets_all)

        
if __name__ == '__main__':
    #pass in the username of the account you want to download
    get_all_tweets("KelloggCompany")


getting tweets before 1433127895623417857
...400 tweets downloaded so far
getting tweets before 1368699367109038085
...600 tweets downloaded so far


In [None]:
df= pd.read_csv('KelloggCompany_full_tweet.csv', header=0)
df.head()
len(df)
# plese compare this full text with the above truncated text, what differences can you find?
df.full_text[34]

Unnamed: 0,id,created_at,full_text,user.name,user.screen_name,user.followers_count,retweeted_status.user.name,retweeted_status.user.screen_name,retweeted_status.user.description
0,1575885449150795777,2022-09-30 16:29:04,b'Our thoughts are with the people impacted by...,Kellogg Company,KelloggCompany,77625,no value,no value,no value
1,1575826863754182665,2022-09-30 12:36:16,b'We are extremely honored to be included amon...,Kellogg Company,KelloggCompany,77625,no value,no value,no value
2,1575474217235136512,2022-09-29 13:14:58,b'Kellogg is proud to have partnered with @Foo...,Kellogg Company,KelloggCompany,77625,no value,no value,no value
3,1575221387832090625,2022-09-28 20:30:19,b'Kellogg\xe2\x80\x99s #WIC cereals provide ir...,Kellogg Company,KelloggCompany,77625,no value,no value,no value
4,1575221380118851584,2022-09-28 20:30:17,"b'In 2021, our cereals reached 1.3M families t...",Kellogg Company,KelloggCompany,77625,no value,no value,no value


600

"b'Our hearts are with the flooding victims in Kentucky and Missouri. We are donating nearly 450,000 servings of snacks to help feed victims and relief workers. Visit https://t.co/ASoo3mAWvM to help.  #betterdays https://t.co/uBMpkcGOUH'"

### 5.3. Build Twitter networks
- **Follower-followee network**
    - If you have a list of user accounts, you may retrive the pairwise boolean values of following relations. 
    - Parameters
        * `source_id` – The user_id of the subject user.
        * `source_screen_name` – The screen_name of the subject user.
        * `target_id` – The user_id of the target user.
        * `target_screen_name` – The screen_name of the target user.
- **Retweet network**
    - Retweeted accounts can be extracted while scraping the API. 
    - Or retweeted accounts can be extracted from the text. 
- **Mention network**
    - Can be extracted from the full text. 

In [None]:
# How to scrape the follower-followee network?
# we can directly retrieve a bollean value 

dog="Microsoft"
cat="Oracle"

is_following = api.show_friendship(source_screen_name=cat,target_screen_name=dog)
print(is_following[1].following)

# Question: how to get the adjacency matrix of a follower-followee network?

False


### 5.4 Keywords search


In [None]:
# Define the search term and the date_since date as variables
search_words = "blockchain"
date_since = "2021-11-01"

# Collect tweets
tweets = tweepy.Cursor(api.search,
              q=search_words,
              lang="en",
              since=date_since).items(5)


# Iterate and print tweets
for tweet in tweets:
    print(tweet.text)

RT @MaximusMech: GM! 

Winning requires effort... 🤖🪛⚙️

RT, Like and Tag 3 friends for more Mechs!!

#NFTs #NFTCommunity #NFTGiveaway #Meta…
@toymories are about to take over @octoyber #toymories take over is real with a @ToysForTots_USA collab and staking… https://t.co/VhqSd64KyT
RT @EnterNFT: This article discusses what a consensus mechanism is, how it works, and the various types available today. Expand your blockc…
RT @NFAdotcrypto: Why do I hold $XRP? This video should make it clear. 😤🌍🔀🌐#Ripple #Crypto #Blockchain https://t.co/kvPf9BROrH
RT @NFTMintRadar: 🔥Today's #Ethereum Mints🔥

🔹@HEXGO_NFT | 0.2 ☰
🔹@Ibutsu_nft | 0.029 ☰
🔹@888NEKOCLUB | 0.008 ☰
🔹@313RabbitNFT | 0.3 ☰

#NF…


##### Twitter data resources
https://github.com/echen102/us-pres-elections-2020 <br>
https://github.com/echen102/COVID-19-TweetIDs