# <center>Web Scraping by API </center>

## 1. Scrape data through APIs 
- Online content providers usually provide APIs for you to access data. Two types of APIs:
   * Python packages: e.g. tweepy package from Twitter
   * REST APIs: e.g. OMDB APIs (http://www.omdbapi.com), or TMDB (https://developers.themoviedb.org/3/getting-started)
- You need to read documentation of APIs to figure out how to access data

## 2. Scrape data by REST APIs (e.g. OMDB API)
- A REST API is a web service that uses `HTTP` requests to `GET`, `PUT`, `POST` and `DELETE` data
- Example:
    - https://groceries.asda.com/api/items/search<font color="blue"><b>?</b></font><font color='green'><b>keyword</b></font>=<font color='red'><b>yogurt<b></font><front color='purple'><b>&</b></font><font color='green'><b>r</b></font>=<font color='red'><b>json<b></font>, where
        - `?`: separate API endpoint  `https://groceries.asda.com/api/items/search` from parameters
        - `keyword=yogurt`: search `yogurt` on parameter `keyword`
        - `&`: combine multiple search criteria
        - `r=json`: result is in json format 
    - You can directly paste the above API to your browser
    - Or issue API calls using requests
- You need to read API documentation to understand how to specify parameters

In [7]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import requests
import json
import pandas as pd

In [8]:
import requests
import json

keyword = 'yogurt'


url="https://groceries.asda.com/api/items/search?keyword=" + keyword + "&r=json"

print(url)

# invoke the API 
r = requests.get(url)

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    # json. dumps() function converts a Python object into a json string
    result = r.json()
    print (json.dumps(result, indent=4))



https://groceries.asda.com/api/items/search?keyword=yogurt&r=json
{
    "statusMessage": "The API Item Search was executed successfully",
    "errors": [],
    "keyword": "yogurt",
    "storeId": "4565",
    "autoCorrectedTerm": "",
    "didYouMeanTerm": "",
    "isHookLogicInsert": "false",
    "totalResult": "421",
    "currentPage": "1",
    "resultsStartIndex": "1",
    "resultsEndIndex": "60",
    "maxPages": "8",
    "qusApplied": false,
    "productBoostingDetails": "0^rule_613f0638f04fd127608474cd^^^Default",
    "monetizedItems": [],
    "items": [
        {
            "shelfId": "1215286383583",
            "shelfName": "Corners",
            "deptId": "1215341888021",
            "deptName": "Yogurts & Desserts",
            "isBundle": "false",
            "meatStickerDetails": "10::for::\u00a33.5::true",
            "extraLargeImageURL": "",
            "bundledItemCount": "0",
            "scene7Host": "https://ui.assets-asda.com:443/dm/",
            "cin": "6362225",
 

In [8]:
# Exercise 2.2.  Another way to pass parameters

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# in case authentication is needed, use
# r = requests.get('https://api.github.com/user', \
# auth=('user', 'pass'))

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    print (json.dumps(r.json(), indent=4))



{
    "statusMessage": "The API Item Search was executed successfully",
    "errors": [],
    "keyword": "yogurt",
    "storeId": "4565",
    "autoCorrectedTerm": "",
    "didYouMeanTerm": "",
    "isHookLogicInsert": "false",
    "totalResult": "419",
    "currentPage": "1",
    "resultsStartIndex": "1",
    "resultsEndIndex": "60",
    "maxPages": "7",
    "qusApplied": false,
    "productBoostingDetails": "0^rule_613f0638f04fd127608474cd^^^Default",
    "monetizedItems": [],
    "items": [
        {
            "shelfId": "1215286383583",
            "shelfName": "Corners",
            "deptId": "1215341888021",
            "deptName": "Yogurts & Desserts",
            "isBundle": "false",
            "meatStickerDetails": "10::for::\u00a33.5::true",
            "extraLargeImageURL": "",
            "bundledItemCount": "0",
            "scene7Host": "https://ui.assets-asda.com:443/dm/",
            "cin": "6362225",
            "promoDetailFull": "10 for \u00a33.5",
            "ava

## 3. JSON (JavaScript Object Notation)

### What is JSON
- A lightweight data-interchange format
- "Self-describing" and easy to understand
- JSON format is text only 
- Language independent: can be read and used as a data format by any programming language

###  JSON Syntax Rules
JSON syntax is derived from JavaScript object notation syntax:
- Data is in **name/value** pairs separated by commas
- Curly braces hold objects
- Square brackets hold arrays

### A JSON object is:
- **a dictionary** or 
- a **list of dictionaries**

### Useful JSON functions
- dumps: save json object to string
- dump: save json object to file
- loads: load from a string in json format
- load: load from a file in json format

In [9]:
# Exercise 3.1 API returns a JSON object 

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# if the API call returns a successful response
if r.status_code==200:
    result = r.json()
    
    df = pd.DataFrame(result["items"])
    df.head()
    

Unnamed: 0,shelfId,shelfName,deptId,deptName,isBundle,meatStickerDetails,extraLargeImageURL,bundledItemCount,scene7Host,cin,...,avgWeight,iconDetails,maxQty,pricePerWt,productURL,pricePerUOM,searchTuningScore,onSale,salePrice,positionChngByMargin
0,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£3.5::true,,0,https://ui.assets-asda.com:443/dm/,6362225,...,,"{'promotionalIcons': ['59600049'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,28751692.0,False,,0
1,910000976085,Kids Yogurts,1215341888021,Yogurts & Desserts,False,,,0,https://ui.assets-asda.com:443/dm/,6203595,...,,"{'promotionalIcons': ['45100001', '59600050'],...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,17800482.0,False,,0
2,910000976085,Kids Yogurts,1215341888021,Yogurts & Desserts,False,,,0,https://ui.assets-asda.com:443/dm/,3638102,...,,"{'promotionalIcons': ['59600051'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,14810072.0,False,,0
3,1215639963408,Greek Style,1215341888021,Yogurts & Desserts,False,,,0,https://ui.assets-asda.com:443/dm/,3922315,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,4545069.5,False,,0
4,1215639963408,Greek Style,1215341888021,Yogurts & Desserts,False,,,0,https://ui.assets-asda.com:443/dm/,5207976,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,3152679.5,False,,0


In [4]:
# Exercise 3.2. Parse JSON object (a dictionary)

# convert the first 2 items to string
s = json.dumps(result["items"][0:2], indent=4)
print(s)

# load back from a string
items = json.loads(s)
items

# save to file
json.dump(result["items"], open("items.json","w"))

# load back from file
items = json.load(open("items.json","r"))
print("test loaded data\n")
len(items)
items[0]

[
    {
        "shelfId": "1215286383583",
        "shelfName": "Corners",
        "deptId": "1215341888021",
        "deptName": "Yogurts & Desserts",
        "isBundle": "false",
        "meatStickerDetails": "10::for::\u00a33.5::true",
        "extraLargeImageURL": "",
        "bundledItemCount": "0",
        "scene7Host": "https://ui.assets-asda.com:443/dm/",
        "cin": "6362225",
        "promoDetailFull": "10 for \u00a33.5",
        "availability": "A",
        "totalReviewCount": "8",
        "asdaSuggest": "",
        "itemName": "Vanilla Yogurt with Chocolate\u00a0Balls",
        "price": "\u00a30.55",
        "imageURL": "",
        "aisleName": "Yogurts & Fromage Frais",
        "id": "1000120228362",
        "promoId": "ls91195",
        "isFavourite": "false",
        "hasAlternates": "false",
        "wasPrice": "",
        "brandName": "Muller Corner",
        "promoType": "No Promo",
        "weight": "130g",
        "promoOfferTypeCode": "15",
        "promoQty": 

{'shelfId': '1215286383583',
 'shelfName': 'Corners',
 'deptId': '1215341888021',
 'deptName': 'Yogurts & Desserts',
 'isBundle': 'false',
 'meatStickerDetails': '10::for::£3.5::true',
 'extraLargeImageURL': '',
 'bundledItemCount': '0',
 'scene7Host': 'https://ui.assets-asda.com:443/dm/',
 'cin': '6362225',
 'promoDetailFull': '10 for £3.5',
 'availability': 'A',
 'totalReviewCount': '8',
 'asdaSuggest': '',
 'itemName': 'Vanilla Yogurt with Chocolate\xa0Balls',
 'price': '£0.55',
 'imageURL': '',
 'aisleName': 'Yogurts & Fromage Frais',
 'id': '1000120228362',
 'promoId': 'ls91195',
 'isFavourite': 'false',
 'hasAlternates': 'false',
 'wasPrice': '',
 'brandName': 'Muller Corner',
 'promoType': 'No Promo',
 'weight': '130g',
 'promoOfferTypeCode': '15',
 'promoQty': '10',
 'promoValue': '£3.50',
 'productAttribute': '',
 'scene7AssetId': '4025500245221',
 'promoDetail': '10 for £3.5',
 'bundleDiscount': '0.00',
 'avgStarRating': '4.625',
 'name': 'Muller Corner Vanilla Yogurt with Ch

## 4. Get Tweets

Reference: 
- https://github.com/scalto/snscrape-by-location/blob/main/snscrape_by_location_tutorial.ipynb
- https://medium.com/swlh/how-to-scrape-tweets-by-location-in-python-using-snscrape-8c870fa6ec25

Note: User object is not exposed by TwitterSearchScraper any more.

In [2]:
import pandas as pd
import snscrape.modules.twitter as sntwitter
import itertools


In [4]:
#  search by keywords + time
# TwitterSearchScraper returns an interator, islice loops through the iterator

df = pd.DataFrame(itertools.islice(sntwitter.TwitterSearchScraper(
    '"blockchain + since:2020-10-31 until:2020-11-03"').get_items(), 500))

print(len(df))
df.head()


500


Unnamed: 0,url,date,content,id,username,outlinks,outlinksss,tcooutlinks,tcooutlinksss
0,https://twitter.com/zawphyowai199/status/13234...,2020-11-02 23:59:55+00:00,Synic Token Airdrop is now Live🚀💰🏆\n\nClick on...,1323414568236806146,zawphyowai199,[https://t.me/synictoken_officialAirdropbot],https://t.me/synictoken_officialAirdropbot,[https://t.co/O83hXYOniH],https://t.co/O83hXYOniH
1,https://twitter.com/CryptoWatchBot/status/1323...,2020-11-02 23:59:53+00:00,"@NEO_Blockchain, #NEO is the coin with the bes...",1323414560204824576,CryptoWatchBot,[],,[],
2,https://twitter.com/Rayinhosen/status/13234145...,2020-11-02 23:59:46+00:00,"📌 CPX Airdrop is Live, 🎁 Join to get Free 7 CP...",1323414530769051648,Rayinhosen,[https://t.me/CrypxieAirdrop_bot?start=r017400...,https://t.me/CrypxieAirdrop_bot?start=r0174001...,[https://t.co/dcfBIyNYi4],https://t.co/dcfBIyNYi4
3,https://twitter.com/coinmarketnet/status/13234...,2020-11-02 23:59:43+00:00,"📌 CPX Airdrop is Live, 🎁 Join to get Free 7 CP...",1323414518651781120,coinmarketnet,[https://t.me/CrypxieAirdrop_bot?start=r076622...,https://t.me/CrypxieAirdrop_bot?start=r0766228434,[https://t.co/Ff0IdY5i1G],https://t.co/Ff0IdY5i1G
4,https://twitter.com/Link_Errors/status/1323414...,2020-11-02 23:59:42+00:00,Yearnify Finance Airdrop is now Live🚀💰🏆\n\nCli...,1323414512675016706,Link_Errors,[https://t.me/YearnifyAirdropBot],https://t.me/YearnifyAirdropBot,[https://t.co/6HpZOBOZzI],https://t.co/6HpZOBOZzI


In [12]:
# search by user

df = pd.DataFrame(itertools.islice(sntwitter.TwitterUserScraper(
    '"zawphyowai199"').get_items(), 500))

print(len(df))
df.tail()

500


Unnamed: 0,url,date,content,id,username,outlinks,outlinksss,tcooutlinks,tcooutlinksss
495,https://twitter.com/zawphyowai199/status/13669...,2021-03-03 02:51:11+00:00,@airdropinspect @Ashwsbreal @blazingbitcoin @m...,1366944211526819840,zawphyowai199,[],,[],
496,https://twitter.com/zawphyowai199/status/13669...,2021-03-03 02:47:55+00:00,@alexanhtuan #Phoswapper\n\n@mma728122 \n@mst5...,1366943388881154048,zawphyowai199,[],,[],
497,https://twitter.com/zawphyowai199/status/13667...,2021-03-02 14:55:21+00:00,@KoalaDefi @dfgr5eytsy56es1 @paisalnurpadil1 @...,1366764068342751238,zawphyowai199,[],,[],
498,https://twitter.com/zawphyowai199/status/13667...,2021-03-02 14:51:58+00:00,@BSCToolzApp Very good Project. I recommend.\n...,1366763216978731008,zawphyowai199,[],,[],
499,https://twitter.com/zawphyowai199/status/13667...,2021-03-02 11:50:32+00:00,@RocketGameVip @mma728122 \n@mst5792 \n@wt7276...,1366717558095835137,zawphyowai199,[],,[],


## 5. Tweepy
- Tweepy is a python library to access Twitter API. 
- `pip install tweepy`
- The Tweepy documentation has detailed explanations: https://docs.tweepy.org/en/stable/
- You need to apply for a developer account from here: https://developer.twitter.com/en/apply-for-access

In [3]:
import tweepy
import csv
import datetime

# https://docs.tweepy.org/en/stable/auth_tutorial.html

# enter your account information
CONSUMER_KEY=''
CONSUMER_SECRET=''
ACCESS_KEY=''
ACCESS_SECRET=''

auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
auth.set_access_token(ACCESS_KEY,ACCESS_SECRET)
api=tweepy.API(auth)

In [5]:
# Take a look at the public tweets from your account's home timeline 

public_tweets = api.home_timeline()
for tweet in public_tweets[:2]:
    print(tweet.text)

How did your neighborhood vote in the California recall election?

Explore the latest precinct results from Los Ang… https://t.co/cdJRQzUbpU
A new fence, an information-sharing alert and ramped-up airport security are just a few of the ways law enforcement… https://t.co/cYps7sbW7A


In [22]:
# is this useful information?
# Let's take a close look at ONE tweet json

public_tweets[0]
# there's no way to figure this out


Status(_api=<tweepy.api.API object at 0x7fe63da41d50>, _json={'created_at': 'Fri Sep 17 15:22:52 +0000 2021', 'id': 1438886178594324480, 'id_str': '1438886178594324480', 'text': 'How did your neighborhood vote in the California recall election?\n\nExplore the latest precinct results from Los Ang… https://t.co/cdJRQzUbpU', 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/cdJRQzUbpU', 'expanded_url': 'https://twitter.com/i/web/status/1438886178594324480', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}]}, 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 16664681, 'id_str': '16664681', 'name': 'Los Angeles Times', 'screen_name': 'latimes', 'location': 'El Segundo, CA', 'description': 'Covering t

In [23]:
# make it look better
# convert to string
json_str = json.dumps(public_tweets[0]._json)

# deserialise string into python object
parsed = json.loads(json_str)

print(json.dumps(parsed, indent=4, sort_keys=True))
# Now we can have a better idea of the clustered relations of the json object

{
    "contributors": null,
    "coordinates": null,
    "created_at": "Fri Sep 17 15:22:52 +0000 2021",
    "entities": {
        "hashtags": [],
        "symbols": [],
        "urls": [
            {
                "display_url": "twitter.com/i/web/status/1\u2026",
                "expanded_url": "https://twitter.com/i/web/status/1438886178594324480",
                "indices": [
                    117,
                    140
                ],
                "url": "https://t.co/cdJRQzUbpU"
            }
        ],
        "user_mentions": []
    },
    "favorite_count": 1,
    "favorited": false,
    "geo": null,
    "id": 1438886178594324480,
    "id_str": "1438886178594324480",
    "in_reply_to_screen_name": null,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "is_quote_status": false,
    "lang": "en",
    "place": null,
    "possibly_sensitive": false,
    "possibly_sensitiv

### 5.1. Get tweets from users' timeline
- Make a Timeline call to retrieve the most recent 3200 tweets by a user (a rule set by Twitter).
    - Note: the time range you get depends on how often the user posts tweets. 
- Parameters for the timeline call
    - `count`: the number of results to try and retrieve per page. Maximum is 200. 
    - Make multiple calls to retrieve the 3200 tweets. 
    - `tweet_mode`:swaps the text index for full_text, and prevents a primary tweet longer than 140 characters from being truncated.
- Variables of tweet objects
    - https://docs.tweepy.org/en/stable/api.html#tweepy-api-twitter-api-wrapper

In [11]:
# Get the first five tweets of a user.
timeline = api.user_timeline("KelloggCompany",count=5,tweet_mode="extended")

for status in timeline:
    print (status.id)
    print (status.full_text)

1438671732513116163
RT @toendhunger: In the fight against #hunger, #schoolmeals are a critical investment in the next generation. Check out our recent webinar…
1438605434030600203
Proud to join @AlbertsonsCos to fight childhood hunger. Kellogg will commit $120,000 to @nokidhungry and @AlbertsonsCos foundation Nourishing Neighbors #BetterDays https://t.co/9AAMWecFpS
1438584502574452746
Land a Gr-r-reat internship with Kellogg’s #internship recruiting virtual event #employee https://t.co/ilgnrdfhQs
1438535106591961090
RT @MidwestRowCrop: What can member companies do together to advance regenerative agriculture? Here’s what Oliver Morton, Chief Customer Of…
1438164431322636294
“The truck drivers we work with are the best in the business. They are critical to our supply chain running successfully. We appreciate all they do!” – Matt Rose Sr. Dir., Kellogg North America Transportation #ThankATrucker #NTDAW2021 https://t.co/5bSABzeWLV


In [18]:
# Step 1: get a list of tweets 
# Step 2: extract the varaibles you want

def get_all_tweets(user_name):
    auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
    auth.set_access_token(ACCESS_KEY,ACCESS_SECRET)
    api=tweepy.API(auth)
    
    # initialize the first call
    alltweets=[]
    new_tweets=api.user_timeline(user_name, count=200)
    alltweets.extend(new_tweets)
    oldest=alltweets[-1].id-1  #next time start from the oldest one minus one 
    
    # continue to get tweets
    while len(new_tweets)>0:  
        print ("getting tweets before", oldest)
        new_tweets = api.user_timeline(user_name,count=200, max_id=oldest)
        alltweets.extend(new_tweets)
        oldest=alltweets[-1].id-1
        print("...{} tweets downloaded so far".format(len(alltweets)))
    
    # extract the variables you want
    outtweets = [[tweet.id_str, tweet.user.name, tweet.created_at, tweet.user.followers_count,
                  tweet.text.encode("utf-8")] for tweet in alltweets]
            
    # write out your variables
    with open('%s_tweet.csv' % user_name,'w') as outputfile: 
        writer=csv.writer(outputfile)
        writer.writerow(["id","user_name","created_at","followers","text"])
        writer.writerows(outtweets)
    pass

# use your function
if __name__ == '__main__':
    #pass in the username of the account you want to download
    get_all_tweets("KelloggCompany")
    

getting tweets before 1381701008527605762
...400 tweets downloaded so far
getting tweets before 1315649082661236735
...600 tweets downloaded so far
getting tweets before 1245073331252068353
...800 tweets downloaded so far
getting tweets before 1176214453777588224
...1000 tweets downloaded so far
getting tweets before 1103676050536628223
...1200 tweets downloaded so far
getting tweets before 1009496869305712639
...1400 tweets downloaded so far
getting tweets before 936346653279379455
...1600 tweets downloaded so far
getting tweets before 877615558744735745
...1800 tweets downloaded so far
getting tweets before 827194028668108801
...2000 tweets downloaded so far
getting tweets before 764075619474735103
...2197 tweets downloaded so far
getting tweets before 707276031858753536
...2395 tweets downloaded so far
getting tweets before 656861069776920575
...2594 tweets downloaded so far
getting tweets before 604011239203819519
...2794 tweets downloaded so far
getting tweets before 5252941128752

In [21]:
# Take a look at the table you got
df= pd.read_csv('KelloggCompany_tweet.csv', header=0)
df.head()

# how many tweets we get?
len(df)

# The first tweet is a retweet. Take a look at the text.
df.text[0]

# Let's compare it with the actual tweet. 
# You can find each tweet by its ID. 
# https://twitter.com/KelloggCompany/status/1438671732513116163

Unnamed: 0,id,user_name,created_at,followers,text
0,1438671732513116163,Kellogg Company,2021-09-17 01:10:44,76099,b'RT @toendhunger: In the fight against #hunge...
1,1438605434030600203,Kellogg Company,2021-09-16 20:47:18,76099,b'Proud to join @AlbertsonsCos to fight childh...
2,1438584502574452746,Kellogg Company,2021-09-16 19:24:07,76099,b'Land a Gr-r-reat internship with Kellogg\xe2...
3,1438535106591961090,Kellogg Company,2021-09-16 16:07:50,76099,b'RT @MidwestRowCrop: What can member companie...
4,1438164431322636294,Kellogg Company,2021-09-15 15:34:54,76099,b'\xe2\x80\x9cThe truck drivers we work with a...


3196

"b'RT @toendhunger: In the fight against #hunger, #schoolmeals are a critical investment in the next generation. Check out our recent webinar\\xe2\\x80\\xa6'"

### 5.2. Deal with truncated text
- For text mining on Twitter, it is important to get the full text. 
    - Full text would be essential for topic modeling and sentiment analysis.
    - Full text is also important for extracting mention networks (note the previous example). 
- Use the `tweet_mode="extended"` when calling a user's timeline.
    - When using extended mode, the `text` attribute of Status objects returned is replaced by a `full_text` attribute, which contains the entire untruncated text of the Tweet. 
- Full text for tweets that are retweets.
    - If the tweet is a retweet, the full_text is still truncated. 
    - We need to access the full text through `retweeted_status` attribute, which is a status object itself. 
- For reference: https://docs.tweepy.org/en/stable/extended_tweets.html

In [27]:
# Let's deal with retweets

def get_all_tweets(user_name):
    auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
    auth.set_access_token(ACCESS_KEY,ACCESS_SECRET)
    api=tweepy.API(auth,wait_on_rate_limit=True)

    alltweets=[]
    new_tweets=api.user_timeline(user_name, count=200,tweet_mode="extended")
    alltweets.extend(new_tweets)
    oldest=alltweets[-1].id-1  
    
    # set date condition
    startDate=datetime.datetime(2021, 1, 1, 0, 0, 0)
    while new_tweets[-1].created_at > startDate:
        print ("getting tweets before", oldest)
        new_tweets = api.user_timeline(user_name,count=200, max_id=oldest)
        alltweets.extend(new_tweets)
        oldest=alltweets[-1].id-1
        print("...{} tweets downloaded so far".format(len(alltweets)))
        
    # check if it's a retweet
    # When using extended mode with a Retweet, the full_text attribute of the Status object may be truncated    
    # However, since the retweeted_status attribute (of a Status object that is a Retweet) is itself a Status object
    # the full_text attribute of the Retweeted Status object can be used instead.
    
    outtweets_all=[]
    for tweet in alltweets:
        status = api.get_status(tweet.id, tweet_mode="extended")
        
        if hasattr(status, "retweeted_status"):  # is a retweet
            full_text=status.retweeted_status.full_text.encode("utf-8")
            
            outtweets=[
            # tweet content
            tweet.id_str, tweet.created_at,full_text,
            # user features
            tweet.user.name, tweet.user.screen_name, tweet.user.followers_count, 
            # retweet features
            tweet.retweeted_status.user.name,tweet.retweeted_status.user.screen_name,tweet.retweeted_status.user.description]
            outtweets_all.append(outtweets)
  
        else: # not a retweet
            full_text=status.full_text.encode("utf-8")
                    
            outtweets=[
            # tweet content
            tweet.id_str, tweet.created_at,full_text,
            # user features
            tweet.user.name, tweet.user.screen_name, tweet.user.followers_count, 
            # retweet features
            "no value","no value","no value"]
            outtweets_all.append(outtweets)

    with open('%s_full_tweet.csv' % user_name,'w') as outputfile: 
        writer=csv.writer(outputfile)
        writer.writerow(["id","created_at","full_text",
                        "user.name","user.screen_name","user.followers_count",
                        "retweeted_status.user.name","retweeted_status.user.screen_name","retweeted_status.user.description"])
        writer.writerows(outtweets_all)

        
if __name__ == '__main__':
    #pass in the username of the account you want to download
    get_all_tweets("KelloggCompany")


getting tweets before 1381701008527605762
...400 tweets downloaded so far


In [28]:
df= pd.read_csv('KelloggCompany_full_tweet.csv', header=0)
df.head()
len(df)
# plese compare this full text with the above truncated text, what differences can you find?
df.full_text[0]

Unnamed: 0,id,created_at,full_text,user.name,user.screen_name,user.followers_count,retweeted_status.user.name,retweeted_status.user.screen_name,retweeted_status.user.description
0,1438671732513116163,2021-09-17 01:10:44,"b'In the fight against #hunger, #schoolmeals a...",Kellogg Company,KelloggCompany,76099,Alliance To End Hunger,toendhunger,Building the will to #EndHunger at home and ab...
1,1438605434030600203,2021-09-16 20:47:18,b'Proud to join @AlbertsonsCos to fight childh...,Kellogg Company,KelloggCompany,76099,no value,no value,no value
2,1438584502574452746,2021-09-16 19:24:07,b'Land a Gr-r-reat internship with Kellogg\xe2...,Kellogg Company,KelloggCompany,76099,no value,no value,no value
3,1438535106591961090,2021-09-16 16:07:50,b'What can member companies do together to adv...,Kellogg Company,KelloggCompany,76099,Midwest Row Crop Collaborative,MidwestRowCrop,A coalition of leading companies and NGOs driv...
4,1438164431322636294,2021-09-15 15:34:54,b'\xe2\x80\x9cThe truck drivers we work with a...,Kellogg Company,KelloggCompany,76099,no value,no value,no value


400

"b'In the fight against #hunger, #schoolmeals are a critical investment in the next generation. Check out our recent webinar with @WFPUSA, @KelloggCompany, and @GCNFoundation. https://t.co/XLh0i39iwV'"

### 5.3. Build Twitter networks
- **Follower-followee network**
    - If you have a list of user accounts, you may retrive the pairwise boolean values of following relations. 
    - Parameters
        * `source_id` – The user_id of the subject user.
        * `source_screen_name` – The screen_name of the subject user.
        * `target_id` – The user_id of the target user.
        * `target_screen_name` – The screen_name of the target user.
- **Retweet network**
    - Retweeted accounts can be extracted while scraping the API. 
    - Or retweeted accounts can be extracted from the text. 
- **Mention network**
    - Can be extracted from the full text. 

In [29]:
# How to scrape the follower-followee network?
# we can directly retrieve a bollean value 

dog="Microsoft"
cat="Oracle"

is_following = api.show_friendship(source_screen_name=dog,target_screen_name=cat)
print(is_following[1].following)

# Question: how to get the adjacency matrix of a follower-followee network?

True


##### Twitter data resources
https://github.com/echen102/us-pres-elections-2020 <br>
https://github.com/echen102/COVID-19-TweetIDs