# Scrapping Data

Scrapping data is where you are extracting data from an online platform. There are several aways to do data scrapping and they all boil up to 2 type:

                           1. Using APIs to get access to the data from the database
                           2. Viewing the source code of the platform (Read website)

Let me explain each a bit

###                                                                           1. Using APIs

* Here the online platform you are planning to access happens to have a gateway to access the data for developers who wish to access their data for learning purposes or creation of new solutins. 

* The companies give this leeway to developers so that they can increase user traffic or get new innovative solutions they can take to their users. 

###                                                             2. Extracting platform sorce code

* Here you are just taking the html code of a website and parsing it to get the data you want.

* This is a little tough if you have more than 1 websites to scrap data from. Escpecially when the developers ar different and have different syntax of coding.



### Scrapping Tweeter

* Here we will be using APIs to access twitter data. 
* Twitter provides several APIs to access different kinds of data. To use this APIs you need to acquire some access tokens and keys for authentication purposes hence why you need an approved twitter developer account.
* To work with the twitter APIs there several libraries you use.I have used 2 of them which are:
                1. Tweepy
                2. GetOldTweets3
* The reason I worked with both is because each has a particular limitation that make the work hard to gather the kind of data you are looking for.
* For Tweepy the limitation are:
                1. You can only get data from the last 30 days
                2. You can't get more than 300 tweets per query hence you need to query for 300 and wait for 15 minutes   before you query again.
* For GetOldTweets3 the limitations are:
                1. Though you are able to get a lot of tweets and old ones, there are specific attributes you can't get   from the tweet objects you return.

***Import Libraries to use***

In [2]:
# Import Tweeter APIs
import tweepy as tp
import GetOldTweets3 as got

# Import libraries for data reading
import pandas as pd

#For reading secured access code and tokens file
import yaml

***Read access codes and Tokens to authenticate the twitter API***

In [3]:
#Twitter API access token and consumer key with their authentication code read from a yaml file.
# Keep the secret keys private and not public
with open(r"secret.yml") as file:
    secret_list = yaml.load(file, Loader=yaml.FullLoader)
    
#Access the Twitter API
auth = tp.OAuthHandler(secret_list["consumer_key"], secret_list["consumer_secret"])
auth.set_access_token(secret_list["access_token"], secret_list["access_secret"])
api = tp.API(auth, wait_on_rate_limit=True)

***Set up tweet query with GetOldTweets3***

In [4]:
tweet_query = "@AIRTEL_KE"
count = 200000

In [None]:
#Set the criteria for searching the tweets
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(tweet_query)\
                                            .setSince("2020-01-01")

#Query for the tweets
tweets = got.manager.TweetManager.getTweets(tweetCriteria)


In [None]:
# Create a list holding lists with tweet details we want
tweets_lst = [[tw.id, tw.date, tw.text, tw.username, tw.retweets, tw.favorites, tw.geo, tw.mentions, tw.hashtags] for tw in tweets]

In [None]:
# Confirm that we received the number of tweets requested
len(tweets_lst)

In [None]:
# Create a dataframe of the tweets we queried
tweets_df = pd.DataFrame(tweets_lst, columns=["ID", "Date", "Post", "Username","Retweets", "Favorites", "Geo", "Mentions", "Hashtags"])
tweets_df.sample(10)

In [None]:
# Filter the tweets that mention @AIRTEL_KE since those are the tweets with questions and queries.
airtel_mention_df = tweets_df[tweets_df["Mentions"].str.contains("@AIRTEL_KE") | tweets_df["Mentions"].str.contains("@airtel_ke")]
print(airtel_mention_df.shape)

# To avoid having to repeat the querying process again, we save the results we got
airtel_mention_df.to_csv(path_or_buf="AirtelMentions1.csv")
airtel_mention_df.sample(20)

In [None]:
# Get the list we already created from the earlier query.
airtel_mention_df = pd.read_csv("AirtelMentions1.csv")
airtel_mention_df.drop(columns=['Unnamed: 0'], inplace=True)
airtel_mention_df.sample(20)

In [None]:
#This searches for replies for tweet by taking the name of the user and the tweet ID and looks for all the tweets after that tweet ID with with the username

# airtel_replies=[]
# for x, Id in enumerate(airtel_mention_df["ID"]):
#     tweet_id = Id
#     name = airtel_mention_df.Username.iloc[x]
#     replies = []
#     print(x)
#     for tweet in tp.Cursor(api.search,q='to:'+name, since_id = tweet_id, timeout=999999).items():

#         if hasattr(tweet, 'in_reply_to_status_id_str'):
#             if (tweet.in_reply_to_status_id_str==tweet_id):
#                 replies.append(tweet)
#             for tweet in replies:
#                 row = {'ID':tweet_id, 'Date': airtel_mention_df.Date.iloc[x], 'Username':name, 
#                         'Post': airtel_mention_df.Post.iloc[x],  'Replier': tweet.user.screen_name, 
#                         'Mentions': airtel_mention_df.Mentions.iloc[x],  'Hashtags': airtel_mention_df.Hashtags.iloc[x],  
#                         'Reply_date':tweet.created_at, 'Reply': tweet.text.replace('\n', ' '), 
#                         'Reply_mentions':' '.join(x['screen_name'] for x in tweet.entities['user_mentions']), 
#                         'Reply_Hashtags':' '.join(x['text'] for x in tweet.entities['hashtags'])}
#                 airtel_replies.append(row)


In [None]:
"""This function finds the tweets by AIRTEL_KE since the tweet 
    of the customer asking a question tweeted the question
    All those tweets are then added to a list of tweets 
    avoiding creation of duplicates"""

def retriver(name, tweet_id,tweetsData):
    try:
        tweet_data = tp.Cursor(api.user_timeline,id='AIRTEL_KE', since_id = tweet_id, timeout=999999).items()
    except:
        print('failed to get data')
        tweet_data = []
    for tweet in tweet_data:
        if tweet not in tweetsData:
            tweetsData.append(tweet)
    
    return tweetsData

In [None]:
# Testing our function to make sure it returns what we expect
# Data_tweets=[]
# ts = retriver('ntvkenya', '1294919890839773184', Data_tweets)
# ts[0].id

In [None]:
def get_replies(Data_tweets,df, tweet_id):
    airtel_replies=[]
    replies = []
    for tweet in Data_tweets:
#         print('In list')
        if hasattr(tweet, 'in_reply_to_status_id_str'):
            if (tweet.in_reply_to_status_id_str==tweet_id):
                replies.append(tweet)
                print('good to go ID')
    if len(replies) > 0:
        for tweet in replies:
            print('good to go')
            row = {'ID':tweet_id, 'Date': df.Date.iloc[x], 'Username':name, 
                    'Post': df.Post.iloc[x],  'Replier': tweet.user.screen_name, 
                    'Mentions': df.Mentions.iloc[x],  'Hashtags': df.Hashtags.iloc[x],  
                    'Reply_date':tweet.created_at, 'Reply': tweet.text.replace('\n', ' '), 
                    'Reply_mentions':' '.join(x['screen_name'] for x in tweet.entities['user_mentions']), 
                    'Reply_Hashtags':' '.join(x['text'] for x in tweet.entities['hashtags'])}
            airtel_replies.append(row)
    return airtel_replies

In [None]:
repliesData = []
sort_df = airtel_mention_df.sort_values(by = 'ID')
Data_tweets=[]

In [None]:
# Loop though the sorted dataframe to get replies for each tweet starting with the oldest
for x, Id in enumerate(sort_df["ID"]):
    tweet_id = Id
    name = sort_df.Username.iloc[x]
    print(x)
    present = False
    print(len(Data_tweets))
    for tw in Data_tweets:
        if tw.in_reply_to_status_id_str == tweet_id:
            present = True
    print(present)
    if present == True:
        print('good')
        try:
            repliesData.extend(get_replies(Data_tweets, sort_df, tweet_id))
        except:
            print('failed')
    else:
        try:
            Data_tweets= retriver(name, tweet_id, Data_tweets)
            print("Run retriver")
            repliesData.extend(get_replies(Data_tweets, sort_df, tweet_id))
        except:
            print('failed')
#     Save each data scrapped to prevent loss in case of the code crashing        
    airtelData_df = pd.DataFrame(repliesData)

    airtelData_df.to_csv(path_or_buf="AirtelData.csv")

In [None]:
airtelData_df = pd.DataFrame(repliesData)

airtelData_df.to_csv(path_or_buf="AirtelData.csv")

In [None]:
airtelData_df.sample(20)

In [None]:
# test = api.get_status('1243081255102615552')
# test.entities

In [23]:
tC = got.manager.TweetCriteria().setUsername("barackobama").setSince("2015-09-10")\
                                            .setMaxTweets(1)
twts = got.manager.TweetManager.getTweets(tC)
for twe in twts:
    for tw in twe:
        print(tw)

TypeError: 'Tweet' object is not iterable

In [29]:
print(sort_df.ID.head(1))
print(sort_df.ID.tail(1))

199993    1169605566479687680
Name: ID, dtype: object
3    1295448699070554119
Name: ID, dtype: object


In [25]:
# A view of how a tweet object looks and its attributes
api.get_status('1169605566479687680')

Status(_api=<tweepy.api.API object at 0x0000022441941B00>, _json={'created_at': 'Thu Sep 05 13:37:51 +0000 2019', 'id': 1169605566479687680, 'id_str': '1169605566479687680', 'text': 'Please clarify this because I have visited your shop in Narok and they are saying the lines are not working… https://t.co/PNmw0GfW7G', 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/PNmw0GfW7G', 'expanded_url': 'https://twitter.com/i/web/status/1169605566479687680', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [109, 132]}]}, 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 3026331367, 'id_str': '3026331367', 'name': 'Ronoh K Clinton 🇰🇪', 'screen_name': 'RonohClinton', 'location': 'Kenya', 'description': 'Di

In [30]:
1295448699070554119 - 1169605566479687680

125843132590866439

In [None]:
# This is the code I need you to run on a strong and faster machine without any internet fluctuation
# This code goes through all tweets posted since the oldest tweet in our list to the last tweet in our tweet
status_id_lst = airtel_mention_df["ID"].tolist()
dataAirtel = []

for x in range(1169605566479687680, 1295448699070554119):
    tweet = api.get_status(str(x))
    if tweet.in_reply_to_status_id_str in status_id_lst:
        df = airtel_mention_df.loc[airtel_mention_df['ID'] == tweet.in_reply_to_status_id_str]
        for rw in df.values.tolist():
            row = {'ID':rw[0], 'Date': rw[1], 'Username':rw[3], 
                    'Post': rw[2], 'Mentions': rw[7],  'Hashtags': rw[8],
                    'Replier': tweet.user.screen_name,  
                    'Reply_date':tweet.created_at, 'Reply': tweet.text.replace('\n', ' '), 
                    'Reply_mentions':' '.join(x['screen_name'] for x in tweet.entities['user_mentions']), 
                    'Reply_Hashtags':' '.join(x['text'] for x in tweet.entities['hashtags'])}
            dataAirtel.append(row)
    
