# Twitter Scraping Using TWINT

Twitter is erratic about giving out developer/API access codes.  If you can't get one, we can use twint for scraping instead.  As a bonus, we can easily get historical data.  

In this script, we'll se the geo coords we want first, for those tweets tagged with geo coords,
then run the data pull in chunks by year, convert to a dataframe, and save to a csv.

In [None]:
# This notebook calls functions stored in a different notebook - you may need to install
#!pip install import-ipynb

# then, you may need open and run twitter_scrape_by_twint.ipynb in your IDE or on 
# the same running python kernel before running this file
#
# if you make changes to that file, you will need to restart your kernel and maybe your IDE for changes to apply

In [None]:
# necessary imports 
import twint 
import pandas as pd
import csv
import datetime
import time
import os

# import the more generic scraper function written and stored in the twitter_scrape.ipynb file
import import_ipynb

import twitter_scrape_by_twint


In [None]:
# Main loop function - takes a city and year and sets off the most reliable method to date
# of coazing twint into giving us as many geo-tagged results as possible for that year given
# all the problems and bugs in twint at the time of running.  
# 
# Known problems discovered and worked around: since/until inconsistent, limit inconsistent but performs best at 
# 500, prefers to work backwards from an "until" date, twitter cuts you off if you call it too much
#
# Problems not worked around: 500 limit effectively limits you to 500*365 max per year; overlap between city ranges;
# Twitter sometimes gives you back different results from run to run
def twitter_scrape_by_geo_year(city_name = "niamey", which_year = 2021, use_tor_channel = False):
    """Given a city name and year, scrapes as many tweets as possible with twint and stores them in a csv"""
    
    geo_str = ""
    city_name = city_name.lower()

    # take the year and reorganize into 
    target_date_min, target_date_max = get_target_dates_min_and_max(which_year)

    # here are the geo coordinates and ranges for a number of major cities in Niger
    # the cities were chosen by size and interest
    # Diffa and the Lake Chad Basin were left out of this analysis
    if city_name == "niamey":
         geo_str = "13.5234,2.1167,75km"
    elif city_name == "agadez":
        geo_str = "16.9701,7.9856,75km"
    elif city_name == "tillaberi":
        geo_str = "14.2589,1.4671,75km"
    elif city_name == "tahoua":
        geo_str = "14.8939,5.2639,75km"
    elif city_name == "dosso":
        geo_str = "13.179,3.2071,75km"
    elif city_name == "zinder":
        geo_str = "13.804,8.9886,75km"
    elif city_name == "maradi":
        geo_str = "13.496,7.1081,75km"
    else:
        raise Exception("city \'{}\' is not recognized".format(city_name))

    # instantiate twint
    c = twint.Config()

    c.Limit = 500           
    c.Pandas = True
    c.Debug = True
    c.Count = True
    c.Stats = True
    c.Hide_output=True

    c.Geo = geo_str
    c.Until = target_date_max.isoformat()

    print("will search {} tweet chunks back from 00:00am {}".format(c.Limit, c.Until))

    # if we want to use tor because we're getting locked out
    if use_tor_channel == True:
        # **optionally** run through the Tor browser 
        # just start up the main Tor browser and uncommen the below lines
        print("using Tor")
        c.Proxy_host = "127.0.0.1"
        c.Proxy_port = 9150
        c.Proxy_type = "socks5"

    # let's create a file name in a subdirectory
    target_file_name = "./" + city_name.lower() + "_geosearch"

    if not os.path.exists(target_file_name):
        os.makedirs(target_file_name)

    target_file_name = target_file_name + "/" + str(which_year) + "_geo.csv"
    
    # here is where we call the below twint operation in the twitter_scrap file
    twitter_scrape.twitter_scrape_given_twint_config(c, target_date_min, target_date_max, target_file_name, city_name)


# Main run block
Run this block below to actually conduct the scraping and saving.  Generally this will be done one city name and one year at a time. but a loop is provided to run multiple years at a time.  Then manually change the year and/or city and run again.  One csv file per run.  Don't do this too fast or Twitter may refuse, throttle, or alter results.

These runs can take a while.  Short pauses are built in to reduce chances of Twitter refusal.  Hopefully you only need to run this once, then run once a year after that to get a series of data for later analysis.

In [None]:

target_city = "agadez"      # options: "niamey" "agadez" "tillaberi" "tahoua" "dosso" "zinder" "maradi"
use_tor = False 

# start at 2010 (there's nothing before that), going up to 2020 in this example
for target_year in range(2010, 2020 + 1, 1):
    
    print("calling scrape for  year {}".format(target_year))

    twitter_scrape_by_geo_year(target_city, target_year, use_tor)
    
    print("done scraping year {}".format(target_year))
