<center><img src="http://i.imgur.com/sSaOozN.png" width="500"></center>

## Course: Computational Thinking for Governance Analytics

### Prof. José Manuel Magallanes, PhD 
* Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
* Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú. 

_____
<a id='home'></a>

# Data collecting

1. [The GitHub Repo](#part1) 
2. [Uploading files](#part2)
3. [APIs](#part3) 
4. [Scraping](#part4) 
5. [Social media](#part5) 

____
<a id='part1'></a>

## 1. The GitHub Repo

**Where is the Data?**
**Where is the Code?**

We used GitHub as a way to organize our work. It helps transforming the way you access your code and data beyond your local machine; an easy way of experiencing the cloud. Follow these steps now:

1. Go to [github.com](https://github.com/), and sign up. Remember your _username_ and _password_.
2. Install the Github *desktop app* in your computer. It is available [here](https://desktop.github.com/). Sign in using your _username_ and _password_ from GitHub web.
3. Once you are signed in, go back to the Github web, and create a _repository_. Choose a name. **DO NOT forget** to select the option to add a READ ME file, choose a **LICENSE** too.
4. **Clone** the repository just created into your computer. You can select where to clone the repository, avoid cloning into another's app folder (i.e. Dropbox).


After those steps, let's open the data file that you find [here](https://drive.google.com/drive/folders/1uH6S-8rns8THDnBRCy3OGLEki9LjyTCL?usp=sharing). Once you open the file, you will see something like this:


In [None]:
from IPython.display import IFrame  
wikiLink="https://docs.google.com/spreadsheets/d/e/2PACX-1vRXdVAxZTnQ6N7bI1xJ_XRQSoG-FsiucGI_fsyBuKCi6TcO3guGB_-6nk4i6so7SG__eIpfS0o8pUuZ/pubhtml?widget=true&amp;headers=false" 
IFrame(wikiLink, width=700, height=300)

Understanding what you see is not straightforward:

* What does these data represent?
* What does each column represent?
* What possible mistakes can appear in the data values?
* What variables will be useful for my hypotheses and goals?

If this is the first time you run into a data set like this, you first need to **read** the documentation of the data (available in the same folder).

At this stage, you can download this data from GoogleDocs into the folder cloned from GitHub (now in your local machine). Download it first as a CSV file into your computer. 

Now, open GitHub Desktop App. You will see that the app is trying to tell you some changes have ocurred in your local folder cloned from GitHub. Now, let's **Commit** and **Push**, that will syncronize contents in your local repo and your cloud repo.

In *my own* repo, it looks like [this](https://github.com/EvansDataScience/data/blob/master/HSBfromGoogle.csv).

GitHub allows you to see the data contents (it is not always possible). Now, find the icon **raw** and copy its link address (**do not** copy the URL of the page). This is mine:

https://github.com/EvansDataScience/data/blob/master/HSBfromGoogle.csv

Now let's get ready for Python.

[home](#home)

____
<a id='part2'></a>

## 2. Uploading files

The file we are planing to read into Python is a data table. Python needs support from one of its libraries to deal with data tables: **PANDAS**

* Check if you already have Pandas.
* If you do not have it, install it.

In [1]:
# do I have it?
!pip show pandas

Name: pandas
Version: 1.2.3
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: 
Author-email: 
License: BSD
Location: /Users/JoseManuel/opt/anaconda3/envs/GovAnalytics/lib/python3.7/site-packages
Requires: numpy, python-dateutil, pytz
Required-by: dtale, geopandas, itables, mapclassify, mizani, plotnine, pyreadstat, seaborn, statsmodels, xarray


In [None]:
# if you do not have it:
#!pip install pandas

Now that I have it, I can read in the data, so that I can work on it:

In [2]:
# call pandas
import pandas as pd # 'pd' is a nickname

# use a function from pandas to read the cloud data into a Python object
gitCloudRepo='https://github.com/EvansDataScience/data/raw/master/'
fileName="HSBfromGoogle.csv"
DFcsv=pd.read_csv(gitCloudRepo + fileName)

If Python has not **complained**, that is, you got no error messages, then, you are good to continue!

The **DFcsv** Python object holds the data:

In [3]:
DFcsv

Unnamed: 0,id,sex,race,ses,sctyp,hsp,locus,concpt,mot,car,rdg,wrtg,math,sci,civ
0,1,Female,Asian,Low,Public,Vocational/Technical,0.29,0.88,0.67,Professional_2,33.6,43.7,40.2,39.0,40.6
1,2,Male,Asian,Low,Public,Academic preparatory,-0.42,0.03,0.33,Craftsman,46.9,35.9,41.9,36.3,45.6
2,3,Female,Asian,Low,Public,Academic preparatory,0.71,0.03,0.67,Professional_1,41.6,59.3,41.9,44.4,45.6
3,4,Female,Asian,Medium,Public,Vocational/Technical,0.06,0.03,0.00,Service,38.9,41.1,32.7,41.7,40.6
4,5,Female,Asian,Medium,Public,Vocational/Technical,0.22,-0.28,0.00,Clerical,36.3,48.9,39.5,41.7,45.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,596,Male,Hispanic,High,Private,General,0.27,-1.05,0.33,Professional_1,60.1,54.1,56.3,55.3,55.6
596,597,Male,Hispanic,Medium,Private,General,0.55,0.34,1.00,Technical,62.7,61.9,72.9,63.4,55.6
597,598,Male,Hispanic,Medium,Private,General,-0.61,0.34,0.67,Technical,68.0,54.1,74.6,66.1,65.5
598,599,Female,Hispanic,Medium,Private,General,0.23,0.94,1.00,Professional_2,57.4,67.1,57.9,60.7,55.6


This object is of a particular type:

In [4]:
type(DFcsv)

pandas.core.frame.DataFrame

The type is **Data Frame** (DF). We will see several functions that can be applied to DFs. Let me show you some basic ones.

In [5]:
# dimensions (rows,columns)
DFcsv.shape

(600, 15)

In [6]:
# top / bottom
DFcsv.head(10) #tail()

Unnamed: 0,id,sex,race,ses,sctyp,hsp,locus,concpt,mot,car,rdg,wrtg,math,sci,civ
0,1,Female,Asian,Low,Public,Vocational/Technical,0.29,0.88,0.67,Professional_2,33.6,43.7,40.2,39.0,40.6
1,2,Male,Asian,Low,Public,Academic preparatory,-0.42,0.03,0.33,Craftsman,46.9,35.9,41.9,36.3,45.6
2,3,Female,Asian,Low,Public,Academic preparatory,0.71,0.03,0.67,Professional_1,41.6,59.3,41.9,44.4,45.6
3,4,Female,Asian,Medium,Public,Vocational/Technical,0.06,0.03,0.0,Service,38.9,41.1,32.7,41.7,40.6
4,5,Female,Asian,Medium,Public,Vocational/Technical,0.22,-0.28,0.0,Clerical,36.3,48.9,39.5,41.7,45.6
5,6,Male,Asian,Medium,Public,Academic preparatory,0.46,0.03,0.0,Proprietor,49.5,46.3,46.2,41.7,35.6
6,7,Male,Asian,Low,Public,General,0.44,-0.47,0.33,Professional_2,62.7,64.5,48.0,63.4,55.6
7,8,Female,Asian,Low,Public,General,0.68,0.25,1.0,Professional_1,44.2,51.5,36.9,49.8,55.6
8,9,Male,Asian,Medium,Public,General,0.06,0.56,0.33,Professional_1,46.9,41.1,45.3,47.1,55.6
9,10,Female,Asian,Low,Public,General,0.05,0.15,1.0,Proprietor,44.2,49.5,40.5,39.0,50.6


In [7]:
# column names
DFcsv.columns

Index(['id', 'sex', 'race', 'ses', 'sctyp', 'hsp', 'locus', 'concpt', 'mot',
       'car', 'rdg', 'wrtg', 'math', 'sci', 'civ'],
      dtype='object')

In [8]:
# access by index position
DFcsv.iloc[:,4]

0       Public
1       Public
2       Public
3       Public
4       Public
        ...   
595    Private
596    Private
597    Private
598    Private
599    Private
Name: sctyp, Length: 600, dtype: object

In [9]:
# access by index names
DFcsv.loc[:,'sctyp']

0       Public
1       Public
2       Public
3       Public
4       Public
        ...   
595    Private
596    Private
597    Private
598    Private
599    Private
Name: sctyp, Length: 600, dtype: object

In [10]:
# subdata frame
DFcsv[['sctyp']]

Unnamed: 0,sctyp
0,Public
1,Public
2,Public
3,Public
4,Public
...,...
595,Private
596,Private
597,Private
598,Private


And the most important for future sessions:

In [12]:
DFcsv.info() # the data types Python has assigned

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      600 non-null    int64  
 1   sex     600 non-null    object 
 2   race    600 non-null    object 
 3   ses     600 non-null    object 
 4   sctyp   600 non-null    object 
 5   hsp     600 non-null    object 
 6   locus   600 non-null    float64
 7   concpt  600 non-null    float64
 8   mot     600 non-null    float64
 9   car     600 non-null    object 
 10  rdg     600 non-null    float64
 11  wrtg    600 non-null    float64
 12  math    600 non-null    float64
 13  sci     600 non-null    float64
 14  civ     600 non-null    float64
dtypes: float64(8), int64(1), object(6)
memory usage: 70.4+ KB


Python will not care about the original data file (CSV, EXCEL. etc.), once they are read into Python you will have a DF.

**Proprietary Sofware**

Several times, you may find that you are given a file that was previously prepared with proprietary software. The most common in the policy field are:

* SPSS (file extension: **sav**).
* STATA (file extension: **dta**).
* EXCEL (file extension: **xlsx** or **xls**).

Getting these files up and running might not bring much pre processing challenges. However, you may need different levels of effort to read them from **GitHub**: they will not be as easy to open as a CSV.

Download [these files](https://drive.google.com/drive/folders/1XxTztY6rFkGwbR7wUtO_xiD4xvBrFqJN?usp=share_link) into the repository where your CSV file is currently stored in your local machine, then commit and push.

**Exercise:**
    
Open the other files (excel, spss, stata). 

**TIPS**

Check these functions: 
* For STATA: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_stata.html
* FOR EXCEL: 
    - https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
    - https://openpyxl.readthedocs.io/en/stable/
    - https://xlrd.readthedocs.io/en/latest/
* FOR SPSS:
    - https://pandas.pydata.org/docs/reference/api/pandas.read_spss.html
    - https://github.com/Roche/pyreadstat

Opening SPSS is more challenging.


Check if all the functions work well with the DFs created. Verify if the results are the same.

[home](#home)

____

<a id='part3'></a>

## 3. Collecting data from APIs

Open data portals from the government and other organizations have APIs, a service that allows you to collect their data. Let's take a look a Seattle data about [Seattle Real Time Fire 911 Calls](https://data.seattle.gov/Public-Safety/Seattle-Real-Time-Fire-911-Calls/kzjm-xkqj).

That page tells you how to get the data into pandas. But first, you need to install **sodapy**. Then you can continue:

In [None]:
#!pip install sodapy

Let's follow some steps, according to the API:

In [14]:
from sodapy import Socrata

# Unauthenticated client (using 'None')

client = Socrata("data.seattle.gov", None)

# If you have credentials:
# client = Socrata(data.seattle.gov,
#                  MyAppToken,
#                  username="user@example.com",
#                  password="AFakePassword")

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("kzjm-xkqj", limit=2000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)



You can see the results now:

In [15]:
results_df

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,:@computed_region_ru88_fbhk,:@computed_region_kuhn_3gp2,:@computed_region_q256_3sug,:@computed_region_2day_rhn5,:@computed_region_cyqu_gs94
0,Sw Roxbury St / 35th Ave Sw,Aid Response,2022-12-20T17:44:00.000,47.517488,-122.376885,"{'type': 'Point', 'coordinates': [-122.376885,...",F220154922,18,51,19581,,
1,4707 12TH AVE NE,Illegal Burn,2022-12-20T17:27:00.000,47.663142,-122.315263,"{'type': 'Point', 'coordinates': [-122.315263,...",F220154919,60,38,18383,,
2,4755 Fauntleroy Way Sw,Medic Response,2022-12-20T17:25:00.000,47.560074,-122.381457,"{'type': 'Point', 'coordinates': [-122.381457,...",F220154917,1,50,19581,,
3,125 Boren Ave S,Aid Response,2022-12-20T17:24:00.000,47.601549,-122.317677,"{'type': 'Point', 'coordinates': [-122.317677,...",F220154916,26,16,18379,,
4,11 W Aloha St,Aid Response,2022-12-20T17:22:00.000,47.627192,-122.356823,"{'type': 'Point', 'coordinates': [-122.356823,...",F220154915,50,40,19575,,
...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1700 7th Ave,Aid Response,2022-12-14T15:20:00.000,47.613098,-122.334718,"{'type': 'Point', 'coordinates': [-122.334718,...",F220152260,14,31,18081,,
1996,2121 26th Ave S,Low Acuity Response,2022-12-14T15:20:00.000,47.584234,-122.299024,"{'type': 'Point', 'coordinates': [-122.299024,...",F220152259,38,42,17919,,
1997,809 S Washington St,Medic Response- 6 per Rule,2022-12-14T15:20:00.000,47.601589,-122.322427,"{'type': 'Point', 'coordinates': [-122.322427,...",F220152261,26,16,18379,,
1998,4th Ave / Pine St,Aid Response,2022-12-14T15:14:00.000,47.611207,-122.337592,"{'type': 'Point', 'coordinates': [-122.337592,...",F220152258,14,24,18081,,


Data from APIs may need some more pre processing than the previous files. Besides, you should study the API documentation to know how to interact with the portal. Not every open data portal behaves the same.

[home](#home)
____
<a id='part4'></a>

## 4. Scraping

Sometimes you are interested in data from the web. Let me get a table from wikipedia:

In [16]:
from IPython.display import IFrame  
wikiLink="https://en.wikipedia.org/wiki/Democracy_Index" 
IFrame(wikiLink, width=700, height=300)

I will use pandas to get the table, but you need to install these libraries first:
* html5lib 
* beautifulsoup4
* lxml

In [19]:
DFwiki=pd.read_html(wikiLink,header=0,flavor='bs4',attrs={'class': 'wikitable'})

Pandas has this command **read_html** that will save lots of coding, above I just said:
* The link to the webpage.
* The position of the header.
* The external library that will be used to extract the text (_flavor_).
* The attributes of the table.

DFwiki is not a data frame:

In [18]:
type(DFwiki)

list

The command **read_html** returned all the tables with the attribute _wikitable_. Since there may be more than one, a **list** of tables is returned. Lists are the most flexible container offered by Python:

In [20]:
aList=[1,2,'a','*']
aList

[1, 2, 'a', '*']

Lists have several interesting functions and properties:

In [21]:
# amount of elements
len(aList)

4

In [22]:
# get element
aList[2]

'a'

In [23]:
# replace
aList[3]="**"
aList

[1, 2, 'a', '**']

In [24]:
# add element
aList.append(4)
aList

[1, 2, 'a', '**', 4]

In [25]:
# erase element
del aList[0]
aList

[2, 'a', '**', 4]

And the nicest of all: **list comprehension**

In [26]:
# list of the squared first five positive integers (0 to 4)
easyList_1=[x**2 for x in range(5)]
easyList_1

[0, 1, 4, 9, 16]

In [27]:
# list of multiples of 5 smaller than 50
easyList_2=[x for x in range(50) if x%5==0 and x>0]
easyList_2

[5, 10, 15, 20, 25, 30, 35, 40, 45]

In [28]:
someNames=["Peter","John","Rob", "Ron", "Mike"]
easyList_3=[x for x in someNames if not x.startswith('R')]
easyList_3

['Peter', 'John', 'Mike']

In [31]:
import math

someNumbers=[-1,4,6,-3]
easyList_4a=[math.pow(x,2) for x in someNumbers ]
easyList_4a

[1.0, 16.0, 36.0, 9.0]

In [32]:
# when using'else' write 'for' at the end
easyList_4b=[math.sqrt(x) if x >=0 else None for x in someNumbers]
easyList_4b

[None, 2.0, 2.449489742783178, None]

Coming back to our example from wikipedia, we first should check how many DFs we have:

In [33]:
len(DFwiki)

5

Is ours the first one?

In [34]:
# remember that Python starts counting in ZERO!
DFwiki[0]

Unnamed: 0,Type of regime,Score,Countries,Countries.1,Proportion ofWorld population (%)
0,Type of regime,Score,Number,(%),Proportion ofWorld population (%)
1,Full democracies,9.01–10.00 8.01–9.00,21,12.6%,6.4%
2,Flawed democracies,7.01–8.00 6.01–7.00,53,31.7%,39.3%
3,Hybrid regimes,5.01–6.00 4.01–5.00,34,20.4%,17.2%
4,Authoritarian regimes,3.01–4.00 0–3.00,59,35.3%,37.1%


or the last one?

In [35]:
DFwiki[4]

Unnamed: 0,Rank,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Δ Rank,Country,Regime type,Overall score,Δ Score,Elec­toral pro­cessand plura­lism,Func­tioningof govern­ment,Poli­ticalpartici­pation,Poli­ticalcul­ture,Civilliber­ties
0,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies
1,1,,Norway,Full democracy,9.75,0.06,10.00,9.64,10.00,10.00,9.12
2,2,2,New Zealand,Full democracy,9.37,0.12,10.00,8.93,9.44,8.75,9.71
3,3,3,Finland,Full democracy,9.27,0.07,10.00,9.29,8.89,8.75,9.41
4,4,1,Sweden,Full democracy,9.26,,9.58,9.29,8.33,10.00,9.12
...,...,...,...,...,...,...,...,...,...,...,...
166,162,2,Central African Republic,Authoritarian,1.43,0.11,1.25,0.00,1.67,1.88,2.35
167,164,2,Democratic Republic of the Congo,Authoritarian,1.40,0.27,0.75,0.00,2.22,3.13,0.88
168,165,2,North Korea,Authoritarian,1.08,,0.00,2.50,1.67,1.25,0.00
169,166,31,Myanmar,Authoritarian,1.02,2.04,0.00,0.00,1.67,3.13,0.29


Tables scrapped will bring different cleaning challenges. 

[home](#home)
____
<a id='part5'></a>

## 5. Social media data

Social media offer APIs too that allow you to get _some_ data. To use this service, you need to register as a developer. For our Twitter example, you should go [here](https://developer.twitter.com/en).
Once you are a confirmed developer, Twitter, Facebook and others will allow you to get _some_ of their data (the more you pay the more you get). 

Let's pay attention to Twitter. First, check if you have **tweepy**:

In [36]:
!pip show tweepy

Name: tweepy
Version: 4.12.1
Summary: Twitter library for Python
Home-page: https://www.tweepy.org/
Author: Joshua Roesslein
Author-email: tweepy@googlegroups.com
License: MIT
Location: /Users/JoseManuel/opt/anaconda3/envs/GovAnalytics/lib/python3.7/site-packages
Requires: oauthlib, requests, requests-oauthlib
Required-by: 


In [None]:
#!pip install tweepy

Tweepy is the key library, but you may need several other libraries according to your goals.

In [37]:
import tweepy

Let me introduce myself to Twitter:

In [38]:
# credentials from a file
import json
keysAPI = json.load(open('APIkeys.txt','r'))

# getting info from the file
api_key = keysAPI['api_key']
api_key_secret = keysAPI['api_key_secret']
access_token = keysAPI['access_token']
access_token_secret = keysAPI['access_token_secret']

# introducing myself:
auth = tweepy.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)
api=tweepy.API(auth, wait_on_rate_limit=True,timeout=60,
               parser=tweepy.parsers.JSONParser())

Let me ask for some tweets from a particular user:

In [43]:
who='@WAStateGov'
howMany=100
gottenTweets = api.user_timeline(screen_name = who, 
                                 count = howMany, 
                                 include_rts = True,
                                 tweet_mode="extended")

In the previous cases, I got a table (a data frame), you should always check what you have:

In [44]:
type(gottenTweets)

list

I have a list, then I could ask how many tweets I got (just to confirm):

In [45]:
len(gottenTweets)

100

Let me view what I have in the first tweet:

In [46]:
gottenTweets[0]

{'created_at': 'Tue Dec 20 23:48:03 +0000 2022',
 'id': 1605349339844710400,
 'id_str': '1605349339844710400',
 'full_text': 'RT @WSRCO: This holiday season, @GovInslee is giving hope for endangered salmon in his proposed budget, which includes more than $872 milli…',
 'truncated': False,
 'display_text_range': [0, 140],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [{'screen_name': 'WSRCO',
    'name': 'WA State Recreation and Conservation Office',
    'id': 793542198025302016,
    'id_str': '793542198025302016',
    'indices': [3, 9]},
   {'screen_name': 'GovInslee',
    'name': 'Governor Jay Inslee',
    'id': 1077214808,
    'id_str': '1077214808',
    'indices': [32, 42]}],
  'urls': []},
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id'

It will take some time to become familiar with a [tweet object structure](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet). Let's find out how each tweet is currently stored:

In [47]:
type(gottenTweets[0])

dict

Now you know that each tweet is stored as a **dictionary**. 
Dictionaries or **dicts** are very flexible and important structures in Python. Let me show you a simple one: 

In [48]:
aDictExample={"name":"Peter",
             "speaks":['French', 'Spanish'],
             'country':'Morocco'}

# then
aDictExample

{'name': 'Peter', 'speaks': ['French', 'Spanish'], 'country': 'Morocco'}

Dicts are a basic structure in Python, and one that makes Python very appealing. Each element in Python can be accessed via **keys**. Our _aDictExample_ has these keys:

In [49]:
aDictExample.keys()

dict_keys(['name', 'speaks', 'country'])

So, you access the info like this:

In [50]:
aDictExample['speaks']

['French', 'Spanish']

Then, let's see our **gottenTweets** keys:

In [51]:
gottenTweets[0].keys()

dict_keys(['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang'])

In [52]:
# then:
gottenTweets[0]['created_at']

'Tue Dec 20 23:48:03 +0000 2022'

In [53]:
gottenTweets[1]['created_at']

'Tue Dec 20 14:40:02 +0000 2022'

We could prepare a data frame using the current tweets, first let's prepare a list of each of fields we want:

In [54]:
# list comprehesions
dates=[t['created_at'] for t in gottenTweets]
ids=[t['id'] for t in gottenTweets]
rts=[t['retweet_count'] for t in gottenTweets]
likes=[t['favorite_count'] for t in gottenTweets]
text=[t['full_text'] for t in gottenTweets]
rtw=[t['full_text'].startswith('RT') for t in gottenTweets]

Each of the objects created is a list (dates, ids,rts,likes and text). Let me show you one:

In [55]:
rtw

[True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 False,
 True,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True]

Let me create a data frame with those lists:

In [59]:
dictOfColsAsLists={'dates':dates,'ids':ids,'rts':rts,
                   'likes':likes,'text':text,'rtw':rtw}
tweetsAsDF=pd.DataFrame(dictOfColsAsLists)

In [60]:
tweetsAsDF

Unnamed: 0,dates,ids,rts,likes,text,rtw
0,Tue Dec 20 23:48:03 +0000 2022,1605349339844710400,5,0,"RT @WSRCO: This holiday season, @GovInslee is ...",True
1,Tue Dec 20 14:40:02 +0000 2022,1605211427249483776,1,0,"RT @WAStateDOR: In November, more than 9 milli...",True
2,Mon Dec 19 20:26:32 +0000 2022,1604936241304469504,0,3,Gun violence is preventable. In addition to ot...,False
3,Mon Dec 19 19:37:22 +0000 2022,1604923866786463744,1,7,"Oh, @waDNR must have some thoughts! https://t....",False
4,Mon Dec 19 17:46:19 +0000 2022,1604895921367830528,1,1,Press conference starting soon… #waleg https:/...,False
...,...,...,...,...,...,...
95,Wed May 20 15:31:53 +0000 2020,1263130333240676354,249,0,RT @GovInslee: NEW: 10 more counties are eligi...,True
96,Mon May 18 20:03:29 +0000 2020,1262473905844838400,43,0,RT @GovInslee: WATCH: Today I’m talking about ...,True
97,Thu May 14 21:03:32 +0000 2020,1261039466178703360,27,0,RT @GovInslee: WATCH: Today I’m talking with r...,True
98,Thu May 07 15:41:07 +0000 2020,1258421613558509569,143,0,"RT @GovInslee: Nurses, CNAs, LPNs and all our ...",True


You can know how many are retweets or not:

In [61]:
tweetsAsDF[tweetsAsDF['rtw']==False].shape

(60, 6)

In [62]:
tweetsAsDF[tweetsAsDF['rtw']==True].shape

(40, 6)

[home](#home)