
Learning to Code — Lessons from Twitter
---------

<img src=https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/images/Twitter-Anaylyze-Data-01.png width=500>


**1. Introduction**

As we have mentioned, in this class our main language for computation will be [Python](http://www.python.org). It was chosen because of its emphasis on **readability and code sharing.** There are plenty of online sources to help you on your journey learning the language, from [cheatsheets](https://www.pythoncheatsheet.org/) to [online tutorials](https://www.coursera.org/learn/python?specialization=python). You will quickly find that the web is a great place to find examples of code to do what you need to do. So, suppose you can't remember how to concatenate two strings...

<img src=https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/images/cc.jpeg width=500 style="border:1px solid black">

A word of caution: Make sure that when you find an answer on the web, it refers to **Python version 3.** There are two popular versions of the language in use, 2 and 3. This is a good example of eventually needing to move on from whatever technological environment you are used to. Apps haave upgrades, your phone's operating system asks to be upgraded, and programming languages evolve. I mean everything can get better, right? We are learning Python Version 3.

We will program in Python (write and execute Python expressions) using the Jupyter notebook — the name "Jupyter" being a mix [Julia](http://julialang.org/), [Python](https://www.python.org/) and [R](https://www.r-project.org/), the three languages it originally supported. Programming with the notebook is often referred to as **"literate computing"** — by that we mean that you code a little, have a look, write a little, come up with more ideas, code a little more, write a little more and so on. To support this, there are two kinds of "cells" that one can either write or program in. Think of it as a modern reporter's notebook.

<img src=https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/images/notebook.jpeg width=500>

The writing is done in **Markdown cells**. Markdown (as opposed to Mark-up) is a language that lets you write in a plain text editor (in this simplest cases the Notepad or TextEdit would work fine too) and there are simple typographical shorthands to **make text bold** or *to put text in italics*. You can also make lists like

+ Bread
+ Milk
+ Dog food
+ Swiffer pads

These Markdown conventions are then translated into HTML — the upshot being that you don't have to know anything about HTML to create documents that look reasonably good, certainly good enough for your reporting notes. 

Double click in this window to see the "raw" Markdown. Notice that you can still recognize lists and emphasized text from the Markdown additions, and that's the other point of this. Your documents, while written in plain text, make use of typographical conventions that make the document's highlighting understandable even without translation to HTML. That's a good trick! 

You can find [the Markdown description here.](http://daringfireball.net/projects/markdown/). As we said in the last class, for Monday please go through the [Markdown Tutorial](http://markdowntutorial.com). 

*To re-render the Markdown in this cell into HTML, click in the cell and hit Shift-Enter to execute the transformation.*

One last note. Many organizations (scientific research labs, journalistic organizations, and so on) are "publishing" dual works — one that goes into an official journal like Science or The New York Times summarizing a set of computations, and then another that is published in the form of a notebook that documents the author's computational work. Here are some examples from science and journalism.

> ["Peeling back the curtain — How the Economist is opening the data behind our reporting."](https://medium.economist.com/peeling-back-the-curtain-487bd3be0c47) In their words "We published these calculations in a Jupyter notebook, a tidy format for breaking scripts into small blocks and annotating them."

>[BuzzFeedNews/everything — An index of all our open-source data, analysis, libraries, tools, and guides.](https://github.com/BuzzFeedNews/everything#data-and-analyses) Data and code for many of their major stories includeing ["Shoot Someone In A Major US City, And Odds Are You’ll Get Away With It"](https://www.buzzfeednews.com/article/sarahryley/police-unsolved-shootings?bftw=&utm_term=4ldqpfp#4ldqpfp) and ["How Russia’s Online Trolls Engaged Unsuspecting American Voters — And Sometimes Duped The Media"](https://www.buzzfeednews.com/article/peteraldhous/russia-online-trolls-viral-strategy).

>["The Need for Openness in Data Journalism"](http://nbviewer.jupyter.org/github/brianckeegan/Bechdel/blob/master/Bechdel_test.ipynb) by Brian Keegan. This is a little old, but Keegan makes good points about the benefits of working with a notebook.

>["Why Jupyter is data scientists’ computational notebook of choice,"](https://www.nature.com/articles/d41586-018-07196-1) a recent overview piece in Nature that marks the rise of notebooks this way — "One analysis of the code-sharing site GitHub counted more than 2.5 million public Jupyter notebooks in September 2018, up from 200,000 or so in 2015."  

>["The Architecture of Jupyter — Interactive by design."](http://scisoftdays.org/pdf/2016_slides/perez.pdf) Starting on page 23 of this PDF, Fernando Perez, one of the designers of the Jupyter describes how notebooks have been published alongside papers in journals like Nature and Science and Scientific American, to name a few.

You get the idea. There are an increasing number of projects like this — the fact that you can take someone else's notebook and examine the steps they followed to arrive at their conclusions is, ultimately, an important step toward transparency in data or computational journalism. *The notebooks become objects of coordination.*

**2. Why can't we have nice things?**

Today we are going to branch out from our humble start on Monday and dig more deeply into how we start to build more complex objects into our programming environment and conduct more interesting analyses. In short, we will be learning to code. 

Under the heading of "Why can't we have nice things?" last year the phrase "Learn to Code" and the hashtag `#LearnToCode` was weaponized as part of a meme directed at journalists as an insult. In the wake of the layoffs at news outlets around this time last year, a meme started — "Learn to code." [Know Your Meme](https://knowyourmeme.com/memes/learn-to-code) began tracking it a few days later calling it "an expression used to mock journalists who were laid off from their jobs, encouraging them to learn software development as an alternate career path." 

<img src=https://github.com/computationaljournalism/columbia2019/raw/master/images/tc.jpg width=600>

Previously the hashtag `#LearnToCode` existed (although used at a low level) to promote sincere efforts encouraging people to take up coding, much in the spirit of this course. The "Learn to Code" meme, however, changed things. It evidently found its roots in criticisms of articles about retraining blue collar workers - criticisms portraying the media as out of touch with the working public. 

As it picked up activity, some suggested that Twitter was even suspending accounts directing this meme at fired journalists. 

In [109]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">I am told by a person in the know that tweeting &quot;learn to code&quot; at any recently laid off journalist will be treated as &quot;abusive behavior&quot; and is a violation of Twitter&#39;s Terms of Service</p>&mdash; Jon Levine (@LevineJonathan) <a href="https://twitter.com/LevineJonathan/status/1089905702146060288?ref_src=twsrc%5Etfw">January 28, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

There were reports that the meme was organized on 4chan and later that groups had coordinated on Gab.

In [110]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">btw, if any other journos targeted by layoffs are getting masses of “learn to code” harassment, it was coordinated on 4chan (of course) <a href="https://t.co/DtpinjWhID">pic.twitter.com/DtpinjWhID</a></p>&mdash; Talia Lavin (@chick_in_kiev) <a href="https://twitter.com/chick_in_kiev/status/1088590587731808256?ref_src=twsrc%5Etfw">January 25, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Some "news" reports circulated about the meme (in places you might expect) —
[The Ringer](https://www.theringer.com/tech/2019/1/29/18201695/learn-to-code-twitter-abuse-buzzfeed-journalists), 
[Fox News](https://www.foxnews.com/tech/twitter-fights-harassment-against-fired-journalists-told-to-learn-to-code), 
[Breitbart](https://www.breitbart.com/tech/2019/01/28/twitter-telling-fired-journalists-to-learn-to-code-is-targeted-harassment/), 
and 
[The Daily Wire](https://www.dailywire.com/news/42783/heres-where-learn-code-meme-originated-hint-not-ashe-schow). [Know Your Meme](https://knowyourmeme.com/memes/learn-to-code) indicates that it surfaced again at the end of December 2019 in a speech by Democratic Candidate Joe Biden who was campaigning in New Hampshire.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('VDRK0MyuuIM')

Information campaigns like this one often play out across several kinds of media and wind their way into the "real world". In previous incarnations of this class, we have seen stories organized on Reddit or 4chan or Gab.ai that then appear on Twitter and make their way from obscure accounts to the mainstream media and even into political speeches. Our information ecosystem is complex and the data to examine it comes from various sources. 

One example of the kind of analysis that is done (but we will do vastly better!) comes from sites like [Hoax.ly](http://hoax.ly). It provides an interface to how messages pass between Twitter and the main stream media. Here is an example of the interface for a Breitbart article on "anchor babies."

<img src=https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/images/ab1.jpeg width=500 style="border:1px solid black">

And here's a closeup of the spike kicked off by Bill Mitchell.

<img src=https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/images/ab1.jpeg width=500 style="border:1px solid black">

For completeness, his tweet is here.

In [111]:
%%HTML
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Anchor Baby Population in U.S. Exceeds One Year of American Births | Breitbart <a href="https://t.co/yTHhInImM1">https://t.co/yTHhInImM1</a></p>&mdash; Bill Mitchell (@mitchellvii) <a href="https://twitter.com/mitchellvii/status/1057405691051364357?ref_src=twsrc%5Etfw">October 30, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Let's take a second and have a look at some recent "claims" featured by Hoaxy and talk about the way computation surfaces in their analyses. Similarly, scan over [Know Your Meme](https://knowyourmeme.com/) and see the kind of work that goes into their narratives around popular memes.

Put notes here...

The point here is that our information ecosystem is complex and it calls for a deep understanding of how ideas flow from one area to another. Through computation you can track this movement and tell better stories. If Twitter and Facebook and other information sources, including news outlets, become your sources, you need computation to assess their veracity.

**3. Review: The basic data types**

As we've said, whether we code using a notebook or some other interface, our basic language will be Python. Python is an **object-oriented language**. Software objects are a kind of programming abstraction, a particular way of organizing information and actions. Software objects try to mimic the notion of objects in the physical world — that means they contain properties or **data** and also might have certain operations or **methods** that you can use to transform the object in some way. Take as an example one of our lovely rolling chairs — it has operations (it can support you when you sit, it has a foldable arm for writing, and it can move around the room) as well as "data" (it has a seat height, a desk height, maybe even an RGB value specifying its lovely seat color).

**Python has a series of built-in types of objects, meaning certain types of information that are so basic that they are needed by just about every programming exercise you'll attempt**. Which have we covered so far? As a kind of quiz, write some code that creates a variable having each of the data types we've seen so far using this "learn to code" tweet for inspiration. (This is a tweet posted on Sunday at 9:30 NYC time on January 20 of last year, before the phrase took a turn.)

In [112]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">finally putting in the effort to learn to code python and it has been enjoyable and rewarding thus far</p>&mdash; halleJOKEL (@halleJOKEL) <a href="https://twitter.com/halleJOKEL/status/1086994877496332288?ref_src=twsrc%5Etfw">January 20, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In [None]:
# put your code here



**4. Combining data**

So far, we have created variables that contain a single value — a number or a string. With the tweet above, we see that we might want to represent an object in the world, say a tweet, as a collection of simple data types. Python provides a few built-in objects that are containers. The first we'll look at is a **dictionary.** 

Think back for a moment to how you used a literal dictionary (Websters?). Word definitions  were referenced by words. Finding a definition meant specifying the word we were after. This is the idea behind a dictionary in Python — store data (**"values"**) according to a name (a word or some kind of **"key"**). The result is a collection of key-value pairs.

For example on January 20, 2019, there were 474 tweets that included the hashtag `#LearnToCode`. We can store these facts in a variable called "activity", say.

In [None]:
activity = {"date":"January 20, 2019", "count":474}

# have a look at what we built
activity

The curly braces (not parentheses!) mean we are creating a dictionary, a set of key-value pairs. The names we give to the data (the word, if you will, we associate with the dictionary entry) are "date" and "count" (one the left of the colons) and the values, the data, are on the right. 

If we want to lookup or access data, we provide a name in square brackets.

In [None]:
activity["count"]

In [None]:
# extract the date



Now, recreate "activity" but add the fact that the 20th of January last year was a Sunday. 

In [None]:
# your code here



**A common mistake for people learning Python is to confuse the parentheses we use to indicate a function (or taking action) like `p.count('RT')` with the square brackets we will use to extract or subset data. They look similar so be careful! 😉**

Now, using the tweet we presented above that occurs just before the "Learn to Code" meme takes hold, create a dictionary that encapsulates as much of the tweet as you can. Call the dictionary "tweet". The HTML code for embedding the tweet also includes some data that isn't directly obvious from the displayed tweet itself. 

In [113]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">finally putting in the effort to learn to code python and it has been enjoyable and rewarding thus far</p>&mdash; halleJOKEL (@halleJOKEL) <a href="https://twitter.com/halleJOKEL/status/1086994877496332288?ref_src=twsrc%5Etfw">January 20, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In [None]:
# Put your code here -- 



If you run across a tweet in the wild you will find a lot of data, data that is not obvious here - Twitter bundles a fair amount of "metadata" with each tweet, only some of which is shown in the display above. We can use the [Application Programming Interface or API](https://developer.twitter.com/en/docs/api-reference-index) to pull the full tweet as a dictionary. We can explore a dictionary visually by printing it out (too easy), or we can also ask for the kinds of data it contains with a method of a dictionary called `.keys()`.

In [None]:
tweet = {"created_at": "Sun Jan 20 14:32:28 +0000 2019", "id": 1086994877496332288, "id_str": "1086994877496332288", "full_text": "finally putting in the effort to learn to code python and it has been enjoyable and rewarding thus far", "truncated": False, "display_text_range": [0, 102], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": []}, "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>", "in_reply_to_status_id": None, "in_reply_to_status_id_str": None, "in_reply_to_user_id": None, "in_reply_to_user_id_str": None, "in_reply_to_screen_name": None, "user": {"id": 144210472, "id_str": "144210472", "name": "halleJOKEL", "screen_name": "halleJOKEL", "location": "Raleigh, NC", "description": "sports was a mistake", "url": None, "entities": {"description": {"urls": []}}, "protected": False, "followers_count": 273, "friends_count": 482, "listed_count": 6, "created_at": "Sat May 15 16:24:53 +0000 2010", "favourites_count": 2044, "utc_offset": None, "time_zone": None, "geo_enabled": False, "verified": False, "statuses_count": 5736, "lang": None, "contributors_enabled": False, "is_translator": False, "is_translation_enabled": False, "profile_background_color": "131516", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme14/bg.gif", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme14/bg.gif", "profile_background_tile": True, "profile_image_url": "http://pbs.twimg.com/profile_images/1069453101311172608/s8lDcFQT_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1069453101311172608/s8lDcFQT_normal.jpg", "profile_banner_url": "https://pbs.twimg.com/profile_banners/144210472/1536632996", "profile_image_extensions_alt_text": None, "profile_banner_extensions_alt_text": None, "profile_link_color": "000000", "profile_sidebar_border_color": "EEEEEE", "profile_sidebar_fill_color": "EFEFEF", "profile_text_color": "333333", "profile_use_background_image": True, "has_extended_profile": True, "default_profile": False, "default_profile_image": False, "can_media_tag": True, "followed_by": False, "following": False, "follow_request_sent": False, "notifications": False, "translator_type": "none"}, "geo": None, "coordinates": None, "place": None, "contributors": None, "is_quote_status": False, "retweet_count": 0, "favorite_count": 4, "favorited": False, "retweeted": False, "lang": "en"}

In [None]:
tweet.keys()

This tells you there are keys or words that are used to reference data. For example, under `created_at` we have the time the tweet was authored (in GMT). We access the information as we did above, providing the key or word to look up.

In [None]:
tweet["created_at"]

Explore a little and tell me about the kinds of "metadata" that are packaged by Twitter with a tweet. If you need some help, consult [Twitter's description of their tweet objects. ](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)

In [None]:
# explore a little here



If you need some help, consult [Twitter's description of their tweet objects. ](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object) There is some notation that we haven't learned yet. When you come across it, take a note and let's talk about it. What did you find?

It's important to keep in mind that tweets have changed a lot since Twitter first launched. The way replies were handled in the metadata changed in 2007, same with retweets. Verified accounts appeared in 2009, along with geotagging of tweets. Each service change potentially had some impact on the metadata that comes along in a tweet.

The point? In this case, as in many data collection efforts you might be part of, the data definition is not stable but might change over time. The facts we have about each tweet look different in 2020 than they did in 2006.

**5. From single variables to lists**

It's one thing to store single values (a single number or a single string or a single dictionary), but as we know, we tend to collect a lot of data different aspects of a person or thing in the world - we might offer a survey to 100 people that consists of 10 questions; or we might record facts about the last 100 of Donald Trump's tweets, including the time he tweeted and the number of retweets each earned; or we might consider all the tweets that promote the hashtag `#LearnToCode`. 

A **list** is another built-in data structure used to group information. As its name suggests, it is simply an **ordered collection** of objects. It has a well-defined first entry, a second entry and a last entry. It can hold different kinds of objects in each position. It is constructed using square brackets [ ] (as opposed to the curly braces for a dictionary)

For the moment we will look at counts of tweets containing "learn to code" or `#LearnToCode` or `#learn2code`. These were pulled from the Twitter API. We will have a lot to say about API's and data access over time. Each count represents the number of tweets appearing on a single day, where the days range from January 20, 2019 to January 29, 2019.

In [None]:
counts = [474,540,679,970,6279,8412,7448,9209,37595,20250]

print("The type of 'counts' is", type(counts), "and its length is", len(counts))

Note that we've been tricky with `print()`. We have several items to print in one line all separated by commas. Sssssslick! Also, a new "global funciton" called `len()`. This function returns the number of elements in a list, or its **length.** It is a global funciton because it can be called meaningfully on a lot of objects. For example, it will also tell you the length or number of characters in a string.

In [None]:
len("learn to code")

As an object, a list carries both data as well as methods that you can apply. What kinds of things would you like to be able to do to this type of object? 

*Maybe add new objects to the list? `append()` does that.*

In [None]:
# print out the list of tweet counts

counts

In [None]:
# now add something to the back of the list

counts.append(2767)
counts

The number 2767 is the count for the  30th (the data were pulled on the 30th, but only up until 10am, so the data are not complete. What else would we like to do with a list?

*Maybe sort the list? `sort()` does that.*

In [None]:
counts.sort()
counts

As a container object (an object that holds or groups other objects), the most obvious set of operations you would like to perform should involve storing and retrieveing data from the list. As we said, a list stores objects in a well-defined order. There is a first, a second, a third, and so on. You access these objects using **an index.**  A small catch: Python refers to positions starting at 0 and not at 1. So the first object has index 0, the second has index 1 and so on. 

In [None]:
# the first element
counts[0]

In [None]:
# the third element
counts[2]

In [None]:
# the sixth elemenet
counts[5]

Sometimes counting places from the back or righthand side of the list is easier. We use negative indices for that.

In [None]:
# the last element -- sneaky, right?
counts[-1]

In [None]:
# the fourth from the right
counts[-4]

We can take out "slices" from a list by asking for not just a single index but a range. The construction `m:n` means starting from index `m` take all the data in a list up to, but not including, the index `n`. So `3:6` means data stored behind indices 3, 4 and 5 (or actual positions 4, 5 and 6 since we count from zero). A slice returns another list containing just the specified objects. 

In [None]:
# Finally, you can pull more than one element with the : symbol to create a 'slice'
print("From the fourth element to the end:", counts[3:], "\n")

In [None]:
print("Up to but not including the third element:", counts[:2], "\n")

In [None]:
print("From the third up to but not including the fifth element:", counts[2:5], "\n")

**6. Comparing lists and dictionaries**

Next, we will use a list to store data from a tweet by Donald Trump Jr. on the "Learn to Code" meme. In this list we store the time and day he tweeted, the content of the tweet, it's index, the number of people who retweeted it, the number of people who "liked" it, whether it was a retweet and the platform he used to tweet from.

Here's the tweet...

In [None]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Could someone explain to me why if I tell my kids to “learn to code” it’s likely sound parenting, but if I told a journalist the same it’s grounds for a <a href="https://twitter.com/Twitter?ref_src=twsrc%5Etfw">@twitter</a> suspension?</p>&mdash; Donald Trump Jr. (@DonaldJTrumpJr) <a href="https://twitter.com/DonaldJTrumpJr/status/1089958848742518785?ref_src=twsrc%5Etfw">January 28, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

... and here is a list representation. How does this compare to the dictionary we saw before? Compare the two ways of storing the same information. What is lost? What is gained?

In [None]:
djt = [
    "Mon Jan 28 18:50:14 +0000 2019",
    "Could someone explain to me why if I tell my kids to 'learn to code' it’s likely sound parenting, but if I told a journalist the same it’s grounds for a @twitter suspension?",
    "1089958848742518785",
    7565,
    31098,
    False,
    "Twitter for iPhone"
]
        
djt

And let's look at how we extract data from this container. We will use the square brackets again — `djt[0]` representing the first item in the list, `djt[2]` representing the third data item stored in the list and so on. Counting from zero rather than 1 is confusing but it will become natural.

In [None]:
print("The list has", len(djt), "elements", "\n")

# the first element in the list has index 0
print("The first element:", djt[0], "\n")

# and the fourth has index 3
print("The fourth element:", djt[3], "\n")

# and the last has index -1 — the negative indices count from the right!
print("The last element:", djt[-1], "\n")
print("The third from the last:", djt[-3], "\n")

# Finally, a slice - remember this returns a new list
print("From the fourth up to but not including the sixth element:", djt[3:6], "\n")

Just as you can pull data from a list,  you can also change the contents of one or more elements of a list.

In [None]:
djt[0] = 6000
djt

In [None]:
djt[1:4] = [1,10,100]
djt

Certain operations produce lists. For example, we can divide a character string into pieces by "splitting" on a character using the method `split()`. This gives us a crude way to pull words from a string that represents a sentence.

In [None]:
line = "Could someone explain to me why if I tell my kids to 'learn to code' it's likely sound parenting, but if I told a journalist the same it's grounds for a @twitter suspension?"

# divide into substrings using the space character " " as a breakpoint — this gives a rough division into words.
rough_words = line.split(" ")
rough_words

In [None]:
print("There are roughly", len(rough_words), "words in this tweet.")

Add text to this cell explaining what might be wrong with this approach to pulling words from text.



Finally, use "e" as a breakpoint to split the string. Make sure you understand what happened here.

In [None]:
line.split("e")

Make sure you understand lists. Create one and try out forming subsets, changing values and so on.

In [None]:
# put your work here


**7. Higher-level objects: A DataFrame**

So we have seen lists and dictionaries, built-in structures that help us group data that are associated in some way. With dictionaries, we use names or keys to look up data. With lists, we use position to look things up. In many cases we actually need a mixture of both kinds of structures. The most common example is a table. 

Think about a spreadsheet. The basic structure involves rows and columns. In many cases the rows refer to different objects in the real world and the columns represent things we measure or record about each object. For example, if instead of one tweet from Donald Trump Jr., we had 100 or 1,000, we would have a series of rows the first entry could be the date and time he tweeted, the second could be the text of the tweet, the third could be the tweet's ID and so on. This happens so often that researchers have created a special object to emulate a spreadsheet. 

To see it in action, let's start with a count of the number of times the "Learn to Code" meme was referenced each day starting from 1/20 through 1/30 (again, with 1/30 we stopped recording at about 10am).

We will store each day as a dictionary, with one key for the day and another for the overall tweet count on the meme.

In [None]:
times = [
    {"day":"2019-01-30","count":2767},
    {"day":"2019-01-29","count":20250},
    {"day":"2019-01-28","count":37595},
    {"day":"2019-01-27","count":9209},
    {"day":"2019-01-26","count":7448},
    {"day":"2019-01-25","count":8412},
    {"day":"2019-01-24","count":6279},
    {"day":"2019-01-23","count":970},
    {"day":"2019-01-22","count":679},
    {"day":"2019-01-21","count":540},
    {"day":"2019-01-20","count":474}
  ]

print(type(times))
print(len(times))

This is certainly a fine way to store the data. We can select information about the third day
by "subsetting" just the third row, say.

In [None]:
times[2]

# why 2?

In [None]:
# extract the count from the fifth row



We will, from time to time, make our data sets "by hand" like this, so it's worth seeing how it might be done. Our data format, the list of dictionaries, is trying really hard to create essentially a **table**. That is, a grid of data, where each row refers to a time period and then each column refers to either the date or the tweet count for the meme. For our simple data above, that would be a table with 11 rows and 2 columns.

Interacting with even this simple data in this format is a little cumbersome. We can appeal to a higher-level object to create a proper table for us. You are probably familiar with Excel or some spreadsheet. These programs are all about tables. In Python, the answer to Excel (or a popular answer) is a so-called Pandas **DataFrame**. Pandas refers to a package contributed by a Python developer who wanted to make working with tabular data easier. 

[You can read more about Pandas here](http://pandas.pydata.org/)

[And there are simple tutorials here](http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.1/cookbook/Chapter%201%20-%20Reading%20from%20a%20CSV.ipynb)

Pandas is a **package** that means its author has published data, functions and a host of new objects for the community to use. Whereas the built-in objects are basic and get us pretty far, often we need something special to make our lives easier. In the case of Pandas, an object of type DataFrame will help us manipulate (compute with, make graphs of, etc) simple tabular data. 

We can use the `times` object (the list of lists) and turn it into a DataFrame using the function `DataFrame().` (Yeah, that might be confusing — the type of the object is "DataFrame" and the name of the function to turn your data into an object of that type is also called "DataFrame". This is a fairly common naming convention, and functions like this are called "constructors.") As arguments, it takes the data itself (the list of dictionaries).

We **import** the function "DataFrame" from the pandas package first. The import command is giving us super powers from the Pandas package to do things not built into the basic Python system. We will see this construction a lot.

In [None]:
from pandas import DataFrame

tweet_times = DataFrame(times)
tweet_times

Notice that the way our data looks has changed. It's much more like an actual table now with column headings and the like. The DataFrame has lots of wonderful things you can do to it — lots of ways to compute with the data contained in the underlying table. 

One simple thing is just to get its size. How many rows and columns? This is an attribute, information, stored with the object that we can again access with "dot" notation. Because we are looking up information and not computing something (like making strings lowercase, say), we don't need parentheses.

In [None]:
tweet_times.shape

**Aside — Installing Packages.** Python has a set of built-in functionality and data types that it knows about. We have seen some really basic things so far — numbers and `print()`ing, say. The power of the platform is that people are constantly adding new functionality, making way for new kinds of data and new kinds of computation. This new capacity is organized into packages. Hence, `pandas`. 

Now some packages come with Python, some are added by Anaconda and still others you have to install yourself. You can search through the collection [here](https://pypi.org/). In particular, we are going to add plotting functionality to our notebook. It's basic so don't get overly excited yet. The package uses the service `plot.ly`. 

Below we use a "UNIX shell command" called `pip` to install the Python package `plotly` that provides us with access to its plotting facilities from within Python. You can read a bit about it [here](https://plot.ly/python/).

In [None]:
%%sh
pip install plotly==4.5.0

**8. Making a plot**

Now, we `import` a function that lets us make easy line plots. With the command `line()` all we have to do is specify our x- and y-axes and maybe give the plot a title. The underlying `plotly` functionality allows for fairly general plots, but it has also made the basic plots very easy. You can read about the so-called plotly "express" [here](https://plot.ly/python/plotly-express/). 

In [None]:
from plotly.express import line

fig = line(tweet_times, x="day", y="count", title='Learn to Code')
fig.show()

**9. More with DataFrames**

Writing out data like we did to create the `tweets` data frame is really limiting. Instead, we can read data into a data frame from a vriety of formats. The easiest is a [CSV](https://en.wikipedia.org/wiki/Comma-separated_values).

We have pulled tweets from Twitter and binned the counts into 10-minute intervals and stored them in 
a [CSV file](https://github.com/computationaljournalism/columbia2019/raw/master/data/learn_counts.csv). Click on the link and have a look. For each row in the file, you will see two fields separated by a comma. The first row of the file is called a "header" and gives you the names of the variables recorded in each row. So you will see `time` and `count`. There are two entries in the header so each row has two entries.

Each row after the first represents a ten minute period from the last few days, arranged so that the most recent are first and the oldest appear last in the file. Following the names in the header, the first entry in each row is the "created at time" and the second is a count of references to "`learn to code`" or `#LearnToCode` or `#Learn2Code`. Each row arranges the data about its time period according to the labels in the first row, and separates the entries by a comma. Hence CSV.

In the cell below, we first import the function, `read_csv()`. Unlike `DataFrame()`, `read_csv()`  takes a CSV file and creates a DataFrame. Oh and it takes as its argument either the URL of a CSV or the location of a CSV file on your computer. Here we supply the URL on github.

In [None]:
from pandas import read_csv

# read in the tweets from the CSV file in our github data directory
counts = read_csv("https://github.com/computationaljournalism/columbia2020/raw/master/data/learn_counts.csv")

type(counts)

In [None]:
counts.shape

So We have 1,403 time periods (rows in the table) and 2 variables recorded for each. We can have a look at the "top" and "bottom" of the data set. These are printed with `head()` and `tail()`methods.

In [None]:
counts.head()

In [None]:
counts.tail()

The `head()` and `tail()` methods of a DataFrame gives you five time periods from the start and end of the data (and you can give an argument to see more). It's important to look at the top and bottom of the file to check that everything looks consistent (column entries seem to mean what they should) and see how the data might be organized.

We can now have a look at these values in a plot. Again, basic, but it motivates our investigations. Each point on the line is a time period. What should we be asking?

In [None]:
fig = line(counts, x="time", y="count", title='Learn to Code')
fig.show()

Just to cement ideas, we can also split out retweets from tweets. So if a tweet was a reply or an original tweet, we include it in the `tweet_count` total below and otherwise if it is a retweet we count it in the `retweet_count` column.

In [None]:
counts2 = read_csv("https://github.com/computationaljournalism/columbia2020/raw/master/data/learn_counts2.csv")
counts2.head()

**Slightly more advanced plotting.** Ideally we'd like to make a single plot with two lines - one for the retweets and one for the tweets. Do they have a similar pattern? For this, we are in need of sligly more from plotly and not just the "express". Here we create two `Scatter()` plots and add them to a `Figure()`.  

In [None]:
from plotly.graph_objects import Scatter, Figure

fig = Figure()

fig.add_trace(Scatter(x=counts2["time"], y=counts2["tweet_count"], name="tweet count"))
fig.add_trace(Scatter(x=counts2["time"], y=counts2["retweet_count"],name="retweet count"))

fig.show()

Notice that something sneaky has happened in the `Scatter()` calls. We have "subset" the DataFrame to pull out the columns corresponding to the x- and y-axes. This subsetting **like all subsetting we've seen today**  happens with square braces. `counts2["time"]` pulls out just the column of data named `time` and uses it for the data on the x-axis.

In [None]:
counts2["time"]

We will have a lot more to say about subsetting tables in your homework! Typically we identify rows and columns to summarize or exhibit in some way.

**10. Tables again (skippable)**

We created our first DataFrame from a list of dictionaries. Each row was an element of the list and the dictionary let us name the attributes we had about each item represented by a row. There is a more compact way of accomplishing the same thing if we know that each entry in our list holds data in the same order - day then count. Like this...

In [None]:
times = [
    ["2019-01-30",2767],
    ["2019-01-29",20250],
    ["2019-01-28",37595],
    ["2019-01-27",9209],
    ["2019-01-26",7448],
    ["2019-01-25",8412],
    ["2019-01-24",6279],
    ["2019-01-23",970],
    ["2019-01-22",679],
    ["2019-01-21",540],
    ["2019-01-20",474]
  ]

This time we have a list of lists! OK this is now very much in the weeds, but we turn this structure into a DataFrame as we had before, this time we have to provide a list with elements that name the columns.

In [None]:
counts = DataFrame(times,columns=["day","count"])
counts

We can think of a DataFrame as a grid where we know columns all have the same kind of data and each row refers to a different unit of observation. This format is central to most machine learning exercises and statistical analysis.

<img src=https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/images/t.jpeg width=300>

**11. To launch you into the world...**

We have created a CSV file where each row is a tweet. This file is big so you have to download it. I put it up [on Dropbox](https://www.dropbox.com/s/ene3qllvkwzolof/learn_tweets.csv?dl=0) — download it and put it in the same folder as your notebook. Then you should be able to read it in using the commands below.

In [None]:
from pandas import read_csv, set_option

# set the maximum number of characters in any cell
set_option("display.max_colwidth", 280)

tweets = read_csv("learn_tweets.csv")

In [None]:
tweets.head()

In homework, we will prepare you with some tools to help you extract the content from this table and ask about tweeting activity. Who was the most tweeted in this little episode? Who was the most tweeted *at*? For example, we can figure out who tweeted the most on this topic with a method called `value_counts()`. It takes the data in the column `screen_name` and then tabulates how many times each person tweeted on the "Learning to code" meme (or at very least used one of the hashtags we identified). That tabulation can then be printed out...

In [None]:
tweets['screen_name'].value_counts()

A bit later we'll see how we can subset not just columns but also rows. Here we pull out the rows (tweets) associated with the top tweeter. The symbol `==` is a called an operator, a logical operator, that returns another kind of built-in variable, "a boolean". This type takes on just the values of True and False and the subsetting below keeps just the rows associated with True (that the `screen_name` of the tweeter is `ham_gretsky`. Can you see the potential here? Logical operators let you subset rows where the number of followers is larger than 5,000 or the total number of tweets by a person is larger than 100. Much of our data analysis will involve simple steps like these.

In [None]:
tweets[tweets["screen_name"]=="ham_gretsky"]

In [None]:
tweets[tweets["screen_name"]=="SherlockRobo"]

The rest waits for homework!