# Capstone Project 2: Milestone Report

## 1. Problem definition
Individuals often post sarcastic or ironic messages on social media sites, and businesses (and sometimes governments) need to be able to distinguish between what is intended literally and what is not.  The model generated for this project would enable organizations to perform this classification.

## 2. Client description
Businesses and government organizations that use Twitter data will benefit from this project.  The model generated by this project would enable such organizations to identify unlabeled sarcastic tweets and distinguish them from literal tweets.  This model could also aid in sentiment analysis, to determine when an utterance's sentiment should be negated or examined more closely.

## 3. Data overview
The tweets for this project were gathered between July and October of 2017 via the Twitter REST API.  The most recent 3,200 tweets made by the 3,573,510 followers of John Oliver were acquired (over 2 billion total tweets).  Tweets that were retweets were excluded.  Of the total acquired tweets, 30,910 contain the sarcasm hashtag, which will be used as the true label.  The negative examples consist of hashtags with the "happy" hashtag (12,639 tweets), "sad" hashtag (40,861 tweets), and the "seriously" hashtag (11,450 tweets). In addition, profile information was gathered for each user, including number of friends, number of followers, date account was created, and location.

## 4. Data wrangling
I performed the following data wrangling and cleaning steps:

1. I removed all duplicate tweets (tweets that appeared more than once in the dataset)
2. I removed tweets that only contained a URL
3. I removed tweets the contained more than one hashtag, to make the comparison between sarcastic and non-sarcastic tweets clearer
4. I removed tweets where the hashtags of interest (#sarcasm, #happy, #sad, and #seriously) were *not* at the end of the tweet text
5. I removed the hashtag itself from the tweet text
6. I removed URLs from the tweet text when they were present

## 5. Initial findings

Below is a summary of the tweets collected (note: followers are individuals following the user, friends are individuals that the user is following).

### #sarcasm

The data include a total of 30,910 tweets with the sarcasm hashtag made by 23,509 unique users.  Table 1A summarizes the dataset.  On average, users have more followers than friends.  However, the variability in the number of followers is much higher than the variability in the number of friends.  As shown in Table 1B, the majority of users only use #sarcasm once.  Finally, as Figure 1A shows, the majority of #sarcasm users were in the Eastern, Central, and Pacific timezones.

**Table 1A. Summary statistics for users of #sarcasm**

|          | Followers |Friends  |Total Tweets|
|----------|-----------|---------|------------|
| **Mean** | 833       | 646     | 4,503      |
| **SD**   | 30,762    | 2,305   | 8,573      |
| **Min**  | 0         | 0       | 2          |
| **Max**  | 5,192,273 | 176,631 | 446,022    |

**Table 1B. Number of users that used #sarcasm X times**

|           | # of Users|
|-----------|-----------|
| **X = 1** | 19,213    |
| **X = 2** | 2,883     |
| **X = 3** | 773       |
| **X = 4** | 289       |
| **X = 5** | 159       |
| **X > 5** | 192       |

**Figure 1A. Top 10 #sarcasm user timezones**

<img src="graphs/timezones_sarcasm.png">

### #happy

The data include a total of 12,639 tweets with the happy hashtag made by 10,149 unique users.  Table 2A summarizes the dataset.  As with users of #sarcasm, users of #happy have more followers than friends, on average.  However, the variability in the number of followers is much higher than the variability in the number of friends.  As shown in Table 2B, the majority of users only use #happy once.  Finally, as Figure 2A shows, the users of #happy generally came from the same timezones as #sarcasm users, with the majority of users were in the Eastern, Central, and Pacific timezones.

**Table 2A. Summary statistics for users of #happy**

|          | Followers |Friends  |Total Tweets|
|----------|-----------|---------|------------|
| **Mean** | 826       | 711     | 3,885      |
| **SD**   | 10,038    | 3,617   | 5,567      |
| **Min**  | 0         | 0       | 1          |
| **Max**  | 752,605   | 223,235 | 175,701    |

**Table 2B. Number of users that used #happy X times **

|           | # of Users|
|-----------|-----------|
| **X = 1** | 8,559     |
| **X = 2** | 1,111     |
| **X = 3** | 287       |
| **X = 4** | 103       |
| **X = 5** | 41        |
| **X > 5** | 48        |

**Figure 2A. Top 10 #happy user timezones**

<img src="graphs/timezones_happy.png">



### #sad

The data include a total of 40,861 tweets with the sad hashtag made by 26,456 unique users.  Table 3A summarizes the dataset.  As with users of #sarcasm and #happy, users of #sad have more followers than friends, on average.  The difference between the average number of followers and friends is larger than for the prior hashtags, however.  Again, the variability in the number of followers is much higher than the variability in the number of friends.  As shown in Table 3B, the majority of users only use #sad once.  Finally, as Figure 3A shows, the users of #sad generally came from the same timezones as the other hashtag users; the majority of users were in the Eastern, Pacific, and Central timezones.

**Table 3A. Summary statistics for users of #sad**

|          | Followers |Friends  |Total Tweets|
|----------|-----------|---------|------------|
| **Mean** | 1,227       | 772     | 5,015      |
| **SD**   | 24,606    | 3,451   | 10,008      |
| **Min**  | 0         | 0       | 2          |
| **Max**  | 2,136,334   | 485,012 | 384,987    |


**Table 3B. Number of users that used #sad X times **

|           | # of Users|
|-----------|-----------|
| **X = 1** | 21,229     |
| **X = 2** | 3,047     |
| **X = 3** | 995       |
| **X = 4** | 367       |
| **X = 5** | 222        |
| **X > 5** | 596        |

**Figure 3A. Top 10 #sad user timezones**

<img src="graphs/timezones_sad.png">



### #seriously

The data include a total of 11,450 tweets with the seriously hashtag made by 9,033 unique users.  Table 4A summarizes the dataset.  Again, as with users of the other hashtags of interest, users of #seriously have more followers than friends, on average.  The variability in the number of followers is higher than the variability in the number of friends, though the difference is much less significant as for the other hashtags.  As shown in Table 4B, the majority of users only use #seriously once.  Again, users of #seriously generally came from the same timezones as the other hashtag users; the majority of users were in the Eastern, Central, and Pacific timezones.


**Table 4A. Summary statistics for users of #seriously**

|          | Followers |Friends  |Total Tweets|
|----------|-----------|---------|------------|
| **Mean** | 755       | 616     | 3,851      |
| **SD**   | 10,322    | 951   | 6,310      |
| **Min**  | 0         | 0       | 3          |
| **Max**  | 740,765   | 54,592 | 222,174    |

**Table 4B. Number of users that used #seriously X times**

|           | # of Users|
|-----------|-----------|
| **X = 1** | 7,692     |
| **X = 2** | 944     |
| **X = 3** | 223       |
| **X = 4** | 87       |
| **X = 5** | 28        |
| **X > 5** | 49        |

**Figure 4A. Top 10 #seriously user timezones**

<img src="graphs/timezones_seriously.png">