# Project 03 - Due Monday, November 19 at 1pm

*Objectives*: Use Spark to process and perform basic analysis on non-relational data, including its DataFrame and SQL interfaces.

*Grading criteria*: The tasks should all be completed, and questions should all be answered with Python code, SQL queries, shell commands, and markdown cells.  The notebook itself should be completely reproducible (using AWS EC2 instance based on the provided AMI) from start to finish; another person should be able to use the code to obtain the same results as yours.  Note that you will receive no more than partial credit if you do not add text/markdown cells explaining your thinking when appropriate.

*Attestation*: **Work in groups**.  At the end of your submitted notebook, identify the work each partner performed and attest that each contributed substantially to the work.

*Deadline*: Monday, November 19, 1pm.  One member of each group must submit your notebook to Blackboard; you should not submit it separately.

## Part 1 - Setup

Begin by setting up Spark and fetching the project data.  

**Note**: you may want to use a larger EC2 instance type than normal.  This project was prepared using a `t2.xlarge` instance.  Just remember that the larger the instance, the higher the per-hour charge, so be sure to remember to shut your instance down when you're done, as always.

### About the data

We will use JSON data from Twitter; we saw an example of this in class.  It should parse cleanly, allowing you to focus on analysis.

This data was gathered using GWU Libraries' [Social Feed Manager](http://sfm.library.gwu.edu/) application during a recent game of the MLB World Series featuring the Los Angeles Dodgers and Boston Red Sox.  This first file tells you a little bit about how it was gathered:

In [1]:
!wget https://s3.amazonaws.com/2018-dmfa/project-3/ea26ccd641744d4a8dce84de0785186d-README.txt

--2018-11-19 17:11:24--  https://s3.amazonaws.com/2018-dmfa/project-3/ea26ccd641744d4a8dce84de0785186d-README.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.84.125
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.84.125|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1145 (1.1K) [text/plain]
Saving to: 'ea26ccd641744d4a8dce84de0785186d-README.txt'


2018-11-19 17:11:24 (83.5 MB/s) - 'ea26ccd641744d4a8dce84de0785186d-README.txt' saved [1145/1145]



In [2]:
!cat ea26ccd641744d4a8dce84de0785186d-README.txt

This is an export created with Social Feed Manager.

EXPORT INFORMATION
Selected seeds: All seeds
Export id: ea26ccd641744d4a8dce84de0785186d
Export type: twitter_filter
Format: Full JSON
Export completed:  Oct. 30, 2018, 9:45:59 a.m. EDT
Deduplicate: Yes

COLLECTION INFORMATION
Collection name: 2018-world-series

Collection id: 4e2564448b144915b6a0eb1899075a44
Collection set: 2018-mlb-playoffs (collection set id 5a00efa0bddf4be2aa19c6df9788ff6e)
Harvest type: Twitter filter

Harvest options:

Seeds:
* Track: dodgers,redsox,red sox,bossox,world series,RedSoxVsDodgers,ladodgers,bostonredsox - Active

Change log:

Change to 2018-world-series (collection) on Oct. 30, 2018, 8:16:14 a.m. EDT by dchud:
Note: Series ended Sunday night after five games, waited ~36 hours.

Change to 2018-world-series (collection) on Oct. 23, 2018, 6:01:28 p.m. EDT by dchud:

Change to Track: dodgers,redsox,red sox,bossox,world series,RedSoxVsDodgers,ladodgers,bostonredsox (seed) on Oct. 23, 2018, 6:01:18 p.m. E

The most important pieces in that metadata are:

 * It tracked tweets that mentioned "dodgers" or "redsox" and several additional related terms.  Every item in this set should refer to one or more of these terms.
 * This data was deduplicated; we should not see individual tweets more than once.
 * Data was collected between October 23 and October 30.  All five games of the Series were played during this time.
 
You should not need to know anything about baseball to complete this assignment, but if you have baseball questions (or Twitter questions!) please ask on the discussion forum.

**Please note**: sometimes social media data contains offensive material.  This data set has not been filtered; if you do come across something inappropriate, please do your best to ignore it if you can.

## Fetch the data

The following files are available:

 * https://s3.amazonaws.com/2018-dmfa/project-3/ea26ccd641744d4a8dce84de0785186d_009.json
 * https://s3.amazonaws.com/2018-dmfa/project-3/ea26ccd641744d4a8dce84de0785186d_010.json
 * https://s3.amazonaws.com/2018-dmfa/project-3/ea26ccd641744d4a8dce84de0785186d_011.json
 * https://s3.amazonaws.com/2018-dmfa/project-3/ea26ccd641744d4a8dce84de0785186d_012.json
 
### Q1.1 - Select at least one and obtain it using `wget`.  Verify the file sizes using the command line.

Each file should contain exactly 100,000 tweets.  

*Note*: you are only required to use one of these files, but you may use more than one.  It will be easier to process more data if you use a larger EC2 instance type, as suggested above.  Use the exact same set of files throughout the assignment.

**Answer**

In [3]:
!wget https://s3.amazonaws.com/2018-dmfa/project-3/ea26ccd641744d4a8dce84de0785186d_009.json

--2018-11-19 17:11:31--  https://s3.amazonaws.com/2018-dmfa/project-3/ea26ccd641744d4a8dce84de0785186d_009.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.162.37
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.162.37|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 554277526 (529M) [application/json]
Saving to: 'ea26ccd641744d4a8dce84de0785186d_009.json'


2018-11-19 17:11:52 (25.3 MB/s) - 'ea26ccd641744d4a8dce84de0785186d_009.json' saved [554277526/554277526]



For your reference, here is the text of one Tweet, randomly selected from one of these files.  You might wish to study its structure and refer to it later.

In [4]:
!cat *.json | shuf -n 1 > example-tweet.json

In [5]:
import json
print(json.dumps(json.load(open("example-tweet.json")), indent=2))

{
  "geo": null,
  "favorited": false,
  "place": null,
  "in_reply_to_user_id_str": null,
  "coordinates": null,
  "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
  "id": 1056753589551538176,
  "retweet_count": 0,
  "lang": "en",
  "retweeted": false,
  "text": "Ugh Red Sox..",
  "quote_count": 0,
  "user": {
    "screen_name": "Carol_Dagny",
    "profile_link_color": "FA743E",
    "verified": false,
    "profile_use_background_image": false,
    "profile_background_image_url": "http://abs.twimg.com/images/themes/theme12/bg.gif",
    "follow_request_sent": null,
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1013446190107627522/nfrcPldi_normal.jpg",
    "time_zone": null,
    "translator_type": "none",
    "id": 1265692452,
    "favourites_count": 4263,
    "lang": "en",
    "profile_sidebar_border_color": "000000",
    "location": "United States",
    "followers_count": 199,
    "profile_background_tile": false

You can find several key elements in this example; the text, time, and language of the tweet, whether it was a reply to another user, the user's screen name along with their primary language and other account information like creation date, follower/friend/tweet counts, and perhaps their location.  If there are hashtags, user mentions, or urls present in their tweet, they will be present in the `entities` section; these are not present in every tweet.  If this is a retweet, you will see the original tweet and its information nested within.

### Q1.2 - Start up Spark, and verify the file sizes.

We will use our normal startup sequence here:

In [6]:
import findspark

In [7]:
findspark.init()

In [8]:
from pyspark import SparkContext

In [9]:
spark = SparkContext(appName='project-03')

In [10]:
spark

In [11]:
from pyspark import SQLContext

In [12]:
sqlc = SQLContext(spark)

In [13]:
sqlc

<pyspark.sql.context.SQLContext at 0x7f0ae67f4c88>

In [14]:
tweets = sqlc.read.json("ea26ccd6*.json")

Verify that Spark has loaded the same number of tweets you saw before:

**Answer**

In [15]:
# The number of tweets should be
!wc -l ea26ccd6*.json

100000 ea26ccd641744d4a8dce84de0785186d_009.json


In [16]:
# The numbe of tweets Spark has loaded
tweets.count()

100000

As mentioned above, the number of tweets are 100000.

## Part 2 - Comparing DataFrames and Spark SQL

For the next three questions, we will look at operations using both DataFrames and SQL queries. Note that `tweets` is already a DataFrame:

In [17]:
tweets

DataFrame[contributors: string, coordinates: struct<coordinates:array<double>,type:string>, created_at: string, display_text_range: array<bigint>, entities: struct<hashtags:array<struct<indices:array<bigint>,text:string>>,media:array<struct<additional_media_info:struct<description:string,embeddable:boolean,monetizable:boolean,title:string>,display_url:string,expanded_url:string,id:bigint,id_str:string,indices:array<bigint>,media_url:string,media_url_https:string,sizes:struct<large:struct<h:bigint,resize:string,w:bigint>,medium:struct<h:bigint,resize:string,w:bigint>,small:struct<h:bigint,resize:string,w:bigint>,thumb:struct<h:bigint,resize:string,w:bigint>>,source_status_id:bigint,source_status_id_str:string,source_user_id:bigint,source_user_id_str:string,type:string,url:string>>,symbols:array<struct<indices:array<bigint>,text:string>>,urls:array<struct<display_url:string,expanded_url:string,indices:array<bigint>,url:string>>,user_mentions:array<struct<id:bigint,id_str:string,indices:a

To issue SQL queries, we need to register a table based on `tweets`:

In [18]:
tweets.createOrReplaceTempView("tweets")

### Q2.1 - Which 10 languages are most commonly used in tweets?  Verify your result by executing it with both the dataframe and with SQL.

Hint: for the dataframe, use `groupBy`, `count`, and `orderBy`.  See the documentation at https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html for details on these and other functions.

**Answer**

Through both the Python codes and SQL Queries, the 10 most commonly used languages were en(English), es(Spanish), pt(Portuguese), ja(Japanese), fr(French), ko(Korean), en-gb(British English), de(Deutsch), it(Italian) and ru(Russian). 

In [19]:
!head -1 example-tweet.json

{"created_at": "Mon Oct 29 03:44:23 +0000 2018", "id": 1056753589551538176, "id_str": "1056753589551538176", "text": "Ugh Red Sox..", "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", "truncated": false, "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 1265692452, "id_str": "1265692452", "name": "Carol", "screen_name": "Carol_Dagny", "location": "United States", "url": null, "description": "On a journey to success.", "translator_type": "none", "protected": false, "verified": false, "followers_count": 199, "friends_count": 160, "listed_count": 6, "favourites_count": 4263, "statuses_count": 2513, "created_at": "Wed Mar 13 23:30:24 +0000 2013", "utc_offset": null, "time_zone": null, "geo_enabled": true, "lang": "en", "contributors_enabled": false, "is_translator": false, "profile_background_color": "000000", "profi

In [20]:
tweets.select('user.lang').groupBy('lang').count() \
    .orderBy('count',ascending=False) \
    .show(10)

+-----+-----+
| lang|count|
+-----+-----+
|   en|91282|
|   es| 7129|
|   pt|  604|
|   ja|  328|
|   fr|  163|
|   ko|  146|
|en-gb|  100|
|   de|   48|
|   it|   35|
|   ru|   22|
+-----+-----+
only showing top 10 rows



In [20]:
sqlc.sql('''
    SELECT user.lang AS language, count(user.lang) AS counts 
    FROM tweets 
    GROUP BY user.lang
    ORDER BY counts DESC
'''
).show(10)

+--------+------+
|language|counts|
+--------+------+
|      en| 91282|
|      es|  7129|
|      pt|   604|
|      ja|   328|
|      fr|   163|
|      ko|   146|
|   en-gb|   100|
|      de|    48|
|      it|    35|
|      ru|    22|
+--------+------+
only showing top 10 rows



The results are verified with both the dataframe and SQL.

### Q2.2 - Which 10 time zones are most common among users?  Verify your result with both the dataframe and SQL.

*Note*: for this question, you may leave NULL values present in your results, as a way to help you understand what data is present and what is missing.

**Answer**

Through both the Python codes and the SQL queries, we found that the both answers only showed the null values even if we wrote the code to show the top 10. 

In [21]:
tweets.select('user.time_zone').groupBy('time_zone').count() \
    .orderBy('count',ascending=False) \
    .show(10)

+---------+------+
|time_zone| count|
+---------+------+
|     null|100000|
+---------+------+



In [22]:
sqlc.sql('''
    SELECT user.time_zone, count(*) AS counts 
    FROM tweets 
    GROUP BY user.time_zone
    ORDER BY counts DESC
'''
).show(10)

+---------+------+
|time_zone|counts|
+---------+------+
|     null|100000|
+---------+------+



The results are verified with both the dataframe and SQL.

### Q2.3 - How many tweets mention the Dodgers?  How many mention the Red Sox?  How many mention both?

You may use either the dataframe or SQL to answer.  Explain why you have chosen that approach.

Hint:  you will want to look at the value of the `text` field.

**Answer**

We used SQL to approach the question because it's much clear and easier to understand. 
The Dodgers were mentioned 24555 times and the Red Sox were mentioned 56951 times, and both were mentioned together 5582 times. 

In [23]:
sqlc.sql('''
    SELECT count(text) AS D_counts
    FROM tweets 
    WHERE lower(text) like '%dodgers%'
'''
).show()

+--------+
|D_counts|
+--------+
|   24555|
+--------+



In [24]:
sqlc.sql('''
    SELECT count(text) AS R_counts
    FROM tweets 
    WHERE lower(text) like '%red sox%' or lower(text) like '%redsox%'
'''
).show()

+--------+
|R_counts|
+--------+
|   56951|
+--------+



In [25]:
sqlc.sql('''
    SELECT count(text) AS both_counts
    FROM tweets 
    WHERE lower(text) like '%dodgers%' and (lower(text) like '%red sox%' or lower(text) like '%redsox%')
'''
).show()

+-----------+
|both_counts|
+-----------+
|       5582|
+-----------+



## Part 3 - More complex queries

For this section, you may choose to use dataframe queries or SQL.  If you wish, you may verify results by using both, as in Part 2, but this is not required for this section.

### Q3.1 - Team mentions by location

In which users' locations are the Red Sox and the Dodgers being mentioned the most?  Consider each team separately, one at a time.  Discuss your findings.

Hint:  you may use either the time zones or user-specified locations for this question.

**Answer**

We used the SQL to solve the questions and used the user-specified locations to find the answer, as we believe that it'd show a more direct answer in terms of the location, compared to the time zones. According to the result, both of the teams were most mentioned within their hometowns, Los Angeles for the Dodgers and Boston for the Red Sox. 

In [27]:
sqlc.sql('''
    SELECT count(user.location) AS count, user.location
    FROM tweets
    WHERE lower(text) like '%dodgers%'
    GROUP BY user.location
    ORDER BY count DESC
'''
).show(1)

+-----+---------------+
|count|       location|
+-----+---------------+
|  765|Los Angeles, CA|
+-----+---------------+
only showing top 1 row



In [28]:
sqlc.sql('''
    SELECT count(user.location) AS count, user.location
    FROM tweets
    WHERE lower(text) like '%red sox%' or lower(text) like '%redsox%'
    GROUP BY user.location
    ORDER BY count DESC
'''
).show(1)

+-----+----------+
|count|  location|
+-----+----------+
| 1808|Boston, MA|
+-----+----------+
only showing top 1 row



### Q3.2 - Which Twitter users are being replied to the most?

Discuss your findings.

Hint: use the top-level `in_reply_to_screen_name` for this.

**Answer**

When looking at the top ten most replied Tweeter Users, we found that the RedSox, Dodgers, MLB and DodgersNation are replied the most, especially the Red Sox and Dodgers being populer compared to the other users in manifold. Other baseball related Twitter users that were popular were users such as MLB, DodgersNation, Rangers and MLBStatoftheDay and the Red Sox player David Price. Another interesting thing about the results were that the ESPN commentator Ernesto Jerez was one of the most popular Twitter users being replied to. 


In [29]:
sqlc.sql('''
    SELECT count(in_reply_to_screen_name) AS count, in_reply_to_screen_name AS name
    FROM tweets
    GROUP BY name
    ORDER BY count DESC
'''
).show(10)

+-----+---------------+
|count|           name|
+-----+---------------+
| 1360|         RedSox|
|  814|        Dodgers|
|  163|            MLB|
|  101|  DodgersNation|
|   77| DonnieWahlberg|
|   75|   DAVIDprice24|
|   59|        Rangers|
|   52|     EJerezESPN|
|   43|MLBStatoftheDay|
|   42|     VeniceMase|
+-----+---------------+
only showing top 10 rows



### Q3.3 - Which 10 verified users have the most followers?  Which 10 unverified users have the most followers?

Provide both the screen names and follower counts for each.

Discuss your findings.

**Answer**

There were many duplicates of user_id with different followers_counts. So we take the maximum of folowers_count as their unique value.

Top ten verified users that have the most followers are as follows. There were journalists like Lopez Doriga and professional basketball player like Magic Johnson and the late Stan Lee in terms of personal verified users. For business accounts, there were the MLB, major mexican newspaper like El Universal.

In [30]:
tweets.filter('user.verified == true') \
    .select('user.screen_name', 'user.followers_count') \
    .groupBy('screen_name').max() \
    .orderBy('max(followers_count)', ascending=False) \
    .show(10)

+---------------+--------------------+
|    screen_name|max(followers_count)|
+---------------+--------------------+
|            MLB|             8296241|
|    lopezdoriga|             7678758|
|El_Universal_Mx|             4836985|
|   MagicJohnson|             4685432|
|        Milenio|             4165323|
|    MarketWatch|             3611461|
| TheRealStanLee|             3340350|
|       Newsweek|             3324621|
|  NateSilver538|             3156640|
|          Migos|             2556415|
+---------------+--------------------+
only showing top 10 rows



Top ten unverified users that have the most followers are as follows. It was interesting to see that several of them were sports-related accounts.

In [31]:
tweets.filter('user.verified == false') \
    .select('user.screen_name', 'user.followers_count') \
    .groupBy('screen_name').max() \
    .orderBy('max(followers_count)', ascending=False) \
    .show(10)

+---------------+--------------------+
|    screen_name|max(followers_count)|
+---------------+--------------------+
|   DRJAMESCABOT|             2185433|
|       PAMsLOvE|              688117|
| mlbtraderumors|              666915|
| Miguel_Gurwitz|              568885|
|        ruleiro|              520622|
| RealKentMurphy|              443846|
|thebrittanyxoxo|              325224|
|FakeSportsCentr|              318999|
|    CelticsLife|              296296|
|    milenagimon|              226625|
+---------------+--------------------+
only showing top 10 rows



### Q3.4 - What are the most popular sets of hashtags among users with many followers?  Are they the same as among users with few followers?

Decide for yourself exactly how many followers you believe to be "many", and explain your decision.  You may use queries and statistics to support this decision if you wish.

Hint: if your sample tweet above does not include hashtags under the `entities` field, generate a new example by running the `shuf` command again until you find one that does.

Hint 2: the hashtag texts will be in an array, so you may need some functions you haven't used before.  If you're using SQL, see the docs for [Hive SQL](https://docs.treasuredata.com/articles/hive-functions) for details, (and consider `CONCAT_WS`, for example).

Discuss your findings.

**Answer**

We ranked all users according to their followers count, and defined the top 25% as the most-followed users and the last 25% as the least-followed users.

In [32]:
tweets.describe('user.followers_count').show()
tweets.select("user.followers_count").approxQuantile("followers_count", [0.25, 0.75], 0.00)

+-------+----------------+
|summary| followers_count|
+-------+----------------+
|  count|          100000|
|   mean|      4104.94204|
| stddev|85071.9871101182|
|    min|               0|
|    max|         8296241|
+-------+----------------+



[141.0, 803.0]

This gives us the the result of 141 and 803 for the two outputs. 

Most followed users

In [33]:
sqlc.sql('''
        select concat_ws(' ',entities.hashtags['text']) as ht,count(concat_ws(' ',entities.hashtags['text'])) as count
        from tweets
        where size(entities.hashtags) != 0
        and user.followers_count > 803
        group by ht
        order by count(concat_ws(' ',entities.hashtags['text'])) desc
        ''').show(1)

+------+-----+
|    ht|count|
+------+-----+
|RedSox| 1228|
+------+-----+
only showing top 1 row



Less followed Users

In [34]:
sqlc.sql('''
        select concat_ws(' ',entities.hashtags['text']) as ht,count(concat_ws(' ',entities.hashtags['text'])) as count
        from tweets
        where size(entities.hashtags) != 0
        and user.followers_count < 141
        group by ht
        order by count(concat_ws(' ',entities.hashtags['text'])) desc
        ''').show(1)

+-----------+-----+
|         ht|count|
+-----------+-----+
|WorldSeries| 1112|
+-----------+-----+
only showing top 1 row



### Q3.5 - Analyze common words in tweet text

Following the example in class, use `tweets.rdd` to find the most common interesting words in tweet text.  To keep it "interesting", add a filter that removes at least 10 common stop words found in tweets, like "a", "an", "the", and "RT" (you might want to derive these stop words from initial results).  To split lines into words, a simple split on text whitespace like we had in class is sufficient; you do not have to account for punctuation.

After you find the most common words, use dataframe or SQL queries to find patterns among how those words are used.  For example, are they more frequently used by Dodgers or Red Sox fans, or by people in one part of the country over another?  Explore and see what you can find, and discuss your findings.

Hint: don't forget all the word count pipeline steps we used earlier in class.

**Answer**

In [35]:
tweets.rdd.flatMap(lambda r: r['text'].split(' ')) \
    .map(lambda t: (t, 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .takeOrdered(10, key=lambda pair: -pair[1])

[('RT', 53854),
 ('the', 48325),
 ('to', 21653),
 ('a', 21226),
 ('World', 16829),
 ('Red', 14855),
 ('@RedSox:', 14766),
 ('in', 13992),
 ('Sox', 13374),
 ('Series', 12775)]

Removing the stop words

In [36]:
import re
stopwords = ["RT", "the","to","a","THE","for","in","is","I","of","and","i","this","rt"]
wordlist = tweets.rdd.flatMap(lambda r: r['text'].lower().split(' ')) \
    .map(lambda t: re.sub(r'[^a-z0-9]', '', t)) \
    .filter(lambda t: t not in stopwords) \
    .filter(lambda t: len(t) > 0) \
    .map(lambda t: (t, 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .takeOrdered(10, key = lambda pair: -pair[1])
for wordtext, count in wordlist:
    print("{}\t{}".format(count, wordtext))

39905	redsox
24456	dodgers
24254	series
23631	world
19968	sox
18769	red
13045	worldseries
9152	are
8379	boston
7958	on


In [37]:
listword=[x[0] for x in wordlist]
listword

['redsox',
 'dodgers',
 'series',
 'world',
 'sox',
 'red',
 'worldseries',
 'are',
 'boston',
 'on']

For the analysis part, we explored every common word used in different countries. We found that all of the common words are mentioned most in United States, Canada and Mexico. People in these three countries discussed a lot about them. The United States had the outstanding number of tweets containing the common word, and then came Canada and several countries in the Central America. 

In [38]:
for wordtext in listword:
    s='text like "%'+wordtext+'%"'
    #print(s)
    print(wordtext)
    tweets.select('place.country').filter(s+'and country is not null').groupBy('country').count().orderBy('count',ascending=False).show(10)    
    #print("{}\t{}".format(count, wordtext))

redsox
+--------------+-----+
|       country|count|
+--------------+-----+
| United States|  201|
|        Canada|    6|
|Estados Unidos|    4|
|     Guatemala|    2|
|        México|    2|
|           USA|    1|
|        Mexico|    1|
|         Aruba|    1|
|    Costa Rica|    1|
|United Kingdom|    1|
+--------------+-----+
only showing top 10 rows

dodgers
+------------------+-----+
|           country|count|
+------------------+-----+
|     United States|  250|
|            Canada|    5|
|            México|    5|
|Dominican Republic|    2|
|            Mexico|    1|
|                  |    1|
+------------------+-----+

series
+--------------------+-----+
|             country|count|
+--------------------+-----+
|       United States|  226|
|              Canada|    6|
|              Mexico|    3|
|              Brazil|    2|
|   Republic of Korea|    1|
|              Italia|    1|
|Republic of the P...|    1|
|      Estados Unidos|    1|
|  Dominican Republic|    1|
|          