# Soccer tweets analysis about countries

In this exercise we will take a csv directly exported from a Mongodb collection, this collection holds data about twitter users who has been tweeting about soccer.

First of all, we have a soccer_tweets.csv which containst the "tweet_text" and a country_list.csv. Using both files we will get to report the countries with the most popularity on Twitter during this event. So, a good way to approach this problem would be to find which countries were mentioned the most in the tweets from our dataset and to analyze those words which are being used the most in these tweets.

In [1]:
# Import and create a new SQLContext 
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

In the next cells we will process the countries dataset in order to get a dataframe which we can use to join with our other dataset to get all the words and how many times each word is repeated.

In [2]:
# Read the country CSV file into an RDD.
path = 'file:///home/cloudera/Downloads/big-data-3/final-project/'
country_lines = sc.textFile(path + 'country-list.csv')

country_tuples = country_lines\
    .map(lambda line: tuple(line.split(",")))

countryDF = sqlContext.createDataFrame(country_tuples, ["country", "code"])
countryDF.printSchema()
countryDF.take(3)

We do exactly the same with the tweets texts but we have to be aware that there are some empty tweets, so we will filter them in orden to avoid empty lines.

In [5]:
# Read tweets CSV file into RDD 
soccer_tweets= sc.textFile(path + 'soccer_tweets.csv')

cleaned_soccer_tweets = soccer_tweets\
    .filter(lambda line: len(line)>0)\
    .collect()
cleaned_soccer_tweets.pop(0) #removing header

At this point we will use a file where we will put all the tweets content and then do the following:
* Split each line into words and storing them in an RDD.
* Assign an initial count value to each word by creating tuples for each word with an initial count of 1.
* Sum all word count values by using the redyceByKey() method.

In [7]:
import os
os.remove('words.txt')
open('words.txt', 'w+')
with open('words.txt', 'w') as filehandle:
    for tweet in cleaned_soccer_tweets:
        filehandle.write('%s\n' % tweet)
        
lines = sc.textFile(path + 'words.txt')
words = lines.flatMap(lambda line: line.split(" "))
tuples = words.map(lambda word : (word, 1))
words_counts = tuples.reduceByKey(lambda a, b: (a + b))
wordsDF = sqlContext.createDataFrame(country_tuples, ["word", "times"])
wordsDF.printSchema()
wordsDF.take(3)

root
 |-- word: string (nullable = true)
 |-- times: string (nullable = true)



[Row(word='Afghanistan', times=' AFG'),
 Row(word='Albania', times=' ALB'),
 Row(word='Algeria', times=' ALG')]

In [8]:
# Create the DataFrame of tweet word counts
wordsDF = sqlContext.createDataFrame(words_counts, ["country", "times"])
wordsDF.printSchema()
wordsDF.take(3)

root
 |-- country: string (nullable = true)
 |-- times: long (nullable = true)



[Row(country='', times=3292),
 Row(country='https://t.co/fQftAwGAad', times=1),
 Row(country='mobile', times=1)]

Once we have successfully count how many times appears each word and we have a dataframe which contains a list of countries, we can join both dataframes in order to accomplish what we wanted by doing this exercise, looking for countries popularity.

In [9]:
# Join the country and tweet DataFrames (on the appropriate column)
merge_result = countryDF.join(wordsDF, 'country')
merge_result.take(3)

[Row(country='Thailand', code=' THA', times=1),
 Row(country='Iceland', code=' ISL', times=2),
 Row(country='Mexico', code=' MEX', times=1)]

Finally, we got a dataframe in which each row contains the country and how many times it has been mentioned in our tweets. This allow us to make some interesting queries using SparkSQL as follows:

In [11]:
# Total times countries where mentioned in tweets.
from pyspark.sql.functions import sum
merge_result.select(sum("times")).show()

+----------+
|sum(times)|
+----------+
|       397|
+----------+



In [12]:
# Top three countries by popularity

from pyspark.sql.functions import desc

merge_result.sort(desc("times")).show(3)

+-------+----+-----+
|country|code|times|
+-------+----+-----+
| Norway| NOR|   52|
|Nigeria| NGA|   49|
| France| FRA|   42|
+-------+----+-----+
only showing top 3 rows



In [13]:
# We can get the times for each country

merge_result.filter(merge_result["country"]=="Wales").show()

+-------+----+-----+
|country|code|times|
+-------+----+-----+
|  Wales| WAL|   19|
+-------+----+-----+



In [14]:
# Get the average of times.

from pyspark.sql.functions import mean

merge_result.select(mean("times")).show()

+-----------------+
|       avg(times)|
+-----------------+
|9.022727272727273|
+-----------------+

