# Reddite

In this notebook we are going to show all the used for the analysis. Also we are going to show all the graphics asociated with the data obtaines.


## Introduction

We downloaded data from reddite with various methods, which did not provide the same amount of data, so we standarized to the minimal amount which it could be useful. Those fields are: 
* Date when it was created at.
* ID of the tweet, which is unique.
* The text of the comment.
* The user, which contains the user screen name (username) and the user ID.

After defining that the period of downloading data was finished, the amount of data gathered was: **500000** unique tweets. Which is a decent amount of data to analyse.

## Analysis

1) The first step is to filter the data in different files that will be used for all the analysis. The filter will be keywords (in the meantime), the keywords are the platforms name (i.e. Nintendo, Playstation, Xbox), these keywords are searched in the text and the username. If there is no defined preference (not zero preference) in which platform the record was pointing to, it is inserted in both files.

The ideal method to filter the data would be create a database of keywords asociated with each platform, so in every record when searching it could be calculated the probability of that text (according to all the words) to which platform it goes. This would require a model with N-Bayes, but like most predictive models it requires training, which we do not have at this moment.

**Note**: Also from here on we called each file as follows:
* project_tweets01.data -> Nintendo
* project_tweets02.data -> Playstation
* project_tweets03.data -> Xbox
* project_tweets04.data -> Else (Which is everything else that did not fit in the other categories)

In [None]:
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import JSONValueProtocol
from datetime import datetime
import itertools

class MRWordFrequencyCount(MRJob):
	INPUT_PROTOCOL = JSONValueProtocol

	def mapper(self, _, record):
		date = datetime.fromtimestamp(int(record['created_at']))
		week = date.isocalendar()[1]
		yield week, int(record['created_at'])

	def reducer(self, key, values):
		for i in values:
			yield (key, datetime.fromtimestamp(i).hour), 1
	def max_reducer(self, stat, values):
		yield stat, sum(values)

	def steps(self):
		return [MRStep(mapper=self.mapper, reducer=self.reducer),
				MRStep(reducer=self.max_reducer)]

if __name__ == '__main__':
	MRWordFrequencyCount.run()


2) The first analysis to do with this data that is now filtered by platforms, is to count the amount of records per platform. This is done with the next script.

In [None]:
import time
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import JSONValueProtocol
from datetime import datetime
import itertools

class MRWordFrequencyCount(MRJob):
	INPUT_PROTOCOL = JSONValueProtocol

	def mapper(self, _, record):
		
		yield record['classification'], 1

	def max_reducer(self, stat, values):
		yield stat, sum(values)

	def steps(self):
		return [MRStep(mapper=self.mapper, reducer=self.max_reducer)]

if __name__ == '__main__':
	MRWordFrequencyCount.run()

**Results**

project_tweets01.data

1317667 

project_tweets02.data

2337881 

project_tweets03.data

2260157 

project_tweets04.data

735410 

project_tweets.data

6542718 

Time taken to completion of the metric: 37.615491 in processor time

**Analysis**

From this we can see that in the "else" category we have **11.24%** of all the data, which is not a small amount. But considering that our filter for the platforms is kind of brute force is all right.

3) The next analysis to do is to count the amount of unique users per platform. This is done with the next script.

In [None]:
import time
import sys

if __name__ == '__main__':
    time_start = time.clock()
    # Clean File.
    open("user_amount_by_platform_summary.txt", 'w').close()
    files = ["project_tweets01.data", "project_tweets02.data", "project_tweets03.data", "project_tweets04.data"]
    for _file in files:
        # parameters for mrjob.
        # To run your job in multiple subprocesses with a few Hadoop features simulated, use -r local.
        option1 = "" #""-r"
        option2 = "" #""local"
        sys.argv = ['user_amount.py', option1, option2, _file]
        # Write to file in append mode.
        _fo = open("user_amount_by_platform_summary.txt", 'a')
        sys.stdout = _fo
        print _file
        execfile('user_amount.py')
        print "\n"

    time_end = time.clock()

    print "Time taken to completion of the metric: {0} in processor time".format(time_end - time_start)


In [None]:
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import JSONValueProtocol
import time
import itertools
import sys

class MRWordFrequencyCount(MRJob):
    INPUT_PROTOCOL = JSONValueProtocol

    def mapper(self, _, record):
        yield [record['user']['screen_name'], 1]

    def reducer(self, key, values):
        yield [key, 1]

    def mapper2(self, key, values):
        yield ['amount_users', values]

    def reducer2(self, key, values):
        yield [key, sum(values)]

    def steps(self):
        return [MRStep(mapper=self.mapper, reducer=self.reducer),
                MRStep(mapper=self.mapper2, reducer=self.reducer2)]


if __name__ == '__main__':
    #time_start = time.clock()
    MRWordFrequencyCount().run()
    #time_end = time.clock()
    #print "Time taken to completion of the metric: {0} in processor time".format(time_end - time_start)


**Results**

project_tweets01.data

"amount_users"	426450


project_tweets02.data

"amount_users"	574308


project_tweets03.data

"amount_users"	727273


project_tweets04.data

"amount_users"	265827


Time taken to completion of the metric: 426.030981 in processor time


**Analysis**

From this we can see that in the Xbox platform there are more unique users than in all the other platforms by a not small percentage, it almost duplicates Nintendo unique users. What is interesting, is using the analysis from before we can see that Xbox had less records than Playstation, but seeing this there is a whooping 150k (estimated) more unique users in Xbox, which could lead us that Playstation content creation is more for each unique user or that there is a tiny amount of users that produce all the content for this platform.

4) The next analysis is to calculate the Top 10 users that generate the most content in each platform. This is done with the next script.

In [None]:
import time
import sys

if __name__ == '__main__':
    time_start = time.clock()
    # Clean File.
    open("top_users_by_platform_summary.txt", 'w').close()
    files = ["project_tweets01.data", "project_tweets02.data", "project_tweets03.data", "project_tweets04.data"]
    for _file in files:
        # parameters for mrjob.
        # To run your job in multiple subprocesses with a few Hadoop features simulated, use -r local.
        option1 = "" #""-r"
        option2 = "" #""local"
        sys.argv = ['top_users.py', option1, option2, _file]
        # Write to file in append mode.
        _fo = open("top_users_by_platform_summary.txt", 'a')
        sys.stdout = _fo
        print _file
        execfile('top_users.py')
        print "\n"

    time_end = time.clock()

    print "Time taken to completion of the metric: {0} in processor time".format(time_end - time_start)


In [None]:
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import JSONValueProtocol
import time
import itertools
import operator
import sys

class MRWordFrequencyCount(MRJob):
    INPUT_PROTOCOL = JSONValueProtocol

    def mapper(self, _, record):
        yield [record['user']['screen_name'], 1]

    def reducer(self, key, values):
        yield ["top_user", (sum(values), key)]

    def reducer2(self, key, values):
        user_ids = []
        user_tweets = []
        for value in values:
            user_ids.append(value[1])
            user_tweets.append(value[0])
        user = {}
        for i in xrange(0, len(user_ids)):
            user[user_ids[i]] = user_tweets[i]
        top_users = sorted(user.items(), key=lambda x: (x[1], operator.itemgetter(0)), reverse=True)
        for user in top_users[0:10]:
            #print user[0], user[1]
            yield [user[0], user[1]]

    def steps(self):
        return [MRStep(mapper=self.mapper, reducer=self.reducer),
                MRStep(reducer=self.reducer2)]


if __name__ == '__main__':
    #time_start = time.clock()
    MRWordFrequencyCount().run()
    #time_end = time.clock()
    #print "Time taken to completion of the metric: {0} in processor time".format(time_end - time_start)


**Results**

project_tweets01.data
* "savetimeandmoey"	11391
* "AuctionPorn"	8449
* "AmazonBay4u"	8352
* "retrodeals"	7435
* "Nintendo_Legend"	6895
* "retrodealsUK"	6744
* "LastChanceGamer"	5904
* "GameUP247"	5580
* "RetroNuss"	5441
* "Nintendoe3E3"	5438


project_tweets02.data
* "Cammie_Whybrew"	13616
* "AskPlayStation"	13282
* "eBayShopperNews"	11578
* "VideoGamesMall"	10274
* "savetimeandmoey"	8632
* "collinschristof"	5382
* "topnewskoeln"	5297
* "Gamifive"	4829
* "Xbox_360_Gamez"	4080
* "pressebank"	3799


project_tweets03.data
* "Xbox_360_Gamez"	34137
* "XboxSupport"	14421
* "VideoGamesMall"	11835
* "Xbox_One_Reddit"	10215
* "xboxgamersdeals"	9224
* "GameUP247"	9156
* "KingsleyNewz"	9139
* "bullzyy"	7576
* "savetimeandmoey"	7176
* "giveawayxfab"	7028


project_tweets04.data
* "savetimeandmoey"	15483
* "VideoGames_Up"	13804
* "giveawaygigatop"	8878
* "tw100_1"	6538
* "videogames_pt"	6357
* "videogames_fr"	5993
* "giveawayxfab"	5886
* "DMGG_Videogames"	5295
* "ShoppeWorld"	4824
* "VideoGames_TV"	3709


Time taken to completion of the metric: 335.930534 in processor time


**Analysis**

From this we can see that (requires more analysis).

5) Another analysis would be ...