# CULTURAL #NOWPLAYING DATA
*********************************

This dataset contains data describing the listening events of users (extracted from the #nowplaying dataset), the emotion extracted from the hashtags used in the according tweets and information about the location of the user.


TWITTER AND TRACK DATA
-------------------------------
The data regarding the listening events is contained in np_cultural.zip and is encoded as json. It holds the following information:

- id: the id of the underlying tweet [*]
- user_id: the id of the user who sent the tweet (MD5 of it)
- user_lang: The BCP 47 code for the user’s self-declared user interface language. [*]
- user_time_zone: [*]
- text: actual content of the tweet [*]
- tweet_lang: language of the tweet (as detected by Twitter; BCP 47 language identifier corresponding to the machine-detected language of the Tweet text, or und if no language could be detected.) [*]
- geo: Deprecated version of coordinates (however, we deal with data stemming from before this API change, therefore we still include it; cf. coordinates for a description)
- coordinates: Represents the geographic location of this Tweet as reported by the user or client application. The inner coordinates array is formatted as geoJSON (longitude first, then latitude). [*]
- place: When present, indicates that the tweet is associated (but not necessarily originating from) a Place. [*]
- created_at: time the tweet was sent. [*]
- source: Utility used to post the Tweet, as an HTML-formatted string. [*]
- track_title: title of the track the user tweeted about
- track_id: the unique id of the track (from #nowplaying dataset) 
- artist_name: name of artist performing the track
- artist_id: the unique id of the artist (from #nowplaying dataset) 
- hashtags: list of hashtags used in the tweet.

[*] for further information about the information gathered from Twitter, please consult https://dev.twitter.com/overview/api/tweets

Please note that we do only add key-value pairs for geo/coordinates/place information if this information was provided by the Twitter API (i.e., missing keys signal that this information is not available for the given tweet).


SENTIMENT DATA
-------------------------------
The data regarding the hashtag's sentiment (if any could be obtained) is contained in np_cultural_sentiment.csv and is formatted as csv. The sentiment score was obtained by applying a set of well-known sentiment dictionaries. The sentiment scores are scaled between 0 and 1 (very negative to very positive). For each dictionary, we list the minimum, maximum, sum and average sentiment score across all hashtags used within every tweet (most tweets only feature a single hashtag we can assign a sentiment value to)
It contains the following information (in this particular order):
- name of the hashtag
- AFINN dictionary (min, max, sum, avg)
- Opinion Lexicon (min, max, sum, avg)
- Sentistrength Lexicon (min, max, sum, avg)
- Vader (min, max, sum, avg)
- Sentiment Hashtag Lexicon (min, max, sum, avg)


Please note that we only added hashtags for which we could obtain a sentiment value from at least one sentiment dictionary.

In [2]:
import pymongo
from pymongo import MongoClient
import pandas as pd
import numpy as np 
import json
import csv
import pprint

In [3]:
#Call Mongo Instance
client = MongoClient()
db = client.now_playing

In [5]:
## NowPlaying Cultural Sentiment Data
cs_header = ["hashtag", "AFINN_min", "AFINN_max", "AFINN_sum", "AFINN_avg", "OpinionLex_min", "OpinionLex_max", "OpinionLex_sum", "OpinionLex_avg", "Sentistrength_min", "Sentistrength_max", "Sentistrength_sum", "Sentistrength_avg", "Vader_min", "Vader_max", "Vader_sum", "Vader_avg", "SentimentHashtag_min", "SentimentHashtag_max", "SentimentHashtag_sum", "SentimentHashtag_avg"]
np_cs = pd.read_csv('np_cultural_sentiment.csv',names = cs_header)

#IMPORTING Cultural Sentiment Data to MongoDB
np_cs_json = json.loads(np_cs.to_json(orient="records"))
db.cultural_sentiment.drop()
db.cultural_sentiment.insert_many(np_cs_json)

<pymongo.results.InsertManyResult at 0x114b84948>

In [3]:
## Transformation of NowPlaying Cultural Tweets to DataframeStructure
output = []
with open("np_cultural.json") as f:
    for line in f:     
        output.append(json.loads(line))
np_c = pd.DataFrame(output)
#IMPORTING Cultural Tweets to MongoDB
db.cultural_tweet.drop()
records = json.loads(np_c.T.to_json()).values()
db.cultural_tweet.insert_many(records)

<pymongo.results.InsertManyResult at 0x20a6a9438>

In [4]:
#db.cultural_tweet.find({"hashtag": "nobeats"})
user_tweets = db.cultural_tweet.find({"user_id": '1c10f9788fdcc4baf6cf6a2631fe78bc12102418'})

In [11]:
#TOTAL Number of Tweets
db.cultural_tweet.find({}).count()

564301

In [15]:
user_ids = db.cultural_tweet.distinct("user_id")
len(user_ids)

9431

In [16]:
for user_id in user_ids

list