# CULTURAL #NOWPLAYING DATA
*********************************

This dataset contains data describing the listening events of users (extracted from the #nowplaying dataset), the emotion extracted from the hashtags used in the according tweets and information about the location of the user.


TWITTER AND TRACK DATA
-------------------------------
The data regarding the listening events is contained in np_cultural.zip and is encoded as json. It holds the following information:

- id: the id of the underlying tweet [*]
- user_id: the id of the user who sent the tweet (MD5 of it)
- user_lang: The BCP 47 code for the user’s self-declared user interface language. [*]
- user_time_zone: [*]
- text: actual content of the tweet [*]
- tweet_lang: language of the tweet (as detected by Twitter; BCP 47 language identifier corresponding to the machine-detected language of the Tweet text, or und if no language could be detected.) [*]
- geo: Deprecated version of coordinates (however, we deal with data stemming from before this API change, therefore we still include it; cf. coordinates for a description)
- coordinates: Represents the geographic location of this Tweet as reported by the user or client application. The inner coordinates array is formatted as geoJSON (longitude first, then latitude). [*]
- place: When present, indicates that the tweet is associated (but not necessarily originating from) a Place. [*]
- created_at: time the tweet was sent. [*]
- source: Utility used to post the Tweet, as an HTML-formatted string. [*]
- track_title: title of the track the user tweeted about
- track_id: the unique id of the track (from #nowplaying dataset) 
- artist_name: name of artist performing the track
- artist_id: the unique id of the artist (from #nowplaying dataset) 
- hashtags: list of hashtags used in the tweet.

[*] for further information about the information gathered from Twitter, please consult https://dev.twitter.com/overview/api/tweets

Please note that we do only add key-value pairs for geo/coordinates/place information if this information was provided by the Twitter API (i.e., missing keys signal that this information is not available for the given tweet).


SENTIMENT DATA
-------------------------------
The data regarding the hashtag's sentiment (if any could be obtained) is contained in np_cultural_sentiment.csv and is formatted as csv. The sentiment score was obtained by applying a set of well-known sentiment dictionaries. The sentiment scores are scaled between 0 and 1 (very negative to very positive). For each dictionary, we list the minimum, maximum, sum and average sentiment score across all hashtags used within every tweet (most tweets only feature a single hashtag we can assign a sentiment value to)
It contains the following information (in this particular order):
- name of the hashtag
- AFINN dictionary (min, max, sum, avg)
- Opinion Lexicon (min, max, sum, avg)
- Sentistrength Lexicon (min, max, sum, avg)
- Vader (min, max, sum, avg)
- Sentiment Hashtag Lexicon (min, max, sum, avg)


Please note that we only added hashtags for which we could obtain a sentiment value from at least one sentiment dictionary.

In [2]:
import pymongo
from pymongo import MongoClient
import pandas as pd
import numpy as np 
import json
import csv
import pprint
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials


In [3]:
#Call Mongo Instance
client = MongoClient()
db = client.now_playing

#Spotify Instance
client_credentials_manager = SpotifyClientCredentials(client_id="7736d10450e04c5f9e302bb07a4f6cf7", client_secret="a11e29bc5c324ddeb19fc6249d303814",)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [27]:
## NowPlaying Cultural Sentiment Data
#cs_header = ["hashtag", "AFINN_min", "AFINN_max", "AFINN_sum", "AFINN_avg", "OpinionLex_min", "OpinionLex_max", "OpinionLex_sum", "OpinionLex_avg", "Sentistrength_min", "Sentistrength_max", "Sentistrength_sum", "Sentistrength_avg", "Vader_min", "Vader_max", "Vader_sum", "Vader_avg", "SentimentHashtag_min", "SentimentHashtag_max", "SentimentHashtag_sum", "SentimentHashtag_avg"]
#np_cs = pd.read_csv('np_cultural_sentiment.csv',names = cs_header)

#IMPORTING Cultural Sentiment Data to MongoDB
#np_cs_json = json.loads(np_cs.to_json(orient="records"))
#db.cultural_sentiment.drop()
#db.cultural_sentiment.insert_many(np_cs_json)

In [28]:
## Transformation of NowPlaying Cultural Tweets to DataframeStructure
#output = []
#with open("np_cultural.json") as f:
#    for line in f:     
#        output.append(json.loads(line))
#np_c = pd.DataFrame(output)
#IMPORTING Cultural Tweets to MongoDB
#db.cultural_tweet.drop()
#records = json.loads(np_c.T.to_json()).values()
#db.cultural_tweet.insert_many(records)

### Importing the MusicBrainz Mapping of artists and tracks

`mongoimport -d now_playing -c mb_tracks --type csv --file mb_tracks.csv --fields "np_id,mb_track_id"`

`mongoimport -d now_playing -c mb_artists --type csv --file mb_artists.csv --fields "np_id,mb_artist_id"`

In [29]:
## Transformation and Importing of Playlist Data from NowPlaying to MongoDB
#np_pl = pd.read_csv("playlist.csv", names=["playlist_id","track_id","playlist_name"])
#np_pl_json = json.loads(np_pl.to_json(orient="records"))
#db.playlist.drop()
#db.playlist.insert_many(np_pl_json)

In [30]:
#db.cultural_tweet.find({"hashtag": "nobeats"})
user_tweets = db.cultural_tweet.find({"user_id": '1c10f9788fdcc4baf6cf6a2631fe78bc12102418'})

In [31]:
#TOTAL Number of Tweets
db.cultural_tweet.find({}).count()

564301

In [4]:
user_ids = db.cultural_tweet.distinct("user_id")
len(user_ids)

9431

In [5]:
first = user_ids[0]
first_songs = db.cultural_tweet.find({"user_id": first})
initial_song = first_songs[0]
initial_date = initial_song["created_at"]

In [6]:
initial_song
db.mb_tracks.find({"np_id": initial_song["track_id"]}).count()
initial_song

{u'_id': ObjectId('59afa34769ab6e58375eb966'),
 u'artist_id': u'985d20e6778586b313ed8cc64c60c1f5',
 u'artist_name': u'Collective Soul',
 u'coordinates': None,
 u'created_at': u'2014-07-28 05:13:13',
 u'geo': None,
 u'hashtags': [u'tophits'],
 u'id': 493594999318392800L,
 u'place': None,
 u'source': u'SAM Broadcaster Song Info',
 u'text': u'#nowplaying #listenlive Streaming Now - Collective Soul - Shine - #tophits http://t.co/hWbSmQECMz',
 u'track_id': u'e4e9daf01adb3234277ee5c258d4caf2',
 u'track_title': u'Shine',
 u'tweet_lang': u'en',
 u'user_id': u'1b8339b2bd1d4494173af5ea20ae073b337a59f0',
 u'user_lang': u'en',
 u'user_location': u'US',
 u'user_time_zone': u'Central Time (US & Canada)'}

In [8]:
results = sp.search(q=initial_song["track_title"] +" "+ initial_song["artist_name"])

In [11]:
results

{u'tracks': {u'href': u'https://api.spotify.com/v1/search?query=Shine+Collective+Soul&type=track&offset=0&limit=10',
  u'items': [{u'album': {u'album_type': u'album',
     u'artists': [{u'external_urls': {u'spotify': u'https://open.spotify.com/artist/4e5V1Q2dKCzbLVMQ8qbTn6'},
       u'href': u'https://api.spotify.com/v1/artists/4e5V1Q2dKCzbLVMQ8qbTn6',
       u'id': u'4e5V1Q2dKCzbLVMQ8qbTn6',
       u'name': u'Collective Soul',
       u'type': u'artist',
       u'uri': u'spotify:artist:4e5V1Q2dKCzbLVMQ8qbTn6'}],
     u'available_markets': [u'AD',
      u'AR',
      u'AT',
      u'AU',
      u'BE',
      u'BG',
      u'BO',
      u'BR',
      u'CA',
      u'CH',
      u'CL',
      u'CO',
      u'CR',
      u'CY',
      u'CZ',
      u'DE',
      u'DK',
      u'DO',
      u'EC',
      u'EE',
      u'ES',
      u'FI',
      u'FR',
      u'GB',
      u'GR',
      u'GT',
      u'HK',
      u'HN',
      u'HU',
      u'ID',
      u'IE',
      u'IS',
      u'IT',
      u'JP',
      u'LI',
     