# CULTURAL #NOWPLAYING DATA
*********************************

This dataset contains data describing the listening events of users (extracted from the #nowplaying dataset), the emotion extracted from the hashtags used in the according tweets and information about the location of the user.


TWITTER AND TRACK DATA
-------------------------------
The data regarding the listening events is contained in np_cultural.zip and is encoded as json. It holds the following information:

- id: the id of the underlying tweet [*]
- user_id: the id of the user who sent the tweet (MD5 of it)
- user_lang: The BCP 47 code for the user’s self-declared user interface language. [*]
- user_time_zone: [*]
- text: actual content of the tweet [*]
- tweet_lang: language of the tweet (as detected by Twitter; BCP 47 language identifier corresponding to the machine-detected language of the Tweet text, or und if no language could be detected.) [*]
- geo: Deprecated version of coordinates (however, we deal with data stemming from before this API change, therefore we still include it; cf. coordinates for a description)
- coordinates: Represents the geographic location of this Tweet as reported by the user or client application. The inner coordinates array is formatted as geoJSON (longitude first, then latitude). [*]
- place: When present, indicates that the tweet is associated (but not necessarily originating from) a Place. [*]
- created_at: time the tweet was sent. [*]
- source: Utility used to post the Tweet, as an HTML-formatted string. [*]
- track_title: title of the track the user tweeted about
- track_id: the unique id of the track (from #nowplaying dataset) 
- artist_name: name of artist performing the track
- artist_id: the unique id of the artist (from #nowplaying dataset) 
- hashtags: list of hashtags used in the tweet.

[*] for further information about the information gathered from Twitter, please consult https://dev.twitter.com/overview/api/tweets

Please note that we do only add key-value pairs for geo/coordinates/place information if this information was provided by the Twitter API (i.e., missing keys signal that this information is not available for the given tweet).


SENTIMENT DATA
-------------------------------
The data regarding the hashtag's sentiment (if any could be obtained) is contained in np_cultural_sentiment.csv and is formatted as csv. The sentiment score was obtained by applying a set of well-known sentiment dictionaries. The sentiment scores are scaled between 0 and 1 (very negative to very positive). For each dictionary, we list the minimum, maximum, sum and average sentiment score across all hashtags used within every tweet (most tweets only feature a single hashtag we can assign a sentiment value to)
It contains the following information (in this particular order):
- name of the hashtag
- AFINN dictionary (min, max, sum, avg)
- Opinion Lexicon (min, max, sum, avg)
- Sentistrength Lexicon (min, max, sum, avg)
- Vader (min, max, sum, avg)
- Sentiment Hashtag Lexicon (min, max, sum, avg)


Please note that we only added hashtags for which we could obtain a sentiment value from at least one sentiment dictionary.

In [1]:
import pymongo
from pymongo import MongoClient
import pandas as pd
import numpy as np 
import json
import time
import csv
import pprint
import spotipy
import requests
from spotipy.oauth2 import SpotifyClientCredentials
from difflib import SequenceMatcher


In [2]:
#Call Mongo Instance
client = MongoClient()
db = client.now_playing

#Spotify Instance
client_data = {}
with open("secrets.json", 'r') as f:
    client_data = json.load(f)
client_data

client_credentials_manager = SpotifyClientCredentials(client_id=client_data["client_id"], client_secret=client_data["client_secret"],)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

## Total Users Count

In [3]:
user_ids = db.cultural_tweet.distinct("user_id")
len(user_ids)

9431

## Helper Methods

In [7]:
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()
def convertMillis(millis):
     seconds = (millis/1000)%60
     minutes = (millis/(1000*60))%60
     return str(int(minutes)) + ":" +str(int(seconds))
    
def get_track_list(find_query):
    return db.cultural_tweet.find(find_query).distinct("track_id")

def get_complete_track(track):
    song_name = track["name"].lower()
    artist_name = " ".join(list(map(lambda x: x["name"], track["artists"]))).lower()
    return song_name + " " + artist_name

def get_searched_track_name(track):
    song_name = db.cultural_tweet.find_one({"track_id" : track}, {"track_title" : 1, "artist_name" : 1})
    return song_name["track_title"].lower() + " " + song_name["artist_name"].lower()

def get_tweets_from_user(user_id):
    return db.cultural_tweet.find({"user_id": user_id})

def get_created_date(song):
    return song["created_at"]

def timeout(STime):
    print('Sleep for' + str(STime) + 'seconds..')
    time.sleep(STime)
    print('\n\n')

In [15]:
song_duration = -1
name_found = ""
searched_name = dict()
spotify_id = -1
track_list = get_track_list({"name_found" : {"$exists" : False }, "spotify_not_found" : {"$exists" : False }})
tracks_size = len(track_list)
searched = 0
for i in range(len(track_list)):   
    search_query = get_searched_track_name(track_list[i])

    search_result = sp.search(q = search_query, limit= 3, type="track")
    print("Finished querying Spotify for "+ search_query)
    searched += 1
    score = 0
    print_text = ""
    if search_result:
        for t in search_result["tracks"]["items"]:
            result_matcher = get_complete_track(t)
            similarity_score = similar(search_query,result_matcher)
            if(similarity_score > score):
                print("Found " + result_matcher + " => Similarity Score: " + str(similarity_score))
                song_duration = t["duration_ms"]
                score = similarity_score
                name_found = result_matcher
                searched_name[track_list[i]] = search_query
                spotify_id = t["id"]
        
        if score > 0:
            db.cultural_tweet.update_many({"track_id": track_list[i]},
            {"$set" : {
                "duration_ms" : song_duration,
                "spotify_id" : spotify_id,
                "similarity_score" : score,
                "name_found" : name_found
            }})
            print_text = " Saved song "+ name_found
        else:
            db.cultural_tweet.update_many({"track_id": track_list[i]},
             {"$set" : {
                 "spotify_not_found" : True
             }})
            print_text = " Not found in Spotify"
             
    else:
        '''
        db.cultural_tweet.update_many({"track_id": track_list[i]},
            {"$set" : {
                "spotify_not_found" : True
            }})

            print_text = " Not found in Spotify"
        ''' 
        print("Didn't find!")  
    print("Completed " + str(searched) +" out of " + str(tracks_size) + print_text)         
print("saved!")

Finished querying Spotify for the beat(en) generation (remastered album version) the the
Completed 1 out of 1 Not found in Spotify
saved!


In [8]:
track_list = get_track_list({"name_found" : {"$exists" : False }, "spotify_not_found" : {"$exists" : False }})
track_list

['db563230b098756cd1f48f6619e22720']

In [None]:
print("Songs with duration " + str(len(db.cultural_tweet.find({"duration_ms" : {"$exists" : True }}).distinct("track_id"))))
print("Songs without duration " + str(len(db.cultural_tweet.find({"duration_ms" : {"$exists" : False}}).distinct("track_id"))))
print("Songs in total " + str(len(db.cultural_tweet.distinct("track_id"))))


In [16]:
test = db.cultural_tweet.find_one({"track_id":"db563230b098756cd1f48f6619e22720"})
score = 0
search_result = sp.search(q = "The Beat(en) Generation - Remastered", limit= 3, type="track")
for t in search_result["tracks"]["items"]:
            result_matcher = get_complete_track(t)
            similarity_score = similar("The Beat(en) Generation - Remastered",result_matcher)
            if(similarity_score > score):
                print("Found " + result_matcher + " => Similarity Score: " + str(similarity_score))
                song_duration = t["duration_ms"]
                score = similarity_score
                name_found = result_matcher
                searched_name[track_list[i]] = search_query
                spotify_id = t["id"]
                if score > 0:
                    db.cultural_tweet.update_many({"track_id": track_list[i]},
                    {"$set" : {
                        "duration_ms" : song_duration,
                        "spotify_id" : spotify_id,
                        "similarity_score" : score,
                        "name_found" : name_found
                    }}, {"$unset" : { "spotify_id" : False}})

Found the beat(en) generation - remastered the the => Similarity Score: 0.8


TypeError: upsert must be True or False

In [12]:
db.cultural_tweet.update_many({"track_id": "db563230b098756cd1f48f6619e22720"},
            {"$set" : {
                "artist_name" : "The The",
                "track_title" : "The Beat(en) Generation (Remastered Album Version)"
            }})

<pymongo.results.UpdateResult at 0x112b14b88>

In [14]:
test

{'_id': ObjectId('59bfff9ced415cd14d0fa890'),
 'artist_id': 'e2505e2af0a61167f95e32c7d219e142',
 'artist_name': 'The The',
 'coordinates': None,
 'created_at': '2014-11-14 04:04:21',
 'geo': None,
 'hashtags': ['googleplay'],
 'id': 533093045299077100,
 'place': None,
 'source': 'IFTTT',
 'text': '#nowplaying The The - The Beat(en) Generation (Remastered Album Version) on #googleplay ! http://t.co/LzICeKVtvC',
 'track_id': 'db563230b098756cd1f48f6619e22720',
 'track_title': 'The Beat(en) Generation (Remastered Album Version)',
 'tweet_lang': 'en',
 'user_id': '4742df668ebc6a50e6ae7a05f27d9e284f208092',
 'user_lang': 'en',
 'user_location': 'Everywhere',
 'user_time_zone': 'Pacific Time (US & Canada)'}