In [1]:
% matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import reverse_geocoder as rg
import os.path
import ast
import seaborn as sns

from helpers import *
from datetime import datetime, date, time
from scipy import stats

%load_ext autoreload
%autoreload 2


In [2]:
# Constants
DATA_DIR = './data/'

#### The main purpose of this notebook is to analyze the data of the tweets we downloaded from kaggle, which are already labeled as positive or negative and we can use to train the model of the project. 

In [3]:
#Download the data and assign names of the columns
tweets_col_names=['sentiment', 'ID', 'Date',
                        'user', 'text']

tweets_dtypes = {'sentiment': int, 'ID': int, 
                       'Date': str, 'user': str,
                       'text': str }

tweets_df = pd.read_csv(DATA_DIR + '/tweets.csv', names=tweets_col_names,
                              dtype=tweets_dtypes, 
                              usecols=[0, 1, 2, 4, 5], encoding='latin1')


First we will show how many tweets are in our data set and understand the values of the column

In [4]:
tweets_df.head()

Unnamed: 0,sentiment,ID,Date,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,Karoli,"@nationwideclass no, it's not behaving at all...."


As you can observe we have the Sentiment (0 for negative, 2 for neutral and 4 for positive).
We have the ID of the tweet, date of the tweet, user and text on the tweet.

To find out if the idea of the project is feasible we depend on the quantity of how many tweets have links of songs, and because the data set we will use is FMA (with data from echonest, previous spotify) complemented with spotify we will search for this keyword, as all the link related to music should have it

In [5]:
f1 = tweets_df[tweets_df.text.str.contains('Spotify')]
f2 = tweets_df[tweets_df.text.str.contains('spotify')]
tweets_spotify = pd.concat([f1,f2])
num_tweets_spotify = tweets_spotify.count()
print('The number of tweets that have Spotify on the text is: {}'.format(num_tweets_spotify[0]))

The number of tweets that have Spotify on the text is: 281


In [6]:
num_tweets_total=tweets_df.count()
print('The total number of tweets is: {}'.format(num_tweets_total[0]))

The total number of tweets is: 1600000


In [7]:
tweets_spotify_pergentage = num_tweets_spotify[0]*100/num_tweets_total[0]
print('The percentage of tweets with Spotify on the text is: {}'.format(tweets_spotify_pergentage))

The percentage of tweets with Spotify on the text is: 0.0175625


### We have less than 0.017% of tweets that relate to a song, therefore we conclude that there is not enough data on our training data set to have a confident model. 