# NLP Methods on Music Reviews
This project explores text preprocessing, lexicon normalization, and modeling of music reviews, retrieved from the Kaggle dataset [song reviews](https://www.kaggle.com/nolanbconaway/pitchfork-data)

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib import cm
import numpy as np
import seaborn as sns
import sqlite3
import string, re




# Import Data
The data is stored as a series of tables within a sqlite database. The following cells extract each table and convert them to Pandas dataframes.

### Connect to the Database

In [7]:
connection = sqlite3.connect('datasets/database.sqlite')
cursor = connection.cursor()

### Reviews Table

In [8]:
query = "select * from reviews;"
cursor.execute(query)
df_reviews = pd.DataFrame(cursor.fetchall(), columns=["id",'track', 'artist', 'url','score',
    'best_new_music', 'author', 'author_type','date','weekday', 'day', 'month', 'year'])
df_reviews.head()

Unnamed: 0,id,track,artist,url,score,best_new_music,author,author_type,date,weekday,day,month,year
0,22703,mezzanine,massive attack,http://pitchfork.com/reviews/albums/22703-mezz...,9.3,0,nate patrin,contributor,2017-01-08,6,8,1,2017
1,22721,prelapsarian,krallice,http://pitchfork.com/reviews/albums/22721-prel...,7.9,0,zoe camp,contributor,2017-01-07,5,7,1,2017
2,22659,all of them naturals,uranium club,http://pitchfork.com/reviews/albums/22659-all-...,7.3,0,david glickman,contributor,2017-01-07,5,7,1,2017
3,22661,first songs,"kleenex, liliput",http://pitchfork.com/reviews/albums/22661-firs...,9.0,1,jenn pelly,associate reviews editor,2017-01-06,4,6,1,2017
4,22725,new start,taso,http://pitchfork.com/reviews/albums/22725-new-...,8.1,0,kevin lozano,tracks coordinator,2017-01-06,4,6,1,2017


In [10]:
len(df_reviews)

18393

### Content Table

In [11]:
query = "select * from content;"
cursor.execute(query)
df_content = pd.DataFrame(cursor.fetchall(), columns=["id",'review'])
df_content.head()

Unnamed: 0,id,review
0,22703,"“Trip-hop” eventually became a ’90s punchline,..."
1,22721,"Eight years, five albums, and two EPs in, the ..."
2,22659,Minneapolis’ Uranium Club seem to revel in bei...
3,22661,Kleenex began with a crash. It transpired one ...
4,22725,It is impossible to consider a given release b...


### Genres Table

In [12]:
query = "select * from genres;"
cursor.execute(query)
df_genres = pd.DataFrame(cursor.fetchall(), columns=["id",'genre'])
df_genres.head()

Unnamed: 0,id,genre
0,22703,electronic
1,22721,metal
2,22659,rock
3,22661,rock
4,22725,electronic


# Data Prep
The following cells perform text preprocessing steps, such as stop word / punctuation removal

## Merge Dataframes

In [16]:
df = df_content.merge(df_reviews, on='id')
df.sample(5)

Unnamed: 0,id,review,track,artist,url,score,best_new_music,author,author_type,date,weekday,day,month,year
12833,2116,"Once upon a time, labels meant something-- or ...",the memphis family album: music from memphis i...,various artists,http://pitchfork.com/reviews/albums/2116-the-m...,7.4,0,rob mitchum,contributor,2006-03-13,0,13,3,2006
16865,7995,Many folks maintain a soft spot for They Might...,no!,they might be giants,http://pitchfork.com/reviews/albums/7995-no/,7.0,0,william bowers,contributor,2002-07-07,6,7,7,2002
12840,4036,Horns of Happiness began as the lo-fi pop side...,would i find your psychic guideline,horns of happiness,http://pitchfork.com/reviews/albums/4036-would...,7.3,0,cory d. byrom,,2006-03-09,3,9,3,2006
9011,12997,The story of A Hawk and a Hacksaw is as captiv...,dlivrance,a hawk and a hacksaw,http://pitchfork.com/reviews/albums/12997-deli...,7.8,0,mia clarke,,2009-05-13,2,13,5,2009
7673,14387,"American Primitive guitarist Robbie ""Basho"" Ro...","we are all one, in the sun: a tribute to robbi...",various artists,http://pitchfork.com/reviews/albums/14387-we-a...,7.9,0,matthew murphy,contributor,2010-06-29,1,29,6,2010


In [18]:
df.to_parquet("datasets/reviews.parquet")