# SPICED Academy ///  Project Week 04 /// Web Scraping and Text Processing

***

## I. Define goal

The goal of this project is to develop a text classifier that predicts the probability of a song being rightously classified to a certain artist, in this case MacMiller and James Blake. Those two artist were picked because they are my current favourite musicians. In concrete, the albums "Friends that break your heart" (by James Blake) and "Circles" (by Mac Miller) were compared. 

***

## II. Import libraries 

In [2]:
#data processing and general
import pandas as pd
import json
import pprint

#web scraping
import lyricsgenius as genius
import api_key
import re

#feature engineering
from functions import FeatureEngineeringLyrics
import nltk 
from nltk.tokenize import TreebankWordTokenizer 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

#machine learning models
from sklearn.naive_bayes import MultinomialNB

#metrics
import sklearn.metrics as metrics

ModuleNotFoundError: No module named 'lyricsgenius'

***

## III. Import and transform data

*Requesting access from Genius*

In [None]:
#access token from Genius
client_access_token = api_key.your_client_access_token
genius = genius.Genius(client_access_token)

***

*Understanding the functionalities of LyricsGenius*

In [None]:
#accessing a couple of songs from James Blake
#JamesBlake = genius.search_artist("James Blake", 6)

In [None]:
#printing the song lyrics
#for song in JamesBlake.songs:
    #print(song.lyrics)

In [None]:
#searching for one specific song
#song = genius.search_song("James Blake", "Famous last words")

In [None]:
#printing the lyrics of "Limit to your love" by "James Blake"
#print(song.lyrics)

<font color = 'blue'> Summary of LyricsGenius </font> In order to use the lyrics for the machine learning model, the text has to be cleaned. In concrete, `\n` and word phrases like `[Verse 1]` have to be filtered out.

***

*Accessing the albums "Friends that break your heart" by James Blake and "Circles" by Mac Miller*

In [None]:
#saving the lyrics of the album "Friends that break your heart" by "James Blake" as json file
#album = genius.search_album("Friends that break your heart", "James Blake")
#album.save_lyrics

In [None]:
#saving the lyrics of the album "Circles" by "Mac Miller" as json file
#album = genius.search_album("Circles", "Mac Miller")
#album.save_lyrics

***

*Converting the json files into dataframes (James Blake)*

In [None]:
#opening the json files
with open('Lyrics_FriendsThatBreakYourHeart.json', 'r') as read_file:
    jamesblake_json = json.load(read_file)

In [None]:
#understanding the structure of the json file
#pprint.pprint(jamesblake_json['tracks'][0]['song']['lyrics'])

In [None]:
#understanding the structure of the json file
#pprint.pprint(jamesblake_json['tracks'][0])

In [None]:
#writing an empty list
list_jamesblake = []

In [None]:
#slicing out the lyrics and the artist's name and fill them into the empty list
for track in jamesblake_json['tracks']:
    X = track['song']['lyrics']
    y = track['song']['artist']
    list_jamesblake.append([X,y])

In [None]:
#inspecting the filled list
#list_jamesblake

In [None]:
#creating a dataframe from the list
df_jamesblake = pd.DataFrame(list_jamesblake, columns=['X', 'y'])
df_jamesblake

***

*Converting the json files into dataframes (Mac Miller)*

In [None]:
#opening the json files
with open('Lyrics_Circles.json', 'r') as read_file:
    macmiller_json = json.load(read_file)

In [None]:
#understanding the structure of the json file
#pprint.pprint(macmiller_json['tracks'][0]['song']['lyrics'])

In [None]:
#understanding the structure of the json file
#pprint.pprint(macmiller_json['tracks'][0])

In [None]:
#writing an empty list
list_macmiller = []

In [None]:
#slicing out the lyrics and the artist's name and fill them into the empty list
for track in macmiller_json['tracks']:
    X = track['song']['lyrics']
    y = track['song']['artist']
    list_macmiller.append([X,y])

In [None]:
#inspecting the filled list
#list_macmiller

In [None]:
#creating a dataframe from the list
df_macmiller = pd.DataFrame(list_macmiller, columns=['X', 'y'])

***

In [None]:
#concate both dataframes
frames = [df_jamesblake, df_macmiller]
df = pd.concat(frames)
df

***

## IV. Feature Engineering

In [None]:
fel = FeatureEngineeringLyrics(df)
fel.clean_dataframe(df['X'], df['y'])

***

## V. Splitting the data in train, validation and test data

*Splitting the data*

In [None]:
#splitting the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 25)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state= 25) # 0.25 x 0.8 = 0.2

In [None]:
#checking if the splitting worked
y_train

***

## VI. Classification Model 

*Applying a Naive Bayes model*

In [None]:
#applying 
m = MultinomialNB() 

In [None]:
#train the model 
m.fit(X_train, y_train)

In [None]:
#test the model score
m.score(X_train, y_train)

In [None]:
#test the model score
m.score(X_val, y_val)

In [None]:
y_pred = m.predict(X_val)

***

*Metrics scores*

In [None]:
#accuracy
round(metrics.accuracy_score(y_val, y_pred),2)

In [None]:
#precision
round(metrics.precision_score(y_val, y_pred, pos_label='macmiller'), 2)

In [None]:
#recall
round(metrics.recall_score(y_val, y_pred,pos_label='macmiller'), 2)

In [None]:
#f1
round(metrics.f1_score(y_val, y_pred, pos_label='macmiller'), 2)

In [None]:
#applying a confusion matrix 
metrics.confusion_matrix(y_val, y_pred)
metrics.plot_confusion_matrix(m, X_val, y_val, cmap='Blues')

***

## VII. Calculate test-score

*Calculating the model scores for all data sets*

In [None]:
#calculating the model score using y_test
round(m.score(X_test, y_test),2)

In [None]:
y_pred = m.predict(X_test)

***

*Metrics scores*

In [None]:
#accuracy
round(metrics.accuracy_score(y_test, y_pred),2)

In [None]:
#precision
round(metrics.precision_score(y_test, y_pred, pos_label='macmiller'), 2)

In [None]:
#recall
round(metrics.recall_score(y_test, y_pred, pos_label='macmiller'), 2)

In [None]:
#f1
round(metrics.f1_score(y_test, y_pred, pos_label='macmiller'), 2)

In [None]:
#applying a confusion matrix 
metrics.confusion_matrix(y_test, y_pred)
metrics.plot_confusion_matrix(m, X_test, y_test, cmap='Blues')

***

## VIII. Calculating the probability of unseen lyrics

In [None]:

#X_unseen = vectorizer.fit_transform(X_unseen)

In [None]:

#m.predict(X_unseen)
#m.predict_proba(["yellow submarine"])