# Capstone Project Overview
The purposes of the capstone are the following:

* Reflect efforts and tools used in this program
* Showcase ability to develop and answer a question of interest
* Apply a predictive model
* Communicate findings in a flawless presentation

# Capstone Deliverables
* Choose a data set
* Deliver a predictive model using supervised or unsupervised learning techniques
* Provide technical write-up (organized and well-structured analysis walk-through with highly-detailed and insightful plots  - and explanations for approaches) in Jupyter Notebook on GitHub repository
* Provide a non-technical report describing your capstone project (problem, results, important findings, and suggestions for next steps) in README
* Informative README and collection of Jupyter Notebooks (headings, formatting, comments explaining code, appropriately named files/sensible variables, and without unnecessary files)
* Demonstrate competencies with pandas, seaborn, and visualizations (appropriate plots for categorical and continuous variables, with human-readable labels, descriptive titles, legible axes, and proper scaling for readability) 

# CRISP-DM Framework: Standard Process for Data Projects/Mining

* Business Understanding: Background, Objectives, Success Criteria, Inventory of Resources/Requirements/Assumptions/Constraints/Risks/Contingencies/Terminology/Costs/Benefits, Data Mining Goals/Success Criteria

* Data Understanding: Data Collection/Exploration/Quality Report

* Data Preparation: Data Description/Inclusion/Exclusion/Attributes/Records, Merged Data, Reformatted Data

* Modeling: Select Technique/Assumptions, Generate Test Designs, Build Model/Parameter Settings/Model Description, Assess Model, Revise Parameter Settings

* Evaluation: Evaluate Results/Assessment of Results w.r.t Business Success Criteria/Approved Models, Review Process, Determine Next Steps, List of Possible Action Decisions

* Deployment: Plan Deployment, Plan Monitoring and Maintenance Plan, Produce Final Report/Final Presentation, Review Project/Experience Documentation

# Project Summary and Background

* Background

For a number of years, I have observed the effects of recorded music in persons with dementia and Alzheimer’s disease - enhancing remembering and social interaction.  Hence, one of my end targets would be to meld song classification and music recommendation to optimize quality of life for such clinical populations.  For instance, one method to classify songs is via perceived emotion and one usage is to recommend music based on this. 

* Music Information Retrieval (MIR) and Song Classification

Music Information Retrievial (MIR) can be described as extracting information from music.  My goal in this project is to classify music, by positive and negative valence, using audio features.  

What is the valence of a song?  As a brief description, think of a song that sounds happy.  Valence for that song would likely be classified as positive.  In the models built from this data set, audio features will predict valence.  Songs will be classified as having positive valence, sounding happy or cheerful.  Songs will also be classified as having negative valence, sounding sad, depressing, or angry.

What is an audio feature?  An audio feature is a characteristic of a song.  One example of an audio feature is the tempo of a song, measured in beats per minute.  Song classification by musical valence, predicted from audio features, might enhance the building of future music recommendation systems.

* Music Recommendation Systems and Clinical Populations

Music recommendation systems are important to particular clinical populations in persons that consider music highly important. For example, one person with dementia may benefit from the positive feelings and recollections that music evokes.  Another could experience the calming effect of a slow tempo Rhythm and Blues song.  Song classification by valence may enhance existing music recommendation systems and help build these systems for such groups.  

* Building Song Classification Models 

There is a Chinese proverb, "A journey of a thousand miles begins with a single step".  Much effort is required to build music recommendation systems for clinical populations.  This project uses audio features to predict the variable termed "valence" and is one small step in enhancing music recommendation systems. 

# Data Sourcing for Research Question

The following is the link to the data set: 
https://www.kaggle.com/code/vatsalmavani/music-recommendation-system-using-spotify-dataset/input

# Understanding the Data

This data set, from website Kaggle.com, consists of 170,653 rows (samples) and 19 columns (features). Features and entailing concepts will be outlined during data exploration.  

As a brief introduction, the data set title is "Music Recommendation System Using Spotify Dataset".  There are 15 features - in addition to Artist, Song Title, and Unique ID - that I would partition into the following categories: 

* "General Features of Music": Danceability, Acousticness, Energy, Key, Liveness, Loudness, Mode, Duration

* "Lyric-Related Features of Music": Speechiness, Instrumentalness, Explicit

* "Time-Related Features of Music": Tempo, Release Year, Release Date, Popularity 

Valence is also listed as a feature, but for this analysis, I will revise this and make Valence the target variable, or outcome variable.  Thus, my goal for in this analysis will be to correctly classify songs by Valence.  The remaining features in the provided data set will then be assessed to predict Valence during model-building.

## Rationale: Research Task and Question

* Valence

In this data set, the variable Valence is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.  The website, from which the data was sourced, describes tracks with high valence as sounding more positive (e.g. happy, cheerful, and euphoric).  Tracks with low valence would then sound more negative (e.g. sad, depressed, angry).

* Predicting Valence: Positive or Negative

Is it possible to predict valence from audio features, and what would be the optimal model to predict the valence of a song?  Improved utilization of audio features to determine if a song is cheerful, happy, sad, or depressing, might enhance the building of future music recommendation systems.  

* Uses for Predicting Valence

The optimal model in this analysis could enhance the building of a future, highly personalized, music recommendation system for one in a clinical population.  For instance,  one person experiencing dementia - that considers music highly important - might benefit from the positive feelings that music can evoke.  If audio features can better predict positive valence, for example, these features can be used to recommend music that may benefit quality of life for this person.    

# Exploring the Data Set for Comprehension

In [None]:
%matplotlib inline

In [None]:
#Importing initial Libraries and plot settings
import pandas as pd
import numpy as np
import seaborn as sns 
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
sns.set_palette("pastel")
sns.set_theme(style="darkgrid")

### Reading the Data Set

In [None]:
DF = pd.read_csv('data.csv')

### Obtaining column names and shape

In [None]:
DF.head()

In [None]:
DF.columns

In [None]:
DF.shape

#### There are currently 170,653 rows/samples and 19 columns/features in this data set

### Obtaining data types and counting null values per column 

In [None]:
DF.info()

In [None]:
DF.isnull().sum()

### No null values discovered

### Initial review of features and cleaning of the data set 

In [None]:
# Generating descriptive statistics
DF.describe()

#### First review of features by count display

In [None]:
DF["valence"].value_counts()

In [None]:
DF["year"].value_counts()

In [None]:
DF["artists"].value_counts()

In [None]:
DF["danceability"].value_counts()

In [None]:
DF["duration_ms"].value_counts()

##### Reformatting duration in milliseconds to duration in minutes for better comprehension when generating visualizations

In [None]:
#Reformatting code
DF['Duration_Mins'] = DF['duration_ms']/(60000)

In [None]:
DF["energy"].value_counts()

In [None]:
DF["explicit"].value_counts()

In [None]:
DF["instrumentalness"].value_counts()

In [None]:
DF["key"].value_counts()

In [None]:
DF["liveness"].value_counts()

In [None]:
DF["loudness"].value_counts()

In [None]:
DF["mode"].value_counts()

In [None]:
DF["popularity"].value_counts()

In [None]:
DF["release_date"].value_counts()

In [None]:
DF["speechiness"].value_counts()

In [None]:
DF["tempo"].value_counts()

In [None]:
DF["name"].value_counts()

In [None]:
DF["id"].value_counts()

### Assessing for Duplicate Rows

In [None]:
Duplicated = DF.duplicated()

In [None]:
sorted = Duplicated.sort_values()
sorted.head(100)

In [None]:
sorted.tail(100)

#### No Duplicates Discovered

### Generating New Data Frame with Renamed Columns 

In [None]:
RenamedDF = DF.rename({'valence': 'Valence', 
                   'year': 'Release_Year', 
                   'acousticness': 'Acousticness',
                   'artists': 'Artist', 
                   'danceability': 'Danceability',
                   'energy': 'Energy', 
                   'explicit': 'Explicit', 
                   'id': 'ID', 
                   'instrumentalness': 'Instrumentalness', 
                   'key': 'Key',
                   'liveness': 'Liveness', 
                   'loudness': 'Loudness', 
                   'mode': 'Mode', 
                   'name': 'Song_Title', 
                   'popularity': 'Popularity', 
                   'release_date': 'Release_Date',
                   'speechiness': 'Speechiness', 
                   'tempo': 'Tempo'},
                        axis=1)

In [None]:
RenamedDF.head()

# Data Preparation & Feature Engineering
In-depth review of features via histogram and box plot examination.  Increasing understanding of variables and performing any necessary transformations. 

## Valence: Reformatting Outcome Variable

### Valence Description
Valence - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).


In [None]:
plt.hist(RenamedDF['Valence'])
plt.title('Valence Count')

##### Transforming Valence Data to Outcome with Four Classes
Creating Upper Positive, Lower Positive, Upper Negative, and Lower Negative Classes for Valence.  

The model will use existing features to classify Valence. Valence can be split into two categories, or into a binary positive and negative outcome.  However, for this classification analysis, I will first be separating the outcome variable Valence into quartiles, or four classes.  The classes will be named Upper Positive, Lower Positive, Upper Negative, and Lower Negative.  I expect that the lower positive and upper negative classes, in the middle of this distribution, will be too similar for an effective classification model. I am separating the Valence into quartiles so that I may also compare the lowest and highest quartiles in the analysis.

In [None]:
# Separating Valence values into four classes, by quartiles
RenamedDF['Valence_4Cat'] = pd.cut(x=RenamedDF['Valence'],
                               bins = [0,.25,.50, .75, 1],
                               labels = ['Lower_Negative', 
                                         'Upper_Negative', 
                                         'Lower_Positive', 
                                         'Upper_Positive'])

In [None]:
sns.histplot(RenamedDF, x = 'Valence_4Cat')
plt.grid()
plt.title('Outcome: Four Classes of Valence')
plt.xlabel('Valence')
plt.ylabel('Count')

In [None]:
# Completing transformation of Valence classes to numerical range of 0-3 
RenamedDF['Valence_4CatNum'] = RenamedDF['Valence_4Cat'].replace({"Lower_Negative":0, "Upper_Negative":1, "Lower_Positive":2,"Upper_Positive":3})

###### Initial pruning of feature set for future numerical analysis

In [None]:
#Dropping unnecessary prior valence variables from data frame
OutcomeDF = RenamedDF.drop(['Valence_4Cat', 'Valence'], axis =1)

In [None]:
OutcomeDF.head()

###### Reviewing the current data set and dropping certain Valence columns, I am also assessing that I can remove certain other columns. I am currently removing them as I do not foresee these features enhancing Valence classification.

In [None]:
# Dropping the following columns: Artist, ID, Song Title (not numerical variables)
## Dropping Release Date column (redundant information, require only release year for analysis)
### Dropping Key column (key already categorized by mode)
#### Dropping unnecessary column for duration in milliseconds (duration in minutes column now exists)
OutcomeDFDrop = OutcomeDF.drop(['Artist', 'ID','Song_Title','Release_Date','Key', 'duration_ms'], axis=1)

In [None]:
OutcomeDFDrop.head()

In [None]:
# Relabeling cleaned data set 
BaseDF = OutcomeDFDrop

## Remaining Features: Evaluating Numerical Features
Outliers for particular features will be removed, first, to diminish their effect on the performance of future models and second, to ensure samples more closely reflect typical songs.  For instance, removing outliers for song duration may increase the likelihood that cases will not be lengthy speeches or operas.

### Assessing Features via Histograms and Box Plots 

#### General Features of Music
Acousticness, Danceability, Energy, Loudness, Liveness, Mode (Major/Minor)

##### Acousticness
Acousticness - A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

In [None]:
plt.hist(BaseDF['Acousticness'])
plt.title('Acousticness Count')

##### Danceability 
Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.  A value of 0.0 is least danceable and 1.0 is most danceable.

In [None]:
plt.hist(BaseDF['Danceability'])
plt.title('Danceability Count')

###### Removing outliers observed for Danceability

In [None]:
DanceBox = px.box(BaseDF, x = 'Danceability', title="Danceability Outliers")
DanceBox

In [None]:
DanceabilityQuery = BaseDF.query("Danceability > .05")

##### Energy 
Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.  For example, death metal has high energy, while a Bach prelude scores low on the scale.

In [None]:
plt.hist(BaseDF['Energy'])
plt.title('Energy Count')

#### Loudness 

The overall loudness of a track in decibels (dB).  Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.  Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude).  Values typical range between -60 and 0 dB.

In [None]:
plt.hist(BaseDF['Loudness'])
plt.title('Loudness Count')

###### Removing outliers observed for Loudness

In [None]:
LoudnessBox = px.box(BaseDF, x = 'Loudness', title="Loudness Outliers")
LoudnessBox

In [None]:
LoudnessQuery = DanceabilityQuery.query("Loudness > -25.764")

##### Liveness 

Detects the presence of an audience in the recording.  Higher liveness values represent an increased probability that the track was performed live.  A value above 0.8 provides strong likelihood that the track is live. The distribution of values for this feature look like this: Liveness distribution.

In [None]:
plt.hist(BaseDF['Liveness'])
plt.title('Liveness Count')

##### Mode (Major/Minor)
Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

In [None]:
plt.hist(BaseDF['Mode'])
plt.title("Mode Count")

#### Lyric-Related Features of Music
Speechiness, Instrumentalness, Presence of Explicit Lyrics

##### Speechiness 
Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.  Values above 0.66 describe tracks that are probably made entirely of spoken words.  Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music.  Values below 0.33 most likely represent music and other non-speech-like tracks.

In [None]:
plt.hist(BaseDF['Speechiness'])
plt.title('Speechiness Count')

###### Removing outliers observed for Speechiness

In [None]:
SpeechBox = px.box(BaseDF, x = 'Speechiness', title="Speechiness Outliers")
SpeechBox

In [None]:
SpeechinessQuery = LoudnessQuery.query("Speechiness < .14")

##### Instrumentalness 
Predicts whether a track contains no vocals.  “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”.  The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.  Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

In [None]:
plt.hist(BaseDF['Instrumentalness'])
plt.title("Instrumentalness Count")

##### Presence of Explicit Lyrics

In [None]:
plt.hist(BaseDF['Explicit'])
plt.title("Explicit Lyrics Count")

#### Time-Related Features of Music
Tempo, Song Duration, Popularity (in time)

##### Tempo
The overall estimated tempo of a track in beats per minute (BPM).  In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

In [None]:
plt.hist(BaseDF['Tempo'])
plt.title("Tempo Count")

In [None]:
TempoBox = px.box(BaseDF, x = 'Tempo', title="Tempo Outliers")
TempoBox

In [None]:
TempoQuery = SpeechinessQuery.query("Tempo > 30 and Tempo < 199.984")

##### Song Duration (Minutes)
Length of track in minutes.

In [None]:
plt.hist(BaseDF['Duration_Mins'])
plt.title("Duration (Minutes) Count")

###### Removing outliers observed for Duration
Music over seven minutes and under 30 seconds were excluded from analysis.

In [None]:
DurationBox = px.box(TempoQuery, x = 'Duration_Mins', title="Duration (Minutes) Outliers")
DurationBox

In [None]:
DurationQuery = TempoQuery.query("Duration_Mins > .605 and Duration_Mins < 6.99")

##### Year of Song Release

In [None]:
plt.hist(BaseDF['Release_Year'])

##### Popularity 
The popularity of the track.  The value will be between 0 and 100, with 100 being the most popular.  The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.  Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past.

In [None]:
plt.hist(BaseDF['Popularity'])
plt.title("Popularity")

###### Relabeling new data frame to encompass all revisions to features and feature set 

In [None]:
PreBaseline = DurationQuery

In [None]:
PreBaseline.head()

In [None]:
PreBaseline.shape

###### Short Summary 
I have generated a new data frame.  The Valence outcome has been split into four classes and the features have been examined and reformatted for future analysis.

I desire to generate a data frame with a binary outcome, that may convey a more meaningful separation between positive and negative Valence classes.  My approach will be to use the existing data frame with four Valence classes.  I will now extract the Upper Positive and Lower Negative cases, or the highest and lowest Valence classes.

I expect that center two classes of the Valence distribution, the songs categorized as Lower Positive and Upper Negative, will contribute to lower test accuracy scores.  I think these Valence ratings are too close to the middle rating of .5 for the algorithm to perform successfully.  In addition, the extremely large number of cases allows for this analysis and pruning.  

I will then be comparing baseline test accuracy scores for the four-class Valence outcome and the binary Valence outcome. Both models include the same features in the current data frame.  The model with the optimal test accuracy score will be used for L1 Regularization in Logistic Regression, for further dimensionality reduction in the current data set. 

# Comparison Baseline Models: Four-Class Versus Binary 

## Generating Data Frame With Binary Outcome: Lower Negative and Upper Positive Valence 

In [None]:
#Selecting data with lower negative and upper positive Valence ratings
Base2ValUpprLwr = PreBaseline.loc[PreBaseline.Valence_4CatNum.isin([0,3])]

In [None]:
Base2ValUpprLwr.head()

In [None]:
#Obtaining shape
Base2ValUpprLwr.shape

In [None]:
sns.histplot(Base2ValUpprLwr, x = 'Valence_4CatNum')
plt.grid()
plt.title('Binary Outcome: Two Classes of Valence')
plt.xlabel('Valence Classes: 0 (Lower Negative) & 3 (Upper Positive)')
plt.ylabel('Count')

Data frame with binary Valence classification has 59,365 cases and 13 features. 

##### Heat Map Comparison
Comparing Heat Maps for Four-Class Outcome and Binary Outcome models.  Also determining if particular features may be removed, due to extremely high correlations.

###### Four-Class Outcome Heatmap

In [None]:
corr = PreBaseline.corr()

f, ax = plt.subplots(figsize=(14,9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
heatmap = sns.heatmap(corr, annot = True, cmap=cmap, center=0.0, vmax = 1,linewidths=1, ax=ax).set(title = "Numerical Variables Correlation")
plt.show()

There are no numerical variables that correlate sufficiently highly to remove the dimension.  In addition, Valence (outcome variable) correlates most highly with Danceability (.54), Energy (.36), and Loudness (.27). 

###### Valence: Binary Outcome Heatmap
Determining if particular features may be removed, due to extremely high correlations.  Utilizing Upper Positive and Lower Negative Classes for this initial data exploration.

In [None]:
corr2 = Base2ValUpprLwr.corr()

f, ax = plt.subplots(figsize=(14,9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
heatmap = sns.heatmap(corr2, annot = True, cmap=cmap, center=0.0, vmax = 1,linewidths=1, ax=ax).set(title = "Numerical Variables Correlation")
plt.show()

There are still no numerical variables that correlate sufficiently highly to remove the dimension. However, correlations increased.  When obtaining the highest and lowest of the four classes of the outcome variable, correlations increased to the following: Danceability (.54 to .67), Energy (.36 to .52), and Loudness (.27 to .40).  This might be an indication that the binary outcome model will perform better than the four-class model.  Baseline models will now be compared.

# Data Preparation Prior to Baseline

I will be comparing two baseline models with Logistic Regression. 

* First Baseline Model: 4 Classes of Valence

The first baseline model has labeled outcomes for four classes of Valence: Upper Positive, Lower Positive, Upper Negative, and Lower Negative.  Multiple classes will require Multinomial Logisitic Regression.

* Second Baseline Model: 2 Classes of Valence

The second baseline model has a binary outcome for Valence.  The highest and lowest two classes, upper positive quartile and lower negative quartile, of the initial four-class model were selected to enhance the separation between positive and negative classes.  The large size of the data set allowed for this analysis. 

I have explored, cleaned, and encoded both data frames and will now complete the data preparation prior to the baseline comparison.

# Obtaning Baseline Metrics
I will now commence comparing the baseline performance of both models with Logistic Regression.  All features kept post data prepration will be kept for both models.  I am first using Logistic Regression because I desire to use the L1 Regularization tool with Logistic Regression. This type of regularization will highlight priority features, allowing for high dimensionality reduction.

Using the L1 Regularization tool in Logistic Regression for feature selection, I will be further pruning the feature set of the best-performing baseline model.  The initial comparison metric to determine the best baseline model will be test accuracy score.   

## Baseline Test Accuracy Scores: Valence Four-Class Model

### Assessing Baseline Data Frame: Four-Class Model
Valence outcome in this data frame has four classes: Upper Positive, Lower Positive, Upper Negative, and Lower Negative.

In [None]:
PreBaseline.shape

In [None]:
PreBaseline.head()

In [None]:
#Checking for null values
PreBaseline.isna().sum()

In [None]:
# Deleting all rows with null values
FinalBaseDFNoNull = PreBaseline.dropna(axis = 0)

In [None]:
#Checking for null values again
FinalBaseDFNoNull.isna().sum()

In [None]:
FinalBaseDFNoNull.shape

In [None]:
#Double-checking for duplicate rows
FinalBaseDFNoNull.duplicated().sort_values()

In [None]:
#Deleting discovered duplicates
FinalBaseDFNoDup = FinalBaseDFNoNull.drop_duplicates()

In [None]:
#Confirming duplicates removed
FinalBaseDFNoDup.shape

In [None]:
#Renaming outcome variable
Renamed4ClassBaseDF = FinalBaseDFNoDup.rename({'Valence_4CatNum': 'Four_Class_Valence'}, axis =1)

In [None]:
Renamed4ClassBaseDF['Four_Class_Valence']

The baseline data frame has been fully reviewed.

### First Shuffling Data Set

In [None]:
from random import shuffle, seed

In [None]:
#Preparing data for Sci-Kit Learn

ShuffleM1 = list(range(0, len(Renamed4ClassBaseDF)))
seed(42)
shuffle(ShuffleM1)
ShuffleM1[:5]

### Performing Train/Test Split
Data prepared for split into train and test set.

In [None]:
# Naming feature and outcome variables for split
X = Renamed4ClassBaseDF.drop(['Four_Class_Valence'], axis=1)
y = Renamed4ClassBaseDF['Four_Class_Valence']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
#Performing train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, 
                                                    random_state=42)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

#### Obtaining Baseline Scores With Multinomial Logistic Regression for 27 Features

In [None]:
#Scaling and fitting the data
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
logreg = LogisticRegression(multi_class='multinomial')
logreg.fit(X_train, y_train)
y_proba = logreg.predict_proba(X_test)[:, 1]
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

In [None]:
# Obtaining intercept
logreg.intercept_

In [None]:
# Obtaining coefficients
logreg.coef_

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
#Training accuracy
score = logreg.score(X_train, y_train)
print(score)

In [None]:
# Test accuracy
for threshold in thresholds:
    y_pred = (y_proba > threshold).astype(int)
    Accuracy = accuracy_score(y_test, y_pred)

In [None]:
Accuracy

###### Table Summary

In [None]:
d = {'Four-Class Logistic Regression': ['Accuracy Scores'],
     'Train': ['52.87%'],
     'Test': ['17.79%']}

In [None]:
FourClassLR = pd.DataFrame(data=d)

In [None]:
FourClassLRT = FourClassLR.set_index('Four-Class Logistic Regression')

In [None]:
FourClassLRT

##### Test accuracy score to beat, for Four-Class Valence outcome, is 17.79%
Baseline test accuracy score is extremely low. Comparing baseline accuracy for binary (two-class) outcome.

## Baseline Test Accuracy Scores: Valence Two-Class (Binary) Model

### Assessing Baseline Data Frame: Binary Model
Valence outcome in this data frame has two classes: Upper Positive and Lower Negative.

In [None]:
 #Accessing alternate data frame with binary Valence outcome for comparison. 
Base2ValUpprLwr.head()

In [None]:
#Obtaining shape (binary model should have less cases than four-class model)
Base2ValUpprLwr.shape

In [None]:
#Checking for null values
Base2ValUpprLwr.isna().sum()

In [None]:
#Checking for duplicates
Base2ValUpprLwr.duplicated().sort_values()

In [None]:
#Deleting discovered duplicates
BinaryBaseDFNoDup = Base2ValUpprLwr.drop_duplicates()

In [None]:
#Confirming duplicates removed
BinaryBaseDFNoDup.shape

In [None]:
#Renaming outcome variable
RenamedBinaryBaseDF = BinaryBaseDFNoDup.rename({'Valence_4CatNum':'Binary_Valence'}, axis =1)

This Binary outcome baseline data frame has been fully reviewed.

### First Shuffling Data Set

In [None]:
#Preparing data for Sci-Kit Learn

ShuffleM2 = list(range(0, len(RenamedBinaryBaseDF)))
seed(42)
shuffle(ShuffleM2)
ShuffleM2[:5]

### Performing Train/Test Split
Data prepared for split into train and test set.

In [None]:
# Naming feature and outcome variables for split
X = RenamedBinaryBaseDF.drop(['Binary_Valence'], axis=1)
y = RenamedBinaryBaseDF['Binary_Valence']

In [None]:
#Performing train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, 
                                                    random_state=42)

#### Obtaining Baseline Scores With Logistic Regression for 27 Features

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_proba = logreg.predict_proba(X_test)[:, 1]
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

In [None]:
# Obtaining intercept
logreg.intercept_

In [None]:
# Obtaining coefficients
logreg.coef_

In [None]:
#Training accuracy
score = logreg.score(X_train, y_train)
print(score)

In [None]:
# Test accuracy
for threshold in thresholds:
    y_pred = (y_proba > threshold).astype(int)
    Accuracy = accuracy_score(y_test, y_pred)

In [None]:
Accuracy

###### Summary Tables

In [None]:
d = {'Two-Class Logistic Regression': ['Accuracy Scores'],
     'Train': ['91.99%'],
     'Test': ['39.16%']}

In [None]:
BinaryLR = pd.DataFrame(data=d)

In [None]:
BinaryLRT = BinaryLR.set_index('Two-Class Logistic Regression')

In [None]:
BinaryLRT

##### Test accuracy for Binary Valence outcome is improved from Four-Class outcome  (17.79% to 39.16%).   Noted that high training accuracy score > 91% indicates overfitting.

In [None]:
d = {'Logistic Regression Outcome Variable': ['Baseline Test Accuracy Scores (13 Features)'],
     'Binary': ['39.16%'],
     'Four-Class': ['17.70%']}

In [None]:
BaselineComparison = pd.DataFrame(data=d)

In [None]:
ValenceTable = BaselineComparison.set_index('Logistic Regression Outcome Variable')

In [None]:
ValenceTable

##### Result of test accuracy score comparison: The binary (Upper Positive and Lower Negative) model is the optimal baseline model.  The binary outcome data frame will be used for further analysis.  Test accuracy score to beat is now 39.16%.  Since test accuracy score is below chance, I will be evaluating whether balancing classes can improve on this test accuracy score. 

## Evaluating Balance of Classes for Binary Model

### Visualizing Balance of Classes

In [None]:
#Replacing row values for histogram plot visualization 
RenamedBinaryBaseDF["Binary"] = RenamedBinaryBaseDF[["Binary_Valence"]].replace({0: "Lower Negative Valence", 
                                     3: "Upper Positive Valence"})

In [None]:
sns.countplot(data=RenamedBinaryBaseDF, x = 'Binary')
plt.title('Count of Target Observations')

#### The classes are closely, but not perfectly, balanced.  I will be evaluating if perfectly balancing the data set increases accuracy.  Accuracy is the chosen metric as it may best capture the number of correct predictions against the total number of predictions - in a binary classification problem.

##### Balancing Classes

In [None]:
UpperPosValence = len(RenamedBinaryBaseDF[RenamedBinaryBaseDF['Binary_Valence'] == 3])
LowerNegValenceIndices = RenamedBinaryBaseDF[RenamedBinaryBaseDF.Binary_Valence == 0].index

In [None]:
random_indices = np.random.choice(LowerNegValenceIndices, UpperPosValence)

In [None]:
UpperPosValenceIndices = RenamedBinaryBaseDF[RenamedBinaryBaseDF.Binary_Valence == 3].index

In [None]:
under_sample_indices = np.concatenate([UpperPosValenceIndices,random_indices])

In [None]:
under_sample = RenamedBinaryBaseDF.loc[under_sample_indices]

In [None]:
sns.countplot(data=under_sample, x = 'Binary')
plt.title('Count of Target Observations')

###### Sampling from balanced data set.  Obtaining test accuracy scores for balanced data set with binary Valence outcome.

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Naming feature and outcome variables for balanced data set
X_under = under_sample.loc[:,under_sample.columns != 'Binary_Valence']
y_under = under_sample.loc[:,under_sample.columns == 'Binary_Valence']
X_under_train, X_under_test, y_under_train, y_under_test = train_test_split(X_under,y_under,test_size = 0.30, random_state = 0)

X = X_under 
y = y_under 

##### Performing Logistic Regression on Model 

In [None]:
# Naming feature and outcome variables for split
X = RenamedBinaryBaseDF.drop(['Binary_Valence', 'Binary'], axis=1)
y = RenamedBinaryBaseDF['Binary_Valence']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [None]:
logreg2 = LogisticRegression()
logreg2.fit(X_train, y_train)
y_proba = logreg2.predict_proba(X_test)[:, 1]
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

In [None]:
# Obtaining intercept
logreg2.intercept_

In [None]:
# Obtaining coefficients
logreg2.coef_

In [None]:
logreg_beta0 = logreg2.intercept_
logreg_beta1 = logreg2.coef_
logreg_thresh = -logreg_beta0/logreg_beta1
logreg_beta0, logreg_beta1, logreg_thresh

In [None]:
for threshold in thresholds:
    y_pred = (y_proba > threshold).astype(int)
    Accuracy = accuracy_score(y_test, y_pred)


In [None]:
# Obtaining training accuracy for model
score = logreg2.score(X_train, y_train)
print(score)

In [None]:
# Obtaining test accuracy
for threshold in thresholds:
    y_pred = (y_proba > threshold).astype(int)
    Accuracy = accuracy_score(y_test, y_pred)

In [None]:
Accuracy

###### Summary Tables

In [None]:
d = {'Two-Class Logistic Regression': ['Accuracy Scores (Balanced Classes)'],
     'Train': ['80.63%'],
     'Test': ['38.81%']}

In [None]:
BalancedLR = pd.DataFrame(data=d)

In [None]:
BalancedLRT = BalancedLR.set_index('Two-Class Logistic Regression')

In [None]:
BalancedLRT

#### Test accuracy score after balancing classes on binary outcome for Valence: 38.81%. Baseline test accuracy score is still below chance.  High training accuracy score, 80.63%, indicates overfitting.

In [None]:
d = {'Binary Outcome Model Comparison': ['Test Accuracy Scores'],
     'Pre-Balancing Classes': ['39.16%'],
     'Post-Balancing Classes': ['38.81%']}

In [None]:
ComparisonBinary = pd.DataFrame(data=d)

In [None]:
BalanceTable = ComparisonBinary.set_index('Binary Outcome Model Comparison')

In [None]:
BalanceTable

##### Baseline test accuracy to beat remains 39.61%.  Balancing classes did not improve test accuracy score.  The data frame pre-balancing the classes will now be used for further analysis.  The next step will be model building based on this data frame.  Dimensionality through Logistic Regression L1 Regularization is the first goal in model-building. 

# Model-Building: Improving the Logistic Regression Model 

## Evaluating Priority Features Via L1 Regularization
L1 Regularization is a method that inherently highlights priority features and therefore is a solution for reducing dimensions. 

In [None]:
# Naming feature and outcome variables
X = RenamedBinaryBaseDF.drop(['Binary_Valence', 'Binary'], axis=1)
y = RenamedBinaryBaseDF['Binary_Valence']

In [None]:
# Obtaining current dimensions of data frame
X.shape

In [None]:
# Reviewing feature names
X.head()

In [None]:
# Normalizing data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# Performing L1 regularization
Cs = np.logspace(-5, .5)

In [None]:
coef_list = []
for C in Cs:
    lgr = LogisticRegression(penalty = 'l1', solver = 'liblinear', C = C, random_state=42, max_iter = 1000).fit(X_scaled, y)
    coef_list.append(list(lgr.coef_[0]))

In [None]:
coef_list[0]

In [None]:
coef_df = pd.DataFrame(coef_list, columns = X.columns)
coef_df.index = Cs

In [None]:
coef_df.head()

In [None]:
L1_Regularization = plt.figure(figsize = (12, 5))
plt.semilogx(coef_df)
plt.gca().invert_xaxis()
plt.grid()
plt.legend(list(coef_df.columns));
plt.title('Increasing Regularization on Baseline Model')
plt.xlabel("Increasing 1/C")
plt.savefig('coefl1.png')

### After evaluation, priority features are Danceability, Energy, and Release Year

#### Generating new data frame to include only three features

##### Visualizing model

In [None]:
#Reformatting release year to release decade for current plot
RenamedBinaryBaseDF['Release_Decade'] = pd.cut(x=RenamedBinaryBaseDF['Release_Year'],
                               bins = [1920, 
                                       1930, 
                                       1940, 
                                       1950, 
                                       1960, 
                                       1970, 
                                       1980,
                                       1990,
                                       2000,
                                       2010,
                                       2020, 
                                       2030],
                               labels = ['1920s', 
                                         '1930s', 
                                         '1940s', 
                                         '1950s',
                                         '1960s',
                                         '1970s',
                                         '1980s',
                                         '1990s',
                                         '2000s',
                                         '2010s',
                                         '2020s'])

In [None]:
# Providing visualization of current model
ax.set_title('Title', fontsize=200)

# Set the axis labels font size
ax.set_xlabel('X-axis', fontsize=100)
ax.set_ylabel('Y-axis', fontsize=100)

f, ax = plt.subplots(figsize=(20,15))
sns.scatterplot(x='Danceability', y='Energy', data=RenamedBinaryBaseDF, 
                hue = 'Binary', 
                size='Release_Decade',
                sizes=(20,200))
plt.title('Danceability by Energy With Decade of Song Release')

###### Reviewing the plot, the separation between Valence classes is evident.  The upper right tip of the scatter plot, showing high energy and high danceability, appear to contain smaller orange circles.  This indicates positive valence in more recent decades.  In addition, the entire bottom left corner of the scatter plot contains larger blue circles.  This area showing low danceability and low energy, indicates negative valence in earlier decades.  L1 Regularization in Logistic Regression appears to have reduced dimensions in a meaningful manner.

In [None]:
# Naming feature and outcome variables for Logistic Regression
X = RenamedBinaryBaseDF[['Danceability','Energy','Release_Year']]
y = RenamedBinaryBaseDF['Binary_Valence']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_proba = logreg.predict_proba(X_test)[:, 1]
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

In [None]:
# Obtaining untercept
logreg.intercept_

In [None]:
# Obtaining coefficients
logreg.coef_

In [None]:
logreg_beta0 = logreg.intercept_
logreg_beta1 = logreg.coef_
logreg_thresh = -logreg_beta0/logreg_beta1
logreg_beta0, logreg_beta1, logreg_thresh

In [None]:
for threshold in thresholds:
    y_pred = (y_proba > threshold).astype(int)
    Accuracy = accuracy_score(y_test, y_pred)

###### Summary Tables

In [None]:
d = {'Intercept': ['0.49', '', ''],
     'Coefficients (Danceability, Energy, Release Year)': ['1.08', '3.85', ' -3.84'],
     'Thresholds (Danceability, Energy, Release Year)': ['-4.51', '-1.27', '1.27']}

In [None]:
SimplerModel = pd.DataFrame(data=d)

In [None]:
SmplrModelTble = SimplerModel.set_index('Intercept')

In [None]:
SmplrModelTble

#### Three-Feature Coefficients Table and Visualizations
##### In this model, the feature that had the most explanatory value in the outcome variable was Energy (with Release Year closely following).  One can interpret that ratings for energy and year of song release have the most explanatory power for the outcome (positive or negative valence) in this current Logistic Regression model.  Also, release year has a negative relationship with the model outcome.  I will be visualizing the Release Year feature to review possible reasons for this negative relationship. 

##### Follow-Up Visualization to Coefficients Table (Release Year by Energy)

In [None]:
CoefsPlot = sns.jointplot(kind = 'hex', x = RenamedBinaryBaseDF['Release_Year'], 
                          y = RenamedBinaryBaseDF['Energy'])
CoefsPlot.fig.suptitle("Release Year by Energy")
CoefsPlot.fig.tight_layout()
CoefsPlot.fig.subplots_adjust(top=0.95)

###### Reviewing this plot, it appears there were a high number of songs with low energy (ratings < 2.5) between 1940 and 1960.  One can also view that the majority of songs after 1960 had energy ratings above .40.  It is possible to see that specifically, the years close to 1980 had songs with energy ratings higher than .50. 

In [None]:
CoefsPlot = sns.jointplot(kind = 'hex', x = RenamedBinaryBaseDF['Release_Year'], 
                          y = RenamedBinaryBaseDF['Danceability'])
CoefsPlot.fig.suptitle("Release Year by Danceability")
CoefsPlot.fig.tight_layout()
CoefsPlot.fig.subplots_adjust(top=0.95)

###### Reviewing this plot, it appears there were a large number of songs with high Danceability (ratings > .5) between 1960 and 1990.  It is possible to see that the majority of songs had high danceability since the 1960's. 

###### Three-Feature Coefficients Assessment
It appears that the reason for the negative relationship of Release Year to Valence (outcome variable), might be the large amount of high Danceability songs in earlier decades (1960's to 1990's).

In [None]:
# Obtaining training accuracy for model
score = logreg.score(X_train, y_train)
print(score)

In [None]:
# Obtaining test accuracy
for threshold in thresholds:
    y_pred = (y_proba > threshold).astype(int)
    Accuracy = accuracy_score(y_test, y_pred)

In [None]:
Accuracy

In [None]:
d = {'Three-Feature Model: Logistic Regression': ['Accuracy Scores'],
     'Train (.1 sec)': ['85.79%'],
     'Test': ['38.41%']}

In [None]:
ThreeFeatureLR = pd.DataFrame(data=d)

In [None]:
ThreeFeatureLRT = ThreeFeatureLR.set_index('Three-Feature Model: Logistic Regression')

In [None]:
ThreeFeatureLRT

###### The model trained well (train accuracy 85.79%), with a processing time of .1 second.  However, the test accuracy for this model remained below chance at 38.41% (performing worse than baseline test accuracy 39.16%).  The high training accuracy score also indicates this model was overfit. 

### Summary Plot

In [None]:
def decision_boundary(x0, beta_0, beta_1, beta_2):
    return -(beta_1/beta_2)*x0 - beta_0/beta_2

In [None]:
beta_0 = -6.74
beta_1 = 10.40
beta_2 = 3.60

In [None]:
#Generating scatter plot 
ax.set_title('Title', fontsize=200)

# Set the axis labels font size
ax.set_xlabel('X-axis', fontsize=100)
ax.set_ylabel('Y-axis', fontsize=100)

f, ax = plt.subplots(figsize=(20,15))
x = np.linspace(.056, .988, 100) 
Three_FeatureLR = sns.scatterplot(data = RenamedBinaryBaseDF, x = 'Danceability', y = 'Energy', 
                hue = 'Binary', size='Release_Decade',
                sizes=(20,200))
plt.plot(x, decision_boundary(x, beta_0, beta_1, beta_2), '--', color = 'black')
plt.ylim(0, 1)
plt.fill_between(x, decision_boundary(x, beta_0, beta_1, beta_2), alpha = 0.3, color = 'lightblue')
plt.fill_between(x, decision_boundary(x, beta_0, beta_1, beta_2), np.repeat(70, 100), alpha = 0.3)
plt.title('Danceability by Energy With Decade of Song Release (Decision Boundary)')

###### Reviewing this plot, it is possible to see the areas where the data are misclassified (to the left of the boundary line and throughout the right side of the decision boundary).  This Logistic Regression model might have performed well with the training data, but did not generalize well to the test data.  Low test accuracy scores might be an indication that a curvilinear, or different, model type is required for increased test accuracy.  Prior to moving forward with different algorithms to improve test accuracy scores, I will assess if a two-feature model (including only Danceability and Energy) will improve test accuracy for Logistic Regression. 

# Comparing Two-Feature to Three-Feature Model Using Logistic Regression

In [None]:
# Naming feature and outcome variables for Logistic Regression
X = RenamedBinaryBaseDF[['Danceability','Energy']]
y = RenamedBinaryBaseDF['Binary_Valence']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_proba = logreg.predict_proba(X_test)[:, 1]
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

In [None]:
# Obtaining intercept
logreg.intercept_

In [None]:
# Obtaining coefficients
logreg.coef_

In [None]:
logreg_beta0 = logreg.intercept_
logreg_beta1 = logreg.coef_
logreg_thresh = -logreg_beta0/logreg_beta1
logreg_beta0, logreg_beta1, logreg_thresh

In [None]:
for threshold in thresholds:
    y_pred = (y_proba > threshold).astype(int)
    Accuracy = accuracy_score(y_test, y_pred)

In [None]:
# Obtaining training accuracy for model
score = logreg.score(X_train, y_train)
print(score)

In [None]:
# Obtaining test accuracy
for threshold in thresholds:
    y_pred = (y_proba > threshold).astype(int)
    Accuracy = accuracy_score(y_test, y_pred)

In [None]:
Accuracy

## Summary Table

In [None]:
d = {'Model Comparison (Logistic Regression)': ['Test Accuracy Scores'],
     '3:Danceability, Energy, Release Year': ['38.41%'],
     '2:Danceability, Energy': ['38.33%']}

In [None]:
ThreeTwoLR = pd.DataFrame(data=d)

In [None]:
ThreeTwoLRT = ThreeTwoLR.set_index('Model Comparison (Logistic Regression)')

In [None]:
ThreeTwoLRT

### The two-feature Logistic Regression model did not improve the test accuracy score.  Therefore, I will be continuing with the three-feature model (with features Danceability, Energy, and Release Year) for model building. 

# Model Comparisons: Logistic Regression, KNN, Decision Trees 

Now, I aim to compare the default performance of this pruned Logistic Regression model to the default performance of K-Nearest Neighbors and Decision Tree algorithms.  K-Nearest Neighbors and Decision Trees were chosen as algorithms because they have processing times suitable for larger data sets.  The accuracy metric will continue to be used for model comparison, similar to the baseline model evaluations.

## Setting Up Model for Comparisons

In [None]:
# Naming feature and outcome variables
X = RenamedBinaryBaseDF[['Danceability','Energy','Release_Year']]
y = RenamedBinaryBaseDF['Binary_Valence']

### K-Nearest Neighbors Model Evaluation

In [None]:
print(X.head())
print('==============')
print(y.head())

In [None]:
# Train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, 
                                                    random_state=42)
print(X_train.shape)
print(X_test.shape)


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

In [None]:
KNNBasic = Pipeline([('scale', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors = 5))])
KNNBasic.fit(X_train, y_train)
KNNVBasic_acc_train = KNNBasic.score(X_train, y_train)
KNNBasic_acc_test = KNNBasic.score(X_test, y_test)


In [None]:
KNNVBasic_acc_train

In [None]:
KNNBasic_acc_test

#### Summary Table

In [None]:
d = {'KNN (k = 5)': ['Test Accuracy Scores'],
     'Train (.1 sec)': ['92.58%'],
     'Test': ['90.17%']}

In [None]:
KNNDefault = pd.DataFrame(data=d)

In [None]:
KNNDefaultTable = KNNDefault.set_index('KNN (k = 5)')

In [None]:
KNNDefaultTable

#### Training and testing accuracy scores were higher for K-Nearest Neighbors default model (n_neighbors = 5).  Time to train was .1 seconds.  Large rise in test accuracy score indicates effectiveness of model, with prediction according to majority class of five closest data points.

### Decision Tree Model Evaluation

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn import tree

In [None]:
# Naming feature and outcome variables
X = RenamedBinaryBaseDF[['Danceability','Energy','Release_Year']]
y = RenamedBinaryBaseDF['Binary_Valence']

In [None]:
# Train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, 
                                                    random_state=42)

print(X_train.shape)
print(X_test.shape)

In [None]:
dtree = DecisionTreeClassifier(max_depth = None, random_state = 42).fit(X_train, y_train)
print(dtree)

In [None]:
depth_1 = dtree.get_depth()
print(depth_1)

In [None]:
train_acc = dtree.score(X_train, y_train)
test_acc = dtree.score(X_test, y_test)
print(f'Training Accuracy: {train_acc: .3f}')
print(f'Test Accuracy: {test_acc: .3f}')

#### Summary Tables

In [None]:
d = {'Decision Tree (Depth 31)': ['Accuracy Scores'],
     'Train (.1 sec)': ['100%'],
     'Test': ['86.90%']}

In [None]:
DTMxDpthNone = pd.DataFrame(data=d)

In [None]:
DTNoneTable = DTMxDpthNone.set_index('Decision Tree (Depth 31)')

In [None]:
DTNoneTable

####  This decision tree model did not improve test accuracy scores when compared to the K-Nearest Neighbors model.  However, test accuracy score was good and better than Logistic Regression (86.90%) and time to train was .1 seconds.  Max depth parameter was set to "None" for maximum complexity, which is why training accuracy was 100% and highly sensitive to all data points.  

### Second Decision Tree Model Evaluation (Grid Search CV)
Since test accuracy scores were good for the Decision Tree model, I will be using Grid Search CV to evaluate whether the decision tree model might be improved with different parameters. 

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Optimizing parameters for the decision tree model with various cross validation methods
params = {'max_depth': [15,30,45],
         'min_samples_split': [.1,.2,.05],
          'criterion': ['gini', 'gini', 'gini'],
          'min_samples_leaf': [1,10,20]
         }

In [None]:
dtree = DecisionTreeClassifier()

In [None]:
#GridSearch CV
grid = GridSearchCV(DecisionTreeClassifier(random_state = 42), param_grid=params).fit(X_train, y_train)
grid_train_acc = grid.score(X_train, y_train)
grid_test_acc = grid.score(X_test, y_test)
best_params = grid.best_params_
print(f'Training Accuracy: {grid_train_acc: .3f}')
print(f'Test Accuracy: {grid_test_acc: .3f}')
print(f'Best parameters of tree: {best_params}')

#### Summary Tables

In [None]:
d = {'Decision Tree (Depth 15)': ['Accuracy Scores'],
     'Train (18 sec)': ['87.20%'],
     'Test': ['86.60%']}

In [None]:
DT2 = pd.DataFrame(data=d)

In [None]:
DT2Table = DT2.set_index('Decision Tree (Depth 15)')

In [None]:
DT2Table

##### Test accuracy score (86.60%) was extremely close to the test accuracy score for the first decision tree (86.90%).  However, Grid Search CV did not improve the test accuracy score from the first Decision Tree.  

#### Table Summary of Models With Best Test Accuracy Scores

In [None]:
d = {'Three-Features Best Model Comparison': ['Test Accuracy Scores'],
     'KNN': ['90.17%'], 
     'Decision Tree': ['86.90%'],
     'Logistic Regression': ['38.41%']}

In [None]:
ThreeFtOptimal = pd.DataFrame(data=d)

In [None]:
Model_Comparison = ThreeFtOptimal.set_index('Three-Features Best Model Comparison')

In [None]:
Model_Comparison 

##### The models I chose to use to compare with Logistic Regression models were K-Nearest Neighbors and Decision Trees. My justification for using these models was the large size of the data set.  Both models increased the test accuracy scores by more than 45%.  However, K-Nearest Neighbors outperformed other models with a test accuracy score of 90.17%.  I  will next attempt to optimize the K-Nearest Neighbors model. 

# Model-Building: K-Nearest Neighbors and GridSearch CV 

## Determining Optimal K Parameter With Grid Search CV

In [None]:
import time

In [None]:
# Naming feature and outcome variables
X = RenamedBinaryBaseDF[['Danceability','Energy','Release_Year']]
y = RenamedBinaryBaseDF['Binary_Valence']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, 
                                                    random_state=42)

print(X_train.shape)
print(X_test.shape)

In [None]:
# Establishing parameters dictionary for n_neighbors parameter
params = {'knn__n_neighbors': list(range(1, 22, 2))}

In [None]:
# Listing parameter values
list(params.values())[0]

In [None]:
# Defining Pipeline
Pipe = Pipeline([('Norm', StandardScaler()),
                 ('knn', KNeighborsClassifier())])
Pipe.fit(X_train, y_train)
Pipe_acc = Pipe.score(X_test, y_test)

In [None]:
params = {'knn__n_neighbors': list(range(1,22,2))} 
knn_grid = GridSearchCV(Pipe, param_grid=params) 
start = time.time()
knn_grid.fit(X_train, y_train) 
stop = time.time()
best_k = list(knn_grid.best_params_.values())[0]
best_acc = knn_grid.score(X_test, y_test)

### Optimal k

In [None]:
print(best_acc)
print(best_k)

In [None]:
#Training time will vary slightly with each computer (approximately 18s)
print(f"Training time: {stop - start}s")

#### Confirming optimal k with another GridSearch CV

In [None]:
## Establishing parameters dictionary for n_neighbors parameter
params2 = {'knn__n_neighbors': list(range(21, 31, 1))}

In [None]:
#Listing parameter values
list(params2.values())[0]

In [None]:
Pipe = Pipeline([('Norm', StandardScaler()),
                 ('knn', KNeighborsClassifier())])
Pipe.fit(X_train, y_train)
Pipe_acc = Pipe.score(X_test, y_test)

In [None]:
knn_grid2 = GridSearchCV(Pipe, param_grid=params2) 
start = time.time()
knn_grid.fit(X_train, y_train) 
stop = time.time()
best_k = list(knn_grid.best_params_.values())[0]
best_acc = knn_grid.score(X_test, y_test)

In [None]:
print(best_acc)
print(best_k)

In [None]:
#Training time will vary slightly with each computer (approximately 18s)
print(f"Training time: {stop - start}s")

##### The second KNN model with GridSearch CV had the same results.

###### Summary Tables and Visualizations

In [None]:
d = {'KNN (k=21)': ['Accuracy Scores'],
     'Train (18 sec)': ['91.57%'],
     'Test': ['91.13%']}

In [None]:
KNNFinal = pd.DataFrame(data=d)

In [None]:
KNNFinalTable = KNNFinal.set_index('KNN (k=21)')

In [None]:
KNNFinalTable

The optimal model for classifying valence of songs from musical characteristics was K-Nearest Neighbors.  The model had improved from 90.17% (n_neighbors = 5) to 91.13% (n_neighbors = 21).  Rise in test accuracy score indicates increased effectiveness of model, with prediction according to majority class of 21 closest data points.  GridSearch CV was the tool used find the best k parameter.  It is also notable that both the training and test accuracy scores for this model were high, 91.57% and 91.13%, respectively.  The model was both sensitive to training data points and generalizable to test data points. 

In [None]:
d = {'Optimal Model': ['KNN (18 seconds)'],
     'Features': ['Danceability, Energy, Release Year'], 
     'Binary Outcome': ['Upper Postive, Lower Negative'],
     'Parameters': ['(n_neighbors = 21)']}

In [None]:
Optimal = pd.DataFrame(data=d)

In [None]:
OptmlTable = Optimal.set_index('Optimal Model')

In [None]:
OptmlTable

The K-Nearest Neighbors optimal model included three features (Danceability, Energy, Release Year) and two outcome (binary) classes (Valence Upper Positive, Valence Lower Negative).  Best parameter was n_neighbors = 21 and time to train was 18 seconds. 

##### Confusion Matrix for Optimal KNN model

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
X = RenamedBinaryBaseDF[['Danceability','Energy','Release_Year']]
y = RenamedBinaryBaseDF['Binary']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, 
                                                    random_state=42)

In [None]:
KNNBasic = Pipeline([('scale', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors = 21))])
KNNBasic.fit(X_train, y_train)
KNNVBasic_acc_train = KNNBasic.score(X_train, y_train)
KNNBasic_acc_test = KNNBasic.score(X_test, y_test)

In [None]:
confusion_mat = ConfusionMatrixDisplay.from_estimator(KNNBasic, X_test, y_test)
plt.grid(False)
plt.title('Confusion Matrix for K-Nearest Neighbors (k=21)')
plt.ylabel('True')
plt.xlabel('Predicted')
print(confusion_mat)

###### The confusion matrix outlines the number of correct and incorrect test set predictions for this optimal KNN model.  The number of correctly predicted cases for Upper Positive Valence was 9,930 (TP), and  the number of correctly predicted cases for Lower Negative Valence was 6,222 (TN).  The total number of test cases was 17,723.  One can view how test accuracy scores were derived for the optimal KNN model.  The equation for delivering test accuracy score is the following TP + TN/(TP + TN + FP + FN).

###### Test accuracy score computed "by-hand" using confusion matrix (matches Sci-Kit Learn's computation)

In [None]:
(9930 + 6222)/(9930 + 6222 + 668 + 903) 

In [None]:
# Confirming optimal model test accuracy using Sci-Kit Learn
KNNBasic_acc_test

# Results

## Optimal Model: K-Nearest Neighbors

### Test Accuracy Score: 91.13%

#### Final Results Tables

In [None]:
d = {'Three-Feature, Best Model Comparison': ['Test Accuracy Scores'],
     'KNN (k = 21)': ['91.13% (Grid Search)'], 
     'Decision Tree (max depth = None)': ['86.90%'],
     'Logistic Regression (default)': ['38.41%']}

In [None]:
ModelFin = pd.DataFrame(data=d)

In [None]:
FinalTable = ModelFin.set_index('Three-Feature, Best Model Comparison')

In [None]:
FinalTable

###### The K-Nearest Neighbors model with GridSearch CV (n_neighbors = 21) performed best (test accuracy score 91.13%).  Closely following was the unpruned Decision Tree model, with the same three features and outcome (test accuracy score 86.90%).  Both KNN and Decision Tree models had higher test accuracy scores compared with the three-feature, binary outcome Logistic Regression model (test accuracy score 38.41%).

In [None]:
# Baseline review: Comparison to baseline test accuracy score
BinaryLRT

##### Three-feature K-Nearest Neighbor and Decision Tree models not only had higher test accuracy scores than the three-feature Logistic Regression model (38.41%), but also the baseline model.  The baseline model was a binary outcome, 13-feature Logistic Regression model with a test accuracy score of 39.16%.

# Partial Dependence Display

## Generating Partial Dependece Plot

In [None]:
from sklearn.inspection import PartialDependenceDisplay, partial_dependence

In [None]:
X = RenamedBinaryBaseDF[['Danceability','Energy','Release_Year']]
y = RenamedBinaryBaseDF['Binary_Valence']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, 
                                                    random_state=42)

In [None]:
KNNBasic = Pipeline([('scale', StandardScaler()), 
                     ('knn', KNeighborsClassifier(n_neighbors = 21))])
KNNBasic.fit(X_train, y_train)
KNNVBasic_acc_train = KNNBasic.score(X_train, y_train)
KNNBasic_acc_test = KNNBasic.score(X_test, y_test)

In [None]:
# Generating Partial Dependence Plot (should take about 4-5 minutes)
fig, ax = plt.subplots(figsize = (20, 6))
PartialDependenceDisplay.from_estimator(KNNBasic, X, features = ['Danceability', 'Energy', 'Release_Year'], ax = ax)
ax.set_title('Partial Dependence Plots for Three Features')

### Partial Dependence Plot Interpretation

Danceability, Energy, and Release Year are displayed in partial dependence plots.  These plots visualized the influence of each audio feature value to the average prediction (Valence) in the optimized K-Nearest Neighbors model (n_neighbors = 21).  Higher danceability and higher energy ratings contributed to positive valence classification.  Lower danceability and lower energy ratings contributed to negative valence classification.  Regarding year of song release, it appeared that older songs contributed to positive valence while newer songs contributed to negative valence.  

# End Summary 

## 1) Business Understanding: 

Music Information Retrievial (MIR) can be described as extracting information from music.  My goal in this project was to classify music, by positive and negative valence, based on audio features.  Song classification by musical valence, predicted from audio features, might enhance the building of future music recommendation systems.  Highly personalized music recommendation systems, in turn, might improve quality of life for persons that consider music highly important in clinical populations.

## 2) Data Understanding:

This data set ("Music Recommendation System Using Spotify Dataset"), from website Kaggle.com, consisted of 170,653 rows (samples) and 19 columns (features). 

There were 15 features - in addition to Artist, Song Title, and Unique ID - that I partitioned into the following categories: 

* "General Features of Music": Danceability, Acousticness, Energy, Key, Liveness, Loudness, Mode, Duration

* "Lyric-Related Features of Music": Speechiness, Instrumentalness, Explicit

* "Time-Related Features of Music": Tempo, Release Year, Release Date, Popularity.  

Valence was also listed as a feature, but for this analysis, I revised this and made Valence the target variable, or outcome variable.  Thus, my target for in this analysis was to correctly classify songs by Valence.  The remaining features provided in the data set were used to predict Valence during model-building.

The optimal model in this analysis could enhance the building of a future, highly personalized, music recommendation system for one in a clinical population.  For instance,  one person experiencing dementia - that considers music highly important - might benefit from the positive feelings that music can evoke.  If audio features can better predict positive valence, for example, these features can be used to recommend music that may benefit the quality of life for one person. 

## 3) Data Preparation Summary

* Null values were removed

* Duplicate samples were removed

* Columns were evaluated and renamed (removed if carrying redundant information)

* Valence was reformatted for Multinomial Logistic Regression (four outcome classes: Upper Positive, Lower Positive, Upper Negative, and Lower Negative) and Logistic Regression (two outcome classes: Upper Positive and Lower Negative)

* Outliers were removed for particular audio features to ensure sample inclusion more closely reflected typical songs (as well as for model-building)

* Data was shuffled and split, with a test size of 0.30

## 4) Modeling:

* Baseline models’ test accuracy scores were compared, and selected baseline model was closely balanced binary (two-class outcome) model

* Remaining 13 dimensions (audio features) were reduced to three dimesions using L1 Regularization tool for Logistic Regression (L1 Regularization is a method that inherently highlights priority features and therefore is a plausible solution for reducing dimensions)

* Logistic Regression, K-Nearest Neighbors, and Decision Tree models were built with priority three audio features and binary outcome, trained on the training set, and validated with the test set

* Models were built by comparing test accuracy scores, as top-performing models were optimized with GridSearch CV 

### Summary of Model-Building (Test Accuracy Scores Comparisons)

#### Baseline Test Accuracy Scores (Binary and Four-Class Outcomes With 13 Features)

In [None]:
ValenceTable

#### L1 Regularization for Logistic Regression (13 Features to Three Features)

In [None]:
L1_Regularization

#### Best Models Comparison (Binary Valence Outcomes With Three Features)

In [None]:
Model_Comparison

#### Best Models Comparison (After KNN GridSearch CV Analysis) 
This final results table replaced with the final KNN model with GridSearch CV.

In [None]:
FinalTable

#### Baseline 
Review to compare final results to baseline test accuracy score.

In [None]:
BinaryLRT

## 5) Evaluation

The optimal model for classifying valence of songs from musical characteristics was K-Nearest Neighbors.  The K-Nearest Neighbors optimal model included three features (Danceability, Energy, Release Year) and two outcome (binary) classes (Valence Upper Positive, Valence Lower Negative).  Best parameter was n_neighbors = 21 and time to train was 18 seconds. 

The model had improved from 90.17% (n_neighbors = 5) to 91.13% (n_neighbors = 21).  Rise in test accuracy score indicated increased effectiveness of model, with prediction according to majority class of 21 closest data points.  GridSearch CV was the tool used find the best k parameter.  It was also notable that both the training and test accuracy scores for this model were high, 91.57% and 91.13%, respectively.  The final model was both sensitive to training data points and generalizable to test data points.  

Closely following in performance was the unpruned Decision Tree model, with the same three features and outcome (test accuracy score 86.90%).  Both three-feature KNN and Decision Tree models had higher test accuracy scores when compared with the three-feature, binary outcome Logistic Regression model (test accuracy score 38.41%).  The reason for the success of the K-Nearest Neighbor and Decision Tree models when compared to the Logistic Regression models might be that test data were more clustered than separated linearly.

### Overall Model-Building Summaries

#### Key Increases in Test Accuracy Scores During Model-Building Process

Please click to see diagram "Model-Building Summary (Steps)": https://github.com/ChristineNoelle/Final-Capstone-Project/blob/7da5c533653463fe22aa39fafb8c84052a902721/Model-Building%20Summary%20(Steps).png

### Overall Features Summary: 
Danceability, Energy, and Release Year, were used in the optimal K-Nearest Neighbors model (k = 21). These three features were displayed in a partial dependence plot (please click to see "Partial Dependence Plot": https://github.com/ChristineNoelle/Final-Capstone-Project/blob/7da5c533653463fe22aa39fafb8c84052a902721/Partial%20Dependence%20Plot%20.png).

### Feature: Release Year Summary

It was interesting to view, in the partial dependence plots, that older songs contributed to positive musical valence.  There might be innumerable theories that could explain this result (e.g., mere-exposure effect).  

Upon review of certain plots in this project, it appears there was a large amount of highly danceable music in earlier decades.  Specifically, the decades spanning from 1960 to 1990 had highly danceable songs.  Alternately, there was much low energy music from about 1945 to 1960 (please click to see this in "Release Year Feature Analysis": https://github.com/ChristineNoelle/Final-Capstone-Project/blob/7da5c533653463fe22aa39fafb8c84052a902721/Release%20Year%20Feature%20Analysis.png).

This might be why the influence of year of song release in the model was tempered in the above partial dependence plot (https://github.com/ChristineNoelle/Final-Capstone-Project/blob/7da5c533653463fe22aa39fafb8c84052a902721/Partial%20Dependence%20Plot%20.png).  

## 6) Deployment

Proceeding forward, it seems that one can interpret that danceability ratings, energy ratings, and year of song release might be optimal audio features for data collection if one would like to classify songs by valence.  The next step is to determine how experience of emotions during song listening relate with the valence of a song.  It seems standardized methods need to be devised to collect these audio and emotion features, so that they might be comparable across studies. 

# Conclusion

In reality, there are infinite dimensions and boundless methods in classifying songs.  This particular analysis provided one optimal model in song classification.  An optimal K-Nearest Neighbors model was built from comparing test accuracy scores, so that the final model had a test accuracy score of 91.13%.  Using audio features to predict musical valence is akin to one step in enhancing music recommendation systems for clinical populations.  According to Chinese proverb, "A journey of a thousand miles begins with a single step".