# Data Science Project
## Content Cluster | Basic Architecture 


##### Marissa Montano & Mohamed Al-Rasbi


______

#### Why do you care?

The goal of our project is to cluster songs by lyrical content. We will be using a kaggle dataset from Gyanendra Mishra (https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics/data) and we will be using the python Sklearn library to actually help us process the data. We want to see what main communities of songs are out there, and what  themes usually become popular nowadays. 

The methods and techniques we are going to use to cluster these lyrics are term frequency-inverse document frequency (TF-IDF), singular value decomposition (SVD), K-Means Clustering. We are going to extract the lyrics to a feature mapping vector using TF-IDF, then we are going to reduce our dimensions with SVD, and lastly we are going to cluster the songs with k-means clustering.  

The goal from this project is to work with natural language processing (NLP). We could have analyzed data from Spotify and just clustered based off of musical measurements like valence, but that seems like a trivial task. We feel like we would get more out of this project this semester if we analyzed the content of the songs (lyrics) instead. 




#### Feedback 1:
 - Questions you want to answer spesifically 
 - Silhouette coefficient or Genre
 - Word2Vec (gets meaning of lyrics)
 - Other clustering methods

#### Solutions 1:
 - Does sentiment of lyrics help predict the genre? Do artists fall victim to making things that sell by following a specific/generic formula? 
 - We used kaggle because we wanted genres and Genius didn't have them
 - We will compare word2vec and tf-idf
 - We compare our results using three other clustering algorithms





In [2]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split

In [3]:
def preprocess(lyrics):
    # Ignore case
    lyrics = lyrics.lower()
    
    # Remove ',!?:. \n
    lyrics = ''.join([word.strip(",!?:") for word in lyrics])
    lyrics = lyrics.replace('\n', ' ')
    lyrics = lyrics.replace('\'', '')
    
    # Remove everything between hard brackets
    lyrics = re.sub(pattern="[\(\[].*?[\)\]]", repl='', string=lyrics)

    # Remove x4 and (x4), for example
    lyrics = re.sub(pattern="(\()?x\d+(\))?", repl=' ', string=lyrics)

    return(lyrics)

In [8]:
def loadData(file_name, sub_list=False):
    df = pd.read_csv(file_name)
    if sub_list == True
        #import only 1000 data points because my laptop SUCKS
        df = df.head(1000)
    # Clean data
    df = df.dropna()
    for i, row in df.iterrows():
        df.loc[i,'lyrics'] = preprocess(df.loc[i,'lyrics'])
    return df

df = loadData('lyrics.csv', True)

0    oh baby how you doing you know im gonna cut ri...
Name: lyrics, dtype: object


In [36]:
# This is me playing around and cluster looking for genre 

# Need to put hyperperams in yaml file or txt file
n_components = 5
n_clusters = len(set(df["genre"])) 

# TFIDF | turn lyrics to vectors
tfidf = TfidfVectorizer(stop_words = 'english')
X = tfidf.fit_transform(df['lyrics'])

# SVD | dimension reduction
svd = TruncatedSVD(n_components=n_components, random_state = 0)
X_final = svd.fit_transform(X)

# K-mean clustering on lyrics
kmeans = KMeans(n_clusters=n_clusters, random_state = 0)
X_clustered = kmeans.fit_predict(X)

# display by groups
df_plot = pd.DataFrame(list(df["genre"]), list(X_clustered))
df_plot = df_plot.reset_index()
df_plot.rename(columns = {'index': 'Cluster', 0: 'Genre'}, inplace = True)
df_plot['Cluster'] = df_plot['Cluster'].astype(int)

print(df_plot.head())
#for i, row in df_plot.iterrows():
#    print(df_plot.loc[i,'Cluster'],  df_plot.loc[i,'Genre'], "\n")

   Cluster Genre
0        3   Pop
1        3   Pop
2        4   Pop
3        4   Pop
4        2   Pop


In [None]:
# This is something we should do do a word2vec training and clustering to see genre and lyrical sentiment.
# https://github.com/ravishchawla/word_2_vec/blob/master/word_2_vec.ipynb

num_features = 100;    # Dimensionality of the hidden layer representation
min_word_count = 40;   # Minimum word count to keep a word in the vocabulary
num_workers = multiprocessing.cpu_count();       # Number of threads to run in parallel set to total number of cpus.
context = 5          # Context window size (on each side)                                                       
downsampling = 1e-3   # Downsample setting for frequent words
# Initialize and train the model. 
#The LineSentence object allows us to pass in a file name directly as input to Word2Vec,
#instead of having to read it into memory first.
print("Training model...");
model = word2vec.Word2Vec(LineSentence('/mnt/big/out_full_clean'), workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling);
# We don't plan on training the model any further, so calling 
# init_sims will make the model more memory efficient by normalizing the vectors in-place.
model.init_sims(replace=True);
# Save the model
model_name = "model_full_reddit";
model.save(model_name);


def clustering_on_wordvecs(word_vectors, num_clusters):
    # Initalize a k-means object and use it to extract centroids
    kmeans_clustering = KMeans(n_clusters = num_clusters, init='k-means++');
    idx = kmeans_clustering.fit_predict(word_vectors);
    
    return kmeans_clustering.cluster_centers_, idx;

model = word2vec.Word2Vec.load('model_full_reddit');
Z = model.wv.syn0;

centers, clusters = clustering_on_wordvecs(Z, 50);
centroid_map = dict(zip(model.wv.index2word, clusters));


def get_top_words(index2word, k, centers, wordvecs):
    tree = KDTree(wordvecs);

    #Closest points for each Cluster center is used to query the closest 20 points to it.
    closest_points = [tree.query(np.reshape(x, (1, -1)), k=k) for x in centers];
    closest_words_idxs = [x[1] for x in closest_points];

    #Word Index is queried for each position in the above array, and added to a Dictionary.
    closest_words = {};
    for i in range(0, len(closest_words_idxs)):
        closest_words['Cluster #' + str(i+1).zfill(2)] = [index2word[j] for j in closest_words_idxs[i][0]]

    #A DataFrame is generated from the dictionary.
    df = pd.DataFrame(closest_words);
    df.index = df.index+1

    return df;