# Import Libraries

In [47]:
import pandas as pd 
import numpy as np 

# Preparing Data

Duplicate values in the Spotify dataset are chosen to be deleted as they are not very useful when duplicated.

In [48]:
df_spotify = pd.read_csv("../data/spotify_to_merge.csv")
df_spotify = df_spotify.drop_duplicates(subset='track_name', keep='first')
df_spotify = df_spotify[['track_name', 'artists']]

df_grammys = pd.read_csv("../data/grammys_to_merge.csv")
df_grammys.rename(columns={'artist': 'artists', 'nominee':'track_name'}, inplace=True)

# Merging Data


The columns 'artists' and 'track_name' are used as references when performing the merge. 

First, the names in the Grammy dataset are changed, and then the merge is carried out.

In [49]:
df_merged = pd.merge(df_grammys, df_spotify, on=["artists", "track_name"], how="outer")
df_merged

Unnamed: 0,track_name,artists,winner,awards_group,title_by_year
0,pagadoff,!nvite,,,
1,strolling,!nvite,,,
2,Going on a Mission,"""Puppy Dog Pals"" Cast",,,
3,Puppy Dog Pals Main Title Theme,"""Puppy Dog Pals"" Cast",,,
4,"Amish Paradise (Parody of ""Gangsta's Paradise""...","""Weird Al"" Yankovic",,,
...,...,...,...,...,...
78322,,,True,Excellence Awards,(1971-1999) AGM
78323,,,True,Excellence Awards,(1971-1999) AGM
78324,,,True,Excellence Awards,(1971-1999) AGM
78325,,,True,Excellence Awards,(1958-1970) AGM


## Improving Merge Data

Now, looking at the dataframe after the merge, we see that there are many data points that are not useful, such as a large number of null values (which will be handled later), among other things.

By using dataframes, it is decided to start searching and selecting data that match in both datasets (Spotify and Grammys) in the columns of artists and track_name, which will be helpful when wanting to do a future analysis.

In [50]:
common_tracks = pd.merge(df_grammys[['track_name']], df_spotify[['track_name']], on='track_name', how='inner')
nominated_tracks = df_merged[df_merged['winner'] == True]
nominated_winner_tracks = nominated_tracks[['track_name']]
common_winner_tracks = pd.merge(common_tracks, nominated_winner_tracks, on='track_name', how='inner')

nominated_winner_tracks = nominated_tracks[['track_name', 'artists', 'awards_group']]
common_winner_tracks = pd.merge(common_tracks, nominated_winner_tracks, on='track_name', how='inner')


In [51]:
common_winner_tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1357 entries, 0 to 1356
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   track_name    1321 non-null   object
 1   artists       946 non-null    object
 2   awards_group  1357 non-null   object
dtypes: object(3)
memory usage: 31.9+ KB


## Getting Data Ready. 

Once these similarities between the datasets are found, a final merge is decided, which will have all this data. 

In turn, due to the large number of nulls, it is decided to handle a maximum of 10,000 data with which a better analysis and management can be done. 

Finally, the most relevant columns are selected, leaving our final dataframe ready.

In [52]:
final_df = pd.merge(df_merged, common_winner_tracks, on=['track_name', 'artists', 'awards_group'], how='outer')
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79179 entries, 0 to 79178
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   track_name     78963 non-null  object
 1   artists        76882 non-null  object
 2   winner         5662 non-null   object
 3   awards_group   5662 non-null   object
 4   title_by_year  5662 non-null   object
dtypes: object(5)
memory usage: 3.0+ MB


In [53]:
final_df

Unnamed: 0,track_name,artists,winner,awards_group,title_by_year
0,!I'll Be Back!,Rilès,,,
1,"""A"" You're Adorable",Brian Hyland,,,
2,"""C"" IS FOR COOKIE",Little Apple Band,,,
3,"""C"" is for Cookie",Little Apple Band,,,
4,"""Christe, Redemptor omnium""",Traditional;Sistine Chapel Choir;Massimo Palom...,,,
...,...,...,...,...,...
79174,,,True,Excellence Awards,(1958-1970) AGM
79175,,,True,Excellence Awards,(1958-1970) AGM
79176,,,True,Excellence Awards,(1958-1970) AGM
79177,,,True,Excellence Awards,(1958-1970) AGM


In [54]:
final_df.to_csv('../data/merged_data.csv', index=False)

# Conclussions

After analyzing and performing exploratory data analysis (EDA) on each dataset, it's clear that combining these datasets was a significant challenge, especially when aiming to generate insightful data. 

An example of this is illustrating which songs were Grammy-nominated and winners and which were not, among other ways of presenting conclusions with this data.

The large percentage of nulls is a significant issue; however, the data and analysis can still be managed by taking them into account, as they cannot be imputed or eliminated.