# Importing needed packages

In [103]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Reading the data

In [104]:
train_df = pd.read_csv('/kaggle/input/song-popularity-prediction/train.csv')

# Looking at data

In [105]:
train_df.head()

In [106]:
train_df.info()

Based on this information we can make a few conclusions:
* Overall number of rows is 40 000
* We have missing values in several columns
* All columns are numerical - we don't have text data 

## Missing values

In [107]:
train_df.isnull().sum()

Seems like that the approach to recover missing values will matter a lot.  I will come back to this later

## Outliers detection

I will come back to this later

# Target variable analysis

## Plots

In [108]:
f,ax=plt.subplots(1,2,figsize=(14,6))
train_df['song_popularity'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Song popularity')
ax[0].set_ylabel('')
sns.countplot(x='song_popularity',data=train_df,ax=ax[1])
ax[1].set_title('Song popularity')
plt.show()

## Class Imbalance

In [109]:
train_df['song_popularity'].value_counts()

The dataset is not balanced - there are more unpopular songs than popular ones.

## Metric

The evaluation metric for this competition is Area Under the ROC Curve (AUC). And the task is binary classification. 

# Feature analysis

In this section I'm going to:
* Look at columns, determine their types
* Summarize data and show some statistics per feature
* Conduct analysis for every feature which will include:
    * Plots
    * Analysys of the dependency between the feature and the target variable
    * Interactions of features
    * Correlation

## Look at columns

In [110]:
train_df.dtypes

As I've said earlier, all the features we have are numerical

## Statistics

In [111]:
train_df.drop('id', axis=1).describe()

## Feature analysis

In [112]:
# Compute the correlation matrix
corr = train_df.drop('id', axis=1).corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap="coolwarm", vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Song popularity doesn't have any significative correlation with any of the features, but this doesn't mean that the features are useless. Subpopulations in these features can be correlated with the survival. To determine this, we need to explore in detail these features.

### Song Duration

**Name**: song_duration_ms  
**Number of values**: 35899  
**Number of missing values**: 4101  
**Range**: [25658, 491671]  
**Type**: Continuous  
**Details**: The duration of the song in milliseconds  
**Possible transformations**: No  

In [113]:
song_duration_s = train_df["song_duration_ms"].div(1000*60)
song_duration_s.agg(['count', 'min', 'max', 'mean', 'median'])

The mean duration of a song is 3m 22s

In [114]:
plt.figure(figsize=(12, 5))
g1 = sns.kdeplot(song_duration_s, shade=True, color='black')
g1.set_title("Song duration (s)")
g1.set_xlabel("")
g1.plot()

Here I'm plotting the graph of song duration distribution in seconds (omitting missing values).  

In [115]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["song_duration_ms"], color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["song_duration_ms"], color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

Seems like that distributions of song duration for popular and unpopular songs look very similar 

### Acousticness

**Name**: acoutsticness  
**Number of values**: 36008  
**Number of missing values**: 3992  
**Range**: [-0.013551, 1.065284]  
**Type**: Continuous  
**Details**: This value describes how acoustic a song is. A score of 1.0 means the song is most likely to be an acoustic one.  
**Possible transformations**: Turn it into a categorical feature based on the split <0.2 and >= 0.2

In [116]:
train_df['acousticness'].agg(['count', 'min', 'max', 'mean', 'median'])

In [117]:
plt.figure(figsize=(12, 5))
g2 = sns.kdeplot(train_df['acousticness'], shade=True, color = 'green')
g2.set_title("Acousticness")
g2.set_xlabel("")
g2.plot()

Based on the statistics and the density plot it is safe to say that more than half of the songs have acousticness that is less than 0.2.  
Seems like it can be transformed into a categorical feature based on the threshold 0.2.   

In [118]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["acousticness"], color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["acousticness"], color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

Comparing popular with unpopular songs based on the acousticness we can spot some differences although they are not very distinctive

### Danceability

**Name**: danceability  
**Number of values**: 35974  
**Number of missing values**: 4026  
**Range**: [0.043961, 0.957131]  
**Type**: Continuous  
**Details**: This value describes how acoustic a song is. A score of 1.0 means the song is most likely to be an acoustic one.  
**Possible transformations**: No

In [119]:
train_df['danceability'].agg(['count', 'min', 'max', 'mean', 'median'])

In [120]:
plt.figure(figsize=(12, 5))
g3 = sns.kdeplot(train_df['danceability'], shade=True, color='orange')
g3.set_title("danceability")
g3.set_xlabel("")
g3.plot()

In [121]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["danceability"], color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["danceability"], color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

The distributions differs a bit

### Energy

**Name**: energy  
**Number of values**: 35974  
**Number of missing values**: 4026  
**Range**: [-0.001682, 1.039741]  
**Type**: Continuous  
**Details**: Energy represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.  
**Possible transformations**: No

In [122]:
train_df['energy'].agg(['count', 'min', 'max', 'mean', 'median'])

In [123]:
plt.figure(figsize=(12, 5))
g4 = sns.kdeplot(train_df['energy'], shade=True, color='red')
g4.set_title("energy")
g4.set_xlabel("")
g4.plot()

In [124]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["energy"], color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["energy"], color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

The distributions differs around two peaks (~0.6, ~0.9) and at the beginning (0.2)

### Instrumentalness

**Name**: instrumentalness  
**Number of values**: 35974  
**Number of missing values**: 4026  
**Range**: [-0.004398, 1.075415]  
**Type**: Continuous  
**Details**: This value represents the amount of vocals in the song. The closer it is to 1.0, the more instrumental the song is.  
**Possible transformations**: Extremely skewed. Transformation is needed.

In [125]:
train_df['instrumentalness'].agg(['count', 'min', 'max', 'mean', 'median'])

In [126]:
plt.figure(figsize=(12, 5))
g5 = sns.kdeplot(train_df['instrumentalness'], shade=True, color='green')
g5.set_title("instrumentalness")
g5.set_xlabel("")
g5.plot()

In [127]:
plt.figure(figsize=(12, 5))
g5_log = sns.kdeplot(train_df['instrumentalness'].transform(lambda x: np.log(x)), shade=True, color='green')
g5_log.set_title("instrumentalness (log)")
g5_log.set_xlabel("")
g5_log.plot()

In [128]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["instrumentalness"].transform(lambda x: np.log(x)), color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["instrumentalness"].transform(lambda x: np.log(x)), color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

The feature may not be very distinctive. If we are going to use it, then the transformation of some kind (for example, log) is needed.

### Key

**Name**: key  
**Number of values**: 35935  
**Number of missing values**: 4065  
**Range**: [0, 11]  
**Type**: Categorical  
**Details**: Key is the pitch, notes or scale of song that forms the basis of a song. 12 keys are ranging from 0 to 11.  
**Possible transformations**: No

In [129]:
train_df['key'].agg(['count', 'min', 'max', 'mean', 'median'])

In [130]:
plt.figure(figsize=(12, 5))
g6 = sns.catplot(x="key", kind="count", data=train_df)

In [131]:
plt.figure(figsize=(12, 5))

g = sns.catplot(x="key", hue="song_popularity", kind="count", data=train_df)

The proportion of popular and unpopular songs for every key looks the same

### Liveness

**Name**: liveness  
**Number of values**: 35914  
**Number of missing values**: 4086  
**Range**: [0.027843, 1.065298]  
**Type**: Continuous  
**Details**: This value describes the probability that the song was recorded with a live audience. According to the official documentation “a value above 0.8 provides strong likelihood that the track is live”.  
**Possible transformations**: No

In [132]:
train_df['liveness'].agg(['count', 'min', 'max', 'mean', 'median'])

In [133]:
plt.figure(figsize=(12, 5))
g7 = sns.kdeplot(train_df['liveness'], shade=True, color='purple')
g7.set_title("liveness")
g7.set_xlabel("")
g7.plot()

In [134]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["liveness"], color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["liveness"], color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

There are almost no differences between popular and unpopular song distributions

### Loudness

**Name**: loudness  
**Number of values**: 36043  
**Number of missing values**: 3957  
**Range**: [-32.117911, -0.877346]  
**Type**: Continuous  
**Details**: Loudness values are averaged across the entire track. It is the quality of a song. Higher the value, the louder the song.  
**Possible transformations**: No

In [135]:
train_df['loudness'].agg(['count', 'min', 'max', 'mean', 'median'])

In [136]:
plt.figure(figsize=(12, 5))
g8 = sns.kdeplot(train_df['loudness'], shade=True, color='orange')
g8.set_title("loudness")
g8.set_xlabel("")
g8.plot()

In [137]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["loudness"], color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["loudness"], color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

There are some differences around (-18, -14), (-11, -9) where there are more popular songs and (-6, -4) where there are more unpopular songs.

### Audio mode

**Name**: audio_mode  
**Number of values**: 40000  
**Number of missing values**: 0  
**Range**: [0, 1]  
**Type**: Binary  
**Details**: Songs can be classified as major and minor. 1.0 represents major mode and 0 represents minor.  
**Possible transformations**: No

In [138]:
train_df['audio_mode'].agg(['count', 'min', 'max', 'mean', 'median'])

In [139]:
plt.figure(figsize=(12, 5))
g9 = sns.catplot(x="audio_mode", kind="count", data=train_df)

In [140]:
plt.figure(figsize=(12, 5))

g = sns.catplot(x="audio_mode", hue="song_popularity", kind="count", data=train_df)

The differences are not so distinctive

### Speechiness

**Name**: speechiness  
**Number of values**: 40000  
**Number of missing values**: 0  
**Range**: [0.015065, 0.560748]  
**Type**: Continuous  
**Details**: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audiobook, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.  
**Possible transformations**: No

In [141]:
train_df['speechiness'].agg(['count', 'min', 'max', 'mean', 'median'])

In [142]:
plt.figure(figsize=(12, 5))
g10 = sns.kdeplot(train_df['speechiness'], shade=True, color='brown')
g10.set_title("speechiness")
g10.set_xlabel("")
g10.plot()

In [143]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["speechiness"], color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["speechiness"], color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

There are some differences around the peak

### Tempo

**Name**: tempo  
**Number of values**: 40000  
**Number of missing values**: 0  
**Range**: [62.055779, 219.163578]  
**Type**: Continuous  
**Details**: The tempo of the song. The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, the tempo is the speed or pace of a given piece and derives directly from the average beat duration.   
**Possible transformations**: No

In [144]:
train_df['tempo'].agg(['count', 'min', 'max', 'mean', 'median'])

In [145]:
plt.figure(figsize=(12, 5))
g11 = sns.kdeplot(train_df['tempo'], shade=True, color='fuchsia')
g11.set_title("tempo")
g11.set_xlabel("")
g11.plot()

In [146]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["tempo"], color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["tempo"], color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

There are some significant difference around the second peak in distributions

### Time signature

**Name**: time_signature  
**Number of values**: 40000  
**Number of missing values**: 0  
**Range**: [2, 5]  
**Type**: Categorical  
**Details**: The time signature indicates how many counts are in each measure and which type of note will receive one count. The top number is commonly 2, 3, 4, or 6.  
**Possible transformations**: No

In [147]:
train_df['time_signature'].agg(['count', 'min', 'max', 'mean', 'median'])

In [148]:
plt.figure(figsize=(12, 5))
g12 = sns.catplot(x="time_signature", kind="count", data=train_df)

In [149]:
plt.figure(figsize=(12, 5))

g = sns.catplot(x="time_signature", hue="song_popularity", kind="count", data=train_df)

**Write analysis**

### Audio valence

**Name**: audio_valence  
**Number of values**: 40000  
**Number of missing values**: 0  
**Range**: [0.013398, 1.022558]  
**Type**: Continuous  
**Details**: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)   
**Possible transformations**: No

In [150]:
train_df['audio_valence'].agg(['count', 'min', 'max', 'mean', 'median'])

In [151]:
plt.figure(figsize=(12, 5))
g11 = sns.kdeplot(train_df['audio_valence'], shade=True, color='cyan')
g11.set_title("audio_valence")
g11.set_xlabel("")
g11.plot()

In [152]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["audio_valence"], color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["audio_valence"], color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

The differences are not so distinctive

**To be continued**

### 