# Importing needed packages

In [29]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Reading the data

In [30]:
train_df = pd.read_csv('/kaggle/input/song-popularity-prediction/train.csv')

# Looking at data

In [31]:
train_df.head()

In [32]:
train_df.info()

Based on this information we can make a few conclusions:
* Overall number of rows is 40 000
* We have missing values in several columns
* All columns are numerical - we don't have text data 

## Missing values

In [33]:
train_df.isnull().sum()

Seems like that the approach to recover missing values will matter a lot.  I will come back to this later

## Outliers detection

I will come back to this later

# Target variable analysis

## Plots

In [34]:
f,ax=plt.subplots(1,2,figsize=(14,6))
train_df['song_popularity'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Song popularity')
ax[0].set_ylabel('')
sns.countplot(x='song_popularity',data=train_df,ax=ax[1])
ax[1].set_title('Song popularity')
plt.show()

## Class Imbalance

In [35]:
train_df['song_popularity'].value_counts()

The dataset is not balanced - there are more unpopular songs than popular ones.

## Metric

The evaluation metric for this competition is Area Under the ROC Curve (AUC). And the task is binary classification. 

# Feature analysis

In this section I'm going to:
* Look at columns, determine their types
* Summarize data and show some statistics per feature
* Conduct analysis for every feature which will include:
    * Plots
    * Analysys of the dependency between the feature and the target variable
    * Interactions of features
    * Correlation

## Look at columns

In [36]:
train_df.dtypes

As I've said earlier, all the features we have are numerical

## Statistics

In [37]:
train_df.drop('id', axis=1).describe()

## Feature analysis

In [38]:
# Compute the correlation matrix
corr = train_df.drop('id', axis=1).corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap="coolwarm", vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Song popularity doesn't have any significative correlation with any of the features, but this doesn't mean that the features are useless. Subpopulations in these features can be correlated with the survival. To determine this, we need to explore in detail these features.

### Song Duration

**Name**: song_duration_ms  
**Number of values**: 35899  
**Number of missing values**: 4101  
**Range**: [25658, 491671]  
**Type**: Continuous  
**Details**: The duration of the song in milliseconds  

In [39]:
song_duration_s = train_df["song_duration_ms"].div(1000*60)
song_duration_s.agg(['min', 'max', 'mean', 'median'])

The mean duration of a song is 3m 22s

In [40]:
plt.figure(figsize=(12, 5))
g = sns.kdeplot(song_duration_s, shade=True)
g.set_title("Song duration (s)")
g.set_xlabel("")
g.plot()

Here I'm plotting the graph of song duration distribution in seconds (omitting missing values).  

In [41]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["song_duration_ms"], color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["song_duration_ms"], color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

Seems like that distributions of song duration for popular and unpopular songs look very similar 

### Acousticness

**Name**: acoutsticness  
**Number of values**: 36008  
**Number of missing values**: 3992  
**Range**: [-0.013551, 1.065284]  
**Type**: Continuous  
**Details**: This value describes how acoustic a song is. A score of 1.0 means the song is most likely to be an acoustic one.

In [42]:
train_df['acousticness'].agg(['min', 'max', 'mean', 'median'])

In [43]:
plt.figure(figsize=(12, 5))
g = sns.kdeplot(train_df['acousticness'], shade=True)
g.set_title("Acousticness")
g.set_xlabel("")
g.plot()

Based on the statistics and the density plot it is safe to say that more than half of the songs have acousticness that is less than 0.2.  
Seems like it can be transformed into a categorical feature based on the threshold 0.2.   

In [44]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["acousticness"], color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["acousticness"], color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

Comparing popular with unpopular songs based on the acousticness we can spot some differences although they are not very distinctive

### Danceability

**Name**: danceability  
**Number of values**: 35974  
**Number of missing values**: 4026  
**Range**: [0.043961, 0.957131]  
**Type**: Continuous  
**Details**: This value describes how acoustic a song is. A score of 1.0 means the song is most likely to be an acoustic one.

In [45]:
train_df['danceability'].agg(['min', 'max', 'mean', 'median'])

In [46]:
plt.figure(figsize=(12, 5))
g = sns.kdeplot(train_df['danceability'], shade=True)
g.set_title("danceability")
g.set_xlabel("")
g.plot()

In [47]:
plt.figure(figsize=(12, 5))

g = sns.kdeplot(train_df[train_df["song_popularity"]==0]["danceability"], color="red", shade=True)
g = sns.kdeplot(train_df[train_df["song_popularity"]==1]["danceability"], color="blue", shade=True)
g = g.legend(["Unpopular song", "Popular song"])

**Write analysis**

### 