# 📊🎵 YouTube and Spotify Data Analysis  
#### This project aims to explore music listening trends and user behavior by analyzing datasets from YouTube and Spotify platforms. A comprehensive analysis was conducted using various musical attributes (e.g., `Energy`, `Valence`, `Tempo`) and engagement metrics (e.g., `Views`, `Likes`). 
#### The analysis also includes extensive and important insights and advice for companies.

---

## 🚀 Key Highlights  
- **Most Popular Keys:** 🎹 Tracks in the keys of C (C#), F# (Fa#) and A# (La#) and at 120-150 BPM reached the highest number of views.   
- **Musical Features and Engagement:** 🎧 Relationships between features like `Tempo`, `Energy`, and `Speechiness` and metrics like views and likes were examined.  
- **Licensed & Official Video:** 🏷️ The impact of these content types on view counts was analyzed in detail.  
- **Outlier Cleaning:** 🧹 All outliers were removed across features to enhance the accuracy of the analysis.
- **Filling Missing Values:** 💾 Empty values that can be filled with the RandomForestRegressor model are filled. 
- **Categorization:** 📦 Continuous variables like `Energy`, `Valence`, and `Tempo` were grouped for clearer insights.  

---

## 🧐 Questions Answered  
- Which musical features influence a track's popularity?  
- How do licensed and official video content types shape viewership patterns?  
- Are specific `Tempo` and `Energy` levels more appealing to listeners?
- What are the factors that influence the listener?
- Which are the View based Top 10?  

---

## 📋 Techniques Used  
- **Data Cleaning and Preparation:** Missing and outlier values were addressed prior to analysis(Pandas, RandomForestRegressor, IQR). 
- **Exploratory Data Analysis (EDA):** 📊 Detailed analysis was performed using visualization tools.  
- **Categorization:** Continuous variables were divided into ranges for better comparisons.  
- **Visualizations:** Metrics were visualized using pie charts, histograms, heat maps and scatter plots.  

# ℹ️  ****Columns Information****

**Track**:  
Name of the song, as visible on the Spotify platform.

**Artist**:  
Name of the artist.

**Url_spotify**:  
The URL of the artist.

**Album**:  
The album in which the song is contained on Spotify.

**Album_type**:  
Indicates if the song is released on Spotify as a single or contained in an album.

**Uri**:  
A Spotify link used to find the song through the API.

**Danceability**:  
Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable, and 1.0 is most danceable.

**Energy**:  
A measure from 0.0 to 1.0 that represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

**Key**:  
The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

**Loudness**:  
The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Values typically range between -60 and 0 dB.

**Speechiness**:  
Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g., talk show, audiobook, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words, while values between 0.33 and 0.66 describe tracks that may contain both music and speech. Values below 0.33 most likely represent music and other non-speech-like tracks.

**Acousticness**:  
A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

**Instrumentalness**:  
Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 represent instrumental tracks, with higher confidence as the value approaches 1.0.

**Liveness**:  
Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

**Valence**:  
A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g., happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g., sad, depressed, angry).

**Tempo**:  
The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

**Duration_ms**:  
The duration of the track in milliseconds.

**Stream**:  
Number of streams of the song on Spotify.

**Url_youtube**:  
URL of the video linked to the song on YouTube, if it exists.

**Title**:  
Title of the videoclip on YouTube.

**Channel**:  
Name of the channel that published the video.

**Views**:  
Number of views.

**Likes**:  
Number of likes.

**Comments**:  
Number of comments.

**Description**:  
Description of the video on YouTube.

**Licensed**:  
Indicates whether the video represents licensed content, meaning that the content was uploaded to a channel linked to a YouTube content partner and then claimed by that partner.

**Official_video**:  
Boolean value indicating if the video found is the official video of the song.fficial video of the song.

# ****Libraries****

In [None]:
# Importing necessary libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.ticker import FuncFormatter
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

# ****Creating Dataset (Youtube and Spotify)****

In [None]:
# After creating the necessary libraries, let's start by creating the data set.

# Importing dataset
df_ = pd.read_csv(filepath_or_buffer = "/kaggle/input/spotify-and-youtube/Spotify_Youtube.csv")
df = df_.copy()

# ****Creating Empty Data (%2)****

In [None]:
#Let's create blank values on dataset.

import random

def add_random_missing_values(dataframe: pd.DataFrame,
                              missing_rate: float = 0.02,
                              seed: random = 42) -> pd.DataFrame:
    
    # Get copy of dataframe
    df_missing = dataframe.copy()

    # Obtain size of dataframe and number total number of missing values
    df_size = dataframe.size
    num_missing = int(df_size * missing_rate)
    
    # Set seed
    if seed:
        random.seed(seed)

    # Get random row and column indexes to turn them NaN
    for _ in range(num_missing):
        row_idx = random.randint(0, dataframe.shape[0] - 1)
        col_idx = random.randint(0, dataframe.shape[1] - 1)

        df_missing.iat[row_idx, col_idx] = np.nan
        
    return df_missing

df = add_random_missing_values(dataframe = df,
                               missing_rate = 0.02)

# ****Dataset review and overview****

In [None]:

# Let's check dataset's shapes.

print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in this dataset.")

In [None]:
# Let's make a general observation on the dataset. We can also do this observation with .head() and .tail() commands.
# But I choose only the df command because it shows the head and tail data together.
# I use the .T command because it makes observation easier when there are many columns.

df.T

In [None]:
# Let's take a look at the columns in general.

df.info()

In [None]:
# Let's check the number of Null values in dataset.
# With the result of this code, we will have a general knowledge about Null values.
# This will help us decide what to do next.

df.isnull().sum()

In [None]:
# With the Describe command, let's check the quartiles, minimum and maximum values, and average of numerical data.
# If we add include = “all” we will examine all columns, but this is not necessary for now. I only want to examine numeric columns.

df.describe().T



In [None]:
# At first look, I see that some of the values that should be between 0 and 1 are incorrect.
# I will manipulate this with appropriate methods in the following process.

In [None]:
# With a short summary table, I want get a general overview and decide how much to delete or if I need to fill in.
# With this table we will be able to see the percentage of missing values, the number of unique values, the data type and more.

summary_table = pd.DataFrame({
    "Data Type" : df.dtypes,
    "Missing" : df.isnull().sum(),
    "Missing %" : (df.isnull().sum()/df.count())*100,
    "Unique" : df.nunique(),
    "Count" : df.count(),
})

summary_table

# ****Maipulation of The Data Set****

In [None]:
# In the numerical columns where I will make inferences, I see that Null values are around 2-6%.
# I might prefer to exclude Null values here because the rate of Null values is very low.
# But it is important to remember that in professional life, this decision is made together with our teammates or according to the upper limit set in the company.
# We can start manipulating our dataset.

In [None]:
# Unnamed: 0 column is useless, so I remove it and reset the index.

df.drop("Unnamed: 0", axis = 1, inplace = True)
df.reset_index(drop = True, inplace = True)

# Firstly I converted milliseconds to minutes (1 min = 60000 ms) and removed the Duration_ms column because we don't need it.
df["Duration_min"] = df["Duration_ms"] / 60000
df.drop(columns = "Duration_ms", inplace = True)

# And than I identify outliers.

# ****Identification of Outliers****

## For 0-1 Range

In [None]:
# Let's identify the columns with a value between 0-1.
columns_to_check = ['Danceability', 'Energy', 'Speechiness', 'Acousticness', 
                    'Instrumentalness', 'Liveness', 'Valence']

# Checking values outside the 0-1 range and printing.
for col in columns_to_check:
    out_of_range = df[(df[col] < 0) | (df[col] > 1)]
    print(f"In the {col} column, {len(out_of_range)} values were found outside the 0-1 range.")
    if not out_of_range.empty:
        print(out_of_range[[col]])
    print('-' * 74)

In [None]:
# Since there are no outliers in the columns I checked, I will not perform data cleaning. 
# For the other columns I will check, I will determine outlier values with the IQR method. 
# Since I will perform analysis on these columns, I will perform data cleaning by selecting only these columns.

## IQR Method

In [None]:
# Let's check dataset.

df

In [None]:
# Now let's check the number of outlier values.
# I prefer the IQR method to determine outlier values. 

# Here I create the necessary function to determine the outlier values with IQR values.
def find_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1

    # I create the variables Upper and Lower bonds.
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # I create variables to identify outliers
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound


columns_to_check = ["Duration_min", "Likes", "Views", "Tempo"]

# With Scatterplot I create a graph to show outliers and print their numbers.

plt.figure(figsize = (12, 8))
for i, col in enumerate(columns_to_check, 1):
    # Calculation of lower and upper limits
    outliers, lower, upper = find_outliers_iqr(df, col)

    # I also create normal values with variables to get a better visualization in the graph.
    normal_values = df[(df[col] >= lower) & (df[col] <= upper)]

    # Creating scatterplot
    plt.subplot(2, 2, i)
    plt.scatter(normal_values.index, normal_values[col], label = "Normal", alpha = 0.6, c = "green")
    plt.scatter(outliers.index, outliers[col], label = "Outlier", alpha=0.6, c = "red")
    plt.title(f"{col} Scatterplot")
    plt.ylabel(col)
    plt.legend()

    # I print the lower and upper bounds with the total number of outliers.
    print(f"Total number of outliers in column {col}: {len(outliers)}")
    print(f"Lower limit: {lower}, Upper limit: {upper}")
    print("-" * 59)

plt.tight_layout()
plt.show()

In [None]:
# The outlier values I obtained for Duration_min, Likes, Views columns were not satisfactory. 
# Because when I look at the actual values, I see that there are 8 billion Views and 54 million Likes. Therefore, I will not delete data on Likes and Views columns.
# For the Duration_min column, I will set a value myself and delete accordingly because there are songs longer than 6-7 minutes.
# For the Tempo column, I want to examine in a boxplot graph and make data extraction in this way.

In [None]:
# For the tempo column, we do the IQR operations again, only this time the chart will be different.
def find_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

tempo_outliers, tempo_lower, tempo_upper = find_outliers_iqr(df, "Tempo")

# Let's visualize Outliers with Boxplot.
plt.figure(figsize = (8, 6))
sns.boxplot(data = df, y = "Tempo", palette = "Greens")
plt.title("Tempo Column - Outliers")
plt.ylabel("Tempo")
plt.show()

# The code we use to print the outlier number.
print(f"Total number of outliers in the Tempo column: {len(tempo_outliers)}")

In [None]:
# I have found that there are 63 outlier values in the tempo column.
# Let's remove outlier values from dataset.

df = df[(df['Tempo'] >= tempo_lower) & (df['Tempo'] <= tempo_upper)]

In [None]:
# To determine the outlier values in Duration_min, I will scattering them among themselves. 
# I chose these columns because I will probably do my analysis with one of these 2 columns.

# With this command I set the tables size.
plt.figure(figsize = (12, 6))

# Likes vs Duration_min
plt.subplot(1, 2, 1)
plt.scatter(df["Likes"], df["Duration_min"], color = "green", alpha = 0.5)
plt.title("Likes vs Duration_min")
plt.xlabel("Likes")
plt.ylabel("Duration (min)")

# Views vs Duration_min
plt.subplot(1, 2, 2)
plt.scatter(df["Views"], df["Duration_min"], color = "midnightblue", alpha = 0.5)
plt.title("Views vs Duration_min")
plt.xlabel("Views")
plt.ylabel("Duration (min)")

plt.tight_layout()
plt.show()


In [None]:
# I see a similarity here, but I will address this in the Correlation Matrix section.

In [None]:
# Let's print the numbers above the values specified for the Duration_min column.
duration_thresholds = [20, 15, 10, 7]
for threshold in duration_thresholds:
    count = (df["Duration_min"] > threshold).sum()
    print(f"Number of values above {threshold} minutes for the Duration_min column: {count}")


In [None]:
# Given the facts and the significant increase after 10 minutes, I decided that value of 10 minutes for the Duration_min column was appropriate.

# I remove those with more than 10 minutes from the data set.
df = df[df["Duration_min"] <= 10]

In [None]:
# In addition, I would like to point out that we could have chosen to shift the outlier values by one digit or to preserve them by applying other operations. 
# Since I did not perform this analysis with a team, I chose to preserve the accuracy of the data by removing the values.

# ****Identifying and Filling Missing Data (With  Random Forest Regressor)****

In [None]:
# Let's see how much data is left and check the final status with the summary table

df.shape

In [None]:
summary_table = pd.DataFrame({
    "Data Type" : df.dtypes,
    "Missing" : df.isnull().sum(),
    "Missing %" : (df.isnull().sum()/df.count())*100,
    "Unique" : df.nunique(),
    "Count" : df.count(),
})

summary_table

In [None]:
# What caught my attention here is that the data types are only object and float64. 
# Storing data in only 2 types will make our analysis a little easier.
# Also, we can see the categorical values in a summary table.

In [None]:
# We can get an overview by seeing the missing values in the numeric columns on the heat map.

# I am creating a variable that returns only numeric columns. Remember, all numeric columns in float64 format! 
numerical_columns = df.select_dtypes(include = "float64").columns

# Creating graphs
plt.figure(figsize = (10, 6))
sns.heatmap(df[numerical_columns].isna(), cbar = False, cmap = "plasma")
plt.title("Visualization of Missing Values")
plt.show()

##  I choose Random Forest Regressor model for filling the missing values. Because:
####    * * By averaging predictions from multiple decision trees, it reduces the risk of overfitting.
####    * * Random Forest handles both linear and nonlinear relationships effectively.
####    * * It can capture complex relationships among numerical features.
####    * * Features with sufficient variance and non-missing values provide strong predictors for imputation.



In [None]:
# Random Forest Regressor for filling missing values.

# I selected only numerical columns for the model.
numerical_columns = df.select_dtypes(include = "float64").columns

# Creating a loop to fill in the missing values.
for col in numerical_columns:
    # Here I select the rows that are not missing to train the model.
    non_missing_data = df[df[col].notna()]
    missing_data = df[df[col].isna()]

    # If the column is completely empty, I use this code to skip it.
    # There are no empty columns in our data set, but I try to use this command for habit.
    if missing_data.empty:
        continue

    # Remove features (columns) that may create noise due to missing values.
    # Feature engineering.
    X = non_missing_data[numerical_columns].drop(columns = [col])
    y = non_missing_data[col]

    X_missing = missing_data[numerical_columns].drop(columns = [col])

    # Excluding Features with Missing Values.
    X = X.dropna(axis = 1)
    X_missing = X_missing[X.columns]

    # Let's standardize features.
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    X_missing_scaled = scaler.transform(X_missing)

    # Let's train the model we created.
    model = RandomForestRegressor(random_state=42)
    model.fit(X_scaled, y)

    # Let's predicting missing values.
    predictions = model.predict(X_missing_scaled)

    # And finally let's fill the missing values.
    df.loc[df[col].isna(), col] = predictions

# Let's check the results.
print(df.isna().sum())

In [None]:
# Now that we can see that the empty values are filled, let's start the analysis.

# ****Correlation Matrix, Counting Numerical Columns****

In [None]:
# For the correlation matrix we select only the columns that include numeric values. 
# When we examine the data, all numeric values in the table are in float64 format. 
# So we only select float64 format.

df_numeric = df.select_dtypes(include = ["float64"])

df_numeric.dtypes

In [None]:
# Let's make a graphical index showing the distribution of numeric columns

# Determine the value for the multiplier (M for Million, B for Billion) expressions shown in the header.
def get_scale_title(col_data, col_name):
    max_val = col_data.max()
    if max_val >= 1e9:
        scale = " (B)"
    elif max_val >= 1e6:
        scale = " (M)"
    else:
        scale = ""
    return f"{col_name}{scale}"

# We logarithmically transformed the columns with large numbers so that the graph can be analyzed.
df_log = df_numeric.apply(lambda x: np.log1p(x) if x.max() > 1e5 else x)

# Let's creat graphs with subplots.
plt.figure(figsize = (16, 12))
for i, col in enumerate(df_log, 1):
    plt.subplot(4, 4, i)
    
    # Special settings for Instrumentalness (0.0 - 0.1 range approximately 17500).
    # Only for Instrumentalness, I set the y-axis limit to 1000.
    if col == "Instrumentalness":
        sns.histplot(df[col], kde = True, binwidth = 0.05)
        plt.ylim(0, 1000)
    else:
        sns.histplot(df_log[col], kde = True)
    
    # The codes we use to show the header and the multiplier scale.
    title = get_scale_title(df_numeric[col], col)
    plt.title(f"Distribution of {title} (Log Transformed)" if df_numeric[col].max() > 1e5 else f"Distribution of {col}")
    plt.xlabel(col)
    
plt.tight_layout()
plt.show()

In [None]:
# Before starting with the correlation matrix, we can quickly find out which columns are similar and get information about the distributions in the tables. For example:
    # Most tracks are around 3.5 minutes long.
    # Almost all of the tracks are not live recordings.

# By the way, the multiples are indicated in the title of the table (M: Million, B: Billion).

In [None]:
# Good. Now let's create and visualize a correlation matrix.
# With the correlation matrix, we can see the correlations between the columns and decide on the values to use in the analysis.

# Let's create the correlation matrix with the df_numeric column.
corr_matrix = df_numeric.corr()

# Here we visualize the correlation matrix.
# Using a heat map in a correlation matrix facilitates analysis and insight.
plt.figure(figsize = (10,8))
sns.heatmap(corr_matrix, annot = True, cmap = "Spectral", center = 0, fmt = ".2f")
plt.title("Correlation Matrix Heat Map")
plt.show()

## ****Examination of Correlation Matrix****

### When we examine the correlation matrix, we see a strong positive correlation between the Likes column and the Views column, while there is a moderate correlation with the Stream and Comments columns. Based on this, it can be said that songs that are listened to more on Youtube receive more likes. I see that songs with a higher number of likes have a higher number of streams on Spotify. 

### There is also a strong positive correlation between Energy and Loudness, meaning that songs with higher energy can be considered as louder songs. On the contrary, we observe a moderate negative correlation between Energy and Acousticity.  Based on this, we can infer that noisy songs are far from being acoustic, and the matrix confirms this idea.

### Based on all these, I will use the Views column for my analysis among the Stream, Likes and Views columns, as I find the correlation between them sufficient. I will also choose the Energy column between the Energy and Loudness columns.

# ****Data Analys****

In [None]:
# Energy, Valence, Speechiness, Tempo and Key analysis based on total Views.

# Before starting the analysis, let's create Bins for columns according to the column information given in the entry. 
energy_bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
valence_bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
speechiness_bins = [0, 0.33, 0.66, 1]
tempo_bins = [0, 80, 120, 160, 200]

# Grouping the features into bins
df["Energy_group"] = pd.cut(df["Energy"], bins = energy_bins, labels = ["0-0.2", "0.2-0.4", "0.4-0.6", "0.6-0.8", "0.8-1"])
df["Valence_group"] = pd.cut(df["Valence"], bins = valence_bins, labels = ["0-0.2", "0.2-0.4", "0.4-0.6", "0.6-0.8", "0.8-1"])
df["Speechiness_group"] = pd.cut(df["Speechiness"], bins = speechiness_bins, labels = ["0-0.33", "0.33-0.66", "0.66-1"])
df["Tempo_group"] = pd.cut(df["Tempo"], bins = tempo_bins, labels = ["Less Than 80", "80-120", "120-160", "160-200"])
df["Key_group"] = pd.cut(df["Key"], bins = range(0, 12), labels = [f"{i}" for i in range(11)])

# Now that we have done the grouping, let's start visualization
plt.figure(figsize = (16, 12))

# Energy
plt.subplot(2, 3, 1)
df.groupby("Energy_group")["Views"].sum().plot(kind = "bar", color = "midnightblue")
plt.title("Total Views by Energy Group")
plt.xlabel("Energy Group")
plt.ylabel("Total Views")

# Valence
plt.subplot(2, 3, 2)
df.groupby("Valence_group")["Views"].sum().plot(kind = "bar", color = "maroon")
plt.title("Total Views by Valence Group")
plt.xlabel("Valence Group")
plt.ylabel("Total Views")

# Speechiness
plt.subplot(2, 3, 3)
df.groupby("Speechiness_group")["Views"].sum().plot(kind = "bar", color = "darkviolet")
plt.title("Total Views by Speechiness Group")
plt.xlabel("Speechiness Group")
plt.ylabel("Total Views")

# Tempo
plt.subplot(2, 3, 4)
df.groupby("Tempo_group")["Views"].sum().plot(kind = "bar", color = "orange")
plt.title("Total Views by Tempo Group")
plt.xlabel("Tempo Group (BPM)")
plt.ylabel("Total Views")

# Key
plt.subplot(2, 3, 5)
df.groupby("Key_group")["Views"].sum().plot(kind = "bar", color = "limegreen")
plt.title("Total Views by Key Group")
plt.xlabel("Key Group")
plt.ylabel("Total Views")
plt.xticks(rotation = 0)

plt.tight_layout()
plt.show()

In [None]:
# Music that is not very energetic but has above average energy has more views and therefore more likes and streams. 
    # If the person concerned is a songwriter, they can gain popularity faster by making songs in this style. 
    # If the relevant person is a Spotify or YouTube executive, this information can be used to take appropriate actions for the advertising campaign or sales policy.

# The same goes for the Valance value. 
    # Positive songs that can be called cheerful and enthusiastic above average have more views, but songs with average and below average emotions are preferred more than the most positive or most negative tracks. 
    # In order not to take risks, it would be better to keep the emotional balance at the average. Highlighting such tracks will increase the number of views and streams.

# When we examine the Speechiness group, the most preferred tracks are the ones where music is at the forefront and lyrics are few.

# While the tempo values most preferred by people are in the 80-160 BPM range, the most preferred key is C (Do), followed by C# (Do#), F# (Fa#) and A# (La#) with almost equal data. 
    # With this data, we observe that listeners are more interested in certain tones. If we analyze these 3 keys, the results will make more sense. 
    # C# (Do#) creates a bright, lively and cheerful atmosphere. It is the key commonly used by Hüseyni Makam in Turkish Art Music and has a calming and melodic structure.
    # F# (Fa#) offers peaceful and elegant tones and is commonly used in the Rast Makam in Turkish Art Music, which I also love to listen to and play on the violin. 
    # Finally, A# (La#) can be preferred to provide strong emotional intonations. It matches the Segah Makam in Turkish art music and creates a more melancholic mood.

# In the light of all this information, the person concerned can choose according to the actions to be taken and achieve the best efficiency.

In [None]:
# Let's see the sum of View values based on Album_type with Donut Chart.

# First, I group by creating the album_views variable
album_views = df.groupby("Album_type")["Views"].sum()

# Now let's create the Donut Chart.
plt.figure(figsize = (8, 8))

plt.pie(album_views, labels = album_views.index, autopct = "%1.2f%%", startangle = 135, colors = ["maroon", "lime", "navy"], wedgeprops = dict(width = 0.3))
plt.title("Album Type Bazında View Oranı", fontsize = 16)
plt.legend(labels = [f"{label}: {value:,.0f} views" for label, value in zip(album_views.index, album_views)],
           title = "Album Type", loc = "upper left", bbox_to_anchor = (1, 0, 0.5, 1))
plt.show()

In [None]:
# We see that tracks released as albums have more views. 
# Even if all of the data we extracted from the dataset were singles, album tracks would still be dominant.

In [None]:
# Let's check on the pie chart how the total number of views is affected in Licensed and Official cases.

# I set it to show all percentages with the special autopct function.
def custom_autopct(pct, colors):
    return f"{pct:.2f}%"

# Let's calculate total Views by grouping Licensed and official_video columns
licensed_views = df.groupby("Licensed")["Views"].sum()
official_video_views = df.groupby("official_video")["Views"].sum()

# Editing charts size
plt.figure(figsize=(16, 8))

# Let's create Licensed pie chart
plt.subplot(1, 2, 1)
colors_licensed = ["lightsalmon", "maroon"]
wedges, texts, autotexts = plt.pie(
    licensed_views,
    labels=licensed_views.index,
    autopct=lambda pct: custom_autopct(pct, colors_licensed),
    startangle=90,
    colors=colors_licensed,
)
# To make the percentage expressions more visible in the chart, I set their colors according to the background color.
for text, color in zip(autotexts, colors_licensed):
    text.set_color("white" if color == "maroon" else "black")
plt.title("Licensed - Total Views")
plt.legend(
    [f"{label}: {views:,} Views" for label, views in zip(licensed_views.index, licensed_views)],
    loc="best",
)

# Creating official_video pie chart
plt.subplot(1, 2, 2)
colors_official = ["lime", "midnightblue"]
wedges, texts, autotexts = plt.pie(
    official_video_views,
    labels=official_video_views.index,
    autopct=lambda pct: custom_autopct(pct, colors_official),
    startangle=90,
    colors=colors_official,
)
# I do the same for the percentage expressions to more visible.
for text, color in zip(autotexts, colors_official):
    text.set_color("white" if color == "midnightblue" else "black")
plt.title("Official Video - Total Views")
plt.legend(
    [f"{label}: {views:,} Views" for label, views in zip(official_video_views.index, official_video_views)],
    loc="best",
)

plt.tight_layout()
plt.show()

In [None]:
# As a result of this information, the importance of the part manufacturers producing official content and the importance of obtaining copyrights can be emphasized. 
# This information can be presented supporting with graphic.

### Top 10 Anaysis

In [None]:
# Now, Let's examine Top 10s. Prizes for you Top 10s! :)

# Top 10 Artist by Views
top10_artist_views = df.groupby("Artist")["Views"].sum().nlargest(10)

# We will analyze the Top 10s on a horizontal bar chart 
plt.figure(figsize = (10, 6))
sns.barplot(x = top10_artist_views.values, y = top10_artist_views.index, palette = "viridis")
plt.title("Top 10 Artists by View Count")
plt.xlabel("Views")
plt.ylabel("Artist")

plt.show()

In [None]:
# Top 10 Track by Views
top10_track_views = df.groupby("Track")["Views"].sum().nlargest(10)

plt.figure(figsize=(10, 6))
sns.barplot(x = top10_track_views.values, y = top10_track_views.index, palette = "inferno")
plt.title("Top 10 Tracks by View Count")
plt.xlabel("Views")
plt.ylabel("Track")
plt.show()

In [None]:
# Top 10 Channel by Views
top10_channel_views = df.groupby("Channel")["Views"].sum().nlargest(10)

plt.figure(figsize = (10, 6))
sns.barplot(x = top10_channel_views.values, y = top10_channel_views.index, palette = "viridis")
plt.title("Top 10 Channels by View Count")
plt.xlabel("Views")
plt.ylabel("Channel")
plt.show()

# ****Overall Analysis****
### We can announce the Top 10 regularly on Youtube and Spotify pages and we can also create a small reward system for these achievements. 
### Prizes or badges can be awarded to those with a total of 21 awards (this limit can be adjusted). 
### We could even scale up and make smaller channels more competitive among themselves. 
### This can be a move that can increase sharing and interaction by creating competition between channels or artists. 
### In this way, channels will advertise themselves and this will result in more users.