---
<a name="top"></a>
# Introduction into Python
## Overview
* [How to run code](#coderunning)
* [Course: Goals & Overview](#course)
* [Get started](#summary)

---
<a name="coderunning"></a>
## [How to run code in Jupyter Notebook](#top)
* #### Select a Code Cell:
    A Jupyter notebook is composed of cells, and you can identify code cells by the In [ ]: prompt on the left side. Click on a code cell to select it. Alternatively, you can use arrow keys to navigate to a code cell.
* #### Run the Code Cell:
    Once you have a code cell selected, you can run the code inside it. There are several ways to do this:
    * Press Shift + Enter: This keyboard shortcut runs the current cell and moves the focus to the next cell (or creates a new cell if none exists).
    * Click the "Run" button: You can find this button in the toolbar above the notebook interface. It looks like a "play" button ▶.
    * Go to the "Run" menu and select "Run Selected Cells": This menu option is another way to run the code in the selected cell.
* #### View Output:
    After running a code cell, any output generated by the code (e.g., print statements, plot visualizations) will appear directly below the cell.
* #### Continue Running Cells:
    You can continue running code cells in the notebook, either by selecting them manually or by using keyboard shortcuts to navigate through the notebook.
* #### Save Your Work:
    Remember to save your changes periodically by clicking the save button in the toolbar or by using the keyboard shortcut (Ctrl + S or Cmd + S on Mac).

---
<a name="course"></a>
## [Course overview](#top)

* Introduction
* Data Structures
* Control Statements
* Functions
* External Modules & Reference Semantic
* Functional Programming & Iterators
* File Handling & Exceptions
* Pandas
* NumPy
* Matplotlib

## Analyzing statistics for songs of various Spotify artists

As an example we show you how to use Python to explore a dataset containing information about more than 20000 songs (it is real data form Spotify and Youtube!). You can use the insights from such an analysis to create models which will predict the success of a new song or to find out which properties the most successful ones have, so that you can write the next Hit. An analysis of such a big dataset would be impossible without the help of a computer.

### Data Preprocessing

First we read the data file.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("./data/Spotify_Youtube.csv")
df.head()

Now we can use the data in our code. For example we can check how many rows are contained in the dataset...

In [None]:
print("There are {} Rows in the dataset.".format(len(df)))

... or compute the average song duration of all 20000 songs:

In [None]:
avg_duration = df["Duration_ms"].mean()
print("On average a song is {:.2f} minutes long".format(avg_duration/60000))

To analyze the data in a little more detail, we first have to preprocess the data and remove all the columns we are not interested in, like the Url of a song:

In [None]:
df.drop(["Unnamed: 0", "Url_spotify", "Uri", "Url_youtube"], axis=1, inplace=True)
df.columns

Now we want to check if some of the data is missing in our dataset:

In [None]:
missing_values = df.isnull().sum()
missing_values

Incomplete data is a typical problem in data analysis (Note that a missing value does not have to be an error, it depends on the dataset you have). One approach to deal with it, is to fill the missing values with the mean value of it's column. We will simply remove all the data rows which are not complete, so that we only work with correctly recorded and complete instances: 

In [None]:
df.dropna(inplace=True)
print("After cleaning the data we have {} rows left".format(len(df)))

### Exploration and Visualization

Now that we have prepared the data, we can use the visualization tools from `matplotlib` to explore it. Let's list the most successful songs in terms of the number of streams on spotify:

In [None]:
from unidecode import unidecode

# helper function to truncate the song title
def truncate(text):
    maxLen = 33
    if(len(text) > maxLen):
        return unidecode(text[0:maxLen] + "...")
    else:
        return unidecode(text)

    
    
# filter out top songs
topSongs = df.nlargest(15, "Views", keep="all").drop_duplicates("Track").sort_values("Stream")
topSongs['Track'] = topSongs['Track'].map(truncate)

# plot songs in a bar chart
fig, ax = plt.subplots()
ax.barh(topSongs["Track"], topSongs["Stream"], color="#619DFF")
ax.xaxis.set_ticks(ax.get_xticks())
ax.set_xticklabels(ax.get_xticklabels())
ax.set_title("Top Songs on Spotify")
ax.set_xlabel("Number of Streams (in Billion)")

plt.show()

We can also look at the top Songs regarding the number of views on youtube:

In [None]:
# filter out top songs
topSongs = df.nlargest(15, "Views", keep="all").drop_duplicates("Track").sort_values("Views")
topSongs['Track'] = topSongs['Track'].map(truncate)

# plot songs in a bar chart
fig, ax = plt.subplots()
ax.barh(topSongs["Track"], topSongs["Views"], color="#619DFF")
ax.xaxis.set_ticks(ax.get_xticks())
ax.set_xticklabels(ax.get_xticklabels())
ax.set_title("Top Music Videos on Youtube")
ax.set_xlabel("Number of Views (in Billions)")

plt.show()

The dataset provides us with many features for every song. We can look at the distribution of some of these features, to get get a better feeling for the range in which most of the songs can be found:

In [None]:
fig, axs = plt.subplots(2,3,figsize=(15,10))

binSize = 30

axs[0,0].hist(df["Danceability"], binSize, color = "#00d5e8", ec="black")
axs[0,0].set_title("Danceability Distribution")
axs[0,0].set_xlabel("Danceability Levels")
axs[0,0].set_ylabel("Values Count")
axs[0,0].grid(True)

axs[0,1].hist(df["Energy"], binSize, color = "#00b5e8", ec="black")
axs[0,1].set_title("Energy Distribution")
axs[0,1].set_xlabel("Energy Levels")
axs[0,1].set_ylabel("Values Count")
axs[0,1].grid(True)

axs[0,2].hist(df["Loudness"], binSize, color = "#0084e8", ec="black")
axs[0,2].set_title("Loudness Distribution")
axs[0,2].set_xlabel("Loudness Levels")
axs[0,2].set_ylabel("Values Count")
axs[0,2].grid(True)

axs[1,0].hist(df["Acousticness"], binSize, color = "#5f66fa", ec="black")
axs[1,0].set_title("Acousticness Distribution")
axs[1,0].set_xlabel("Acousticness Levels")
axs[1,0].set_ylabel("Values Count")
axs[1,0].grid(True)

axs[1,1].hist(df["Liveness"], binSize, color = "#7f60ee", ec="black")
axs[1,1].set_title("Liveness Distribution")
axs[1,1].set_xlabel("Liveness Levels")
axs[1,1].set_ylabel("Values Count")
axs[1,1].grid(True)

axs[1,2].hist(df["Tempo"], binSize, color = "#7d40ec", ec="black")
axs[1,2].set_title("Tempo Distribution")
axs[1,2].set_xlabel("Tempo Levels")
axs[1,2].set_ylabel("Values Count")
axs[1,2].grid(True)

plt.show()

We can also analyze the relationship among these features:

In [None]:
fig, axs=plt.subplots(3,2,figsize=(15,20))

axs[0,0].scatter(df["Duration_ms"]/60000, df["Stream"], color = "#F56B91", ec="white", s=50)
axs[0,0].set_xlabel("Duration in minutes")
axs[0,0].set_ylabel("Number of Streams")
axs[0,0].set_title("Relation between Duration and Streams on Spotify")
axs[0,0].grid(True)

axs[0,1].scatter(df["Duration_ms"]/60000, df["Views"], color = "#F2A688", ec="white", s=50)
axs[0,1].set_xlabel("Duration in minutes")
axs[0,1].set_ylabel("Number of Views")
axs[0,1].set_title("Relation between Duration and Views on Youtube")
axs[0,1].grid(True)

axs[1,0].scatter(df["Loudness"], df["Stream"], color = "#828CE8", ec="white", s=50)
axs[1,0].set_xlabel("Loudness in dB")
axs[1,0].set_ylabel("Number of Streams")
axs[1,0].set_title("Relation between Loudness and Streams on Spotify")
axs[1,0].grid(True)

axs[1,1].scatter(df["Loudness"], df["Views"], color = "#5186EA", ec="white", s=50)
axs[1,1].set_xlabel("Loudness in dB")
axs[1,1].set_ylabel("Number of Views")
axs[1,1].set_title("Relation between Loudness and Views on Youtube")
axs[1,1].grid(True)

axs[2,0].scatter(df["Tempo"], df["Stream"], color = "#5EDFFF", ec="white", s=50)
axs[2,0].set_xlabel("Tempo in bpm")
axs[2,0].set_ylabel("Number of Streams")
axs[2,0].set_title("Relation between Tempo and Streams on Spotify")
axs[2,0].grid(True)

axs[2,1].scatter(df["Tempo"], df["Views"], color = "#60AAFF", ec="white", s=50)
axs[2,1].set_xlabel("Tempo in bpm")
axs[2,1].set_ylabel("Number of Views")
axs[2,1].set_title("Relation between Tempo and Views on Youtube")
axs[2,1].grid(True)



plt.show()

This is just a small glimpse into the possibilities that Python provides. There much more (and also more complex) visualization and analysis tools you can use.

---
<a name="summary"></a>
## [Summary & Exercise](#top)
Now you know how to work with notebooks, so go ahead!