<p style="color:red">Remember to start recording</p>

# Week 3 - Wednesday Lecture

In [14]:
import pandas as pd
import numpy as np
import altair as alt
rng = np.random.default_rng()

In [15]:
# Change the path if necessary
# Doing some small "data cleaning": converting some columns from strings to numbers.
df = pd.read_csv("data/spotify_dataset.csv")
df = df.replace(" ",np.nan)
df["Streams"] = df["Streams"].str.replace(",","")
df.iloc[:,[5,7]] = df.iloc[:,[5,7]].apply(pd.to_numeric,axis=0).copy()
df.iloc[:,12:22] = df.iloc[:,12:22].apply(pd.to_numeric,axis=0).copy()

In [8]:
A = rng.integers(0,100,size=50)
A

array([93, 90, 91, 37, 21, 74, 69, 43,  6, 99, 30, 13,  5, 56, 49, 33, 14,
       84, 18, 77, 57,  5, 51, 19, 18, 70, 97, 79,  5, 41, 40,  9, 51, 97,
       67, 27, 69, 56, 74, 45, 34, 22, 71, 73, 87, 52, 31, 67, 70, 74])

## not and or
In standard Python, these are spelled out:
* `not True`
* `True and False`
* `True or True`

In NumPy and pandas, they are abbreviated:
* `~A`
* `A & B`
* `A | B`

In [16]:
not True

False

In [17]:
3 != 5

True

In [18]:
not (3 == 3)

False

In [3]:
for a in [True, False]:
    for b in [True, False]:
        print(f"{a} and {b} is {a and b}")

True and True is True
True and False is False
False and True is False
False and False is False


In [11]:
for a in [True, False]:
    for b in [True, False]:
        print(f"{a} or {b} is {a or b}")

True or True is True
True or False is True
False or True is True
False or False is False


* Make a list consisting of integers from A between 25 and 50, using list comprehension.
* Make an array consisting of integers from A between 25 and 50 using NumPy comparisons.
* Get the sub-dataframe of df consisting of songs with popularity > 60 and danceability < 0.4.  Call the result `sub_df`.  (Check: its shape should be 46 by 23.)
* Make a new column in `sub_df` with the name "temp" and filled with all 4s.
* Among those 46 songs, what are the three loudest?  Can you get those three into a list using Python?

In [9]:
[x for x in A if x > 25 and x < 50]

[37, 43, 30, 49, 33, 41, 40, 27, 45, 34, 31]

In [10]:
A[(A > 25) & (A < 50)]

array([37, 43, 30, 49, 33, 41, 40, 27, 45, 34, 31])

In [52]:
df_sub = df.loc[(df["Popularity"] > 60) & (df["Danceability"] < 0.4)]

In [53]:
df_sub.shape

(46, 23)

In [54]:
# Notice how you probably get a warning!  Could fix it using .copy()
df_sub["temp"] = 4

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub["temp"] = 4


In [41]:
top3 = df_sub.sort_values("Loudness",ascending=False).iloc[:3]

In [42]:
list(top3["Artist"])

['Bring Me The Horizon', 'The Killers', 'Oasis']

## Altair

* Using Altair, make a scatter plot including all the songs with popularity > 0.6 and danceability < 0.4.  Use energy for the x-axis, popularity for the y-axis, and loudness for the color.  Include a tooltip for song name and artist.
* How many points are in that scatter plot?

In [45]:
alt.Chart(df_sub).mark_circle().encode(
    x = "Energy",
    y = alt.Y("Popularity",scale=alt.Scale(zero=False)),
    color = 'Loudness',
    tooltip = ['Artist','Song Name'],
)

* Draw a plot Energy vs Acousticness of all Lady Gaga and Billie Eilish songs (leaving `df_sub`, going back to the original `df`), drawing the songs by those artists in two different colors.
* From the plot, one of the songs appears to be an "outlier", which song?

In [46]:
df2 = df[(df.Artist == "Billie Eilish") | (df.Artist == "Lady Gaga")]

In [48]:
alt.Chart(df2).mark_circle().encode(
    x = "Energy",
    y = "Acousticness",
    color = 'Artist',
    tooltip = ['Artist','Song Name']
).properties(
    width = 800,
)

## Cleaning data

Before we used `df = df.replace(" ",np.nan)` to replace blanks spaces with not a number.  (What's not a number?)  Do the same thing using applymap and a lambda function.

In [12]:
df = pd.read_csv("data/spotify_dataset.csv")

In [13]:
df = df.applymap(lambda x: np.nan if x == " " else x)

Similarly `df["Streams"] = df["Streams"].str.replace(",","")`

In [15]:
df["Streams"] = df["Streams"].map(lambda x: x.replace(",",""))

## try and except

In [24]:
3/0

ZeroDivisionError: division by zero

In [21]:
def try_to_divide(x):
    try:
        return 3/x
    except:
        return "Can't"

In [22]:
try_to_divide(5)

0.6

In [23]:
try_to_divide(0)

"Can't"

In [25]:
def try_to_divide(x):
    try:
        return 3/x
    except ZeroDivisionError:
        return "Can't divide by zero"
    except:
        return "Can't for some other reason"

In [26]:
try_to_divide(np.nan)

nan

In [16]:
def try_to_make_numeric(x):
    try:
        return pd.to_numeric(x)
    except:
        return x

In [17]:
df.dtypes

Index                         int64
Highest Charting Position     int64
Number of Times Charted       int64
Week of Highest Charting     object
Song Name                    object
Streams                      object
Artist                       object
Artist Followers             object
Song ID                      object
Genre                        object
Release Date                 object
Weeks Charted                object
Popularity                   object
Danceability                 object
Energy                       object
Loudness                     object
Speechiness                  object
Acousticness                 object
Liveness                     object
Tempo                        object
Duration (ms)                object
Valence                      object
Chord                        object
dtype: object

In [19]:
df = df.apply(try_to_make_numeric,axis=0)

In [20]:
df.dtypes

Index                          int64
Highest Charting Position      int64
Number of Times Charted        int64
Week of Highest Charting      object
Song Name                     object
Streams                        int64
Artist                        object
Artist Followers             float64
Song ID                       object
Genre                         object
Release Date                  object
Weeks Charted                 object
Popularity                   float64
Danceability                 float64
Energy                       float64
Loudness                     float64
Speechiness                  float64
Acousticness                 float64
Liveness                     float64
Tempo                        float64
Duration (ms)                float64
Valence                      float64
Chord                         object
dtype: object