<a href="https://colab.research.google.com/github/MichelleHon/WiDS-EmpowHer-Mentorship-Program/blob/main/Data_Science_Mentorship_Workshop_1_Full_Version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop 1: Introduction to Data Science

### Learning Goals
*   Define data science and identify key applications of data science.
*   Demonstrate **dataset scrapping**.
*   Develop practical skills in applying **data cleaning and wrangling** techniques.
*   Design appropriate **visualizations** to answer question at hand.

#### Applications of Data Science
Some examples include:
*   **Healthcare**: Analyzing patient data to improve diagnostics and treatment plans
*   **Finance**: Detecting fraud, assessing risk, and optimizing investment strategies
*   **Marketing**: Personalizing campaigns by analyzing customer behavior and preferences

#### Key Components of Data Science
1. **Data Collection**: Gathering data from sources such as databases, APIs, and web scraping
2. **Data Cleaning and Wrangling**: Preparing data for analysis by handling missing values, duplicates, and inconsistencies
3. **Data Analysis**: Using statistical methods and algorithms to find patterns and trends in the data
4. **Data Visualization**: Creating charts and graphs to represent data insights clearly and effectively
5. **Machine Learning**: Applying algorithms to build predictive models that automate decision-making

#### Types of Plots
*  **Scatterplots**: Visualize relationship between 2 quantitative variables
*  **Barplots**: Visualize comparisons of amounts
*  **Histograms**: Visualize the distribution of 1 quantitative variable

##### Characteristics of Scatterplots
*  **Direction**: Positive relationship: When the y variable tends to increase when the x variable increases. Vice versa for negative relationship.  
*  **Strength**: Strong relationship: When the y variable reliably increases/decreases/stays flat as the x variable increases. Weak relationship otherwise.
*  **Shape**: Either linear or non-linear

##### Characteristics of Histogram
*  **Shape**: Symmetric or Skewed. If the tail is on the left, the data points are more concentrated on the right and the histogram is left/negatively skewed
*  **Center**: Where do most data points fall? (peak) Where is the mean/median? How many peaks does the histogram have? Unimodal/bimodal/multimodal? (mode)
*  **Spread**: What is the range of the data points? Are there any data points that are far away from the rest of the data? (outliers)

**Rule of Thumb: No pie charts, 3-D visualizations and Tables**

Other important notes:
*  Use simple, colorblind-friendly color palettes

Use this link to learn more about colorblind-friendly color palettes:
https://www.color-blindness.com/coblis-color-blindness-simulator/
*  Include labels and legends
*  Avoid overplotting

The visualization should **convey its message and minimize noise**

#### Tidy Data
*  Each row corresponds to a single observation
*  Each column corresponds to a single variable
*  Each cell corresponds to a single value


## Top 50 Songs on Spotify in 2023
 Let's explore the dataset together

 1. Data Collection

 You may see the dataset in files called "spotify_top_50_2023.csv". This dataset is obtained from the Spotify Web API.

 Let's load the data and check it out.

In [None]:
# Run this cell to import package
import pandas as pd

In [None]:
# spotify_raw = ...("spotify_top_50_2023.csv")
# spotify_raw.head()

# Solution below
spotify_raw = pd.read_csv("spotify_top_50_2023.csv")
spotify_raw
spotify_raw.head()

Unnamed: 0,artist_name,track_name,album_release_date,genres,danceability,energy,loudness,key,tempo,duration_ms,time_signature,popularity
0,Miley Cyrus,Flowers,2023-08-18,['pop'],0.706,0.691,-4.775,0,118.048,200600,4,94
1,SZA,Kill Bill,2022-12-08,"['pop', 'r&b', 'rap']",0.644,0.735,-5.747,8,88.98,153947,4,86
2,Harry Styles,As It Was,2022-05-20,['pop'],0.52,0.731,-5.338,6,173.93,167303,4,95
3,Jung Kook,Seven (feat. Latto) (Explicit Ver.),2023-11-03,['k-pop'],0.79,0.831,-4.185,11,124.987,183551,4,90
4,Eslabon Armado,Ella Baila Sola,2023-04-28,"['corrido', 'corridos tumbados', 'sad sierreno...",0.668,0.758,-5.176,5,147.989,165671,3,86


This dataset is mostly tidy. Each row corresponds to a single observation and each column corresponds to a single variable. However, in the genres column, we can see that there are multiple genres per song, which means that some cells in the genre column does not only correspond to a single value.

2. Data Cleaning and Wrangling

Let's compare the number of pop songs versus non-pop songs in the dataset. We will use a barplot since we are trying to visualize comparisons of amounts.

In [None]:
# Run this cell
spotify_pop = spotify_raw["genres"] == "pop"
spotify_pop.head()

Unnamed: 0,genres
0,True
1,False
2,True
3,False
4,False


In [None]:
# Run this cell
spotify_raw["pop"] = spotify_pop
spotify_raw.head()

Unnamed: 0,artist_name,track_name,album_release_date,genres,danceability,energy,loudness,key,tempo,duration_ms,time_signature,popularity,pop
0,Miley Cyrus,Flowers,2023-08-18,pop,0.706,0.691,-4.775,0,118.048,200600,4,94,True
1,SZA,Kill Bill,2022-12-08,"pop, r&b, rap",0.644,0.735,-5.747,8,88.98,153947,4,86,False
2,Harry Styles,As It Was,2022-05-20,pop,0.52,0.731,-5.338,6,173.93,167303,4,95,True
3,Jung Kook,Seven (feat. Latto) (Explicit Ver.),2023-11-03,k-pop,0.79,0.831,-4.185,11,124.987,183551,4,90,False
4,Eslabon Armado,Ella Baila Sola,2023-04-28,"corrido, corridos tumbados, sad sierreno, sier...",0.668,0.758,-5.176,5,147.989,165671,3,86,False


3. Data visualization

In [None]:
# Run this cell to import packages
import altair as alt
import matplotlib.pyplot as plt

In [None]:
plt.rcParams['figure.figsize'] = (10, 8)

# spotify_barplot = alt.Chart(...,
#                             title = "Barplot of Pop vs Non-Pop Songs"
#                             )...(
#                                 x = alt.X("pop").title("Pop"),
#                                 y = alt.Y("count()").title("Count"),
#                                 )
# ...

# Solution below
spotify_barplot = alt.Chart(spotify_raw,
                            title = "Barplot of Pop vs Non-Pop Songs"
                            ).mark_bar().encode(
                                x = alt.X("pop").title("Pop"),
                                y = alt.Y("count()").title("Count"),
                                )
spotify_barplot

Awesome! Let's try answering this question: Is there a relationship between how loud a song is in decibels and its popularity? Since we are trying to answer a question regarding the relationship between 2 quantitative variables, we should use scatterplot to visualize it.

In [None]:
plt.rcParams['figure.figsize'] = (10, 8)

pop_min, pop_max = spotify_raw["popularity"].min(), spotify_raw["popularity"].max()
loud_min, loud_max = spotify_raw["loudness"].min(), spotify_raw["loudness"].max()

pop_range = pop_max - pop_min
loud_range = loud_max - loud_min

pop_padding = pop_range * 0.1
loud_padding = loud_range * 0.1

# spotify_scatterplot = alt.Chart(...,
#                             title = "Scatterplot of Loudness vs Popularity"
#                             )...(
#                                 x = alt.X("popularity").title("Popularity Score")
#                                 .scale(domain=[pop_min - pop_padding, pop_max + pop_padding]),
#                                 y = alt.Y("loudness").title("Loudness (dB)")
#                                 .scale(domain=[loud_min - loud_padding, loud_max + loud_padding]),
#                                 )
# ...

# Solution below
spotify_scatterplot = alt.Chart(spotify_raw,
                            title = "Scatterplot of Loudness vs Popularity"
                            ).mark_circle().encode(
                                x = alt.X("popularity").title("Popularity Score")
                                .scale(domain=[pop_min - pop_padding, pop_max + pop_padding]),
                                y = alt.Y("loudness").title("Loudness (dB)")
                                .scale(domain=[loud_min - loud_padding, loud_max + loud_padding]),
                                )
spotify_scatterplot

There seem to be a relatively weak positive relationship between loudness and popularity score.

In [None]:
# Run this cell
plt.rcParams['figure.figsize'] = (10, 8)

spotify_scatterplot_2 = alt.Chart(spotify_raw,
                            title = "Scatterplot of Loudness vs Popularity"
                            ).mark_circle().encode(
                                x = alt.X("popularity").title("Popularity Score"),
                                y = alt.Y("loudness").title("Loudness (dB)"),
                                )
spotify_scatterplot_2

## Your Turn!

Explore the dataset "insta.csv" and answer the question: What is the distribution of engagement rate of the top 200 followed accounts on Instagram?

1. Data Collection

Load the data and check it out.

In [None]:
# insta_raw = ...("...")
# insta_raw

# Solution below
insta_raw = pd.read_csv("insta.csv")
insta_raw

Unnamed: 0,rank,name,channel_Info,Category,Posts,Followers,Avg. Likes,Eng Rate
0,1,instagram,brand,photography,7.3K,580.1M,7.31K,0.1%
1,2,cristiano,male,"Health, Sports & Fitness",3.4K,519.9M,3.41K,1.4%
2,3,leomessi,male,"Health, Sports & Fitness",1K,403.7M,0.97K,1.7%
3,4,kyliejenner,female,entertainment,7K,375.9M,7.02K,1.7%
4,5,selenagomez,female,entertainment,1.8K,365.3M,1.85K,1.1%
...,...,...,...,...,...,...,...,...
195,196,fcbayern,male,"Health, Sports & Fitness",16.8K,35.4M,16.78K,0.6%
196,197,colesprouse,male,entertainment,1.1K,35.3M,1.14K,3.5%
197,198,shaymitchell,male,entertainment,6.3K,35.1M,6.31K,1.2%
198,199,ivetesangalo,female,entertainment,7.8K,35M,7.77K,0.4%


2. Data Cleaning and Wrangling

Is the dataset tidy? What information do we need to answer the question? Do they need wrangling?

In [None]:
# insta_raw['Eng Rate'] = insta_raw['...'].replace('%', '', regex=True).astype(float) / 100
# insta_raw = insta_raw.sort_values('...')
# ...

# Solution below
insta_raw['Eng Rate'] = insta_raw['Eng Rate'].replace('%', '', regex=True).astype(float) / 100
insta_raw = insta_raw.sort_values('Eng Rate')
insta_raw

Unnamed: 0,rank,name,channel_Info,Category,Posts,Followers,Avg. Likes,Eng Rate
0,1,instagram,brand,photography,7.3K,580.1M,7.31K,0.001
50,51,victoriassecret,brand,fashion,3.1K,73.8M,3.14K,0.001
92,93,zara,brand,fashion,3.9K,55.2M,3.94K,0.001
100,101,ayutingting92,female,entertainment,11.7K,53.5M,11.65K,0.001
101,102,chanelofficial,brand,Beauty & Makeup,5.1K,53.5M,5.11K,0.001
...,...,...,...,...,...,...,...,...
147,148,uarmyhope,,fashion,171,42.2M,171,0.185
150,151,jin,male,,100,42.1M,100,0.230
99,100,thv,male,entertainment,76,53.8M,76,0.239
158,159,agustd,,,83,41.2M,83,0.253


3. Data Visualization

In [None]:
plt.rcParams['figure.figsize'] = (10, 8)
# insta_histogram = alt.Chart(...,
#                             title = "Histogram of Engagement Rate"
#                             )...(
#                                 x = alt.X("...").title("Engagement Rate (%)").bin(),
#                                 y = alt.Y("...").title("Count"),
#                                 )
# insta_histogram

# Solution below
insta_histogram = alt.Chart(insta_raw,
                            title = "Histogram of Engagement Rate"
                            ).mark_bar().encode(
                                x = alt.X("Eng Rate").title("Engagement Rate (%)").bin(),
                                y = alt.Y("count()").title("Count"),
                                )
insta_histogram

What can you conclude based on the histogram?

**Answer: The engagement rate distribution is right-skewed, indicating that most accounts have generally low engagement rates**

*Challenge (optional): *

Draw a dotted line on the histogram to indicate the median engagement rate. First find the median engagement rate, then draw the line on the histogram.

In [None]:
# median = ...["..."].median()
# median

# Solution below
median = insta_raw["Eng Rate"].median()
median

0.0125

In [None]:
plt.rcParams['figure.figsize'] = (10, 8)
# v_line = alt.Chart(...).mark_rule(strokeDash = [6], size = 1.5).encode(
#     x = alt.datum(...)
# )

# insta_histogram_line = ... + ...
# ...

# Solution below
v_line = alt.Chart(insta_raw).mark_rule(strokeDash = [6], size = 1.5).encode(
    x = alt.datum(median)
)

insta_histogram_line = insta_histogram + v_line
insta_histogram_line

#### References and Datasets
https://github.com/katieburak/girls-in-DS

https://python.datasciencebook.ca/