## This document regroup all the current R code
Be cautious, this code is made for documentation purpose only, this code requires too much RAM to be excuted as such. 

In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
rawGeniusDataset <- read_csv("../input/genius-song-lyrics-with-language-information/song_lyrics.csv")

summary(rawGeniusDataset)

### Clean up (round 1)

Le but de ce bloc est d'enlever du dataset les lignes où le language de la musique n'est pas clair (car mix de langues ou autres raisons).

In [None]:
rawGeniusDataset <- subset(rawGeniusDataset, (language!=NaN) & (year<=2023))
summary(rawGeniusDataset)

### Identification of the different type of data

In [None]:
# We get the different genre

genre <- unique(rawGeniusDataset[,"tag"])
genre

We'll avoid misc and focus on :
- Rap
- Rb
- Rock
- Pop (Quite wide, include a lot of subgenre)
- Country

Misc also including text from book and proetry 

#### Identification of the languages



In [None]:
# I'll fillter out misc and after that we'll look at the different language
rawGeniusDataset <- subset(rawGeniusDataset, tag!="misc")
lang <- unique(rawGeniusDataset[,"language"])
lang

### Travail de Dylan

On va traiter cette question:

#### Quelle est la proportion de chansons qui contiennent des artistes featured ?

In [None]:
truc <- subset(rawGeniusDataset,"features" != '{}')$features
head(truc)

On peut remarquer qu'il y a des éléments vides "" et "{}". Ils correspondent aux artistes qui n'ont pas fait de feature.

In [None]:
# Filtrage de "features" pour enlever les artistes qui n'ont pas fait de feature
df_features <- subset(rawGeniusDataset, features != "" & features != "{}")$features
# head(df_features,n=100)
nb_features <- length(df_features)
nb_tot <- length(rawGeniusDataset$features)
ratio <- nb_features / nb_tot
cat("Il y a ", ratio*100, "% features parmi les artistes du dataset.")

In [None]:
# autre méthode pour le nettoyage:


library(tidyverse)
library(jsonlite)

# "Nettoyage" de features pour enlever les "" et les "{}" ou autres éléments perturbateurs
x <- vector()
for (elt in truc$features) {
  x <- c(truc, fromJSON(elt))
}

head(data.frame(x), n=100)

In [None]:
library(jsonlite)

truc <- vector("list", nrow(rawGeniusDataset)) # Initialise le vecteur RGD_non_vide

for (i in 1:nrow(rawGeniusDataset)) {
  if (rawGeniusDataset[i, "features"] != "" & rawGeniusDataset[i, "features"] != "{}") {
    truc[[i]] <- fromJSON(rawGeniusDataset[i, "features"])
  }
}

truc <- data.frame(features = unlist(truc)) # Convertit le vecteur en dataframe
head(truc,n=100)

#### Ces collaborations ont-elles un impact sur le nombre de consultations sur la page ?

In [None]:
# dataframe des features

# Je choisis les lignes où il y a un feature
result1 <- subset(rawGeniusDataset, features != "" & features != "{}")
# Je ne prends que les colonnes features et views
df_avec_features <- result1[, c("features", "views")]


# dataframe sans features

# Je prends les colonnes features et views
result2 <- rawGeniusDataset[, c("features", "views")]
# Je fais "rawGeniusDataset - df_avec_features" de sorte à obtenir le reste, soit les artistes sans feature
df_sans_features <- anti_join(result2, df_avec_features, by = c("features", "views"))



summary(df_avec_features)
summary(df_sans_features)


### Travail Sara

In [None]:
artist_views <- rawGeniusDataset %>% #remove extremum dates
  group_by(artist) %>% #Group by genre
  summarize(total_views_artist = sum(views)) %>% subset(total_views_artist >= summarize$total_views_artist[6]) 
  head(10)

barplot(artist_views$total_views_artist, legend=artist_views$artist, col=c("#F5E8C4", "#B7D2F3", "#E5A8D4", "#C7B3F2", "#A1F4E8", "#D9C7A5"))                                          

#### Top #50 and Top #10 artists according to number of views

In [None]:
library(dplyr)

# Cleaning the data
cleanGeniusDataset  <- rawGeniusDataset[!grepl("Genius", rawGeniusDataset$artist),]

top_artists <- cleanGeniusDataset %>%
  group_by(artist) %>%
  summarise(total_views = sum(views)) %>%
  arrange(desc(total_views))

top_artists_50 <- top_artists %>% head(50)

barplot(top_artists_50$total_views, legend=top_artists_50$artist, col="#B7D2F3")

library(ggplot2)
library(ggpubr)

top_artists_10 <- top_artists %>% head(10)

# Sample data
category <- as.vector(top_artists_10$artist)
values <- as.vector(top_artists_10$total_views)
total <- sum(values)
values <- round(values/total*100)
typeof(values)
print(values)
print(category)

# Bar plot with custom color palette and labels
ggplot(top_artists_10, aes(x=category, y=values, fill=category)) +
  geom_bar(stat="identity", color="black") +
  labs(title = "Top #10 artistes les plus populaires", x = "Artiste", y = "Nombre de consultations des paroles (toutes chansons confondues)") +
  theme_minimal(base_size = 14) +
  theme(plot.title = element_text(lineheight=3, face="bold", color="black",size=24)) +
  theme(legend.text=element_text(size=14),
        legend.title=element_text(size=14)) +
  scale_fill_manual(name = "Artistes", 
                    label = category, 
                    values = c("#F5E8C4", "#B7D2F3", "#E5A8D4", "#C7B3F2", "#A1F4E8", "#FBB4AE", "#B3CDE3", "#CCEBC5", "#DECBE4", "#FED9A6")) +
  theme(legend.position = "none")


#### Most prevalent genre for each language

In [None]:
# Cleaning the data
cleanGeniusDataset  <- rawGeniusDataset[!grepl("Genius", rawGeniusDataset$artist),]

# group the data by language and genre and count the frequency of each genre
genre_freq <- cleanGeniusDataset %>%
  group_by(language, tag) %>%
  summarize(freq = n()) %>%
  ungroup()

# for each language, find the genre with the highest frequency
top_genre <- genre_freq %>%
  group_by(language) %>%
  slice_max(freq) %>%
  ungroup() %>%
  select(language, tag)

# Top #10 genres worldwide

#print(top_genre)

top_genre_world <- rawGeniusDataset %>%
  group_by(tag) %>%
  summarise(best_genre = n()) %>%
  arrange(desc(best_genre)) %>%
  head(5)

library(ggplot2)
library(ggpubr)

# Sample data
category <- as.vector(top_genre_world$tag)
values <- as.vector(top_genre_world$best_genre)
total <- sum(values)
values <- round(values/total*100)
typeof(values)
print(values)
print(category)

# Bar plot with custom color palette and labels
ggplot(top_genre_world, aes(x=category, y=values, fill=category)) +
  geom_bar(stat="identity", color="black") +
  scale_fill_manual(values=c("#F5E8C4", "#B7D2F3", "#E5A8D4", "#C7B3F2", "#A1F4E8")) +
  labs(title = "Popularité des genres à l'international", x = "Genre", y = "Genre préféré (en % de nb de pays)") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

# Pie chart with custom color palette and labels
ggplot(top_genre_world, aes(x="", y=values, fill=category)) +
  geom_bar(stat="identity", color="white", width=1) +
  coord_polar("y", start=0) +
  scale_fill_manual(values=c("#F5E8C4", "#B7D2F3", "#E5A8D4", "#C7B3F2", "#A1F4E8")) +
  labs(title = "Popularité des genres à l'international") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5))

#print(top_genre_world)

#barplot(top_genre_world$best_genre, legend=top_genre_world$tag, col="#B7D2F3")

#mycols <- c("#F5E8C4", "#B7D2F3", "#E5A8D4", "#C7B3F2", "#A1F4E8")

#pie(top_genre_world$best_genre)

#### Top #10 des artistes les plus prolifiques

In [None]:
# Load the dplyr package
#library(dplyr)

# Cleaning the data
cleanGeniusDataset  <- rawGeniusDataset[!grepl("Genius", rawGeniusDataset$artist),]

# Group the data by artist and count the number of songs for each artist
songs_by_artist <- cleanGeniusDataset %>%
  group_by(artist) %>%
  summarise(num_songs = n()) %>%
  ungroup()

# Sort the artists by the number of songs and select the top 10
top_artists <- songs_by_artist %>%
  arrange(desc(num_songs)) %>%
  head(10)

# Sample data
category <- as.vector(top_artists$artist)
values <- as.vector(top_artists$num_songs)
total <- sum(values)
values <- round(values/total*100)
typeof(values)
print(values)
print(category)

# Bar plot with custom color palette and labels
ggplot(top_artists, aes(x=category, y=values, fill=category)) +
  geom_bar(stat="identity", color="black") +
  scale_fill_manual(values=c("#F5E8C4", "#B7D2F3", "#E5A8D4", "#C7B3F2", "#A1F4E8", "#FBB4AE", "#B3CDE3", "#CCEBC5", "#DECBE4", "#FED9A6")) +
  labs(title = "Top #10 artistes les plus prolifiques", x = "Artiste", y = "Nombre d'oeuvres (en % de nb d'oeuvres total)") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

# Pie chart with custom color palette and labels
ggplot(top_artists, aes(x="", y=values, fill=category)) +
  geom_bar(stat="identity", color="white", width=1) +
  coord_polar("y", start=0) +
  scale_fill_manual(values=c("#F5E8C4", "#B7D2F3", "#E5A8D4", "#C7B3F2", "#A1F4E8", "#FBB4AE", "#B3CDE3", "#CCEBC5", "#DECBE4", "#FED9A6")) +
  labs(title = "Top #10 artistes les plus prolifiques") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5))

### Travail Antoine

In [None]:
library(tidyverse)
#library(plyr)

# We count which languages are the most popular

lang_freq <- count(rawGeniusDataset, language) 
head(lang_freq)
lang_freq[order(-lang_freq$n), ]

We now do the same thing across time

In [None]:
library(tidyverse)

In [None]:


# Language in respect to time (from 1800 to 2023)

langAllowed = c("en", "es", "fr", "pt", "ru", "de", "it")

langFreqTime <- rawGeniusDataset %>%
  filter((year >= 1800) & (year <= 2022)) %>% #remove extremum dates
  filter(language %in% langAllowed) %>%
  group_by(year, language) %>% #Group by year and language
  count() #Count the number of song by group


head(langFreqTime)

In [None]:
library(ggplot2)

In [None]:
options(repr.plot.width = 15, repr.plot.height = 10)

ggplot(langFreqTime %>% filter(year >= 1900), aes(x = year, y = log10(n), color = language)) +
  geom_line(linewidth = 1) +
  labs(x = "Year of release", y = "Nb of song released (log10)")+
  theme(aspect.ratio = 0.5, text = element_text(size = 24, , color = "black", face = "bold"), panel.background = element_blank())
ggsave("langNumATime.png")

We can notice that between 2000 and 2025, there is a big spike

In [None]:
#We zoom on this pike

ggplot(langFreqTime %>% filter(year >= 2005), aes(x = year, y = log(n), color = language)) +
  geom_line(linewidth = 1) +
  labs(x = "Year", y = "N")+
  theme(aspect.ratio = 0.5)


We immediatly notice that 2015 has been a good year on Genius for song lyrics (this still need explainations), it's hard to explain we have this spike

We can therefore conclude that english songs tend to be more popular (english has a noticably faster growth since the 1920's)

#### What are the most popular genres in regard of the number of views on the page ?


In [None]:
tag_views <- rawGeniusDataset %>% #remove extremum dates
  group_by(tag) %>% #Group by genre
  summarize(total_views_genre = sum(views)) 
  
head(tag_views)

ggplot(tag_views, aes(x = tag, y = total_views_genre)) + geom_bar(stat = "identity", fill = c("#FFFF64", "#FFFF64", "#FFFF64", "#FFFF64", "#FFFF64"), color = "black", linewidth=2) + theme(text = element_text(size = 30, , color = "black", face = "bold"), panel.background = element_blank())
ggsave("bar_chart.png")

#### Is there any trends regarding the popularity of some genres in function of their release date ?

In [None]:
# genres in respect to time (from 1970 to 2023)
library(ggplot2)
library(tidyverse)

genreFreqTime <- rawGeniusDataset %>%
  filter((year >= 1950) & (year <= 2022)) %>% #remove extremum dates
  group_by(year, tag) %>% #Group by year and genre
  summarize(total_views_genre = sum(views)) 


head(genreFreqTime)

We put that into a graph

In [None]:
options(repr.plot.width = 15, repr.plot.height = 10)

ggplot(genreFreqTime %>% filter((year >= 1950) & (year <= 2022)), aes(x = year, y = log10(total_views_genre), color = tag)) +
  geom_line(linewidth = 1) +
  labs(x = "year of release", y = "total views by genre (log10)")+
  theme(aspect.ratio = 0.5) + theme(text = element_text(size = 25, , color = "black", face = "bold"), panel.background = element_blank())
ggsave("genrepoptime.png")

We can see something very intersting there, in recent years, Rap has overpassed pop as the most popular genre (it's although very important to be cautious, due to the nature of RAP which is very lyrics oriented, there is maybe a bias here toward RAP). There is also quite a lot of noise going back before 1960 so those data are quite unreliable

#### NLP, preprocessing of the text for further analysis (next step)

In [None]:
# install wordpiece
install.packages("tokenizers")