# Decoding Musical Influence: Analyzing Artist Relationships and Evolution

In this final project, you will tackle a real-world problem by applying the skills and knowledge acquired throughout the InfoSci 102 course. The project is designed to challenge you to analyze, process, and visualize data to derive meaningful insights and propose solutions to a practical issue.

Music has been part of human societies since the beginning of time as an essential component of cultural heritage. As part of an effort to understand the role music has played in the collective human experience, we have been asked to develop a method to quantify musical evolution. There are many factors that can influence artists when they create a new piece of music, including their innate ingenuity, current social or political events, access to new instruments or tools, or other personal experiences. Our goal is to understand and measure the influence of previously produced music on new music and musical artists.

Some artists can list a dozen or more other artists who they say influenced their own musical work. It has also been suggested that influence can be measured by the degree of similarity between song characteristics, such as structure, rhythm, or lyrics. There are sometimes revolutionary shifts in music, offering new sounds or tempos, such as when a new genre emerges, or there is a reinvention of an existing genre (e.g. classical, pop/rock, jazz, etc.). This can be due to a sequence of small changes, a cooperative effort of artists, a series of influential artists, or a shift within society.

## You have been identified to develop a model that measures musical influence. This problem asks you to analyze the influence and similarity between artists.


## Project Overview

To do this, you has been given several data sets:

“influence_data” represents musical influencers and followers, as reported by the artists themselves, as well as the opinions of industry experts. These data contains influencers and followers for 5,854 artists in the last 90 years.

“data_by_artist” provides 16 variable entries, including musical features such as danceability, tempo, loudness, and key, along with artist_name and artist_id for each of 98,340 songs. Mean value of these entries are provided grouping by artist.


## Tasks List

**Part 0: Comments and Readability (10 pts)**


In programming, it is essential to ensure that your code is self-explanatory. This means that as a programmer, you should include comments within your code to provide explanations and context about the functionality, logic, or any complex sections. Comments help not only others who might read your code but also yourself, especially when you revisit the code after some time.


In this context, you are required to include comments in your code. These comments should serve the purpose of analyzing and explaining the different parts of your code. The goal is to make the code easy to understand for anyone who reads it. Proper commenting enhances code readability and helps in maintaining and debugging.


The given task allocates 10 points for this part, emphasizing the importance of clear and explanatory comments in your code. It ensures that your code is not only functional but also understandable to others, contributing to good programming practices.


**Part 1:** Analyze Artists (30 pts)

In the data_by_artist.csv file, you can build an overview for each artists. Thus, to manage this information better, you need to use **OOP** in Python, build an **Artist** class for each artist, and an **Artist Array** for storing all artists. 

(15 pts) Your Artist class should at least contain:

    a) All attributes of artists
    
    b) Function that can add a song to this artist. You need to calculate the mean of all the attributes((current song number * current attributes + new attributes)/current song number + 1), then update this artist’s attribute.
    
    c) Function that can used for comparing the similarity between this and another artist. You may use **cosine similarity** or other method to compare. But at least your algorithm must use all attributes to compare and is explanatory.


(15 pts) Your Artist Array should at least contain:

    a) The number of artists.
    
    b) Get an artist class by name.
    
    c) Get an artist class by id.
    
    d) Compare two artists by two ids.

**Part 2:** Analysis influencers and followers. (60 pts)

After you get the information provided by artists, we want to dive into the relationship between artists. Thus, we want you to provide the answer to these questions:

    (10 pts) Q1: Find who influences others the most. 
    
    (10 pts) Q2: Which three years of influencer active start influenced followers that starting from the 2000s most?
    
    (20 pts) Q3: How many people do not influence others?
    
    Limitation of Q1-Q3: Assume A influences B, B influences C, then we say A influenced C. Due to the data size, we only consider four-layer relationship. That is, A->B->C->D->E,  A influenced B, C, and D.
    
    (20 pts) Q4: Write a function that can find the shortest influence chain given two artists.
    
    Hint1: Put all data points into a dictionary, then build a graph based on this dictionary.
    
    Hint2: This sheet has more than 40k data points, you must design your algorithm considering performance.
    
    Hint3: There are some bi-influence relationships (A<->B). Your algorithm should not count an artist twice.
    
    Hint4: A demo: (A -> B means A influence B)
        A -> B -> C -> D -> E -> F
        |    ^    ^         ^  
        v    |    |         |
        G -> H -> I -> J -> K -> L
    
        In Q1-3, A influenced B, C, D, G, H, I.
        In Q4, the shortest relationship chain of A and K is A->G->H->J->K. The shortest relationship chain of H and C could be either H->B->C or H->I->C. L and F do not have the shortest relationship chain.

You need to write your problem-solving process into code and place it in the class 'ProblemSolver'.

**Part 3**: Combination (20 pts bonus)

Now, you have got functions, objects, and graph of artists. Facilitate them, choose a topic below that you want to explore, then use your skills to solve it! (20 pts maximum)

    1. Measure the ability of influencers within each genre.
    
    2. Measure the similarity between every two genres.
    
    3. Measure the trend of music development chronologically.

You need to write your problem-solving process into code and place it in the class 'ProblemSolver'.


## Utilities

We have provided several functions that might help you, please find them attached.



## Evaluation Standard

For Part 1 and Part 2, you need to use Python's built-in data structure to finish all the tasks. You are free to use all built-in algorithms and functions, as well as pandas for importing data. But you are not allowed to use third-party packages such as numpy and scikit. You will receive full credit if you answer the question correctly, or you will receive partial credits based on your code completeness.

For Part 2, we expect you to answer questions briefly using language, with all code attached. 

For Part 3, we expect you to answer the question with language, with all codes attached, but it is ok to ignore this part if you do not want to receive extra credits. In this part, you are free to use any additional supporting materials, such as a document that does not exceed 2 pages or a video that does not exceed 2 minutes. As you finish your task, you will receive credit based on your code's complexity and document clearness.

In [1]:
# open and read in the file
import pandas as pd

# CSV file
file_path = 'influence_data.csv'

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Display the DataFrame
print(df)

       influencer_id  influencer_name influencer_main_genre  \
0             759491    The Exploited              Pop/Rock   
1              25462           Tricky            Electronic   
2              66915        Bob Dylan              Pop/Rock   
3              71209    Leonard Cohen              Pop/Rock   
4              91438     The Gun Club              Pop/Rock   
...              ...              ...                   ...   
42765         580300   Sufjan Stevens              Pop/Rock   
42766         261309      Vybz Kartel                Reggae   
42767         467203  Michael Jackson                  R&B;   
42768        2518003          Popcaan                Reggae   
42769        2896351        Tommy Lee                Reggae   

       influencer_active_start  follower_id      follower_name  \
0                         1980           74     Special Duties   
1                         1990          335          PJ Harvey   
2                         1960          335  

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [9]:
for column_name, column_data in df.iterrows():
    print(column_data['influencer_id'])
    break # after first line

759491


In [7]:
import math

def dot_product(v1, v2):
    return sum(x * y for x, y in zip(v1, v2))

def magnitude(vector):
    return math.sqrt(sum(x * x for x in vector))

def cosine_similarity(v1, v2):
    return dot_product(v1, v2) / (magnitude(v1) * magnitude(v2))

vector1 = [1, 2, 3]
vector2 = [4, 5, 6]

similarity = cosine_similarity(vector1, vector2)
print("cosine_similarity: ", similarity)


cosine_similarity:  0.9746318461970762
