<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Overview" data-toc-modified-id="Data-Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Overview</a></span></li><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Column-Header-Style" data-toc-modified-id="Column-Header-Style-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Column Header Style</a></span></li><li><span><a href="#Handling-Missing-Values" data-toc-modified-id="Handling-Missing-Values-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Handling Missing Values</a></span></li><li><span><a href="#Duplicates" data-toc-modified-id="Duplicates-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Duplicates</a></span></li></ul></li><li><span><a href="#Hypothesis-Testing" data-toc-modified-id="Hypothesis-Testing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Hypothesis Testing</a></span><ul class="toc-item"><li><span><a href="#Comparison-of-User-Behavior-Between-Two-Cities" data-toc-modified-id="Comparison-of-User-Behavior-Between-Two-Cities-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Comparison of User Behavior Between Two Cities</a></span></li><li><span><a href="#Music-at-the-Beginning-and-End-of-the-Week" data-toc-modified-id="Music-at-the-Beginning-and-End-of-the-Week-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Music at the Beginning and End of the Week</a></span></li><li><span><a href="#Genre-Preferences-in-Alpha-and-Beta" data-toc-modified-id="Genre-Preferences-in-Alpha-and-Beta-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Genre Preferences in Alpha and Beta</a></span></li></ul></li><li><span><a href="#Research-Summary" data-toc-modified-id="Research-Summary-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Research Summary</a></span></li></ul></div>

# Music Listening Habits: A Comparison of Two Major Cities

This project explores the listening habits of users from two major cities, often surrounded by myths. For example:

- **City Alpha** is a bustling metropolis driven by the fast-paced rhythm of the workweek.
- **City Beta** is known as a cultural hub with distinct tastes.

Using data from **MusicHub**, we will compare the behavior of users from these two cities.

**Objective:** The research aims to test the following hypotheses:

1. User activity varies depending on the day of the week, with differences between City Alpha and City Beta.
2. On Monday mornings, certain genres are more popular in City Alpha, while others dominate in City Beta. Similarly, on Friday evenings, different genres prevail in each city.
3. City Alpha and City Beta have distinct musical preferences, with pop music being more popular in City Alpha and  a specific genre, like rap, being more favored in City Beta.

**Research Process**

The data on user behavior is provided in the file **musichub_data.csv**. Since the quality of the data is unknown, it is crucial to review the dataset before testing the hypotheses.

The study will involve three main stages:

- **Data Overview**: Initial exploration of the data to understand its structure and content.
- **Data Preprocessing**: Cleaning and preparing the data for analysis, addressing any critical data quality issues.
- **Hypothesis Testing**: Conducting statistical analysis to validate or refute the hypotheses.

## Data Overview

Let's start by examining the **MusicHub** data.

**First, import the necessary library:**

In [1]:
import pandas as pd # Import the pandas library

**Next, read the file `musichub_data.csv` and store it in a variable `df`:**

In [2]:
df = pd.read_csv('/Users/arina/Downloads/my-study-projects/musichub_data.csv') 
df.to_csv('musichub_data.csv', index=False)

**Display the first ten rows of the table and get general information about the data using the `info()` method:**

In [3]:
df.head(10) # Display the first 10 rows of the dataframe df

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Beta,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Alpha,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Beta,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Beta,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Alpha,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Beta,13:09:41,Friday
6,4CB90AA5,TRUE,Roman Messer,dance,Alpha,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Alpha,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Alpha,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Beta,21:20:49,Wednesday


In [4]:
df.info() # Get general information about the data in the dataframe df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   userID  65079 non-null  object
 1   Track   63848 non-null  object
 2   artist  57876 non-null  object
 3   genre   63881 non-null  object
 4   City    65079 non-null  object
 5   time    65079 non-null  object
 6   Day     65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


**Observations**

The table contains seven columns. The data type for all columns is `object`.

According to the data documentation:
* `userID` — unique identifier for the user;
* `Track` — the name of the track;
* `artist` — the name of the artist;
* `genre` — the name of the genre;
* `City` — the city of the user;
* `time` — the time the listening started;
* `Day` — the day of the week.

There are varying numbers of entries in the columns, indicating missing values.

Three style issues are visible in the column names:

- A mix of lowercase and uppercase letters.
- Presence of spaces.
- Absence of underscores.

**Conclusion**

Each row in the table represents data on a track that was listened to. Some columns describe the track itself (title, artist, genre), while others provide information about the user (city, listening time).

Preliminary analysis suggests that the data is sufficient for hypothesis testing. However, there are missing values and inconsistencies in the column names that need to be addressed.

To proceed, these data issues should be resolved.

## Data Preprocessing
In this step, we will fix the style of the column headers and address any missing values. We'll also check the data for duplicates.

### Column Header Style
**First, let's display the current column names:**

In [5]:
df.columns

Index(['userID', 'Track', 'artist', 'genre', 'City', 'time', 'Day'], dtype='object')

**Now, let's rename the columns to follow a consistent style:**

In [6]:
df = df.rename(columns={'userID': 'user_id', 'Track': 'track', 'City': 'city', 'Day': 'day'})

### Handling Missing Values
**Next, we'll count the number of missing values in the table:**

In [7]:
df.isna().sum() # Count missing values

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values impact our analysis. For instance, missing values in `track` and `artist` are not critical for our study. These can be replaced with a placeholder.

However, missing values in `genre` might hinder our ability to compare the musical preferences between City Alpha and City Beta. In a real-world scenario, it would be ideal to investigate the cause of these missing values and attempt to recover the data. However, in this project, we will:

* Replace missing values with explicit placeholders.
* Assess how this might affect the results.

**We'll replace the missing values in the `track`, `artist`, and `genre` columns with the string `'unknown'`.**

To do this, create a list `columns_to_replace`, iterate over its elements with a `for` loop, and replace missing values in each column:

In [8]:
columns_to_replace = ['track','artist','genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown') 

### Duplicates
**First, let's count the number of explicit duplicates in the dataset:**

In [9]:
df.duplicated().sum() # Count explicit duplicates

3826

**Now, let's remove these duplicates:**

In [10]:
df = df.drop_duplicates().reset_index(drop=True)  # Remove explicit duplicates

Next, let's address implicit duplicates in the `genre` column. For example, the same genre might be recorded in slightly different ways, which could skew our analysis.

**To begin, let's display a sorted list of unique genre names:**

* Extract the genre column from the DataFrame.
* Sort the values in this column.
* Display the unique values in the sorted column.

In [11]:
genre = df['genre']
genre = genre.sort_values()
genre = genre.unique()
genre

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

In the list, you'll find implicit duplicates such as variations of `hiphop`, which might include typos or alternative spellings.

**We need to standardize these values.** 

For example, replace `hip`, `hop`, and `hip-hop` with `hiphop`:

In [12]:
df['genre'] = df['genre'].replace(['hip', 'hop', 'hip-hop'], 'hiphop') # Fix implicit duplicates

**Conclusions**

The data preprocessing phase uncovered three main issues:

- Inconsistent column header styles.
- Missing values.
- Duplicates, both explicit and implicit.

We have corrected the column headers for easier data handling and removed duplicates, which should improve the accuracy of the analysis.

Missing values have been replaced with `'unknown'`. We will need to monitor whether the missing data in the `genre` column impacts our study.

With these issues addressed, we can now proceed to hypothesis testing.

## Hypothesis Testing

### Comparison of User Behavior Between Two Cities

The **first hypothesis** suggests that users listen to music differently in City Alpha and City Beta. Let's test this hypothesis by analyzing data from three days of the week—Monday, Wednesday, and Friday. The steps to do this are:

* Separate the users from City Alpha and City Beta.
* Compare how many tracks each group of users listened to on Monday, Wednesday, and Friday.
* Evaluate user activity in each city by grouping the data by city and counting the number of listens in each group.

**First, let's group the data by city and count the number of listens in each city:**

In [13]:
df.groupby('city')['genre'].count() # Counting listens in each city

city
Alpha    42741
Beta     18512
Name: genre, dtype: int64

City Alpha has more listens than City Beta. However, this does not necessarily mean that users in City Alpha listen to music more frequently. It could simply be that there are more users in City Alpha.

**Next, group the data by the day of the week and count the number of listens on Monday, Wednesday, and Friday.** Note that the data only includes information for these three days.

In [14]:
df.groupby('day')['genre'].count() # Counting listens on each of the three days

day
Friday       21840
Monday       21354
Wednesday    18059
Name: genre, dtype: int64

On average, users from both cities are less active on Wednesdays. However, the pattern may differ when examining each city separately.

To make this process more efficient, **we'll create a function `number_tracks()` that counts the number of listens for each combination of day and city.**

In [32]:
def number_tracks(day, city):
    # Create a subset of the DataFrame for the specified day and city
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    
    # Count the number of user_id entries (which corresponds to the number of tracks)
    track_list_count = track_list['user_id'].count()
    
    # Return the count of tracks
    return track_list_count

**Use the function to get the number of listens for each combination of day and city:**

In [33]:
# Print track counts for Alpha
print("Track counts for Alpha:")
print(f"Monday: {number_tracks('Monday', 'Alpha')} tracks")
print(f"Wednesday: {number_tracks('Wednesday', 'Alpha')} tracks")
print(f"Friday: {number_tracks('Friday', 'Alpha')} tracks")
print("-" * 30)  # Line to separate the results for Alpha and Beta

# Print track counts for Beta
print("Track counts for Beta:")
print(f"Monday: {number_tracks('Monday', 'Beta')} tracks")
print(f"Wednesday: {number_tracks('Wednesday', 'Beta')} tracks")
print(f"Friday: {number_tracks('Friday', 'Beta')} tracks")

Track counts for Alpha:
Monday: 15740 tracks
Wednesday: 11056 tracks
Friday: 15945 tracks
------------------------------
Track counts for Beta:
Monday: 5614 tracks
Wednesday: 7003 tracks
Friday: 5895 tracks


**Create a table with the following columns — `['city', 'monday', 'wednesday', 'friday']` and data from the `number_tracks` function results:**

In [34]:
info = pd.DataFrame(data=[['Alpha',15740,11056,15945],['Beta',5614,7003,5895]], columns=['city', 'monday', 'wednesday', 'friday']) 
info

Unnamed: 0,city,monday,wednesday,friday
0,Alpha,15740,11056,15945
1,Beta,5614,7003,5895


**Conclusions**

The data shows differences in user behavior:

- In City Alpha, peak listens occur on Monday and Friday, with a drop on Wednesday.
- In City Beta, music listening peaks on Wednesdays, with Monday and Friday showing nearly equal activity but lower than Wednesday.

Thus, the data supports the first hypothesis.

### Music at the Beginning and End of the Week
According to the **second hypothesis**, different genres dominate in City Alpha and City Beta on Monday mornings and Friday evenings.

**Save the data for each city in two variables: `alpha_general` and `beta_general`.**

In [35]:
alpha_general = df[df['city'] == 'Alpha']  # Data for City Alpha
beta_general = df[df['city'] == 'Beta']  # Data for City Beta

**Create a function `genre_weekday()` to analyze the top 10 genres for a given day and time range:**

In [36]:
def genre_weekday(df, day, time1, time2): 
    genre_df = df[df['day'] == day] # Filter by day
    genre_df = genre_df [genre_df ['time'] > time1] # Filter by start time
    genre_df = genre_df [genre_df ['time'] < time2] # Filter by end time
    
    genre_df_grouped = genre_df.groupby('genre')['genre'].count() # Group by genre and count occurrences
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False) # Sort genres by count
    return genre_df_sorted[:10] # Return top 10 genres

**Compare the results for Monday morning (7:00 to 11:00) and Friday evening (17:00 to 23:00) for both cities:**

In [48]:
print("top 10 genres for City Alpha:")
# Results for City Alpha on Monday morning
print(f"\nMonday morning \n{genre_weekday(alpha_general, 'Monday', '07:00', '11:00')}") 

# Results for City Alpha on Friday evening
print(f"\nFriday evening \n{genre_weekday(alpha_general, 'Friday', '17:00', '23:00')}")
print("-" * 30) 

print("top 10 genres for City Beta:")
# Results for City Beta on Monday morning
print(f"\nMonday morning \n{genre_weekday(beta_general, 'Monday', '07:00', '11:00')}")

# Results for City Beta on Friday evening
print(f"\nFriday evening \n{genre_weekday(beta_general, 'Friday', '17:00', '23:00')}")

top 10 genres for City Alpha:

Monday morning 
genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

Friday evening 
genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64
------------------------------
top 10 genres for City Beta:

Monday morning 
genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

Friday evening 
genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59

**Conclusions**

When comparing the top 10 genres on Monday morning, the following observations can be made:

1. In both City Alpha and City Beta, people listen to similar music. The only notable difference is that the genre "world" appears in the top 10 for City Alpha, while "jazz" and "classical" make the top 10 in City Beta.

2. The presence of missing values in the data is significant, especially in City Alpha, where the genre labeled as 'unknown' ranks as the 10th most popular genre. This indicates that the missing data may significantly impact the reliability of the analysis.

Friday evening does not alter this pattern. Some genres rise slightly in popularity, while others drop, but the overall top 10 remains largely the same.

Thus, the second hypothesis is only partially confirmed:

* Users listen to similar music at the beginning and end of the week.
* The difference between City Alpha and City Beta is not very pronounced. City Alpha sees more frequent listening of Russian pop music, while City Beta favors jazz.

However, the presence of missing data casts doubt on this result. The large number of missing entries in City Alpha suggests that the top 10 list could look different if the lost genre data were available.

### Genre Preferences in Alpha and Beta

The **third hypothesis** suggests that Beta is the capital of rap, where this genre is more frequently listened to than in Alpha. Meanwhile, Alpha is seen as a city of contrasts, but with a predominant preference for pop music.

**Group the  `alpha_general`dataframe by genre and count the number of track plays for each genre. Then, sort the results in descending order and save them in the `alpha_genres` dataframe.**

In [49]:
alpha_genres = alpha_general.groupby('genre')['genre'].count() # Group by genre and count track plays
alpha_genres = alpha_genres.sort_values(ascending=False) # Sort the result in descending order

**Display the top 10 genres in Alpha:**

In [53]:
alpha_genres.head(10) # Display the first 10 rows of alpha_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

**Repeat the same process for the `beta_general` dataframe: group by genre, count the track plays, and sort the results in descending order. Save the results in the `beta_genres` dataframe.**

In [54]:
beta_genres = beta_general.groupby('genre')['genre'].count() # одной строкой: группировка таблицы spb_general по столбцу 'genre', подсчёт числа значений 'genre' в этой группировке методом count(),
beta_genres = beta_genres.sort_values(ascending=False) # сортировка получившегося Series в порядке убывания и сохранение в spb_genres

**Display the top 10 genres in Beta:**

In [55]:
beta_genres.head(10) # Display the first 10 rows of beta_genres

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis is partially confirmed:

* Pop music is indeed the most popular genre in City Alpha, as the hypothesis suggested. Moreover, a related genre, Russian pop music, is also in the top 10.
* Contrary to the expectation, hip-hop is equally popular in both City Alpha and City Beta.

This implies that while City Alpha has a strong preference for pop music, the idea that City Beta is the capital of rap is not entirely supported by the data.

## Research Summary

We tested three hypotheses and found the following:

1. Day of the week influences user activity differently in Alpha and Beta.

The first hypothesis was fully confirmed.

2. Musical preferences remain relatively stable throughout the week in both Alpha and Beta. Slight differences were observed at the beginning of the week, on Mondays:
* In Alpha, the "world" genre is more popular.
* In Beta, jazz and classical music are preferred.

Thus, the second hypothesis was only partially confirmed. This result might have been different if there were no missing data.

3. The musical tastes of users in Alpha and Beta have more in common than differences. Contrary to expectations, genre preferences in Beta closely resemble those in Alpha.

The third hypothesis was not confirmed. If differences in preferences do exist, they are not noticeable among the majority of users.