# Analyzing Music Streaming Behavior

## Table of content

## Introduction

The goal of this study is to examine user behavior in two cities by analyzing actual internet music streaming data. The main objective is to test the hypothesis that user activity varies by city and day of the week.

Three phases make up the project's structure:

* Data Description: Examine the dataset, highlight its salient characteristics, and note any findings on its composition and organization.
* Data preprocessing: Address duplication, fix column names, and handle missing values to clean up the data.
* Testing Hypotheses: Apply analytical techniques to test the hypothesis and evaluate the findings to ascertain whether it is rejected entirely, in part, or in part.

User IDs, song names, artists, genres, localities, playback times, and days of the week are among the details included in the collection. Based on user location and activity trends, the results will be useful in identifying patterns in user behavior.

## Objectives

* Determine the differences in daily and weekly music streaming activity between users in Springfield and Shelbyville.

* Guarantee the  precise analysis and reliable hypothesis testing, clean and prepare the dataset.

## Descripcion de los datos

In [6]:
# Import Pandas
import pandas as pd

# Read the CSV file and loading it into a DataFrame 
df = pd.read_csv('music_project_en.csv')

# Displaying the first ten rows 
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [7]:
# Display general information 
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB
None


* **Data Types**: The columns include both text and integer (int) data. We should have enough information to test our working hypothesis if the data is accurate.

* **Information Sufficiency**: Given that the DataFrame contains the variables required for our analysis, we have enough information, assuming the data is correct.

* **Missing Values**: We find missing information in the genre column. The info() function, which displays the number of non-null values in each column, makes this simple to find. The column has missing data if there are fewer non-null values than there are entries overall. This approach is essential because it provides us with a clear understanding of the structure and quality of the data we are dealing with.

## Data preprocessing

### Header style

In [8]:
# Display the column names

new_columns = {
    '  user_id': 'user_id',
    'Track': 'track',
    'Artist': 'artist',
    'City': 'city',
    'time': 'time',
    'Day': 'day'
}

df.rename(columns=new_columns, inplace=True)

print(df.columns)

Index(['  userID', 'track', 'artist', 'genre', '  City  ', 'time', 'day'], dtype='object')


In [9]:
# Loop through the column headers and convert them to lowercase

for col in df.columns:
    df.rename(columns={col: col.lower()}, inplace=True)

# Display the updated column names
print(df.columns)

Index(['  userid', 'track', 'artist', 'genre', '  city  ', 'time', 'day'], dtype='object')


In [10]:
# Loop through the column headers and remove spaces

for col in df.columns:
    df.rename(columns={col: col.strip().lower()}, inplace=True)

# Display the updated column names
print(df.columns)

Index(['userid', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


In [11]:
# Rename the column "userid"

df.rename(columns={'userid': 'user_id'}, inplace=True)

# Loop through the column headers, remove spaces, and convert to lowercase

for col in df.columns:
    df.rename(columns={col: col.strip().lower()}, inplace=True)

# Display the updated column names
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


In [12]:
# Check the result: the list of headers

df.rename(columns={'userid': 'user_id'}, inplace=True)

# Loop through the column headers, remove spaces, and convert to lowercase

for col in df.columns:
    df.rename(columns={col: col.strip().lower()}, inplace=True)

# Display the updated column names
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


The style guidelines were followed when changing the table headers:

- Every character was changed to a lowercase letter.
- The spaces were eliminated.
- Snake_case was used for names that contained more than one word.