##Hypothesis Test for gender comaprison significance - w.r.t user's listening events ##

In [None]:
import pandas as pd
from scipy import stats
# Load  dataset
df = pd.read_csv('LFM-1b_users_additional.txt', delimiter='\t' )
df_output_data_new = pd.read_csv('output_data_new.txt', delimiter='\t' )

df_add = df.rename (columns={'user-id':'user_id'})
data  = pd.merge(df_add, df_output_data_new, on='user_id')

# Separate artist counts for male and female groups
listening_events_male = data[data['gender'] == 'm']['cnt_listeningevents']
listening_events_female = data[data['gender'] == 'f']['cnt_listeningevents']

# Checking Assumptions
# Assumption 1: Independence is typically not checked explicitly for t-tests, assuming samples are independent.

# Assumption 2: Normality
# You can perform normality tests for both groups.
normality_male = stats.shapiro(listening_events_male)
normality_female = stats.shapiro(listening_events_female)

print("Normality test for male group - p-value:", normality_male.pvalue)
print("Normality test for female group - p-value:", normality_female.pvalue)

# Assumption 3: Homogeneity of Variances
# You can perform Levene's test for homogeneity of variances.
levene_result = stats.levene(listening_events_male, listening_events_female)

print("Levene's test for homogeneity of variances - p-value:", levene_result.pvalue)



Normality test for male group - p-value: 0.0
Normality test for female group - p-value: 0.0
Levene's test for homogeneity of variances - p-value: 1.6293189880895844e-07




In [None]:
# Perform Mann-Whitney U test (non-parametric alternative)
u_statistic, mw_p_value = stats.mannwhitneyu(listening_events_male, listening_events_female, alternative='two-sided')

print("Mann-Whitney U statistic:", u_statistic)
print("Mann-Whitney p-value:", mw_p_value)

if mw_p_value < 0.05:
    print("The difference in listening events between male and female groups is statistically significant.")
else:
    print("There is no statistically significant difference in the listening events between male and female groups.")

Mann-Whitney U statistic: 253200087.0
Mann-Whitney p-value: 1.3201873519308334e-65
The difference in listening events between male and female groups is statistically significant.


The Mann-Whitney U test results indicate that there is a statistically significant difference in listening events between the male and female groups, even after accounting for the violations of normality and homogeneity of variances assumptions. The extremely low p-value suggests strong evidence against the null hypothesis of no difference in listening events between the two groups.

Since the Mann-Whitney U test is a non-parametric test that doesn't rely on the same assumptions as the t-test, these results provide a more robust indication of the significant difference between the groups.

##Hypothesis Test for gender comaprison significance - count of distinct artists##

In [None]:
import pandas as pd
from scipy import stats

# Load  dataset
df = pd.read_csv('LFM-1b_users_additional.txt', delimiter='\t' )
df_output_data_new = pd.read_csv('output_data_new.txt', delimiter='\t' )

df_add = df.rename (columns={'user-id':'user_id'})
data  = pd.merge(df_add, df_output_data_new, on='user_id')

# Separate artist counts for male and female groups
cnt_artists_male = data[data['gender'] == 'm']['cnt_distinct_artists']
cnt_artists_female = data[data['gender'] == 'f']['cnt_distinct_artists']

# Checking Assumptions
# Assumption 1: Independence is typically not checked explicitly for t-tests, assuming samples are independent.

# Assumption 2: Normality
# You can perform normality tests for both groups.
normality_male = stats.shapiro(cnt_artists_male)
normality_female = stats.shapiro(cnt_artists_female)

print("Normality test for male group - p-value:", normality_male.pvalue)
print("Normality test for female group - p-value:", normality_female.pvalue)

# Assumption 3: Homogeneity of Variances
# You can perform Levene's test for homogeneity of variances.
levene_result = stats.levene(cnt_artists_male, cnt_artists_female)

print("Levene's test for homogeneity of variances - p-value:", levene_result.pvalue)



Normality test for male group - p-value: 0.0
Normality test for female group - p-value: 0.0
Levene's test for homogeneity of variances - p-value: 1.9898832738365336e-49




In [None]:
# Perform Mann-Whitney U test (non-parametric alternative)
u_statistic, mw_p_value = stats.mannwhitneyu(cnt_artists_male, cnt_artists_female, alternative='two-sided')

print("Mann-Whitney U statistic:", u_statistic)
print("Mann-Whitney p-value:", mw_p_value)

if mw_p_value < 0.05:
    print("The difference in count of distinct artists between male and female groups is statistically significant.")
else:
    print("There is no statistically significant difference in the count of distinct artists between male and female groups.")

Mann-Whitney U statistic: 253805592.0
Mann-Whitney p-value: 5.444257313936632e-69
The difference in count of distinct artists between male and female groups is statistically significant.


The Mann-Whitney U statistic is 253,805,592.0 and the corresponding p-value is 5.44e-69. These results indicate that there is a statistically significant difference in the count of distinct artists between the male and female groups.

##Hypothesis Test for gender comaprison significance - w.r.t novelty and mainstreaminess score ##

In [None]:
# Downloading the novelty and mainstreaminess score computed on window of 12 months
df_score = df[['user-id','novelty_artist_avg_year','mainstreaminess_avg_year']]
df_score

Unnamed: 0,user-id,novelty_artist_avg_year,mainstreaminess_avg_year
0,384,0.3094285950064659,0.000000
1,1206,0.5137868970632553,0.000000
2,2622,0.6989826304571969,0.079669
3,2732,0.8828014254570007,0.032614
4,3653,0.4244110181051142,0.077731
...,...,...,...
120317,50871714,0.5498878061771393,0.075544
120318,50900118,0.6803168281912804,0.103613
120319,50931921,0.35164836049079895,0.012505
120320,50933471,0.5991988703608513,0.039335


In [None]:
# Renaming 'user-id' column to 'user_id'
df_score = df_score.rename(columns={'user-id' :'user_id' })
df_score

Unnamed: 0,user_id,novelty_artist_avg_year,mainstreaminess_avg_year
0,384,0.3094285950064659,0.000000
1,1206,0.5137868970632553,0.000000
2,2622,0.6989826304571969,0.079669
3,2732,0.8828014254570007,0.032614
4,3653,0.4244110181051142,0.077731
...,...,...,...
120317,50871714,0.5498878061771393,0.075544
120318,50900118,0.6803168281912804,0.103613
120319,50931921,0.35164836049079895,0.012505
120320,50933471,0.5991988703608513,0.039335


In [None]:
# Data preprocessing - chceking for missing values
import numpy as np


df_score = df_score.replace('?',np.NaN)
print('Number of instances = %d' % (df_score.shape[0]))
print('Number of attributes = %d' % (df_score.shape[1]))


print('Number of missing values:')
for col in df_score.columns:
    print('\t%s: %d' % (col,df_score[col].isna().sum()))


for col in df_score.columns:
    print(col,df_score[col].isna().value_counts())

Number of instances = 120322
Number of attributes = 3
Number of missing values:
	user_id: 0
	novelty_artist_avg_year: 7530
	mainstreaminess_avg_year: 0
user_id False    120322
Name: user_id, dtype: int64
novelty_artist_avg_year False    112792
True       7530
Name: novelty_artist_avg_year, dtype: int64
mainstreaminess_avg_year False    120322
Name: mainstreaminess_avg_year, dtype: int64


In [None]:
# Removing the mising values from the data
df_score_cl=df_score.dropna()

In [None]:
# Checking data type to execute numeric calculations
data_type = df_score_cl.dtypes
print(data_type)

user_id                       int64
novelty_artist_avg_year      object
mainstreaminess_avg_year    float64
dtype: object


In [None]:
# Converting columns to numeric
df_score_cl = df_score_cl .apply(pd.to_numeric)
df_score_cl

Unnamed: 0,user_id,novelty_artist_avg_year,mainstreaminess_avg_year
0,384,0.309429,0.000000
1,1206,0.513787,0.000000
2,2622,0.698983,0.079669
3,2732,0.882801,0.032614
4,3653,0.424411,0.077731
...,...,...,...
120317,50871714,0.549888,0.075544
120318,50900118,0.680317,0.103613
120319,50931921,0.351648,0.012505
120320,50933471,0.599199,0.039335


In [None]:
# Merging the analysis dataset and the novelty and mainstreaminess  scores
df_merged_score = pd.merge(df_score_cl, df_output_data_new , on='user_id')
df_merged_score

Unnamed: 0,user_id,novelty_artist_avg_year,mainstreaminess_avg_year,country,age,gender,playcount,registered_unixtime
0,384,0.309429,0.000000,UK,35,m,42139,1035849600
1,3653,0.424411,0.077731,UK,31,m,18504,1041033600
2,4813,0.907891,0.007011,US,43,m,640,1050364800
3,5069,0.552692,0.057478,AT,30,m,31867,1051488000
4,6958,0.460354,0.007506,US,36,m,34788,1057536000
...,...,...,...,...,...,...,...,...
46206,50673825,0.431379,0.034846,JP,43,m,39,1341630308
46207,50759670,0.239111,0.178529,US,25,m,467,1342153673
46208,50796677,0.484848,0.000000,PL,110,f,1495,1342344762
46209,50871714,0.549888,0.075544,BY,19,f,569,1342728447


##Hypothesis Test for gender comaprison significance - w.r.t novelty score ##

In [None]:
import pandas as pd
from scipy import stats

# Separate for male and female groups
novelty_male = df_merged_score[df_merged_score['gender'] == 'm']['novelty_artist_avg_year']
novelty_female = df_merged_score[df_merged_score['gender'] == 'f']['novelty_artist_avg_year']


# Checking Assumptions
# Assumption 1: Independence is typically not checked explicitly for t-tests, assuming samples are independent.

# Assumption 2: Normality
# You can perform normality tests for both groups.
normality_male = stats.shapiro(novelty_male)
normality_female = stats.shapiro(novelty_female)

print("Normality test for male group - p-value:", normality_male.pvalue)
print("Normality test for female group - p-value:", normality_female.pvalue)

# Assumption 3: Homogeneity of Variances
# You can perform Levene's test for homogeneity of variances.
levene_result = stats.levene(novelty_male, novelty_female)

print("Levene's test for homogeneity of variances - p-value:", levene_result.pvalue)


Normality test for male group - p-value: 2.705977399534438e-40
Normality test for female group - p-value: 9.648014525280374e-35
Levene's test for homogeneity of variances - p-value: 0.07341101762499902




In [None]:
# Perform Mann-Whitney U test (non-parametric alternative)
u_statistic, mw_p_value = stats.mannwhitneyu(novelty_male, novelty_female, alternative='two-sided')

print("Mann-Whitney U statistic:", u_statistic)
print("Mann-Whitney p-value:", mw_p_value)

if mw_p_value < 0.05:
    print("The difference in novelty score between male and female groups is statistically significant.")
else:
    print("There is no statistically significant difference in the novelty score  of distinct artists between male and female groups.")

Mann-Whitney U statistic: 225076192.5
Mann-Whitney p-value: 2.5872949336229797e-11
The difference in novelty score between male and female groups is statistically significant.


##Hypothesis Test for gender comaprison significance - w.r.t mainstreaminess score ##

In [None]:
import pandas as pd
from scipy import stats

# Separate for male and female groups
mainstream_male = df_merged_score[df_merged_score['gender'] == 'm']['mainstreaminess_avg_year']
mainstream_female = df_merged_score[df_merged_score['gender'] == 'f']['mainstreaminess_avg_year']


# Checking Assumptions
# Assumption 1: Independence is typically not checked explicitly for t-tests, assuming samples are independent.

# Assumption 2: Normality
# You can perform normality tests for both groups.
normality_male = stats.shapiro(mainstream_male)
normality_female = stats.shapiro(mainstream_male)

print("Normality test for male group - p-value:", normality_male.pvalue)
print("Normality test for female group - p-value:", normality_female.pvalue)

# Assumption 3: Homogeneity of Variances
# You can perform Levene's test for homogeneity of variances.
levene_result = stats.levene(mainstream_male, mainstream_male)

print("Levene's test for homogeneity of variances - p-value:", levene_result.pvalue)


Normality test for male group - p-value: 0.0
Normality test for female group - p-value: 0.0
Levene's test for homogeneity of variances - p-value: 1.0




In [None]:

# Perform Mann-Whitney U test (non-parametric alternative)
u_statistic, mw_p_value = stats.mannwhitneyu(mainstream_male, mainstream_male, alternative='two-sided')

print("Mann-Whitney U statistic:", u_statistic)
print("Mann-Whitney p-value:", mw_p_value)

if mw_p_value < 0.05:
    print("The difference in mainstreaminess score between male and female groups is statistically significant.")
else:
    print("There is no statistically significant difference in the mainstreaminess score  of distinct artists between male and female groups.")

Mann-Whitney U statistic: 549527552.0
Mann-Whitney p-value: 1.0
There is no statistically significant difference in the mainstreaminess score  of distinct artists between male and female groups.
