<h1>LEARNING FROM NETWORK PROJECT</h1>

<h2>Part 1: Network Generation</h2> 

Import necessary packages and create an instance of the class

In [6]:
from NetworkManager import NetworkManager
NM = NetworkManager()

Create the edges list using the <i>featGenerator</i> method  (may take a while)

In [None]:
NM.featGenerator("7silW8RiEOoLBgAg5JBCL1", "Riccardo Muti", 3)

Generate the output graph

In [None]:
NM.buildGraphNetwork()

Save the current graph to a .txt file

In [None]:
save_path = 'Graphs/graph_name.txt'
NM.writeNetwork(save_path)

<h2>Part 2: Graph Features Extraction</h2> 

Import necessary packages

In [7]:
import networkx as nx

Load a edge list and build a graph (if needed)

In [8]:
edge_list_path = 'Graphs/kw_2.txt'
NM.buildNetworkFromTxt(edge_list_path)
graph = NM.Graph_network

print(nx.info(graph))

Graph with 1994 nodes and 13251 edges


<h3>Extract Features</h3>

Popularity Score

In [26]:
popularities = NM.getPopularityScores()

--------------------------------------------------------------

------------  Start Artist Popularity Calculation ------------

--------------------------------------------------------------

 Elapsed time : 339.34 s


Number of followers

In [10]:
followers = NM.getFollowersNumber()

--------------------------------------------------------------

------------  Start Artist Followers Calculation -------------

--------------------------------------------------------------

 Elapsed time : 270.38 s


Number of albums

In [11]:
num_albums = NM.getNumAlbums()

--------------------------------------------------------------

------------  Start Number of Albums Calculation -------------

--------------------------------------------------------------

 Elapsed time : 348.62 s


Graph nodes features

In [12]:
pagerank = list(nx.pagerank(graph).values())
closeness_centralities = list(nx.closeness_centrality(graph).values())
degree_centralities = list(nx.degree_centrality(graph).values())

<h3>CSV creation</h3>

Load Packages and create an instance of the class

In [13]:
from ArtistFeatures import ArtistFeatures
AF = ArtistFeatures(graph)

Add all the features 

In [14]:
popularities_normalized =  [x / 100.0 for x in popularities]
followers_normalized =  [x / max(followers) for x in followers]
num_albums_normalized =  [x / max(num_albums) for x in num_albums]

AF.add_Feature('Popularities', popularities_normalized)
AF.add_Feature('Followers', followers_normalized)
AF.add_Feature('Num_Albums', num_albums_normalized) 

AF.add_Feature('Page_rank',pagerank)
AF.add_Feature('Closeness_centralities',closeness_centralities)
AF.add_Feature('Degree_centralities',degree_centralities)

Save the csv

In [15]:
save_path_csv = 'CSV/kw_2.csv'
AF.create_csv(save_path_csv)

<h2>Part 3: Machine Learning</h2>

Import all necessay packages

In [16]:
import pandas as pd
from sklearn import linear_model
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

Load the dataset, and divide it into labels Y and data X to train the model

In [17]:
csv_path = 'CSV/kw_2.csv'
ds = pd.read_csv(csv_path, sep = ',')
ds = ds.dropna() 

Data = ds.values
n = Data.shape[0]
Y = Data[:n,1]
X = Data[:n,2:]
feature_names = ds.columns[2:]

print("Amount of data : ",n)
print("Features : ",feature_names.values)

Amount of data :  1994
Features :  ['Followers' 'Num_Albums' 'Page_rank' 'Closeness_centralities'
 'Degree_centralities']


Divide the model into training, validation and test set

In [18]:
n_train = int(2./3.*n)
n_val = int((n-n_train)/2.)
n_test = n - n_train - n_val
from sklearn.model_selection import train_test_split
Xtrain_and_val, Xtest, Ytrain_and_val, Ytest = train_test_split(X, Y, test_size=n_test/n)
Xtrain, Xval, Ytrain, Yval = train_test_split(Xtrain_and_val, Ytrain_and_val, test_size=n_val/(n_train+n_val))


print("Amount of data for training and deciding parameters:",n_train)
print("Amount of data for validation :",n_val)
print("Amount of data for test:",n_test)

Amount of data for training and deciding parameters: 1329
Amount of data for validation : 332
Amount of data for test: 333


<h3>Linear Regression</h3>

In [19]:
LR = linear_model.LinearRegression()

LR.fit(Xtrain_and_val, Ytrain_and_val)

print("1 - R^2 on training + validation data : %.4f"%(1 - LR.score(Xtrain_and_val,Ytrain_and_val)))
print("1 - R^2 on test data : %.4f"%(1 - LR.score(Xtest,Ytest)))

1 - R^2 on training + validation data : 0.5288
1 - R^2 on test data : 0.5979


<h3>Decision Trees</h3>

Create a decision tree and find the optimal depth using grid search with 10-fold cross validaiton procedure

In [20]:
DTs = DecisionTreeRegressor()
param_grid = {
    'max_depth': [i for i in range (1,20)],
    'min_samples_leaf': [j for j in range (1,5)]
}

gs = GridSearchCV(estimator=DTs,
                 param_grid= param_grid,
                 cv = 10)
gs.fit(Xtrain_and_val,Ytrain_and_val)

print("Best model found : ",gs.best_estimator_)
print("Best score found : %.4f"%gs.best_score_)

Best model found :  DecisionTreeRegressor(max_depth=5, min_samples_leaf=3)
Best score found : 0.8576


Using the best parameters found, learn a model and print the data

In [21]:
DT_opt = gs.best_estimator_

print("1 - R^2 on training + validation data : %.4f"%(1 - DT_opt.score(Xtrain_and_val,Ytrain_and_val)))
print("1 - R^2 on test data : %.4f"%(1 - DT_opt.score(Xtest,Ytest)))

1 - R^2 on training + validation data : 0.1212
1 - R^2 on test data : 0.1339


<h3> Decision trees, without number of followers as feature </h3>

Modified dataset : creation and split

In [22]:
######### New dataset ###########
Data = ds.values
n = Data.shape[0]
Y = Data[:n,1]
X_mod = Data[:n,3:]
feature_names = ds.columns[3:]

print("Amount of data : ",n)
print("Features for modified dataset : ",feature_names.values)

########  Split into training validation and test #########

n_train = int(2./3.*n)
n_val = int((n-n_train)/2.)
n_test = n - n_train - n_val
from sklearn.model_selection import train_test_split
X_mod_train_and_val, X_mod_test, Ytrain_and_val, Ytest = train_test_split(X_mod, Y, test_size=n_test/n)
X_mod_train, X_mod_val, Ytrain, Yval = train_test_split(Xtrain_and_val, Ytrain_and_val, test_size=n_val/(n_train+n_val))



Amount of data :  1994
Features for modified dataset :  ['Num_Albums' 'Page_rank' 'Closeness_centralities' 'Degree_centralities']


Decision trees for modified dataset : GridSearch

In [23]:
DT_mod_s = DecisionTreeRegressor()
param_grid = {
    'max_depth': [i for i in range (1,20)],
    'min_samples_leaf': [j for j in range (1,5)]
}

gs_mod = GridSearchCV(estimator=DT_mod_s,
                 param_grid= param_grid,
                 cv = 10)
gs_mod.fit(X_mod_train_and_val,Ytrain_and_val)

print("Best model found : ",gs_mod.best_estimator_)
print("Best score found : %.4f"%gs_mod.best_score_)

Best model found :  DecisionTreeRegressor(max_depth=5, min_samples_leaf=3)
Best score found : 0.6056


Decision trees : best model results

In [24]:
DT_mod_opt = gs_mod.best_estimator_

print("1 - R^2 on modified training + validation data : %.4f"%(1 - DT_mod_opt.score(X_mod_train_and_val,Ytrain_and_val)))
print("1 - R^2 on modified test data : %.4f"%(1 - DT_mod_opt.score(X_mod_test,Ytest)))

1 - R^2 on modified training + validation data : 0.3281
1 - R^2 on modified test data : 0.3887
