# Data Import and Preparation

### <span style="color:teal">We import the data for 1 week in april'07 and then also data for 1 week in april'08.</span>

Further more, we decide to choose only include 5 edges out of the 20 from each node which are provided in each row of the data. This decision is made to make the final network a bit more managable in terms of computation on our systems. This step could be skipped if one has the computational power to deal with the entire network.

This selection of sub-set will limit our analysis of the youtube's algorithm but it is also safe to assume that even the full list of 20 edges provided in the raw dataset for each node which the crawler moved through, is but only a partial representaton of the possible connections the algorithm facilitates.

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm

<hr>

### Defining helper functions

In [2]:
# function for readinf the data in folder into a dataframe    
# CAN ONLY READ AND STORE DATA FROM ONE FOLDER AT A TIME, where 1 folder is containging data for one day's crawl.
def folderTOdf(folder_path):
    file_path = folder_path + "/{0}.txt"

    #set the column names
    col_name = ['video ID', 'uploader', 'age', 'category', 'length', 'views', 'rate', 'ratings', 'comments', 'related ID1','related ID2',
                'related ID3','related ID4','related ID5','related ID6','related ID7','related ID8','related ID9','related ID10','related ID11','related ID12',
                'related ID13','related ID14','related ID15','related ID16','related ID17','related ID18','related ID19','related ID20','related ID21',
                'related ID22','related ID23','related ID24','related ID25','related ID26','related ID27','related ID28','related ID29','related ID30','related ID31',
                'related ID32','related ID33','related ID34','related ID35','related ID36','related ID37','related ID38','related ID39','related ID40','related ID41',
                'related ID42','related ID43','related ID44','related ID45','related ID46','related ID47','related ID48','related ID49']

    # loop over the file paths and read each file into a DataFrame

    # for loop for creating df from one entire folder
    df_list = []
    for i in range(4):
        
        # create the file path string with the variable i
        path = file_path.format(i)
        # read the file into a DataFrame
        df = pd.read_csv(path, delimiter='\t',header = None,names=col_name,low_memory=False)
        df = df.drop(columns=['related ID21','related ID22','related ID23','related ID24','related ID25','related ID26','related ID27','related ID28','related ID29','related ID30','related ID31',
                'related ID32','related ID33','related ID34','related ID35','related ID36','related ID37','related ID38','related ID39','related ID40','related ID41',
                'related ID42','related ID43','related ID44','related ID45','related ID46','related ID47','related ID48','related ID49'])
        df_list.append(df)
    return df_list  #will contain 4 dfs - one for each level of crawl.

# function for exporting a df as a csv
def exportCSV(df):
    headerOBJ = df.columns
    header = []
    for item in headerOBJ:
        header.append(item)
        
    filename = input('Please Enter The Name of CSV ->\n')

    # define the desired formatting options
    format_options = {'sep': ',',  # use semicolon as delimiter
                      'index': False,  # don't include index column
                      'float_format': '%.2f',  # format float values to 2 decimal places
                      'header': header,  # rename headers
                      'encoding': 'utf-8'}  # specify encoding type

    # export the dataframe to CSV using the formatting options
    df.to_csv(filename,**format_options) 

Now we set the filepaths for the importing the raw data which is store in the directory `first_crawl`

The `first_crawl` directory has got its own structure and stores the actual data in a .txt file which correspond the depth level of the crawl to which the data belongs.

We will however be using only a subset of the files so we created another directory which stores only the required amount of data which corresponds to the selected duration of analysis.

In [86]:
# set the variable for the ALL the file paths
# for 1st week of April,2007 -->>
fpath1_7 = "first_crawl/April2007/0403"
fpath2_7 = "first_crawl/April2007/0410"
# for 1st week of 2008 -->>
fpath1_8 = "first_crawl/April2008/080402"
fpath2_8 = "first_crawl/April2008/080404"
fpath3_8 = "first_crawl/April2008/080406"

In [87]:
folderList1 = [fpath1_7,fpath2_7]
folderList2 = [fpath1_8,fpath2_8,fpath3_8]
folder_df_list1 = []   # will contain the number of df_lists equal to the number of
folder_df_list2 = []   # file paths provided.


for path in folderList1:
    folder_df_list1.append(folderTOdf(path))
for path in folderList2:
    folder_df_list2.append(folderTOdf(path))

### Data cleaning

We need to do these steps to do further processing of the data:
- combine the individual dfs stored in the list into one single dataframe
- droping 3/4 of the related IDs

<hr>

In [159]:
# step 1 - combining into one DF
df7 = pd.DataFrame()
df8 = pd.DataFrame()

for item in folder_df_list1:
    for df in item:
        df7 = pd.concat([df7,df])
        
for item in folder_df_list2:
    for df in item:
        df8 = pd.concat([df8,df])

df7.head(3)

Unnamed: 0,video ID,uploader,age,category,length,views,rate,ratings,comments,related ID1,...,related ID11,related ID12,related ID13,related ID14,related ID15,related ID16,related ID17,related ID18,related ID19,related ID20
0,W91sqAs-_-g,dusted21,776.0,Music,249.0,1556837.0,4.61,7314.0,3899.0,tZw-8RSyvh8,...,_oPYs-LNYGo,GzqvzhpLfIg,f2uBfi4miC8,8Eaj9OZ--K0,mj7mYbHEasI,gJ0I92_1Vt8,n58uchRpgO0,jDVPJ_7dS3k,KizKliQvF_M,w8dpP4uQglk
1,oqcaJ4NrUKA,freefuelsaver,777.0,Autos & Vehicles,95.0,224066.0,0.0,0.0,30.0,4acBKXjveJI,...,CQr3MnK1RF4,yFs_o1a5TNk,fEeCv3S7ZJU,jCNYbRZSTcs,A2-e3Fgah5A,mHnV2L-RuDo,tJBQUM52dhw,6NbqZy-8FGg,jHV4RZsl-bI,c1lwEBlfvDE
2,XSGc5Vkh_1g,wonderwhenmedia,776.0,People & Blogs,156.0,161916.0,3.77,380.0,874.0,9c6kzCR07PQ,...,857HnIPR_r0,hABUd_qEYTE,_rZtvAynX2M,jDVPJ_7dS3k,oGhl8ySfKHY,R-8DnUw97WM,39mBHoRsbCw,xO3pzL9klmE,12Yq96OFS-c,ljHpDcfljp0


In [160]:
# step 2 - dropping 3/4 of the related IDs
df7 = df7.drop(columns=['related ID6','related ID7','related ID8','related ID9','related ID10','related ID11',
                'related ID12','related ID13','related ID14','related ID15','related ID16','related ID17','related ID18','related ID19','related ID20'])
df8 = df8.drop(columns=['related ID6','related ID7','related ID8','related ID9','related ID10','related ID11',
                'related ID12','related ID13','related ID14','related ID15','related ID16','related ID17','related ID18','related ID19','related ID20'])

df7.head(3)

Unnamed: 0,video ID,uploader,age,category,length,views,rate,ratings,comments,related ID1,related ID2,related ID3,related ID4,related ID5
0,W91sqAs-_-g,dusted21,776.0,Music,249.0,1556837.0,4.61,7314.0,3899.0,tZw-8RSyvh8,-L6tFCeR_ZQ,SgHk5JDCdx0,4U5dmIVBzq8,MwzSxbqyzcE
1,oqcaJ4NrUKA,freefuelsaver,777.0,Autos & Vehicles,95.0,224066.0,0.0,0.0,30.0,4acBKXjveJI,Zk22r_GJV-Q,resxQ9PgiqY,ttk-52Zm9Gk,0V5_b9eGufU
2,XSGc5Vkh_1g,wonderwhenmedia,776.0,People & Blogs,156.0,161916.0,3.77,380.0,874.0,9c6kzCR07PQ,nxT2TaktQ2I,fr1sR3Zdc4g,gh6tBFTNWNs,llP4uTdXBSI


In [163]:
# extracting the edge list from the dataframe now
def edgeLst(df):
    E_lst = [] 
    for idx,row in tqdm(df.iterrows()):
        connections = row
        A = list(connections)
        for item in A[9:]:
            edge = (A[0],item)
            E_lst.append(edge)
    return E_lst

In [164]:
eList1 = edgeLst(df7)

308283it [00:07, 43121.85it/s]


In [166]:
eList1[:10]

[('W91sqAs-_-g', 'tZw-8RSyvh8'),
 ('W91sqAs-_-g', '-L6tFCeR_ZQ'),
 ('W91sqAs-_-g', 'SgHk5JDCdx0'),
 ('W91sqAs-_-g', '4U5dmIVBzq8'),
 ('W91sqAs-_-g', 'MwzSxbqyzcE'),
 ('oqcaJ4NrUKA', '4acBKXjveJI'),
 ('oqcaJ4NrUKA', 'Zk22r_GJV-Q'),
 ('oqcaJ4NrUKA', 'resxQ9PgiqY'),
 ('oqcaJ4NrUKA', 'ttk-52Zm9Gk'),
 ('oqcaJ4NrUKA', '0V5_b9eGufU')]

In [167]:
eList2 = edgeLst(df8)
eList2[:10]

438084it [00:10, 43345.99it/s]


[('hu28avESP68', 'XaEOGrodVYs'),
 ('hu28avESP68', 'ys8duGx0Adw'),
 ('hu28avESP68', 'kSMhRFPNiVo'),
 ('hu28avESP68', 'JH4eNPPs8wI'),
 ('hu28avESP68', 'C7_wlNIw6z8'),
 ('PmI76gyuWcM', 'x2nYyQd37xI'),
 ('PmI76gyuWcM', 'Xqhhdb0we0o'),
 ('PmI76gyuWcM', 'VruKg9Jop0Q'),
 ('PmI76gyuWcM', 'k3zgAJP3D58'),
 ('PmI76gyuWcM', 'ig6ClFm_yhM')]

Checking if the connected nodes have attributes data themselves. We perform this step becuse the way the crawler had scrapped the data, it had not added the meta-data for the different connected nodes simultaneously.

In [118]:
def make_nodesets(df):
    Snode_lst = set() #Source nodes
    Tnode_lst = set() #Target nodes , refered to in the dataframes as the related IDs
    
    for item in df["video ID"]:
        Snode_lst.add(item)
    
    col_name = "related ID{0}"
    for i in range(1,6):
        a = set(df[col_name.format(i)])
        Tnode_lst = Tnode_lst | a
    
    return Snode_lst, Tnode_lst

    
node_lst1,connected_nodes1 = make_nodesets(df7)
node_lst2,connected_nodes2 = make_nodesets(df8)
    
print(len(node_lst1))        # output - 3,05,906
print(len(connected_nodes1)) # output - 7,79,254

print(len(node_lst2))        # output - 4,38,084
print(len(connected_nodes2)) # output - 8,54,612

305906
779254
438084
854612


Lets calculate, how much of the nodes of have meta-data with them. we add the `node_lst1` and `node_lst2` as they are the primary nodes through which the crawler had scraped the data and we know that we have meta-data for them. 

In [119]:
good_n = len(node_lst1 | node_lst2)
print("number of good nodes -> ",good_n) 

# now we take the intersection of the nodes with attribute data and the total number of nodes in the network data for year 07.
cnodes = (node_lst1 | node_lst2) & (node_lst1 | connected_nodes1)
print("common nodes for 07 ->", len(cnodes)) # 3,39,267
ratio7 = len(cnodes)/len(node_lst1 | connected_nodes1)


ratio8 = (len(node_lst1 | node_lst2))/len(node_lst2 | connected_nodes2)
print(ratio7)
print(ratio8)
percentage7 = ratio7 * 100
percentage8 = ratio8 * 100

print("This much of network07 has meta-data -> ",percentage7)  # 35.5 %
print("This much of network08 has meta-data -> ",percentage8)  # 70.9 %

number of good nodes ->  737467
common nodes for 07 -> 318420
0.3552617045465956
0.7097887862790485
This much of network07 has meta-data ->  35.526170454659564
This much of network08 has meta-data ->  70.97887862790485


>Now we will add additional nodes which we will call extra nodes from the raw data for other weeks, with the assumption that source nodes might be the same as the target nodes from our previous data and thus we can use the meta-data from there and add it to our dataframes

In [120]:
# set the variable for the EXTRA META DATA file paths
# for April,2007 -->>
fpath3_7 = "first_crawl/April2007/0413"
fpath4_7 = "first_crawl/April2007/0418"
fpath5_7 = "first_crawl/April2007/0420"
fpath6_7 = "first_crawl/April2007/0422"
fpath7_7 = "first_crawl/April2007/0424"
fpath8_7 = "first_crawl/April2007/0426"
fpath9_7 = "first_crawl/April2007/0428"
fpath10_7 = "first_crawl/April2007/0430"
fpath11_7 = "first_crawl/May2007/0502"
fpath12_7 = "first_crawl/May2007/0507"
fpath13_7 = "first_crawl/May2007/0509"
fpath14_7 = "first_crawl/May2007/0511"
fpath15_7 = "first_crawl/May2007/0513"
fpath16_7 = "first_crawl/March2007/0301"
#fpath17_7 = "first_crawl/March2007/0302"
fpath18_7 = "first_crawl/March2007/0303"
fpath19_7 = "first_crawl/March2007/0305"
fpath20_7 = "first_crawl/March2007/0309"

# for 2008 -->>
fpath3_8 = "first_crawl/April2008/080408"
fpath4_8 = "first_crawl/April2008/080412"
fpath5_8 = "first_crawl/April2008/080414"
fpath6_8 = "first_crawl/April2008/080416"
fpath7_8 = "first_crawl/April2008/080418"
fpath8_8 = "first_crawl/April2008/080422"
fpath9_8 = "first_crawl/April2008/080424"
fpath10_8 = "first_crawl/April2008/080426"

In [121]:
f_List = [fpath3_7,fpath4_7,fpath5_7,fpath6_7,fpath7_7,fpath8_7,fpath9_7,fpath10_7,fpath11_7,fpath12_7,fpath13_7,fpath14_7,fpath15_7,
         fpath16_7,fpath18_7,fpath19_7,fpath20_7,fpath3_8,fpath4_8,fpath5_8,fpath6_8,fpath7_8,fpath8_8,fpath9_8,fpath10_8]
extraLst = []   # will contain the number of df_lists equal to the number of

# file paths provided.
for path in f_List:
    extraLst.append(folderTOdf(path))

In [122]:
df_extra = pd.DataFrame()

for item in extraLst:
    for df in item:
        df_extra = pd.concat([df_extra,df])
        
# repeat the cleaning steps again but this time we drop all the related nodes because we are interested in only the attributes of the source nodes and we
# remove all the rows with NaN or nan or empty values.

In [123]:
df_extra = df_extra.drop(columns=['related ID1','related ID2','related ID3','related ID4','related ID5','related ID6','related ID7','related ID8','related ID9','related ID10','related ID11',
                'related ID12','related ID13','related ID14','related ID15','related ID16','related ID17','related ID18','related ID19','related ID20'])
df_extra.head() # Row count at this point - 22,09,596

Unnamed: 0,video ID,uploader,age,category,length,views,rate,ratings,comments
0,gicTEKbczxw,Ryan06171,785.0,Entertainment,104.0,229247.0,4.68,2454.0,1116.0
1,LOP5By5FevU,consumerist,785.0,Entertainment,125.0,168285.0,3.51,398.0,831.0
2,JT3_itf-ALg,TNAwrestling,786.0,Sports,1063.0,93708.0,4.52,71.0,37.0
3,VwXQM3Vnt3k,kylemj,786.0,Entertainment,565.0,91057.0,3.51,185.0,202.0
4,ZqnCQSd_ik0,autocuctioncenter,786.0,Autos & Vehicles,94.0,90259.0,4.16,449.0,287.0


In [124]:
def remove_nan_empty_rows(dataframe):
    dataframe.replace('', np.nan, inplace=True)
    dataframe.dropna(inplace=True)
    dataframe.dropna(how='all', inplace=True)
    dataframe.dropna(subset=dataframe.columns.values.tolist(), how='all', inplace=True)
    dataframe.dropna(subset=dataframe.columns.values.tolist(), how='any', inplace=True)
    dataframe.replace(np.nan, '', inplace=True)
    return dataframe

df_extra = remove_nan_empty_rows(df_extra) # Row count after this - 22,04,709
df_extra.head()

Unnamed: 0,video ID,uploader,age,category,length,views,rate,ratings,comments
0,gicTEKbczxw,Ryan06171,785.0,Entertainment,104.0,229247.0,4.68,2454.0,1116.0
1,LOP5By5FevU,consumerist,785.0,Entertainment,125.0,168285.0,3.51,398.0,831.0
2,JT3_itf-ALg,TNAwrestling,786.0,Sports,1063.0,93708.0,4.52,71.0,37.0
3,VwXQM3Vnt3k,kylemj,786.0,Entertainment,565.0,91057.0,3.51,185.0,202.0
4,ZqnCQSd_ik0,autocuctioncenter,786.0,Autos & Vehicles,94.0,90259.0,4.16,449.0,287.0


In [130]:
nodss = df_extra["video ID"]
extraN = set(nodss)
len(extraN) # 1,08,538

2113475

In [133]:
# let's try to see how much of the extra nodes exist in the origina data.
#extraN = extraN | set(df_uc['video ID'])
good_n = (node_lst1 | node_lst2) | extraN

# now we take the intersection of the nodes with attribute data and the total number of nodes in the network data for year 07.
cnodes = good_n & (node_lst1 | connected_nodes1)
print("common nodes for 07 ->", len(cnodes)) # 4,24,357
ratio7 = len(cnodes)/len(node_lst1 | connected_nodes1)
percentage7 = ratio7 * 100
print("This much of network07 has meta-data -> ",percentage7)  # 47.34 %

common nodes for 07 -> 426945
This much of network07 has meta-data ->  47.6343221052843


In [134]:
# now we take the intersection of the nodes with attribute data and the total number of nodes in the network data for year 08.
cnodes = good_n & (node_lst2 | connected_nodes2)
print("common nodes for 08 ->", len(cnodes)) # 4,89,284
ratio8 = len(cnodes)/len(node_lst2 | connected_nodes2)
percentage8 = ratio8 * 100
print("This much of network07 has meta-data -> ",percentage8)  # 50.69 %

common nodes for 08 -> 535858
This much of network07 has meta-data ->  51.57464665373751


### Merging data

Let's say we are happy to have only 47-50% of our network with meta-data. We need to export the csv file with all these nodes and their attributes

In [137]:
# list of dataframes which we are extracting the node attributes from 
# df_extra

#creating merged df
df_comp = pd.DataFrame()

df_comp = pd.concat([df_comp,df_extra])

#df7 = df7.drop(columns=['related ID1','related ID2','related ID3','related ID4','related ID5'])
df_comp = pd.concat([df_comp,df7])

df8 = df8.drop(columns=['related ID1','related ID2','related ID3','related ID4','related ID5'])
df_comp = pd.concat([df_comp,df8])

df_comp  # row count - 25,12,992

Unnamed: 0,video ID,uploader,age,category,length,views,rate,ratings,comments
0,gicTEKbczxw,Ryan06171,785.0,Entertainment,104.0,229247.0,4.68,2454.0,1116.0
1,LOP5By5FevU,consumerist,785.0,Entertainment,125.0,168285.0,3.51,398.0,831.0
2,JT3_itf-ALg,TNAwrestling,786.0,Sports,1063.0,93708.0,4.52,71.0,37.0
3,VwXQM3Vnt3k,kylemj,786.0,Entertainment,565.0,91057.0,3.51,185.0,202.0
4,ZqnCQSd_ik0,autocuctioncenter,786.0,Autos & Vehicles,94.0,90259.0,4.16,449.0,287.0
...,...,...,...,...,...,...,...,...,...
109870,yHzOWv7W8E8,bs13gas,606.0,Music,25.0,1732.0,4.50,2.0,4.0
109871,LSWRglDJNDQ,videogum,947.0,Music,365.0,1647.0,4.89,9.0,1.0
109872,2FxAiD3NCSE,anprim,748.0,Music,147.0,7185.0,4.67,6.0,2.0
109873,L_XFMCgeI7c,marcelolucio,721.0,Music,261.0,295360.0,4.91,693.0,352.0


In [138]:
# Find the non-unique values in the specified column
column_name = "video ID"
non_unique_values = df_comp[df_comp.duplicated(subset=column_name, keep=False)][column_name]

# Drop rows with non-unique values
df_comp = df_comp[~df_comp[column_name].isin(non_unique_values)]

# Reset the index of the dataframe
df_comp = df_comp.reset_index(drop=True)
df_comp  # row count is - 26,63,007

Unnamed: 0,video ID,uploader,age,category,length,views,rate,ratings,comments
0,gicTEKbczxw,Ryan06171,785.0,Entertainment,104.0,229247.0,4.68,2454.0,1116.0
1,LOP5By5FevU,consumerist,785.0,Entertainment,125.0,168285.0,3.51,398.0,831.0
2,VwXQM3Vnt3k,kylemj,786.0,Entertainment,565.0,91057.0,3.51,185.0,202.0
3,ZqnCQSd_ik0,autocuctioncenter,786.0,Autos & Vehicles,94.0,90259.0,4.16,449.0,287.0
4,XI6P_MsrL0Q,happynationnow,786.0,Comedy,199.0,85195.0,4.41,975.0,545.0
...,...,...,...,...,...,...,...,...,...
2663002,GHxk4zX430M,Ialtriaga,937.0,Music,319.0,5150.0,5.00,17.0,5.0
2663003,yHzOWv7W8E8,bs13gas,606.0,Music,25.0,1732.0,4.50,2.0,4.0
2663004,LSWRglDJNDQ,videogum,947.0,Music,365.0,1647.0,4.89,9.0,1.0
2663005,2FxAiD3NCSE,anprim,748.0,Music,147.0,7185.0,4.67,6.0,2.0


### Filtering the dataframe

In [139]:
# from the above created dataframe, we can select all the node's attributes for each year's network graph.
data_07 = pd.DataFrame()
data_08 = pd.DataFrame()

columns = ['video ID', 'uploader', 'age', 'category', 'length', 'views', 'rate',
       'ratings', 'comments']

data_07 = data_07.reindex(columns=columns)
data_08 = data_08.reindex(columns=columns)

def filter_and_create_final_df(df, set_A):
    # setA are the nodes we need to keep
    # Filter rows based on values in the "video ID" column
    filtered_df = df[df["video ID"].isin(set_A)]

    # Create a new DataFrame called "final"
    final = pd.DataFrame(filtered_df)

    return final


data_07 = filter_and_create_final_df(df_comp, (node_lst1 | connected_nodes1))
data_07 # row count of 3,84,474

Unnamed: 0,video ID,uploader,age,category,length,views,rate,ratings,comments
7,s6sUUc3KRYQ,lonelygirl15,785.0,People & Blogs,161.0,46751.0,3.12,344.0,266.0
93,g4QCK5mdNiM,tetrooney,784.0,Sports,425.0,145981.0,4.49,408.0,280.0
97,T8p1IuPcW-c,cr4fty3,782.0,Comedy,33.0,108898.0,4.69,387.0,147.0
102,nLshDQ-HhhA,tarantjuve,784.0,Sports,270.0,79681.0,4.70,159.0,140.0
107,23dWApnJEUA,dirrtydirrtysouth,785.0,Music,228.0,59557.0,4.60,289.0,163.0
...,...,...,...,...,...,...,...,...,...
2662895,LcmabAfWcIk,senglepeng,557.0,Music,184.0,12148.0,4.40,15.0,8.0
2662928,8mMI-0i2ICA,TennCali,739.0,Music,253.0,29976.0,4.79,29.0,11.0
2662946,UkBZ0U-xl8Y,CyberMSX,504.0,Music,217.0,50993.0,4.84,79.0,22.0
2662986,79yrbmcXftA,Trakse,753.0,Music,156.0,38486.0,4.89,107.0,36.0


In [140]:
data_08 = filter_and_create_final_df(df_comp, (node_lst2 | connected_nodes2))
data_08 # row count of 4,97,246

Unnamed: 0,video ID,uploader,age,category,length,views,rate,ratings,comments
20,p8Z-DIAthbM,VictorVB,786.0,Music,202.0,21432.0,4.81,81.0,28.0
50,iYrjc1Srzs4,tpmtv,786.0,News & Politics,99.0,11463.0,4.90,96.0,34.0
52,suV7P1m44kI,esmeedenters,786.0,Music,196.0,11001.0,4.69,298.0,202.0
90,2-avakrRUaU,artquest,781.0,Entertainment,328.0,194068.0,4.73,161.0,122.0
99,YkKHdniFMQ8,melchior64,782.0,Sports,122.0,95791.0,4.85,88.0,57.0
...,...,...,...,...,...,...,...,...,...
2663002,GHxk4zX430M,Ialtriaga,937.0,Music,319.0,5150.0,5.00,17.0,5.0
2663003,yHzOWv7W8E8,bs13gas,606.0,Music,25.0,1732.0,4.50,2.0,4.0
2663004,LSWRglDJNDQ,videogum,947.0,Music,365.0,1647.0,4.89,9.0,1.0
2663005,2FxAiD3NCSE,anprim,748.0,Music,147.0,7185.0,4.67,6.0,2.0


## Exporting the data

In [141]:
exportCSV(data_07)

Please Enter The Name of CSV ->
 S_data07.csv


In [142]:
exportCSV(data_08)

Please Enter The Name of CSV ->
 S_data08.csv


> we can also export the edge list corresponding to each year at this point

In [169]:
edgeDF1 = pd.DataFrame(eList1)
edgeDF1  # row count of 15,41,415 which roughly is 4-5 times the number of nodes for this year.

Unnamed: 0,0,1
0,W91sqAs-_-g,tZw-8RSyvh8
1,W91sqAs-_-g,-L6tFCeR_ZQ
2,W91sqAs-_-g,SgHk5JDCdx0
3,W91sqAs-_-g,4U5dmIVBzq8
4,W91sqAs-_-g,MwzSxbqyzcE
...,...,...
1541410,pkQSb7Cvz-0,0vau-tW-8FA
1541411,pkQSb7Cvz-0,AmeBtjyNzrQ
1541412,pkQSb7Cvz-0,LTXNxxStn_U
1541413,pkQSb7Cvz-0,V4YYDD2TEAo


In [171]:
exportCSV(edgeDF1)

Please Enter The Name of CSV ->
 eList1.csv


In [172]:
edgeDF2 = pd.DataFrame(eList2)
edgeDF2 # the row count is 21,90,420 which is again 4-5 times the number of nodes for this year.

Unnamed: 0,0,1
0,hu28avESP68,XaEOGrodVYs
1,hu28avESP68,ys8duGx0Adw
2,hu28avESP68,kSMhRFPNiVo
3,hu28avESP68,JH4eNPPs8wI
4,hu28avESP68,C7_wlNIw6z8
...,...,...
2190415,FIkY8xNEs9E,
2190416,FIkY8xNEs9E,
2190417,FIkY8xNEs9E,
2190418,FIkY8xNEs9E,


In [173]:
exportCSV(edgeDF2)

Please Enter The Name of CSV ->
 eList2.csv
