# Word Co-occurences

To analyse co-occurences of topics of LGBTIQ datasets and visualize them. Useful links:
- https://rainynotes.net/co-occurrence-matrix-visualization/
- https://www.pingshiuanchua.com/blog/post/keyword-network-analysis-with-python-and-gephi
- https://medium.com/data-analytics-at-nesta/how-to-create-network-visualisations-with-gephi-a-step-by-step-tutorial-e0743c49ec72
- https://gephi.org/users/tutorial-layouts/

Steps:
- Write topics in a text file, one line per id/dataset
- Count Vector Representation
- Turn into matrix

(Visualization of the co-occurrence matrix network is done by using Gephi, a open-source software to visualize network.)

More on CountVectorizer:

https://medium.com/swlh/understanding-count-vectorizer-5dd71530c1b 

## Write topics in a text file, one line per id/dataset

In [7]:
import pandas as pd
from slugify import slugify

df = pd.read_csv("keywords_LGBTIQ_datasets.csv", sep = ';', encoding="latin-1")
df = df.rename(columns={'id': 'ident'})

# Read csv into dictionary with ident as keys and topics as values (in a list)
# 2nd answer here: https://stackoverflow.com/questions/56062627/converting-csv-to-dictionary-with-one-column-as-keys-and-one-column-as-its-value 
df = df.applymap(str).groupby('ident')['topics_level_1'].apply(list).to_dict()
#print(df)

# Iterate over dictionary: https://realpython.com/iterate-through-dictionary-python/
# Convert list to string: https://www.freecodecamp.org/news/python-list-to-string-how-to-convert-lists-in-python/ 
for key in df:
    #print(df[key])
    with open("output.txt", 'w') as f:
        for key in df:
            string_list = [slugify(element) for element in df[key]]  # turns list into string and changes topics to dashes instead of spaces
            res = list(map(lambda st: str.replace(st, "-", "_"), string_list))   # Changes - to _ (necessary to be treated as one word later)
            delimiter = " "
            result_string = delimiter.join(res)
            f.write('%s,\n' % (result_string)) 


  df = df.applymap(str).groupby('ident')['topics_level_1'].apply(list).to_dict()


## Count Vector Representation

CountVectorizer is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

https://www.geeksforgeeks.org/using-countvectorizer-to-extracting-features-from-text/

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

### Tokenize

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

f = open("output.txt", "r")
data = f.read()
data_into_list = data.split(",\n")
#print(data_into_list) 

count_vect = CountVectorizer()
count_vect.fit(data_into_list)

print(count_vect.vocabulary_)
# --> Unique words along with indeces (= words sorted alphabetically, index is the place in the sorted list)

{'14_social_stratification_and_groupings': 13, '16_society_and_culture': 15, '08_law_crime_and_legal_systems': 7, '04_health': 3, '19_other': 18, '07_labour_and_employment': 6, '11_politics': 10, '05_history': 4, '01_demography_population_vital_statistics_and_censuses': 0, '13_science_and_technology': 12, '03_education': 2, '06_housing_and_land_use': 5, '02_economics': 1, '09_media_communication_and_language': 8, '15_social_welfare_policy_and_systems': 14, '18_transport_and_travel': 17, '17_trade_industry_and_markets': 16, '12_psychology': 11, '10_natural_environment': 9}


### Derive vector

In [9]:
# Encode the Document
vector = count_vect.fit_transform(data_into_list)
Xc = (vector.T * vector) # Transposed matrix multiplied with original matrix
Xc.setdiag(0) # set the diagonals to be zeroes as it's pointless to be 1

# Summarizing the Encoded Texts
print(Xc.toarray())


[[   0   30   12  170    2   24   85   34   18    4   11    4    2  131
    14  109    8    4   35]
 [  30    0   45  303    0   91  292  108   37    2   28   10    5  564
    47  345   36   14   63]
 [  12   45    0  112    1   36  117   49   26    4   15    6    0  212
    12  112   16    4   34]
 [ 170  303  112    0    3  260  824  300  272   80   59   46   47 1774
   174 1565   24   45  515]
 [   2    0    1    3    0    0    6   14    0    0    5    0    0   15
     0    5    0    0  141]
 [  24   91   36  260    0    0  221   83   39    4   23    9    3  380
    33  244   28   10   54]
 [  85  292  117  824    6  221    0  271  117   14   84   32   10 1195
   115  728  100   35  246]
 [  34  108   49  300   14   83  271    0   57    8   39   13    1  519
    31  293   36   10   96]
 [  18   37   26  272    0   39  117   57    0   12   15    9    0  270
     9  165   12    3   54]
 [   4    2    4   80    0    4   14    8   12    0    2    2    0   42
     0   30    0    0   14]


## Create matrix for Gephi

In [59]:
names = count_vect.get_feature_names_out() # This are the entity names (i.e. keywords)
df = pd.DataFrame(data = Xc.toarray(), columns = names, index = names)

dict = {'01_demography_population_vital_statistics_and_censuses': 'Demography',
        '02_economics': 'Economics',
        '03_education': 'Education',
        '04_health': 'Health',
        '05_history': 'History',
        '06_housing_and_land_use': 'Housing and land use',
        '07_labour_and_employment': 'Labour and employment',
        '08_law_crime_and_legal_systems': 'Law, crime, and legal systems',
        '09_media_communication_and_language': 'Media, communication, and language',
        '10_natural_environment': 'Natural environment',
        '11_politics': 'Politics',
        '12_psychology': 'Psychology',
        '13_science_and_technology': 'Science and technology',
        '14_social_stratification_and_groupings': 'Social stratification and groupings',
        '15_social_welfare_policy_and_systems': 'Social welfare policy and systems',
        '16_society_and_culture': 'Society and culture',
        '17_trade_industry_and_markets': 'Trade, industry, and markets',
        '18_transport_and_travel': 'Transport and travel',
        '19_other': 'Other'}

# Rename header and index
df.rename(columns=dict, inplace=True)
df.rename(index=dict, inplace=True)

df.to_csv('to_gephi.csv', sep = ',')